Spark-Avro学习4之使用AvroWritePartitioned存储AVRO文件时进行划分
2016-05-02 11:26
489 查看
更多Spark学习examples代码请见:https://github.com/xubo245/SparkLearning
1.主要是partition存储avro文件
2.代码:
/**
* @author xubo
* @time 20160502
* ref https://github.com/databricks/spark-avro */
package org.apache.spark.avro.learning
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import java.text.SimpleDateFormat
import java.util.Date
import com.databricks.spark.avro._
/**
* partitioned Avro records
*/
object AvroWritePartitioned {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("AvroWritePartitioned").setMaster("local")
val sc = new SparkContext(conf)
// import needed for the .avro method to be added
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = Seq((2012, 8, "Batman", 9.8),
(2012, 8, "Hero", 8.7),
(2012, 7, "Robot", 5.5),
(2011, 7, "Git", 2.0))
.toDF("year", "month", "title", "rating")
df.show
val iString = new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date())
df.write.partitionBy("year", "month").avro("file/data/avro/output/episodes/WriteAvro" + iString)
val dfread = sqlContext.read
.format("com.databricks.spark.avro")
.load("file/data/avro/output/episodes/WriteAvro" + iString)
dfread.show
val dfread2 = sqlContext.read.avro("file/data/avro/output/episodes/WriteAvro" + iString)
dfread2.show
}
}
3,结果:
+----+-----+------+------+
|year|month| title|rating|
+----+-----+------+------+
|2012| 8|Batman| 9.8|
|2012| 8| Hero| 8.7|
|2012| 7| Robot| 5.5|
|2011| 7| Git| 2.0|
+----+-----+------+------+
2016-05-02 11:25:15 WARN :139 - Your hostname, xubo-PC resolves to a loopback/non-reachable address: fe80:0:0:0:200:5efe:ca26:54d2%20, but we couldn't find any external IP address!
+------+------+----+-----+
| title|rating|year|month|
+------+------+----+-----+
| Robot| 5.5|2012| 7|
| Git| 2.0|2011| 7|
|Batman| 9.8|2012| 8|
| Hero| 8.7|2012| 8|
+------+------+----+-----+
+------+------+----+-----+
| title|rating|year|month|
+------+------+----+-----+
| Robot| 5.5|2012| 7|
| Git| 2.0|2011| 7|
|Batman| 9.8|2012| 8|
| Hero| 8.7|2012| 8|
+------+------+----+-----+
1.主要是partition存储avro文件
2.代码:
/**
* @author xubo
* @time 20160502
* ref https://github.com/databricks/spark-avro */
package org.apache.spark.avro.learning
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import java.text.SimpleDateFormat
import java.util.Date
import com.databricks.spark.avro._
/**
* partitioned Avro records
*/
object AvroWritePartitioned {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("AvroWritePartitioned").setMaster("local")
val sc = new SparkContext(conf)
// import needed for the .avro method to be added
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = Seq((2012, 8, "Batman", 9.8),
(2012, 8, "Hero", 8.7),
(2012, 7, "Robot", 5.5),
(2011, 7, "Git", 2.0))
.toDF("year", "month", "title", "rating")
df.show
val iString = new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date())
df.write.partitionBy("year", "month").avro("file/data/avro/output/episodes/WriteAvro" + iString)
val dfread = sqlContext.read
.format("com.databricks.spark.avro")
.load("file/data/avro/output/episodes/WriteAvro" + iString)
dfread.show
val dfread2 = sqlContext.read.avro("file/data/avro/output/episodes/WriteAvro" + iString)
dfread2.show
}
}
3,结果:
+----+-----+------+------+
|year|month| title|rating|
+----+-----+------+------+
|2012| 8|Batman| 9.8|
|2012| 8| Hero| 8.7|
|2012| 7| Robot| 5.5|
|2011| 7| Git| 2.0|
+----+-----+------+------+
2016-05-02 11:25:15 WARN :139 - Your hostname, xubo-PC resolves to a loopback/non-reachable address: fe80:0:0:0:200:5efe:ca26:54d2%20, but we couldn't find any external IP address!
+------+------+----+-----+
| title|rating|year|month|
+------+------+----+-----+
| Robot| 5.5|2012| 7|
| Git| 2.0|2011| 7|
|Batman| 9.8|2012| 8|
| Hero| 8.7|2012| 8|
+------+------+----+-----+
+------+------+----+-----+
| title|rating|year|month|
+------+------+----+-----+
| Robot| 5.5|2012| 7|
| Git| 2.0|2011| 7|
|Batman| 9.8|2012| 8|
| Hero| 8.7|2012| 8|
+------+------+----+-----+
相关文章推荐
- 第二周编程题-时间换算
- Win7下编译Qt5.4OCI驱动
- php支持断点续传、分块下载的类
- 重装Ubuntu配置编程环境LAMP,J2EE从0开始
- Spark RDD API详解Map和Reduce
- 3 C语言 流程控制 循环 跳转
- ios开发学习笔记--数据持久化之偏好设置(NSUserDefault)
- js的clearInterval()
- 第一周编程题-逆序的三位数
- ACS5.X -AD-tacacs+ authentication
- HDU-ACM2049--错排问题的应用
- 网络仿真工具TOTEM之——环境配置
- Oracle 11g R2 RAC高可用性连接
- vs2012新建项目产生的问题
- leetcode 258---Add Digits, 关于C++中负数取余
- poj2135 最小费用最大流模板
- CentOS 7Install PrestaShop
- HDU-ACM2048
- 快速排序以及归并排序
- LeetCode 245. Shortest Word Distance III