SparkSQL JSON数据操作(1.3->1.4)
2015-08-05 23:00
267 查看
1.用户自定义schema
data
json串格式如下:{ "partner_code": "demo", "app_name": "web", "person_info": { "name": "张三", "age": 18 }, "items": [ { "item_id": 1, "item_name": "王家村", "group": "group1" }, { "item_id": 2, "item_name": "李家澡堂", "item_detail": { "platform_count": 2 }, "group": "group2" } ] }
spark1.3
在spark1.3我们是这样处理的//定义schema val struct =StructType( StructField("partner_code", StringType, true) :: StructField("app_name", StringType, true):: StructField("person_info",MapType(StringType,StringType,true)) :: StructField("items",ArrayType(MapType(StringType,StringType,true))) :: Nil) val data = sc.textFile("path/jsonFile") val df = sqlContext.jsonRDD(data,struct) df.printSchema df.show
spark1.4
//定义schema val struct =StructType( StructField("partner_code", StringType, true) :: StructField("app_name", StringType, true):: StructField("person_info",MapType(StringType,StringType,true)) :: StructField("items",ArrayType(MapType(StringType,StringType,true))) :: Nil) val df = sqlContext.read.schema(struct).json("path/jsonFile")
输出结果
//df.printSchema root |-- partner_code: string (nullable = true) |-- app_name: string (nullable = true) |-- person_info: map (nullable = true) | |-- key: string | |-- value: string (valueContainsNull = true) |-- items: array (nullable = true) | |-- element: map (containsNull = true) | | |-- key: string | | |-- value: string (valueContainsNull = true) //df.show +------------+--------+--------------------+--------------------+ |partner_code|app_name| person_info| items| +------------+--------+--------------------+--------------------+ | demo| web|Map(name -> 张三, a...|List(Map(item_id ...| +------------+--------+--------------------+--------------------+
系统自动生成schema
直接使用自带的解析会更方便,不过那样会产生大量的struct结构,同时如果结构复杂多变将会产生大量的空值。//不需要定义schema,系统自动判断生成 val df = sqlContext.read.json("path/jsonFile") df.printSchema df.show
输出结果
//df.printSchema root |-- app_name: string (nullable = true) |-- items: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- group: string (nullable = true) | | |-- item_detail: struct (nullable = true) | | | |-- platform_count: long (nullable = true) | | |-- item_id: long (nullable = true) | | |-- item_name: string (nullable = true) |-- partner_code: string (nullable = true) |-- person_info: struct (nullable = true) | |-- age: long (nullable = true) | |-- name: string (nullable = true) //df.show +--------+--------------------+------------+-----------+ |app_name| items|partner_code|person_info| +--------+--------------------+------------+-----------+ | web|List([group1,null...| demo| [18,张三]| +--------+--------------------+------------+-----------+
相关文章推荐
- DB2遇到问题
- mysql 的官网在哪下载源码包呀
- AWS中使用Memcached作为hibernate的二级缓存
- SQL Server 2008 数据库中创建只读用户的方法
- Hadoop2.6.0学习笔记(七)MapReduce操作MySQL数据库
- predis操作redis方法大全
- 连接mysql,oracle,sqlServer数据库的方式
- 关于数据库表的“记录”与“字段”
- 连接不上mysqlworkbench问题解决方法
- 为什么MySQL死锁检测会严重降低TPS
- Oracle 游标使用全解
- 20150805 oracle笔记
- oracle学习3
- redis分析-SDS
- Oracle学习之DATAGUARD(八) Switchover与failover
- mysql处理高并发数据,防止数据超读
- mysqldump导出部分数据的方法: 加入--where参数
- oracle 字符串中取数字
- leetcode数据库sql之Delete Duplicate Emails
- CYQ多数据库链接