spark 1.6 preview
2015-12-29 17:43
225 查看
A new Dataset API
RDD API 使用非常灵活,但是在某些case下,比较难于优化执行。DataFrame API内部执行虽然有优化,但是lacks some of the nice perks of the RDD API (e.g. harder to use UDFs, lack of strong types in Scala/Java 即缺少RDD API的一些好处).Spark Dataset的目标是:让开发者很容易的编写操作 Domain Objects的transmatons, 同时提供spark sql执行引擎的性能和稳定的优点。
具体参见:SPARK-9999
设计文档里说:DataSet是DataFrame的超集,即DataSet[T] <- DataFrame[Row], DataFrame操作的是一行的Row。 而DataSet可以操作定义的各种类型-String,POJO等等。
Requirements
Fast - In most cases, the performance of Datasets should be equal to or better than working with RDDs. Encoders should be as fast or faster than Kryo and Java serialization, and unnecessary conversion should be avoided.
Typesafe - Similar to RDDs, objects and functions that operate on those objects should provide compile-time safety where possible. When converting from data where the schema is not known at compile-time (for example data read from an external source such as JSON), the conversion function should fail-fast if there is a schema mismatch.
Support for a variety of object models - Default encoders should be provided for a variety of object models: primitive types, case classes, tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard conventions, such as Avro SpecificRecords, should also work out of the box.
Java Compatible - Datasets should provide a single API that works in both Scala and Java. Where possible, shared types like Array will be used in the API. Where not possible, overloaded functions should be provided for both languages. Scala concepts, such as ClassTags should not be required in the user-facing API.
Interoperates with DataFrames - Users should be able to seamlessly transition between Datasets and DataFrames, without specifying conversion boiler-plate. When names used in the input schema line-up with fields in the given class, no extra mapping should be necessary. Libraries like MLlib should not need to provide different interfaces for accepting DataFrames and Datasets as input.
Autoatic memory configuration
开发者不用在为优化内存比例分配而发愁,现在Spark可以根据需要自动的增长或缩减Application执行时内存各个区域的比例。这样对于join,aggregation有着明显的性能提升。Optimized state storage in Spark Streaming
spark streaming的状态tracking api:a new “delta-tracking” approach .优化了spark streaming状态计算时amounts of state.即:原来的updateStateByKey 换成了 trackingStateByKey.
Pipeline persistence inSpark ML
Spark ML pipelines 可以被持久化,他们可以在执行状态被reload.这个是非常有用的,比如training large models的pipelines.Spark 1.6 改动
相关文章推荐
- Web Services 指南之:Web Services 综述
- c#window服务程序
- java中ArrayList使用remove注意事项
- linux系统中如何查看日志 (常用命令)
- Linux C 遍历文件--opendir()、readdir()...上
- 告别编译运行 ---- Android Studio 2.0 Preview发布Instant Run功能
- 【扩展欧几里得】Bzoj 1477:青蛙的约会
- 教你快速高效接入SDK——总体思路和架构
- Microsoft Azure_Fabric
- How it's made species-in-pieces
- 利用redis写webshell
- Spring boot配合Spring session(redis)遇到的错误
- Java源码 SpringMVC Mybatis Shiro Bootstrap Rest Webservice
- VS 解决方案 从高版本降为地版本
- 2015年年终总结--迷茫中前进
- android5.1系统TvSettings为选择项添加背景颜色
- Linux "ls -l"文件列表权限详解
- 清空文件的内容 和 统计文件的大小的命令
- c++ Const关键字
- 庖丁解牛之UPack工作原理及实例分析(3)