Spark 大数据平台
2016-01-04 18:37
459 查看
Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.
BDAS, the Berkeley Data Analytics Stack, is an open source software stack that integrates software components being built by the AMPLab to make sense of Big Data.


BlinkDB is a massively parallel, approximate query engine for running interactive SQL queries on large volumes of data. It allows users to +, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars.
Two key ideas:
An adaptive optimization framework that builds and maintains a set of multi-dimensional samples from original data over time
A dynamic sample selection strategy that selects an appropriately sized sample based on a query’s accuracy and/or response time requirements.
Why spark is fast:
in-memory computing
Directed Acyclic Graph (DAG) engine, compiler can see the whole computing graph in advance so that it can optimize it. Delay Scheduling
A function for computing each split
A list of dependencies on other RDDs
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

BDAS, the Berkeley Data Analytics Stack, is an open source software stack that integrates software components being built by the AMPLab to make sense of Big Data.


Spark Components | VS. | Hadoop Components |
---|---|---|
Spark Core | <------> | Apache Hadoop MR |
Spark Streaming | <------> | Apache Storm |
Spark SQL | <------> | Apache Hive |
Spark GraphX | <------> | MPI(taobao) |
Spark MLlib | <------> | Apache Mahout |
Two key ideas:
An adaptive optimization framework that builds and maintains a set of multi-dimensional samples from original data over time
A dynamic sample selection strategy that selects an appropriately sized sample based on a query’s accuracy and/or response time requirements.
Why spark is fast:
in-memory computing
Directed Acyclic Graph (DAG) engine, compiler can see the whole computing graph in advance so that it can optimize it. Delay Scheduling
Resilient Distributed Dataset
A list of partitionsA function for computing each split
A list of dependencies on other RDDs
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
Storage Strategy
class StorageLevel private( private var useDisk_ : Boolean, private var useMemory_ : Boolean, private var deserialized_ : Boolean, private var replication_ : Int = 1) val MEMORY_ONLY_ = new StorageLevel(false, true, true)
RDD, transformation & action
lazy evaluation
相关文章推荐
- Linux大数据开发基础:第九节:Shell编程入门(一)
- mkdir 0755 failed
- Conversion to Dalvik format failed: Unable to execute dex: Multiple dex files define Lcom/squareup/o
- chmod data/cache/zfcache/file.txt 0666 failed
- 自定义View时,用到Paint Canvas的一些温故,简单的帧动画(动画一 ,"掏粪男孩Gif"顺便再提提onWindowFocusChanged)
- 自定义View时,用到Paint Canvas的一些温故,简单的帧动画(动画一 ,"掏粪男孩Gif"顺便再提提onWindowFocusChanged)
- 【题解】QDUOJ.65.again and again
- Mailchimp VS Zoho Campaigns
- iPad Air/Air2/iPhone6 Plus跑分对比
- codeforces 463E . Caisa and Tree
- 模拟mspaint画图程序
- 磁盘阵列raid技术比较
- fastboot 卡在 waiting for device
- 大数据学习之Scala中数组(Array)与循环控制for联合使用学习(4)
- loadrunner error 27796 Failed to connect to server
- 智能Agent概述
- AudioToolbox.framework框架学习
- 人工智能史
- scala中的trait
- Setting your email in Git