Spark学习笔记之初识
2015-11-01 15:21
393 查看
1 spark官网 http://spark.apache.org/
2 学习版本为1.5.0
Spark架构,官方文档解读
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).
跟其他分布式系统一样,每个节点的spark 应用程序都是一系列独立的进程,这些进程由主节点的SparkContext对象管理,这个对象叫做驱动程序。
Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications.
集群管理程序可能有很多种,Mesos or YARN等,主要是为应用程序分配资源,SparkContext要和集群管理程序进行连接才能在多集群上驱动应用程序。
Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.
完成连接之后,SparkContext向各个节点发送执行代码,最后分配执行任务。
注意点:
1 Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
不同的SparkContext之间不能共享数据
2 Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).
spark对YARN等集群管理器有很好的支持
3The driver program must listen for and accept incoming connections from its executors throughout its lifetime (e.g., see spark.driver.port and spark.fileserver.port in the network config section). As such, the driver program must be network addressable from the worker nodes.
驱动程序一直监视节点知道任务完成,因此这个期间要保证主节点和其他业务节点的网络通信
4Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.
最好做本地集群,驱动服务器和执行节点服务器最好在一个物理位置上就很靠近的局域网之内
2 学习版本为1.5.0
Spark架构,官方文档解读
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).
跟其他分布式系统一样,每个节点的spark 应用程序都是一系列独立的进程,这些进程由主节点的SparkContext对象管理,这个对象叫做驱动程序。
Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications.
集群管理程序可能有很多种,Mesos or YARN等,主要是为应用程序分配资源,SparkContext要和集群管理程序进行连接才能在多集群上驱动应用程序。
Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.
完成连接之后,SparkContext向各个节点发送执行代码,最后分配执行任务。
注意点:
1 Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
不同的SparkContext之间不能共享数据
2 Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).
spark对YARN等集群管理器有很好的支持
3The driver program must listen for and accept incoming connections from its executors throughout its lifetime (e.g., see spark.driver.port and spark.fileserver.port in the network config section). As such, the driver program must be network addressable from the worker nodes.
驱动程序一直监视节点知道任务完成,因此这个期间要保证主节点和其他业务节点的网络通信
4Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.
最好做本地集群,驱动服务器和执行节点服务器最好在一个物理位置上就很靠近的局域网之内
相关文章推荐
- 小白学习做项目(1)_使用EntityFramework实现Code First设计数据库(1)
- 小波变换入门知识总结
- 机器学习的基本模型
- 期中总结
- 实现才是目的——《大道至简》第六章读后感
- 信息安全系统设计基础第七周期中总结
- matrix_last_acm_4
- 最常用的排序——快速排序
- 程序员必备:100本免费编程图书(英文)
- No.6 Single Number II 一组数都出现n次除了某个数,找出这个数
- Codestorm:Game with a Boomerang
- 基于opencv的符号提取源代码
- InterView common question
- leetcode@ [209]Minimum Size Subarray Sum
- Codestorm:Game with a Boomerang
- maven的聚合和继承(三)
- C++primer学习:模板编程(2):类模板的定义
- 证明N个节点构成一棵树的种类数
- Mysql学习笔记一, 安装Mysql,简单命令学习
- 记一次被中间人攻击的经历