Overview of Spark, YARN, and HDFS
2015-11-27 18:54
381 查看
Spark is a relatively recent addition to the Hadoop ecosystem. Spark is an analytics engine and framework capable of running
queries 100 times faster than traditional MapReduce jobs written in Hadoop. In addition to the performance boost, developers can write Spark jobs in Scala, Python, and Java if they so desire. Spark can load data directly from disk, memory, and other data storage
technologies such as Amazon S3, Hadoop Distributed File System (HDFS), HBase, Cassandra, etc.
Due to the iterative nature of Spark scripts, they are often developed interactively and can be developed as a script or in a notebook. This results in compact scripts that consist of tight, functional code.
A Spark script can be submitted to a Spark cluster using various methods:
Running the script directly on the head node
Using the acluster submit command from the client
Interactively in an IPython shell or Jupyter Notebook on the cluster
Using the spark-submit script from the client
To run a script on the head node, simply execute python example.py on the cluster. Developing locally on test data and pushing the same analytics scripts to the cluster is a
key feature of Anaconda Cluster. With a cluster created and Spark scripts developed, you can use the acluster submit command to automatically push the script to the head node
and run it on the Spark cluster.
The below examples use the acluster submit command to run the Spark examples, but any of the above methods can be used to submit a job to the Spark cluster.
Anaconda Cluster can install Spark in standalone mode via the spark-standalone plugin or with theYARN resource
manager via the spark-yarn plugin. YARN can be useful when resource management on the cluster is an issue, e.g., when resources need to be shared by many tasks, users and multiple applications.
Spark scripts can be configured to run in standalone mode:
or with YARN by setting the yarn-client as the master within the script:
You can also submit jobs with YARN by setting --master yarn-client as an option to the spark-submit command:
Moving data in and around HDFS can be difficult. If you need to move data from your local machine to HDFS, from Amazon S3 to HDFS, from Amazon S3 to Redshift, from HDFS to Hive, and so on, we recommend using odo,
which is part of the Blaze ecosystem. Odo efficiently migrates data from the source to the
target through a network of conversions.
Use odo to upload a file:
Use odo to upload URLs:
If you are unfamiliar with Spark and/or SQL, we recommend using Blaze to express selections, aggregations, group-bys, etc.
in a dataframe-like style. Blaze provides Python users with a familiar interface to query data that exists in other data storage systems.
queries 100 times faster than traditional MapReduce jobs written in Hadoop. In addition to the performance boost, developers can write Spark jobs in Scala, Python, and Java if they so desire. Spark can load data directly from disk, memory, and other data storage
technologies such as Amazon S3, Hadoop Distributed File System (HDFS), HBase, Cassandra, etc.
Submitting Spark Jobs
Due to the iterative nature of Spark scripts, they are often developed interactively and can be developed as a script or in a notebook. This results in compact scripts that consist of tight, functional code.A Spark script can be submitted to a Spark cluster using various methods:
Running the script directly on the head node
Using the acluster submit command from the client
Interactively in an IPython shell or Jupyter Notebook on the cluster
Using the spark-submit script from the client
To run a script on the head node, simply execute python example.py on the cluster. Developing locally on test data and pushing the same analytics scripts to the cluster is a
key feature of Anaconda Cluster. With a cluster created and Spark scripts developed, you can use the acluster submit command to automatically push the script to the head node
and run it on the Spark cluster.
The below examples use the acluster submit command to run the Spark examples, but any of the above methods can be used to submit a job to the Spark cluster.
Running Spark in Different Modes
Anaconda Cluster can install Spark in standalone mode via the spark-standalone plugin or with theYARN resourcemanager via the spark-yarn plugin. YARN can be useful when resource management on the cluster is an issue, e.g., when resources need to be shared by many tasks, users and multiple applications.
Spark scripts can be configured to run in standalone mode:
conf = SparkConf() conf.setMaster('spark://<HOSTNAME_SPARK_MASTER>:7077')
or with YARN by setting the yarn-client as the master within the script:
conf = SparkConf() conf.setMaster('yarn-client')
You can also submit jobs with YARN by setting --master yarn-client as an option to the spark-submit command:
spark-submit --master yarn-client spark_example.py
Working with Data in HDFS
Moving data in and around HDFS can be difficult. If you need to move data from your local machine to HDFS, from Amazon S3 to HDFS, from Amazon S3 to Redshift, from HDFS to Hive, and so on, we recommend using odo,which is part of the Blaze ecosystem. Odo efficiently migrates data from the source to the
target through a network of conversions.
Use odo to upload a file:
# Load local data into HDFS auth = {'user': 'hdfs','port': '14000'} odo('./iris.csv', 'hdfs://{}:/tmp/iris/iris.csv'.format(HEAD_NODE_IP), **auth)
Use odo to upload URLs:
# Load local data into HDFS auth = {'user': 'hdfs','port': '14000'} url = 'https://raw.githubusercontent.com/ContinuumIO/blaze/master/blaze/examples/data/iris.csv' odo(url, 'hdfs://{}:/tmp/iris/iris.csv'.format(HEAD_NODE_IP), **auth)
If you are unfamiliar with Spark and/or SQL, we recommend using Blaze to express selections, aggregations, group-bys, etc.
in a dataframe-like style. Blaze provides Python users with a familiar interface to query data that exists in other data storage systems.
相关文章推荐
- Spark RDD API详解(一) Map和Reduce
- 使用spark和spark mllib进行股票预测
- Spark随谈——开发指南(译)
- Spark,一种快速数据分析替代方案
- eclipse 开发 spark Streaming wordCount
- Understanding Spark Caching
- Windows 下Spark 快速搭建Spark源码阅读环境
- Spark中将对象序列化存储到hdfs
- Spark初探
- Spark Streaming初探
- 搭建hadoop/spark集群环境
- 整合Kafka到Spark Streaming——代码示例和挑战
- Spark 性能相关参数配置详解-任务调度篇
- 基于spark1.3.1的spark-sql实战-01
- 基于spark1.3.1的spark-sql实战-02
- 在 Databricks 可获得 Spark 1.5 预览版
- spark standalone模式 zeppelin安装
- Apache Spark 1.5.0正式发布
- Tachyon 0.7.1伪分布式集群安装与测试
- spark取得lzo压缩文件报错 java.lang.ClassNotFoundException