第一个Spark程序
2015-05-16 14:25
120 查看
在本文中,威廉将尝试构建我们的第一个Spark程序,并在之前文章中创建的Spark集群里运行起来
利用
接下来我们将Spark官方实例
此程序实现的功能是计算Pi的近似值,需提供参数slices,默认值为2,slices参数值越大,计算结果的精确度越高,但计算量也就越大
配置
编辑
我们可以通过
配置
默认情况下,
而当在集群中运行程序的时候,情况就有所不同了,我们很难保证所依赖的
我们可以通过
成功运行后,我们看到在
此处威廉只是演示如何生成
修改
再次运行
引用的
Java程序
对于Java程序来说,使用Maven管理依赖及发布过程中的各个步骤是不错的选择
利用Maven
生成Java Project
mvn archetype:generate -DgroupId=com.spark.example -DartifactId=JavaSparkPi
Maven为我们创建了符合标准目录结构的文件
JavaSparkPi |-src |-main |-java |-com |-spark |-example |-App.java |-test |-java |-com |-spark |-example |-AppTest.java |-pom.xml
接下来我们将Spark官方实例
JavaSparkPi.java作实例,因此用其替换掉生成的
App.java,
/* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ package com.spark.example; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.api.java.function.Function2; import java.util.ArrayList; import java.util.List; /** * 计算Pi的近似值 * Usage: JavaSparkPi [slices] */ public final class JavaSparkPi { public static void main(String[] args) throws Exception { # 创建SparkConf对象 SparkConf sparkConf = new SparkConf().setAppName("JavaSparkPi"); # 创建JavaSparkContext对象 JavaSparkContext jsc = new JavaSparkContext(sparkConf); int slices = (args.length == 1) ? Integer.parseInt(args[0]) : 2; int n = 100000 * slices; List<Integer> l = new ArrayList<Integer>(n); for (int i = 0; i < n; i++) { l.add(i); } JavaRDD<Integer> dataSet = jsc.parallelize(l, slices); int count = dataSet.map(new Function<Integer, Integer>() { @Override public Integer call(Integer integer) { double x = Math.random() * 2 - 1; double y = Math.random() * 2 - 1; return (x * x + y * y < 1) ? 1 : 0; } }).reduce(new Function2<Integer, Integer, Integer>() { @Override public Integer call(Integer integer, Integer integer2) { return integer + integer2; } }); System.out.println("Pi is roughly " + 4.0 * count / n); jsc.stop(); } }
此程序实现的功能是计算Pi的近似值,需提供参数slices,默认值为2,slices参数值越大,计算结果的精确度越高,但计算量也就越大
配置pom
依赖关系
JavaSparkPi对象引用了
org.apache.spark包的对象,因此需要在pom文件中配置依赖关系
编辑
JavaSparkPi/pom.xml,在
dependencies处加入以下代码
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.2.0</version> </dependency>
我们可以通过
mvn dependency命令来查看依赖关系
mvn dependency:tree -Ddetail=true
[INFO] com.spark.example:JavaSparkPi:jar:1.0-SNAPSHOT [INFO] +- junit:junit:jar:3.8.1:test [INFO] \- org.apache.spark:spark-core_2.10:jar:1.2.0:compile [INFO] +- com.twitter:chill_2.10:jar:0.5.0:compile [INFO] | \- com.esotericsoftware.kryo:kryo:jar:2.21:compile [INFO] | +- com.esotericsoftware.reflectasm:reflectasm:jar:shaded:1.07:compile [INFO] | +- com.esotericsoftware.minlog:minlog:jar:1.2:compile [INFO] | \- org.objenesis:objenesis:jar:1.2:compile [INFO] +- com.twitter:chill-java:jar:0.5.0:compile [INFO] +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile [INFO] | +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile [INFO] | | +- commons-cli:commons-cli:jar:1.2:compile [INFO] | | +- org.apache.commons:commons-math:jar:2.1:compile [INFO] | | +- xmlenc:xmlenc:jar:0.52:compile [INFO] | | +- commons-io:commons-io:jar:2.1:compile [INFO] | | +- commons-logging:commons-logging:jar:1.1.1:compile [INFO] | | +- commons-lang:commons-lang:jar:2.5:compile [INFO] | | +- commons-configuration:commons-configuration:jar:1.6:compile [INFO] | | | +- commons-collections:commons-collections:jar:3.2.1:compile [INFO] | | | +- commons-digester:commons-digester:jar:1.8:compile [INFO] | | | | \- commons-beanutils:commons-beanutils:jar:1.7.0:compile [INFO] | | | \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile [INFO] | | +- org.codehaus.jackson:jackson-core-asl:jar:1.8.8:compile [INFO] | | +- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile [INFO] | | +- org.apache.avro:avro:jar:1.7.4:compile [INFO] | | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile [INFO] | | +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile [INFO] | | \- org.apache.commons:commons-compress:jar:1.4.1:compile [INFO] | | \- org.tukaani:xz:jar:1.0:compile [INFO] | +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile [INFO] | | \- org.mortbay.jetty:jetty-util:jar:6.1.26:compile [INFO] | +- org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile [INFO] | | +- org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile [INFO] | | | +- org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile [INFO] | | | | +- com.google.inject:guice:jar:3.0:compile [INFO] | | | | | +- javax.inject:javax.inject:jar:1:compile [INFO] | | | | | \- aopalliance:aopalliance:jar:1.0:compile [INFO] | | | | +- com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile [INFO] | | | | | +- com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile [INFO] | | | | | | +- javax.servlet:javax.servlet-api:jar:3.0.1:compile [INFO] | | | | | | \- com.sun.jersey:jersey-client:jar:1.9:compile [INFO] | | | | | \- com.sun.jersey:jersey-grizzly2:jar:1.9:compile [INFO] | | | | | +- org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile [INFO] | | | | | | \- org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile [INFO] | | | | | | \- org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile [INFO] | | | | | | \- org.glassfish.external:management-api:jar:3.0.0-b012:compile [INFO] | | | | | +- org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile [INFO] | | | | | | \- org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile [INFO] | | | | | +- org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile [INFO] | | | | | \- org.glassfish:javax.servlet:jar:3.1:compile [INFO] | | | | +- com.sun.jersey:jersey-server:jar:1.9:compile [INFO] | | | | | +- asm:asm:jar:3.1:compile [INFO] | | | | | \- com.sun.jersey:jersey-core:jar:1.9:compile [INFO] | | | | +- com.sun.jersey:jersey-json:jar:1.9:compile [INFO] | | | | | +- org.codehaus.jettison:jettison:jar:1.1:compile [INFO] | | | | | | \- stax:stax-api:jar:1.0.1:compile [INFO] | | | | | +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile [INFO] | | | | | | \- javax.xml.bind:jaxb-api:jar:2.2.2:compile [INFO] | | | | | | \- javax.activation:activation:jar:1.1:compile [INFO] | | | | | +- org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile [INFO] | | | | | \- org.codehaus.jackson:jackson-xc:jar:1.8.3:compile [INFO] | | | | \- com.sun.jersey.contribs:jersey-guice:jar:1.9:compile [INFO] | | | \- org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile [INFO] | | \- org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile [INFO] | +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile [INFO] | +- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile [INFO] | | \- org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile [INFO] | +- org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile [INFO] | \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile [INFO] +- org.apache.spark:spark-network-common_2.10:jar:1.2.0:compile [INFO] +- org.apache.spark:spark-network-shuffle_2.10:jar:1.2.0:compile [INFO] +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile [INFO] | +- commons-codec:commons-codec:jar:1.3:compile [INFO] | \- commons-httpclient:commons-httpclient:jar:3.1:compile [INFO] +- org.apache.curator:curator-recipes:jar:2.4.0:compile [INFO] | +- org.apache.curator:curator-framework:jar:2.4.0:compile [INFO] | | \- org.apache.curator:curator-client:jar:2.4.0:compile [INFO] | +- org.apache.zookeeper:zookeeper:jar:3.4.5:compile [INFO] | | \- jline:jline:jar:0.9.94:compile [INFO] | \- com.google.guava:guava:jar:14.0.1:compile [INFO] +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile [INFO] | +- org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile [INFO] | +- org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile [INFO] | | +- org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile [INFO] | | \- org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile [INFO] | \- org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile [INFO] | \- org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile [INFO] | \- org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile [INFO] +- org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile [INFO] +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile [INFO] +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile [INFO] | +- org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile [INFO] | +- org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile [INFO] | \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile [INFO] | \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile [INFO] +- org.apache.commons:commons-lang3:jar:3.3.2:compile [INFO] +- org.apache.commons:commons-math3:jar:3.1.1:compile [INFO] +- com.google.code.findbugs:jsr305:jar:1.3.9:compile [INFO] +- org.slf4j:slf4j-api:jar:1.7.5:compile [INFO] +- org.slf4j:jul-to-slf4j:jar:1.7.5:compile [INFO] +- org.slf4j:jcl-over-slf4j:jar:1.7.5:compile [INFO] +- log4j:log4j:jar:1.2.17:compile [INFO] +- org.slf4j:slf4j-log4j12:jar:1.7.5:compile [INFO] +- com.ning:compress-lzf:jar:1.0.0:compile [INFO] +- org.xerial.snappy:snappy-java:jar:1.1.1.6:compile [INFO] +- net.jpountz.lz4:lz4:jar:1.2.0:compile [INFO] +- org.roaringbitmap:RoaringBitmap:jar:0.4.5:compile [INFO] +- commons-net:commons-net:jar:2.2:compile [INFO] +- org.spark-project.akka:akka-remote_2.10:jar:2.3.4-spark:compile [INFO] | +- org.spark-project.akka:akka-actor_2.10:jar:2.3.4-spark:compile [INFO] | | \- com.typesafe:config:jar:1.2.1:compile [INFO] | +- io.netty:netty:jar:3.8.0.Final:compile [INFO] | +- org.spark-project.protobuf:protobuf-java:jar:2.5.0-spark:compile [INFO] | \- org.uncommons.maths:uncommons-maths:jar:1.2.2a:compile [INFO] +- org.spark-project.akka:akka-slf4j_2.10:jar:2.3.4-spark:compile [INFO] +- org.scala-lang:scala-library:jar:2.10.4:compile [INFO] +- org.json4s:json4s-jackson_2.10:jar:3.2.10:compile [INFO] | +- org.json4s:json4s-core_2.10:jar:3.2.10:compile [INFO] | | +- org.json4s:json4s-ast_2.10:jar:3.2.10:compile [INFO] | | +- com.thoughtworks.paranamer:paranamer:jar:2.6:compile [INFO] | | \- org.scala-lang:scalap:jar:2.10.0:compile [INFO] | | \- org.scala-lang:scala-compiler:jar:2.10.0:compile [INFO] | | \- org.scala-lang:scala-reflect:jar:2.10.0:compile [INFO] | \- com.fasterxml.jackson.core:jackson-databind:jar:2.3.1:compile [INFO] | +- com.fasterxml.jackson.core:jackson-annotations:jar:2.3.0:compile [INFO] | \- com.fasterxml.jackson.core:jackson-core:jar:2.3.1:compile [INFO] +- org.apache.mesos:mesos:jar:shaded-protobuf:0.18.1:compile [INFO] +- io.netty:netty-all:jar:4.0.23.Final:compile [INFO] +- com.clearspring.analytics:stream:jar:2.7.0:compile [INFO] +- com.codahale.metrics:metrics-core:jar:3.0.0:compile [INFO] +- com.codahale.metrics:metrics-jvm:jar:3.0.0:compile [INFO] +- com.codahale.metrics:metrics-json:jar:3.0.0:compile [INFO] +- com.codahale.metrics:metrics-graphite:jar:3.0.0:compile [INFO] +- org.tachyonproject:tachyon-client:jar:0.5.0:compile [INFO] | \- org.tachyonproject:tachyon:jar:0.5.0:compile [INFO] +- org.spark-project:pyrolite:jar:2.0.1:compile [INFO] +- net.sf.py4j:py4j:jar:0.8.2.1:compile [INFO] \- org.spark-project.spark:unused:jar:1.0.0:compile
配置maven-shade-plugin
默认情况下,Maven在本地有名为
.m2的仓库文件夹,用于存放所有依赖的
jar包,因此在
package操作时只会打包依赖的
jar包的名称,而不会将真正的内容打包进去
而当在集群中运行程序的时候,情况就有所不同了,我们很难保证所依赖的
jar包在所有
worker主机上都存在,因此需要生成一个大
jar包(
uber JAR/
assembly JAR),其中包含所有的依赖内容
我们可以通过
maven-shade-plugin插件来实现,在
pom中加入以下内容:
<build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>2.3</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> </execution> </executions> </plugin> </plugins> </build>
Maven生成JAR包
mvn clean compile package
成功运行后,我们看到在
JavaSparkPi/target文件夹下生成了
JavaSparkPi-1.0-SNAPSHOT.jar文件,大小为76M,可见所有的依赖
jar包都被包含了进来
此处威廉只是演示如何生成
uber JAR,其实
spark-core的
jar是所有集群主机都肯定拥有的,因此我们不需要把它加入到打包生成的
jar包中
修改
JavaSparkPi/pom.xml,把
spark-core依赖的
scope设为
provided
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.2.0</version> <scope>provided</scope> </dependency>
再次运行
mvn clean compile package,可以看到这次的
JavaSparkPi-1.0-SNAPSHOT.jar大小仅为5K,不再包含
spark-core相关的依赖
jar
向Spark集群提交程序
./bin/spark-submit --class com.spark.example.JavaSparkPi --master spark://192.168.32.130:7077 $HOME/JavaSparkPi/target/JavaSparkPi-1.0-SNAPSHOT.jar 1000
Python程序
Python是脚本语言,部署相对简单,不需要编译的步骤,仍然以计算Pi近似值的示例程序为例./bin/spark-submit \ --master spark://192.168.32.130:7077 \ examples/src/main/python/pi.py \ 1000
引用的
.zip, .egg, .py文件可以通过
--py-files参数来添加
相关文章推荐
- 第一个Spark程序
- 第一个spark scala程序——wordcount
- Spark认识&环境搭建&运行第一个Spark程序
- spark下载安装和第一个Wordcount程序
- 大数据Spark “蘑菇云”行动前传第3课:在IDE下开发第一个Scala程序透彻解析及Scala控制结构详解实战
- Spark第一个程序开发 wordcount
- Spark学习之第一个程序打包、提交任务到集群
- spark on yarn第一个程序(为小象插上翅膀)
- spark 第一个java程序
- 第一个spark程序
- 极简 Spark 入门笔记——安装和第一个回归程序
- Spark入门实战指南——Spark生态圈+第一个Spark程序
- 使用Scala写第一个Spark程序
- 运行第一个spark程序
- 第一个spark程序
- Spark认识&环境搭建&运行第一个Spark程序
- Local模式下开发第一个Spark程序并运行于集群环境
- Spark 安装及运行第一个程序遇到问题总结
- 极简 Spark 入门笔记——安装和第一个回归程序
- 极简 Spark 入门笔记——安装和第一个回归程序