您的位置:首页 > 其它

第一个Spark程序

2015-05-16 14:25 120 查看
在本文中,威廉将尝试构建我们的第一个Spark程序,并在之前文章中创建的Spark集群里运行起来

Java程序

对于Java程序来说,使用
Maven
管理依赖及发布过程中的各个步骤是不错的选择

利用
Maven
生成Java Project

mvn archetype:generate -DgroupId=com.spark.example -DartifactId=JavaSparkPi


Maven
为我们创建了符合标准目录结构的文件

JavaSparkPi
|-src
|-main
|-java
|-com
|-spark
|-example
|-App.java
|-test
|-java
|-com
|-spark
|-example
|-AppTest.java
|-pom.xml


接下来我们将Spark官方实例
JavaSparkPi.java
作实例,因此用其替换掉生成的
App.java


/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements.  See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License.  You may obtain a copy of the License at
*
*    http://www.apache.org/licenses/LICENSE-2.0 *
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package com.spark.example;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;

import java.util.ArrayList;
import java.util.List;

/**
* 计算Pi的近似值
* Usage: JavaSparkPi [slices]
*/
public final class JavaSparkPi {

public static void main(String[] args) throws Exception {
# 创建SparkConf对象
SparkConf sparkConf = new SparkConf().setAppName("JavaSparkPi");
# 创建JavaSparkContext对象
JavaSparkContext jsc = new JavaSparkContext(sparkConf);

int slices = (args.length == 1) ? Integer.parseInt(args[0]) : 2;
int n = 100000 * slices;
List<Integer> l = new ArrayList<Integer>(n);
for (int i = 0; i < n; i++) {
l.add(i);
}

JavaRDD<Integer> dataSet = jsc.parallelize(l, slices);

int count = dataSet.map(new Function<Integer, Integer>() {
@Override
public Integer call(Integer integer) {
double x = Math.random() * 2 - 1;
double y = Math.random() * 2 - 1;
return (x * x + y * y < 1) ? 1 : 0;
}
}).reduce(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer integer, Integer integer2) {
return integer + integer2;
}
});

System.out.println("Pi is roughly " + 4.0 * count / n);

jsc.stop();
}
}


此程序实现的功能是计算Pi的近似值,需提供参数slices,默认值为2,slices参数值越大,计算结果的精确度越高,但计算量也就越大

配置
pom
依赖关系

JavaSparkPi
对象引用了
org.apache.spark
包的对象,因此需要在pom文件中配置依赖关系

编辑
JavaSparkPi/pom.xml
,在
dependencies
处加入以下代码

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.2.0</version>
</dependency>


我们可以通过
mvn dependency
命令来查看依赖关系

mvn dependency:tree -Ddetail=true


[INFO] com.spark.example:JavaSparkPi:jar:1.0-SNAPSHOT
[INFO] +- junit:junit:jar:3.8.1:test
[INFO] \- org.apache.spark:spark-core_2.10:jar:1.2.0:compile
[INFO]    +- com.twitter:chill_2.10:jar:0.5.0:compile
[INFO]    |  \- com.esotericsoftware.kryo:kryo:jar:2.21:compile
[INFO]    |     +- com.esotericsoftware.reflectasm:reflectasm:jar:shaded:1.07:compile
[INFO]    |     +- com.esotericsoftware.minlog:minlog:jar:1.2:compile
[INFO]    |     \- org.objenesis:objenesis:jar:1.2:compile
[INFO]    +- com.twitter:chill-java:jar:0.5.0:compile
[INFO]    +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile
[INFO]    |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile
[INFO]    |  |  +- commons-cli:commons-cli:jar:1.2:compile
[INFO]    |  |  +- org.apache.commons:commons-math:jar:2.1:compile
[INFO]    |  |  +- xmlenc:xmlenc:jar:0.52:compile
[INFO]    |  |  +- commons-io:commons-io:jar:2.1:compile
[INFO]    |  |  +- commons-logging:commons-logging:jar:1.1.1:compile
[INFO]    |  |  +- commons-lang:commons-lang:jar:2.5:compile
[INFO]    |  |  +- commons-configuration:commons-configuration:jar:1.6:compile
[INFO]    |  |  |  +- commons-collections:commons-collections:jar:3.2.1:compile
[INFO]    |  |  |  +- commons-digester:commons-digester:jar:1.8:compile
[INFO]    |  |  |  |  \- commons-beanutils:commons-beanutils:jar:1.7.0:compile
[INFO]    |  |  |  \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
[INFO]    |  |  +- org.codehaus.jackson:jackson-core-asl:jar:1.8.8:compile
[INFO]    |  |  +- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile
[INFO]    |  |  +- org.apache.avro:avro:jar:1.7.4:compile
[INFO]    |  |  +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
[INFO]    |  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile
[INFO]    |  |  \- org.apache.commons:commons-compress:jar:1.4.1:compile
[INFO]    |  |     \- org.tukaani:xz:jar:1.0:compile
[INFO]    |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile
[INFO]    |  |  \- org.mortbay.jetty:jetty-util:jar:6.1.26:compile
[INFO]    |  +- org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile
[INFO]    |  |  +- org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile
[INFO]    |  |  |  +- org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile
[INFO]    |  |  |  |  +- com.google.inject:guice:jar:3.0:compile
[INFO]    |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile
[INFO]    |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile
[INFO]    |  |  |  |  +- com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile
[INFO]    |  |  |  |  |  +- com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile
[INFO]    |  |  |  |  |  |  +- javax.servlet:javax.servlet-api:jar:3.0.1:compile
[INFO]    |  |  |  |  |  |  \- com.sun.jersey:jersey-client:jar:1.9:compile
[INFO]    |  |  |  |  |  \- com.sun.jersey:jersey-grizzly2:jar:1.9:compile
[INFO]    |  |  |  |  |     +- org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile
[INFO]    |  |  |  |  |     |  \- org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile
[INFO]    |  |  |  |  |     |     \- org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile
[INFO]    |  |  |  |  |     |        \- org.glassfish.external:management-api:jar:3.0.0-b012:compile
[INFO]    |  |  |  |  |     +- org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile
[INFO]    |  |  |  |  |     |  \- org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile
[INFO]    |  |  |  |  |     +- org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile
[INFO]    |  |  |  |  |     \- org.glassfish:javax.servlet:jar:3.1:compile
[INFO]    |  |  |  |  +- com.sun.jersey:jersey-server:jar:1.9:compile
[INFO]    |  |  |  |  |  +- asm:asm:jar:3.1:compile
[INFO]    |  |  |  |  |  \- com.sun.jersey:jersey-core:jar:1.9:compile
[INFO]    |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile
[INFO]    |  |  |  |  |  +- org.codehaus.jettison:jettison:jar:1.1:compile
[INFO]    |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile
[INFO]    |  |  |  |  |  +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
[INFO]    |  |  |  |  |  |  \- javax.xml.bind:jaxb-api:jar:2.2.2:compile
[INFO]    |  |  |  |  |  |     \- javax.activation:activation:jar:1.1:compile
[INFO]    |  |  |  |  |  +- org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
[INFO]    |  |  |  |  |  \- org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
[INFO]    |  |  |  |  \- com.sun.jersey.contribs:jersey-guice:jar:1.9:compile
[INFO]    |  |  |  \- org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile
[INFO]    |  |  \- org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile
[INFO]    |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile
[INFO]    |  +- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile
[INFO]    |  |  \- org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile
[INFO]    |  +- org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile
[INFO]    |  \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile
[INFO]    +- org.apache.spark:spark-network-common_2.10:jar:1.2.0:compile
[INFO]    +- org.apache.spark:spark-network-shuffle_2.10:jar:1.2.0:compile
[INFO]    +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile
[INFO]    |  +- commons-codec:commons-codec:jar:1.3:compile
[INFO]    |  \- commons-httpclient:commons-httpclient:jar:3.1:compile
[INFO]    +- org.apache.curator:curator-recipes:jar:2.4.0:compile
[INFO]    |  +- org.apache.curator:curator-framework:jar:2.4.0:compile
[INFO]    |  |  \- org.apache.curator:curator-client:jar:2.4.0:compile
[INFO]    |  +- org.apache.zookeeper:zookeeper:jar:3.4.5:compile
[INFO]    |  |  \- jline:jline:jar:0.9.94:compile
[INFO]    |  \- com.google.guava:guava:jar:14.0.1:compile
[INFO]    +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile
[INFO]    |  +- org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile
[INFO]    |  +- org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile
[INFO]    |  |  +- org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile
[INFO]    |  |  \- org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile
[INFO]    |  \- org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile
[INFO]    |     \- org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile
[INFO]    |        \- org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile
[INFO]    +- org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile
[INFO]    +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
[INFO]    +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile
[INFO]    |  +- org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile
[INFO]    |  +- org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile
[INFO]    |  \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile
[INFO]    |     \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile
[INFO]    +- org.apache.commons:commons-lang3:jar:3.3.2:compile
[INFO]    +- org.apache.commons:commons-math3:jar:3.1.1:compile
[INFO]    +- com.google.code.findbugs:jsr305:jar:1.3.9:compile
[INFO]    +- org.slf4j:slf4j-api:jar:1.7.5:compile
[INFO]    +- org.slf4j:jul-to-slf4j:jar:1.7.5:compile
[INFO]    +- org.slf4j:jcl-over-slf4j:jar:1.7.5:compile
[INFO]    +- log4j:log4j:jar:1.2.17:compile
[INFO]    +- org.slf4j:slf4j-log4j12:jar:1.7.5:compile
[INFO]    +- com.ning:compress-lzf:jar:1.0.0:compile
[INFO]    +- org.xerial.snappy:snappy-java:jar:1.1.1.6:compile
[INFO]    +- net.jpountz.lz4:lz4:jar:1.2.0:compile
[INFO]    +- org.roaringbitmap:RoaringBitmap:jar:0.4.5:compile
[INFO]    +- commons-net:commons-net:jar:2.2:compile
[INFO]    +- org.spark-project.akka:akka-remote_2.10:jar:2.3.4-spark:compile
[INFO]    |  +- org.spark-project.akka:akka-actor_2.10:jar:2.3.4-spark:compile
[INFO]    |  |  \- com.typesafe:config:jar:1.2.1:compile
[INFO]    |  +- io.netty:netty:jar:3.8.0.Final:compile
[INFO]    |  +- org.spark-project.protobuf:protobuf-java:jar:2.5.0-spark:compile
[INFO]    |  \- org.uncommons.maths:uncommons-maths:jar:1.2.2a:compile
[INFO]    +- org.spark-project.akka:akka-slf4j_2.10:jar:2.3.4-spark:compile
[INFO]    +- org.scala-lang:scala-library:jar:2.10.4:compile
[INFO]    +- org.json4s:json4s-jackson_2.10:jar:3.2.10:compile
[INFO]    |  +- org.json4s:json4s-core_2.10:jar:3.2.10:compile
[INFO]    |  |  +- org.json4s:json4s-ast_2.10:jar:3.2.10:compile
[INFO]    |  |  +- com.thoughtworks.paranamer:paranamer:jar:2.6:compile
[INFO]    |  |  \- org.scala-lang:scalap:jar:2.10.0:compile
[INFO]    |  |     \- org.scala-lang:scala-compiler:jar:2.10.0:compile
[INFO]    |  |        \- org.scala-lang:scala-reflect:jar:2.10.0:compile
[INFO]    |  \- com.fasterxml.jackson.core:jackson-databind:jar:2.3.1:compile
[INFO]    |     +- com.fasterxml.jackson.core:jackson-annotations:jar:2.3.0:compile
[INFO]    |     \- com.fasterxml.jackson.core:jackson-core:jar:2.3.1:compile
[INFO]    +- org.apache.mesos:mesos:jar:shaded-protobuf:0.18.1:compile
[INFO]    +- io.netty:netty-all:jar:4.0.23.Final:compile
[INFO]    +- com.clearspring.analytics:stream:jar:2.7.0:compile
[INFO]    +- com.codahale.metrics:metrics-core:jar:3.0.0:compile
[INFO]    +- com.codahale.metrics:metrics-jvm:jar:3.0.0:compile
[INFO]    +- com.codahale.metrics:metrics-json:jar:3.0.0:compile
[INFO]    +- com.codahale.metrics:metrics-graphite:jar:3.0.0:compile
[INFO]    +- org.tachyonproject:tachyon-client:jar:0.5.0:compile
[INFO]    |  \- org.tachyonproject:tachyon:jar:0.5.0:compile
[INFO]    +- org.spark-project:pyrolite:jar:2.0.1:compile
[INFO]    +- net.sf.py4j:py4j:jar:0.8.2.1:compile
[INFO]    \- org.spark-project.spark:unused:jar:1.0.0:compile


配置
maven-shade-plugin

默认情况下,
Maven
在本地有名为
.m2
的仓库文件夹,用于存放所有依赖的
jar
包,因此在
package
操作时只会打包依赖的
jar
包的名称,而不会将真正的内容打包进去

而当在集群中运行程序的时候,情况就有所不同了,我们很难保证所依赖的
jar
包在所有
worker
主机上都存在,因此需要生成一个大
jar
包(
uber JAR
/
assembly JAR
),其中包含所有的依赖内容

我们可以通过
maven-shade-plugin
插件来实现,在
pom
中加入以下内容:

<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>


Maven生成JAR包

mvn clean compile package


成功运行后,我们看到在
JavaSparkPi/target
文件夹下生成了
JavaSparkPi-1.0-SNAPSHOT.jar
文件,大小为76M,可见所有的依赖
jar
包都被包含了进来

此处威廉只是演示如何生成
uber JAR
,其实
spark-core
jar
是所有集群主机都肯定拥有的,因此我们不需要把它加入到打包生成的
jar
包中

修改
JavaSparkPi/pom.xml
,把
spark-core
依赖的
scope
设为
provided


<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.2.0</version>
<scope>provided</scope>
</dependency>


再次运行
mvn clean compile package
,可以看到这次的
JavaSparkPi-1.0-SNAPSHOT.jar
大小仅为5K,不再包含
spark-core
相关的依赖
jar


向Spark集群提交程序

./bin/spark-submit --class com.spark.example.JavaSparkPi --master spark://192.168.32.130:7077 $HOME/JavaSparkPi/target/JavaSparkPi-1.0-SNAPSHOT.jar 1000


Python程序

Python是脚本语言,部署相对简单,不需要编译的步骤,仍然以计算Pi近似值的示例程序为例

./bin/spark-submit \
--master spark://192.168.32.130:7077 \
examples/src/main/python/pi.py \
1000


引用的
.zip, .egg, .py
文件可以通过
--py-files
参数来添加
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: