Spark 程序 WordCount实现 Scala、Python
2018-03-03 22:25
615 查看
单词统计程序
Scala实现
---idea 安装scala插件
创建maven项目,引入scala sdk
pom.xml<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion>
<groupId>spark-learn</groupId>
<artifactId>cn.spark.learn</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.version>2.10.6</scala.version>
<scala.compat.version>2.10</scala.compat.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.5.2</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.2</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-make:transitive</arg>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.18.1</version>
<configuration>
<useFile>false</useFile>
<disableXmlReport>true</disableXmlReport>
<includes>
<include>**/*Test.*</include>
<include>**/*Suite.*</include>
</includes>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
b4b0
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>cn.itcast.spark.WordCount</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project> 3. 代码实现object WordCount {
def main(args: Array[String]): Unit = {
// 创建conf,设置应用程序的名字和运行的方式,local[2]表示本地模式运行两个线程,产生两个文件结果
val conf = new SparkConf().setAppName("wordcount").setMaster("local[2]")
// 创建sparkcontext
val sc = new SparkContext(conf)
// 开始计算代码
// textfile从hdfs中读取代码
val file: RDD[String] = sc.textFile("hdfs://mini1:9000/words.txt")
// 压平,分割每一行数据为每个单词
val words: RDD[String] = file.flatMap(_.split(" "))
val tuple: RDD[(String, Int)] = words.map((_, 1))
val result: RDD[(String, Int)] = tuple.reduceByKey(_ + _)
val resultBy: RDD[(String, Int)] = result.sortBy(_._2, false)
// 打印结果
resultBy.foreach(println)
}
}以上程序的输出结果 从控制台打印上来看可能没有排序,原因是local[2]启动了两个线程执行,产生了两个结果文件,local[1]相当于全局排序。
4. 提交到集群执行
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
object WordCount {
def main(args: Array[String]): Unit = {
// 创建conf,设置应用程序的名字和运行的方式,local[2]表示本地模式运行两个线程,产生两个文件结果
// val conf = new SparkConf().setAppName("wordcount").setMaster("local[2]")
// 提交到集群执行
val conf = new SparkConf().setAppName("wordcount")
// 创建sparkcontext
val sc = new SparkContext(conf)
// 开始计算代码
// textfile从hdfs中读取代码
val file: RDD[String] = sc.textFile(args(0))
// 压平,分割每一行数据为每个单词
val words: RDD[String] = file.flatMap(_.split(" "))
val tuple: RDD[(String, Int)] = words.map((_, 1))
val result: RDD[(String, Int)] = tuple.reduceByKey(_ + _)
val resultBy: RDD[(String, Int)] = result.sortBy(_._2, false)
// 打印结果
// resultBy.foreach(println)
resultBy.saveAsTextFile(args(1))
}
}使用idea打包,上传至集群中的任意台机器。
提交任务spark-submit --master spark://mini1:7077 --class cn.itcast.spark.WordCount original-spark-learn-1.0-SNAPSHOT.jar hdfs://mini1:9000/words.txt hdfs://mini1:9000/ceshi-scala/
Python实现
---#!/usr/bin/pythonfrom pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("pywordCount").setMaster("spark://mini1:7077")
sc = SparkContext(conf = conf)
sc.textFile("hdfs://mini1:9000/words.txt").flatMap(lambda a:a.split(" ")).map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).saveAsTextFile("hdfs://mini1:9000/wordcount/ceshi/")
spark-submit wordcount.py
相关文章推荐
- maven构建Scala程序,实现spark的wordcount
- python、scala、java分别实现在spark上实现WordCount
- 第一个spark scala程序——wordcount
- 基于Jupyter平台通过python实现Spark的应用程序之wordCount
- Spark on Yarn上实现WordCount程序
- Spark:用Scala和Java实现WordCount
- Spark中的wordCount程序实现
- Spark:用Scala和Java实现WordCount
- Spark实战----(1)使用Scala开发本地测试的Spark WordCount程序
- Spark:用Scala和Java实现WordCount
- scala-eclipse 编写spark简单程序 WordCount
- Spark:用Java和Scala实现WordCount
- Eclipse+Maven+Scala Project+Spark | 编译并打包wordcount程序
- Spark wordcount - Python, Scala, Java
- JDK8+Scala2.11+spark-2.0.0+Intellij2017.3.4开发wordcount程序并在集群中运行
- 分别用Java、Scala、spark-shell开发wordcount程序及测试代码
- Spark:用Scala和Java实现WordCount
- hadoop-python——Wordcount程序:python实现详解
- Spark:用Scala和Java实现WordCount
- python实现wordcount程序