您的位置:首页 > 其它

用IDEA写spark单词统计

2018-04-07 20:38 288 查看
1.创建一个项目


2.选择Maven项目,然后点击next


3.填写maven的GAV,然后点击next



4.填写项目名称,然后点击finish
5.创建好maven项目后,点击Enable Auto-Import



创建maven后不能创建scala  按照下面的来



注意自己的版本号



注意!!!记得把以下几个选项勾上 不然打不了jar包







这个不用选  因为自己选择的是默认的的c:\\user\..\.m2\settings.xml 


6.配置Maven的pom.xml<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion>

<groupId>cn.bw.spark</groupId>
<artifactId>WordCount2</artifactId>
<version>2.0</version>

<properties>
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.version>2.10.6</scala.version>
<scala.compat.version>2.10</scala.compat.version>
</properties>

<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.6.1</version>
</dependency>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.6.1</version>
</dependency>

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.6.4</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.38</version>
</dependency>
</dependencies>

<build>
&
d267
lt;sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-make:transitive</arg>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.18.1</version>
<configuration>
<useFile>false</useFile>
<disableXmlReport>true</disableXmlReport>
<includes>
<include>**/*Test.*</include>
<include>**/*Suite.*</include>
</includes>
</configuration>
</plugin>

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>cn.bw.spark.WordCountSpark</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>

</project>
7.将src/main/java和src/test/java分别修改成src/main/scala和src/test/scala,与pom.xml中的配置保持一致





8.新建一个scala class,类型为Object



9.编写spark程序
package cn.bw.spark

import org.apache.spark.{SparkContext, SparkConf}

object WordCount {
def main(args: Array[String]) {
//创建SparkConf()并设置App名称
val conf = new SparkConf().setAppName("WC")
//创建SparkContext,该对象是提交spark App的入口
val sc = new SparkContext(conf)
//使用sc创建RDD并执行相应的transformation和action
sc.textFile(args(0)).flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_, 1).sortBy(_._2, false).saveAsTextFile(args(1))
//停止sc,结束该任务
sc.stop()
}
}

10.使用Maven打包:首先修改pom.xml中的main class





点击idea右侧的Maven Project选项



点击Lifecycle,选择clean和package,然后点击Run Maven Build



11.选择编译成功的jar包,并将该jar上传到Spark集群中的某个节点上



12.首先启动hdfs和Spark集群
13.使用spark-submit命令提交Spark应用(注意参数的顺序)
/usr/local/spark-1.5.2-bin-hadoop2.6/bin/spark-submit \ //注意是在submit下
--class cn.bw.spark.WordCount \ //写的是你类的全路径
--master spark://node1.bw.cn:7077 \
--executor-memory 512m \                               //指定运行内存
--total-executor-cores 1 \                            //指定核数
/root/spark-mvn-1.0-SNAPSHOT.jar \                   //jar的路径
hdfs://node1.bw.cn:9000/words.txt \                //文件的路径
hdfs://node1.bw.cn:9000/out                       //输出的路径
查看程序执行结果hdfs dfs -cat hdfs://node1.bw.cn:9000/out/part-00000


                                            
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  spark ieda