Distributed configure (hadoop 2.7.2 & spark 2.1.0)
2017-04-26 14:44
246 查看
Distributed configure (hadoop 2.7.2 & spark 2.1.0)
Distributed configure hadoop 272 spark 210environment
configure details
1 download the required software
11 configure the requirements
111 java8 environment
112 scala-2118 environment
113 sbt-01315 environment
114 maven-339 environment
2 configure the distributed system
21 configure Hadoop
211 compile Hadoop
212 configure Hadoop in Standalone Operation
213 configure Hadoop in Pseudo-Distributed Operation
214 configure Hadoop in Fully-Distributed Operation
215 Test the calculation of hadoop distributed
22 configure spark
221 compile spark
222 configure spark distributed in hadoop
3 develop spark in IDEAintellij
31 configure java environment in window
32 configure builtsbt
33 example code
1. environment
Hadoop 2.7.2 spark 2.1.0 scala 2.11.8 sbt 0.13.15 java 1.8 maven 3.3.9 protobuf 2.5.0 findbugs 2.0.2
2. configure details
2.1 download the required software
download the hadoop source code from https://dist.apache.org/repos/dist/release/hadoop/common/download the spark source code from http://spark.apache.org/downloads.html
download scala from http://www.scala-lang.org/download/all.html
download sbt from http://www.scala-sbt.org/download.html
download java from http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
download maven from http://maven.apache.org/download.cgi
download protobuf from https://github.com/google/protobuf/tree/master/src
download findbugs from https://sourceforge.net/projects/findbugs/?source=typ_redirect
2.1.1 configure the requirements
2.1.1.1 java8 environmentFirst need to remove the java environment on the system in present.
# see the all the java environment rpm -qa | grep java # then remove it by do this rpm -e --nodeps XXXXX # XXXXX is what you see when type 'rpm -qa | grep java' # upload the JDK8 'jdk-8u131-linux-x64.tar.gz' which can be download from oracle official website tar -zxvf jdk-8u131-linux-x64.tar.gz vim /etc/profile # add this code into the file. JAVA_HOME=/usr/local/java/jdk1.8.0_131 # be care of the path. JRE_HOME=JAVA_HOME/jre PATH=$JAVA_HOME/bin:$PATH CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib export PATH JAVA_HOME CLASSPATH source /etc/profile
Type java -version and javac in the console, if you see these message, that means java is installed successfully!
java version "1.8.0_131" Java(TM) SE Runtime Environment (build 1.8.0_131-b11) Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode) Usage: javac <options> <source files> where possible options include: -g Generate all debugging info -g:none Generate no debugging info -g:{lines,vars,source} Generate only some debugging info -nowarn Generate no warnings -verbose Output messages about what the compiler is doing -deprecation Output source locations where deprecated APIs are used -classpath <path> Specify where to find user class files and annotation processors -cp <path> Specify where to find user class files and annotation processors -sourcepath <path> Specify where to find input source files -bootclasspath <path> Override location of bootstrap class files -extdirs <dirs> Override location of installed extensions -endorseddirs <dirs> Override location of endorsed standards path ```` ```` ````
2.1.1.2 scala-2.11.8 environment
tar -zxvf scala-2.11.8.tgz vim /etc/profile # add the following code into the file. export SCALA_HOME=/usr/local/scala/scala-2.11.8 export PATH=$PATH:$SCALA_HOME/bin source /etc/profile
Type scala -version in the console, if it appears these messages, that means scala is installed successfully!.
Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL
2.1.1.3 sbt-0.13.15 environment
curl https://bintray.com/sbt/rpm/rpm | sudo tee /etc/yum.repos.d/bintray-sbt-rpm.repo yum install sbt sbt
Type sbt sbt-version in the console, if it appears these messages, that means sbt is installed successfully!.
[info] Set current project to sbt (in build file:/usr/local/sbt/) [info] 0.13.15
2.1.1.4 maven-3.3.9 environment
tar -zxvf apache-maven-3.3.9-bin.tar.gz # add the following code into the file. export MAVEN_HOME=/usr/local/maven/apache-maven-3.3.9 export PATH=$PATH:$MAVEN_HOME/bin source /etc/profile
Type mvn -v in the console, if it appears these messages, that means maven is installed successfully!.
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T11:41:47-05:00) Maven home: /usr/local/maven/apache-maven-3.3.9 Java version: 1.8.0_131, vendor: Oracle Corporation Java home: /usr/local/java/jdk1.8.0_131/jre Default locale: en_US, platform encoding: UTF-8 OS name: "linux", version: "2.6.32-573.el6.x86_64", arch: "amd64", family: "unix"
2.2 configure the distributed system
2.2.1 configure Hadoop
2.2.1.1 compile Hadoopfirst, uncompress hadoop-2.7.2-src.tar.gz.
tar -zxvf hadoop-2.7.2-src.tar.gz
second, download and install maven, protobufs.
tar -zxvf apache-maven-3.3.9-bin.tar.gz cd /apache-maven-3.3.9 vim /etc/profile export MAVEN_HOME=/your directory/apache-maven-3.3.9 export PATH=.:$PATH:$JAVA_HOME/bin:$MAVEN_HOME/bin source /etc/profile ln -s /your directory/apache-maven-3.5.0/bin/mvn /usr/bin/mvn tar -zxvf protobuf-2.5.0.tar.gz cd /protobuf-2.5.0 ./configure --prefix=/your director/protobuf-2.5.0 make make install vim /etc/profile # add the following code into the file. export PATH=$PATH:/usr/local/protobuf/protobuf-2.5.0/bin/ export PKG_CONFIG_PATH=/usr/local/protobuf/protobuf-2.5.0/lib/pkgconfig/ source /etc/profile
third, dowload hadoop library using maven and then compile it.
cd hadoop-2.7.2-src mvn clean package -Pdist,native -DskipTests -Dtar
when compile hadoop source code, there may appear such problem:
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:run (dist) on project hadoop-dist: An Ant BuildException has occured: exec returned: 1 [ERROR] around Ant part ...<exec failonerror="true" dir="/usr/local/hadoop/hadoop-2.7.2-src/hadoop-dist/target" executable="sh">... @ 38:104 in /usr/local/hadoop/hadoop-2.7.2-src/hadoop-dist/target/antrun/build-main.xml
That you need to download four dependencies:
cmake : yum install cmake(in centos)
findbugs : https://sourceforge.net/projects/findbugs/?source=typ_redirect dowload and uncompress it and then
yum install ant unzip findbugs-2.0.2-source.zip cd /your findbugs directory ant export FINDBUGS_HOME=/usr/local/findbugs/findbugs-2.0.2
openssl-dev : yum install openssl-devel(in centos)
zlib-dev : yum install zlib-devel(in centos)
finally, when it appears the following messages, that means hadoop compiled successfully!
[INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 21:45 min [INFO] Finished at: 2017-04-24T19:20:15+08:00 [INFO] Final Memory: 119M/419M [INFO] ------------------------------------------------------------------------
Now we can get the compiled code from:
/hadoop-2.7.2-src/hadoop-dist/target/hadoop-2.7.2.tar.gz
2.2.1.2 configure Hadoop in Standalone Operation
By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. And we can test it right now:
mkdir distributed_input cp /your directory/hadoop-/hadoop-2.7.2/etc/hadoop/*.xml distributed_input ./your directory/hadoop-2.7.2/bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep /usr/local/distributed_input /usr/local/distributed_output 'dfs[a-z.]+' cat distributed_output/*
After a few seconds, the console will appear these messages:
File System Counters FILE: Number of bytes read=1153568 FILE: Number of bytes written=2210810 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 Map-Reduce Framework Map input records=1 Map output records=1 Map output bytes=17 Map output materialized bytes=25 Input split bytes=134 Combine input records=0 Combine output records=0 Reduce input groups=1 Reduce shuffle bytes=25 Reduce input records=1 Reduce output records=1 Spilled Records=2 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=0 Total committed heap usage (bytes)=638582784 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=123 File Output Format Counters Bytes Written=23
That means we have run the example successfully!
In order to invoke hadoop command, we can configure the profile by do this:
export HADOOP_HOME=/your directory/hadoop-2.7.2 export PATH=$PATH:$HADOOP_HOME/bi source /etc/profile
2.2.1.3 configure Hadoop in Pseudo-Distributed Operation
Pseudo-Distributed is also a single-node but it runs in multiple separate Java processes.
There are several files need to be modified:
hosts ssh network ifcfg-eth0 resolv.conf
# modify the hosts file. vim /etc/hosts # add the master and worker's ip and hostname # here is my example. 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.159.132 master 192.168.159.134 worker2 192.168.159.133 worker3 # modify the hostname. vim /etc/sysconfig/network # add the proper code, here is my example. NETWORKING=yes HOSTNAME=master # if it is worker then the HOSTNAME=worker'name # restart the network service network restart # modify the ssh key, because we need every machine can connect each other without password. ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keys # above should be done in each machine, then copy the authorized_key to the master and finally send it to all worker. # log in worker2 and then: ssh-copy-id -i master # copy the authorized_key to the master from worker2 # log in worker3 and then: ssh-copy-id -i master # copy the authorized_key to the master from worker3 scp /root/.ssh/authorized_keys worker2:/root/.ssh/ # send the authorized_key to worker2 scp /root/.ssh/authorized_keys worker3:/root/.ssh/ # send the authorized_key to worker3 # make the directory for hadoop files/hdfs/logs and so on, the directory tree just like: distribute_data ├── hadoop │ ├── data │ ├── hdfs │ ├── logs │ ├── name │ └── temp └── spark #################################################################################################### belows don't need to do. ################################################################################################### # modify the network adapter. vim /etc/sysconfig/network-scripts/ifcfg-eth0 # add the following code. DEVICE=eth0 TYPE=Ethernet ONBOOT=yes NM_CONTROLLED=yes BOOTPROTO=static DEFROUTE=yes IPV4_FAILURE_FATAL=yes IPV6INIT=no NAME="System eth0" HWADDR=00:02:c9:03:00:31:78:f2 PEERDNS=yes PEERROUTES=yes IPADDR=192.168.159.130 NETMASK=255.255.255.0 GATEWAY=192.168.159.2 DNS1=100.100.100.1 # modify the DNS # vim /etc/resolv.conf # add the following code(It depends on your system and network). nameserver 100.100.100.1 nameserver 114.114.114.114 nameserver 8.8.8.8
etc/hadoop/hadoop-evn.sh
# modify the JAVA_HOME variable. export JAVA_HOME=/your JAVA_HOME directory. # here is mine. # export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.121-0.b13.el7_3.x86_64
etc/slaves
# add all the hostname of the worker machine to the slaves file. worker2 worker3
etc/hadoop/core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> </property> <!-- Size of read/write buffer used in SequenceFiles. --> <property> <name>io.file.buffer.size</name> <value>131072</value> </property> <!-- hadoop temp directory, it depends on you --> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/distribute_data/hadoop/temp</value> </property> </configuration>
etc/hadoop/hdfs-site.xml
<configuration> <property> <name>dfs.namenode.secondary.http-address</name> <value>master:50090</value> </property> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/distribute_data/hadoop/hdfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/distribute_data/hadoop/hdfs/data</value> </property> </configuration>
etc/hadoop/mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>master:10020</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>master:19888</value> </property> </configuration>
etc/hadoop/yarn-site.xml
<configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>master:8032</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8030</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>master:8033</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>master:8088</value> </property> </configuration>
After modified the slaves / core-site.xml / hdfs-site.xml / mapred-site.xml / yarn-site.xml files in master, then copy them to the workers, like that:
scp core-site.xml worker2:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/ scp core-site.xml worker3:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/ scp hdfs-site.xml worker2:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/ scp hdfs-site.xml worker3:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/ scp yarn-site.xml worker2:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/ scp yarn-site.xml worker3:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/ scp mapred-site.xml.template worker2:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/ scp mapred-site.xml.template worker3:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/
Finally, format the hdfs file system and start the distributed.
cd /your directory/hadoop-2.7.2 ./bin/hdfs namenode -format
if it comes up these messages, that means it format the file successfully!
``` ``` 17/04/26 16:02:20 INFO common.Storage: Storage directory /usr/local/distribute_data/hadoop/hdfs/name has been successfully formatted. 17/04/26 16:02:20 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0 17/04/26 16:02:20 INFO util.ExitUtil: Exiting with status 0 17/04/26 16:02:20 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at master/192.168.159.132 ************************************************************/
and start the distributed
./sbin/start-all.sh # actually the hadoop team recommend to use start-dfs.sh and start-yarn.sh
if it comes up these messages, that means it started successfully!
Starting namenodes on [master] master: starting namenode, logging to /usr/local/hadoop/hadoop-2.7.2/logs/hadoop-root-namenode-master.out localhost: starting datanode, logging to /usr/local/hadoop/hadoop-2.7.2/logs/hadoop-root-datanode-master.out worker2: starting datanode, logging to /usr/local/hadoop/hadoop-2.7.2/logs/hadoop-root-datanode-worker2.out worker3: starting datanode, logging to /usr/local/hadoop/hadoop-2.7.2/logs/hadoop-root-datanode-worker3.out Starting secondary namenodes [master] master: starting secondarynamenode, logging to /usr/local/hadoop/hadoop-2.7.2/logs/hadoop-root-secondarynamenode-master.out starting yarn daemons starting resourcemanager, logging to /usr/local/hadoop/hadoop-2.7.2/logs/yarn-root-resourcemanager-master.out worker2: starting nodemanager, logging to /usr/local/hadoop/hadoop-2.7.2/logs/yarn-root-nodemanager-worker2.out localhost: starting nodemanager, logging to /usr/local/hadoop/hadoop-2.7.2/logs/yarn-root-nodemanager-master.out worker3: starting nodemanager, logging to /usr/local/hadoop/hadoop-2.7.2/logs/yarn-root-nodemanager-worker3.out
type jps in the master’s console, we can see:
18544 NodeManager 17540 NameNode 17764 DataNode 18437 ResourceManager 19557 Jps 18092 SecondaryNameNode
2.2.1.4 configure Hadoop in Fully-Distributed Operation
Fully-Distributed is very similar to Pseudo-Distributed, actually the configure step is same to above.
2.2.1.5 Test the calculation of hadoop distributed.
cp /usr/local/hadoop/hadoop-2.7.2/etc/hadoop/*.xml /usr/local/distribute_data/hadoop/data/ /your directory/hadoop-2.7.2/bin/hdfs dfs -mkdir /in hadoop dfs -put /usr/local/distribute_data/hadoop/data/* /in
2.2.2 configure spark
2.2.2.1 compile sparkFirst, we need to add the MAVEN_OPTS to the profile, it will help us avoid the heap space error.
vim /etc/profile export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" source /etc/profile
Second, uncompress the spark source code and download spark library using maven and then compile it.
tar -zxvf spark-2.1.0.tgz cd /spark-2.1.0 # need to declare the hadoop version, same to above version. ./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.2 -DskipTests clean package # wait a minute.
if it appears these error message:
[INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 20.189 s [INFO] Finished at: 2017-04-27T14:58:14+08:00 [INFO] Final Memory: 41M/211M [INFO] ------------------------------------------------------------------------ [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project spark-tags_2.11: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.: CompileFailed -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn <goals> -rf :spark-tags_2.11
make sure you have install the right version of scala-2.11.8 and maven-3.3.9, then reboot the system and type the same command again, actually it will fix the error by reboot.
if it appears these messages, that means we have download and compile spark libraries successfully!
[INFO] Spark Project Parent POM ........................... SUCCESS [ 13.441 s] [INFO] Spark Project Tags ................................. SUCCESS [ 17.570 s] [INFO] Spark Project Sketch ............................... SUCCESS [ 17.121 s] [INFO] Spark Project Networking ........................... SUCCESS [ 22.711 s] [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 14.905 s] [INFO] Spark Project Unsafe ............................... SUCCESS [ 22.380 s] [INFO] Spark Project Launcher ............................. SUCCESS [ 22.645 s] [INFO] Spark Project Core ................................. SUCCESS [10:59 min] [INFO] Spark Project ML Local Library ..................... SUCCESS [04:20 min] [INFO] Spark Project GraphX ............................... SUCCESS [02:17 min] [INFO] Spark Project Streaming ............................ SUCCESS [04:35 min] [INFO] Spark Project Catalyst ............................. SUCCESS [08:20 min] [INFO] Spark Project SQL .................................. SUCCESS [14:59 min] [INFO] Spark Project ML Library ........................... SUCCESS [09:13 min] [INFO] Spark Project Tools ................................ SUCCESS [01:12 min] [INFO] Spark Project Hive ................................. SUCCESS [11:40 min] [INFO] Spark Project REPL ................................. SUCCESS [03:23 min] [INFO] Spark Project YARN Shuffle Service ................. SUCCESS [02:26 min] [INFO] Spark Project YARN ................................. SUCCESS [05:34 min] [INFO] Spark Project Assembly ............................. SUCCESS [01:25 min] [INFO] Spark Project External Flume Sink .................. SUCCESS [02:42 min] [INFO] Spark Project External Flume ....................... SUCCESS [03:03 min] [INFO] Spark Project External Flume Assembly .............. SUCCESS [ 40.208 s] [INFO] Spark Integration for Kafka 0.8 .................... SUCCESS [02:38 min] [INFO] Spark Project Examples ............................. SUCCESS [08:19 min] [INFO] Spark Project External Kafka Assembly .............. SUCCESS [01:10 min] [INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [04:44 min] [INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [02:35 min] [INFO] Kafka 0.10 Source for Structured Streaming ......... SUCCESS [03:42 min] [INFO] Spark Project Java 8 Tests ......................... SUCCESS [06:04 min] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 01:58 h [INFO] Finished at: 2017-04-27T19:44:19+08:00 [INFO] Final Memory: 88M/852M [INFO] ------------------------------------------------------------------------
2.2.2.2 configure spark distributed in hadoop.
Once spark has been compiled successfully, here will come up with the conf directory, and what we need to do is adding the hostname of every machine into the slaves file and configure the spark-env.sh file.
# add this code into slaves file. master worker2 worker3 # add this code into spark-env.sh file export SPARK_MASTER_IP=master export SPARK_MASTER_PORT=7077 export SPARK_WORKER_CORES=1 export SPARK_WORKER_INSTANCES=1 export SPARK_WORKER_MEMORY=512M export HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-2.7.2/etc/hadoop
And for convenience, we can configure the profile for spark environment.
vim /etc/profile # add the following code. export SPARK_HOME=/your directory/spark-2.1.0 export PATH=$PATH:$SPARK_HOME/bin source /etc/profile
Finally, we should send the compiled spark code to the work machine.
scp -r spark-2.1.0 worker2:/usr/local/spark/ scp -r spark-2.1.0 worker3:/usr/local/spark/
Now we can start spark-2.1.0 in hadoop-2.7.2 and make some test.
cd /your directory/spark-2.1.0 ./sbin/start-all.sh
And start the spark-shell, then will appear the following messages:
./bin/spark-shell Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/04/28 14:01:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/04/28 14:01:14 WARN spark.SparkConf: SPARK_WORKER_INSTANCES was detected (set to '1'). This is deprecated in Spark 1.0+. Please instead use: - ./spark-submit with --num-executors to specify the number of executors - Or set SPARK_EXECUTOR_INSTANCES - spark.executor.instances to configure the number of instances in the spark config. Spark context Web UI available at http://127.0.0.1:4040 Spark context available as 'sc' (master = local[*], app id = local-1493402474643). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information.
2.3 develop spark in IDEA(intellij)
2.3.1 configure java environment in window.
JAVA_HOME='your java directory' # type java -verbose can see the java directory in your PC. CLASSPATH=%JAVA_HOME%\lib\dt.jar;%JAVA_HOME%\lib\tools.jar; PATH=%JAVA_HOME%\bin; # add this into path.
After then, type java -version, javac, java and if you see these message, that means java is installed successfully!
Microsoft Windows [版本 6.1.7601] 版权所有 (c) 2009 Microsoft Corporation。保留所有权利。 C:\Users\user>java -version java version "1.8.0_112" Java(TM) SE Runtime Environment (build 1.8.0_112-b15) Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode) C:\Users\user>javac 用法: javac <options> <source files> 其中, 可能的选项包括: -g 生成所有调试信息 -g:none 不生成任何调试信息 -g:{lines,vars,source} 只生成某些调试信息 -nowarn 不生成任何警告 -verbose 输出有关编译器正在执行的操作的消息 -deprecation 输出使用已过时的 API 的源位置 -classpath <路径> 指定查找用户类文件和注释处理程序的位置 -cp <路径> 指定查找用户类文件和注释处理程序的位置 -sourcepath <路径> 指定查找输入源文件的位置 -bootclasspath <路径> 覆盖引导类文件的位置 -extdirs <目录> 覆盖所安装扩展的位置 -endorseddirs <目录> 覆盖签名的标准路径的位置 -proc:{none,only} 控制是否执行注释处理和/或编译。 -processor <class1>[,<class2>,<class3>...] 要运行的注释处理程序的名称; 绕过默 认的搜索进程 -processorpath <路径> 指定查找注释处理程序的位置 -parameters 生成元数据以用于方法参数的反射 -d <目录> 指定放置生成的类文件的位置 -s <目录> 指定放置生成的源文件的位置 -h <目录> 指定放置生成的本机标头文件的位置 -implicit:{none,class} 指定是否为隐式引用文件生成类文件 -encoding <编码> 指定源文件使用的字符编码 -source <发行版> 提供与指定发行版的源兼容性 -target <发行版> 生成特定 VM 版本的类文件 -profile <配置文件> 请确保使用的 API 在指定的配置文件中可用 -version 版本信息 -help 输出标准选项的提要 -A关键字[=值] 传递给注释处理程序的选项 -X 输出非标准选项的提要 -J<标记> 直接将 <标记> 传递给运行时系统 -Werror 出现警告时终止编译 @<文件名> 从文件读取选项和文件名 C:\Users\user>java 用法: java [-options] class [args...] (执行类) 或 java [-options] -jar jarfile [args...] (执行 jar 文件) 其中选项包括: -d32 使用 32 位数据模型 (如果可用) -d64 使用 64 位数据模型 (如果可用) -server 选择 "server" VM 默认 VM 是 server. -cp <目录和 zip/jar 文件的类搜索路径> -classpath <目录和 zip/jar 文件的类搜索路径> 用 ; 分隔的目录, JAR 档案 和 ZIP 档案列表, 用于搜索类文件。 -D<名称>=<值> 设置系统属性 -verbose:[class|gc|jni] 启用详细输出 -version 输出产品版本并退出 -version:<值> 警告: 此功能已过时, 将在 未来发行版中删除。 需要指定的版本才能运行 -showversion 输出产品版本并继续 -jre-restrict-search | -no-jre-restrict-search 警告: 此功能已过时, 将在 未来发行版中删除。 在版本搜索中包括/排除用户专用 JRE -? -help 输出此帮助消息 -X 输出非标准选项的帮助 -ea[:<packagename>...|:<classname>] -enableassertions[:<packagename>...|:<classname>] 按指定的粒度启用断言 -da[:<packagename>...|:<classname>] -disableassertions[:<packagename>...|:<classname>] 禁用具有指定粒度的断言 -esa | -enablesystemassertions 启用系统断言 -dsa | -disablesystemassertions 禁用系统断言 -agentlib:<libname>[=<选项>] 加载本机代理库 <libname>, 例如 -agentlib:hprof 另请参阅 -agentlib:jdwp=help 和 -agentlib:hprof=help -agentpath:<pathname>[=<选项>] 按完整路径名加载本机代理库 -javaagent:<jarpath>[=<选项>] 加载 Java 编程语言代理, 请参阅 java.lang.instrument -splash:<imagepath> 使用指定的图像显示启动屏幕 有关详细信息, 请参阅 http://www.oracle.com/technetwork/java/javase/documentation /index.html。
2.3.2 configure built.sbt
# add the following code libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.0"
2.3.3 example code
import org.apache.spark.{SparkConf, SparkContext} object test { def main(args: Array[String]): Unit = { System.setProperty("hadoop.home.dir", "D:\\spark\\IntelliJ IDEA\\work sheet\\hadoop-2.7.2") if (args.length < 1) { System.err.println("Usage: <file>") System.exit(1) } val conf = new SparkConf() val sc = new SparkContext("local","wordcount",conf) val line = sc.textFile(args(0)) line.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).collect().foreach(println) sc.stop() } }
We need to assign the parameter in ‘Edit configuration’ in intellij:
Program argument: E:\Distributed\Configuration\Distributed_configure.md
Result:
.... 17/05/08 17:55:41 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 17/05/08 17:55:41 INFO DAGScheduler: ResultStage 1 (collect at test.scala:20) finished in 0.100 s 17/05/08 17:55:41 INFO DAGScheduler: Job 0 finished: collect at test.scala:20, took 1.311414 s (,1977) (the,95) (```,88) ([INFO],51) (to,47) (#,43) (Spark,33) (SUCCESS,32) (and,27) (Project,26) (in,25) (export,23) (min],22) (it,19) ....
相关文章推荐
- 搭建Hive On Spark 编译Hive源码错误解决方法(spark2.1.0,hadoop2.7.2)
- Spark2.1.0 + CarbonData1.0.0+hadoop2.7.2集群模式部署及使用入门
- Spark核心开发者:性能超Hadoop百倍&Spark:大数据的“电光石火”
- Hadoop&Yarn&Mahout&Spark企业级最佳实践(3天)
- Hadoop & Spark 集群搭建 理念思想
- 在OpenStack(Mitaka版本)上通过Sahara部署Hadoop&Spark集群
- Spark_2.0 on hadoop_2.7.2
- Shark & Hive & Spark & Hadoop2 进行整合的测试。
- hue的基本认识和安装(Ubuntu+hadoop2.7.2+hive2.1.0+hue3.11.0)
- Spark1.6.1平台搭建(hadoop-2.7.2 scala-2.11.8)
- 基于Ubuntu/Debian的hadoop2.7.2+spark1.6实验环境快速部署
- Spark2.0 + Hadoop2.7.2 + Centos7 集群部署<一>
- 决胜大数据时代:Hadoop&Yarn&Spark企业级最佳实践(3天)
- Storm Spark 和 Hadoop区别
- 决胜Hadoop&Spark大数据时代:Hadoop&Yarn&Spark企业级最佳实践
- Spark 1.6.2 + Hadoop 2.7.2 集群搭建
- 决胜大数据时代:Hadoop&Yarn&Spark企业级最佳实践(3天)
- hadoop&spark mapreduce对比 & 框架设计和理解
- spark学习1——配置hadoop 单机模式并运行WordCount实例(ubuntu14.04 & hadoop 2.6.0)
- 大数据基础(五)从零开始安装配置Hadoop 2.7.2+Spark 2.0.0到Ubuntu 16.04