您的位置：首页 > 运维架构

Distributed configure (hadoop 2.7.2 & spark 2.1.0)

2017-04-26 14:44 246 查看

Distributed configure (hadoop 2.7.2 & spark 2.1.0)

Distributed configure hadoop 272 spark 210
environment

configure details
1 download the required software
11 configure the requirements
111 java8 environment

112 scala-2118 environment

113 sbt-01315 environment

114 maven-339 environment

2 configure the distributed system
21 configure Hadoop
211 compile Hadoop

212 configure Hadoop in Standalone Operation

213 configure Hadoop in Pseudo-Distributed Operation

214 configure Hadoop in Fully-Distributed Operation

215 Test the calculation of hadoop distributed

22 configure spark
221 compile spark

222 configure spark distributed in hadoop

3 develop spark in IDEAintellij
31 configure java environment in window

32 configure builtsbt

33 example code

1. environment

Hadoop 2.7.2
spark 2.1.0
scala 2.11.8
sbt 0.13.15
java 1.8
maven 3.3.9
protobuf 2.5.0
findbugs 2.0.2

2. configure details

2.1 download the required software

download the hadoop source code from https://dist.apache.org/repos/dist/release/hadoop/common/

download the spark source code from http://spark.apache.org/downloads.html

download scala from http://www.scala-lang.org/download/all.html

download sbt from http://www.scala-sbt.org/download.html

download java from http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

download maven from http://maven.apache.org/download.cgi

download protobuf from https://github.com/google/protobuf/tree/master/src

download findbugs from https://sourceforge.net/projects/findbugs/?source=typ_redirect

2.1.1 configure the requirements

2.1.1.1 java8 environment

First need to remove the java environment on the system in present.

# see the all the java environment
rpm -qa | grep java
# then remove it by do this
rpm -e --nodeps XXXXX   # XXXXX is what you see when type 'rpm -qa | grep java'

# upload the JDK8 'jdk-8u131-linux-x64.tar.gz' which can be download from oracle official website
tar -zxvf jdk-8u131-linux-x64.tar.gz
vim /etc/profile
# add this code into the file.
JAVA_HOME=/usr/local/java/jdk1.8.0_131             # be care of the path.
JRE_HOME=JAVA_HOME/jre
PATH=$JAVA_HOME/bin:$PATH
CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib
export PATH JAVA_HOME CLASSPATH

source /etc/profile

Type java -version and javac in the console, if you see these message, that means java is installed successfully!

java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

Usage: javac <options> <source files>
where possible options include:
-g                         Generate all debugging info
-g:none                    Generate no debugging info
-g:{lines,vars,source}     Generate only some debugging info
-nowarn                    Generate no warnings
-verbose                   Output messages about what the compiler is doing
-deprecation               Output source locations where deprecated APIs are used
-classpath <path>          Specify where to find user class files and annotation processors
-cp <path>                 Specify where to find user class files and annotation processors
-sourcepath <path>         Specify where to find input source files
-bootclasspath <path>      Override location of bootstrap class files
-extdirs <dirs>            Override location of installed extensions
-endorseddirs <dirs>       Override location of endorsed standards path
````
````
````

2.1.1.2 scala-2.11.8 environment

tar -zxvf scala-2.11.8.tgz

vim /etc/profile
# add the following code into the file.
export SCALA_HOME=/usr/local/scala/scala-2.11.8
export PATH=$PATH:$SCALA_HOME/bin

source /etc/profile

Type scala -version in the console, if it appears these messages, that means scala is installed successfully!.

Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL

2.1.1.3 sbt-0.13.15 environment

curl https://bintray.com/sbt/rpm/rpm | sudo tee /etc/yum.repos.d/bintray-sbt-rpm.repo
yum install sbt
sbt

Type sbt sbt-version in the console, if it appears these messages, that means sbt is installed successfully!.

[info] Set current project to sbt (in build file:/usr/local/sbt/)
[info] 0.13.15

2.1.1.4 maven-3.3.9 environment

tar -zxvf apache-maven-3.3.9-bin.tar.gz
# add the following code into the file.
export MAVEN_HOME=/usr/local/maven/apache-maven-3.3.9
export PATH=$PATH:$MAVEN_HOME/bin

source /etc/profile

Type mvn -v in the console, if it appears these messages, that means maven is installed successfully!.

Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T11:41:47-05:00)
Maven home: /usr/local/maven/apache-maven-3.3.9
Java version: 1.8.0_131, vendor: Oracle Corporation
Java home: /usr/local/java/jdk1.8.0_131/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "2.6.32-573.el6.x86_64", arch: "amd64", family: "unix"

2.2 configure the distributed system

2.2.1 configure Hadoop

2.2.1.1 compile Hadoop

first, uncompress hadoop-2.7.2-src.tar.gz.

tar -zxvf hadoop-2.7.2-src.tar.gz

second, download and install maven, protobufs.

tar -zxvf apache-maven-3.3.9-bin.tar.gz
cd /apache-maven-3.3.9
vim /etc/profile
export MAVEN_HOME=/your directory/apache-maven-3.3.9
export PATH=.:$PATH:$JAVA_HOME/bin:$MAVEN_HOME/bin
source /etc/profile
ln -s /your directory/apache-maven-3.5.0/bin/mvn /usr/bin/mvn

tar -zxvf protobuf-2.5.0.tar.gz
cd /protobuf-2.5.0
./configure --prefix=/your director/protobuf-2.5.0
make
make install
vim /etc/profile
# add the following code into the file.
export PATH=$PATH:/usr/local/protobuf/protobuf-2.5.0/bin/
export PKG_CONFIG_PATH=/usr/local/protobuf/protobuf-2.5.0/lib/pkgconfig/

source /etc/profile

third, dowload hadoop library using maven and then compile it.

cd hadoop-2.7.2-src
mvn clean package -Pdist,native -DskipTests -Dtar

when compile hadoop source code, there may appear such problem:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:run (dist) on project hadoop-dist: An Ant BuildException has occured: exec returned: 1
[ERROR] around Ant part ...<exec failonerror="true" dir="/usr/local/hadoop/hadoop-2.7.2-src/hadoop-dist/target" executable="sh">... @ 38:104 in /usr/local/hadoop/hadoop-2.7.2-src/hadoop-dist/target/antrun/build-main.xml

That you need to download four dependencies:

cmake : yum install cmake(in centos)

findbugs : https://sourceforge.net/projects/findbugs/?source=typ_redirect dowload and uncompress it and then

yum install ant
unzip findbugs-2.0.2-source.zip
cd /your findbugs directory
ant
export FINDBUGS_HOME=/usr/local/findbugs/findbugs-2.0.2

openssl-dev : yum install openssl-devel(in centos)

zlib-dev : yum install zlib-devel(in centos)

finally, when it appears the following messages, that means hadoop compiled successfully!

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 21:45 min
[INFO] Finished at: 2017-04-24T19:20:15+08:00
[INFO] Final Memory: 119M/419M
[INFO] ------------------------------------------------------------------------

Now we can get the compiled code from:

/hadoop-2.7.2-src/hadoop-dist/target/hadoop-2.7.2.tar.gz

2.2.1.2 configure Hadoop in Standalone Operation

By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. And we can test it right now:

mkdir distributed_input
cp /your directory/hadoop-/hadoop-2.7.2/etc/hadoop/*.xml distributed_input
./your directory/hadoop-2.7.2/bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar grep /usr/local/distributed_input /usr/local/distributed_output 'dfs[a-z.]+'
cat distributed_output/*

After a few seconds, the console will appear these messages:

File System Counters
FILE: Number of bytes read=1153568
FILE: Number of bytes written=2210810
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=1
Map output records=1
Map output bytes=17
Map output materialized bytes=25
Input split bytes=134
Combine input records=0
Combine output records=0
Reduce input groups=1
Reduce shuffle bytes=25
Reduce input records=1
Reduce output records=1
Spilled Records=2
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=0
Total committed heap usage (bytes)=638582784
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=123
File Output Format Counters
Bytes Written=23

That means we have run the example successfully!

In order to invoke hadoop command, we can configure the profile by do this:

export HADOOP_HOME=/your directory/hadoop-2.7.2
export PATH=$PATH:$HADOOP_HOME/bi

source /etc/profile

2.2.1.3 configure Hadoop in Pseudo-Distributed Operation

Pseudo-Distributed is also a single-node but it runs in multiple separate Java processes.

There are several files need to be modified:

hosts ssh network ifcfg-eth0 resolv.conf

# modify the hosts file.
vim /etc/hosts
# add the master and worker's ip and hostname
# here is my example.
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.159.132 master
192.168.159.134 worker2
192.168.159.133 worker3

# modify the hostname.
vim /etc/sysconfig/network
# add the proper code, here is my example.
NETWORKING=yes
HOSTNAME=master # if it is worker then the HOSTNAME=worker'name

# restart the network
service network restart

# modify the ssh key, because we need every machine can connect each other without password.
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
# above should be done in each machine, then copy the authorized_key to the master and finally send it to all worker.
# log in worker2 and then:
ssh-copy-id -i master # copy the authorized_key to the master from worker2
# log in worker3 and then:
ssh-copy-id -i master # copy the authorized_key to the master from worker3
scp /root/.ssh/authorized_keys worker2:/root/.ssh/ # send the authorized_key to worker2
scp /root/.ssh/authorized_keys worker3:/root/.ssh/ # send the authorized_key to worker3

# make the directory for hadoop files/hdfs/logs and so on, the directory tree just like:
distribute_data
├── hadoop
│   ├── data
│   ├── hdfs
│   ├── logs
│   ├── name
│   └── temp
└── spark

#################################################################################################### belows don't need to do.
###################################################################################################
# modify the network adapter.
vim /etc/sysconfig/network-scripts/ifcfg-eth0
# add the following code.
DEVICE=eth0
TYPE=Ethernet
ONBOOT=yes
NM_CONTROLLED=yes
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=yes
IPV6INIT=no
NAME="System eth0"
HWADDR=00:02:c9:03:00:31:78:f2
PEERDNS=yes
PEERROUTES=yes
IPADDR=192.168.159.130
NETMASK=255.255.255.0
GATEWAY=192.168.159.2
DNS1=100.100.100.1

# modify the DNS
# vim /etc/resolv.conf
# add the following code(It depends on your system and network).
nameserver 100.100.100.1
nameserver 114.114.114.114
nameserver 8.8.8.8

etc/hadoop/hadoop-evn.sh

# modify the JAVA_HOME variable.
export JAVA_HOME=/your JAVA_HOME directory.
# here is mine.
# export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.121-0.b13.el7_3.x86_64

etc/slaves

# add all the hostname of the worker machine to the slaves file.
worker2
worker3

etc/hadoop/core-site.xml

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
</property>
<!-- Size of read/write buffer used in SequenceFiles. -->
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
<!-- hadoop temp directory, it depends on you -->
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/distribute_data/hadoop/temp</value>
</property>
</configuration>

etc/hadoop/hdfs-site.xml

<configuration>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>master:50090</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/distribute_data/hadoop/hdfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/distribute_data/hadoop/hdfs/data</value>
</property>
</configuration>

etc/hadoop/mapred-site.xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>master:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>master:19888</value>
</property>
</configuration>

etc/hadoop/yarn-site.xml

<configuration>

<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>master:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>master:8088</value>
</property>
</configuration>

After modified the slaves / core-site.xml / hdfs-site.xml / mapred-site.xml / yarn-site.xml files in master, then copy them to the workers, like that:

scp core-site.xml worker2:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/
scp core-site.xml worker3:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/
scp hdfs-site.xml worker2:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/
scp hdfs-site.xml worker3:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/
scp yarn-site.xml worker2:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/
scp yarn-site.xml worker3:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/
scp mapred-site.xml.template worker2:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/
scp mapred-site.xml.template worker3:/usr/local/hadoop/hadoop-2.7.2/etc/hadoop/

Finally, format the hdfs file system and start the distributed.

cd /your directory/hadoop-2.7.2
./bin/hdfs namenode -format

if it comes up these messages, that means it format the file successfully!

```
```
17/04/26 16:02:20 INFO common.Storage: Storage directory /usr/local/distribute_data/hadoop/hdfs/name has been successfully formatted.
17/04/26 16:02:20 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
17/04/26 16:02:20 INFO util.ExitUtil: Exiting with status 0
17/04/26 16:02:20 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at master/192.168.159.132
************************************************************/

and start the distributed

./sbin/start-all.sh  # actually the hadoop team recommend to use start-dfs.sh and start-yarn.sh

if it comes up these messages, that means it started successfully!

Starting namenodes on [master]
master: starting namenode, logging to /usr/local/hadoop/hadoop-2.7.2/logs/hadoop-root-namenode-master.out
localhost: starting datanode, logging to /usr/local/hadoop/hadoop-2.7.2/logs/hadoop-root-datanode-master.out
worker2: starting datanode, logging to /usr/local/hadoop/hadoop-2.7.2/logs/hadoop-root-datanode-worker2.out
worker3: starting datanode, logging to /usr/local/hadoop/hadoop-2.7.2/logs/hadoop-root-datanode-worker3.out
Starting secondary namenodes [master]
master: starting secondarynamenode, logging to /usr/local/hadoop/hadoop-2.7.2/logs/hadoop-root-secondarynamenode-master.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/hadoop-2.7.2/logs/yarn-root-resourcemanager-master.out
worker2: starting nodemanager, logging to /usr/local/hadoop/hadoop-2.7.2/logs/yarn-root-nodemanager-worker2.out
localhost: starting nodemanager, logging to /usr/local/hadoop/hadoop-2.7.2/logs/yarn-root-nodemanager-master.out
worker3: starting nodemanager, logging to /usr/local/hadoop/hadoop-2.7.2/logs/yarn-root-nodemanager-worker3.out

type jps in the master’s console, we can see:

18544 NodeManager
17540 NameNode
17764 DataNode
18437 ResourceManager
19557 Jps
18092 SecondaryNameNode

2.2.1.4 configure Hadoop in Fully-Distributed Operation

Fully-Distributed is very similar to Pseudo-Distributed, actually the configure step is same to above.

2.2.1.5 Test the calculation of hadoop distributed.

cp /usr/local/hadoop/hadoop-2.7.2/etc/hadoop/*.xml /usr/local/distribute_data/hadoop/data/
/your directory/hadoop-2.7.2/bin/hdfs dfs -mkdir /in
hadoop dfs -put /usr/local/distribute_data/hadoop/data/* /in

2.2.2 configure spark

2.2.2.1 compile spark

First, we need to add the MAVEN_OPTS to the profile, it will help us avoid the heap space error.

vim /etc/profile
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
source /etc/profile

Second, uncompress the spark source code and download spark library using maven and then compile it.

tar -zxvf spark-2.1.0.tgz
cd /spark-2.1.0
# need to declare the hadoop version, same to above version.
./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.2 -DskipTests clean package  # wait a minute.

if it appears these error message:

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 20.189 s
[INFO] Finished at: 2017-04-27T14:58:14+08:00
[INFO] Final Memory: 41M/211M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project spark-tags_2.11: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.: CompileFailed -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException [ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :spark-tags_2.11

make sure you have install the right version of scala-2.11.8 and maven-3.3.9, then reboot the system and type the same command again, actually it will fix the error by reboot.

if it appears these messages, that means we have download and compile spark libraries successfully!

[INFO] Spark Project Parent POM ........................... SUCCESS [ 13.441 s]
[INFO] Spark Project Tags ................................. SUCCESS [ 17.570 s]
[INFO] Spark Project Sketch ............................... SUCCESS [ 17.121 s]
[INFO] Spark Project Networking ........................... SUCCESS [ 22.711 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 14.905 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [ 22.380 s]
[INFO] Spark Project Launcher ............................. SUCCESS [ 22.645 s]
[INFO] Spark Project Core ................................. SUCCESS [10:59 min]
[INFO] Spark Project ML Local Library ..................... SUCCESS [04:20 min]
[INFO] Spark Project GraphX ............................... SUCCESS [02:17 min]
[INFO] Spark Project Streaming ............................ SUCCESS [04:35 min]
[INFO] Spark Project Catalyst ............................. SUCCESS [08:20 min]
[INFO] Spark Project SQL .................................. SUCCESS [14:59 min]
[INFO] Spark Project ML Library ........................... SUCCESS [09:13 min]
[INFO] Spark Project Tools ................................ SUCCESS [01:12 min]
[INFO] Spark Project Hive ................................. SUCCESS [11:40 min]
[INFO] Spark Project REPL ................................. SUCCESS [03:23 min]
[INFO] Spark Project YARN Shuffle Service ................. SUCCESS [02:26 min]
[INFO] Spark Project YARN ................................. SUCCESS [05:34 min]
[INFO] Spark Project Assembly ............................. SUCCESS [01:25 min]
[INFO] Spark Project External Flume Sink .................. SUCCESS [02:42 min]
[INFO] Spark Project External Flume ....................... SUCCESS [03:03 min]
[INFO] Spark Project External Flume Assembly .............. SUCCESS [ 40.208 s]
[INFO] Spark Integration for Kafka 0.8 .................... SUCCESS [02:38 min]
[INFO] Spark Project Examples ............................. SUCCESS [08:19 min]
[INFO] Spark Project External Kafka Assembly .............. SUCCESS [01:10 min]
[INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [04:44 min]
[INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [02:35 min]
[INFO] Kafka 0.10 Source for Structured Streaming ......... SUCCESS [03:42 min]
[INFO] Spark Project Java 8 Tests ......................... SUCCESS [06:04 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:58 h
[INFO] Finished at: 2017-04-27T19:44:19+08:00
[INFO] Final Memory: 88M/852M
[INFO] ------------------------------------------------------------------------

2.2.2.2 configure spark distributed in hadoop.

Once spark has been compiled successfully, here will come up with the conf directory, and what we need to do is adding the hostname of every machine into the slaves file and configure the spark-env.sh file.

# add this code into slaves file.
master
worker2
worker3

# add this code into spark-env.sh file
export SPARK_MASTER_IP=master
export SPARK_MASTER_PORT=7077
export SPARK_WORKER_CORES=1
export SPARK_WORKER_INSTANCES=1
export SPARK_WORKER_MEMORY=512M
export HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-2.7.2/etc/hadoop

And for convenience, we can configure the profile for spark environment.

vim /etc/profile
# add the following code.
export SPARK_HOME=/your directory/spark-2.1.0
export PATH=$PATH:$SPARK_HOME/bin

source /etc/profile

Finally, we should send the compiled spark code to the work machine.

scp -r spark-2.1.0 worker2:/usr/local/spark/
scp -r spark-2.1.0 worker3:/usr/local/spark/

Now we can start spark-2.1.0 in hadoop-2.7.2 and make some test.

cd /your directory/spark-2.1.0
./sbin/start-all.sh

And start the spark-shell, then will appear the following messages:

./bin/spark-shell

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/04/28 14:01:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/04/28 14:01:14 WARN spark.SparkConf:
SPARK_WORKER_INSTANCES was detected (set to '1').
This is deprecated in Spark 1.0+.

Please instead use:
- ./spark-submit with --num-executors to specify the number of executors
- Or set SPARK_EXECUTOR_INSTANCES
- spark.executor.instances to configure the number of instances in the spark config.

Spark context Web UI available at http://127.0.0.1:4040 Spark context available as 'sc' (master = local[*], app id = local-1493402474643).
Spark session available as 'spark'.
Welcome to
____              __
/ __/__  ___ _____/ /__
_\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 2.1.0
/_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

2.3 develop spark in IDEA(intellij)

2.3.1 configure java environment in window.

JAVA_HOME='your java directory'  # type java -verbose can see the java directory in your PC.
CLASSPATH=%JAVA_HOME%\lib\dt.jar;%JAVA_HOME%\lib\tools.jar;
PATH=%JAVA_HOME%\bin;            # add this into path.

After then, type java -version, javac, java and if you see these message, that means java is installed successfully!

Microsoft Windows [版本 6.1.7601]
版权所有 (c) 2009 Microsoft Corporation。保留所有权利。

C:\Users\user>java -version
java version "1.8.0_112"
Java(TM) SE Runtime Environment (build 1.8.0_112-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode)

C:\Users\user>javac
用法: javac <options> <source files>
其中, 可能的选项包括:
-g                         生成所有调试信息
-g:none                    不生成任何调试信息
-g:{lines,vars,source}     只生成某些调试信息
-nowarn                    不生成任何警告
-verbose                   输出有关编译器正在执行的操作的消息
-deprecation               输出使用已过时的 API 的源位置
-classpath <路径>            指定查找用户类文件和注释处理程序的位置
-cp <路径>                   指定查找用户类文件和注释处理程序的位置
-sourcepath <路径>           指定查找输入源文件的位置
-bootclasspath <路径>        覆盖引导类文件的位置
-extdirs <目录>              覆盖所安装扩展的位置
-endorseddirs <目录>         覆盖签名的标准路径的位置
-proc:{none,only}          控制是否执行注释处理和/或编译。
-processor <class1>[,<class2>,<class3>...] 要运行的注释处理程序的名称; 绕过默
认的搜索进程
-processorpath <路径>        指定查找注释处理程序的位置
-parameters                生成元数据以用于方法参数的反射
-d <目录>                    指定放置生成的类文件的位置
-s <目录>                    指定放置生成的源文件的位置
-h <目录>                    指定放置生成的本机标头文件的位置
-implicit:{none,class}     指定是否为隐式引用文件生成类文件
-encoding <编码>             指定源文件使用的字符编码
-source <发行版>              提供与指定发行版的源兼容性
-target <发行版>              生成特定 VM 版本的类文件
-profile <配置文件>            请确保使用的 API 在指定的配置文件中可用
-version                   版本信息
-help                      输出标准选项的提要
-A关键字[=值]                  传递给注释处理程序的选项
-X                         输出非标准选项的提要
-J<标记>                     直接将 <标记> 传递给运行时系统
-Werror                    出现警告时终止编译
@<文件名>                     从文件读取选项和文件名

C:\Users\user>java
用法: java [-options] class [args...]
(执行类)
或  java [-options] -jar jarfile [args...]
(执行 jar 文件)
其中选项包括:
-d32          使用 32 位数据模型 (如果可用)
-d64          使用 64 位数据模型 (如果可用)
-server       选择 "server" VM
默认 VM 是 server.

-cp <目录和 zip/jar 文件的类搜索路径>
-classpath <目录和 zip/jar 文件的类搜索路径>
用 ; 分隔的目录, JAR 档案
和 ZIP 档案列表, 用于搜索类文件。
-D<名称>=<值>
设置系统属性
-verbose:[class|gc|jni]
启用详细输出
-version      输出产品版本并退出
-version:<值>
警告: 此功能已过时, 将在
未来发行版中删除。
需要指定的版本才能运行
-showversion  输出产品版本并继续
-jre-restrict-search | -no-jre-restrict-search
警告: 此功能已过时, 将在
未来发行版中删除。
在版本搜索中包括/排除用户专用 JRE
-? -help      输出此帮助消息
-X            输出非标准选项的帮助
-ea[:<packagename>...|:<classname>]
-enableassertions[:<packagename>...|:<classname>]
按指定的粒度启用断言
-da[:<packagename>...|:<classname>]
-disableassertions[:<packagename>...|:<classname>]
禁用具有指定粒度的断言
-esa | -enablesystemassertions
启用系统断言
-dsa | -disablesystemassertions
禁用系统断言
-agentlib:<libname>[=<选项>]
加载本机代理库 <libname>, 例如 -agentlib:hprof
另请参阅 -agentlib:jdwp=help 和 -agentlib:hprof=help
-agentpath:<pathname>[=<选项>]
按完整路径名加载本机代理库
-javaagent:<jarpath>[=<选项>]
加载 Java 编程语言代理, 请参阅 java.lang.instrument
-splash:<imagepath>
使用指定的图像显示启动屏幕
有关详细信息, 请参阅 http://www.oracle.com/technetwork/java/javase/documentation /index.html。

2.3.2 configure built.sbt

# add the following code
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.1.0"

2.3.3 example code

import org.apache.spark.{SparkConf, SparkContext}

object test {
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "D:\\spark\\IntelliJ IDEA\\work sheet\\hadoop-2.7.2")
if (args.length < 1) {
System.err.println("Usage: <file>")
System.exit(1)
}
val conf = new SparkConf()
val sc = new SparkContext("local","wordcount",conf)
val line = sc.textFile(args(0))
line.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).collect().foreach(println)
sc.stop()
}
}

We need to assign the parameter in ‘Edit configuration’ in intellij:

Program argument: E:\Distributed\Configuration\Distributed_configure.md

Result:

....
17/05/08 17:55:41 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
17/05/08 17:55:41 INFO DAGScheduler: ResultStage 1 (collect at test.scala:20) finished in 0.100 s
17/05/08 17:55:41 INFO DAGScheduler: Job 0 finished: collect at test.scala:20, took 1.311414 s
(,1977)
(the,95)
(```,88)
([INFO],51)
(to,47)
(#,43)
(Spark,33)
(SUCCESS,32)
(and,27)
(Project,26)
(in,25)
(export,23)
(min],22)
(it,19)
....

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： hadoop spark

相关文章推荐

新的分享

章节导航