Spark:用Scala和Java实现WordCount
2014-11-27 22:37
239 查看
http://www.cnblogs.com/byrhuangqiang/p/4017725.html
为了在IDEA中编写scala,今天安装配置学习了IDEA集成开发环境。IDEA确实很优秀,学会之后,用起来很顺手。关于如何搭建scala和IDEA开发环境,请看文末的参考资料。
用Scala和Java实现WordCount,其中Java实现的JavaWordCount是spark自带的例子($SPARK_HOME/examples/src/main/java/org/apache/spark/examples/JavaWordCount.java)
1.环境
OS:Red Hat Enterprise Linux Server release 6.4 (Santiago)
Hadoop:Hadoop 2.4.1
JDK:1.7.0_60
Spark:1.1.0
Scala:2.11.2
集成开发环境:IntelliJ IDEA 13.1.3
注意:需要在客户端windows环境下安装IDEA、Scala、JDK,并且为IDEA下载scala插件。
2.Scala实现单词计数
5.参考资料
关于IDEA的使用:Scala从零开始:使用Intellij IDEA写hello world
scala编写WC: Spark wordcount开发并提交到集群运行
java编写WC:用java编写spark程序,简单示例及运行、Spark在Yarn上运行Wordcount程序
Spark Programming Guide
为了在IDEA中编写scala,今天安装配置学习了IDEA集成开发环境。IDEA确实很优秀,学会之后,用起来很顺手。关于如何搭建scala和IDEA开发环境,请看文末的参考资料。
用Scala和Java实现WordCount,其中Java实现的JavaWordCount是spark自带的例子($SPARK_HOME/examples/src/main/java/org/apache/spark/examples/JavaWordCount.java)
1.环境
OS:Red Hat Enterprise Linux Server release 6.4 (Santiago)
Hadoop:Hadoop 2.4.1
JDK:1.7.0_60
Spark:1.1.0
Scala:2.11.2
集成开发环境:IntelliJ IDEA 13.1.3
注意:需要在客户端windows环境下安装IDEA、Scala、JDK,并且为IDEA下载scala插件。
2.Scala实现单词计数
1 Spark assembly has been built with Hive, including Datanucleus jars on classpath 2 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 3 14/10/10 19:24:51 INFO SecurityManager: Changing view acls to: ebupt, 4 14/10/10 19:24:51 INFO SecurityManager: Changing modify acls to: ebupt, 5 14/10/10 19:24:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ebupt, ); users with modify permissions: Set(ebupt, ) 6 14/10/10 19:24:52 INFO Slf4jLogger: Slf4jLogger started 7 14/10/10 19:24:52 INFO Remoting: Starting remoting 8 14/10/10 19:24:52 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@eb174:56344] 9 14/10/10 19:24:52 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@eb174:56344] 10 14/10/10 19:24:52 INFO Utils: Successfully started service 'sparkDriver' on port 56344. 11 14/10/10 19:24:52 INFO SparkEnv: Registering MapOutputTracker 12 14/10/10 19:24:52 INFO SparkEnv: Registering BlockManagerMaster 13 14/10/10 19:24:52 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20141010192452-3398 14 14/10/10 19:24:52 INFO Utils: Successfully started service 'Connection manager for block manager' on port 41110. 15 14/10/10 19:24:52 INFO ConnectionManager: Bound socket to port 41110 with id = ConnectionManagerId(eb174,41110) 16 14/10/10 19:24:52 INFO MemoryStore: MemoryStore started with capacity 265.4 MB 17 14/10/10 19:24:52 INFO BlockManagerMaster: Trying to register BlockManager 18 14/10/10 19:24:52 INFO BlockManagerMasterActor: Registering block manager eb174:41110 with 265.4 MB RAM 19 14/10/10 19:24:52 INFO BlockManagerMaster: Registered BlockManager 20 14/10/10 19:24:52 INFO HttpFileServer: HTTP File server directory is /tmp/spark-8051667e-bfdb-4ecd-8111-52992b16bb13 21 14/10/10 19:24:52 INFO HttpServer: Starting HTTP Server 22 14/10/10 19:24:52 INFO Utils: Successfully started service 'HTTP file server' on port 48233. 23 14/10/10 19:24:53 INFO Utils: Successfully started service 'SparkUI' on port 4040. 24 14/10/10 19:24:53 INFO SparkUI: Started SparkUI at http://eb174:4040 25 14/10/10 19:24:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 26 14/10/10 19:24:53 INFO SparkContext: Added JAR file:/home/ebupt/test/WordCountByscala.jar at http://10.1.69.174:48233/jars/WordCountByscala.jar with timestamp 1412940293532 27 14/10/10 19:24:53 INFO AppClient$ClientActor: Connecting to master spark://eb174:7077... 28 14/10/10 19:24:53 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 29 14/10/10 19:24:53 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=278302556 30 14/10/10 19:24:53 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 159.9 KB, free 265.3 MB) 31 14/10/10 19:24:53 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20141010192453-0009 32 14/10/10 19:24:53 INFO AppClient$ClientActor: Executor added: app-20141010192453-0009/0 on worker-20141008204132-eb176-49618 (eb176:49618) with 1 cores 33 14/10/10 19:24:53 INFO SparkDeploySchedulerBackend: Granted executor ID app-20141010192453-0009/0 on hostPort eb176:49618 with 1 cores, 1024.0 MB RAM 34 14/10/10 19:24:53 INFO AppClient$ClientActor: Executor added: app-20141010192453-0009/1 on worker-20141008204132-eb175-56337 (eb175:56337) with 1 cores 35 14/10/10 19:24:53 INFO SparkDeploySchedulerBackend: Granted executor ID app-20141010192453-0009/1 on hostPort eb175:56337 with 1 cores, 1024.0 MB RAM 36 14/10/10 19:24:53 INFO AppClient$ClientActor: Executor updated: app-20141010192453-0009/0 is now RUNNING 37 14/10/10 19:24:53 INFO AppClient$ClientActor: Executor updated: app-20141010192453-0009/1 is now RUNNING 38 14/10/10 19:24:53 INFO MemoryStore: ensureFreeSpace(12633) called with curMem=163705, maxMem=278302556 39 14/10/10 19:24:53 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 12.3 KB, free 265.2 MB) 40 14/10/10 19:24:53 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on eb174:41110 (size: 12.3 KB, free: 265.4 MB) 41 14/10/10 19:24:53 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 42 14/10/10 19:24:54 INFO FileInputFormat: Total input paths to process : 1 43 14/10/10 19:24:54 INFO SparkContext: Starting job: collect at WordCount.scala:26 44 14/10/10 19:24:54 INFO DAGScheduler: Registering RDD 3 (map at WordCount.scala:26) 45 14/10/10 19:24:54 INFO DAGScheduler: Got job 0 (collect at WordCount.scala:26) with 2 output partitions (allowLocal=false) 46 14/10/10 19:24:54 INFO DAGScheduler: Final stage: Stage 0(collect at WordCount.scala:26) 47 14/10/10 19:24:54 INFO DAGScheduler: Parents of final stage: List(Stage 1) 48 14/10/10 19:24:54 INFO DAGScheduler: Missing parents: List(Stage 1) 49 14/10/10 19:24:54 INFO DAGScheduler: Submitting Stage 1 (MappedRDD[3] at map at WordCount.scala:26), which has no missing parents 50 14/10/10 19:24:54 INFO MemoryStore: ensureFreeSpace(3400) called with curMem=176338, maxMem=278302556 51 14/10/10 19:24:54 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.3 KB, free 265.2 MB) 52 14/10/10 19:24:54 INFO MemoryStore: ensureFreeSpace(2082) called with curMem=179738, maxMem=278302556 53 14/10/10 19:24:54 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.0 KB, free 265.2 MB) 54 14/10/10 19:24:54 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on eb174:41110 (size: 2.0 KB, free: 265.4 MB) 55 14/10/10 19:24:54 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0 56 14/10/10 19:24:54 INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (MappedRDD[3] at map at WordCount.scala:26) 57 14/10/10 19:24:54 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks 58 14/10/10 19:24:56 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@eb176:35482/user/Executor#1456950111] with ID 0 59 14/10/10 19:24:56 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 0, eb176, ANY, 1238 bytes) 60 14/10/10 19:24:56 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@eb175:35502/user/Executor#-1231100997] with ID 1 61 14/10/10 19:24:56 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 1, eb175, ANY, 1238 bytes) 62 14/10/10 19:24:56 INFO BlockManagerMasterActor: Registering block manager eb176:33296 with 530.3 MB RAM 63 14/10/10 19:24:56 INFO BlockManagerMasterActor: Registering block manager eb175:32903 with 530.3 MB RAM 64 14/10/10 19:24:57 INFO ConnectionManager: Accepted connection from [eb176/10.1.69.176:39218] 65 14/10/10 19:24:57 INFO ConnectionManager: Accepted connection from [eb175/10.1.69.175:55227] 66 14/10/10 19:24:57 INFO SendingConnection: Initiating connection to [eb176/10.1.69.176:33296] 67 14/10/10 19:24:57 INFO SendingConnection: Initiating connection to [eb175/10.1.69.175:32903] 68 14/10/10 19:24:57 INFO SendingConnection: Connected to [eb175/10.1.69.175:32903], 1 messages pending 69 14/10/10 19:24:57 INFO SendingConnection: Connected to [eb176/10.1.69.176:33296], 1 messages pending 70 14/10/10 19:24:57 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on eb175:32903 (size: 2.0 KB, free: 530.3 MB) 71 14/10/10 19:24:57 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on eb176:33296 (size: 2.0 KB, free: 530.3 MB) 72 14/10/10 19:24:57 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on eb176:33296 (size: 12.3 KB, free: 530.3 MB) 73 14/10/10 19:24:57 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on eb175:32903 (size: 12.3 KB, free: 530.3 MB) 74 14/10/10 19:24:58 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 1) in 1697 ms on eb175 (1/2) 75 14/10/10 19:24:58 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 0) in 1715 ms on eb176 (2/2) 76 14/10/10 19:24:58 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 77 14/10/10 19:24:58 INFO DAGScheduler: Stage 1 (map at WordCount.scala:26) finished in 3.593 s 78 14/10/10 19:24:58 INFO DAGScheduler: looking for newly runnable stages 79 14/10/10 19:24:58 INFO DAGScheduler: running: Set() 80 14/10/10 19:24:58 INFO DAGScheduler: waiting: Set(Stage 0) 81 14/10/10 19:24:58 INFO DAGScheduler: failed: Set() 82 14/10/10 19:24:58 INFO DAGScheduler: Missing parents for Stage 0: List() 83 14/10/10 19:24:58 INFO DAGScheduler: Submitting Stage 0 (ShuffledRDD[4] at reduceByKey at WordCount.scala:26), which is now runnable 84 14/10/10 19:24:58 INFO MemoryStore: ensureFreeSpace(2096) called with curMem=181820, maxMem=278302556 85 14/10/10 19:24:58 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.0 KB, free 265.2 MB) 86 14/10/10 19:24:58 INFO MemoryStore: ensureFreeSpace(1338) called with curMem=183916, maxMem=278302556 87 14/10/10 19:24:58 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1338.0 B, free 265.2 MB) 88 14/10/10 19:24:58 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on eb174:41110 (size: 1338.0 B, free: 265.4 MB) 89 14/10/10 19:24:58 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0 90 14/10/10 19:24:58 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (ShuffledRDD[4] at reduceByKey at WordCount.scala:26) 91 14/10/10 19:24:58 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 92 14/10/10 19:24:58 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 2, eb175, PROCESS_LOCAL, 1008 bytes) 93 14/10/10 19:24:58 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 3, eb176, PROCESS_LOCAL, 1008 bytes) 94 14/10/10 19:24:58 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on eb175:32903 (size: 1338.0 B, free: 530.3 MB) 95 14/10/10 19:24:58 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on eb176:33296 (size: 1338.0 B, free: 530.3 MB) 96 14/10/10 19:24:58 INFO MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to sparkExecutor@eb175:59119 97 14/10/10 19:24:58 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 144 bytes 98 14/10/10 19:24:58 INFO MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to sparkExecutor@eb176:39028 99 14/10/10 19:24:58 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 3) in 109 ms on eb176 (1/2) 100 14/10/10 19:24:58 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 2) in 120 ms on eb175 (2/2) 101 14/10/10 19:24:58 INFO DAGScheduler: Stage 0 (collect at WordCount.scala:26) finished in 0.123 s 102 14/10/10 19:24:58 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 103 14/10/10 19:24:58 INFO SparkContext: Job finished: collect at WordCount.scala:26, took 3.815637915 s 104 (scala,1) 105 (Function2,1) 106 (JavaSparkContext,1) 107 (JavaRDD,1) 108 (Tuple2,1) 109 (,1) 110 (org,7) 111 (apache,7) 112 (JavaPairRDD,1) 113 (java,7) 114 (function,4) 115 (api,7) 116 (Function,1) 117 (PairFunction,1) 118 (spark,7) 119 (FlatMapFunction,1) 120 (import,8) 121 14/10/10 19:24:58 INFO SparkUI: Stopped Spark web UI at http://eb174:4040 122 14/10/10 19:24:58 INFO DAGScheduler: Stopping DAGScheduler 123 14/10/10 19:24:58 INFO SparkDeploySchedulerBackend: Shutting down all executors 124 14/10/10 19:24:58 INFO SparkDeploySchedulerBackend: Asking each executor to shut down 125 14/10/10 19:24:58 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(eb176,33296) 126 14/10/10 19:24:58 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(eb176,33296) 127 14/10/10 19:24:58 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(eb176,33296) not found 128 14/10/10 19:24:58 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(eb175,32903) 129 14/10/10 19:24:58 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(eb175,32903) 130 14/10/10 19:24:58 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(eb175,32903) 131 14/10/10 19:24:58 INFO ConnectionManager: Key not valid ? sun.nio.ch.SelectionKeyImpl@5e92c11b 132 14/10/10 19:24:58 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@5e92c11b 133 java.nio.channels.CancelledKeyException 134 at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:310) 135 at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139) 136 14/10/10 19:24:59 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 137 14/10/10 19:24:59 INFO ConnectionManager: Selector thread was interrupted! 138 14/10/10 19:24:59 INFO ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(eb176,33296) 139 14/10/10 19:24:59 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(eb176,33296) not found 140 14/10/10 19:24:59 INFO ConnectionManager: Removing SendingConnection to ConnectionManagerId(eb176,33296) 141 14/10/10 19:24:59 WARN ConnectionManager: All connections not cleaned up 142 14/10/10 19:24:59 INFO ConnectionManager: ConnectionManager stopped 143 14/10/10 19:24:59 INFO MemoryStore: MemoryStore cleared 144 14/10/10 19:24:59 INFO BlockManager: BlockManager stopped 145 14/10/10 19:24:59 INFO BlockManagerMaster: BlockManagerMaster stopped 146 14/10/10 19:24:59 INFO SparkContext: Successfully stopped SparkContext 147 14/10/10 19:24:59 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 148 14/10/10 19:24:59 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 149 14/10/10 19:24:59 INFO Remoting: Remoting shut down 150 14/10/10 19:24:59 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
5.参考资料
关于IDEA的使用:Scala从零开始:使用Intellij IDEA写hello world
scala编写WC: Spark wordcount开发并提交到集群运行
java编写WC:用java编写spark程序,简单示例及运行、Spark在Yarn上运行Wordcount程序
Spark Programming Guide
相关文章推荐
- Spark:用Scala和Java实现WordCount
- Spark:用Scala和Java实现WordCount
- Spark:用Scala和Java实现WordCount
- Spark:用Scala和Java实现WordCount
- 第67课:Spark SQL下采用Java和Scala实现Join的案例综合实战(巩固前面学习的Spark SQL知识)
- Spark中RDD转换成DataFrame的两种方式(分别用Java和scala实现)
- WordCount 的 Java 和 Scala 实现
- Spark:用Scala和Java实现WordCount
- Java和scala实现 Spark RDD转换成DataFrame的两种方法小结
- python、scala、java分别实现在spark上实现WordCount
- Spark:用Java和Scala实现WordCount
- Spark:用Scala和Java实现WordCount
- Spark中RDD转换成DataFrame的两种方式(分别用Java和scala实现)
- Spark:用Scala和Java实现WordCount
- Spark:用Scala和Java实现WordCount
- Spark:用Scala和Java实现WordCount
- 第51讲:Scala中链式调用风格的实现代码实战及其在Spark编程中的广泛运用
- Spark---Scala与Java性能比较
- 第51讲:Scala中链式调用风格的实现代码实战及其在Spark编程中的广泛运用学习笔记
- Spark启动错误Exception in thread “main” java.lang.NoSuchMethodError: scala.collection.immutable.HashSet$