hello samza不容易
2015-06-28 13:42
183 查看
为什么说不那么容易说hello呢,因为在整个过程中,你不仅要等待将近一个小时下载yarn、kafka、zookeeper,还且你还会遇到2个让你无法顺利执行的状况。借助原文,我会进行说明。
Hello Samza
The hello-samza project is a stand-alone project designed to help you run your first Samza job.
Get the Code
You'll need to check out and publish Samza, since it's not available in a Maven repository right now.git clone http://git-wip-us.apache.org/repos/asf/incubator-samza.git cd incubator-samza ./gradlew -PscalaVersion=2.8.1 clean publishToMavenLocal
Next, check out the hello-samza project.
git clone git://github.com/linkedin/hello-samza.git
This project contains everything you'll need to run your first Samza jobs.
Start a Grid
A Samza grid usually comprises three different systems: YARN, Kafka,and ZooKeeper. The hello-samza project comes with a script called "grid" to help you setup these systems. Start by running:
此处有颗雷。执行此命令之前,需要将grid中18行和20行的下载地址改成可用地址。
DOWNLOAD_KAFKA=https://dist.apache.org/repos/dist/release/kafka/0.8.0/kafka_2.8.0-0.8.0.tar.gz
DOWNLOAD_ZOOKEEPER=http://apache.mirrors.pair.com/zookeeper/zookeeper-3.4.5/zookeeper-3.4.5.tar.gz
修改完毕,即可执行下面的命令了。
bin/grid
This command will download, install, and start ZooKeeper, Kafka, and YARN. All package files will be put in a sub-directory called "deploy" inside hello-samza's root folder.
If you get a complaint that JAVA_HOME is not set, then you'll need to set it. This can be done on Mac OSX by running:
export JAVA_HOME=$(/usr/libexec/java_home)
Once the grid command completes, you can verify that YARN is up and running by going to http://localhost:8088. This is the YARN UI.
Build a Samza Job Package
Before you can run a Samza job, you need to build a package for it. This package is what YARN uses to deploy your jobs on the grid.mvn clean package mkdir -p deploy/samza tar -xvf ./samza-job-package/target/samza-job-package-0.7.0-dist.tar.gz -C deploy/samza
Run a Samza Job
After you've built your Samza package, you can start a job on the grid using the run-job.sh script.执行此命令之前,需要将$PWD/deploy/samza/config/wikipedia-feed.properties中35行的6667修改为6665,原因是6667端口可能无法连接,这样你永远看不到kafka推送的数据。修改完毕,即可高枕无忧地执行后面的脚本了:)
deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-feed.properties
The job will consume a feed of real-time edits from Wikipedia, and produce them to a Kafka topic called "wikipedia-raw". Give the job a minute to startup, and then tail the Kafka topic:
deploy/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic wikipedia-raw
Pretty neat, right? Now, check out the YARN UI again (http://localhost:8088). This time around, you'll see your Samza job is running!
Generate Wikipedia Statistics
Let's calculate some statistics based on the messages in the wikipedia-raw topic. Start two more jobs:deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-parser.properties deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-stats.properties
The first job (wikipedia-parser) parses the messages in wikipedia-raw, and extracts information about the size of the edit, who made the change, etc. You can take a look at its output with:
deploy/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic wikipedia-edits
The last job (wikipedia-stats) reads messages from the wikipedia-edits topic, and calculates counts, every ten seconds, for all edits that were made during that window. It outputs these counts to the wikipedia-stats topic.
deploy/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic wikipedia-stats
The messages in the stats topic look like this:
{"is-talk":2,"bytes-added":5276,"edits":13,"unique-titles":13} {"is-bot-edit":1,"is-talk":3,"bytes-added":4211,"edits":30,"unique-titles":30,"is-unpatrolled":1,"is-new":2,"is-minor":7} {"bytes-added":3180,"edits":19,"unique-titles":19,"is-unpatrolled":1,"is-new":1,"is-minor":3} {"bytes-added":2218,"edits":18,"unique-titles":18,"is-unpatrolled":2,"is-new":2,"is-minor":3}
If you check the YARN UI, again, you'll see that all three jobs are now listed.
Shutdown
After you're done, you can clean everything up using the same grid script.bin/grid stop yarn bin/grid stop kafka bin/grid stop zookeeper
Congratulations! You've now setup a local grid that includes YARN, Kafka, and ZooKeeper, and run a Samza job on it. Next up, check out the Background and API
Overview pages.
相关文章推荐
- [学习笔记—Objective-C]《Objective-C 程序设计 第6版》第四章 数据类型和表达式 课后练习题 4-10
- linux 时间编程相关函数
- 使用Xcode Instruments Leak解决内存泄漏问题
- OAF中为MessageTextInput添加加事件处理
- Linux 查看和删除进程
- VC控件风格化
- 圆圈中最后剩下的数字
- 第十六周项目一——平方根中的异常
- 有关js和html的小细节
- 线性回归模型
- 从孙子兵法看企业价值观和企业文化
- CentOs Linux 文件位置标记
- I/O完成端口
- 《欧洲文学名著导读》——读书笔记
- cdoj32-树上战争(Battle on the tree) 【记忆化搜索】
- 140.字符串链接(不使用strcat)
- 家庭常用5号/7号电池购买及使用攻略
- Linux Shell学习之基础篇(不适合学习,仅为本人笔记)
- void及void指针含义的深刻解析
- 一些项目——鞍点计算