您的位置:首页 > Web前端

网上搜集的storm 一些有用的资料

2012-09-24 10:55 411 查看
IMHO, Flume has lots of built-in data sources, decorators (batching,

archiving) and sinks (HDFS etc). So its easy to adopt on existing

system and easy to extend, especially for logs files.
Kafka could be used as a basic log aggregation and is general purpose

producer-consumer messaging system. And has high throughput.

I feel Flume focus more on aggregation whereas Kafka on super-fast

messaging. Even though they can be sources of storm, if you archive

log aggregation and log data processing together Flume might be

better, or if you want a extreme low-latency data processing from

logs, Kafka would be winner.

Min, so it sounds Kafka is more suitable for real-time solution

providing by Storm. Aggregation calculations can be done already

inside Storm Bolt with some real ESP engine, for example with Drools

Fusion or Esper. I am building solution with Drools Fusion

Sam, I'm starting from AMQP solution for now.

Thank you for reply. I thought the problems was in parameters of

exchange but it really appears it was in parameters of queue. Solved.

搜集spout数据源的几种策略:

We have several log files on different remote computers which sharding

by nginx and with the same format .

How can I tail these files paralell as the input of Spout?
1.Is there some way that I can control a Spout running on a specific

mechine with log file? For example ,I have 4 log-server ,and

setspout(id,class,4),just fill the 4 spouts into 4 log-server.

2.Or I can only scratch all log files into one then tailed by

Spout ,which doesn't sound effective.

3.Also I heard the FaceBook has developed PTail in their real-time

system,how could I imitate it?

4.Of course I can put  content of logs into MQ, which means I should

write a daemon process and run it on each merchine.

Thank you for your advice!

Another option would be using the Flume.

You might have to build a FlumeSinkSpout.

Some tools:

* Flume (we use this in our performance monitoring and analytics

services)

* Scribe

* Kafka

* Fluentd

* ...

storm作者的建议:

How you get log data into Storm depends on what you want your message processing guarantees to be. If you don't care about dropping messages, then you could do logs -> Scribe -> ScribeSpout. Scribe would have to be able to discover where the ScribeSpout
tasks are (since they could be started on any machine), which could be accomplished using Zookeeper. I believe Scribe has this discovery functionality builtin already.

If you care about processing every message, then you should do an architecture like logs -> Scribe -> Kestrel -> KestrelSpout.

其他人:

I have a very similar requirement where I have to pipe data from many log files to a bunch of aggregators so that they can do the aggregation. I was thinking to setting up the log file readers as spouts and aggregators as bolts.

So now I have to run the log file reader's main and in that emit the log file lines.

A few questions

1. Do you recommend doing this?

2. How can I have a main from which I cam emit tuples. I cant package this class as a regular spout because it HAS to run on the right host and storm does not give you that control as far as I know.

In 0.8.0 you can write a custom class to dispatch your log tailing spout to the right boxes.

We have a small daemon tailing logs and feeding Kafka.   Much lighter weight than Flume, gives us transactionality, and easily allows multiple topologies to share the same data source.

Hi =)  My options/answers inline (mind you I'm learning Storm right now), but hope I can help some.

I looked at this post and my question is related to it
https://groups.google.com/forum/#!msg/storm-user/Zvy6RrT8RHo/OVB03TlT2HgJ

I am looking to tail log files from many servers to storm and them carry out staged processing on the messages from these log files. So what I would like is to have a spout on each of the servres on which the log files exist that will emit the log lines and
then have the blots to consume them running on different machines within the same data center.

Because you can't (currently) define which machine a Spout should run (check out issue #164https://github.com/nathanmarz/storm/issues/164) I think
this could be an issue.  Once this is complete, though, you should be able to define this in the topology and, using a library as previously suggested in that linked thread), monitor logs via a custom Spout.

The questions I have are:

1. Is this not recommended? From the post mentioned above it seems to me that it is not. If so why?

 

I don't see why not - this looks like a good use case.  I agree with some of the other respondents in the referenced post, using something like Flume or another library to produce a Spout to do this is a good approach.  Plus, you can contribute that back
=)

2. If we are to do this I need to be able to write a spout which has a main() so that I can read the log and emit the tuples. I cant package these in the topology since I need to control exactly where these should run (where the log files are).

Right now, that's kinda true (reference about about issue #164).  Once you can define your Spout deployment location, this could be done in a normal Spout IMO.  For now, you can make a solution where you're reading the data from the logs (using a library as
previously identified) and pushing it to Krestle, for example.  Then, using storm-krestle (backtype.storm.spout.Krestle)
, you can bring them into Storm for processing.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息