您的位置：首页 > 大数据

大数据学习——Flume介绍与安装

2017-04-20 21:07 483 查看

Flume

实验环境：

shiyanlou

- CentOS6.6 64

- JDK 1.7.0_55 64

- Hadoop 1.1.2

Flume 介绍

Flume是Cloudera提供的日志收集系统。Flume支持在日志系统中定制各类数据发送方，用于收集数据；同时，Flume提供对数据进行简单处理，并写到各种数据接受方（可定制）的能力。

Flume是一个分布式、可靠、高可用的海量日志采集、聚合和传输的系统。

Flume特点

Reliability：数据可靠性，包括End-to-end，Store on failure和Best effort

Scalability：Flume的3大组件collector、master和storage tier都是可伸缩的

Manageability：利用ZooKeeper和gossip，保证配置数据的一致性、高可用，同时多Master

Extensibility：基于Java，用户可以为Flume添加各种新的功能。

Flume架构

其中最重要的抽象是data flow，描述了数据从产生、传输、处理并最终写入目标的一条路径。

上图实线是data flow。

Agent用于采集数据，agent是flume中产生数据流的地方，同时，agent会将产生的数据流传输到collector。对应的，collector用于对数据进行聚合，往往产生一个更大的流。

Flume提供了从console（控制台）、RPC（Thrift-RPC）、text（文件）、tail（UNIX tail）、syslog（syslog日志系统，支持TCP和UDP等2种模式），exec（命令执行）等数据源上收集数据的能力。

同时，Flume的数据接受方，可以是console（控制台）、text（文件）、dfs（HDFS文件）、RPC（Thrift-RPC）和syslogTCP（TCP syslog日志系统）等。

其中，收集数据有2种主要工作模式：

- Push Sources：外部系统会主动地将数据推送到Flume中，如RPC、syslog

- Polling Sources：Flume到外部系统中获取数据，一般使用轮询的方式，如text和exec

注意，在Flume中，agent和collector对应，而source和sink对应。

Source和sink强调发送、接受方的特性（如数据格式、编码等），而agent和collector关注功能。

Flume Master用于管理数据流的配置。Flume Master间使用gossip协议同步数据。

安装部署Flume

下载地址

http://flume.apache.org/download.html

cd /home/shiyanlou/install-pack
tar -xzf flume-1.5.2-bin.tar.gz
mv apache-flume-1.5.2-bin /app/flume-1.5.2
sudo vi /etc/profile

export FLUME_HOME=/app/flume-1.5.2
export FLUME_CONF_DIR=$FLUME_HOME/conf
export PATH=$PATH:$FLUME_HOME/bin

source /etc/profile
echo $PATH

cd /app/flume-1.5.2/conf
cp flume-env.sh.template flume-env.sh
sudo vi flume-env.sh

JAVA_HOME= /app/lib/jdk1.7.0_55
JAVA_OPTS="-Xms100m -Xmx200m -Dcom.sun.management.jmxremote"

cp flume-conf.properties.template flume-conf.properties
sudo vi flume-conf.properties

# The configuration file needs to define the sources, the channels and the sinks.
# Sources, channels and sinks are defined per agent, in this case called 'a1'
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# For each one of the sources, the type is defined
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

#The channel can be defined as follows.
a1.sources.r1.channels = c1
# Each sink's type must be defined
a1.sinks.k1.type = logger

#Specify the channel the sink should use
a1.sinks.k1.channel = c1

# Each channel's type is defined.
a1.channels.c1.type = memory
# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

cd /app/flume-1.5.2
./bin/flume-ng agent --conf ./conf/ --conf-file ./conf/flume-conf.properties --name a1 -Dflume.root.logger=INFO,console

下面的测试在shiyanlou无法进行：

另开一个终端

#sudo yum install telnet
telnet localhost 44444
hello world

在原来的终端上，可以收到来自于telnet发出的消息。

cd /app/flume-1.5.2/conf
cp flume-conf.properties.template flume-conf2.properties
sudo vi flume-conf2.properties

a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /app/hadoop-1.1.2/logs/hadoop-shiyanlou-namenode-b393a04554e1.log
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs://hadoop:9000/class12/out_flume
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollSize = 4000000
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.batchSize = 10
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

cd /app/flume-1.5.2
./bin/flume-ng agent --conf ./conf/ --conf-file ./conf/flume-conf2.properties --name a1 -Dflume.root.logger=INFO,console

这时会不断收集hadoop-hadoop-namenode-hadoop1.log的数据写入HDFS中。

查看hdfs中/class12/out_flume中的文件

hadoop fs -ls /class12/out_flume
hadoop fs -cat /class12/out_flume/events-.1433921305493

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航