Hadoop Ecosystem
2016-06-08 16:59
441 查看
https://blogs.walkingtree.in/2013/09/26/hadoop-ecosystem/
When we start learning Hadoop techology, we come across many components in Hadoop ecosystem. It would be of great interest for all of us to know the what specific purpose each component will serve with in Hadoop ecosystem.
This article talks describes the use of different components in hadoop ecosystem
Following diagram depicts the general ecosystem of Hadoop. Not all components are mandatory. Many times one component compliments other components.
HDFS is a distributed file system which distributes data across multiple servers with automatic recovery in case of any node failure. It is built with concept of write once and read multiple times. It does not support multiple writers in one go, it allows
only writer at a time. Typical Hadoop instance can withstand peta bytes of data with the power of this file system.
Hbase is distributed column oriented database where as HDFS is file system. But it is built on top of HDFS system. HBase does not support SQL, but it solves concurrent write limitation we have in HDFS. HBase is not a replacement for HDFS. HBase internally stores
the data in HDFS format. If have need for concurrent writes in your big data solution then HBase can be used.
MapReduce is framework for distributed parallel data processing. It provides programming model for large data processing. Mapreduce programs can be written in Jave, Ruby, Python and C++. It has inherent capability to run the programs in parallel across multiple
nodes in a big data cluster. As processing has is distributed across multiple nodes we can expect better performance and throughput. Mapreduce performs data processing in two stages i.e. map and reduce. Map will convert an input data in intermdiate data
format which is basically and key value pair. Reduce will combine all the maps which share common key and generates reduced set of key value pairs. It has two components i.e. Job tracker and task tracker.Job tracker acts like master and send commands to slaves
for specific task.Task tracker will take care of real execution of task and report back to job tracker
YARN means ‘yet another resource negotiator’. Map reduce was rewritten to overcome the potential bottleneck of single job tracker in old mapreduce which has responsibilities of job scheduling and monitoring task progress. Now YARN divides that into those two
responsibilities into two seperate deamons i.e. resource manager and application master. Existing mapreduce programs can work directly on YARN but some times we need make some changes.
Avro is data serialization format which brings data interoperability among mutlple components of apache hadoop. Most of the components in hadoop started supporting Avro data format. It works with basic premise of data produced by component should be readily
consumed by other component.
Avro has following features
Rich data types.
Fast and compact serialization
Support many programming langguages like java, Python
Pig is platform for big data analysis and processing. Then immediate question comes to our mind that map reduce is also serving same purpose then what other benefits pig is providing. Pig adds one more level abstraction in data processing and it makes writing
and maintaining data processing jobs very easy. At the time of compilation, pig script will be converted into multiple map reduce programs and they will be executed as per the logic written in pig script. Pig has two pieces
The lanuguage to write programs which is named as Pig Latin
Execution environment where pig scripts will be executed
Pig can process tera bytes of data with half dozen lines of code.
Hive is a dataware housing framework on top of Hadoop. Hive allows to write SQL like queries to process and analyze the big data stored in HDFS. Hive is primarly intended for the resources who want to process big data but does not have programming background
around java or other related technologies. While execution hive scripts will be converted to series of mapreduce jobs.
Sqoop is tool which can be used to transfer the data from relational database environments like oracle, mysql and postgresql into hadoop environment. It can transfter large amount of data into hadoop system. It can store the data in HDFS in arvo fromat.
Zookeeper is a distributed coordination and governing service for hadoop cluster. Zookeeper runs on multiple nodes in a cluster and in general hadoop nodes and zookeeper nodes will be same. Zookeeper can notify in case of any changes happened master or any
of its child. In hadoop this will be useful to track if particular node is down and plan necessary communication protocol around node failure.
Mahout adds data mining and machine learning capabilities for big data. It can be used to build recommendation engine based on certain usage patterns of user.
In this article we understood hadoop ecosystem and learned about different components and their primary purpose.
http://hadoop.apache.org
Problem Statement:
When we start learning Hadoop techology, we come across many components in Hadoop ecosystem. It would be of great interest for all of us to know the what specific purpose each component will serve with in Hadoop ecosystem.
Scope of the Article:
This article talks describes the use of different components in hadoop ecosystem
Details:
Following diagram depicts the general ecosystem of Hadoop. Not all components are mandatory. Many times one component compliments other components.
Hadoop Distributed File System(HDFS):
HDFS is a distributed file system which distributes data across multiple servers with automatic recovery in case of any node failure. It is built with concept of write once and read multiple times. It does not support multiple writers in one go, it allowsonly writer at a time. Typical Hadoop instance can withstand peta bytes of data with the power of this file system.
HBase:
Hbase is distributed column oriented database where as HDFS is file system. But it is built on top of HDFS system. HBase does not support SQL, but it solves concurrent write limitation we have in HDFS. HBase is not a replacement for HDFS. HBase internally storesthe data in HDFS format. If have need for concurrent writes in your big data solution then HBase can be used.
MapReduce:
MapReduce is framework for distributed parallel data processing. It provides programming model for large data processing. Mapreduce programs can be written in Jave, Ruby, Python and C++. It has inherent capability to run the programs in parallel across multiplenodes in a big data cluster. As processing has is distributed across multiple nodes we can expect better performance and throughput. Mapreduce performs data processing in two stages i.e. map and reduce. Map will convert an input data in intermdiate data
format which is basically and key value pair. Reduce will combine all the maps which share common key and generates reduced set of key value pairs. It has two components i.e. Job tracker and task tracker.Job tracker acts like master and send commands to slaves
for specific task.Task tracker will take care of real execution of task and report back to job tracker
YARN:
YARN means ‘yet another resource negotiator’. Map reduce was rewritten to overcome the potential bottleneck of single job tracker in old mapreduce which has responsibilities of job scheduling and monitoring task progress. Now YARN divides that into those tworesponsibilities into two seperate deamons i.e. resource manager and application master. Existing mapreduce programs can work directly on YARN but some times we need make some changes.
Avro:
Avro is data serialization format which brings data interoperability among mutlple components of apache hadoop. Most of the components in hadoop started supporting Avro data format. It works with basic premise of data produced by component should be readilyconsumed by other component.
Avro has following features
Rich data types.
Fast and compact serialization
Support many programming langguages like java, Python
Pig:
Pig is platform for big data analysis and processing. Then immediate question comes to our mind that map reduce is also serving same purpose then what other benefits pig is providing. Pig adds one more level abstraction in data processing and it makes writingand maintaining data processing jobs very easy. At the time of compilation, pig script will be converted into multiple map reduce programs and they will be executed as per the logic written in pig script. Pig has two pieces
The lanuguage to write programs which is named as Pig Latin
Execution environment where pig scripts will be executed
Pig can process tera bytes of data with half dozen lines of code.
Hive:
Hive is a dataware housing framework on top of Hadoop. Hive allows to write SQL like queries to process and analyze the big data stored in HDFS. Hive is primarly intended for the resources who want to process big data but does not have programming backgroundaround java or other related technologies. While execution hive scripts will be converted to series of mapreduce jobs.
Sqoop:
Sqoop is tool which can be used to transfer the data from relational database environments like oracle, mysql and postgresql into hadoop environment. It can transfter large amount of data into hadoop system. It can store the data in HDFS in arvo fromat.
Zookeeper:
Zookeeper is a distributed coordination and governing service for hadoop cluster. Zookeeper runs on multiple nodes in a cluster and in general hadoop nodes and zookeeper nodes will be same. Zookeeper can notify in case of any changes happened master or anyof its child. In hadoop this will be useful to track if particular node is down and plan necessary communication protocol around node failure.
Mahout:
Mahout adds data mining and machine learning capabilities for big data. It can be used to build recommendation engine based on certain usage patterns of user.
Summary:
In this article we understood hadoop ecosystem and learned about different components and their primary purpose.
References:
http://hadoop.apache.org
相关文章推荐
- 详解HDFS Short Circuit Local Reads
- Hadoop_2.1.0 MapReduce序列图
- 使用Hadoop搭建现代电信企业架构
- 单机版搭建Hadoop环境图文教程详解
- hadoop常见错误以及处理方法详解
- hadoop 单机安装配置教程
- hadoop的hdfs文件操作实现上传文件到hdfs
- hadoop实现grep示例分享
- Apache Hadoop版本详解
- linux下搭建hadoop环境步骤分享
- hadoop client与datanode的通信协议分析
- hadoop中一些常用的命令介绍
- Hadoop单机版和全分布式(集群)安装
- 用PHP和Shell写Hadoop的MapReduce程序
- hadoop map-reduce中的文件并发操作
- Hadoop1.2中配置伪分布式的实例
- hadoop上传文件功能实例代码
- java结合HADOOP集群文件上传下载
- Hadoop 2.x伪分布式环境搭建详细步骤
- Java访问Hadoop分布式文件系统HDFS的配置说明