HDInsight - 1,简介
2015-08-17 08:53
155 查看
最近工作需要,要看HDInsight部分,这里要做笔记。自然是官网资料最权威,所以内容都从这里搬过来:https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-introduction/
HDInsight可以理解为是Apache Hadoop在微软Azure上的一个实现,里面包含了对应的Storm, HBase, Pig, Hive, Sqoop, Oozie, Ambari等等,当然,也捆绑了自家的Excel,SSAS,SSRS。
HDInsight支持两种类型操作系统,Linux和M$自己的Windows,区别主要在这里:
一些基本概念及定义
Hadoop (the "Query" workload): Provides reliable data storage with HDFS, and a simple MapReduce programming model to process and analyze data in parallel.
HBase (the "NoSQL" workload): A NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semi-structured data - potentially billions of rows times millions of columns. See Overview of HBase on HDInsight.
Apache Storm (the "Stream" workload): A distributed, real-time computation system for processing large streams of data fast. Storm is offered as a managed cluster in HDInsight. See Analyze real-time sensor data using Storm and Hadoop.
Ambari: Cluster provisioning, management, and monitoring.
Avro (Microsoft .NET Library for Avro): Data serialization for the Microsoft .NET environment.
Hive & HCatalog: Structured Query Language (SQL)-like querying, and a table and storage management layer.
Mahout: Machine learning.
MapReduce and YARN: Distributed processing and resource management.
Oozie: Workflow management.
Phoenix: Relational database layer over HBase.
Pig: Simpler scripting for MapReduce transformations.
Sqoop: Data import and export.
Tez: Allows data-intensive processes to run efficiently at scale.
ZooKeeper: Coordination of processes in distributed systems.
HBase数据,可以通过hbase shell的create/get/put/scan命令来管理,scan是读多个行的数据。同时有一个REST方式的C# API可以供调用。
HBase的使用场景
初衷就是google为了自己的web search,你搜索三体的时候,它把所有包含三体的页面都返回给你。除此之外,还包含了:
Key-Value存储,这个适合于消息的管理,比如Facebook。
Sensor data,包含但不限于社交数据,时间相关数据,审计日志等。
real-time query,比如Phoenix是一个Apache Hbase的SQL查询引擎
HDInsight中的Storm,有如下特性:
SLA承诺是999
Storm组件可以用Java/C#/Python来搞
内置的scale-up和scale-down的机制
可以和EventHub/Virtual Network/SQL/Blob/DocumntDB集成
实时处理的场景
Internet of Things (IoT)
Fraud detection
Social analytics
Extract, Transform, Load (ETL)
Network monitoring
Search
Mobile engagement
适用场景:
交互式的数据分析与BI处理
迭代机器学习(这是个啥?)
流式及实时数据处理
Hadoop on HDInsight
搞大数据,都知道Hadoop,那么HDInsight和Hadoop啥关系呢?HDInsight是M$基于Azure的一个软件架构,主要做大数据分析、管理用的,它使用了HDP(Hortonworks Data Platform)的Hadoop发行版。然后有点要注意,我们讲的Hadoop 一般指的是Hadoop的生态系统,包括Storm/Hbase等,而不单单是那个小大象。HDInsight可以理解为是Apache Hadoop在微软Azure上的一个实现,里面包含了对应的Storm, HBase, Pig, Hive, Sqoop, Oozie, Ambari等等,当然,也捆绑了自家的Excel,SSAS,SSRS。
HDInsight支持两种类型操作系统,Linux和M$自己的Windows,区别主要在这里:
CATEGORY | HADOOP ON LINUX | HADOOP ON WINDOWS |
Cluster OS | Ubuntu 12.04 Long Term Support (LTS) | Windows Server 2012 R2 |
Cluster Type | Hadoop | Hadoop, HBase, Storm |
Deployment | Azure Management Portal, Azure CLI, Azure PowerShell | Azure Management Portal, Azure CLI, Azure PowerShell, HDInsight .NET SDK |
Cluster UI | Ambari | Cluster Dashboard |
Remote Access | Secure Shell (SSH) | Remote Desktop Protocol (RDP) |
Hadoop (the "Query" workload): Provides reliable data storage with HDFS, and a simple MapReduce programming model to process and analyze data in parallel.
HBase (the "NoSQL" workload): A NoSQL database built on Hadoop that provides random access and strong consistency for large amounts of unstructured and semi-structured data - potentially billions of rows times millions of columns. See Overview of HBase on HDInsight.
Apache Storm (the "Stream" workload): A distributed, real-time computation system for processing large streams of data fast. Storm is offered as a managed cluster in HDInsight. See Analyze real-time sensor data using Storm and Hadoop.
Ambari: Cluster provisioning, management, and monitoring.
Avro (Microsoft .NET Library for Avro): Data serialization for the Microsoft .NET environment.
Hive & HCatalog: Structured Query Language (SQL)-like querying, and a table and storage management layer.
Mahout: Machine learning.
MapReduce and YARN: Distributed processing and resource management.
Oozie: Workflow management.
Phoenix: Relational database layer over HBase.
Pig: Simpler scripting for MapReduce transformations.
Sqoop: Data import and export.
Tez: Allows data-intensive processes to run efficiently at scale.
ZooKeeper: Coordination of processes in distributed systems.
HBase
这货有两个版本,一个是Apache HBase,开源、NoSQL、基于Hadoop和狗狗的BigTable,对于海量的结构化及半结构化数据访问有很好的支撑。另一个是HDInsight HBase,微软自己的。数据直接存放于Blob中。HBase数据,可以通过hbase shell的create/get/put/scan命令来管理,scan是读多个行的数据。同时有一个REST方式的C# API可以供调用。
HBase的使用场景
初衷就是google为了自己的web search,你搜索三体的时候,它把所有包含三体的页面都返回给你。除此之外,还包含了:
Key-Value存储,这个适合于消息的管理,比如Facebook。
Sensor data,包含但不限于社交数据,时间相关数据,审计日志等。
real-time query,比如Phoenix是一个Apache Hbase的SQL查询引擎
Storm
官网介绍,它分布式的、容错的、开源的一个计算系统,可以实时处理Hadoop的数据。HDInsight中的Storm,有如下特性:
SLA承诺是999
Storm组件可以用Java/C#/Python来搞
内置的scale-up和scale-down的机制
可以和EventHub/Virtual Network/SQL/Blob/DocumntDB集成
实时处理的场景
Internet of Things (IoT)
Fraud detection
Social analytics
Extract, Transform, Load (ETL)
Network monitoring
Search
Mobile engagement
Spark
Apache Spark,一个开源的,支持in-memory大数据分析的并行处理框架。适用场景:
交互式的数据分析与BI处理
迭代机器学习(这是个啥?)
流式及实时数据处理
相关文章推荐
- 实现ModelDriver接口的功能
- 【Ajax技术】利用XHR接受与处理XML数据
- IOS开发类似于微博个人中心的头像可以拖动lei
- jQuery中的datatable 插件新增一行
- java虚拟机内存管理介绍
- Android下运行时动态链接dlopen()和dlsym()的实现
- 软件汉化
- tcp ip协议笔记(4)——arp
- Ubuntu下nginx配置php环境
- 使用synchronized和Lock对象获取对象锁
- [NOIP2010]引水入城
- hibernate 保存图片到数据库(oracle)
- Swift和OC混编
- [NOIP2010]引水入城
- Android官方数据绑定框架DataBinding(二)
- OpenGL 获取当前屏幕坐标对应的三维坐标,使用很简单glu库中的一个函数
- python自动登录BAIDU,失效版
- 【bzoj1026】【SCOI2009】【windy数】
- 《Java网络编程》读书笔记(一)
- Linux源代码目录树结构