Pig and Hive at Yahoo!
2012-01-04 01:25
239 查看
http://developer.yahoo.com/blogs/hadoop/posts/2010/08/pig_and_hive_at_yahoo/
Pig and Hive at Yahoo!
Yahoo! has begun evaluating Hive for use as part of its Hadoop stack. Since, in many peoples' minds, Hive and Pig are roughly equivalent and Pig Latin is very close to SQL, this has led to some confusion. Why are we interested in using both technologies?As we have looked at our workloads and analyzed our use cases, we have come to the conclusion that the different use cases require different tools. In this post, I will walk through our thinking on why both of these tools belong in our toolkit, and when
each is appropriate.
Data preparation and presentation
Let me begin with a little background on processing and using large data. Data processing often splits into three separate tasks: data collection, data preparation, and data presentation. I will not discuss the data collection phase, because I want to focuson Pig and Hive, neither of which play a role in that phase.
The data preparation phase is often known as ETL (Extract Transform Load) or the
data factory. "Factory" is a good analogy because it captures the essence of what is being done in this stage: Just as a physical factory brings in raw materials and outputs products ready for consumers, so a data factory brings in raw data and produces
data sets ready for data users to consume. Raw data is loaded in, cleaned up, conformed to the selected data model, joined with other data sources, and so on. Users in this phase are generally engineers, data specialists, or researchers.
The data presentation phase is usually referred to as the data warehouse. A warehouse stores products ready for consumers; they need only come and select the proper products off of the shelves. In this phase, users may be engineers using the data
for their systems, analysts, or decisionmakers.
Given the different workloads and different users for each phase, we have found that different tools work best in each phase. Pig (combined with a workflow system such as Oozie) is best suited for the data factory, and Hive for the data warehouse.
Data factory use cases
In our data factory we have observed three distinct workloads: pipelines, iterative processing, and research.Pipelines
The classical data pipelines bring in a data feed, and clean and transform it. A common example of such a feed is logs from Yahoo!'s web servers. These logs undergo a cleaning step where bots, company internal views, and clicks are removed.
We also do transformations such as, for each click,finding the page view that preceded that click.
See my previous
post that discusses why we have found Pig to be the best tool for implementing these pipelines. We use it together with our workflow tool,
Oozie, to construct pipelines, some of which run tens of thousands of Pig jobs a day.
Iterative processing
A closely related use case is iterative processing. In this case, there is usually one very large data set that is maintained. Typical processing onthat data set involves bringing in small new pieces of data that will change the state of
the large data set.
For example, consider a data set that contained all the news stories currently known to Yahoo! News. You can envision this as a huge graph, where each story is a node. In this graph, there are links between stories that reference the same events. Every few
minutes a new set of stories comes in, and the tools need to add these to the graph, find related stories and create links, and remove old stories that these new stories supersede.
What sets this off from the standard pipeline case is the constant inflow of small changes. These require the use of an incremental processing model to process this data in a reasonable amount of time.
For example, if the process has already done a join against the graph of all news stories, and a small set of new stories arrives, re-running the join across the whole set will not be desirable. It will take hours or days.
Instead, joining against the new incremental data and using the results together with the results from the previous full join is the correct approach. This will take only a few minutes. Standard database operations can be implemented in this incremental
way in Pig Latin, making Pig a good tool for this use case.
Research
A third use case is research. Yahoo! has many scientists who use our grid tools to comb through the petabytes of data we have. Many of these researchers want to quickly write a script to test a theory or gain deeper insight.
But, in the data factory, data may not be in a nice, standardized state yet. This makes Pig a good fit for this use case as well, since it supports data with partial or unknown schemas, and semi-structured or unstructured data.
Pig integration with streaming also makes it easy for researchers to take a Perl or Python script they have already debugged on a small data set and run it against a huge data set.
Data warehouse use cases
In the data warehouse phase of processing, we see two dominant use cases: business-intelligence analysis and ad-hoc queries.In the first case, users connect the data to business intelligence (BI) tools — such as MicroStrategy — to generate reports or do further analysis.
In the second case, users run ad-hoc queries issued by data analysts or decisionmakers.
In both cases, the relational model and SQL are the best fit. Indeed, data warehousing has been one of the core use cases for SQL through much of its history. It has the right constructs to support the types of queries and tools that analysts want to use.
And it is already in use by both the tools and users in the field.
The Hadoop subproject Hive provides a SQL interface and relational model for Hadoop. The Hive team has begun work to integrate with BI tools via interfaces such as
ODBC.
Pig + Hive = Hadoop toolkit
The widespread use of Pig at Yahoo! has enabled the migration of our data factory processing to Hadoop. With the adoption of Hive, we will be able to move much of our data warehousing to Hadoop as well.Having the data factory and the data warehouse on the same system will lower data-loading time into the warehouse — as soon as the factory is finished, it is available in the warehouse.
It will also enable us to share — across both the factory and the warehouse — metadata, monitoring, and management tools; support and operations teams; and hardware.
So we are excited to add Hive to our toolkit, and look forward to using both these tools together as we lean on Hadoop to do more and more of our data processing.
相关文章推荐
- hadoop - what is difference between Pig and Hive? - Stack Overflow
- OLAP: Hive, Impala and Redshift
- spark sql and hive 3g数据测试
- Hadoop生态上几个技术的关系与区别:hive、pig、hbase 关系与区别
- Succeeding with Object databases: a practical look at today's implementations with Java and XML
- AtCoder Regular Contest 079-C - Cat Snuke and a Voyage
- Primitive Service Model and Interceptor in HiveMind
- Hive on Spark and Spark sql on Hive
- Pig、Hive 解决分组 TopK 问题
- Couldn't get connection because we are at maximum connection count (150/150) and there are none 异常解决
- 【RAC搭建报错】You need disks from at least two different failure groups, excluding quorum disks and quorum failure groups, to create a Disk Group with normal redundancy
- How I Failed, Failed, and Finally Succeeded at Learning How to Code
- Hadoop伪分布式环境搭建(hadoop-0.20.2、hive-0.11.0、pig-0.5.0、zookeeper-3.4.3)
- android编程中遇到Please ensure that adb is correctly located at ***and can be executed.错误的原因及解决办法
- chmod,fchmod,and fchmodat Functions
- Atcoder Shik and Game
- Apply User story, ATDD and TDD in real world (1) – At all the beginning
- Running PHP(Apache) and ASP.NET(IIS) at port 80 at the same time
- Hive 读书笔记2:Data Types and File Formats
- AtCoder Grand Contest 019 B - Reverse and Compare