MapReduce and MPP: Two sides of the Big Data coin?
2014-10-13 10:50
591 查看
Summary: To
many, Big Data goes hand-in-hand with Hadoop + MapReduce. But MPP (Massively Parallel Processing) and data warehouse appliances are Big Data technologies too. The MapReduce and MPP worlds have been pretty separate, but are now starting to collide. And that’s
a good thing.
When the Big Data moniker is applied to a discussion, it’s often assumed that Hadoop is, or should be, involved. But perhaps that’s just doctrinaire.
Hadoop, at its core, consists of HDFS (the Hadoop Distributed File System) and MapReduce. The latter is a computational approach that involves breaking large volumes of data down into smaller batches, and processing them separately.
A cluster of computing nodes, each one built on commodity hardware, will scan the batches and aggregate their data. Then the multiple nodes’ output gets merged to generate the final result data. In a separate post, I’ll provide a more detailed and precise
explanation of MapReduce, but this high-level explanation will do for now.
But Big Data's not all about MapReduce. There’s another computational approach to distributed query processing, called Massively Parallel Processing, or MPP. MPP has a lot in common with MapReduce. In MPP, as in MapReduce, processing
of data is distributed across a bank of compute nodes, these separate nodes process their data in parallel and the node-level output sets are assembled together to produce a final result set. MapReduce and MPP are relatives. They might be siblings, parent-and-child
or maybe just kissing cousins.
But, for a variety of reasons, MPP and MapReduce are used in rather different scenarios. You will find MPP employed in high-end data warehousing appliances. Almost all of these products started out as offerings from pure-play
companies, but there’s been a lot of recent M&A activity that has taken MPP mainstream. MPP products like Teradata and ParAccel are
independent to this day. But other MPP appliance products have been assimilated into the mega-vendor world. Netezza was acquired by IBM;Vertica by
HP, Greenplum by EMC and Microsoft’s acquisition of DATAllegro resulted in an MPP version of SQL Server, called Parallel
Data Warehouse Edition (SQL PDW, for short).
MPP gets used on expensive, specialized hardware tuned for CPU, storage and network performance. MapReduce and Hadoop find themselves deployed to clusters of commodity servers that in turn use commodity disks. The commodity nature
of typical Hadoop hardware (and the free nature of Hadoop software) means that clusters can grow as data volumes do, whereas MPP products are bound by the cost of, and finite hardware in, the appliance and the relative high cost of the software.
MPP and MapReduce are separated by more than just hardware. MapReduce’s native control mechanism is Java code (to implement the Map and Reduce logic), whereas MPP products are queried with SQL (Structured Query Language). “Hive,”
a subproject of the overall Apache Hadoop project, essentially provides a SQL abstraction over MapReduce. Nonetheless, Hadoop is natively controlled through imperative code while MPP appliances are queried though declarative query. In a great many cases, SQL
is easier and more productive than is writing MapReduce jobs, and database professionals with the SQL skill set are more plentiful and less costly than Hadoop specialists.
But there’s no reason that SQL + MPP couldn’t be implemented on commodity hardware and, for that matter, no reason why MapReduce couldn’t be used in data warehouse appliance environments. MPP and MapReduce are both Big Data technologies.
They’re also products of different communities and cultures, but that doesn’t justify their continued separate evolution.
The MPP and Hadoop/MapReduce worlds are destined for unification. Perhaps that’s why Teradata’sAster
Data nCluster mashes up SQL, MPP and MapReduce. Or why Teradata and Hortonworks (an offshoot of Yahoo’s Hadoop team) have announced a partnership to make Hadoop and Teradata
work together. And that’s probably why Microsoft is also working with Hortonworks, not only to implementHadoop on Windows Azure (Microsoft’s cloud computing platform)
and Windows Server, but also to integrate it with SQL Server business intelligence products and technologies.
Big Data is data, and it’s big, whether in a hulking data warehouse or a sprawling Hadoop cluster. Data warehouse and Hadoop practitioners have more in common than they might care to admit. Sure, one group has been more corporate
and the other more academic- or research-oriented. But those delineations are subsiding and the technology delineations should subside as well.
For now, expect to see lots of permutations of Hadoop and its ecosystem components with data warehouse, business intelligence, predictive analytics and data visualization technologies. In the future, be prepared to see these specialty
areas more unified, rationalized and seamlessly combined. The companies that get there first will have real competitive advantage. Companies that continue to just jam these things together will have a tougher time.
many, Big Data goes hand-in-hand with Hadoop + MapReduce. But MPP (Massively Parallel Processing) and data warehouse appliances are Big Data technologies too. The MapReduce and MPP worlds have been pretty separate, but are now starting to collide. And that’s
a good thing.
When the Big Data moniker is applied to a discussion, it’s often assumed that Hadoop is, or should be, involved. But perhaps that’s just doctrinaire.
Hadoop, at its core, consists of HDFS (the Hadoop Distributed File System) and MapReduce. The latter is a computational approach that involves breaking large volumes of data down into smaller batches, and processing them separately.
A cluster of computing nodes, each one built on commodity hardware, will scan the batches and aggregate their data. Then the multiple nodes’ output gets merged to generate the final result data. In a separate post, I’ll provide a more detailed and precise
explanation of MapReduce, but this high-level explanation will do for now.
But Big Data's not all about MapReduce. There’s another computational approach to distributed query processing, called Massively Parallel Processing, or MPP. MPP has a lot in common with MapReduce. In MPP, as in MapReduce, processing
of data is distributed across a bank of compute nodes, these separate nodes process their data in parallel and the node-level output sets are assembled together to produce a final result set. MapReduce and MPP are relatives. They might be siblings, parent-and-child
or maybe just kissing cousins.
But, for a variety of reasons, MPP and MapReduce are used in rather different scenarios. You will find MPP employed in high-end data warehousing appliances. Almost all of these products started out as offerings from pure-play
companies, but there’s been a lot of recent M&A activity that has taken MPP mainstream. MPP products like Teradata and ParAccel are
independent to this day. But other MPP appliance products have been assimilated into the mega-vendor world. Netezza was acquired by IBM;Vertica by
HP, Greenplum by EMC and Microsoft’s acquisition of DATAllegro resulted in an MPP version of SQL Server, called Parallel
Data Warehouse Edition (SQL PDW, for short).
MPP gets used on expensive, specialized hardware tuned for CPU, storage and network performance. MapReduce and Hadoop find themselves deployed to clusters of commodity servers that in turn use commodity disks. The commodity nature
of typical Hadoop hardware (and the free nature of Hadoop software) means that clusters can grow as data volumes do, whereas MPP products are bound by the cost of, and finite hardware in, the appliance and the relative high cost of the software.
MPP and MapReduce are separated by more than just hardware. MapReduce’s native control mechanism is Java code (to implement the Map and Reduce logic), whereas MPP products are queried with SQL (Structured Query Language). “Hive,”
a subproject of the overall Apache Hadoop project, essentially provides a SQL abstraction over MapReduce. Nonetheless, Hadoop is natively controlled through imperative code while MPP appliances are queried though declarative query. In a great many cases, SQL
is easier and more productive than is writing MapReduce jobs, and database professionals with the SQL skill set are more plentiful and less costly than Hadoop specialists.
But there’s no reason that SQL + MPP couldn’t be implemented on commodity hardware and, for that matter, no reason why MapReduce couldn’t be used in data warehouse appliance environments. MPP and MapReduce are both Big Data technologies.
They’re also products of different communities and cultures, but that doesn’t justify their continued separate evolution.
The MPP and Hadoop/MapReduce worlds are destined for unification. Perhaps that’s why Teradata’sAster
Data nCluster mashes up SQL, MPP and MapReduce. Or why Teradata and Hortonworks (an offshoot of Yahoo’s Hadoop team) have announced a partnership to make Hadoop and Teradata
work together. And that’s probably why Microsoft is also working with Hortonworks, not only to implementHadoop on Windows Azure (Microsoft’s cloud computing platform)
and Windows Server, but also to integrate it with SQL Server business intelligence products and technologies.
Big Data is data, and it’s big, whether in a hulking data warehouse or a sprawling Hadoop cluster. Data warehouse and Hadoop practitioners have more in common than they might care to admit. Sure, one group has been more corporate
and the other more academic- or research-oriented. But those delineations are subsiding and the technology delineations should subside as well.
For now, expect to see lots of permutations of Hadoop and its ecosystem components with data warehouse, business intelligence, predictive analytics and data visualization technologies. In the future, be prepared to see these specialty
areas more unified, rationalized and seamlessly combined. The companies that get there first will have real competitive advantage. Companies that continue to just jam these things together will have a tougher time.
相关文章推荐
- The Difference Between Big Data and a Lot of Data
- Regionals 2009 Two Sides of the Same Coin
- URAL 1721 Two Sides of the Same Coin(二分图匹配,输出匹配对象)
- 二分图最大匹配(匈牙利算法) URAL 1721 Two Sides of the Same Coin
- MS Bigdata HDInsight -Process, analyze, and gain new insights from big data using the power of Apache Hadoop
- BigQueue:The Architecture and Design of a Publish & Subscribe Messaging System Tailored for Big Data
- Two Sides of the Same Coin --二分图的最大匹配
- Machine-Learning Maestro Michael Jordan on the Delusions of Big Data and Other Huge Engineering Effo
- URAL - 1721 Two Sides of the Same Coin
- 1721. Two Sides of the Same Coin
- URAL 1721 Two Sides of the Same Coin(二分匹配)
- URAL1721——匈牙利算法——Two Sides of the Same Coin
- The Tomes of Delphi: Algorithms and Data Structures
- The Design and Implementation of Two-dimensional Vector Graphics Interactive Tools Based on “Smart Handle”
- Apache错误:LoadModule takes two arguments, a module name and the name of a shared object
- The Age of Big Data & Computer programs that think like humans
- [转]The Big List of JavaScript, CSS, and HTML Development Tools, Libraries, Projects, and Books
- give two sorted array, find the k-th smallest element of union of A and B
- 评论数据库Win A Free Copy of Packt’s Managing Multimedia and Unstructured Data in the Oracle Database e-book
- The Age Of Big Data Coming