Hadoop Futures at Structure Big Data: DataStax Brisk, EMC, and MapR
2011-10-13 13:17
225 查看
he Structure Big Data conference was filled with news and rumors of new
Hadoop offerings. During a MapReduce panel DataStax
announced Brisk, a distribution of Hadoop using
Cassandra to store data instead of the Hadoop Distributed File System. EMC published a full page ad in the conference program that stated "05.09.11. EMC Greenplum. Apache Hadoop." And GigaOm, the conference presenter,
published an
article that speculated that stealth mode startup MapR Technologies "is building a propertiary version of Hadoop and is likely to launch later this year." The day after the conference Hadoop was named as the Guardian's "Innovator
of the Year" and Cloudera engineer Todd Lipcon presented Hadoop in a
keynote at EclipseCon.
GigaOm reported that MapR is
building a proprietary replacement for the Hadoop Distributed File System that is said to be three times faster than the current open source version. It comes with snapshots and no NameNode single point of failure (SPOF) and is said to be API compatible
with HDFS, so it can be a drop-in replacement.
DataStax (formerly Riptano) provides support and commercial products for Cassandra, such as the recently announced management tool
OpsCenter for Apache Cassandra. During the panel VP of Product, Ben Werther said that Brisk was motivated by customers like Netflix, which stores all their streaming data in Cassandra, and which are also
heavy users of Hive for analytics. He noted Netflix wants to be able to have interactive response to Hive queries on ClickStream data without ETL delay. Werther told InfoQ that Brisk will ship within 45 days of the announcement,
and that DataStax will be offering commercial support for the distribution. He also said that OpsCenter will allow managing multiple Data Centers, replica sets, and include basic Hadoop monitoring. Werther said that Twitter's
Rainbird project for realtime counter analytics using Cassandra will soon be available in open source.
Brisk is based on Apache Hadoop 20.2 and includes:
CassandraFS, a Hadoop-compatible file system that stores data using Cassandra.
Input and output formats to read and write Cassandra column families for Hadoop jobs
Hive support to read and write data stored in Cassandra and to allow transposing data, converting wide rows into multiple narrow rows.
Updates for the JobTracker (JT) to allow restarting it when nodes fail. However, Werther clarified that Brisk does nothing to persist in-memory JobTracker state, so while Brisk will start up a new JT, running jobs wouldn't be able to
complete
Pre-built configuration: Werther told InfoQ DataStax had simplified the whole stack, with a set of predefined flags so Cassandra comes up as both real time and Hadoop nodes.
Cassandra is a BigTable inspired NoSQL database with a Dynamo architecture. It was initially created and open sourced by Facebook, but the majority of committers on the project work at DataStax
including the project chairman and company co-founder Jonathan Ellis. Currently DataStax employs no Hadoop committers. Cassandra supports replication of data across multiple data centers, range scans, separate column families for storing data, and has recently
added secondary indexes and the ability to replicate data to different replica groups to allow analysis to access a recent copy of data without impacting production serving requirements.
InfoQ asked Werther about the maturity of Cassandra and how it compares to
HBase. Notably, Facebook which created Cassandra has been using HBase for serving large-scale
messaging and for
real-time analytics. He claimed that while Hadoop has a large community, HBase has a tiny one, whereas Cassandra has a larger community and more momentum. DataStax uses bug fixes, the backlog of unfixed bugs, community discussions, and downloads as metrics
to compare adoption. In response to InfoQ's questioning about
problems in past Cassandra deployments (such as for Digg) Werther said that the "rapidly maturing" technology was sometimes used a little early or in the wrong way, but that they have large successful customers including Cisco, Rackspace, Constant Contact,
Real Networks, and Netflix. Werther also stated that Facebook had been invested in HBase so their decision to use it over Cassandra had more to do with internal decision making. He also claimed that consistency of storage is a red herring because Cassandra's
support for eventual consistency is an option and one can run it with strong consistency.
When asked Werther said that Brisk is still being tested internally - there are no beta customers for the technology yet. InfoQ asked about large scale uses of Cassandra. Werther said the largest production deployment is a 700 node cluster being used by
a government agency. In terms of transaction volumes, he said Twitter performs 200,000 writes per second for data ingest. In terms of data storage, he said there were clusters that are storing in the "low hundreds of Terabytes" of data.
InfoQ interviewed Werther and lead engineer Jake Luciani about the architecture of Brisk and the file system implementation, CassandraFS. Some of the key differences between current Hadoop DFS (HDFS) versions, possible improvements to HDFS, and the planned
CassandraFS are outlined below:
Technically, CassandraFS creates a table with paths as the key and inodes as a value including metadata like the file owner, permissions, and a list of blocks. It then has another table with block id's as the key and serialized blocks as values.
Werther noted that Brisk works with other Hadoop ecosystem code. In response to InfoQ's questions about how customers can load log data that didn't originate in Cassandra, he said customers could use Cloudera
Flume, which they have verified can be used works with Brisk. Likewise, Werther noted that the Cloudera
Hue browser-based interface for Hadoop works with Brisk.
Hadoop offerings. During a MapReduce panel DataStax
announced Brisk, a distribution of Hadoop using
Cassandra to store data instead of the Hadoop Distributed File System. EMC published a full page ad in the conference program that stated "05.09.11. EMC Greenplum. Apache Hadoop." And GigaOm, the conference presenter,
published an
article that speculated that stealth mode startup MapR Technologies "is building a propertiary version of Hadoop and is likely to launch later this year." The day after the conference Hadoop was named as the Guardian's "Innovator
of the Year" and Cloudera engineer Todd Lipcon presented Hadoop in a
keynote at EclipseCon.
GigaOm reported that MapR is
building a proprietary replacement for the Hadoop Distributed File System that is said to be three times faster than the current open source version. It comes with snapshots and no NameNode single point of failure (SPOF) and is said to be API compatible
with HDFS, so it can be a drop-in replacement.
DataStax (formerly Riptano) provides support and commercial products for Cassandra, such as the recently announced management tool
OpsCenter for Apache Cassandra. During the panel VP of Product, Ben Werther said that Brisk was motivated by customers like Netflix, which stores all their streaming data in Cassandra, and which are also
heavy users of Hive for analytics. He noted Netflix wants to be able to have interactive response to Hive queries on ClickStream data without ETL delay. Werther told InfoQ that Brisk will ship within 45 days of the announcement,
and that DataStax will be offering commercial support for the distribution. He also said that OpsCenter will allow managing multiple Data Centers, replica sets, and include basic Hadoop monitoring. Werther said that Twitter's
Rainbird project for realtime counter analytics using Cassandra will soon be available in open source.
Brisk is based on Apache Hadoop 20.2 and includes:
CassandraFS, a Hadoop-compatible file system that stores data using Cassandra.
Input and output formats to read and write Cassandra column families for Hadoop jobs
Hive support to read and write data stored in Cassandra and to allow transposing data, converting wide rows into multiple narrow rows.
Updates for the JobTracker (JT) to allow restarting it when nodes fail. However, Werther clarified that Brisk does nothing to persist in-memory JobTracker state, so while Brisk will start up a new JT, running jobs wouldn't be able to
complete
Pre-built configuration: Werther told InfoQ DataStax had simplified the whole stack, with a set of predefined flags so Cassandra comes up as both real time and Hadoop nodes.
Cassandra is a BigTable inspired NoSQL database with a Dynamo architecture. It was initially created and open sourced by Facebook, but the majority of committers on the project work at DataStax
including the project chairman and company co-founder Jonathan Ellis. Currently DataStax employs no Hadoop committers. Cassandra supports replication of data across multiple data centers, range scans, separate column families for storing data, and has recently
added secondary indexes and the ability to replicate data to different replica groups to allow analysis to access a recent copy of data without impacting production serving requirements.
InfoQ asked Werther about the maturity of Cassandra and how it compares to
HBase. Notably, Facebook which created Cassandra has been using HBase for serving large-scale
messaging and for
real-time analytics. He claimed that while Hadoop has a large community, HBase has a tiny one, whereas Cassandra has a larger community and more momentum. DataStax uses bug fixes, the backlog of unfixed bugs, community discussions, and downloads as metrics
to compare adoption. In response to InfoQ's questioning about
problems in past Cassandra deployments (such as for Digg) Werther said that the "rapidly maturing" technology was sometimes used a little early or in the wrong way, but that they have large successful customers including Cisco, Rackspace, Constant Contact,
Real Networks, and Netflix. Werther also stated that Facebook had been invested in HBase so their decision to use it over Cassandra had more to do with internal decision making. He also claimed that consistency of storage is a red herring because Cassandra's
support for eventual consistency is an option and one can run it with strong consistency.
When asked Werther said that Brisk is still being tested internally - there are no beta customers for the technology yet. InfoQ asked about large scale uses of Cassandra. Werther said the largest production deployment is a 700 node cluster being used by
a government agency. In terms of transaction volumes, he said Twitter performs 200,000 writes per second for data ingest. In terms of data storage, he said there were clusters that are storing in the "low hundreds of Terabytes" of data.
InfoQ interviewed Werther and lead engineer Jake Luciani about the architecture of Brisk and the file system implementation, CassandraFS. Some of the key differences between current Hadoop DFS (HDFS) versions, possible improvements to HDFS, and the planned
CassandraFS are outlined below:
Current HDFS | Possible HDFS Improvements | CassandraFS |
---|---|---|
The Name Node (NN) is a single point of failure (SPOF) | Several approaches to amerliorating and elimating the NN SPOF are being developed. | CassandraFS stores data in Cassandra, which has no SPOF. |
File metadata is held in RAM by a single process, limiting the total files. | Federated HDFS and use of BookKeeper are approaches to scaling HDFS that are being developed. | CassandraFS offers virtually unlimited file scalability. |
No WAN Replication Support | No WAN Replication Support | Cassandra supports Multi-Data Center Replication |
Supports append (in Cloudera Distribution for Hadoop 3 and Apache Hadoop 0.21) | n/a | The design allows for append, but the first release won't support it. However, HDFS Append has mostly been used to support HBase, which is an unlikely technology for those using Brisk. |
Werther noted that Brisk works with other Hadoop ecosystem code. In response to InfoQ's questions about how customers can load log data that didn't originate in Cassandra, he said customers could use Cloudera
Flume, which they have verified can be used works with Brisk. Likewise, Werther noted that the Cloudera
Hue browser-based interface for Hadoop works with Brisk.
相关文章推荐
- Structure Big Data揭示Hadoop未来
- Beyond Hadoop: Next-Generation Big Data Architectures(zz)
- An introduction to Apache Hadoop for big data
- Big Data Integration with Hadoop: A Q&A Spotlig...
- Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing
- Pentaho Work with Big Data(一)—— Kettle连接Hadoop集群
- Pentaho Work with Big Data(七)—— 从Hadoop集群抽取数据
- 灵活管理Hadoop各发行版的运维利器 - vSphere Big Data Extensions
- Hadoop 不是萬能,破除七大迷思讓你做好 Big Data/Cloud Analysis
- Solr + Hadoop = Big Data Love
- MS Bigdata HDInsight -Process, analyze, and gain new insights from big data using the power of Apache Hadoop
- Note of big data dummies:Looking at Real-Time and Non-Real-Time Requirements
- "Big Data"- Reporting Over Hadoop using Hive-Intellicus 5.2
- Thinking in BigData(八)大数据Hadoop核心架构HDFS+MapReduce+Hbase+Hive内部机理详解
- Data Structure Binary Tree: Connect nodes at same level using constant extra space
- hadoop datanode 问题 INFO org.apache.hadoop.ipc.RPC: Server at /:9000 not available yet, Zzzzz..
- Big Data Analytics Beyond Hadoop
- Thinking in BigData(九)大数据hadoop集群下离线数据存储和挖掘架构
- Thinking in BigData(八)大数据Hadoop核心架构HDFS+MapReduce+Hbase+Hive内部机理详解
- Pentaho Work with Big Data(三)—— 向Hadoop集群导入数据