您的位置：首页 > 运维架构

Hadoop Futures at Structure Big Data: DataStax Brisk, EMC, and MapR

2011-10-13 13:17 225 查看

he Structure Big Data conference was filled with news and rumors of new
Hadoop offerings. During a MapReduce panel DataStax
announced Brisk, a distribution of Hadoop using
Cassandra to store data instead of the Hadoop Distributed File System. EMC published a full page ad in the conference program that stated "05.09.11. EMC Greenplum. Apache Hadoop." And GigaOm, the conference presenter,
published an
article that speculated that stealth mode startup MapR Technologies "is building a propertiary version of Hadoop and is likely to launch later this year." The day after the conference Hadoop was named as the Guardian's "Innovator
of the Year" and Cloudera engineer Todd Lipcon presented Hadoop in a
keynote at EclipseCon.
GigaOm reported that MapR is

building a proprietary replacement for the Hadoop Distributed File System that is said to be three times faster than the current open source version. It comes with snapshots and no NameNode single point of failure (SPOF) and is said to be API compatible
with HDFS, so it can be a drop-in replacement.
DataStax (formerly Riptano) provides support and commercial products for Cassandra, such as the recently announced management tool
OpsCenter for Apache Cassandra. During the panel VP of Product, Ben Werther said that Brisk was motivated by customers like Netflix, which stores all their streaming data in Cassandra, and which are also
heavy users of Hive for analytics. He noted Netflix wants to be able to have interactive response to Hive queries on ClickStream data without ETL delay. Werther told InfoQ that Brisk will ship within 45 days of the announcement,
and that DataStax will be offering commercial support for the distribution. He also said that OpsCenter will allow managing multiple Data Centers, replica sets, and include basic Hadoop monitoring. Werther said that Twitter's

Rainbird project for realtime counter analytics using Cassandra will soon be available in open source.

Brisk is based on Apache Hadoop 20.2 and includes:

CassandraFS, a Hadoop-compatible file system that stores data using Cassandra.
Input and output formats to read and write Cassandra column families for Hadoop jobs
Hive support to read and write data stored in Cassandra and to allow transposing data, converting wide rows into multiple narrow rows.
Updates for the JobTracker (JT) to allow restarting it when nodes fail. However, Werther clarified that Brisk does nothing to persist in-memory JobTracker state, so while Brisk will start up a new JT, running jobs wouldn't be able to
complete
Pre-built configuration: Werther told InfoQ DataStax had simplified the whole stack, with a set of predefined flags so Cassandra comes up as both real time and Hadoop nodes.

Cassandra is a BigTable inspired NoSQL database with a Dynamo architecture. It was initially created and open sourced by Facebook, but the majority of committers on the project work at DataStax
including the project chairman and company co-founder Jonathan Ellis. Currently DataStax employs no Hadoop committers. Cassandra supports replication of data across multiple data centers, range scans, separate column families for storing data, and has recently
added secondary indexes and the ability to replicate data to different replica groups to allow analysis to access a recent copy of data without impacting production serving requirements.

InfoQ asked Werther about the maturity of Cassandra and how it compares to
HBase. Notably, Facebook which created Cassandra has been using HBase for serving large-scale

messaging and for
real-time analytics. He claimed that while Hadoop has a large community, HBase has a tiny one, whereas Cassandra has a larger community and more momentum. DataStax uses bug fixes, the backlog of unfixed bugs, community discussions, and downloads as metrics
to compare adoption. In response to InfoQ's questioning about
problems in past Cassandra deployments (such as for Digg) Werther said that the "rapidly maturing" technology was sometimes used a little early or in the wrong way, but that they have large successful customers including Cisco, Rackspace, Constant Contact,
Real Networks, and Netflix. Werther also stated that Facebook had been invested in HBase so their decision to use it over Cassandra had more to do with internal decision making. He also claimed that consistency of storage is a red herring because Cassandra's
support for eventual consistency is an option and one can run it with strong consistency.

When asked Werther said that Brisk is still being tested internally - there are no beta customers for the technology yet. InfoQ asked about large scale uses of Cassandra. Werther said the largest production deployment is a 700 node cluster being used by
a government agency. In terms of transaction volumes, he said Twitter performs 200,000 writes per second for data ingest. In terms of data storage, he said there were clusters that are storing in the "low hundreds of Terabytes" of data.

InfoQ interviewed Werther and lead engineer Jake Luciani about the architecture of Brisk and the file system implementation, CassandraFS. Some of the key differences between current Hadoop DFS (HDFS) versions, possible improvements to HDFS, and the planned
CassandraFS are outlined below:

Current HDFS	Possible HDFS Improvements	CassandraFS
The Name Node (NN) is a single point of failure (SPOF)	Several approaches to amerliorating and elimating the NN SPOF are being developed.	CassandraFS stores data in Cassandra, which has no SPOF.
File metadata is held in RAM by a single process, limiting the total files.	Federated HDFS and use of BookKeeper are approaches to scaling HDFS that are being developed.	CassandraFS offers virtually unlimited file scalability.
No WAN Replication Support	No WAN Replication Support	Cassandra supports Multi-Data Center Replication
Supports append (in Cloudera Distribution for Hadoop 3 and Apache Hadoop 0.21)	n/a	The design allows for append, but the first release won't support it. However, HDFS Append has mostly been used to support HBase, which is an unlikely technology for those using Brisk.

Technically, CassandraFS creates a table with paths as the key and inodes as a value including metadata like the file owner, permissions, and a list of blocks. It then has another table with block id's as the key and serialized blocks as values.
Werther noted that Brisk works with other Hadoop ecosystem code. In response to InfoQ's questions about how customers can load log data that didn't originate in Cassandra, he said customers could use Cloudera
Flume, which they have verified can be used works with Brisk. Likewise, Werther noted that the Cloudera
Hue browser-based interface for Hadoop works with Brisk.

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航