Hbase 学习笔记二 》hbase with MR example
2016-06-20 18:27
274 查看
Writing a MapReduce application
HBase is running on top of Hadoop, specifically the HDFS. Data in HBase is partitioned
and replicated like any other data in the HDFS. That means running a MapReduce
program over data stored in HBase has all the same advantages as a regular
MapReduce program. This is why your MapReduce calculation can execute the same
HBase scan as the multithreaded example and attain far greater throughput. In the
MapReduce application, the scan is executing simultaneously on multiple nodes. This
removes the bottleneck of all data moving through a single machine. If you’re running
MapReduce on the same cluster that’s running HBase, it’s also taking advantage
of any collocation that might be available. Putting it all together, the Shakespearean
counting example looks like the following listing.
CountShakespeare is pretty simple; it packages a Mapper implementation and a main
method. It also takes advantage of the HBase-specific MapReduce helper class
TableMapper and the TableMapReduceUtil utility class that we talked about earlier in
the chapter. Also notice the lack of a reducer. This example doesn’t need to perform
additional computation in the reduce phase. Instead, map output is collected via job
counters.
Counters are fun and all, but what about writing back to HBase? We’ve developed a
similar algorithm specifically for detecting references to Hamlet. The mapper is similar
to the Shakespearean example, except that its [k2,v2] output types are [Immutable-
BytesWritable,Put]—basically, HBase rowkey and an instance of the Put command
you learned in the previous chapter. Here’s the reducer code:
There’s not much to it. The reducer implementation accepts [k2,{v2}], the rowkey
and a list of Puts as input. In this case, each Put is setting the info:hamlet_tag column
to true. A Put need only be executed once for each user, so only the first is emitted
to the output context object. [k3,v3] tuples produced are also of type
[ImmutableBytesWritable,Put]. You let the Hadoop machinery handle execution of
the Puts to keep the reduce implementation idempotent.
HBase is running on top of Hadoop, specifically the HDFS. Data in HBase is partitioned
and replicated like any other data in the HDFS. That means running a MapReduce
program over data stored in HBase has all the same advantages as a regular
MapReduce program. This is why your MapReduce calculation can execute the same
HBase scan as the multithreaded example and attain far greater throughput. In the
MapReduce application, the scan is executing simultaneously on multiple nodes. This
removes the bottleneck of all data moving through a single machine. If you’re running
MapReduce on the same cluster that’s running HBase, it’s also taking advantage
of any collocation that might be available. Putting it all together, the Shakespearean
counting example looks like the following listing.
package HBaseIA.TwitBase.mapreduce; //... public class CountShakespeare { public static void main(String[] args) throws Exception { Configuration conf = HBaseConfiguration.create(); Job job = new Job(conf, "TwitBase Shakespeare counter"); job.setJarByClass(CountShakespeare.class); Scan scan = new Scan(); scan.addColumn(TwitsDAO.TWITS_FAM, TwitsDAO.TWIT_COL); TableMapReduceUtil.initTableMapperJob(Bytes.toString( TwitsDAO.TABLE_NAME), scan, Map.class, ImmutableBytesWritable.class, Result.class, job); job.setOutputFormatClass(NullOutputFormat.class); job.setNumReduceTasks(0); System.exit(job.waitForCompletion(true) ? 0 : 1); } public static class Map extends TableMapper<Text, LongWritable> { private boolean containsShakespeare(String msg) { //... } @Override protected void map(ImmutableBytesWritable rowkey, Result result, Context context) { byte[] b = result.getColumnLatest(TwitsDAO.TWITS_FAM, TwitsDAO.TWIT_COL).getValue(); String msg = Bytes.toString(b); if ((msg != null) && !msg.isEmpty()) { context.getCounter(Counters.ROWS).increment(1); } if (containsShakespeare(msg)) { context.getCounter(Counters.SHAKESPEAREAN).increment(1); } } public static enum Counters {ROWS, SHAKESPEAREAN; } } }
CountShakespeare is pretty simple; it packages a Mapper implementation and a main
method. It also takes advantage of the HBase-specific MapReduce helper class
TableMapper and the TableMapReduceUtil utility class that we talked about earlier in
the chapter. Also notice the lack of a reducer. This example doesn’t need to perform
additional computation in the reduce phase. Instead, map output is collected via job
counters.
Counters are fun and all, but what about writing back to HBase? We’ve developed a
similar algorithm specifically for detecting references to Hamlet. The mapper is similar
to the Shakespearean example, except that its [k2,v2] output types are [Immutable-
BytesWritable,Put]—basically, HBase rowkey and an instance of the Put command
you learned in the previous chapter. Here’s the reducer code:
public static class Reduce extends TableReducer<ImmutableBytesWritable, Put, ImmutableBytesWritable> { @Override protected void reduce(ImmutableBytesWritable rowkey, Iterable<Put> values, Context context) { Iterator<Put> i = values.iterator(); if (i.hasNext()) { context.write(rowkey, i.next()); } } }
There’s not much to it. The reducer implementation accepts [k2,{v2}], the rowkey
and a list of Puts as input. In this case, each Put is setting the info:hamlet_tag column
to true. A Put need only be executed once for each user, so only the first is emitted
to the output context object. [k3,v3] tuples produced are also of type
[ImmutableBytesWritable,Put]. You let the Hadoop machinery handle execution of
the Puts to keep the reduce implementation idempotent.
相关文章推荐
- JVM参数--GC
- [Leetcode]Nim Game
- CGROUP相关知识
- Android Studio里面的Build.gradle的详细配置说明
- ThreadPoolExecutor使用和思考(上)-线程池大小设置与BlockingQueue的三种实现区别
- mybatis学习笔记(二)增删改查
- .tar.xz
- struts2 iterator中if标签的使用
- qt 使用多个ui文件
- andriod开发:charles抓取https请求
- Android Studio--Gradle配置详解
- Kmeans算法
- CSS绘制三角形
- 关于android studio与eclipse的比较
- 如何查看进程的线程数?
- ArrayList部分源码分析(基于1.8)
- Introduction to parsetR
- Kafka深度解析
- jQuery根据元素值删除数组元素的方法
- 关于webview加载网页,返回后总是刷新页面问题解决