您的位置:首页 > 其它

Hbase 学习笔记二 》hbase with MR example

2016-06-20 18:27 274 查看
Writing a MapReduce application

HBase is running on top of Hadoop, specifically the HDFS. Data in HBase is partitioned

and replicated like any other data in the HDFS. That means running a MapReduce

program over data stored in HBase has all the same advantages as a regular

MapReduce program. This is why your MapReduce calculation can execute the same

HBase scan as the multithreaded example and attain far greater throughput. In the

MapReduce application, the scan is executing simultaneously on multiple nodes. This

removes the bottleneck of all data moving through a single machine. If you’re running

MapReduce on the same cluster that’s running HBase, it’s also taking advantage

of any collocation that might be available. Putting it all together, the Shakespearean

counting example looks like the following listing.

package HBaseIA.TwitBase.mapreduce;

//...
public class CountShakespeare {
public static void main(String[] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
Job job = new Job(conf, "TwitBase Shakespeare counter");
job.setJarByClass(CountShakespeare.class);

Scan scan = new Scan();
scan.addColumn(TwitsDAO.TWITS_FAM, TwitsDAO.TWIT_COL);
TableMapReduceUtil.initTableMapperJob(Bytes.toString(
TwitsDAO.TABLE_NAME), scan, Map.class,
ImmutableBytesWritable.class, Result.class, job);
job.setOutputFormatClass(NullOutputFormat.class);
job.setNumReduceTasks(0);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}

public static class Map extends TableMapper<Text, LongWritable> {
private boolean containsShakespeare(String msg) {
//...
}

@Override
protected void map(ImmutableBytesWritable rowkey, Result result,
Context context) {
byte[] b = result.getColumnLatest(TwitsDAO.TWITS_FAM,
TwitsDAO.TWIT_COL).getValue();
String msg = Bytes.toString(b);

if ((msg != null) && !msg.isEmpty()) {
context.getCounter(Counters.ROWS).increment(1);
}

if (containsShakespeare(msg)) {
context.getCounter(Counters.SHAKESPEAREAN).increment(1);
}
}
public static enum Counters {ROWS,
SHAKESPEAREAN;
}
}
}


CountShakespeare is pretty simple; it packages a Mapper implementation and a main

method. It also takes advantage of the HBase-specific MapReduce helper class

TableMapper and the TableMapReduceUtil utility class that we talked about earlier in

the chapter. Also notice the lack of a reducer. This example doesn’t need to perform

additional computation in the reduce phase. Instead, map output is collected via job

counters.

Counters are fun and all, but what about writing back to HBase? We’ve developed a

similar algorithm specifically for detecting references to Hamlet. The mapper is similar

to the Shakespearean example, except that its [k2,v2] output types are [Immutable-

BytesWritable,Put]—basically, HBase rowkey and an instance of the Put command

you learned in the previous chapter. Here’s the reducer code:

 

public static class Reduce extends TableReducer<ImmutableBytesWritable, Put, ImmutableBytesWritable> {
@Override
protected void reduce(ImmutableBytesWritable rowkey, Iterable<Put> values,
Context context) {
Iterator<Put> i = values.iterator();

if (i.hasNext()) {
context.write(rowkey, i.next());
}
}
}


There’s not much to it. The reducer implementation accepts [k2,{v2}], the rowkey

and a list of Puts as input. In this case, each Put is setting the info:hamlet_tag column

to true. A Put need only be executed once for each user, so only the first is emitted

to the output context object. [k3,v3] tuples produced are also of type

[ImmutableBytesWritable,Put]. You let the Hadoop machinery handle execution of

the Puts to keep the reduce implementation idempotent.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: