您的位置：首页 > 运维架构

Hadoop计算：ansj分词+BloomFilter+ Hadoop计算用户文件属性的方案

2014-12-03 00:00 621 查看

如果你有很多TB的日志，里面有个字段是文件名，如何根据文件名来计算此人的属性，比如文件是综艺，还是韩剧？

1 确定文件种类

先来谈谈确定文件种类的方案，刚开始想到是根据文件名去搜索，后来调研发现这个方案太坑爹了，果断丢弃！

采用第二种方案：

综艺：从优酷土豆，罗列出一批综艺节目的关键词，一般几百个可以搜到。

韩剧：同样的道理，也可以搜到。

插入到mongodb里，如图

mongos> db.file_keyword.save({"n":"奔跑吧兄弟","c":2})
WriteResult({ "nInserted" : 1 })
mongos> db.file_keyword.save({"n":"这就是生活","c":2})
WriteResult({ "nInserted" : 1 })
mongos> db.file_keyword.save({"n":"勇敢的心","c":2})
WriteResult({ "nInserted" : 1 })
mongos> db.file_keyword.save({"n":"侣行","c":2})
WriteResult({ "nInserted" : 1 })
mongos> db.file_keyword.save({"n":"优酷全明星","c":2})
WriteResult({ "nInserted" : 1 })
mongos> db.file_keyword.save({"n":"鸿观","c":2})
WriteResult({ "nInserted" : 1 })

2 在hadoop里加入一个任务---从数据库里查询出记录，写入到HDFS文件中。

3 在第2个真正执行的任务里，分发此文件

job.addCacheFile(new Path(args[1]).toUri());

4 在任务的setup初始化函数里，需要做2件事情：初始化BloomFilter和用户自定义词典，后者是为了分词。

先科普下BloomFilter

具体函数

private static int bitArraySize = 1024 * 1024 * 20;
private static int numHashFunc = 6;
private BitSet zongyiBloomFilter;// c=2
private BitSet koreanBloomFilter;// c=3

protected int[] getHashIndexes(String obj) {
int[] indexes = new int[this.numHashFunc];
long seed = 0;
byte[] digest;
try {
MessageDigest md = MessageDigest.getInstance("MD5");
md.update(obj.toString().getBytes());
digest = md.digest();
for (int i = 0; i < 6; i++) {
seed = seed ^ (((long) digest[i] & 0xFF)) << (8 * i);

}

} catch (NoSuchAlgorithmException e) {

}
Random gen = new Random(seed);
for (int i = 0; i < this.numHashFunc; i++) {
indexes[i] = gen.nextInt(this.bitArraySize);
}
return indexes;
}

public void add(BitSet bf, String obj) {
int[] indexes = getHashIndexes(obj);
for (int index : indexes) {
bf.set(index);
}
}

public boolean contains(BitSet bf, String obj) {
int[] indexes = getHashIndexes(obj);
for (int index : indexes) {
if (false == bf.get(index)) {
return false;
}
}
return true;
}

至于为什么要创建BloomFilter,主要是为了节省内存，好奇者可自行百度。

2 同时初始化BloomFilter和创建用户自定义分词，这里是ansj分词器

关键代码：

if (null != c && c.equals("2")) {
// zongyi
String n = array[2];
this.add(zongyiBloomFilter, n);
UserDefineLibrary
.insertWord("" + n, "userDefine", 1000);
context.getCounter("ComputerProfileHDFS",
"readZongYiTags").increment(1);
}

3 使用分词器和BloomFilter

// included by the 2 kind of BloomFilter?
List<Term> parse = ToAnalysis.parse(filename);
for (Term t : parse) {
String name = t.getName();
if (this.contains(zongyiBloomFilter, name)) {
tags += " " + "综艺";
context.getCounter("ComputerProfileHDFS",
"_MapVideoZongYiRecord").increment(1);
}
if (this.contains(koreanBloomFilter, name)) {
tags += " " + "韩剧";
context.getCounter("ComputerProfileHDFS",
"_MapVideoKoreanRecord").increment(1);
}
}

运行结果看图：

说明方法是行之有效的。

PS:

1）你的mongo中的词汇越完善，匹配度则越高！并不影响后续的判断所耗费的时间

2）优化：

job.setInputFormatClass(CombineTextInputFormat.class);

job.getConfiguration().setLong("mapreduce.input.fileinputformat.split.maxsize", 128*1024*1024);

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： Hadoop 分词器 ansj BloomFilter

相关文章推荐

新的分享

章节导航