您的位置:首页 > 其它

Pig + Ansj 统计中文文本词频

2016-01-12 20:21 393 查看
最近特别喜欢用Pig,拥有能满足大部分需求的内置函数(built-in functions),支持自定义函数(user defined functions, UDF),能load 纯文本、avro等格式数据;illustrate看pig执行步骤的结果,describe看alias的schema;以轻量级脚本形式跑MapReduce任务,各种爽爆。

1. Word Count

较于中文,英文比较工整,可以根据空格、标点符号进行分词。

A = load '/user/.*/req-temp/text.txt' as (text:chararray);
B = foreach A generate flatten(TOKENIZE(text)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;

Pig的内置函数
TOKENIZE
用StringTokenizer来对英文文本进行分词(代码参看这里),继承于抽象类
EvalFunc<T>
,返回
DataBag
词组。为了能统计单个词词频,需要用函数
flatten
对词组进行打散。抽象类
EvalFunc<T>
为用于pig语句
foreach .. generate ..
中的基类,以实现对数据字段的转换操作,其中
exec()
方法在pig运行期间被调用。

public class TOKENIZE extends EvalFunc<DataBag> {
TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();

@Override
public DataBag exec(Tuple input) throws IOException {
...
DataBag output = mBagFactory.newDefaultBag();
...
String delim = " \",()*";
...
StringTokenizer tok = new StringTokenizer((String)o, delim, false);
while (tok.hasMoreTokens()) {
output.add(mTupleFactory.newTuple(tok.nextToken()));
}
return output;
...
}
}

2. Ansj中文分词

为了写Pig的UDF,需要添加maven依赖:

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoop.version}</version>
<scope>provided</scope>
</dependency>

<dependency>
<groupId>org.apache.pig</groupId>
<artifactId>pig</artifactId>
<version>${pig.version}</version>
<scope>provided</scope>
</dependency>

<dependency>
<groupId>org.ansj</groupId>
<artifactId>ansj_seg-all-in-one</artifactId>
<version>3.0</version>
</dependency>

输入命令
hadoop version
得到hadoop的版本,输入
pig -i
得到pig的版本。务必要保证与集群部署的pig版本一致,要不然会报错:


ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias D


然后依葫芦画瓢,根据
TOKENIZE.java
修改,得到中文分词
Segment.java


package com.pig.udf;

public class Segment extends EvalFunc<DataBag> {

TupleFactory mTupleFactory = TupleFactory.getInstance();
BagFactory mBagFactory = BagFactory.getInstance();

@Override
public DataBag exec(Tuple input) throws IOException {
try {
if (input==null)
return null;
if (input.size()==0)
return null;
Object o = input.get(0);
if (o==null)
return null;
DataBag output = mBagFactory.newDefaultBag();
if (!(o instanceof String)) {
int errCode = 2114;
String msg = "Expected input to be chararray, but" +
" got " + o.getClass().getName();
throw new ExecException(msg, errCode, PigException.BUG);
}

// filter punctuation
FilterModifWord.insertStopNatures("w");
List<Term> words = ToAnalysis.parse((String) o);
words = FilterModifWord.modifResult(words);

for(Term word: words) {
output.add(mTupleFactory.newTuple(word.getName()));
}
return output;
} catch (ExecException ee) {
throw ee;
}
}

@SuppressWarnings("deprecation")
@Override
public Schema outputSchema(Schema input) {
...
}
...

ansj支持设置词性的停用词
FilterModifWord.insertStopNatures("w");
,如此可以去掉标点符号的词。将源代码打包后放在hdfs上,然后通过register jar包调用该UDF:

REGISTER ../piglib/udf-0.0.1-SNAPSHOT-jar-with-dependencies.jar
A = load '/user/.*/renmin.txt' as (text:chararray);
B = foreach A generate flatten(com.pig.udf.Segment(text)) as word;
C = group B by word;
D = foreach C generate COUNT(B), group;

截取人民日报社论的一段:


树好家风,严管才是厚爱。古人说:“居官所以不能清白者,率由家人喜奢好侈使然也。”要看到,好的家风,能系好人生的“第一粒扣子”。“修身、齐家”,才能“治国、平天下”,领导干部首先要“正好家风、管好家人、处好家事”,才能看好“后院”、堵住“后门”。“父母之爱子,则为之计深远”,与其冒着风险给子女留下大笔钱财,不如给子女留下好家风、好作风,那才是让子女受益无穷的东西,才是真正的“为之计深远”。


统计词频如下:


...

(3,能)

(2,要)

(2,计)

(1,与其)

(1,作风)

(1,使然)

(1,修身)

(1,厚爱)

(1,受益)

...


可见,ansj在不加载用户自定义词表的情况下,分词效果并不理想,不能对成语等词正确地分词。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: