您的位置:首页 > 其它

coreNLP的使用

2016-05-06 20:44 405 查看
最近考虑做些英文词语词干化的工作,听说coreNLP这个工具不错,就拿来用了。

coreNLP是斯坦福大学开发的一套关于自然语言处理的工具(toolbox),使用简单功能强大,有;命名实体识别、词性标注、词语词干化、语句语法树的构造还有指代关系等功能,使用起来比较方便。

coreNLP是使用Java编写的,运行环境需要在JDK1.8,1.7貌似都不支持。这是需要注意的

coreNLP官方文档不多,但是给的几个示例文件也差不多能摸索出来怎么用,刚才摸索了一下,觉得还挺顺手的。

环境:

window7 64位

JDK1.8

需要引进的ar包:



说明:这里只是测试了英文的,所以使用的Stanford-corenlp-3.6.0.models.jar文件,如果使用中文的需要在官网上下对应中文的model jar包,然后引进项目即可。

直接看代码比较简单:

package com.luchi.corenlp;

import java.util.List;
import java.util.Map;
import java.util.Properties;

import edu.stanford.nlp.hcoref.CorefCoreAnnotations.CorefChainAnnotation;
import edu.stanford.nlp.hcoref.data.CorefChain;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreAnnotations.LemmaAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.PartOfSpeechAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.semgraph.SemanticGraph;
import edu.stanford.nlp.semgraph.SemanticGraphCoreAnnotations;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.trees.TreeCoreAnnotations.TreeAnnotation;
import edu.stanford.nlp.util.CoreMap;

public class TestNLP {

public void test(){
//构造一个StanfordCoreNLP对象,配置NLP的功能,如lemma是词干化,ner是命名实体识别等
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// 待处理字符串
String text = "judy has been to china . she likes people there . and she went to Beijing ";// Add your text here!

// 创造一个空的Annotation对象
Annotation document = new Annotation(text);

// 对文本进行分析
pipeline.annotate(document);

//获取文本处理结果
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
// 获取句子的token(可以是作为分词后的词语)
String word = token.get(TextAnnotation.class);
System.out.println(word);
//词性标注
String pos = token.get(PartOfSpeechAnnotation.class);
System.out.println(pos);
// 命名实体识别
String ne = token.get(NamedEntityTagAnnotation.class);
System.out.println(ne);
//词干化处理
String lema=token.get(LemmaAnnotation.class);
System.out.println(lema);
}

// 句子的解析树
Tree tree = sentence.get(TreeAnnotation.class);
tree.pennPrint();

// 句子的依赖图
SemanticGraph graph = sentence.get(SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation.class);
System.out.println(graph.toString(SemanticGraph.OutputFormat.LIST));

}

// 指代词链
//每条链保存指代的集合
// 句子和偏移量都从1开始
Map<Integer, CorefChain> corefChains =  document.get(CorefChainAnnotation.class);
if (corefChains == null) { return; }
for (Map.Entry<Integer,CorefChain> entry: corefChains.entrySet()) {
System.out.println("Chain " + entry.getKey() + " ");
for (CorefChain.CorefMention m : entry.getValue().getMentionsInTextualOrder()) {
// We need to subtract one since the indices count from 1 but the Lists start from 0
List<CoreLabel> tokens = sentences.get(m.sentNum - 1).get(CoreAnnotations.TokensAnnotation.class);
// We subtract two for end: one for 0-based indexing, and one because we want last token of mention not one following.
System. out.println("  " + m + ", i.e., 0-based character offsets [" + tokens.get(m.startIndex - 1).beginPosition() +
", " + tokens.get(m.endIndex - 2).endPosition() + ")");
}
}
}
public static void main(String[]args){
TestNLP nlp=new TestNLP();
nlp.test();
}

}


具体的注释都给出来了,我们可以直接看结果就知道代码的作用了:

对于每个token的识别结果:





原句中的:

judy 识别结果为:词性为NN,也就是名词,命名实体对象识别结果为O,词干识别为Judy

注意到has识别的词干已经被识别出来了,是“have”

而Beijing的命名实体标注识别结果为“Location”,也就意味着识别出了地名

然后我们看 解析树的识别(以第一句为例):



最后我们看一下指代的识别:



每个chain包含的是指代相同内容的词语,如chain1中两个she虽然在句子的不同位置,但是都指代的是第一句的“Judy”,这和原文的意思一致,表示识别正确,offsets表示的是该词语在句子中的位置

当然我只是用到了coreNLP的词干化功能,所以只需要把上面代码一改就可以处理词干化了,测试代码如下:
package com.luchi.corenlp;

import java.util.List;
import java.util.Properties;

import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.CoreAnnotations.LemmaAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;

import edu.stanford.nlp.util.CoreMap;

public class Lemma {

// 词干化
public String stemmed(String inputStr) {
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

Annotation document = new Annotation(inputStr);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);

String outputStr = "";
for (CoreMap sentence : sentences) {
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
String lema = token.get(LemmaAnnotation.class);
outputStr += lema+" ";
}

}
return outputStr;
}
public static void main(String[]args){

Lemma lemma=new Lemma();
String input="jack had been to china there months ago. he likes china very much,and he is falling love with this country";
String output=lemma.stemmed(input);
System.out.print("原句    :");
System.out.println(input);
System.out.print("词干化:");
System.out.println(output);

}

}


输出结果为:



结果还是很准确的
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: