您的位置:首页 > 运维架构

使用opennlp自定义命名实体

2018-03-30 00:00 441 查看

本文主要研究一下如何使用opennlp自定义命名实体,标注训练及模型运用。

maven

<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.8.4</version>
</dependency>

实践

训练模型

// train the name finder
String typedEntities = "<START:organization> NATO <END>\n" +
"<START:location> United States <END>\n" +
"<START:organization> NATO Parliamentary Assembly <END>\n" +
"<START:location> Edinburgh <END>\n" +
"<START:location> Britain <END>\n" +
"<START:person> Anders Fogh Rasmussen <END>\n" +
"<START:location> U . S . <END>\n" +
"<START:person> Barack Obama <END>\n" +
"<START:location> Afghanistan <END>\n" +
"<START:person> Rasmussen <END>\n" +
"<START:location> Afghanistan <END>\n" +
"<START:date> 2010 <END>";
ObjectStream<NameSample> sampleStream = new NameSampleDataStream(
new PlainTextByLineStream(new MockInputStreamFactory(typedEntities), "UTF-8"));

TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ALGORITHM_PARAM, "MAXENT");
params.put(TrainingParameters.ITERATIONS_PARAM, 70);
params.put(TrainingParameters.CUTOFF_PARAM, 1);

TokenNameFinderModel nameFinderModel = NameFinderME.train("eng", null, sampleStream,
params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(), new BioCodec()));


opennlp使用<START> 及 <END>来进行自定义标注实体,命名实体的话则在START之后用冒号标明,比如START:person

参数说明

ALGORITHM_PARAM

On the engineering level, using maxent is an excellent way of creating programs which perform very difficult classification tasks very well.

ITERATIONS_PARAM

number of training iterations, ignored if -params is used.

CUTOFF_PARAM

minimal number of times a feature must be seen

使用模型

上面训练完模型之后,就可以使用该模型进行解析

NameFinderME nameFinder = new NameFinderME(nameFinderModel);

// now test if it can detect the sample sentences

String[] sentence = "NATO United States Barack Obama".split("\\s+");

Span[] names = nameFinder.find(sentence);

Stream.of(names)
.forEach(span -> {
String named = IntStream.range(span.getStart(),span.getEnd())
.mapToObj(i -> sentence[i])
.collect(Collectors.joining(" "));
System.out.println("find type: "+ span.getType()+",name: " + named);
});

输出如下:

find type: organization,name: NATO
find type: location,name: United States
find type: person,name: Barack Obama

小结

opennlp的自定义命名实体的标注,给以了一定定制空间,方便开发者定制各自领域特
4000
殊的命名实体,以提高特定命名实体分词的准确性。

doc

opennlp-1.8.4-docs

OpenNLP进行中文命名实体识别(上:预处理及训练模型)

OpenNLP进行中文命名实体识别(下:载入模型识别实体)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  nlp