您的位置:首页 > 编程语言 > PHP开发

三大分词工具:standford CoreNLP/中科院NLPIR/哈工大LTP的简单使用

2017-03-10 16:49 756 查看
写在前面的话:

  一个学期下来,发现写了不少代码。但是都没有好好整理,以后会慢慢整理。第一篇博文,可能也比较杂。望见谅。目的只是为了过段日子再次review时候不至于那么生疏。如果你能帮一下各位NLPer那真的是我的荣幸。

本文将简单介绍standford CoreNLP /中科院NLPIR系统 /哈工大LTP,这三个分词系统下载到简单示例代码的调用。

 

1. Standford coreNLP

网址:http://stanfordnlp.github.io/CoreNLP/

 

功能:StanfordCoreNLP integrates many of Stanford’s NLP tools, including the
part-of-speech (POS) tagger, the named entity recognizer (NER), the
parser, the coreference resolution system, sentiment
analysis, bootstrapped pattern learning,
and the open information extraction tools.

 

Download:coreNLP和对应的语言模型的jar包,默认为英语。比如中文的则是stanford-chinese-corenlp-2016-10-31-models.jar

Programming language:

Stanford CoreNLP is written in Java; current releases require Java1.8+.

You can use Stanford CoreNLP from the command-line, via its Javaprogrammatic API, via third
party APIs for most major modern programming languages, or via a service.
It works on Linux, OS X, and Windows.

 

(本文将展示命令行, java,以及python调用)

1.1 命令行,默认使用9000端口。

# Run the server using all jarsin the current directory (e.g., the CoreNLP home directory)

java -mx4g -cp
"*"edu.stanford.nlp.pipeline.StanfordCoreNLPServer[port][timeout]

在浏览器中输入http://localhost:9000/,即可进行相应操作。

如果语料是中文,需要调用对应的property。命名行如下

java -mx4g -cp "*"edu.stanford.nlp.pipeline.StanfordCoreNLPServer 9000 -propsStanfordCoreNLP-chinese.properties

 

1.2 java 代码调用

把coreNLP文件夹中的jar包及stanford-chinese-corenlp-2016-10-31-models.jar导入到项目里。

 

package Test;

/**
 * Created by Roy on 2016/11/13.
 */
import edu.stanford.nlp.ling.CoreAnnotations.*;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;

import java.util.List;

public class TestCoreNLP {
    public static void main(String[] args) {
        StanfordCoreNLP nlp = new StanfordCoreNLP("StanfordCoreNLP-chinese.properties");
        // read some text in the text variable
        String text = "李航老师的《统计方法》在市面上很畅销。";
        // create an empty Annotation just with the given text
        Annotation document = new Annotation(text);
        nlp.annotate(document);

        // these are all the sentences in this document
        // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
        List<CoreMap> sentences = document.get(SentencesAnnotation.class);

        System.out.println("word\tpos\tlemma\tner");
        for(CoreMap sentence: sentences) {
            // traversing the words in the current sentence
            // a CoreLabel is a CoreMap with additional token-specific methods
            for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
                String word = token.get(TextAnnotation.class);
                String pos = toke
d14a
n.get(PartOfSpeechAnnotation.class);
                String ne = token.get(NamedEntityTagAnnotation.class);
                String lemma = token.get(LemmaAnnotation.class);
                System.out.println(word+"\t"+pos+"\t"+lemma+"\t"+ne);
            }
   
        }
      
    }
}


命名实体识别NER:

Reference:http://blog.csdn.net/shijiebei2009/article/details/42525091

stanford-ner-2012-11-11-chinese的压缩包可在上方链接处找到并下载

packageTest;

/**

 * Created by Roy on 2016/11/14.

 */

import edu.stanford.nlp.ie.AbstractSequenceClassifier;

import edu.stanford.nlp.ie.crf.CRFClassifier;

import edu.stanford.nlp.ling.CoreLabel;

public class Ner {

    private static AbstractSequenceClassifier<CoreLabel>ner;

    public Ner() {

        String serializedClassifier = "classifiers/chinese.misc.distsim.crf.ser.gz";//chinese.misc.distsim.crf.ser.gz

        if (ner==null)
{

            ner = CRFClassifier.getClassifierNoExceptions(serializedClassifier);

        }

    }

    public StringdoNer(String sent) {

        return ner.classifyWithInlineXML(sent);

    }

 }       
 

package Test;

import edu.stanford.nlp.ie.crf.CRFClassifier;
import edu.stanford.nlp.ling.CoreLabel;
import java.util.Properties;
/**
 * Created by Roy on 2016/11/14.
 */
public class NerforAText {
    public static CRFClassifier<CoreLabel> segmenter;
    public NerforAText(){
        Properties props = new Properties();
        props.setProperty("sighanCorporaDict", "data");
        props.setProperty("serDictionary", "data/dict-chris6.ser.gz");
        props.setProperty("inputEncoding", "UTF-8");
        props.setProperty("sighanPostProcessing", "true");
        segmenter = new CRFClassifier<CoreLabel>(props);
        segmenter.loadClassifierNoExceptions("data/ctb.gz", props);
        segmenter.flags.setProperties(props);
    }
    public static String doSegment(String text){
        String[] strs=(String[]) segmenter.segmentString(text).toArray();
        String result="";
        for (String s:strs){
            result=result+s+" ";
        }
        System.out.println(result);
        return result;
    }
    public static void main(String args[]){
        String text="习近平祝贺特朗普当选美国总统。习近平表示,中美建交37年来,两国关系不断向前发展,给两国人民带来了实实在在的利益,也促进了世界和地区和平、稳定、繁荣。";
        NerforAText nerforAText =new NerforAText();
        String seg=nerforAText.doSegment(text);
        Ner ner=new Ner();
        System.out.println(ner.doNer(seg));

    }
}

 

1.3 python,安装pycorenlp

直接在linux里pip installpycorenlp

网址: https://github.com/smilli/py-corenlp
其他编程语言使用,详见http://stanfordnlp.github.io/CoreNLP/other-languages.html

Step 1:先去命令行中打开coreNLP服务

Step 2:运行python代码

# coding: utf-8
import re
import os
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
from pycorenlp import StanfordCoreNLP

nlp =StanfordCoreNLP('http://127.0.0.1:9000')
line="习近平 主席 指出 ,我们 要 深入 学习 两学一做系列 活动"
print line
#输出可以格式除了text,还可以为json等,具体看官网
output=nlp.annotate(line,properties={'annotators': 'tokenize,ssplit,pos,lemma,ner','outputFormat':'text'})
print output.decode('utf-8')


2. NLPIR 分词系统

下载相应的分词包,http://ictclas.nlpir.org/downloads

如果初始化失败的时候,每月需要到https://github.com/NLPIR-team/NLPIR项目上
NLPIR/License/licensefor a month/


下载相应的XX.user文件去替换本地原有的XX.user

 

import java.io.UnsupportedEncodingException;
import com.sun.jna.Library;
import com.sun.jna.Native;

public class NLP {
// 定义接口CLibrary,继承自com.sun.jna.Library
public interface CLibraryextends Library {
// 定义并初始化接口的静态变量
CLibrary Instance =(CLibrary)Native.loadLibrary(System.getProperty("user.dir")+"\\source\\NLPIR",CLibrary.class);

// printf函数声明
public intNLPIR_Init(byte[] sDataPath, int encoding,byte[] sLicenceCode);
//
public StringNLPIR_ParagraphProcess(String sSrc, int bPOSTagged);

public StringNLPIR_GetKeyWords(String sLine, int nMaxKeyLimit, boolean bWeightOut);
public doubleNLPIR_FileProcess(String sSourceFilename,String sResultFilename, intbPOStagged);
public StringNLPIR_GetFileKeyWords(String sLine, int nMaxKeyLimit,boolean bWeightOut);

public StringNLPIR_WordFreqStat(String sText);
public StringNLPIR_FileWordFreqStat(String sSourceFilename);
public voidNLPIR_Exit();
}

public static StringtransString(String aidString, String ori_encoding,
String new_encoding) {
try {
return newString(aidString.getBytes(ori_encoding), new_encoding);
} catch(UnsupportedEncodingException e) {
e.printStackTrace();
}
return null;
}

public static voidmain(String[] args) throws Exception {
String argu ="";
String system_charset= "utf-8";
int charset_type = 1;

int init_flag =CLibrary.Instance.NLPIR_Init(argu.getBytes(system_charset), charset_type,"0".getBytes(system_charset));

if (0 == init_flag) {
System.err.println("初始化失败!");
return;
}
String sInput = "据悉,质检总局已将最新有关情况再次通报美方,要求美方加强对输华玉米的产地来源、运输及仓储等环节的管控措施,有效避免输华玉米被未经我国农业部安全评估并批准的转基因品系污染。";

String nativeBytes =null;
try {
nativeBytes =CLibrary.Instance.NLPIR_ParagraphProcess(sInput, 3);
System.out.println("分词结果为: " + nativeBytes);
int nCountKey = 0;
String nativeByte= CLibrary.Instance.NLPIR_GetKeyWords(sInput, 10,false);
System.out.print("\n关键词提取结果是:" +nativeByte);
String wordFreq =CLibrary.Instance.NLPIR_WordFreqStat(sInput);
System.out.print("\n字符串词频统计结果:" +wordFreq);
CLibrary.Instance.NLPIR_Exit();

} catch (Exception ex){
// TODOAuto-generated catch block
ex.printStackTrace();
}

}
}


3. LTP, PyLTP (Linux系统)

安装指南:http://ltp.readthedocs.io/zh_CN/latest/install.html

下载LTP项目文件https://github.com/HIT-SCIR/ltp/releases

下载LTP模型文件http://pan.baidu.com/share/link?shareid=1988562907&uk=2738088569#list/path=%2F

解压项目文件,进入项目根目录编译 (保证Cmake已经安装)

./configure  

make

安装pyltp:pip install pyltp

 

# -*- coding: utf-8 -*-
import sys
import re
import glob  #读取文件夹下的所有文件名称模块
reload(sys)
sys.setdefaultencoding('utf-8')  # 编码格式
from pyltp import Segmentor
def read_txt(filename):
# 打开文件
f = open('/mnt/e/code/run/' + filename,'r')
w = open('/mnt/e/code/Segmentation/test/seg_' + filename,'w')
# 初始化实例,并使用外部词典
segmentor = Segmentor()
segmentor.load_with_lexicon('/mnt/e/Pris/duozhuan/code/data/cws.model','/mnt/e/Pris/duozhuan/code/data/pro-noun.txt')
content = f.readline()
line = content.split("\t")[1]
index = content.split("\t")[0]
i=1
while line:
words = segmentor.segment(line)
words_len = len(words)
str_line = ''
for i in range(words_len):
str_line += words[i];
str_line += ' '

w.write(index+"\t"+str_line)
w.write('\n')

content = f.readline()
line = content.split("\t")[1]
index = content.split("\t")[0]
i=i+1
if i%100==0:
print i

segmentor.release()
f.close()
w.close()

if __name__ == '__main__':
read_txt("q_with_id.txt")


 写在最后:

     个人感觉这三个分词工具各有利弊,就分词而言,导入外部词典还是很重要的。还有一些词汇,即使在外部词典中存在,分词的时候还是会分开,比如数字+中文,字母+中文等词汇。需要做后处理。

 
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息