三大分词工具:standford CoreNLP/中科院NLPIR/哈工大LTP的简单使用
2017-03-10 16:49
756 查看
写在前面的话:
一个学期下来,发现写了不少代码。但是都没有好好整理,以后会慢慢整理。第一篇博文,可能也比较杂。望见谅。目的只是为了过段日子再次review时候不至于那么生疏。如果你能帮一下各位NLPer那真的是我的荣幸。
本文将简单介绍standford CoreNLP /中科院NLPIR系统 /哈工大LTP,这三个分词系统下载到简单示例代码的调用。
功能:StanfordCoreNLP integrates many of Stanford’s NLP tools, including the
part-of-speech (POS) tagger, the named entity recognizer (NER), the
parser, the coreference resolution system, sentiment
analysis, bootstrapped pattern learning,
and the open information extraction tools.
Download:coreNLP和对应的语言模型的jar包,默认为英语。比如中文的则是stanford-chinese-corenlp-2016-10-31-models.jar
Programming language:
Stanford CoreNLP is written in Java; current releases require Java1.8+.
You can use Stanford CoreNLP from the command-line, via its Javaprogrammatic API, via third
party APIs for most major modern programming languages, or via a service.
It works on Linux, OS X, and Windows.
(本文将展示命令行, java,以及python调用)
1.1 命令行,默认使用9000端口。
# Run the server using all jarsin the current directory (e.g., the CoreNLP home directory)
java -mx4g -cp
"*"edu.stanford.nlp.pipeline.StanfordCoreNLPServer[port][timeout]
在浏览器中输入http://localhost:9000/,即可进行相应操作。
如果语料是中文,需要调用对应的property。命名行如下
java -mx4g -cp "*"edu.stanford.nlp.pipeline.StanfordCoreNLPServer 9000 -propsStanfordCoreNLP-chinese.properties
1.2 java 代码调用
把coreNLP文件夹中的jar包及stanford-chinese-corenlp-2016-10-31-models.jar导入到项目里。
命名实体识别NER:
Reference:http://blog.csdn.net/shijiebei2009/article/details/42525091
stanford-ner-2012-11-11-chinese的压缩包可在上方链接处找到并下载
packageTest;
/**
* Created by Roy on 2016/11/14.
*/
import edu.stanford.nlp.ie.AbstractSequenceClassifier;
import edu.stanford.nlp.ie.crf.CRFClassifier;
import edu.stanford.nlp.ling.CoreLabel;
public class Ner {
private static AbstractSequenceClassifier<CoreLabel>ner;
public Ner() {
String serializedClassifier = "classifiers/chinese.misc.distsim.crf.ser.gz";//chinese.misc.distsim.crf.ser.gz
if (ner==null)
{
ner = CRFClassifier.getClassifierNoExceptions(serializedClassifier);
}
}
public StringdoNer(String sent) {
return ner.classifyWithInlineXML(sent);
}
}
1.3 python,安装pycorenlp
直接在linux里pip installpycorenlp
网址: https://github.com/smilli/py-corenlp
其他编程语言使用,详见http://stanfordnlp.github.io/CoreNLP/other-languages.html
Step 1:先去命令行中打开coreNLP服务
Step 2:运行python代码
2. NLPIR 分词系统
下载相应的分词包,http://ictclas.nlpir.org/downloads
如果初始化失败的时候,每月需要到https://github.com/NLPIR-team/NLPIR项目上
NLPIR/License/licensefor a month/
下载相应的XX.user文件去替换本地原有的XX.user
3. LTP, PyLTP (Linux系统)
安装指南:http://ltp.readthedocs.io/zh_CN/latest/install.html
下载LTP项目文件https://github.com/HIT-SCIR/ltp/releases
下载LTP模型文件http://pan.baidu.com/share/link?shareid=1988562907&uk=2738088569#list/path=%2F
写在最后:
个人感觉这三个分词工具各有利弊,就分词而言,导入外部词典还是很重要的。还有一些词汇,即使在外部词典中存在,分词的时候还是会分开,比如数字+中文,字母+中文等词汇。需要做后处理。
一个学期下来,发现写了不少代码。但是都没有好好整理,以后会慢慢整理。第一篇博文,可能也比较杂。望见谅。目的只是为了过段日子再次review时候不至于那么生疏。如果你能帮一下各位NLPer那真的是我的荣幸。
本文将简单介绍standford CoreNLP /中科院NLPIR系统 /哈工大LTP,这三个分词系统下载到简单示例代码的调用。
1. Standford coreNLP
网址:http://stanfordnlp.github.io/CoreNLP/功能:StanfordCoreNLP integrates many of Stanford’s NLP tools, including the
part-of-speech (POS) tagger, the named entity recognizer (NER), the
parser, the coreference resolution system, sentiment
analysis, bootstrapped pattern learning,
and the open information extraction tools.
Download:coreNLP和对应的语言模型的jar包,默认为英语。比如中文的则是stanford-chinese-corenlp-2016-10-31-models.jar
Programming language:
Stanford CoreNLP is written in Java; current releases require Java1.8+.
You can use Stanford CoreNLP from the command-line, via its Javaprogrammatic API, via third
party APIs for most major modern programming languages, or via a service.
It works on Linux, OS X, and Windows.
(本文将展示命令行, java,以及python调用)
1.1 命令行,默认使用9000端口。
# Run the server using all jarsin the current directory (e.g., the CoreNLP home directory)
java -mx4g -cp
"*"edu.stanford.nlp.pipeline.StanfordCoreNLPServer[port][timeout]
在浏览器中输入http://localhost:9000/,即可进行相应操作。
如果语料是中文,需要调用对应的property。命名行如下
java -mx4g -cp "*"edu.stanford.nlp.pipeline.StanfordCoreNLPServer 9000 -propsStanfordCoreNLP-chinese.properties
1.2 java 代码调用
把coreNLP文件夹中的jar包及stanford-chinese-corenlp-2016-10-31-models.jar导入到项目里。
package Test; /** * Created by Roy on 2016/11/13. */ import edu.stanford.nlp.ling.CoreAnnotations.*; import edu.stanford.nlp.ling.CoreLabel; import edu.stanford.nlp.pipeline.Annotation; import edu.stanford.nlp.pipeline.StanfordCoreNLP; import edu.stanford.nlp.util.CoreMap; import java.util.List; public class TestCoreNLP { public static void main(String[] args) { StanfordCoreNLP nlp = new StanfordCoreNLP("StanfordCoreNLP-chinese.properties"); // read some text in the text variable String text = "李航老师的《统计方法》在市面上很畅销。"; // create an empty Annotation just with the given text Annotation document = new Annotation(text); nlp.annotate(document); // these are all the sentences in this document // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types List<CoreMap> sentences = document.get(SentencesAnnotation.class); System.out.println("word\tpos\tlemma\tner"); for(CoreMap sentence: sentences) { // traversing the words in the current sentence // a CoreLabel is a CoreMap with additional token-specific methods for (CoreLabel token: sentence.get(TokensAnnotation.class)) { String word = token.get(TextAnnotation.class); String pos = toke d14a n.get(PartOfSpeechAnnotation.class); String ne = token.get(NamedEntityTagAnnotation.class); String lemma = token.get(LemmaAnnotation.class); System.out.println(word+"\t"+pos+"\t"+lemma+"\t"+ne); } } } }
命名实体识别NER:
Reference:http://blog.csdn.net/shijiebei2009/article/details/42525091
stanford-ner-2012-11-11-chinese的压缩包可在上方链接处找到并下载
packageTest;
/**
* Created by Roy on 2016/11/14.
*/
import edu.stanford.nlp.ie.AbstractSequenceClassifier;
import edu.stanford.nlp.ie.crf.CRFClassifier;
import edu.stanford.nlp.ling.CoreLabel;
public class Ner {
private static AbstractSequenceClassifier<CoreLabel>ner;
public Ner() {
String serializedClassifier = "classifiers/chinese.misc.distsim.crf.ser.gz";//chinese.misc.distsim.crf.ser.gz
if (ner==null)
{
ner = CRFClassifier.getClassifierNoExceptions(serializedClassifier);
}
}
public StringdoNer(String sent) {
return ner.classifyWithInlineXML(sent);
}
}
package Test; import edu.stanford.nlp.ie.crf.CRFClassifier; import edu.stanford.nlp.ling.CoreLabel; import java.util.Properties; /** * Created by Roy on 2016/11/14. */ public class NerforAText { public static CRFClassifier<CoreLabel> segmenter; public NerforAText(){ Properties props = new Properties(); props.setProperty("sighanCorporaDict", "data"); props.setProperty("serDictionary", "data/dict-chris6.ser.gz"); props.setProperty("inputEncoding", "UTF-8"); props.setProperty("sighanPostProcessing", "true"); segmenter = new CRFClassifier<CoreLabel>(props); segmenter.loadClassifierNoExceptions("data/ctb.gz", props); segmenter.flags.setProperties(props); } public static String doSegment(String text){ String[] strs=(String[]) segmenter.segmentString(text).toArray(); String result=""; for (String s:strs){ result=result+s+" "; } System.out.println(result); return result; } public static void main(String args[]){ String text="习近平祝贺特朗普当选美国总统。习近平表示,中美建交37年来,两国关系不断向前发展,给两国人民带来了实实在在的利益,也促进了世界和地区和平、稳定、繁荣。"; NerforAText nerforAText =new NerforAText(); String seg=nerforAText.doSegment(text); Ner ner=new Ner(); System.out.println(ner.doNer(seg)); } }
1.3 python,安装pycorenlp
直接在linux里pip installpycorenlp
网址: https://github.com/smilli/py-corenlp
其他编程语言使用,详见http://stanfordnlp.github.io/CoreNLP/other-languages.html
Step 1:先去命令行中打开coreNLP服务
Step 2:运行python代码
# coding: utf-8 import re import os import sys reload(sys) sys.setdefaultencoding("utf-8") from pycorenlp import StanfordCoreNLP nlp =StanfordCoreNLP('http://127.0.0.1:9000') line="习近平 主席 指出 ,我们 要 深入 学习 两学一做系列 活动" print line #输出可以格式除了text,还可以为json等,具体看官网 output=nlp.annotate(line,properties={'annotators': 'tokenize,ssplit,pos,lemma,ner','outputFormat':'text'}) print output.decode('utf-8')
2. NLPIR 分词系统
下载相应的分词包,http://ictclas.nlpir.org/downloads
如果初始化失败的时候,每月需要到https://github.com/NLPIR-team/NLPIR项目上
NLPIR/License/licensefor a month/
下载相应的XX.user文件去替换本地原有的XX.user
import java.io.UnsupportedEncodingException; import com.sun.jna.Library; import com.sun.jna.Native; public class NLP { // 定义接口CLibrary,继承自com.sun.jna.Library public interface CLibraryextends Library { // 定义并初始化接口的静态变量 CLibrary Instance =(CLibrary)Native.loadLibrary(System.getProperty("user.dir")+"\\source\\NLPIR",CLibrary.class); // printf函数声明 public intNLPIR_Init(byte[] sDataPath, int encoding,byte[] sLicenceCode); // public StringNLPIR_ParagraphProcess(String sSrc, int bPOSTagged); public StringNLPIR_GetKeyWords(String sLine, int nMaxKeyLimit, boolean bWeightOut); public doubleNLPIR_FileProcess(String sSourceFilename,String sResultFilename, intbPOStagged); public StringNLPIR_GetFileKeyWords(String sLine, int nMaxKeyLimit,boolean bWeightOut); public StringNLPIR_WordFreqStat(String sText); public StringNLPIR_FileWordFreqStat(String sSourceFilename); public voidNLPIR_Exit(); } public static StringtransString(String aidString, String ori_encoding, String new_encoding) { try { return newString(aidString.getBytes(ori_encoding), new_encoding); } catch(UnsupportedEncodingException e) { e.printStackTrace(); } return null; } public static voidmain(String[] args) throws Exception { String argu =""; String system_charset= "utf-8"; int charset_type = 1; int init_flag =CLibrary.Instance.NLPIR_Init(argu.getBytes(system_charset), charset_type,"0".getBytes(system_charset)); if (0 == init_flag) { System.err.println("初始化失败!"); return; } String sInput = "据悉,质检总局已将最新有关情况再次通报美方,要求美方加强对输华玉米的产地来源、运输及仓储等环节的管控措施,有效避免输华玉米被未经我国农业部安全评估并批准的转基因品系污染。"; String nativeBytes =null; try { nativeBytes =CLibrary.Instance.NLPIR_ParagraphProcess(sInput, 3); System.out.println("分词结果为: " + nativeBytes); int nCountKey = 0; String nativeByte= CLibrary.Instance.NLPIR_GetKeyWords(sInput, 10,false); System.out.print("\n关键词提取结果是:" +nativeByte); String wordFreq =CLibrary.Instance.NLPIR_WordFreqStat(sInput); System.out.print("\n字符串词频统计结果:" +wordFreq); CLibrary.Instance.NLPIR_Exit(); } catch (Exception ex){ // TODOAuto-generated catch block ex.printStackTrace(); } } }
3. LTP, PyLTP (Linux系统)
安装指南:http://ltp.readthedocs.io/zh_CN/latest/install.html
下载LTP项目文件https://github.com/HIT-SCIR/ltp/releases
下载LTP模型文件http://pan.baidu.com/share/link?shareid=1988562907&uk=2738088569#list/path=%2F
解压项目文件,进入项目根目录编译 (保证Cmake已经安装)
./configure
make
安装pyltp:pip install pyltp
# -*- coding: utf-8 -*- import sys import re import glob #读取文件夹下的所有文件名称模块 reload(sys) sys.setdefaultencoding('utf-8') # 编码格式 from pyltp import Segmentor def read_txt(filename): # 打开文件 f = open('/mnt/e/code/run/' + filename,'r') w = open('/mnt/e/code/Segmentation/test/seg_' + filename,'w') # 初始化实例,并使用外部词典 segmentor = Segmentor() segmentor.load_with_lexicon('/mnt/e/Pris/duozhuan/code/data/cws.model','/mnt/e/Pris/duozhuan/code/data/pro-noun.txt') content = f.readline() line = content.split("\t")[1] index = content.split("\t")[0] i=1 while line: words = segmentor.segment(line) words_len = len(words) str_line = '' for i in range(words_len): str_line += words[i]; str_line += ' ' w.write(index+"\t"+str_line) w.write('\n') content = f.readline() line = content.split("\t")[1] index = content.split("\t")[0] i=i+1 if i%100==0: print i segmentor.release() f.close() w.close() if __name__ == '__main__': read_txt("q_with_id.txt")
写在最后:
个人感觉这三个分词工具各有利弊,就分词而言,导入外部词典还是很重要的。还有一些词汇,即使在外部词典中存在,分词的时候还是会分开,比如数字+中文,字母+中文等词汇。需要做后处理。
相关文章推荐
- 使用Java调用中科院分词NLPIR/ICTCLAS
- 结巴分词和哈工大ltp词性标注结合使用
- Standford CoreNLP--Sentiment Analysis初探
- 【python】使用中科院NLPIR分词工具进行mysql数据分词
- 哈工大LTP和中科院NLPIR中文分词比较
- 使用Stanford CoreNLP的Python封装包处理中文(分词、词性标注、命名实体识别、句法树、依存句法分析)
- 使用Java调用中科院分词NLPIR/ICTCLAS
- 开源中文分词工具探析(六):Stanford CoreNLP
- 推荐SQLPrompt3 -简单破解无限期的使用这款很不错的SQL查询分析工具
- 关于mysql界面工具的简单使用
- 让中科院中文分词系统ICTCLAS为lucene所用的简单程序(C#版)
- MyEclipse8.0与SVN版本工具集成及简单使用方法介绍
- 中科院中文分词工具ICTCLAS30进行名实体识别的方法
- Svn版本控制工具服务器端命令及客户端简单使用
- 在www.json.org上公布了很多Java下的json解析工具,其中org.json和json-lib比较简单,两者使用上差不多
- 使用中科院汉语分词系统ICTCLAS30.dll时出的问题,麻烦各位帮解决
- 使用LUCENE快速实现属于自己的英文分词程序——附简单实现
- 在NETBEANS上使用中科院汉语分词系统ICTCLAS2009共享版
- 安全四部曲之一---黑客工具简单使用
- 版本控制工具git的简单使用