Stanford NLP Chinese(中文)的使用
2016-01-11 00:00
405 查看
Stanford NLP tools提供了处理中文的三个工具,分别是分词、Parser;具体参考:
http://nlp.stanford.edu/software/parser-faq.shtml#o
1.分词 Chinese segmenter
下载:http://nlp.stanford.edu/software/
Stanford Chinese Word Segmenter A Java implementation of a CRF-based Chinese Word Segmenter
这个包比较大,运行时候需要的内存也多,因而如果用eclipse运行的时候需要修改虚拟内存空间大小:
运行-》自变量-》VM自变量-》-Xmx800m (最大内存空间800m)
demo代码(修改过的,未检验):
Properties props = new Properties();
props.setProperty("sighanCorporaDict", "data");
// props.setProperty("NormalizationTable", "data/norm.simp.utf8");
// props.setProperty("normTableEncoding", "UTF-8");
// below is needed because CTBSegDocumentIteratorFactory accesses it
props.setProperty("serDictionary","data/dict-chris6.ser.gz");
//props.setProperty("testFile", args[0]);
props.setProperty("inputEncoding", "UTF-8");
props.setProperty("sighanPostProcessing", "true");
CRFClassifier classifier = new CRFClassifier(props);
classifier.loadClassifierNoExceptions("data/ctb.gz", props);
// flags must be re-set after data is loaded
classifier.flags.setProperties(props);
//classifier.writeAnswers(classifier.test(args[0]));
//classifier.testAndWriteAnswers(args[0]);
String result = classifier.testString("我是中国人!");
System.out.println(result);
2. Stanford Parser
可以参考http://nlp.stanford.edu/software/parser-faq.shtml#o
http://blog.csdn.net/leeharry/archive/2008/03/06/2153583.aspx
根据输入的训练库不同,可以处理英文,也可以处理中文。输入是分词好的句子,输出词性、句子的语法树(依赖关系)
英文demo(下载的压缩文件中有):
//LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");
LexicalizedParser lp = new LexicalizedParser("xinhuaFactored.ser.gz");
//lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"});
// String[] sent = { "This", "is", "an", "easy", "sentence", "." };
String[] sent = { "他", "和", "我", "在", "学校", "里", "常", "打", "桌球", "。" };
String sentence = "他和我在学校里常打台球。";
Tree parse = (Tree) lp.apply(Arrays.asList(sent));
//Tree parse = (Tree) lp.apply(sentence);
parse.pennPrint();
System.out.println();
/*
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
Collection tdl = gs.typedDependenciesCollapsed();
System.out.println(tdl);
System.out.println();
*/
//only for English
//TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
//chinese
TreePrint tp = new TreePrint("wordsAndTags,penn,typedDependenciesCollapsed",new ChineseTreebankLanguagePack());
tp.printTree(parse);
然而有些时候我们不是光只要打印出来的语法依赖关系,而是希望得到关于语法树(图),则需要采用如下的程序:
String[] sent = { "他", "和", "我", "在", "学校", "里", "常", "打", "桌球", "。" };
ParserSentence ps = new ParserSentence();
Tree parse = ps.parserSentence(sent);
parse.pennPrint();
TreebankLanguagePack tlp = new ChineseTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
Collection tdl = gs.typedDependenciesCollapsed();
System.out.println(tdl);
System.out.println();
for(int i = 0;i < tdl.size();i ++)
{
//TypedDependency(GrammaticalRelation reln, TreeGraphNode gov, TreeGraphNode dep)
TypedDependency td = (TypedDependency)tdl.toArray()[i];
System.out.println(td.toString());
}
//采用GrammaticalStructure的方法getGrammaticalRelation(TreeGraphNode gov, TreeGraphNode dep)可以获得两个词的语法依赖关系
http://nlp.stanford.edu/software/parser-faq.shtml#o
1.分词 Chinese segmenter
下载:http://nlp.stanford.edu/software/
Stanford Chinese Word Segmenter A Java implementation of a CRF-based Chinese Word Segmenter
这个包比较大,运行时候需要的内存也多,因而如果用eclipse运行的时候需要修改虚拟内存空间大小:
运行-》自变量-》VM自变量-》-Xmx800m (最大内存空间800m)
demo代码(修改过的,未检验):
Properties props = new Properties();
props.setProperty("sighanCorporaDict", "data");
// props.setProperty("NormalizationTable", "data/norm.simp.utf8");
// props.setProperty("normTableEncoding", "UTF-8");
// below is needed because CTBSegDocumentIteratorFactory accesses it
props.setProperty("serDictionary","data/dict-chris6.ser.gz");
//props.setProperty("testFile", args[0]);
props.setProperty("inputEncoding", "UTF-8");
props.setProperty("sighanPostProcessing", "true");
CRFClassifier classifier = new CRFClassifier(props);
classifier.loadClassifierNoExceptions("data/ctb.gz", props);
// flags must be re-set after data is loaded
classifier.flags.setProperties(props);
//classifier.writeAnswers(classifier.test(args[0]));
//classifier.testAndWriteAnswers(args[0]);
String result = classifier.testString("我是中国人!");
System.out.println(result);
2. Stanford Parser
可以参考http://nlp.stanford.edu/software/parser-faq.shtml#o
http://blog.csdn.net/leeharry/archive/2008/03/06/2153583.aspx
根据输入的训练库不同,可以处理英文,也可以处理中文。输入是分词好的句子,输出词性、句子的语法树(依赖关系)
英文demo(下载的压缩文件中有):
LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz"); lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"}); String[] sent = { "This", "is", "an", "easy", "sentence", "." }; Tree parse = (Tree) lp.apply(Arrays.asList(sent)); parse.pennPrint(); System.out.println(); TreebankLanguagePack tlp = new PennTreebankLanguagePack(); GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory(); GrammaticalStructure gs = gsf.newGrammaticalStructure(parse); Collection tdl = gs.typedDependenciesCollapsed(); System.out.println(tdl); System.out.println(); TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed"); tp.printTree(parse);中文有些不同:
//LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz");
LexicalizedParser lp = new LexicalizedParser("xinhuaFactored.ser.gz");
//lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"});
// String[] sent = { "This", "is", "an", "easy", "sentence", "." };
String[] sent = { "他", "和", "我", "在", "学校", "里", "常", "打", "桌球", "。" };
String sentence = "他和我在学校里常打台球。";
Tree parse = (Tree) lp.apply(Arrays.asList(sent));
//Tree parse = (Tree) lp.apply(sentence);
parse.pennPrint();
System.out.println();
/*
TreebankLanguagePack tlp = new PennTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
Collection tdl = gs.typedDependenciesCollapsed();
System.out.println(tdl);
System.out.println();
*/
//only for English
//TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
//chinese
TreePrint tp = new TreePrint("wordsAndTags,penn,typedDependenciesCollapsed",new ChineseTreebankLanguagePack());
tp.printTree(parse);
然而有些时候我们不是光只要打印出来的语法依赖关系,而是希望得到关于语法树(图),则需要采用如下的程序:
String[] sent = { "他", "和", "我", "在", "学校", "里", "常", "打", "桌球", "。" };
ParserSentence ps = new ParserSentence();
Tree parse = ps.parserSentence(sent);
parse.pennPrint();
TreebankLanguagePack tlp = new ChineseTreebankLanguagePack();
GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
Collection tdl = gs.typedDependenciesCollapsed();
System.out.println(tdl);
System.out.println();
for(int i = 0;i < tdl.size();i ++)
{
//TypedDependency(GrammaticalRelation reln, TreeGraphNode gov, TreeGraphNode dep)
TypedDependency td = (TypedDependency)tdl.toArray()[i];
System.out.println(td.toString());
}
//采用GrammaticalStructure的方法getGrammaticalRelation(TreeGraphNode gov, TreeGraphNode dep)可以获得两个词的语法依赖关系
相关文章推荐
- 海电力上线TurboGate邮件网关
- 第一篇博客
- Fedora添加国内源
- 【祝贺】蓝凌今日敲响开市宝钟,开启资本市场新征程
- 帆软报表FineReport SQLServer数据库连接失败常见解决方案
- Universal-ImageLoader
- 日常点滴之首都就医
- Nginx入门之tomcat的session共享(与memcached整合)
- Protobuf使用不当导致的程序内存上涨问题
- 厦门巨游网络科技有限公司(HOTPOWER)承接游戏UI外包
- 解析Zbrush无限历史常见问题
- 2016/01/11 VBA学习7
- 在蜂窝教育Android培训 找到让我满意的工作
- Struts工作原理及其优缺点详解
- android工程目录结构,及相关文件获取方式(1)
- android中Activity的跳转和Intent七个属性简单记录(5)
- LAMP环境搭建
- R语言中的情感分析与机器学习
- Ionic 学习demo
- 两轮驱动型家用商场室内移动机器人底盘运动平台单步移动调试过程