【NLP】play with stanford nlp
2017-08-08 07:20
330 查看
PlayNLP on GitHub
the above command line can be use to as a general purpose utility to parse Chinese sentences.
note: xinhuaFactoredSegmenting
example input and output:
Check the segmeng.sh in stanford-segmenter-3.8.0.zip
command-line:
see:
demo:
train.sh
command-line:
output:
,then you may fail.
but if you pos tag this sentence:
and assume that, this is just the meaning of the sentence.
Ok, now use the following command-line parse-pre-tagged.sh to parse the above sentence:
Note: here use the PCFG model chinesePCFG
demo,
view train.sh
demo of training as following:
after training, we got trainded.ser.gz
use this model to parse a special sentence, such as:
output:
output:
demo:
output:
the output of the segment can be used as the input of parse-with-model.sh
A Powerful Parser with xinhuaFactoredSegmenting.ser.gz
#!/bin/bash java -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser \ -encoding utf-8 \ -outputFormat "penn,typedDependenciesCollapsed" \ edu/stanford/nlp/models/lexparser/xinhuaFactoredSegmenting.ser.gz \ $1
the above command line can be use to as a general purpose utility to parse Chinese sentences.
note: xinhuaFactoredSegmenting
example input and output:
目前,《新华日报》国内外总发行量40万份
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. Parsing file: /tmp/2.txt Parsing [sent. 1 len. 11]: 目前 , 《 新华 日报 》 国内外 总 发行量 40万 份 (ROOT (IP (NP (NT 目前)) (PU ,) (NP (PU 《) (NR 新华) (NN 日报) (PU 》)) (NP (NP (NN 国内外)) (ADJP (JJ 总)) (NP (NN 发行量))) (VP (QP (CD 40万) (CLP (M 份)))))) nmod:tmod(40万-10, 目前-1) punct(40万-10, ,-2) punct(日报-5, 《-3) compound:nn(日报-5, 新华-4) nmod:topic(40万-10, 日报-5) punct(日报-5, 》-6) compound:nn(发行量-9, 国内外-7) amod(发行量-9, 总-8) nsubj(40万-10, 发行量-9) root(ROOT-0, 40万-10) mark:clf(40万-10, 份-11) Parsed file: /tmp/2.txt [1 sentences]. Parsed 11 words in 1 sentences (9.58 wds/sec; 0.87 sents/sec).
Segment with Custom Dictionary
Custom Dictionaryjava -mx1g -cp seg.jar edu.stanford.nlp.ie.crf.CRFClassifier -sighanCorporaDict data -loadClassifier data/ctb.gz -testFile preprocess-$1.txt -inputEncoding UTF-8 -sighanPostProcessing true -serDictionary data/dict-chris6.ser.gz,data/cedict.txt,data/ntusd.txt -keepAllWhitespaces false >$1_seged.txt
Check the segmeng.sh in stanford-segmenter-3.8.0.zip
command-line:
java -mx2g -cp $BASEDIR/*: edu.stanford.nlp.ie.crf.CRFClassifier \ -sighanCorporaDict ./data \ -textFile shit.txt \ -inputEncoding UTF-8 \ -sighanPostProcessing true \ -keepAllWhitespaces false \ -loadClassifier ./data/ctb.gz \ -serDictionary ./data/dict-chris6.ser.gz,./names.txt
#!/bin/sh usage() { echo "Usage: $0 [ctb|pku] filename encoding kBest" >&2 echo " ctb : use Chinese Treebank segmentation" >&2 echo " pku : Beijing University segmentation" >&2 echo " kBest: print kBest best segmenations; 0 means kBest mode is off." >&2 echo >&2 echo "Example: $0 ctb test.simp.utf8 UTF-8 0" >&2 echo "Example: $0 pku test.simp.utf8 UTF-8 0" >&2 exit } if [ $# -lt 4 -o $# -gt 5 ]; then usage fi ARGS="-keepAllWhitespaces false" if [ $# -eq 5 -a "$1"=="-k" ]; then ARGS="-keepAllWhitespaces true" lang=$2 file=$3 enc=$4 kBest=$5 else if [ $# -eq 4 ]; then lang=$1 file=$2 enc=$3 kBest=$4 else usage fi fi if [ $lang = "ctb" ]; then echo "(CTB):" >&2 elif [ $lang = "pku" ]; then echo "(PKU):" >&2 else echo "First argument should be either ctb or pku. Abort" exit fi echo -n "File: " >&2 echo $file >&2 echo -n "Encoding: " >&2 echo $enc >&2 echo "-------------------------------" >&2 BASEDIR=`dirname $0` DATADIR=$BASEDIR/data #LEXDIR=$DATADIR/lexicons JAVACMD="java -mx2g -cp $BASEDIR/*: edu.stanford.nlp.ie.crf.CRFClassifier -sighanCorporaDict $DATADIR -textFile $file -inputEncoding $enc -sighanPostProcessing true $ARGS" DICTS=$DATADIR/dict-chris6.ser.gz,./names.txt KBESTCMD="" if [ $kBest != "0" ]; then KBESTCMD="-kBest $kBest" fi if [ $lang = "ctb" ]; then $JAVACMD -loadClassifier $DATADIR/ctb.gz -serDictionary $DICTS $KBESTCMD elif [ $lang = "pku" ]; then $JAVACMD -loadClassifier $DATADIR/pku.gz -serDictionary $DICTS $KBESTCMD fi
see:
DICTS=$DATADIR/dict-chris6.ser.gz,./names.txt
demo:
$ cat names.txt 哈马尼克斯 啊部 阿三的 猫 跳 上 树枝 黑色的
$ cat shit.txt 哈马尼克斯啊部阿三的。
$ sh segment.sh ctb shit.txt UTF-8 0 (CTB): File: shit.txt Encoding: UTF-8 ------------------------------- Invoked on Tue Aug 08 07:16:58 CST 2017 with arguments: -sighanCorporaDict ./data -textFile shit.txt -inputEncoding UTF-8 -sighanPostProcessing true -keepAllWhitespaces false -loadClassifier ./data/ctb.gz -serDictionary ./data/dict-chris6.ser.gz,./names.txt serDictionary=./data/dict-chris6.ser.gz,./names.txt loadClassifier=./data/ctb.gz sighanCorporaDict=./data inputEncoding=UTF-8 textFile=shit.txt sighanPostProcessing=true keepAllWhitespaces=false Loading Chinese dictionaries from 2 files: ./data/dict-chris6.ser.gz ./names.txt ./names.txt: 8 entries Done. Unique words in ChineseDictionary is: 423204. Loading classifier from ./data/ctb.gz ... done [20.0 sec]. Loading character dictionary file from ./data/dict/character_list [done]. Loading affix dictionary from ./data/dict/in.ctb [done]. 哈马尼克斯 啊部 阿三的 。 CRFClassifier tagged 11 words in 1 documents at 81.48 words per second.
Train a Specific Parser with Corpus in Penn Tree Bank
corpus$ cat train.txt (ROOT (IP (NP (NN 哈马尼克斯)) (VP (ADVP (VBD 啊部)) (VP (NN 阿三的))) (. 。))) nsubj(阿三的-3, 哈马尼克斯-1) advmod(阿三的-3, 啊部-2) root(ROOT-0, 阿三的-3) dep(阿三的-3, 。-4) (ROOT (IP (NP (ADJP (JJ 黑色的)) (NP (NN 猫))) (VP (ADVP (VBD 跳)) (VP (IN 上) (NP (NN 树枝)))) (. 。))) amod(猫-2, 黑色的-1) nsubj(树枝-5, 猫-2) advmod(树枝-5, 跳-3) dep(树枝-5, 上-4) root(ROOT-0, 树枝-5) dep(树枝-5, 。-6)
train.sh
$ cat train.sh #!/bin/bash java -cp "*" -mx800m edu.stanford.nlp.parser.lexparser.LexicalizedParser \ -evals "factDA,tsv" \ -chineseFactored -PCFG -hMarkov 1 -nomarkNPconj -compactGrammar 0 \ -tLPP edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams \ -PCFG \ -chinesePCFG \ -saveToSerializedFile ./trained.ser.gz \ -maxLength 40 \ -encoding utf-8 \ -train $1 \ -test $1
command-line:
sh train.sh train.txt
output:
$ sh train.sh train.txt SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. done [read 12 trees]. Time elapsed: 0 ms useUnknownWordSignatures 0 smoothInUnknownsThreshold 100 smartMutation false useUnicodeType true unknownSuffixSize 1 unknownPrefixSize 1 flexiTag false useSignatureForKnownSmoothing false wordClassesFile null parserParams edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams forceCNF false doPCFG true doDep false freeDependencies false directional true genStop true distance true coarseDistance false dcTags false nPrune false Using ChineseTreebankParserParams chineseSplitDouHao=false chineseSplitPunct=true chineseSplitPunctLR=true markVVsisterIP=true markVPadjunct=true chineseSplitVP=0 mergeNNVV=false unaryIP=false unaryCP=false paRootDtr=false markPsisterIP=false markIPsisterVVorP=true markADgrandchildOfIP=false gpaAD=false markIPsisterBA=true markNPmodNP=true markNPconj=false markMultiNtag=false markIPsisDEC=false markIPconj=false markIPadjsubj=false markPostverbalP=false markPostverbalPP=false baseNP=false headFinder=levy discardFrags=false dominatesV=false done. Time elapsed: 35 ms done. Time elapsed: 22 ms done. Time elapsed: 32 ms done Time elapsed: 0 ms useUnknownWordSignatures 0 smoothInUnknownsThreshold 100 smartMutation false useUnicodeType true unknownSuffixSize 1 unknownPrefixSize 1 flexiTag false useSignatureForKnownSmoothing false wordClassesFile null parserParams edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams forceCNF false doPCFG true doDep false freeDependencies false directional true genStop true distance true coarseDistance false dcTags false nPrune false Using ChineseTreebankParserParams chineseSplitDouHao=false chineseSplitPunct=true chineseSplitPunctLR=true markVVsisterIP=true markVPadjunct=true chineseSplitVP=0 mergeNNVV=false unaryIP=false unaryCP=false paRootDtr=false markPsisterIP=false markIPsisterVVorP=true markADgrandchildOfIP=false gpaAD=false markIPsisterBA=true markNPmodNP=true markNPconj=false markMultiNtag=false markIPsisDEC=false markIPconj=false markIPadjsubj=false markPostverbalP=false markPostverbalPP=false baseNP=false headFinder=levy discardFrags=false dominatesV=false Parsing [len. 4]: 哈马尼克斯 啊部 阿三的 。 (ROOT (IP (NP (NN 哈马尼克斯)) (VP (ADVP (VBD 啊部)) (VP (NN 阿三的))) (. 。))) P: 100.0 R: 100.0 pcfg LP/LR F1: 100.0 N: 1.0 P: 100.0 R: 100.0 factor LP/LR F1: 100.0 N: 1.0 P: 100.0 R: 100.0 factor Tag F1: 100.0 N: 1.0 Parsing [len. 1]: 哈马尼克斯-1 (ROOT (FRAG 哈马尼克斯-1)) P: 0.0 R: 0.0 pcfg LP/LR F1: 0.0 N: 2.0 P: 0.0 R: 0.0 factor LP/LR F1: 0.0 N: 2.0 P: 100.0 R: 100.0 factor Tag F1: 100.0 N: 2.0 Parsing [len. 1]: 啊部-2 (ROOT (FRAG 啊部-2)) P: 0.0 R: 0.0 pcfg LP/LR F1: 0.0 N: 3.0 P: 0.0 R: 0.0 factor LP/LR F1: 0.0 N: 3.0 P: 100.0 R: 100.0 factor Tag F1: 100.0 N: 3.0 Parsing [len. 1]: 阿三的-3 (ROOT (FRAG 阿三的-3)) P: 0.0 R: 0.0 pcfg LP/LR F1: 0.0 N: 4.0 P: 0.0 R: 0.0 factor LP/LR F1: 0.0 N: 4.0 P: 100.0 R: 100.0 factor Tag F1: 100.0 N: 4.0 Parsing [len. 1]: 。-4 (ROOT (FRAG 。-4)) P: 0.0 R: 0.0 pcfg LP/LR F1: 0.0 N: 5.0 P: 0.0 R: 0.0 factor LP/LR F1: 0.0 N: 5.0 P: 100.0 R: 100.0 factor Tag F1: 100.0 N: 5.0 Parsing [len. 6]: 黑色的 猫 跳 上 树枝 。 (ROOT (IP (NP (ADJP (JJ 黑色的)) (NP (NN 猫))) (VP (ADVP (VBD 跳)) (VP (IN 上) (NP (NN 树枝)))) (. 。))) P: 100.0 R: 100.0 pcfg LP/LR F1: 100.0 N: 6.0 P: 100.0 R: 100.0 factor LP/LR F1: 100.0 N: 6.0 P: 100.0 R: 100.0 factor Tag F1: 100.0 N: 6.0 Parsing [len. 1]: 黑色的-1 (ROOT (FRAG 黑色的-1)) P: 0.0 R: 0.0 pcfg LP/LR F1: 0.0 N: 7.0 P: 0.0 R: 0.0 factor LP/LR F1: 0.0 N: 7.0 P: 100.0 R: 100.0 factor Tag F1: 100.0 N: 7.0 Parsing [len. 1]: 猫-2 (ROOT (FRAG 猫-2)) P: 0.0 R: 0.0 pcfg LP/LR F1: 0.0 N: 8.0 P: 0.0 R: 0.0 factor LP/LR F1: 0.0 N: 8.0 P: 100.0 R: 100.0 factor Tag F1: 100.0 N: 8.0 Parsing [len. 1]: 跳-3 (ROOT (FRAG 跳-3)) P: 0.0 R: 0.0 pcfg LP/LR F1: 0.0 N: 9.0 P: 0.0 R: 0.0 factor LP/LR F1: 0.0 N: 9.0 P: 100.0 R: 100.0 factor Tag F1: 100.0 N: 9.0 Parsing [len. 1]: 上-4 (ROOT (FRAG 上-4)) P: 0.0 R: 0.0 pcfg LP/LR F1: 0.0 N: 10.0 P: 0.0 R: 0.0 factor LP/LR F1: 0.0 N: 10.0 P: 100.0 R: 100.0 factor Tag F1: 100.0 N: 10.0 Parsing [len. 1]: 树枝-5 (ROOT (FRAG 树枝-5)) P: 0.0 R: 0.0 pcfg LP/LR F1: 0.0 N: 11.0 P: 0.0 R: 0.0 factor LP/LR F1: 0.0 N: 11.0 P: 100.0 R: 100.0 factor Tag F1: 100.0 N: 11.0 Parsing [len. 1]: 。-6 (ROOT (FRAG 。-6)) P: 0.0 R: 0.0 pcfg LP/LR F1: 0.0 N: 12.0 P: 0.0 R: 0.0 factor LP/LR F1: 0.0 N: 12.0 P: 100.0 R: 100.0 factor Tag F1: 100.0 N: 12.0 pcfg LP/LR summary evalb: LP: 100.0 LR: 61.9 F1: 76.47 Exact: 16.66 N: 12 dep DA summary evalb: LP: 0.0 LR: 0.0 F1: 0.0 Exact: 0.0 N: 0 factor LP/LR summary evalb: LP: 100.0 LR: 61.9 F1: 76.47 Exact: 16.66 N: 12 factor DA summary evalb: LP: 0.0 LR: 0.0 F1: 0.0 Exact: 0.0 N: 0 factor Tag summary evalb: LP: 100.0 LR: 100.0 F1: 100.0 Exact: 100.0 N: 12 factF1 factDA factEx pcfgF1 depDA factTA num 76.47 16.67 76.47 100.00 12
Fallback to Parse Manually-tagged Sentence
if you try to parse a never-seen sentence like this:哈马尼克斯啊部阿三的。
,then you may fail.
but if you pos tag this sentence:
$cat shit.txt 哈马尼克斯/NN 啊部/VBD 阿三的/NN 。/.
and assume that, this is just the meaning of the sentence.
Ok, now use the following command-line parse-pre-tagged.sh to parse the above sentence:
#!/bin/bash java -mx500m -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser \ -encoding utf-8 \ -sentences newline \ -tokenized \ -tagSeparator / \ -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer \ -tokenizerMethod newCoreLabelTokenizerFactory \ -outputFormat "penn,typedDependenciesCollapsed" \ edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz \ $1
Note: here use the PCFG model chinesePCFG
demo,
sh parse-pre-tagged.sh shit.txt SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. Parsing file: shit.txt Parsing [sent. 1 len. 4]: 哈马尼克斯 啊部 阿三的 。 (ROOT (IP (NP (NN 哈马尼克斯)) (VP (ADVP (VBD 啊部)) (VP (NN 阿三的))) (. 。))) nsubj(阿三的-3, 哈马尼克斯-1) advmod(阿三的-3, 啊部-2) root(ROOT-0, 阿三的-3) dep(阿三的-3, 。-4) Parsed file: shit.txt [1 sentences]. Parsed 4 words in 1 sentences (10.78 wds/sec; 2.70 sents/sec).
Save the words to custom dictionary
while parsing the manually tagged sentence, the words(tokens) should be pushed into custom dictionary for future use.Use the Manually parsed PTB to Train Parser
with parse-pre-tagged.sh, we got some corpus in PTB:$ cat corpus.txt (ROOT (IP (NP (NN 哈马尼克斯)) (VP (ADVP (VBD 啊部)) (VP (NN 阿三的))) (. 。))) nsubj(阿三的-3, 哈马尼克斯-1) advmod(阿三的-3, 啊部-2) root(ROOT-0, 阿三的-3) dep(阿三的-3, 。-4) (ROOT (IP (NP (ADJP (JJ 黑色的)) (NP (NN 猫))) (VP (ADVP (VBD 跳)) (VP (IN 上) (NP (NN 树枝)))) (. 。))) amod(猫-2, 黑色的-1) nsubj(树枝-5, 猫-2) advmod(树枝-5, 跳-3) dep(树枝-5, 上-4) root(ROOT-0, 树枝-5) dep(树枝-5, 。-6)
view train.sh
#!/bin/bash java -cp "*" -mx800m edu.stanford.nlp.parser.lexparser.LexicalizedParser \ -evals "factDA,tsv" \ -chineseFactored -PCFG -hMarkov 1 -nomarkNPconj -compactGrammar 0 \ -tLPP edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams \ -PCFG \ -chinesePCFG \ -saveToSerializedFile ./trained.ser.gz \ -maxLength 40 \ -encoding utf-8 \ -train $1 \ -test $1
demo of training as following:
sh train.sh corpus.txt
after training, we got trainded.ser.gz
use this model to parse a special sentence, such as:
$ cat test.txt
output:
猫 啊部 阿三的 。 猫 啊部 树枝 。 树枝 跳 上 哈马尼克斯 。
$ cat parse-with-model.sh
output:
#!/bin/bash java -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser \ -encoding utf-8 \ -outputFormat "penn,typedDependenciesCollapsed" \ ./trained.ser.gz \ $1
demo:
sh parse-with-model.sh test.txt
output:
Parsing file: ./test.txt Parsing [sent. 1 len. 4]: 猫 啊部 阿三的 。 (ROOT (IP (NP (NN 猫)) (VP (ADVP (VBD 啊部)) (VP (NN 阿三的))) (. 。))) nsubj(阿三的-3, 猫-1) advmod(阿三的-3, 啊部-2) root(ROOT-0, 阿三的-3) dep(阿三的-3, 。-4) Parsing [sent. 2 len. 4]: 猫 啊部 树枝 。 (ROOT (IP (NP (NN 猫)) (VP (ADVP (VBD 啊部)) (VP (NN 树枝))) (. 。))) nsubj(树枝-3, 猫-1) advmod(树枝-3, 啊部-2) root(ROOT-0, 树枝-3) dep(树枝-3, 。-4) Parsing [sent. 3 len. 5]: 树枝 跳 上 哈马尼克斯 。 (ROOT (IP (NP (NN 树枝)) (VP (ADVP (VBD 跳)) (VP (IN 上) (NP (NN 哈马尼克斯)))) (. 。))) nsubj(哈马尼克斯-4, 树枝-1) advmod(哈马尼克斯-4, 跳-2) dep(哈马尼克斯-4, 上-3) root(ROOT-0, 哈马尼克斯-4) dep(哈马尼克斯-4, 。-5) Parsed file: ./test.txt [3 sentences]. Parsed 13 words in 3 sentences (61.03 wds/sec; 14.08 sents/sec).
Parse non-segmented sentence with specific model
if you input sentence is not segmented, then use the above custom dictionary to segment.the output of the segment can be used as the input of parse-with-model.sh
相关文章推荐
- NLP with DL Stanford -- 1. Word Vector Basic
- NLP with DL Stanford – 2.Word2Vec Tutorial - The Skip-Gram Model
- Splay树(区间添加删除 | 区间翻转)——HDU 3487 Play with Chain
- UVA 10673 Play with Floor and Ceil (扩展欧几里得算法)
- uva10673 - Play with Floor and Ceil 扩展欧几里德算法
- Cousera-stanford-机器学习练习-第二周-Linear Regression with Multiple Variables
- uva 10673 - Play with Floor and Ceil
- bzoj4353: Play with tree
- Stanford机器学习---第二讲. 多变量线性回归 Linear Regression with multiple variable
- NLP with python 3 处理原始文本
- UVA10673 - Play with Floor and Ceil
- stanford nlp第五课“拼写纠错(Spelling Correction)”
- HDU 3487 Play with Chain(SplayTree)
- 【HDU3487】【splay分裂合并】Play with Chain
- 10673 - Play with Floor and Ceil
- NLP:nltk+stanfordNLP
- 【Delphi】从内存(MemoryStream)使用WMP(WindowsMediaPlayer)控件播放视频音频(Play Video with WMP from MemoryStream)
- 斯坦福NLP笔记73 —— Query Processing with the Inverted I 3ff0
- [Play with T]_[C# 类与属性操作] 通过克隆实现ObjectMap对List的填充
- uva 10673 - Play with Floor and Ceil(欧几里得)