笔记-2009-An Error-Driven Word-Character Hybrid Model for Joint CWS and POS Tagging
2012-08-12 16:47
323 查看
An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging
作者:神户大学,Canasai Kruengkrai, and Kiyotaka Uchimoto, and Jun’ichi Kazama, Yiou Wang, and Kentaro Torisawa, and Hitoshi Isahara
出处:Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 513–521,Suntec, Singapore, 2-7 August2009.
word-character based标注结合MIRA算法,是Tetsuji Nakagawa继2004-2007年后的又一次改进
引言部分
分词词性标注一体化,从2004-2009得到非常广泛的关注(Ngand Low, 2004; Nakagawa and Uchimoto, 2007;Zhang and Clark, 2008; Jiang et al., 2008a; Jianget al., 2008b)
字词混合标注模型2004年提出使用,词Markov model,字ME model(Nakagawa, 2004; Nakagawa and Uchimoto, 2007)
MIRA算法(Crammer,2004; Crammer et al., 2005; McDonald, 2006)
使用语料
Penn Chinese Treebank(Xia et al.,2000) (下文简称CTB)
正文
Background
1 搜索空间the search space with a lattice based on the word-character hybrid model (Nakagawa and Uchimoto, 2007)
2 word-level 先查词典,如果查到标注其词性(POS);character-level,构词位置标注(POC)和POS(Asahara, 2003; Naka-gawa, 2004).
3 词典能查到的词用word-level;查不到的词用character-level。
4 测试部分使用动态规划算法,搜索最佳候选路径。
Policies for correct path selection
如果一个字十分罕见(在训练语料中)有可能是OOV(Baayen and Sproat 1996)这个理论的有效性得到了验证 (Ratna-parkhi, 1996; Nagata, 1999; Nakagawa, 2004).
该文采用此方法作为baseline policy,先统计训练语料中词的词频,将词频小于某个阈值“r”的词都标注出来(即得到假设OOV)。然后人工调整IV和假设OOV的阈值r,平衡两者的数量使其得到最佳效果。疑问:是说大于r的词,才用作生成词典,生成word-level的依据吗?
10-fold的交叉检验,1份验证,9份训练,r=1,从每次验证语料中得到unidentified unknown words。
error-driven policy:用1)假设OOV是从训练语料得到 ,2)unidentified unknown words是从待验证语料得到,3)identified words词典词,这三项去学习unknown words.
Training method
McDonald 2006年的方法,k-best MIRA,0/1 loss fumction
Feature 包含两个字、词两个层面,一元(27个)、二元(18个)。w代表词,p代表词性,T的分类在表4中(TB代表取词的首字)
迭代次数N=10,k-best=5,infrequent word(罕见字) r=3时候最好。结果,seg 最好时 0.9787,seg&tag 最好时 0.9364
这篇论文与Ng and Low(2004) CTB3.0,Zhang and Clark(2008)CTB4.0, Jiang et al.(2008)CTB5.0 的结果对比,该论文是最好的。
这一段错误驱动方法还是不太清楚,为什么一定要设定人工OOV呢?人工OOV的作用是什么呢?到底可以学到什么知识呢?怎么有助于真OOV的发现呢?过几天再看
We now describe our new approach to leverage additional examples of unknown words. Intuition suggests that even though the system can handle someunknown
words(用阈值“r”人工生成的), manyunidentified unknown words
remain that cannot be recovered by the system; we wish to learn the characteristics (特点)of suchunidentified unknown words. We propose
the following simple scheme:
•Divide the training corpus into ten equal sets and perform 10-fold cross validation to find the errors.训练语料分成10份,交叉验证,找错误
•For each trial, train the word-character hybrid model with the baseline policy (r = 1) using nine sets and estimate errors using the remaining validation set.使用9份做词字混合模型的训练语料,1份验证语料用于估计错误
•Collect unidentified unknown words from each validation set.保留每个验证集合的unidentified unknown words
Several types of errors are produced by the baseline model, but we only focus on those caused by unidentified unknown words, which can be easily collected in the evaluation process. As described later in Section 5.2, we easure the recall on out-of-vocabulary
(OOV) words. Here, we define unidentified unknown words as OOV words in each validation set that cannot be recovered by the system. After ten cross validation runs, we get a list of the unidentified unknown words derived from the whole training corpus. Note
that the unidentified unknown words in the cross validation are not necessary to be infrequent words, but some overlap may exist.(unidentified unknown words 并不一定是罕见词,但是可能会有一些重叠)
Finally, we obtain the artificial unknown words that combine the unidentified unknown words in cross validation and infrequent words for learning unknown words. We refer to this approach as the error-driven policy.
该文用了Baayen and Sproat 1996的方法作为baseline 也许需要看一下这篇论文,他怎么unknown words 与unidentified
unknown words结合?怎么学习!
作者:神户大学,Canasai Kruengkrai, and Kiyotaka Uchimoto, and Jun’ichi Kazama, Yiou Wang, and Kentaro Torisawa, and Hitoshi Isahara
出处:Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 513–521,Suntec, Singapore, 2-7 August2009.
word-character based标注结合MIRA算法,是Tetsuji Nakagawa继2004-2007年后的又一次改进
引言部分
分词词性标注一体化,从2004-2009得到非常广泛的关注(Ngand Low, 2004; Nakagawa and Uchimoto, 2007;Zhang and Clark, 2008; Jiang et al., 2008a; Jianget al., 2008b)
字词混合标注模型2004年提出使用,词Markov model,字ME model(Nakagawa, 2004; Nakagawa and Uchimoto, 2007)
MIRA算法(Crammer,2004; Crammer et al., 2005; McDonald, 2006)
使用语料
Penn Chinese Treebank(Xia et al.,2000) (下文简称CTB)
正文
Background
1 搜索空间the search space with a lattice based on the word-character hybrid model (Nakagawa and Uchimoto, 2007)
2 word-level 先查词典,如果查到标注其词性(POS);character-level,构词位置标注(POC)和POS(Asahara, 2003; Naka-gawa, 2004).
3 词典能查到的词用word-level;查不到的词用character-level。
4 测试部分使用动态规划算法,搜索最佳候选路径。
Policies for correct path selection
如果一个字十分罕见(在训练语料中)有可能是OOV(Baayen and Sproat 1996)这个理论的有效性得到了验证 (Ratna-parkhi, 1996; Nagata, 1999; Nakagawa, 2004).
该文采用此方法作为baseline policy,先统计训练语料中词的词频,将词频小于某个阈值“r”的词都标注出来(即得到假设OOV)。然后人工调整IV和假设OOV的阈值r,平衡两者的数量使其得到最佳效果。疑问:是说大于r的词,才用作生成词典,生成word-level的依据吗?
10-fold的交叉检验,1份验证,9份训练,r=1,从每次验证语料中得到unidentified unknown words。
error-driven policy:用1)假设OOV是从训练语料得到 ,2)unidentified unknown words是从待验证语料得到,3)identified words词典词,这三项去学习unknown words.
Training method
McDonald 2006年的方法,k-best MIRA,0/1 loss fumction
Feature 包含两个字、词两个层面,一元(27个)、二元(18个)。w代表词,p代表词性,T的分类在表4中(TB代表取词的首字)
迭代次数N=10,k-best=5,infrequent word(罕见字) r=3时候最好。结果,seg 最好时 0.9787,seg&tag 最好时 0.9364
这篇论文与Ng and Low(2004) CTB3.0,Zhang and Clark(2008)CTB4.0, Jiang et al.(2008)CTB5.0 的结果对比,该论文是最好的。
这一段错误驱动方法还是不太清楚,为什么一定要设定人工OOV呢?人工OOV的作用是什么呢?到底可以学到什么知识呢?怎么有助于真OOV的发现呢?过几天再看
We now describe our new approach to leverage additional examples of unknown words. Intuition suggests that even though the system can handle someunknown
words(用阈值“r”人工生成的), manyunidentified unknown words
remain that cannot be recovered by the system; we wish to learn the characteristics (特点)of suchunidentified unknown words. We propose
the following simple scheme:
•Divide the training corpus into ten equal sets and perform 10-fold cross validation to find the errors.训练语料分成10份,交叉验证,找错误
•For each trial, train the word-character hybrid model with the baseline policy (r = 1) using nine sets and estimate errors using the remaining validation set.使用9份做词字混合模型的训练语料,1份验证语料用于估计错误
•Collect unidentified unknown words from each validation set.保留每个验证集合的unidentified unknown words
Several types of errors are produced by the baseline model, but we only focus on those caused by unidentified unknown words, which can be easily collected in the evaluation process. As described later in Section 5.2, we easure the recall on out-of-vocabulary
(OOV) words. Here, we define unidentified unknown words as OOV words in each validation set that cannot be recovered by the system. After ten cross validation runs, we get a list of the unidentified unknown words derived from the whole training corpus. Note
that the unidentified unknown words in the cross validation are not necessary to be infrequent words, but some overlap may exist.(unidentified unknown words 并不一定是罕见词,但是可能会有一些重叠)
Finally, we obtain the artificial unknown words that combine the unidentified unknown words in cross validation and infrequent words for learning unknown words. We refer to this approach as the error-driven policy.
该文用了Baayen and Sproat 1996的方法作为baseline 也许需要看一下这篇论文,他怎么unknown words 与unidentified
unknown words结合?怎么学习!
相关文章推荐
- 笔记-2004-2007-A Hybrid Approach to Word Segmentation and POS Tagging
- Could not install the app on the device, read the error above for details. Make sure you have an And
- 1604.Joint Detection and Identification Feature Learning for Person Search论文阅读笔记
- 论文笔记:LSTM, GRU, Highway and a Bit of Attention: An Empirical Overview for Language Modeling in Speec
- 【论文阅读笔记】RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechan
- 《机器学习实战》笔记:TypeError: unsupported operand type(s) for *: 'float' and 'NoneType'
- Could not install the app on the device, read the error above for details. Make sure you have an And
- 论文笔记之---Joint Detection and Identification Feature Learning for Person Search
- [论文笔记] Leveraging the crowd as a source of innovation Does crowdsourcing represent a new model for product and service innovation? (SIGMIS-CPR, 2012)
- 文献阅读A hybrid model for grammatical error correction
- Method and apparatus for training a memory signal via an error signal of a memory
- 打开Word出现an error occurred starting mathtype's command for word
- An unsupervised neural attention model for aspect extraction 读论文笔记
- NUnit Error: The Type Initializer for … Threw an Exception (And app.config)
- The eventual following stack trace is caused by an error thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access, and has no functional impact
- vs2013 error:Building an MFC project for a non-Unicode character set is deprecated
- 笔记-2008-An Empirical Comparison of Goodness Measures for Unsupervised CWS with a ~
- WCF Error RANT: An error occured creating the configuration section handler for system.serviceModel/
- Moiseinst An organizational model for specifying rights and duties of autonomous agents
- The eventual following stack trace is caused by an error thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access, and has no functional impact.