您的位置:首页 > 其它

笔记-2009-An Error-Driven Word-Character Hybrid Model for Joint CWS and POS Tagging

2012-08-12 16:47 323 查看
An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging

作者:神户大学,Canasai Kruengkrai, and Kiyotaka Uchimoto, and Jun’ichi Kazama, Yiou Wang, and Kentaro Torisawa, and Hitoshi Isahara

出处:Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, pages 513–521,Suntec, Singapore, 2-7 August2009.

word-character based标注结合MIRA算法,是Tetsuji Nakagawa继2004-2007年后的又一次改进

引言部分

分词词性标注一体化,从2004-2009得到非常广泛的关注(Ngand Low, 2004; Nakagawa and Uchimoto, 2007;Zhang and Clark, 2008; Jiang et al., 2008a; Jianget al., 2008b)

字词混合标注模型2004年提出使用,词Markov model,字ME model(Nakagawa, 2004; Nakagawa and Uchimoto, 2007)

MIRA算法(Crammer,2004; Crammer et al., 2005; McDonald, 2006)

使用语料

Penn Chinese Treebank(Xia et al.,2000) (下文简称CTB)

正文

Background

1 搜索空间the search space with a lattice based on the word-character hybrid model (Nakagawa and Uchimoto, 2007)

2 word-level 先查词典,如果查到标注其词性(POS);character-level,构词位置标注(POC)和POS(Asahara, 2003; Naka-gawa, 2004).

3 词典能查到的词用word-level;查不到的词用character-level。

4 测试部分使用动态规划算法,搜索最佳候选路径。

Policies for correct path selection

如果一个字十分罕见(在训练语料中)有可能是OOV(Baayen and Sproat 1996)这个理论的有效性得到了验证 (Ratna-parkhi, 1996; Nagata, 1999; Nakagawa, 2004).

该文采用此方法作为baseline policy,先统计训练语料中词的词频,将词频小于某个阈值“r”的词都标注出来(即得到假设OOV)。然后人工调整IV和假设OOV的阈值r,平衡两者的数量使其得到最佳效果。疑问:是说大于r的词,才用作生成词典,生成word-level的依据吗?

10-fold的交叉检验,1份验证,9份训练,r=1,从每次验证语料中得到unidentified unknown words。

error-driven policy:用1)假设OOV是从训练语料得到 ,2)unidentified unknown words是从待验证语料得到,3)identified words词典词,这三项去学习unknown words.

Training method

McDonald 2006年的方法,k-best MIRA,0/1 loss fumction

Feature 包含两个字、词两个层面,一元(27个)、二元(18个)。w代表词,p代表词性,T的分类在表4中(TB代表取词的首字)



迭代次数N=10,k-best=5,infrequent word(罕见字) r=3时候最好。结果,seg 最好时 0.9787,seg&tag 最好时 0.9364

这篇论文与Ng and Low(2004) CTB3.0,Zhang and Clark(2008)CTB4.0, Jiang et al.(2008)CTB5.0 的结果对比,该论文是最好的。


这一段错误驱动方法还是不太清楚,为什么一定要设定人工OOV呢?人工OOV的作用是什么呢?到底可以学到什么知识呢?怎么有助于真OOV的发现呢?过几天再看

We now describe our new approach to leverage additional examples of unknown words. Intuition suggests that even though the system can handle someunknown
words(用阈值“r”人工生成的), manyunidentified unknown words
remain that cannot be recovered by the system; we wish to learn the characteristics (特点)of suchunidentified unknown words. We propose
the following simple scheme:

•Divide the training corpus into ten equal sets and perform 10-fold cross validation to find the errors.训练语料分成10份,交叉验证,找错误

•For each trial, train the word-character hybrid model with the baseline policy (r = 1) using nine sets and estimate errors using the remaining validation set.使用9份做词字混合模型的训练语料,1份验证语料用于估计错误

•Collect unidentified unknown words from each validation set.保留每个验证集合的unidentified unknown words

Several types of errors are produced by the baseline model, but we only focus on those caused by unidentified unknown words, which can be easily collected in the evaluation process. As described later in Section 5.2, we easure the recall on out-of-vocabulary
(OOV) words. Here, we define unidentified unknown words as OOV words in each validation set that cannot be recovered by the system. After ten cross validation runs, we get a list of the unidentified unknown words derived from the whole training corpus. Note
that the unidentified unknown words in the cross validation are not necessary to be infrequent words, but some overlap may exist.(unidentified unknown words 并不一定是罕见词,但是可能会有一些重叠)
Finally, we obtain the artificial unknown words that combine the unidentified unknown words in cross validation and infrequent words for learning unknown words. We refer to this approach as the error-driven policy.

该文用了Baayen and Sproat 1996的方法作为baseline 也许需要看一下这篇论文,他怎么unknown words 与unidentified
unknown words结合?怎么学习!
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐