您的位置：首页 > 其它

NPL学习之:分词相关摘选zz from 52npl`

2010-08-06 13:53 176 查看

分词相关
a) Tokenization
i. 目标（Goal）：将文本切分成单词序列（divide text into a sequence of words）
ii. 单词指的是一串连续的字母数字并且其两端有空格；可能包含连字符和撇号但是没有其它标点符号

b) 什么是词（What’s a word）?
i. English:
1. “Wash. vs wash”
2. “won’t”, “John’s”
3. “pro-Arab”, “the idea of a child-as-required-yuppie-possession must be motivating them”, “85-year-old grandmother”
ii. 东亚语言

1. 词之间没有空格

c) 分词
i. 基于规则的方法 : 基于词典和语法知识的形态分析
ii. 基于语料库的方法: 从语料中学习
iii. 需要考虑的问题: 覆盖面，歧义，准确性

d) 统计切分方法的动机
i. 未登录词问题:
——存在领域术语和专有名词
ii. 语法约束可能不充分
——例子（Example）: 名词短语的交替切分
iii. 举例一
1. Segmentation：sha-choh/ken/gyoh-mu/bu-choh
2. Translation：“president/and/business/general/manager”
iv. 举例二
1. Segmentation：sha-choh/ken-gyoh/mu/bu-choh
2. Translation：“president/subsidiary business/Tsutomi[a name]/general manag

e) 一个切分算法：
i. 核心思想（Key idea）: 对于每一个候选边界，比较这个边界邻接的n元序列的频率和跨过这个边界的n元序列的频率。

f) 实验框架（Experimental Framework）
i. 语料库（Corpus）: 150兆1993年Nikkei新闻语料
ii. 人工切分: 用于开发集的50条序列（调节参数）和用于测试集的50条序列
iii. 基线算法（Baseline algorithms）: Chasen和Juma的形态分析器

g) 评测方法（Evaluation Measures）
i. tp — true positive （真正, TP）被模型预测为正的正样本；
ii. fp — false positive （假正, FP）被模型预测为正的负样本；
iii. tn — true negative （真负 , TN）被模型预测为负的负样本；
iv. fn — false negative （假负 , FN）被模型预测为负的正样本；
v. 准确率（Precision） — the measure of the proportion of selected items that the system got right：
P = tp / ( tp + fp)
vi. 召回率（Recall） — the measure of the target items that the system selected:
R = tp / ( tp + fn )
vii. F值（F-measure）:
F = 2 ∗ PR / (R + P)
viii. Word precision (P) is the percentage of proposed brackets that match word-level brackets in the annotation;
ix. Word recall (R) is the percentage of word-level brackets that are proposed by the algorithm.

完整原文:请参考http://www.52nlp.cn/mit-nlp-second-lesson-word-counting-third-part

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航

NPL学习之:分词相关 摘选zz from 52npl`

NPL学习之:分词相关摘选zz from 52npl`