Chinese Language Processing at Penn -- 中文宾大TreeBank
2015-09-07 02:29
471 查看
Penn's Chinese Language Processing program is anchored by linguistic corpora annotated with morphological, syntactic, semantic and discourse structures. The Penn
Chinese Treebank is a segmented, part-of-speech tagged, and fully bracketed corpus that currently has 500 thousand words (over 824K Chinese characters). The sources of
this corpus are mostly Xinhua newswire, Sinorama news magazine and Hong Kong News. The segmentation, POS-tagging and syntactic
bracketing standards are fully documented.
The Chinese Proposition Bank adds a layer of semantic annotation to the Chinese Treebank. This layer of semantic annotation
mainly deals with the predicate-argument structure of Chinese verbs. This task is also called semantic role labeling in the sense that each verb is expected to take a fixed number of arguments and each argument plays a role with regard to
the verb. The draft annotation guidelines and the annotation of the first installment of the corpus (250K words) are near completion.
Extending the idea of predicate-argument structure to discourse, we are also in the initial stages of building a Chinese
Discourse Treebank in which discourse connectives are treated as predicates that take arguments. A discourse connective can be a subordinate conjunction, a coordinate conjunction, or an anaphorical adverbial expression. Sometimes discourse relations can
even be inferred when explicit discourse markers are not available.
Other Chinese annotation projects that are carried out at Penn include coreference annotation, sense-tagging. Since most of our data have English translations, we are also building parallel
Chinese-English treebanks and proposition banks.
In the context of NLP research, building annotated corpora is of course only part of the larger picture, a means to an end. The goal is to train natural language systems. To that end, we have built Chinese segmenters and part-of-speech
taggers, parsers, semantic role labelers, word sense and coreference disambiguators. We have also built machine translation andinformation extraction systems.
译文:
宾大中文处理程序是标注了语言形态结构,语法结构,语义与陈述结构的语料。宾大中文树库(Penn Chinese Treebank)现有50万字(超过 824K中文字符),并且被segmented, part-of-speech tagged, fully bracketed。这些语料来自xinhua newswire, Sinorama news magazine和HongKong News.Segmentation,
POS-tagging, Syntactic bracketing标准完全记录在案。
(Chinese Proposition Bank)增加了一层语义标注。这层标注主要由predicate-argument中文谓语结构方式处理。这个任务
相关文章推荐
- 计算候选关键字
- 黑马程序员-Java基础:网络编程
- java集合类与并发包内的对象使用说明
- C++ Primer : 第十四章 : 重载运算与类型转换之重载运算符
- 00 Linux 基础预习(上)
- 黑马程序员-Java基础:GUI
- Ubuntu15.04+Wine+QQ7.6绿色版成功安装运行
- mysql复合索引与普通索引总结
- 大数据研究的历史先驱:麦肯锡(McKinsey)
- AES加密CBC模式兼容互通四种编程语言平台【PHP、Javascript、Java、C#】
- bzoj 3529 数表 莫比乌斯反演+树状数组
- jQuery中filter(),not(),split()的用法
- hdu 3720 Arranging Your Team(暴力)
- 20150830-Y1506401-19+benz2015+文本编辑工具vim的使用方法等
- 黑马程序员-Java基础:多线程
- hdu 3642(扫描线)
- hdoj 5044 LCA
- java之--------socket编程(TCP)服务器请求与客户端
- light oj 1422 Halloween Costumes (区间DP)
- jquery1.9radio checkbox操作