生物语料词性标注工具——genia tagger
2015-04-23 10:59
344 查看
GENIA Tagger
- part-of-speech tagging, shallow parsing, and named entity recognition for biomedical text -
What's New
20 Oct. 2006A demo page is available.6 Oct. 2006Version 3.0: The tagger now performs named entity recognition.Overview
The GENIA tagger analyzes English sentences and outputs the base forms, part-of-speech tags, chunk tags, and named entity tags. The tagger is specifically tuned for biomedical text such as MEDLINE abstracts. Ifyou need to extract information from biomedical documents, this tagger might be a useful preprocessing tool. You can try the tagger on a demo page.
How to use the tagger
You need gcc to build the tagger.1. Download the latest version of the tagger
Apr. 16 2007 geniatagger-3.0.1.tar.gz (source package for Unix)2. Expand the archive
> tar xvzf geniatagger.tar.gz
3. Make
> cd geniatagger/ > make
4. Tag sentences
Prepare a text file containing one sentence per line, then> ./geniatagger < RAWTEXT > TAGGEDTEXT
The tagger outputs the base forms, part-of-speech (POS) tags, chunk tags, and named entity (NE) tags in the following tab-separated format.
word1 base1 POStag1 chunktag1 NEtag1 word2 base2 POStag2 chunktag2 NEtag2 : : : : :
Chunks are represented in the IOB2 format (B for BEGIN, I for INSIDE, and O for OUTSIDE).
Example
> echo "Inhibition of NF-kappaB activation reversed the anti-apoptotic effect of isochamaejasmin." | ./geniatagger Inhibition Inhibition NN B-NP O of of IN B-PP O NF-kappaB NF-kappaB NN B-NP B-protein activation activation NN I-NP O reversed reverse VBD B-VP O the the DT B-NP O anti-apoptotic anti-apoptotic JJ I-NP O effect effect NN I-NP O of of IN B-PP O isochamaejasmin isochamaejasmin NN B-NP O . . . O O
You can easily extract four noun phrases ("Inhibition", "NF-kappaB activation", "the anti-apoptotic effect", and "isochamaejasmin") from this output by looking at the chunk tags. You can also find a protein name
with the named entity tags.
Part-of-Speech Tagging Performance
General-purpose part-of-speech taggers do not usually perform well on biomedical text because lexical characteristics of biomedical documents are considerably different from those of newspaper articles, which areoften used as the training data for a general-purpose tagger. The GENIA tagger is trained not only on the Wall Street Journal corpus but also on the GENIA corpus and the PennBioIE corpus [1], so the tagger works well on various types of biomedical documents.
The table below shows the tagging accuracies of a tagger trained with different sets of documents. For details of the performance, see [2](the latest
version uses a different tagging algorithm [3] and gives slightly better performance than reported in the paper).
Wall Street Journal | GENIA corpus | |
---|---|---|
A tagger trained on the WSJ corpus | 97.05% | 85.19% |
A tagger trained on the GENIA corpus | 78.57% | 98.49% |
GENIA tagger | 96.94% | 98.26% |
Chunking Performance
(to be evaluated)Named Entity Recognition Performance
The named entity tagger is trained on the NLPBA data set.The featuers and parameters were tuned using the training data. The final performance on the evaluation set is as follows.
Entity Type | Recall | Precision | F-score |
---|---|---|---|
Protein | 81.41 | 65.82 | 72.79 |
DNA | 66.76 | 65.64 | 66.20 |
RNA | 68.64 | 60.45 | 64.29 |
Cell Line | 59.60 | 56.12 | 57.81 |
Cell Type | 70.54 | 78.51 | 74.31 |
Overall | 75.78 | 67.45 | 71.37 |
References
[1] S. Kulick, A. Bies, M. Liberman, M. Mandel, R. McDonald, M. Palmer, A. Schein and L. Ungar. Integrated Annotation for Biomedical Information Extraction, HLT/NAACL 2004 Workshop: Biolink 2004, pp. 61-68.[2] Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John McNaught, Sophia Ananiadou, and Jun'ichi Tsujii, Developing a Robust Part-of-Speech Tagger for Biomedical Text, Advances in Informatics -
10th Panhellenic Conference on Informatics, LNCS 3746, pp. 382-392, 2005 (pdf)
[3] Yoshimasa Tsuruoka and Jun'ichi Tsujii, Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data, Proceedings of HLT/EMNLP 2005, pp. 467-474. (pdf)
#-----------------------------------------------------------------------
上文来源:http://www.nactem.ac.uk/GENIA/tagger/
GENIA Tagger Demo:http://text0.mib.man.ac.uk/software/geniatagger/
geniatagger-3.0.1下载:http://pan.baidu.com/s/1hqznbta(这里劳资要吐槽,下载那么多种类的geniatagger,结果都特么特么make不成功啊,终于找到一份能够make成功的版本,找了那么久,差点放弃不打算用这个包了,终于让劳资找到了能够make成功的版本,似乎来自这个github的下载,卤主下载太多版本了,都搞乱了https://github.com/ninjin/geniatagger)、http://pan.baidu.com/s/1qW1E4jE
(这些版本似乎不行)或者http://download.csdn.net/detail/u010454729/8623187
相关文章推荐
- python nltk 统计语料的词性标注分布
- 中英文分词及词性标注工具
- 奋战聊天机器人(三)自动化对语料做词性标注
- NLP+VS︱深度学习数据集标注工具、图像语料数据库、实验室搜索ing...
- 自动化对语料做词性标注
- 中英文词性标注工具介绍
- 四、何须动手?完全自动化对语料做词性标注
- NLP+VS︱深度学习数据集标注工具、图像语料数据库、实验室搜索ing...
- NLP+VS︱深度学习数据集标注工具、图像语料数据库、实验室搜索ing...
- NLP+VS︱深度学习数据集标注工具、图像语料数据库、实验室搜索ing...
- NLP+VS︱深度学习数据集标注工具、图像语料数据库、实验室搜索ing...
- 【图像标注】NLP+VS︱深度学习数据集标注工具、图像语料数据库、实验室搜索ing
- 实习点滴(3)--以“词性标注”为例理解CRF算法
- 图片标注工具LabelImg使用教程
- 基于隐马尔可夫模型的有监督词性标注
- 【毕业设计day05_3】词性标注_思路
- 高颜值生物信息在线绘图工具
- CNN for Semantic Segmentation(语义分割,论文,代码,数据集,标注工具,blog)
- 目标检测数据集标注工具 - od-annotation
- 基于MaxEnt的中文词性标注模型实现