您的位置:首页 > 其它

STS 数据分析

2015-12-28 22:16 363 查看
2012 - train

- MSR-Paraphrase, Microsoft Research Paraphrase Corpus
http://research.microsoft.com/en-us/downloads/607d14d9-20cd-47e3-85bc-a2f65cd28042/
750 pairs of sentences.

- MSR-Video, Microsoft Research Video Description Corpus
http://research.microsoft.com/en-us/downloads/38cf15fd-b8df-477e-a4e4-a4680caa75af/
750 pairs of sentences.

- SMTeuroparl: WMT2008 develoment dataset (Europarl section)
http://www.statmt.org/wmt08/shared-evaluation-task.html
734 pairs of sentences.

2012 - test

- MSR-Paraphrase, Microsoft Research Paraphrase Corpus
http://research.microsoft.com/en-us/downloads/607d14d9-20cd-47e3-85bc-a2f65cd28042/
750 pairs of sentences.

- MSR-Video, Microsoft Research Video Description Corpus
http://research.microsoft.com/en-us/downloads/38cf15fd-b8df-477e-a4e4-a4680caa75af/
750 pairs of sentences.

- SMTeuroparl: WMT2008 develoment dataset (Europarl section)
http://www.statmt.org/wmt08/shared-evaluation-task.html
459 pairs of sentences.

In addition, it contains two surprise datasets comprising the

following collections:

- SMTnews: news conversation sentence pairs from WMT

399 pairs of sentences.

- OnWN: pairs of sentences where the first comes from Ontonotes and

the second from a WordNet definition.

750 pairs of sentences.

2013 - test

- STS.input.headlines.txt: we used headlines mined from several news

sources by European Media Monitor using the RSS feed.
http://emm.newsexplorer.eu/NewsExplorer/home/en/latest.html
- STS.input.OnWN.txt: The sentences are sense definitions from WordNet

and OntoNotes.

- STS.input.FNWN.txt: The sentences are sense definitions from WordNet

and FrameNet. Note that some FrameNet definitions involve more than

one sentence.

// 丢失

- STS.input.SMT.txt: This SMT dataset comes from DARPA GALE HTER and

HyTER, where one sentence is a MT output and the other is a

reference translation where a reference is generated based on human

post editing (provided by LDC) or an original human reference

(provided by LDC) or a human generated reference based on FSM as

described in (Dreyer and Marcu, NAACL 2012). The reference comes

from post edited translations.

2014 - test

- STS.input.image.txt: The Image Descriptions data set is a subset of

the PASCAL VOC-2008 data set (Rashtchian et al., 2010) . PASCAL

VOC-2008 data set consists of 1,000 images and has been used by a

number of image description systems. The image captions of the data

set are released under a CreativeCommons Attribution-ShareAlike

license, the descriptions itself are free.

- STS.input.OnWN.txt: The sentences are sense definitions from WordNet

and OntoNotes. 5 pairs of sentences.

- STS.input.tweet-news.txt: The tweet-news data set is a subset of the

Linking-Tweets-to-News data set (Guo et al., 2013), which consists

of 34,888 tweets and 12,704 news articles. The tweets are the

comments on the news articles. The news sentences are the titles of

news articles.

- STS.input.deft-news.txt: A subset of news article data in the DEFT

project.

- STS.input.deft-forum.txt: A subset of discussion forum data in the

DEFT project.

- STS.input.headlines.txt: we used headlines mined from several news

sources by European Media Monitor using the RSS feed.
http://emm.newsexplorer.eu/NewsExplorer/home/en/latest.html
2015 - test(with some raw data)

- STS.input.image.txt: The Image Descriptions data set is a subset of

the Flickr dataset presented in (Rashtchian et al., 2010), which

consisted on 8108 hand-selected images from Flickr, depicting

actions and events of people or animals, with five captions per

image. The image captions of the data set are released under a

CreativeCommons Attribution-ShareAlike license.

- STS.input.headlines.txt: We used headlines mined from several news

sources by European Media Monitor using their RSS feed from April 2,

2013 to July 28, 2014. This period was selected to avoid overlap

with STS 2014 data.
http://emm.newsexplorer.eu/NewsExplorer/home/en/latest.html
- STS.input.answers-students.txt: The source of these pairs is the

BEETLE corpus (Dzikovska et al., 2010), is a question-answer data

set collected and annotated during the evaluation of the BEETLE II

tutorial dialogue system. The BEETLE II system is an intelligent

tutoring engine that teaches students in basic electricity and

electronics. The corpus was used in the student response analysis

task of semeval-2013. Given a question, a known correct "reference

answer" and the "student answer", the goal of the task was to assess

student answers as correct, contradictory or incorrect (partially

correct, irrelevant or not in the domain). For STS, we selected

pairs of answers made up by single sentences.

- STS.input.answers-forum.txt: This data set consists of paired

answers collected from the Stack Exchange question and answer

websites (http://stackexchange.com/). Some of the paired answers are

in response to the same question, while others are in response to

different questions. Each answer in the pair consists of a statement

composed of a single sentence or sentence fragment. For

multi-sentence answers, we extract the single sentence from the

larger answer that appears to best summarize the answer. The Stack

Exchange data license requires that we provide additional metadata

that allows participants to recover the source of each paired

answer. Systems submitted to the shared task must not make use of

this meta-data in anyway to assign STS scores or to otherwise inform

the operation of the system.

- STS.input.belief: The data is collected from DEFT Committed Belief

Annotation dataset (LDC2014E55). All source documents are English

Discussion Forum data.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: