您的位置：首页 > 其它

康奈尔大学的电影对白语料库介绍 --Cornell Movie-Dialogs Corpus

2016-12-05 15:33 363 查看

这个公开的资源被很多和自然语言处理NLP相关的开源代码和论文提到，

所以仔细阅读了readme，并记录相关要点

所有文件以" +++$+++ "分隔符

- movie_titles_metadata.txt

   - 包含每部电影标题信息

   - fields:

       - movieID,

       - movie title,

       - movie year,

          - IMDB rating,

       - no. IMDB votes,

        - genres in the format ['genre1','genre2',?'genreN']

- movie_characters_metadata.txt

   - 包含每部电影角色信息

   - fields:

       - characterID

       - character name

       - movieID

       - movie title

       - gender ("?" for unlabeled cases)

       - position in credits ("?" for unlabeled cases)

关键是下面两个文件，一个包含了所有文本，一个包含了文本之间的关系

- movie_lines.txt

   - 包含每个表达(utterance)的实际文本

   - fields:

       - lineID

       - characterID (who uttered this phrase)

       - movieID

       - character name

       - text of the utterance

前面5个样本:

L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!

L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!

L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.

L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?

L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.

- movie_conversations.txt

   - 对话的结构-

   - fields

       - characterID of the first character involved in the conversation 对话中的第一个角色的ID

        - characterID of the second character involved in the conversation 对话中的第二个角色的ID

        - movieID of the movie in which the conversation occurred 对话所属电影的ID

        - list of the utterances that make the conversation, in chronological

           order: ['lineID1','lineID2',?'lineIDN']

           has to be matched with movie_lines.txt to reconstruct the actual content

            对话中以时间顺序的各个表达的列表，

   order: ['lineID1','lineID2',?'lineIDN']必须和movie_lines.txt匹配以便于重构实际内容

前面5个样本:

u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']

u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']

u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']

u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L204', 'L205', 'L206']

u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L207', 'L208']

- raw_script_urls.txt

   -原始来源的url( the urls from which the raw sources were retrieved)

来源:

http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 自然语言处理 nlp

相关文章推荐

新的分享

章节导航