您的位置：首页 > 编程语言 > Python开发

《Python自然语言处理》学习笔记（三）

2017-01-12 22:30 232 查看

现在开始学习书的第二章，《获得文本语料和词汇资源》。

一. 获取文本语料库

1.古腾堡语料库gutenberg

内容：NLTK包含古腾堡项目（大约有36000本免费电子书）电子文本档案的经过挑选的一小部分文本。

>>> import nltk
>>> nltk.corpus.gutenberg.fileids()
[u'austen-emma.txt', u'austen-persuasion.txt', u'austen-sense.txt', u'bible-kjv.txt', u'blake-poems.txt', u'bryant-stories.txt', u'burgess-busterbrown.txt', u'carroll-alice.txt', u'chesterton-ball.txt', u'chesterton-brown.txt', u'chesterton-thursday.txt', u'edgeworth-parents.txt', u'melville-moby_dick.txt', u'milton-paradise.txt', u'shakespeare-caesar.txt', u'shakespeare-hamlet.txt', u'shakespeare-macbeth.txt', u'whitman-leaves.txt']

nltk.corpus.gutenberg.fileids()的作用是导入nltk中包含的古腾堡语料库中的信息。

corpus：a collection of written or spoken texts

fileid: 文件标识符

所以这句话的意思很明显：从nltk的全集的古腾堡部分中，导出所有文件标识符（书名）。

例子1

>>> emma=nltk.corpus.gutenberg.words('austen-emma.txt')
>>> len(emma)

例子2：

>>> for fileid in gutenberg.fileids():
num_chars=len(gutenberg.raw(fileid))
num_words=len(gutenberg.words(fileid))
num_sents=len(gutenberg.sents(fileid))
num_vocab=len(set([w.lower() for w in gutenberg.words(fileid)]))
print int(num_chars/num_words),int(num_words/num_sents),int(num_words/num_vocab),fileid

4 24 26 austen-emma.txt
4 26 16 austen-persuasion.txt
4 28 22 austen-sense.txt
4 33 79 bible-kjv.txt
4 19 5 blake-poems.txt
4 19 14 bryant-stories.txt
4 17 12 burgess-busterbrown.txt
4 20 12 carroll-alice.txt
4 20 11 chesterton-ball.txt
4 22 11 chesterton-brown.txt
4 18 10 chesterton-thursday.txt
4 20 24 edgeworth-parents.txt
4 25 15 melville-moby_dick.txt
4 52 10 milton-paradise.txt
4 11 8 shakespeare-caesar.txt
4 12 7 shakespeare-hamlet.txt
4 12 6 shakespeare-macbeth.txt
4 36 12 whitman-leaves.txt

raw函数把文本中的内容以字符为单位分开，words函数把文本中的内容以单词为单位分开，sents（sentences）函数把文本中的内容以句子为单位分开。

2.网络和聊天文本webtext

nltk网络文本集合包括Firefox交流论坛，在纽约无意听到的对话，《加勒比海盗》剧本，个人广告和葡萄酒的评论。

例子1：

>>> from nltk.corpus import webtext
>>> for fileid in webtext.fileids():
print fileid,webtext.raw(fileid)[:65],'...'

firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se ...
grail.txt SCENE 1: [wind] [clop clop clop]
KING ARTHUR: Whoa there!  [clop ...
overheard.txt White guy: So, do you have any plans for this evening?
Asian girl ...
pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr ...
singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun ...
wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawb ...

书P63习题：

轮到你来：在udhr.fileids()中选择一种感兴趣的语言，定义一个变量

raw_text= udhr.raw(+)。使用nltk.FreqDist(raw_text).plot()画出

此文本的字母频率分布图。

>>> def search_word(word):
for w in udhr.fileids():
if word in w.lower():
return w

>>> search_word('english')
u'English-Latin1'
>>> raw_text=udhr.raw('English-Latin1')

>>> nltk.FreqDist(raw_text).plot()

3.即时消息聊天会话语料库nps_chat

语料库被分为15个文件，每个文件包含几百个按特定日期和特定年龄的聊天室（青少年、20岁、30岁、40岁，以及通用的成年人聊天室）收集的帖子。如：10-19-20s_706posts.xml包含2006年10月19日从20多岁聊天室收集的706个帖子。

例子1:

>>> from nltk.corpus import nps_chat
>>> chatroom=nps_chat.posts('10-19-20s_706posts.xml')
>>> chatroom[123]
[u'i', u'do', u"n't", u'want', u'hot', u'pics', u'of', u'a', u'female', u',', u'I', u'can', u'look', u'in', u'a', u'mirror', u'.']

注：之前两个语料库的操作，这个语料库也可以用。比如nps_chat.words('10-19-20s_706posts.xml')

4.布朗语料库brown

布朗语料库是第一个百万词级的英语电子语料库。这个语料库包含500个不同来源的文本，按照文体分类，如：新闻、社论等。

具体包括以下类别：

例子1：

>>> from nltk.corpus import brown
>>> brown.categories()
[u'adventure', u'belles_lettres', u'editorial', u'fiction', u'government', u'hobbies', u'humor', u'learned', u'lore', u'mystery', u'news', u'religion', u'reviews', u'romance', u'science_fiction']

注意，categories是brown语料库独特的函数，上面3个语料库都不能调用。

例子2：一般语料库通用的一些函数

>>> brown.words(categories='news')
[u'The', u'Fulton', u'County', u'Grand', u'Jury', ...]
>>> brown.words(fileids='cg22')
[u'Does', u'our', u'society', u'have', u'a', ...]
>>> brown.sents(categories=['news','editorial','reviews'])
[[u'The', u'Fulton', u'County', u'Grand', u'Jury', u'said', u'Friday', u'an', u'investigation', u'of', u"Atlanta's", u'recent', u'primary', u'election', u'produced', u'``', u'no', u'evidence', u"''", u'that', u'any', u'irregularities', u'took', u'place', u'.'], [u'The', u'jury', u'further', u'said', u'in', u'term-end', u'presentments', u'that', u'the', u'City', u'Executive', u'Committee', u',', u'which', u'had', u'over-all', u'charge', u'of', u'the', u'election', u',', u'``', u'deserves', u'the', u'praise', u'and', u'thanks', u'of', u'the', u'City', u'of', u'Atlanta', u"''", u'for', u'the', u'manner', u'in', u'which', u'the', u'election', u'was', u'conducted', u'.'], ...]

注意括号内的赋值，等号只有一个。（是“=”而非“==”）。因为我们令categories为某些特定值，而非判断categories是不是某个特定值。

例子3：

布朗语料库是一个研究文体间的系统性差
4000
异（一种叫做文体学的语言学研究）很方便的资源。

>>> news_text=brown.words(categories='news')
>>> fdist=nltk.FreqDist([w.lower() for w in news_text])
>>> modals=['can','could','may','might','must','will']
>>> for m in modals:
print m+':',fdist[m]

can: 94
could: 87
may: 93
might: 38
must: 53
will: 389

P58习题：

轮到你来：选择布朗语料库的不同部分，修改前面的例子，计数包含wh的词，如：what，when，where，who 和why。

>>> brown.categories()
[u'adventure', u'belles_lettres', u'editorial', u'fiction', u'government', u'hobbies', u'humor', u'learned', u'lore', u'mystery', u'news', u'religion', u'reviews', u'romance', u'science_fiction']
>>> ad_text=brown.words(categories='adventure')
>>> fdist=nltk.FreqDist([w for w in ad_text if 'wh' in w])
>>> fdist
FreqDist({u'when': 126, u'what': 110, u'which': 100, u'who': 91, u'where': 53, u'while': 45, u'white': 26, u'why': 13, u'whispered': 11, u'whole': 9, ...})

5.路透社语料库

路透社语料库包含10,788个新闻文档。这些文档分成90个主题，按照“训练”和“测试”分为两组。

例子1：

>>> from nltk.corpus import reuters

>>> reuters.categories('training/9865')
[u'barley', u'corn', u'grain', u'wheat']
>>> reuters.categories(['training/9865','training/9866'])
[u'barley', u'corn', u'gold', u'grain', u'wheat']

>>> reuters.fileids('barley')
[u'test/15618', u'test/15649', u'test/15676', u'test/15728'...]
>>> reuters.fileids(['barley','corn'])

[u'test/14832', u'test/14858', u'test/15033', u'test/15043', u'test/15106'...]

reuters.categories()返回某些文本的种类（的并集）。

reuters.fileids()返回某些种类对应的文本（的并集）。

例子2：

注意，这段代码需要import nltk。否则2/3段代码将无法执行。涉及两个以上文本时，需要nltk中的链表连接函数。

>>> reuters.words('training/9865')[:14]
[u'FRENCH', u'FREE', u'MARKET', u'CEREAL', u'EXPORT', u'BIDS', u'DETAILED', u'French', u'operators', u'have', u'requested', u'licences', u'to', u'export']

>>> reuters.words(['training/9865','training/9880'])
[u'FRENCH', u'FREE', u'MARKET', u'CEREAL', u'EXPORT', ...]
>>> reuters.words(categories='barley')
[u'FRENCH', u'FREE', u'MARKET', u'CEREAL', u'EXPORT', ...]
>>> reuters.words(categories=['barely','corn'])
[u'THAI', u'TRADE', u'DEFICIT', u'WIDENS', u'IN', ...]

6.就职演说语料库inaugural

语料库是55个文本的集合，每个文本都是一个总统的演说。这个集合的一个有趣特性是它的时间维度。

例子1：

>>> from nltk.corpus import inaugural
>>> inaugural.fileids()
[u'1789-Washington.txt', u'1793-Washington.txt', u'1797-Adams.txt', u'1801-Jefferson.txt', u'1805-Jefferson.txt', u'1809-Madison.txt', u'1813-Madison.txt', u'1817-Monroe.txt', u'1821-Monroe.txt', u'1825-Adams.txt', u'1829-Jackson.txt', u'1833-Jackson.txt', u'1837-VanBuren.txt', u'1841-Harrison.txt', u'1845-Polk.txt', u'1849-Taylor.txt', u'1853-Pierce.txt', u'1857-Buchanan.txt', u'1861-Lincoln.txt', u'1865-Lincoln.txt', u'1869-Grant.txt', u'1873-Grant.txt', u'1877-Hayes.txt', u'1881-Garfield.txt', u'1885-Cleveland.txt', u'1889-Harrison.txt', u'1893-Cleveland.txt', u'1897-McKinley.txt', u'1901-McKinley.txt', u'1905-Roosevelt.txt', u'1909-Taft.txt', u'1913-Wilson.txt', u'1917-Wilson.txt', u'1921-Harding.txt', u'1925-Coolidge.txt', u'1929-Hoover.txt', u'1933-Roosevelt.txt', u'1937-Roosevelt.txt', u'1941-Roosevelt.txt', u'1945-Roosevelt.txt', u'1949-Truman.txt', u'1953-Eisenhower.txt', u'1957-Eisenhower.txt', u'1961-Kennedy.txt', u'1965-Johnson.txt', u'1969-Nixon.txt', u'1973-Nixon.txt', u'1977-Carter.txt', u'1981-Reagan.txt', u'1985-Reagan.txt', u'1989-Bush.txt', u'1993-Clinton.txt', u'1997-Clinton.txt', u'2001-Bush.txt', u'2005-Bush.txt', u'2009-Obama.txt']
>>> [fileid[:4] for fileid in inaugural.fileids]

Traceback (most recent call last):
File "<pyshell#21>", line 1, in <module>
[fileid[:4] for fileid in inaugural.fileids]
TypeError: 'instancemethod' object is not iterable
>>> [fileid[:4] for fileid in inaugural.fileids()]
[u'1789', u'1793', u'1797', u'1801', u'1805', u'1809', u'1813', u'1817', u'1821', u'1825', u'1829', u'1833', u'1837', u'1841', u'1845', u'1849', u'1853', u'1857', u'1861', u'1865', u'1869', u'1873', u'1877', u'1881', u'1885', u'1889', u'1893', u'1897', u'1901', u'1905', u'1909', u'1913', u'1917', u'1921', u'1925', u'1929', u'1933', u'1937', u'1941', u'1945', u'1949', u'1953', u'1957', u'1961', u'1965', u'1969', u'1973', u'1977', u'1981', u'1985', u'1989', u'1993', u'1997', u'2001', u'2005', u'2009']

例子2：
按照年份绘制America和citizen的分布

>>> cfd=nltk.ConditionalFreqDist(
(target,fileid[:4])
for target in ['america','citizen']
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
if w.lower().startswith(target))

这段代码等价于下面一段：

cfd=nltk.ConditionalFreqDist()
for fileid in inaugural.fileids():
for w in inaugural.words(fileid):
for target in ['america','citizen']:
if w.lower().startswith(target):
cfd[target][fileid[:4]]+=1

cfd.plot()

感觉还是这个能看懂。

图：

书P63习题：

轮到你来：在udhr.fileids()中选择一种感兴趣的语言，定义一个变量
raw_text= udhr.raw(+)。使用nltk.FreqDist(raw_text).plot()画出

此文本的字母频率分布图。

[python] view
plain copy

>>> def search_word(word):

    for w in udhr.fileids():

        if word in w.lower():

            return w





>>> search_word('english')

u'English-Latin1'

>>> raw_text=udhr.raw('English-Latin1')



>>> nltk.FreqDist(raw_text).plot()

二.文本语料库的结构

1.上面介绍了gutenberg, webtext, brown, nps_chat, inaugural,reuters六种语料库。它们分为以下几种类型：

2.nltk中定义的基本语料库函数：

示例描述

fileids() 语料库中的文件

fileids([categories]) 这些分类对应的语料库中的文件

categories() 语料库中的分类

categories([fileids]) 这些文件对应的语料库中的分类

raw() 语料库的原始内容

raw(fileids=[f1,f2,f3]) 指定文件的原始内容

raw(categories=[c1,c2]) 指定分类的原始内容

words() 整个语料库中的词汇

words(fileids=[f1,f2,f3]) 指定文件中的词汇

words(categories=[c1,c2]) 指定分类中的词汇

sents() 指定分类中的句子

sents(fileids=[f1,f2,f3]) 指定文件中的句子

sents(categories=[c1,c2]) 指定分类中的句子

abspath(fileid) 指定文件在磁盘上的位置

encoding(fileid) 文件的编码（如果知道的话）

open(fileid) 打开指定语料库文件的文件流

root() 到本地安装的语料库根目录的路径

readme() 语料库的README 文件的内容

3.载入自己的语料库

例子1：

>>> from nltk.corpus import PlaintextCorpusReader
>>> corpus_root='E:\codes\matlab'
>>> wordlists=PlaintextCorpusReader(corpus_root,'.*')
>>> wordlists.fileids()
['BA.m', 'degree.m']

书P68习题：

轮到你来：处理布朗语料库的新闻和言情文体，找出一周中最有新闻价值并且是

最浪漫的日子。定义一个变量days 包含星期的链表，如['Monday', ...]。然

后使用cfd.tabulate(samples=days)为这些词的计数制表。接下来用绘图替

代制表尝试同样的事情。你可以在额外的参数conditions=['Monday', ...]的

帮助下控制星期输出的顺序。

>>>days=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']

>>> cfd=nltk.ConditionalFreqDist(
(genre,day)
for genre in ['news','romance']
for day in days
for w in brown.words(categories=genre)
if w==day)

>>> cfd.plot(sample=days)
>>> cfd.tabulate(samples=days)
           Monday   Tuesday Wednesday  Thursday    Friday  Saturday    Sunday 
   news        54        43        22        20        41        33        51 
romance         2         3         3         1         3         4         5

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python 自然语言处理 nlp 文本语料

相关文章推荐

新的分享

章节导航