您的位置：首页 > 编程语言 > Python开发

python 自然语言处理（二）____获得文本语料和词汇资源

2017-02-16 21:38 423 查看

一,获取文本语料库

　　一个文本语料库是一大段文本。它通常包含多个单独的文本，但为了处理方便，我们把他们头尾连接起来当做一个文本对待。

1.古腾堡语料库

　　nltk包含古腾堡项目（ProjectGutenberg）电子文本档案的一小部分文本。要使用该语料库通常需要用Python解释器加载nltk包，然后尝试nltk.corpus.gutenberg.fileids().实例如下：

1>>>importnltk
2>>>nltk.corpus.gutenberg.fileids()
3['austen-emma.txt','austen-persuasion.txt','austen-sense.txt','bible-kjv.txt'
4,'blake-poems.txt','bryant-stories.txt','burgess-busterbrown.txt','carroll-a
5lice.txt','chesterton-ball.txt','chesterton-brown.txt','chesterton-thursday.t
6xt','edgeworth-parents.txt','melville-moby_dick.txt','milton-paradise.txt','
7shakespeare-caesar.txt','shakespeare-hamlet.txt','shakespeare-macbeth.txt','w
8hitman-leaves.txt']
9>>>

运行结果显示的是nltk包含了该语料库的哪些文本。我们可以对其中的任意文本进行操作。

1）统计词数。实例如下：

1>>>emma=nltk.corpus.gutenberg.words('austen-emma.txt')
2>>>len(emma)
3192427
4>>>

2）索引文本。实例如下：

1>>>emma=nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
2>>>emma.concordance("surprise")
3Displaying1of1matches:
4thatEmmacouldnotbutfeelsomesurprise,andalittledispleasure,onhe
5>>>

3）获取文本的标识符，词，句。实例如下：

279>>>forfileidingutenberg.fileids():
280...raw=gutenberg.raw(fileid)
281...num_chars=len(raw)
282...words=gutenberg.words(fileid)
283...num_words=len(words)
284...sents=gutenberg.sents(fileid)
285...num_sents=len(sents)
286...vocab=set([w.lower()forwingutenberg.words(fileid)])
287...num_vocab=len(vocab)
288...print("%d%d%d%s"%(num_chars,num_words,num_sents,fileid))
289...
2908870711924277752austen-emma.txt
291466292981713747austen-persuasion.txt
2926730221415764999austen-sense.txt
2934332554101065430103bible-kjv.txt
294381538354438blake-poems.txt
295249439555632863bryant-stories.txt
29684663189631054burgess-busterbrown.txt
297144395341101703carroll-alice.txt
298457450969964779chesterton-ball.txt
299406629860633806chesterton-brown.txt
300320525692133742chesterton-thursday.txt
30193515821066310230edgeworth-parents.txt
302124299026081910059melville-moby_dick.txt
303468220968251851milton-paradise.txt
304112310258332163shakespeare-caesar.txt
305162881373603106shakespeare-hamlet.txt
306100351231401907shakespeare-macbeth.txt
3077112151548834250whitman-leaves.txt
308
309>>>raw[:1000]
310"[LeavesofGrassbyWaltWhitman1855]\n\n\nCome,saidmysoul,\nSuchversesfo
311rmyBodyletuswrite,(forweareone,)\nThatshouldIafterreturn,\nOr,long
312,longhence,inotherspheres,\nTheretosomegroupofmatesthechantsresumin
313g,\n(TallyingEarth'ssoil,trees,winds,tumultuouswaves,)\nEverwithpleas'd
314smileImaykeepon,\nEverandeveryettheversesowning--as,first,Ihereand
315now\nSigningforSoulandBody,settothemmyname,\n\nWaltWhitman\n\n\n\n[BO
316OKI.INSCRIPTIONS]\n\n}One's-SelfISing\n\nOne's-selfIsing,asimplesepa
317rateperson,\nYetutterthewordDemocratic,thewordEn-Masse.\n\nOfphysiology
318fromtoptotoeIsing,\nNotphysiognomyalonenorbrainaloneisworthyforth
319eMuse,Isay\ntheFormcompleteisworthierfar,\nTheFemaleequallywitht
320heMaleIsing.\n\nOfLifeimmenseinpassion,pulse,andpower,\nCheerful,for
321freestactionform'dunderthelawsdivine,\nTheModernManIsing.\n\n\n\n}As
322IPonder'dinSilence\n\nAsIponder'dinsilence,\nReturninguponmypoems,c"
323>>>
324>>>words
325['[','Leaves','of','Grass','by','Walt','Whitman',...]
326>>>sents
327[['[','Leaves','of','Grass','by','Walt','Whitman','1855',']'],['Come',
328',','said','my','soul',',','Such','verses','for','my','Body','let','u
329s','write',',','(','for','we','are','one',',)','That','should','I','
330after','return',',','Or',',','long',',','long','hence',',','in','othe
331r','spheres',',','There','to','some','group','of','mates','the','chant
332s','resuming',',','(','Tallying','Earth',"'",'s','soil',',','trees','
333,','winds',',','tumultuous','waves',',)','Ever','with','pleas',"'",'d'
334,'smile','I','may','keep','on',',','Ever','and','ever','yet','the','
335verses','owning','--','as',',','first',',','I','here','and','now','Si
336gning','for','Soul','and','Body',',','set','to','them','my','name',',
337'],...]

raw表示的是文本中所有的标识符，words是词，sents是句子。显然句子都是划分成一个个词来进行存储的。除了words(),raw()和sents()以外，大多数nltk语料库阅读器还包括多种访问方法。

2.网络和聊天文本

古腾堡项目包含的是成千上万的书籍，它们比较正式，代表了既定的文学。除此之外，nltk中还有很多的网络文本小集合，其内容包括Firefox交流论坛，在纽约无意中听到的对话，《加勒比海盗》的电影剧本，个人广告和葡萄酒的评论。访问该部分的文本实例如下：

1>>>forfileidinwebtext.fileids():
2...print("%s%s..."%(fileid,webtext.raw(fileid)[:65]))
3...
4firefox.txtCookieManager:"Don'tallowsitesthatsetremovedcookiestose
5...
6grail.txtSCENE1:[wind][clopclopclop]
7KINGARTHUR:Whoathere![clop...
8overheard.txtWhiteguy:So,doyouhaveanyplansforthisevening?
9Asiangirl...
10pirates.txtPIRATESOFTHECARRIBEAN:DEADMAN'SCHEST,byTedElliott&Terr
11...
12singles.txt25SEXYMALE,seeksattracoldersinglelady,fordiscreetencoun
13...
14wine.txtLovelydelicate,fragrantRhonewine.Polishedleatherandstrawb...
15
16>>>

3.即时消息聊天会话语料库

该语料库最初是由美国海军研究生院为研究自动检测互联网入侵者而收集的，包含超过1000个帖子，被分成15个文件，每个文件包含几百个从特定日期和特定年龄的聊天室收集的帖子。文件名包含日期，聊天室和帖子的数量。引用实例如下：

4.布朗语料库

布朗语料库是第一个百万词级的英语电子语料库，其中包含500个不同来源的文本，按照文体分类，如新闻，社论等。它主要用于研究文体之间的系统性差异（又叫做文体学的语言学研究）。我们可以将语料库作为词链表或者句子链表来访问。

1）按特定类别或文件阅读

1>>>fromnltk.corpusimportbrown
2>>>brown.categories()
3['adventure','belles_lettres','editorial','fiction','government','hobbies',
4'humor','learned','lore','mystery','new','news','religion','reviews','r
5omance','science_fiction']
6>>>brown.words(categories='news')
7['The','Fulton','County','Grand','Jury','said',...]
9>>>brown.words(fileids=['cg22'])
10['Does','our','society','have','a','runaway',',',...]
11>>>brown.sents(categories=['news','editorial','reviews',])
12[['The','Fulton','County','Grand','Jury','said','Friday','an','investiga
13tion','of',"Atlanta's",'recent','primary','election','produced','``','no
14','evidence',"''",'that','any','irregularities','took','place','.'],['T
15he','jury','further','said','in','term-end','presentments','that','the',
16'City','Executive','Committee',',','which','had','over-all','charge','o
17f','the','election',',','``','deserves','the','praise','and','thanks',
18'of','the','City','of','Atlanta',"''",'for','the','manner','in','which
19','the','election','was','conducted','.'],...]
20>>>

2）比较不同文体之间情态动词的用法

1>>>fromnltk.corpusimportbrown
2>>>news_text=brown.words(categories='news')
3>>>fdist=nltk.FreqDist([w.lower()forwinnews_text])
4>>>modals=['can','could','may','might','must','will']
5>>>forminmodals:
6...print("%s:%d"%(m,fdist[m]))
7...
8can:94
9could:87
10may:93
11might:38
12must:53
13will:389
14>>>

5.路透社语料库

路透社语料库包含10788个新闻文档，共计130万字。这些文档分成90个主题，按照训练和测试分为两组，这样分割是为了方便运用训练和测试算法的自动检测文档的主题。与布朗语料库不同，路透社语料库的类别是相互重叠的，因为新闻报道往往涉及多个主题。我们可以查找由一个或多个文档涵盖的主题，也可以查找包含在一个或者多个类别中的文档。应用实例如下：

1>>>fromnltk.corpusimportudhr
2>>>languages=['Chickasaw','English','German_Deutsch','Greenlandic_Inuktikut'
3,'Hungarian_Magyar','Ibibio_Efik']
4>>>
5>>>cfd=nltk.ConditionalFreqDist(
6...(lang,len(word))
7...forlanginlanguages
8...forwordinudhr.words(lang+'-Latin1'))
9>>>cfd.plot(cumulative=True)
10>>>

ViewCode

9.nltk中定义的基本语料库函数

示例	描述
fileids()	语料库中的文件
fileids([categories])	分类对应的语料库中的文件
categories()	语料库中的分类
categories([fileids])	文件对应的语料库中的分类
raw()	语料库的原始内容
raw([fileids=[f1,f2,f3])	指定文件的原始内容
raw(categories=[c1,c2])	指定分类的原始内容
words()	整个语料库中的词汇
words(fileids=[f1,f2,f3])	指定文件中的词汇
words(categories=[c1,c2])	指定分类中的词汇
sents()	指定分类中的句子
sents(fileids=[f1,f2,f3])	指定文件中的句子
sents(categories=[c1,c2])	指定分类中的句子
abspath(fileid)	指定文件在磁盘上的位置
encoding(fileid)	文件编码（如果知道的话）
open(fileid)	打开指定语料库文件的文件流
root()	到本地安装的语料库根目录的路径
readme()	语料库中的README文件的内容

10.载入自己的语料库

1>>>fromnltk.corpusimport*
2>>>corpus_root=r"E:\corpora"//本地存放文本的目录，原始的nltk数据库存放目录为D：\
3>>>wordlists=PlaintextCorpusReader(corpus_root,'.*')
4>>>wordlists.fileids()//获取文件列表
5['README','aaaaaaaaaaa.txt','austen-emma.txt','austen-persuasion.txt','auste//其中的aaaaaaaaaaa.txt是自定义的文件
6n-sense.txt','bible-kjv.txt','blake-poems.txt','bryant-stories.txt','burgess
7-busterbrown.txt','carroll-alice.txt','chesterton-ball.txt','chesterton-brown
8.txt','chesterton-thursday.txt','edgeworth-parents.txt','luo.txt','melville-
9moby_dick.txt','milton-paradise.txt','shakespeare-caesar.txt','shakespeare-ha
10mlet.txt','shakespeare-macbeth.txt','whitman-leaves.txt']
11>>>

自己的语料库加载成功后，我们就可以使用各种函数对其中的语料进行操作。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航