WordNet Interface(Wordnet一些有用的函数,做了一下标注方便查找)
2014-04-17 15:20
113 查看
WordNet Interface
WordNet is accessed just another NLTK corpus reader, and can be imported like this:>>> from nltk.corpus import wordnet
For more compact code, we recommend:
>>> from nltk.corpus import wordnet as wn
Words
同义词词林,pos为可调参数Look up a word using synsets(); this function has an optional pos argument which lets you constrain the part of speech of the word:
>>> wn.synsets('dog') # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE [Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')] >>> wn.synsets('dog', pos=wn.VERB) [Synset('chase.v.01')]
The other parts of speech are NOUN, ADJ and ADV. A synset is identified with a 3-part name of the form: word.pos.nn:
>>> wn.synset('dog.n.01') Synset('dog.n.01') >>> print(wn.synset('dog.n.01').definition)#解释意思 a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds >>> len(wn.synset('dog.n.01').examples) 1 >>> print(wn.synset('dog.n.01').examples[0])#例句 the dog barked all night >>> wn.synset('dog.n.01').lemmas #词元 [Lemma('dog.n.01.dog'), Lemma('dog.n.01.domestic_dog'), Lemma('dog.n.01.Canis_familiaris')] >>> [str(lemma.name) for lemma in wn.synset('dog.n.01').lemmas]#列出所有词元名 ['dog', 'domestic_dog', 'Canis_familiaris'] >>> wn.lemma('dog.n.01.dog').synset Synset('dog.n.01')
Synsets
Synset: a set of synonyms that share a common meaning.>>> dog = wn.synset('dog.n.01') >>> dog.hypernyms()#上位词 [Synset('domestic_animal.n.01'), Synset('canine.n.02')] >>> dog.hyponyms() # doctest: +ELLIPSIS#下位词 [Synset('puppy.n.01'), Synset('great_pyrenees.n.01'), Synset('basenji.n.01'), ...] >>> dog.member_holonyms() [Synset('pack.n.06'), Synset('canis.n.01')] >>> dog.root_hypernyms() [Synset('entity.n.01')] >>> wn.synset('dog.n.01').lowest_common_hypernyms(wn.synset('cat.n.01'))#二者共同的上位词 [Synset('carnivore.n.01')]
Each synset contains one or more lemmas, which represent a specific sense of a specific word.
Note that some relations are defined by WordNet only over Lemmas:
>>> good = wn.synset('good.a.01') >>> good.antonyms() Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'Synset' object has no attribute 'antonyms' >>> good.lemmas[0].antonyms() [Lemma('bad.a.01.bad')]
The relations that are currently defined in this way are antonyms, derivationally_related_forms and pertainyms.
Lemmas
不太懂这里的作用,求高人指点!>>> eat = wn.lemma('eat.v.03.eat') >>> eat Lemma('feed.v.06.eat') >>> print(eat.key) eat%2:34:02:: >>> eat.count() 4 >>> wn.lemma_from_key(eat.key) Lemma('feed.v.06.eat') >>> wn.lemma_from_key(eat.key).synset Synset('feed.v.06') >>> wn.lemma_from_key('feebleminded%5:00:00:retarded:00') Lemma('backward.s.03.feebleminded') >>> for lemma in wn.synset('eat.v.03').lemmas: ... print(lemma, lemma.count()) ... Lemma('feed.v.06.feed') 3 Lemma('feed.v.06.eat') 4 >>> for lemma in wn.lemmas('eat', 'v'): ... print(lemma, lemma.count()) ... Lemma('eat.v.01.eat') 61 Lemma('eat.v.02.eat') 13 Lemma('feed.v.06.eat') 4 Lemma('eat.v.04.eat') 0 Lemma('consume.v.05.eat') 0 Lemma('corrode.v.01.eat') 0
Lemmas can also have relations between them:
>>> vocal = wn.lemma('vocal.a.01.vocal') >>> vocal.derivationally_related_forms() [Lemma('vocalize.v.02.vocalize')] >>> vocal.pertainyms() [Lemma('voice.n.02.voice')] >>> vocal.antonyms() [Lemma('instrumental.a.01.instrumental')]
The three relations above exist only on lemmas, not on synsets.
Verb Frames
>>> wn.synset('think.v.01').frame_ids [5, 9] >>> for lemma in wn.synset('think.v.01').lemmas: ... print(lemma, lemma.frame_ids) ... print(" | ".join(lemma.frame_strings)) ... Lemma('think.v.01.think') [5, 9] Something think something Adjective/Noun | Somebody think somebody Lemma('think.v.01.believe') [5, 9] Something believe something Adjective/Noun | Somebody believe somebody Lemma('think.v.01.consider') [5, 9] Something consider something Adjective/Noun | Somebody consider somebody Lemma('think.v.01.conceive') [5, 9] Something conceive something Adjective/Noun | Somebody conceive somebody >>> wn.synset('stretch.v.02').frame_ids [8] >>> for lemma in wn.synset('stretch.v.02').lemmas: ... print(lemma, lemma.frame_ids) ... print(" | ".join(lemma.frame_strings)) ... Lemma('stretch.v.02.stretch') [8, 2] Somebody stretch something | Somebody stretch Lemma('stretch.v.02.extend') [8] Somebody extend something
Similarity
>>> dog = wn.synset('dog.n.01') >>> cat = wn.synset('cat.n.01')
>>> hit = wn.synset('hit.v.01') >>> slap = wn.synset('slap.v.01')
synset1.path_similarity(synset2): Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy.
The score is in the range 0 to 1. By default, there is now a fake root node added to verbs so for cases where previously a path could not be found---and None was returned---it should return a value. The old behavior can be achieved by setting simulate_root
to be False. A score of 1 represents identity i.e. comparing a sense with itself will return 1.
>>> dog.path_similarity(cat) # doctest: +ELLIPSIS 0.2...
>>> hit.path_similarity(slap) # doctest: +ELLIPSIS 0.142...
>>> wn.path_similarity(hit, slap) # doctest: +ELLIPSIS 0.142...
>>> print(hit.path_similarity(slap, simulate_root=False)) None
>>> print(wn.path_similarity(hit, slap, simulate_root=False)) None
synset1.lch_similarity(synset2): Leacock-Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above)
and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d the taxonomy depth.
>>> dog.lch_similarity(cat) # doctest: +ELLIPSIS 2.028...
>>> hit.lch_similarity(slap) # doctest: +ELLIPSIS 1.312...
>>> wn.lch_similarity(hit, slap) # doctest: +ELLIPSIS 1.312...
>>> print(hit.lch_similarity(slap, simulate_root=False)) None
>>> print(wn.lch_similarity(hit, slap, simulate_root=False)) None
synset1.wup_similarity(synset2): Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific
ancestor node). Note that at this time the scores given do _not_ always agree with those given by Pedersen's Perl implementation of Wordnet Similarity.
The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for
the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.
>>> dog.wup_similarity(cat) # doctest: +ELLIPSIS 0.857...
>>> hit.wup_similarity(slap) 0.25
>>> wn.wup_similarity(hit, slap) 0.25
>>> print(hit.wup_similarity(slap, simulate_root=False)) None
>>> print(wn.wup_similarity(hit, slap, simulate_root=False)) None
wordnet_ic Information Content: Load an information content file from the wordnet_ic corpus.
>>> from nltk.corpus import wordnet_ic >>> brown_ic = wordnet_ic.ic('ic-brown.dat') >>> semcor_ic = wordnet_ic.ic('ic-semcor.dat')
Or you can create an information content dictionary from a corpus (or anything that has a words() method).
>>> from nltk.corpus import genesis >>> genesis_ic = wn.ic(genesis, False, 0.0)
synset1.res_similarity(synset2, ic): Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node). Note that
for any similarity measure that uses information content, the result is dependent on the corpus used to generate the information content and the specifics of how the information content was created.
>>> dog.res_similarity(cat, brown_ic) # doctest: +ELLIPSIS 7.911... >>> dog.res_similarity(cat, genesis_ic) # doctest: +ELLIPSIS 7.204...
synset1.jcn_similarity(synset2, ic): Jiang-Conrath Similarity Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and
that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).
>>> dog.jcn_similarity(cat, brown_ic) # doctest: +ELLIPSIS 0.449... >>> dog.jcn_similarity(cat, genesis_ic) # doctest: +ELLIPSIS 0.285...
synset1.lin_similarity(synset2, ic): Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the
two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).
>>> dog.lin_similarity(cat, semcor_ic) # doctest: +ELLIPSIS 0.886...
Access to all Synsets
Iterate over all the noun synsets:>>> for synset in list(wn.all_synsets('n'))[:10]: ... print(synset) ... Synset('entity.n.01') Synset('physical_entity.n.01') Synset('abstraction.n.06') Synset('thing.n.12') Synset('object.n.01') Synset('whole.n.02') Synset('congener.n.03') Synset('living_thing.n.01') Synset('organism.n.01') Synset('benthos.n.02')
Get all synsets for this word, possibly restricted by POS:
>>> wn.synsets('dog') # doctest: +ELLIPSIS [Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), ...] >>> wn.synsets('dog', pos='v') [Synset('chase.v.01')]
Walk through the noun synsets looking at their hypernyms:
>>> from itertools import islice >>> for synset in islice(wn.all_synsets('n'), 5): ... print(synset, synset.hypernyms()) ... Synset('entity.n.01') [] Synset('physical_entity.n.01') [Synset('entity.n.01')] Synset('abstraction.n.06') [Synset('entity.n.01')] Synset('thing.n.12') [Synset('physical_entity.n.01')] Synset('object.n.01') [Synset('physical_entity.n.01')]
Morphy
查找词元Look up forms not in WordNet, with the help of Morphy:
>>> wn.morphy('denied', wn.NOUN) >>> print(wn.morphy('denied', wn.VERB)) deny >>> wn.synsets('denied', wn.NOUN) [] >>> wn.synsets('denied', wn.VERB) # doctest: +NORMALIZE_WHITESPACE [Synset('deny.v.01'), Synset('deny.v.02'), Synset('deny.v.03'), Synset('deny.v.04'), Synset('deny.v.05'), Synset('traverse.v.03'), Synset('deny.v.07')]
Morphy uses a combination of inflectional ending rules and exception lists to handle a variety of different possibilities:
>>> print(wn.morphy('dogs')) dog >>> print(wn.morphy('churches')) church >>> print(wn.morphy('aardwolves')) aardwolf >>> print(wn.morphy('abaci')) abacus >>> print(wn.morphy('book', wn.NOUN)) book >>> wn.morphy('hardrock', wn.ADV) >>> wn.morphy('book', wn.ADJ) >>> wn.morphy('his', wn.NOUN) >>>
Synset Closures
Compute transitive closures of synsets>>> dog = wn.synset('dog.n.01') >>> hypo = lambda s: s.hyponyms() >>> hyper = lambda s: s.hypernyms() >>> list(dog.closure(hypo, depth=1)) == dog.hyponyms() True >>> list(dog.closure(hyper, depth=1)) == dog.hypernyms() True >>> list(dog.closure(hypo)) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE [Synset('puppy.n.01'), Synset('great_pyrenees.n.01'), Synset('basenji.n.01'), Synset('newfoundland.n.01'), Synset('lapdog.n.01'), Synset('poodle.n.01'), Synset('leonberg.n.01'), Synset('toy_dog.n.01'), Synset('spitz.n.01'), ...] >>> list(dog.closure(hyper)) # doctest: +NORMALIZE_WHITESPACE [Synset('domestic_animal.n.01'), Synset('canine.n.02'), Synset('animal.n.01'), Synset('carnivore.n.01'), Synset('organism.n.01'), Synset('placental.n.01'), Synset('living_thing.n.01'), Synset('mammal.n.01'), Synset('whole.n.02'), Synset('vertebrate.n.01'), Synset('object.n.01'), Synset('chordate.n.01'), Synset('physical_entity.n.01'), Synset('entity.n.01')]
Regression Tests
可查找报错类型Bug 85: morphy returns the base form of a word, if it's input is given as a base form for a POS for which that word is not defined:
>>> wn.synsets('book', wn.NOUN) [Synset('book.n.01'), Synset('book.n.02'), Synset('record.n.05'), Synset('script.n.01'), Synset('ledger.n.01'), Synset('book.n.06'), Synset('book.n.07'), Synset('koran.n.01'), Synset('bible.n.01'), Synset('book.n.10'), Synset('book.n.11')] >>> wn.synsets('book', wn.ADJ) [] >>> wn.morphy('book', wn.NOUN) 'book' >>> wn.morphy('book', wn.ADJ)
Bug 160: wup_similarity breaks when the two synsets have no common hypernym
>>> t = wn.synsets('picasso')[0] >>> m = wn.synsets('male')[1] >>> t.wup_similarity(m) # doctest: +ELLIPSIS 0.631...
>>> t = wn.synsets('titan')[1] >>> s = wn.synsets('say', wn.VERB)[0] >>> print(t.wup_similarity(s)) None
Bug 21: "instance of" not included in LCS (very similar to bug 160)
>>> a = wn.synsets("writings")[0] >>> b = wn.synsets("scripture")[0] >>> brown_ic = wordnet_ic.ic('ic-brown.dat') >>> a.jcn_similarity(b, brown_ic) # doctest: +ELLIPSIS 0.175...
Bug 221: Verb root IC is zero
>>> from nltk.corpus.reader.wordnet import information_content >>> s = wn.synsets('say', wn.VERB)[0] >>> information_content(s, brown_ic) # doctest: +ELLIPSIS 4.623...
Bug 161: Comparison between WN keys/lemmas should not be case sensitive
>>> k = wn.synsets("jefferson")[0].lemmas[0].key >>> wn.lemma_from_key(k) Lemma('jefferson.n.01.Jefferson') >>> wn.lemma_from_key(k.upper()) Lemma('jefferson.n.01.Jefferson')
Bug 99: WordNet root_hypernyms gives incorrect results
>>> from nltk.corpus import wordnet as wn>>> for s in wn.all_synsets(wn.NOUN):
... if s.root_hypernyms()[0] != wn.synset('entity.n.01'):
... print(s, s.root_hypernyms())
...
>>>
Bug 382: JCN Division by zero error
>>> tow = wn.synset('tow.v.01') >>> shlep = wn.synset('shlep.v.02') >>> from nltk.corpus import wordnet_ic >>> brown_ic = wordnet_ic.ic('ic-brown.dat') >>> tow.jcn_similarity(shlep, brown_ic) # doctest: +ELLIPSIS 1...e+300
Bug 428: Depth is zero for instance nouns
>>> s = wn.synset("lincoln.n.01") >>> s.max_depth() > 0 True
Bug 429: Information content smoothing used old reference to all_synsets
>>> genesis_ic = wn.ic(genesis, True, 1.0)
Bug 430: all_synsets used wrong pos lookup when synsets were cached
>>> for ii in wn.all_synsets(): pass >>> for ii in wn.all_synsets(): pass
Bug 470: shortest_path_distance ignored instance hypernyms
>>> google = wordnet.synsets("google")[0] >>> earth = wordnet.synsets("earth")[0] >>> google.wup_similarity(earth) # doctest: +ELLIPSIS 0.1...
Bug 484: similarity metrics returned -1 instead of None for no LCS
>>> t = wn.synsets('fly', wn.VERB)[0] >>> s = wn.synsets('say', wn.VERB)[0] >>> print(s.shortest_path_distance(t)) None >>> print(s.path_similarity(t, simulate_root=False)) None >>> print(s.lch_similarity(t, simulate_root=False)) None >>> print(s.wup_similarity(t, simulate_root=False)) None
Bug 427: "pants" does not return all the senses it should
>>> from nltk.corpus import wordnet>>> wordnet.synsets("pants",'n')
[Synset('bloomers.n.01'), Synset('pant.n.01'), Synset('trouser.n.01'), Synset('gasp.n.01')]
Bug 482: Some nouns not being lemmatised by WordNetLemmatizer().lemmatize
>>> from nltk.stem.wordnet import WordNetLemmatizer >>> WordNetLemmatizer().lemmatize("eggs", pos="n") 'egg' >>> WordNetLemmatizer().lemmatize("legs", pos="n") 'leg'
Bug 284: instance hypernyms not used in similarity calculations
>>> wn.synset('john.n.02').lch_similarity(wn.synset('dog.n.01')) # doctest: +ELLIPSIS 1.335... >>> wn.synset('john.n.02').wup_similarity(wn.synset('dog.n.01')) # doctest: +ELLIPSIS 0.571... >>> wn.synset('john.n.02').res_similarity(wn.synset('dog.n.01'), brown_ic) # doctest: +ELLIPSIS 2.224... >>> wn.synset('john.n.02').jcn_similarity(wn.synset('dog.n.01'), brown_ic) # doctest: +ELLIPSIS 0.075... >>> wn.synset('john.n.02').lin_similarity(wn.synset('dog.n.01'), brown_ic) # doctest: +ELLIPSIS 0.252... >>> wn.synset('john.n.02').hypernym_paths() # doctest: +ELLIPSIS [[Synset('entity.n.01'), ..., Synset('john.n.02')]]
Issue 541: add domains to wordnet
>>> wn.synset('code.n.03').topic_domains() [Synset('computer_science.n.01')] >>> wn.synset('pukka.a.01').region_domains() [Synset('india.n.01')] >>> wn.synset('freaky.a.01').usage_domains() [Synset('slang.n.02')]
Issue 629: wordnet failures when python run with -O optimizations
>>> # Run the test suite with python -O to check this >>> wn.synsets("brunch") [Synset('brunch.n.01'), Synset('brunch.v.01')]
Issue 395: wordnet returns incorrect result for lowest_common_hypernyms of chef and policeman
>>> wn.synset('policeman.n.01').lowest_common_hypernyms(wn.synset('chef.n.01')) [Synset('person.n.01')]
相关文章推荐
- 一些关于Castle + Nhibernate+ ASP.NET的资源,我收集了一下,发布出来,供大家学习方便。
- WordNet--JWI( the MIT Java Wordnet Interface) 获取信息
- 平时在做ASP.NET项目里经常使用的一些函数和方法
- asp.net常见开发的一些函数
- MTK一些有用的层函数
- 本人服务器遭受黑客长期攻击,特把这几天做的一些有用的安全方面总结出来,以方便以后查阅
- 看的一些有用常用的东西,收藏一下
- 我常用的一些注入命令,方便一下大家
- 一些有用的函数
- PHP一些有用的函数
- 一些有用的宏或小函数
- 平时在做ASP.NET项目里经常使用的一些函数和方法
- 矩形碰撞公式,以前还做j2me时就用到的公式,为了方便查找,这里记一下
- C# .Net输出word和excel文件方法和函数!
- 【转】 关于ASP.NET 2.0一些简单而有用的小技巧
- 一些 快速 有用的 函数
- 一些有用的函数
- 本科毕设的时候用的一些word技巧,记录一下
- (转载)虚幻引擎3--UDK常用函数汇总--比较详细解释了一些类中的函数(有用,星月自己备注)
- word 一些有用的技巧