您的位置:首页 > 编程语言 > Python开发

NLP05-Gensim源码[包与接口]

2017-10-28 21:20 387 查看


摘要:粗略从的方面查看一下gensim包中的文件结构与接口,感性地认识一下gensim的源码都有些什么东西,这个是认识Gensim源码的第一步。内容包含了文件结构,核心接口,Corpora模块,Models模块 ,Similarity模块,Models模块 ,scripts, 集成sklearn,摘要与关键词,单元测试,topic coherence这几个方面。

0.文件结构

把开gensim包,目录结构如下地出现眼前:



模块分为语料,模型等等,另外interfaces.py核心接口,matutils.py数学工具,utils.py公共方法。nosy.py这个不重要,是用来监控py文档是否有修改更的。

1. Gensim核心接口[interfaces.py]###



1.1corpusABC

Interface (abstract base class) for corpora. A corpus is simply an iterable, where each iteration step yields one document:

语料接口(抽象基类),一个语料是一个简单的迭代器,每步产生一个文档;

>>> for doc in corpus:
>>>     # do something with the doc...


A document is a sequence of (fieldId, fieldValue) 2-tuples:

一个文档是一个二元组(域id,域值)序列;

>>> for attr_id, attr_value in doc:
>>>     # do something with the attribute


1.2 SimilarityABC

Abstract interface for similarity searches over a corpus.

In all instances, there is a corpus against which we want to perform the similarity search.

For each similarity search, the input is a document and the output are its similarities to individual corpus documents.

Similarity queries are realized by calling self[query_document].

There is also a convenience wrapper, where iterating over self yields similarities of each document in the corpus against the whole corpus (ie., the query is each corpus document in turn).

在语料之上的相似搜索抽象接口。

所有实例中,凭借一个语料我们可以执行相似搜索。

对于每个相似搜索,输入一个文档,输出是各自相似的文档集合;

相似查询是通过调用self[query_document]这样方法来实现的。

这里也有一个方便的包装器,可以自迭代按顺序产生自已的相似性文档 。

1.3 TransformationABC

Interface for transformations. A ‘transformation’ is any object which accepts a sparse document via the dictionary notation [] and returns another sparse document in its stead:

转换的接口,接收通过字典标记’[]‘的一个稀疏文档,返回取而代之的稀疏文档;

2. Corpora模块

This package contains implementations of various streaming corpus I/O format.

这个包包含了各种流式语料I/O格式的实现。



各类的层次关系,可以看成一个子类就是一个语料的储存形式了:



3.Models模块

This package contains algorithms for extracting document representations from their raw bag-of-word counts.

这个包主要是维护从源数据的词袋计算中抽取文档的表示算法;

models包下的文件结构:



各自的继承关系:



4. Similarity模块

This package contains implementations of pairwise similarity queries.

这个包是相似查询对的实现,

只有两个文件:docsim.py与index.py

docsim.py中的类如下,均继承于SimilarityABC接口。

Similarity模块下的类图:



5. Parsing模块

This package contains functions to preprocess raw text

文本预处理

里面包含两个文件:

preprocessing.py:文档的预处理,例如停用词,大小写等。

porter.py : Porter Stemming Algorithm 【词干提取算法】,来自论文

Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14,

no. 3, pp 130-137,

算法相关信息:http://www.tartarus.org/~martin/PorterStemmer

词干提取,也就是把单词的复数,第三人称之类的单词还原成原型,例如:

"""Get rid of plurals and -ed or -ing. E.g.,

caresses  ->  caress
ponies    ->  poni
ties      ->  ti
caress    ->  caress
cats      ->  cat

feed      ->  feed
agreed    ->  agree
disabled  ->  disable

matting   ->  mat
mating    ->  mate
meeting   ->  meet
milling   ->  mill
messing   ->  mess

meetings  ->  meet
"""


6. scripts

这个是一个脚本集合,方便处理与转换的,

例如

glove2word2vec.py,是GloVe vectors format 转成 word2vec text format;
USAGE: $ python -m gensim.scripts.glove2word2vec --input <GloVe vector file> --output <Word2vec vector file>
Where:
<GloVe vector file>: Input GloVe .txt file
<Word2vec vector file>: Desired name of output Word2vec .txt file
word2vec2tensor是word2vec转成tensor形式:
USAGE: $ python -m gensim.scripts.word2vec2tensor --input <Word2Vec model file> --output <TSV tensor filename prefix> [--binary] <Word2Vec binary flag>
Where:
<Word2Vec model file>: Input Word2Vec model
<TSV tensor filename prefix>: 2D tensor TSV output file name prefix
<Word2Vec binary flag>: Set True if Word2Vec model is binary. Defaults to False.
Output:
The script will create two TSV files. A 2d tensor format file, and a Word Embedding metadata file. Both files will
us the --output file name as prefix


7. 集成sklearn

Scikit learn对于gensim的包装器:SklearnWrapperLdaModel与SklearnWrapperLsiModel

8. summarization

8.1 关键词:

def keywords(text, ratio=0.2, words=None, split=False, scores=False, pos_filter=[‘NN’, ‘JJ’], lemmatize=False, deacc=True)

关键词的计算用到了graph;

8.2 概述

def summarize(text, ratio=0.2, word_count=None, split=False)

主用到TextRank algorithm,计算用到了graph;

8.3 相关的数据结构及算法

BM25[bm25.py]

TextRank算法

Graph【common.py,graph.py】

9. 单元测试

10 topic coherence###

主题模型有评估模型,对于这方面的相关资料:

What is Topic Coherence?

https://rare-technologies.com/what-is-topic-coherence/

Exploring the Space of Topic Coherence Measures

http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf

Evaluating topic coherence measures

https://mimno.infosci.cornell.edu/nips2013ws/nips2013tm_submission_7.pdf

Topic Coherence To Evaluate Topic Models

http://qpleple.com/topic-coherence-to-evaluate-topic-models/

对topic cohearnce的演示:

https://nbviewer.jupyter.org/github/dsquareindia/gensim/blob/280375fe14adea67ce6384ba7eabf362b05e6029/docs/notebooks/topic_coherence_tutorial.ipynb

基于语义连贯性实现主题挖掘和分类 http://blog.csdn.net/shirdrn/article/details/7076505

【作者:happyprince, http://blog.csdn.net/ld326/article/details/78379449
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  nlp Gensim python