NLP05-Gensim源码[包与接口]
2017-10-28 21:20
387 查看
摘要:粗略从的方面查看一下gensim包中的文件结构与接口,感性地认识一下gensim的源码都有些什么东西,这个是认识Gensim源码的第一步。内容包含了文件结构,核心接口,Corpora模块,Models模块 ,Similarity模块,Models模块 ,scripts, 集成sklearn,摘要与关键词,单元测试,topic coherence这几个方面。
0.文件结构
把开gensim包,目录结构如下地出现眼前:模块分为语料,模型等等,另外interfaces.py核心接口,matutils.py数学工具,utils.py公共方法。nosy.py这个不重要,是用来监控py文档是否有修改更的。
1. Gensim核心接口[interfaces.py]###
1.1corpusABC
Interface (abstract base class) for corpora. A corpus is simply an iterable, where each iteration step yields one document:语料接口(抽象基类),一个语料是一个简单的迭代器,每步产生一个文档;
>>> for doc in corpus: >>> # do something with the doc...
A document is a sequence of (fieldId, fieldValue) 2-tuples:
一个文档是一个二元组(域id,域值)序列;
>>> for attr_id, attr_value in doc: >>> # do something with the attribute
1.2 SimilarityABC
Abstract interface for similarity searches over a corpus.In all instances, there is a corpus against which we want to perform the similarity search.
For each similarity search, the input is a document and the output are its similarities to individual corpus documents.
Similarity queries are realized by calling self[query_document].
There is also a convenience wrapper, where iterating over self yields similarities of each document in the corpus against the whole corpus (ie., the query is each corpus document in turn).
在语料之上的相似搜索抽象接口。
所有实例中,凭借一个语料我们可以执行相似搜索。
对于每个相似搜索,输入一个文档,输出是各自相似的文档集合;
相似查询是通过调用self[query_document]这样方法来实现的。
这里也有一个方便的包装器,可以自迭代按顺序产生自已的相似性文档 。
1.3 TransformationABC
Interface for transformations. A ‘transformation’ is any object which accepts a sparse document via the dictionary notation [] and returns another sparse document in its stead:转换的接口,接收通过字典标记’[]‘的一个稀疏文档,返回取而代之的稀疏文档;
2. Corpora模块
This package contains implementations of various streaming corpus I/O format.这个包包含了各种流式语料I/O格式的实现。
各类的层次关系,可以看成一个子类就是一个语料的储存形式了:
3.Models模块
This package contains algorithms for extracting document representations from their raw bag-of-word counts.这个包主要是维护从源数据的词袋计算中抽取文档的表示算法;
models包下的文件结构:
各自的继承关系:
4. Similarity模块
This package contains implementations of pairwise similarity queries.这个包是相似查询对的实现,
只有两个文件:docsim.py与index.py
docsim.py中的类如下,均继承于SimilarityABC接口。
Similarity模块下的类图:
5. Parsing模块
This package contains functions to preprocess raw text文本预处理
里面包含两个文件:
preprocessing.py:文档的预处理,例如停用词,大小写等。
porter.py : Porter Stemming Algorithm 【词干提取算法】,来自论文
Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14,
no. 3, pp 130-137,
算法相关信息:http://www.tartarus.org/~martin/PorterStemmer
词干提取,也就是把单词的复数,第三人称之类的单词还原成原型,例如:
"""Get rid of plurals and -ed or -ing. E.g., caresses -> caress ponies -> poni ties -> ti caress -> caress cats -> cat feed -> feed agreed -> agree disabled -> disable matting -> mat mating -> mate meeting -> meet milling -> mill messing -> mess meetings -> meet """
6. scripts
这个是一个脚本集合,方便处理与转换的,例如
glove2word2vec.py,是GloVe vectors format 转成 word2vec text format; USAGE: $ python -m gensim.scripts.glove2word2vec --input <GloVe vector file> --output <Word2vec vector file> Where: <GloVe vector file>: Input GloVe .txt file <Word2vec vector file>: Desired name of output Word2vec .txt file word2vec2tensor是word2vec转成tensor形式: USAGE: $ python -m gensim.scripts.word2vec2tensor --input <Word2Vec model file> --output <TSV tensor filename prefix> [--binary] <Word2Vec binary flag> Where: <Word2Vec model file>: Input Word2Vec model <TSV tensor filename prefix>: 2D tensor TSV output file name prefix <Word2Vec binary flag>: Set True if Word2Vec model is binary. Defaults to False. Output: The script will create two TSV files. A 2d tensor format file, and a Word Embedding metadata file. Both files will us the --output file name as prefix
7. 集成sklearn
Scikit learn对于gensim的包装器:SklearnWrapperLdaModel与SklearnWrapperLsiModel8. summarization
8.1 关键词:
def keywords(text, ratio=0.2, words=None, split=False, scores=False, pos_filter=[‘NN’, ‘JJ’], lemmatize=False, deacc=True)关键词的计算用到了graph;
8.2 概述
def summarize(text, ratio=0.2, word_count=None, split=False)主用到TextRank algorithm,计算用到了graph;
8.3 相关的数据结构及算法
BM25[bm25.py]TextRank算法
Graph【common.py,graph.py】
9. 单元测试
10 topic coherence###
主题模型有评估模型,对于这方面的相关资料:What is Topic Coherence?
https://rare-technologies.com/what-is-topic-coherence/
Exploring the Space of Topic Coherence Measures
http://svn.aksw.org/papers/2015/WSDM_Topic_Evaluation/public.pdf
Evaluating topic coherence measures
https://mimno.infosci.cornell.edu/nips2013ws/nips2013tm_submission_7.pdf
Topic Coherence To Evaluate Topic Models
http://qpleple.com/topic-coherence-to-evaluate-topic-models/
对topic cohearnce的演示:
https://nbviewer.jupyter.org/github/dsquareindia/gensim/blob/280375fe14adea67ce6384ba7eabf362b05e6029/docs/notebooks/topic_coherence_tutorial.ipynb
基于语义连贯性实现主题挖掘和分类 http://blog.csdn.net/shirdrn/article/details/7076505
【作者:happyprince, http://blog.csdn.net/ld326/article/details/78379449】
相关文章推荐
- NLP06-Gensim源码简析[字典]
- NLP09-Gensim源码简析[TfidfModel]
- 云星数据---Scala实战系列(精品版)】:Scala入门教程029-Scala实战源码-Scala 的特质 (接口)05
- NLP07-Gensim源码简析[MmCorpus&SvmLightCorpus]
- MyBatis Mapper 接口如何通过JDK动态代理来包装SqlSession 源码分析
- 【网站国际化必备】Asp.Net MVC 集成Paypal(贝宝)快速结账 支付接口 ,附源码demo
- 摘录-IT企业必读的200个.NET面试题-05 常用类和接口
- OkHttp3源码解析05-连接池
- NLP 学习笔记 05 (Log-linear Models)
- OpenJDK源码研究笔记(十):枚举的高级用法,枚举实现接口,竟是别有洞天
- Spring源码 ConfigurableListableBeanFactory接口
- Java之接口与工厂详解一(附源码)
- 转 MyBatis Mapper 接口如何通过JDK动态代理来包装SqlSession 源码分析
- mybatis源码学习--spring+mybatis注解方式为什么mybatis的dao接口不需要实现类
- 【转载】MyBatis Mapper 接口如何通过JDK动态代理来包装SqlSession 源码分析
- NLP自然语言处理相关技术说明及样例(附源码)
- 使用swagger作为restful api的doc文档生成——从源码中去提取restful URL接口描述文档
- JavaSE学习随笔(一) Cloneable接口源码分析与技术细节
- OpenJDK源码研究笔记(十):枚举的高级用法,枚举实现接口,竟是别有洞天
- linux2.6.12 下s3c2440 camera接口 源码分析和个人思考之 read方法篇