您的位置：首页 > 编程语言 > Python开发

NLP03-Gensim转换与相似计算

2017-10-26 18:21 405 查看

摘要：根据Gensim官网的说明文档，进行动手操作，记录实践过程，为以后及相关学习同伙作参考。

学习来源

Topics and Transformations https://radimrehurek.com/gensim/tut2.html

Similarity Queries https://radimrehurek.com/gensim/tut3.html

说明与入门代码

以下的数据都是来之前一篇文章生成的数据，这里把这些数据加载入来作相关的处理；相关生成数据集的代码参考：http://blog.csdn.net/ld326/article/details/78353338

转换

import os

from gensim.models import ldamodel, hdpmodel

"""
向量转换
In this tutorial, I will show how to transform documents from one vector representation into another.
This process serves two goals:
1. To bring out hidden structure in the corpus, discover relationships between words and use them to describe the documents in a new and (hopefully) more semantic way.
2. To make the document representation more compact. This both improves efficiency (new representation consumes less resources) and efficacy (marginal data trends are ignored, noise-reduction).
在这个教程中，将合演示文档从一个向量向另一个向的转换，这个处理的目的：
1. 去发现语料库中隐藏的结构，发现词之间的关系，并使用它们采用新的方法与更语义的方法去描述文档；
2. 使文档表示更紧凑。提高效率[花更少的资源]与功效。
"""
from gensim import corpora, models, similarities
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# 从第一个教程中生成的数据集中加载数据
if (os.path.exists("tmp/deerwester.dict")):
dictionary = corpora.Dictionary.load('tmp/deerwester.dict')
corpus = corpora.MmCorpus('tmp/deerwester.mm')
print("Used files generated from first tutorial")
print("使用第一个教程生成的数据集")
else:
print("请运行第一个教程去生成数据集")

# 第一个转换：tfidf：由整型的词袋向量空间转去实数的向量空间；返回相同维的向量，对于稀缺的属性会有比较大的权重，由整型转向实型
doc_bow = [(0, 1), (1, 1)]
# step 1 -- 初始化模型 有相同的属性-编号相同
tfidf = models.TfidfModel(corpus)
# step 2 -- 使用模型去转换数据
print('转换:[(0, 1), (1, 1)]->%s' % str(tfidf[doc_bow]))
# 转换整个语料库:调用tfidf[corpus]在基础上创建一个包装，真正的转化是在迭代文档时计算的
corpus_tfidf = tfidf[corpus]
print('转换整个语料库：')
for doc in corpus_tfidf:
print(doc)
# 第二个转换：潜在语义索引（LSI），bow->tfidf->fold-in-lsi [对原始语料加双重包装]
# LSI：由词袋或TfIdf权重空间（更好）转化为低维的潜在空间
# 适合增量更新
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)  # initialize an LSI transformation
corpus_lsi = lsi[corpus_tfidf]  # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi [双重包装]
lsi.print_topics(2)  # LSI将Tf-Idf语料转化为潜在2D空间（num_topics=2）
# 保存转换数据（same for tfidf, lda, ...）
lsi.save('tmp/model.lsi')
# 加载转换数据
lsi = models.LsiModel.load('tmp/model.lsi')
print('主题情况：')
for doc in corpus_lsi:  # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
print(doc)

# 第三个转换：隐含狄利克雷分布（Latent Dirichlet Allocation, LDA）
#  将词袋计数转化为一个低维主题空间的转换。LDA是LSA的概率扩展。
lda_m = ldamodel.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lda = lda_m[corpus_tfidf]
print('LDA主题：')
for doc in corpus_lda:  # both bow->tfidf and tfidf->lda transformations are actually executed here
print(doc)

# 第四个转换：分层狄利克雷过程（Hierarchical Dirichlet Process，HDP）
# 无参贝叶斯方法（没有num_topics参数），这个方法还未成熟，要小心使用
hdp_m = hdpmodel.HdpModel(corpus_tfidf, id2word=dictionary)
corpus_hdp = hdp_m[corpus_tfidf]
print('HDP主题：')
for doc in corpus_hdp:  # both bow->tfidf and tfidf->hdp transformations are actually executed here
print(doc)

生成结果：

Used files generated from first tutorial
使用第一个教程生成的数据集
转换:[(0, 1), (1, 1)]->[(0, 0.7071067811865476), (1, 0.7071067811865476)]
转换整个语料库：
[(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)]
[(2, 0.44424552527467476), (3, 0.3244870206138555), (4, 0.44424552527467476), (5, 0.44424552527467476), (6, 0.3244870206138555), (7, 0.44424552527467476)]
[(0, 0.5710059809418182), (3, 0.4170757362022777), (6, 0.4170757362022777), (8, 0.5710059809418182)]
[(1, 0.49182558987264147), (3, 0.7184811607083769), (8, 0.49182558987264147)]
[(4, 0.6282580468670046), (6, 0.45889394536615247), (7, 0.6282580468670046)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(5, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]
主题情况：
[(0, 0.066007833960902457), (1, -0.52007033063618502)]
[(0, 0.19667592859142424), (1, -0.76095631677000475)]
[(0, 0.08992639972446359), (1, -0.72418606267525076)]
[(0, 0.075858476521781348), (1, -0.63205515860034289)]
[(0, 0.10150299184980076), (1, -0.57373084830029542)]
[(0, 0.70321089393783143), (1, 0.1611518021402569)]
[(0, 0.87747876731198338), (1, 0.16758906864659284)]
[(0, 0.90986246868185816), (1, 0.14086553628718884)]
[(0, 0.61658253505692828), (1, -0.053929075663894474)]
LDA主题：
[(0, 0.43458819379220637), (1, 0.56541180620779363)]
[(0, 0.81496850961257628), (1, 0.18503149038742375)]
[(0, 0.42398633716223832), (1, 0.57601366283776179)]
[(0, 0.27284273175270496), (1, 0.72715726824729499)]
[(0, 0.78961713183549342), (1, 0.21038286816450658)]
[(0, 0.27046663485539763), (1, 0.72953336514460243)]
[(0, 0.23738848068832419), (1, 0.76261151931167581)]
[(0, 0.27880748145925172), (1, 0.72119251854074828)]
[(0, 0.64582453064154577), (1, 0.35417546935845429)]
HDP主题：
[(0, 0.52809018857755152), (1, 0.26553584360391191), (2, 0.051983022299278911), (3, 0.03886440092379026), (4, 0.029208773283318439), (5, 0.021975161934421561), (6, 0.016341636429796386), (7, 0.012334101686562229)]
[(0, 0.074099556860078936), (1, 0.76127241803674017), (2, 0.041537586886014721), (3, 0.030963727406811727), (4, 0.023292083182027672), (5, 0.017524199077066691), (6, 0.013031743285505505)]
[(0, 0.48437437363602542), (1, 0.32579435434464177), (2, 0.048112070195248331), (3, 0.035668320063564463), (4, 0.02681313352299294), (5, 0.020172686197132431), (6, 0.015001253112536254), (7, 0.011322426747553908)]
[(0, 0.34034073129107029), (1, 0.45101804381381155), (2, 0.052557228855974372), (3, 0.039278480525089025), (4, 0.029532032771654788), (5, 0.022218461058883556), (6, 0.016522573829929454), (7, 0.012470666950317679)]
[(0, 0.094173500744889241), (1, 0.070742471007394905), (2, 0.67977887835637896), (3, 0.039071041145108294), (4, 0.029387407680202364), (5, 0.022109782761934496), (6, 0.016441782387953123), (7, 0.012409688403626294)]
[(0, 0.6243532030336858), (1, 0.093931816536177784), (2, 0.070751477981715785), (3, 0.053146621419023092), (4, 0.039904675525158521), (5, 0.030018566595092239), (6, 0.022323090455904304), (7, 0.016848696236704659), (8, 0.012506198130003471)]
[(0, 0.68886300433857051), (1, 0.077826159523606567), (2, 0.058598719917462287), (3, 0.043975113679573102), (4, 0.033055336187415012), (5, 0.024868191804956103), (6, 0.018493053625294565), (7, 0.013957916979065739), (8, 0.010360473758344944)]
[(0, 0.49182518940144543), (1, 0.30015320078544222), (2, 0.052469639197019941), (3, 0.039153986517730457), (4, 0.029430280423350139), (5, 0.022140594553867854), (6, 0.016464688924120365), (7, 0.01242697745241703)]
[(0, 0.27845680026889291), (1, 0.51398051031892455), (2, 0.052268236069297966), (3, 0.039060866109315549), (4, 0.029386884419279093), (5, 0.022109784517034499), (6, 0.016441782381119849), (7, 0.012409688403623432)]

相似计算

import os

from gensim import corpora, models, similarities

# 从第一个教程中生成的数据集中加载数据
if (os.path.exists("tmp/deerwester.dict")):
dictionary = corpora.Dictionary.load('tmp/deerwester.dict')
corpus = corpora.MmCorpus('tmp/deerwester.mm')
print("Used files generated from first tutorial")
print("使用第一个教程生成的数据集")
else:
print("请运行第一个教程去生成数据集")

# 第一个转换：tfidf：由整型的词袋向量空间转去实数的向量空间；返回相同维的向量，对于稀缺的属性会有比较大的权重，由整型转向实型
doc_bow = [(0, 1), (1, 1)]
# 初始化模型 有相同的属性-编号相同
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)  # initialize an LSI transformation
corpus_lsi = lsi[corpus_tfidf]  # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi [双重包装]
lsi.print_topics(2)  # LSI将Tf-Idf语料转化为潜在2D空间（num_topics=2）

"""
相似度查询（Similarity Queries）
"""
if (os.path.exists("tmp/deerwester.dict")):
dictionary = corpora.Dictionary.load('tmp/deerwester.dict')
corpus = corpora.MmCorpus('tmp/deerwester.mm')
print("Used files generated from first tutorial")
print("使用第一个教程生成的数据集")
else:
print("请运行第一个教程去生成数据集")
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
# convert the query to LSI space
vec_lsi = lsi[vec_bow]
print(vec_lsi)
# 建立索引
# similarities.MatrixSimilarity类适合所有向量在都在内存的情况，如果数据比较大的话采用similarities.Similarity类；
index = similarities.MatrixSimilarity(lsi[corpus])
# 保存索引
index.save('tmp/deerwester.index')
# 加载索引
index = similarities.MatrixSimilarity.load('tmp/deerwester.index')
# 查询
sims = index[vec_lsi]  # perform a similarity query against the corpus
print('文档：%s' % doc)
print(list(enumerate(sims)))
# 查询排序结果
print('排序后的结果：')
sims = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims)  # print sorted (document number, similarity score) 2-tuples

运行的结果：

Used files generated from first tutorial
使用第一个教程生成的数据集
Used files generated from first tutorial
使用第一个教程生成的数据集
[(0, 0.079104751174449178), (1, 0.57328352430794038)]
文档：Human computer interaction
[(0, 0.99994081), (1, 0.99467081), (2, 0.99994278), (3, 0.999879), (4, 0.99935204), (5, -0.08804217), (6, -0.0515742), (7, -0.023664713), (8, 0.1938726)]
排序后的结果：
[(2, 0.99994278), (0, 0.99994081), (3, 0.999879), (4, 0.99935204), (1, 0.99467081), (8, 0.1938726), (7, -0.023664713), (6, -0.0515742), (5, -0.08804217)]

【作者：happyprince, http://blog.csdn.net/ld326/article/details/78357172】

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： nlp python Gensim

相关文章推荐

新的分享

章节导航