NLP03-Gensim转换与相似计算
2017-10-26 18:21
405 查看
摘要:根据Gensim官网的说明文档,进行动手操作,记录实践过程,为以后及相关学习同伙作参考。
学习来源
Topics and Transformations https://radimrehurek.com/gensim/tut2.htmlSimilarity Queries https://radimrehurek.com/gensim/tut3.html
说明与入门代码
以下的数据都是来之前一篇文章生成的数据,这里把这些数据加载入来作相关的处理;相关生成数据集的代码参考:http://blog.csdn.net/ld326/article/details/78353338转换
import os from gensim.models import ldamodel, hdpmodel """ 向量转换 In this tutorial, I will show how to transform documents from one vector representation into another. This process serves two goals: 1. To bring out hidden structure in the corpus, discover relationships between words and use them to describe the documents in a new and (hopefully) more semantic way. 2. To make the document representation more compact. This both improves efficiency (new representation consumes less resources) and efficacy (marginal data trends are ignored, noise-reduction). 在这个教程中,将合演示文档从一个向量向另一个向的转换,这个处理的目的: 1. 去发现语料库中隐藏的结构,发现词之间的关系,并使用它们采用新的方法与更语义的方法去描述文档; 2. 使文档表示更紧凑。提高效率[花更少的资源]与功效。 """ from gensim import corpora, models, similarities import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) # 从第一个教程中生成的数据集中加载数据 if (os.path.exists("tmp/deerwester.dict")): dictionary = corpora.Dictionary.load('tmp/deerwester.dict') corpus = corpora.MmCorpus('tmp/deerwester.mm') print("Used files generated from first tutorial") print("使用第一个教程生成的数据集") else: print("请运行第一个教程去生成数据集") # 第一个转换:tfidf:由整型的词袋向量空间转去实数的向量空间;返回相同维的向量,对于稀缺的属性会有比较大的权重,由整型转向实型 doc_bow = [(0, 1), (1, 1)] # step 1 -- 初始化模型 有相同的属性-编号相同 tfidf = models.TfidfModel(corpus) # step 2 -- 使用模型去转换数据 print('转换:[(0, 1), (1, 1)]->%s' % str(tfidf[doc_bow])) # 转换整个语料库:调用tfidf[corpus]在基础上创建一个包装,真正的转化是在迭代文档时计算的 corpus_tfidf = tfidf[corpus] print('转换整个语料库:') for doc in corpus_tfidf: print(doc) # 第二个转换:潜在语义索引(LSI),bow->tfidf->fold-in-lsi [对原始语料加双重包装] # LSI:由词袋或TfIdf权重空间(更好)转化为低维的潜在空间 # 适合增量更新 lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi [双重包装] lsi.print_topics(2) # LSI将Tf-Idf语料转化为潜在2D空间(num_topics=2) # 保存转换数据(same for tfidf, lda, ...) lsi.save('tmp/model.lsi') # 加载转换数据 lsi = models.LsiModel.load('tmp/model.lsi') print('主题情况:') for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly print(doc) # 第三个转换:隐含狄利克雷分布(Latent Dirichlet Allocation, LDA) # 将词袋计数转化为一个低维主题空间的转换。LDA是LSA的概率扩展。 lda_m = ldamodel.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2) corpus_lda = lda_m[corpus_tfidf] print('LDA主题:') for doc in corpus_lda: # both bow->tfidf and tfidf->lda transformations are actually executed here print(doc) # 第四个转换:分层狄利克雷过程(Hierarchical Dirichlet Process,HDP) # 无参贝叶斯方法(没有num_topics参数),这个方法还未成熟,要小心使用 hdp_m = hdpmodel.HdpModel(corpus_tfidf, id2word=dictionary) corpus_hdp = hdp_m[corpus_tfidf] print('HDP主题:') for doc in corpus_hdp: # both bow->tfidf and tfidf->hdp transformations are actually executed here print(doc)
生成结果:
Used files generated from first tutorial 使用第一个教程生成的数据集 转换:[(0, 1), (1, 1)]->[(0, 0.7071067811865476), (1, 0.7071067811865476)] 转换整个语料库: [(0, 0.5773502691896257), (1, 0.5773502691896257), (2, 0.5773502691896257)] [(2, 0.44424552527467476), (3, 0.3244870206138555), (4, 0.44424552527467476), (5, 0.44424552527467476), (6, 0.3244870206138555), (7, 0.44424552527467476)] [(0, 0.5710059809418182), (3, 0.4170757362022777), (6, 0.4170757362022777), (8, 0.5710059809418182)] [(1, 0.49182558987264147), (3, 0.7184811607083769), (8, 0.49182558987264147)] [(4, 0.6282580468670046), (6, 0.45889394536615247), (7, 0.6282580468670046)] [(9, 1.0)] [(9, 0.7071067811865475), (10, 0.7071067811865475)] [(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)] [(5, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)] 主题情况: [(0, 0.066007833960902457), (1, -0.52007033063618502)] [(0, 0.19667592859142424), (1, -0.76095631677000475)] [(0, 0.08992639972446359), (1, -0.72418606267525076)] [(0, 0.075858476521781348), (1, -0.63205515860034289)] [(0, 0.10150299184980076), (1, -0.57373084830029542)] [(0, 0.70321089393783143), (1, 0.1611518021402569)] [(0, 0.87747876731198338), (1, 0.16758906864659284)] [(0, 0.90986246868185816), (1, 0.14086553628718884)] [(0, 0.61658253505692828), (1, -0.053929075663894474)] LDA主题: [(0, 0.43458819379220637), (1, 0.56541180620779363)] [(0, 0.81496850961257628), (1, 0.18503149038742375)] [(0, 0.42398633716223832), (1, 0.57601366283776179)] [(0, 0.27284273175270496), (1, 0.72715726824729499)] [(0, 0.78961713183549342), (1, 0.21038286816450658)] [(0, 0.27046663485539763), (1, 0.72953336514460243)] [(0, 0.23738848068832419), (1, 0.76261151931167581)] [(0, 0.27880748145925172), (1, 0.72119251854074828)] [(0, 0.64582453064154577), (1, 0.35417546935845429)] HDP主题: [(0, 0.52809018857755152), (1, 0.26553584360391191), (2, 0.051983022299278911), (3, 0.03886440092379026), (4, 0.029208773283318439), (5, 0.021975161934421561), (6, 0.016341636429796386), (7, 0.012334101686562229)] [(0, 0.074099556860078936), (1, 0.76127241803674017), (2, 0.041537586886014721), (3, 0.030963727406811727), (4, 0.023292083182027672), (5, 0.017524199077066691), (6, 0.013031743285505505)] [(0, 0.48437437363602542), (1, 0.32579435434464177), (2, 0.048112070195248331), (3, 0.035668320063564463), (4, 0.02681313352299294), (5, 0.020172686197132431), (6, 0.015001253112536254), (7, 0.011322426747553908)] [(0, 0.34034073129107029), (1, 0.45101804381381155), (2, 0.052557228855974372), (3, 0.039278480525089025), (4, 0.029532032771654788), (5, 0.022218461058883556), (6, 0.016522573829929454), (7, 0.012470666950317679)] [(0, 0.094173500744889241), (1, 0.070742471007394905), (2, 0.67977887835637896), (3, 0.039071041145108294), (4, 0.029387407680202364), (5, 0.022109782761934496), (6, 0.016441782387953123), (7, 0.012409688403626294)] [(0, 0.6243532030336858), (1, 0.093931816536177784), (2, 0.070751477981715785), (3, 0.053146621419023092), (4, 0.039904675525158521), (5, 0.030018566595092239), (6, 0.022323090455904304), (7, 0.016848696236704659), (8, 0.012506198130003471)] [(0, 0.68886300433857051), (1, 0.077826159523606567), (2, 0.058598719917462287), (3, 0.043975113679573102), (4, 0.033055336187415012), (5, 0.024868191804956103), (6, 0.018493053625294565), (7, 0.013957916979065739), (8, 0.010360473758344944)] [(0, 0.49182518940144543), (1, 0.30015320078544222), (2, 0.052469639197019941), (3, 0.039153986517730457), (4, 0.029430280423350139), (5, 0.022140594553867854), (6, 0.016464688924120365), (7, 0.01242697745241703)] [(0, 0.27845680026889291), (1, 0.51398051031892455), (2, 0.052268236069297966), (3, 0.039060866109315549), (4, 0.029386884419279093), (5, 0.022109784517034499), (6, 0.016441782381119849), (7, 0.012409688403623432)]
相似计算
import os from gensim import corpora, models, similarities # 从第一个教程中生成的数据集中加载数据 if (os.path.exists("tmp/deerwester.dict")): dictionary = corpora.Dictionary.load('tmp/deerwester.dict') corpus = corpora.MmCorpus('tmp/deerwester.mm') print("Used files generated from first tutorial") print("使用第一个教程生成的数据集") else: print("请运行第一个教程去生成数据集") # 第一个转换:tfidf:由整型的词袋向量空间转去实数的向量空间;返回相同维的向量,对于稀缺的属性会有比较大的权重,由整型转向实型 doc_bow = [(0, 1), (1, 1)] # 初始化模型 有相同的属性-编号相同 tfidf = models.TfidfModel(corpus) corpus_tfidf = tfidf[corpus] lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi [双重包装] lsi.print_topics(2) # LSI将Tf-Idf语料转化为潜在2D空间(num_topics=2) """ 相似度查询(Similarity Queries) """ if (os.path.exists("tmp/deerwester.dict")): dictionary = corpora.Dictionary.load('tmp/deerwester.dict') corpus = corpora.MmCorpus('tmp/deerwester.mm') print("Used files generated from first tutorial") print("使用第一个教程生成的数据集") else: print("请运行第一个教程去生成数据集") doc = "Human computer interaction" vec_bow = dictionary.doc2bow(doc.lower().split()) # convert the query to LSI space vec_lsi = lsi[vec_bow] print(vec_lsi) # 建立索引 # similarities.MatrixSimilarity类适合所有向量在都在内存的情况,如果数据比较大的话采用similarities.Similarity类; index = similarities.MatrixSimilarity(lsi[corpus]) # 保存索引 index.save('tmp/deerwester.index') # 加载索引 index = similarities.MatrixSimilarity.load('tmp/deerwester.index') # 查询 sims = index[vec_lsi] # perform a similarity query against the corpus print('文档:%s' % doc) print(list(enumerate(sims))) # 查询排序结果 print('排序后的结果:') sims = sorted(enumerate(sims), key=lambda item: -item[1]) print(sims) # print sorted (document number, similarity score) 2-tuples
运行的结果:
Used files generated from first tutorial 使用第一个教程生成的数据集 Used files generated from first tutorial 使用第一个教程生成的数据集 [(0, 0.079104751174449178), (1, 0.57328352430794038)] 文档:Human computer interaction [(0, 0.99994081), (1, 0.99467081), (2, 0.99994278), (3, 0.999879), (4, 0.99935204), (5, -0.08804217), (6, -0.0515742), (7, -0.023664713), (8, 0.1938726)] 排序后的结果: [(2, 0.99994278), (0, 0.99994081), (3, 0.999879), (4, 0.99935204), (1, 0.99467081), (8, 0.1938726), (7, -0.023664713), (6, -0.0515742), (5, -0.08804217)]
【作者:happyprince, http://blog.csdn.net/ld326/article/details/78357172】
相关文章推荐
- 【转】递归计算向非递归计算转换模板
- PostgreSQL 遗传学应用 - 矩阵相似距离计算 (欧式距离,...XX距离)
- 华为oj中级 计算日期到天数转换
- 编程语言分析及其应用:Lisp格式到C格式的转换及其计算
- C#中字符串转换为计算公式,并进行计算的方法(自定义公式的计算)
- Javascript将双字节字符转换成单字节字符并计算长度
- sql中时间日期操作(时间日期函数,时间日期格式,时间日期转换参数,时间日期比较,时间日期计算)
- 数据类型转换(计算mac地址)
- 全球主要城市时区时差转换计算表
- EXCEL中关于角度的输入、输出及转换计算技巧
- C#数据库读取数据后转换为INT32后计算的小技巧
- 程序猿之---C语言细节20(符号和有符号之间转换、两数相加溢出后数值计算)
- 经得起雷劈:关于double和int/long相互转换失去精度计算错误的问题
- 表达式计算 - 逆波兰式转换及运算示例
- Java中时间日期的计算及相互转换
- 【华为 OJ 】计算日期到天数的转换
- mysql日期转换与计算函数
- 转换成javascript时间格式,并计算时间差
- 词向量转换成句向量的文本相似度计算
- 根据经纬度计算距离的公式、百度坐标转换成GPS坐标(PHP版)