The end of December summary
2014-12-30 16:28
477 查看
Abstract
This month , my work is mainly about Bilingual Topic model , that is PolylLDA . The base of PolyLDA are Gibbs Sampling and Variational Inference , similar as monolingual LDA . Being familiar with Gibbs Sampling , I select PolyLDA based Gibbs Sampling to deal with bilingual corpus . Attachment is the final result file where I deal with sample parallel corpus .PolyLDA
PolyLDA is the shorthand of Polylingual LDA . It assumes that a single document has words in multiple languages , but each document has a common distribution of topics . Each topic also has different facets of languages , these topics end up being consistent due to the links across language encoded in the consistent themes present in document .Gibbs & Variational Inference
Variational Inference :-- Map reduce : The lda project based on Variational Inference can be executed on the Hadoop , which can deal with data with huge size effectively .
Gibbs sampling :
-- Drawback : Convergence of sampler to its stationary distribution is difficult to diagnose , and sampling algorithm can be slow to converge in high dimensional models .
Thus , if we want to deal with data with huge size , variable inference is best choice for us .
Mr.LDA
Mr.LDA is used to deal with multilingual topic modeling using variational inference in MapReduce , which fits into a distributed environment well . Morever ,compared to LDA based on Gibbs Sampling , Mr.LDA is easily extensible .Two main extensions
Informed priors : To guide topic discovery .PolyLDA : To extracting from multilingual corpus .
In the package given on the github , PolyLDA is a branch of Mr.LDA , separated . The PolyLDA can run well and produce the final result , but the files of result can't be read directly . We should use some tools to decode and extract it , whose download link and directions are ignored in its README . Being out of data , the README just contains the directions of executing monolingual topic model . It is a pity that I just can run the monolingual topic model using Mr.LDA .
Howerve , with description in detail of the difference in detail between Gibbs Sampling and Variational Inference , its paper is very nice . By comparison , variational Inference is more popular than Gibbs Sampling , both in execution speed and likelihood . In view of its great idea and powerful source , I think mastering Mr.LDA is the final step of researching topic model .
Further work
Mr.LDAHadoop : Learning it to read the codes of Mr.LDA
Tf-idf
相关文章推荐
- Summary Of The December
- C语言-何消除 warning no newline at the end of file
- A Christmas Carol——5、The end of the story
- My twenty,the end of a dynasty.
- The end of the world
- 20162314 《Program Design & Data Structures》Learning Summary Of The Fifth Week
- hdu 4260 The End of The World
- 每日英语:The end of cheap China
- Nth node from the end of a Linked List
- Hdu 4596 Yet another end of the world(数论)
- Myeclipse 保存jsp异常Save FailedCompilation unit name must end with .java, or one of the registered Java-like extensions
- HDU4596 Yet another end of the world 扩展欧几里德性质
- mac上执行sed的编辑 -i命令报错sed: 1: "test.txt": undefined label ‘est.txt’或sed: 1: "2a\test\": extra characters after \ at the end of a command
- HDU 4596 Yet another end of the world
- CRT detected that the application wrote to memory after end of heap buffer
- c++编译时候fatal error C1075: end of file found before the left brace '{' at
- main.c :10:2 warning: no newline at the end of file
- [vs执行报错] CRT detected that the application wrote to memory after end of heap buffer
- error -5008: intel64 or amd64 must be specified in the template of the summary
- eclipse报: Compilation unit name must end with .java, or one of the registered Java-like extensions