Similarity of texts: The Vector Space Model with Python
2012-12-08 11:50
323 查看
Similarity of texts: The Vector Space Model with Python | All My Brain
I found an example implementation of a basic document search engine by Maciej Ceglowski, written in Perl, here. I thought I'd find the equivalent libraries in Python and code me up an implementation.
Parse and stem the documents.
It is important, when comparing words, to compare the word stems. e.g., cat and cats should always be compared as simply cat. There are a few word stemming algorithms already available. I found an implementation of the Porter Stemming algorithm in Python here. If you want to run the attached file, you'll need to download porter.py.
I filter the documents through a regular expression to pick out everything composed of a-z, a "-" or a single quote. I also convert the words to lower case. All the words in all the documents are added to a dictionary that keeps track of the word and the number of times it has been used. Before adding a word to the dictionary, I check a list of stop words. Words like "I", "am", "you", "and" make documents appear to be more related than they really are. I found that a good list of stop words comes with the Postgresql tsearch2 full text indexing module. Maciej pointed out in the Perl implementation that it was important to check the stop words before stemming the word.
Reorganize the master word listThere is probably a better way to do this. Perhaps an object that keeps the keys sorted to begin with or something. As far as I know though, Python doesn't have a native dictionary with sorted keys, so I simply create a new dictionary that contains a tuple with the index of the key and the count obtained previously.
Map each document onto it's own vectorThis is why you need the ordered dictionary. Each word from each document maps onto a vector that represents all the words. If you have a list of all words "apple", "cat", "dog", and you have a document with the word "cat", the resulting vector for the document would be: [0, 1, 0].
The arrays for all your documents might be really big. Fortunately, NumPy offers a way to represent sparce array data. You can create a zeroed out vector and then set the values for words individually. For this example, I just use 1 if the word is included in the document. You could instead, use the frequency of words to set values less than 1 and greater than zero for more complex query requirements (like comparing documents against a search query)
Use NumPy to complete the cosine measure calculationThe Cosine measure is calculated by taking the dot product of the two vectors, and then dividing by the product of the norms of the vectors
To do vector math, you could implement your own routine. There is already a good linear algebra implementation for Python. Just download NumPy from www.scipy.org.
I found a handly little online implementation of the cosine measure here, that helped to verify this was working correctly.
That's it. The attached Python Cosine Measure Implementation has a compare function that takes two documents and returns the similarity value.
Similarity of texts: The Vector Space Model with Python
I'm working on a little task that compares the similarity of text documents. One of the most common methods of doing this is called the Vector Space Model. In short, you map words from the documents you want to compare onto a vector that is based on the words found in all documents. Then, you find the cosine of the angle between the vectors of the documents that you want to compare. This is called the cosine measure. When the cosine measure is 0, the documents have no similarity. A value of 1 is yielded when the documents are equal.I found an example implementation of a basic document search engine by Maciej Ceglowski, written in Perl, here. I thought I'd find the equivalent libraries in Python and code me up an implementation.
Parse and stem the documents.
It is important, when comparing words, to compare the word stems. e.g., cat and cats should always be compared as simply cat. There are a few word stemming algorithms already available. I found an implementation of the Porter Stemming algorithm in Python here. If you want to run the attached file, you'll need to download porter.py.
I filter the documents through a regular expression to pick out everything composed of a-z, a "-" or a single quote. I also convert the words to lower case. All the words in all the documents are added to a dictionary that keeps track of the word and the number of times it has been used. Before adding a word to the dictionary, I check a list of stop words. Words like "I", "am", "you", "and" make documents appear to be more related than they really are. I found that a good list of stop words comes with the Postgresql tsearch2 full text indexing module. Maciej pointed out in the Perl implementation that it was important to check the stop words before stemming the word.
import re import porter splitter=re.compile ( "[a-z\-']+", re.I ) stemmer=porter.PorterStemmer() stop_words=['i','am','the','you'] # replace with real stop words all_words=dict() def add_word(word): w=word.lower() # or you could pass in lower case words to begin with if w not in stop_words: ws=stemmer.stem(w,0,len(w)-1) all_words.setdefault(ws,0) all_words[ws] += 1
Reorganize the master word listThere is probably a better way to do this. Perhaps an object that keeps the keys sorted to begin with or something. As far as I know though, Python doesn't have a native dictionary with sorted keys, so I simply create a new dictionary that contains a tuple with the index of the key and the count obtained previously.
key_idx=dict() # key-> ( position, count ) keys=all_words.keys() keys.sort() for i in range(len(keys)): key_idx[keys[i]] = (i,all_words[keys[i]]) del keys # not necessary, but I didn't need these any longer del all_words
Map each document onto it's own vectorThis is why you need the ordered dictionary. Each word from each document maps onto a vector that represents all the words. If you have a list of all words "apple", "cat", "dog", and you have a document with the word "cat", the resulting vector for the document would be: [0, 1, 0].
The arrays for all your documents might be really big. Fortunately, NumPy offers a way to represent sparce array data. You can create a zeroed out vector and then set the values for words individually. For this example, I just use 1 if the word is included in the document. You could instead, use the frequency of words to set values less than 1 and greater than zero for more complex query requirements (like comparing documents against a search query)
from numpy import zeros def doc_vec(doc): v=zeros(len(key_idx)) # returns array([0,0,0....len(key_idx)]) for word in splitter.findall(doc): # returns (key index, key count) or None keydata=key_idx.get(stemmer.stem(word,0,len(word)-1).lower(), None) if keydata: v[keydata[0]] = 1 return v
Use NumPy to complete the cosine measure calculationThe Cosine measure is calculated by taking the dot product of the two vectors, and then dividing by the product of the norms of the vectors
cos(A,B) = dot(A,B) / ( || A || * || B || )
To do vector math, you could implement your own routine. There is already a good linear algebra implementation for Python. Just download NumPy from www.scipy.org.
from numpy import dot from numpy.linalg import norm v1=doc_vec(doc1) v2=doc_vec(doc2) print "Similarity: %s" % float(dot(v1,v2) / (norm(v1) * norm(v2)))
I found a handly little online implementation of the cosine measure here, that helped to verify this was working correctly.
That's it. The attached Python Cosine Measure Implementation has a compare function that takes two documents and returns the similarity value.
import ds2 s=ds2.compare("I like dogs and cats", "My cat runs from dogs.")
相关文章推荐
- The Classic Vector Space Model
- SIGIR 2016 Improving Language Estimation with the Paragraph Vector Model for Ad-hoc Retrieval
- Rolling with developments in the Web component programming model
- The model used to open the store is incompatible with the one used to create the store错误
- Include promo/activity effect into the prediction (extended ARIMA model with R)
- pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available.
- Thransform the vertex from model space to screen space. opengl.
- Implementing OpenStack Cinder with NetApp Storage on the Rackspace Private Cloud Software
- 用Python和OpenCV创建一个图片搜索引擎的完整指南 The complete guide to building an image search engine with Python and
- ES搜索排序,文档相关度评分介绍——Vector Space Model
- deep-learning-with-python --the author of keres
- CoreDataErrorThe model used to open the store is incompatible with the one used to create the store
- CoreData修改了数据模型报错 The model used to open the store is incompatible with the one used to create the store
- gensim学习笔记(一)- Vector space model
- Core Data异常: model used to open the store is incompatible with the one used to create the store
- TheServerSide:Liferay Portal 4.0 released with new fine-grained security model
- 2.4 The Object Model -- Computed Properties and Aggregate Data with @each(计算的属性和使用@each聚合数据)
- (Andrew NG)The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization
- Lucene学习之计算相似度模型VSM(Vector Space Model)
- Recipe 1.5. Trimming Space from the Ends of a String(Python Cookbook)