TF-IDF词项权重计算
2017-09-01 16:22
423 查看
一、TF-IDF
词项频率:df:term frequency。 term在文档中出现的频率.tf越大,词项越重要.文档频率:tf:document frequecy。有多少文档包含此term,df越大词项越不重要.词项权重计算公式:tf-idf=tf(t,d)*log(N/df(t))11W(t,d):the weight of the term in document dtf(t,d):the frequency of term t in document dN:the number of documentsdf(t):the number of documents that contain term t
二、JAVA实现
package com.javacore.algorithm; import java.util.Arrays; import java.util.List; /** * Created by bee on 17/3/13. * @version 1.0 * @author blog.csdn.net/napoay */ public class TfIdfCal { /** *calculate the word frequency * @param doc word vector of a doc * @param term a word * @return the word frequency of a doc */ public double tf(List<String> doc, String term) { double termFrequency = 0; for (String str : doc) { if (str.equalsIgnoreCase(term)) { termFrequency++; } } return termFrequency / doc.size(); } /** *calculate the document frequency * @param docs the set of all docs * @param term a word * @return the number of docs which contain the word */ public int df(List<List<String>> docs, String term) { int n = 0; if (term != null && term != "") { for (List<String> doc : docs) { for (String word : doc) { if (term.equalsIgnoreCase(word)) { n++; break; } } } } else { System.out.println("term不能为null或者空串"); } return n; } /** *calculate the inverse document frequency * @param docs the set of all docs * @param term a word * @return idf */ public double idf(List<List<String>> docs, String term) { System.out.println("N:"+docs.size()); System.out.println("DF:"+df(docs,term)); return Math.log(docs.size()/(double)df(docs,term)); } /** * calculate tf-idf * @param doc a doc * @param docs document set * @param term a word * @return inverse document frequency */ public double tfIdf(List<String> doc, List<List<String>> docs, String term) { return tf(doc, term) * idf(docs, term); } public static void main(String[] args) { List<String> doc1 = Arrays.asList("人工", "智能", "成为", "互联网", "大会", "焦点"); List<String> doc2 = Arrays.asList("谷歌", "推出", "开源", "人工", "智能", "系统", "工具"); List<String> doc3 = Arrays.asList("互联网", "的", "未来", "在", "人工", "智能"); List<String> doc4 = Arrays.asList("谷歌", "开源", "机器", "学习", "工具"); List<List<String>> documents = Arrays.asList(doc1, doc2, doc3,doc4); TfIdfCal calculator = new TfIdfCal(); System.out.println(calculator.tf(doc2, "开源")); System.out.println(calculator.df(documents, "开源")); double tfidf = calculator.tfIdf(doc2, documents, "谷歌"); System.out.println("TF-IDF (谷歌) = " + tfidf); System.out.println(Math.log(4/2)*1.0/7); } }123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109运行结果:
0.14285714285714285 2 N:4 DF:2 TF-IDF (谷歌) = 0.09902102579427789
参考:http://blog.csdn.net/napoay/article/details/65449877
相关文章推荐
- TF-IDF词项权重计算
- TF-IDF词项权重计算
- (6)文本挖掘(三)——文本特征TFIDF权重计算及文本向量空间VSM表示
- 关键词权重计算算法:TF-IDF
- 关键词权重计算算法 - TF-IDF
- (6)文本挖掘(三)——文本特征TFIDF权重计算及文本向量空间VSM表示
- 随机计算TFIDF作为权重,然后利用余弦距离进行聚类,用的是简单k-means算法。
- scikit-learn计算tf-idf词语权重
- python scikit-learn计算tf-idf词语权重
- python scikit-learn计算tf-idf词语权重
- 关键词权重计算算法 - TF-IDF
- python scikit-learn计算tf-idf词语权重
- 使用scikit-learn tfidf计算词语权重
- TF-IDF解析及在计算广告中的应用
- 使用sci-kit learn计算TF-IDF
- TF-IDF的java实现(权重排序,可用来处理大数据集)
- 信息检索之文档评分、词项权重计算及向量空间模型
- 在线编程题-计算文本的 TFIDF值
- TF-IDF计算 Python
- 使用Gensim建立bow TFIDF LSI模型对文本相似度计算