TF-IDF
2015-11-23 21:18
260 查看
tf-idf, short for term frequency-inverse document frequency.
The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the the fact that some words appear more frequently
in general.
Term Frequency
the number of times a term occurs in a document is called its term frequency.
Inverse document frequency
However, because the term "the" is so common, this will tend to incorrectly emphasize documents which happen to use the word "the" more frequency, without giving enough weight to the more meaningful terms "brown" and "cow". The
term "the" is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less common words "brown" and "cow". Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently
in the document set and increases the weight of terms that occur rarely.
Inverse document frequency
The inverse document frequency is a measure of how much information the word provides,that is,whether the term is common or rare across all documents。
词频(TF)=某个词在文章中的出现次数
考虑到文章有长短之分,为了便于不同文章的比较,进行“词频”标准化
词频(TF)=某个词在文章中的出现次数/文章词的总数
第二步:计算逆文档频率
逆文档频率(IDF)=log(语料库的文档总数/包含该词的文档数+1)
如果一个词越常见,那么分母就越大,逆文档频率就越小越接近于0.分母之所以要加1,是为了避免分母为0(即所有文档都不包含该词)。
第三步:
TF-IDF=词频(TF)*逆文档频率(IDF)
可以看到,TF-IDF与一个词在文档中出现的次数成正比,与该词在整个语言中的出现次数成反比。
TF-IDF算法的优点是简单快速,结果比较符合实际情况。缺点是,单纯以“词频”衡量一个词的重要性,不够全面,有时重要的词出现的次数并不多。而且,这种算法无法体现词的位置信息,出现位置靠前的词与出现位置靠后的词,都被视为重要性相同,这是不正确的。
The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the the fact that some words appear more frequently
in general.
Term Frequency
the number of times a term occurs in a document is called its term frequency.
Inverse document frequency
However, because the term "the" is so common, this will tend to incorrectly emphasize documents which happen to use the word "the" more frequency, without giving enough weight to the more meaningful terms "brown" and "cow". The
term "the" is not a good keyword to distinguish relevant and non-relevant documents and terms, unlike the less common words "brown" and "cow". Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently
in the document set and increases the weight of terms that occur rarely.
Inverse document frequency
The inverse document frequency is a measure of how much information the word provides,that is,whether the term is common or rare across all documents。
词频(TF)=某个词在文章中的出现次数
考虑到文章有长短之分,为了便于不同文章的比较,进行“词频”标准化
词频(TF)=某个词在文章中的出现次数/文章词的总数
第二步:计算逆文档频率
逆文档频率(IDF)=log(语料库的文档总数/包含该词的文档数+1)
如果一个词越常见,那么分母就越大,逆文档频率就越小越接近于0.分母之所以要加1,是为了避免分母为0(即所有文档都不包含该词)。
第三步:
TF-IDF=词频(TF)*逆文档频率(IDF)
可以看到,TF-IDF与一个词在文档中出现的次数成正比,与该词在整个语言中的出现次数成反比。
TF-IDF算法的优点是简单快速,结果比较符合实际情况。缺点是,单纯以“词频”衡量一个词的重要性,不够全面,有时重要的词出现的次数并不多。而且,这种算法无法体现词的位置信息,出现位置靠前的词与出现位置靠后的词,都被视为重要性相同,这是不正确的。
相关文章推荐
- JavaScript语言精粹学习笔记(1)——语法
- 统计模拟
- 求A^B的最后三位数表示的整数。 说明:A^B的含义是“A的B次方”
- 观察者模式
- [bzoj1717][Usaco2006 Dec]Milk Patterns 产奶的模式 (hash构造后缀数组,二分答案)
- 闲扯淡
- linux SPI总线驱动(一)
- 双向循环链表
- android小问题:代码中设置Button被选中
- Angular 学习笔记——filter
- linux下字符串处理工具二:awk(1)
- 项目管理之起点
- 数据抽象与封装的好处--【primer第四版】
- 从尾到头打印链表
- 函数式编程中的常用技巧
- 七牛是如何搞定每天500亿条日志的
- win10 80070002
- C/C++学习(二)输入n个整数,输出其中最小的k个。
- jquery图片时钟
- bzoj1095[ZJOI2007]Hide 捉迷藏