您的位置:首页 > 其它

TF-IDF计算三

2011-08-31 00:14 204 查看
逆向文件频率(inverse document frequency,IDF)是一个词语普遍重要性的度量。某一特定词语的IDF,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取对数.

IDF i = log N / n i

N 是代表语料库中文件数量的总数, n i 是代表包含词语n i 的文件数目所包含的词 i 的总数.

class IDFs

{

HashMap<String,Double> IDFsList = new HashMap<String, Double>();

ArrayList<ArrayList<String>> IDFsMainFileList = new ArrayList<ArrayList<String>>();

ArrayList<String> IDFsave = new ArrayList<String>();

int Ncount;

public IDFs(ArrayList<ArrayList<String>> idf)

{

IDFsMainFileList = idf;

Ncount = IDFsMainFileList.size();

}

public HashMap<String,Double> PrintIDFs()

{

for(int i=0; i<IDFsMainFileList.size(); i++)

{

ArrayList<String> IDFsSubFileList = IDFsMainFileList.get(i);

ArrayList<String> list = new ArrayList<String>();

for(int j=0; j<IDFsSubFileList.size(); j++)

{

if(!list.contains(IDFsSubFileList.get(j)))

{

list.add(IDFsSubFileList.get(j));

//Take elements from arraylist<arraylist<string>>

if(!IDFsList.containsKey(IDFsSubFileList.get(j)))

{

IDFsList.put(IDFsSubFileList.get(j),1.0);//save elements to arraylist<hashmap<String,Double>>

IDFsave.add(IDFsSubFileList.get(j));//take elements from hashmap

}

else

{

double value = IDFsList.get(IDFsSubFileList.get(j));

value++;

IDFsList.put(IDFsSubFileList.get(j),value);

}

}

}

}

for(int k=0; k<IDFsave.size(); k++)

{

double nTotal = IDFsList.get(IDFsave.get(k));

double temp = Ncount / nTotal;

double idfs = Math.log(temp);

IDFsList.put(IDFsave.get(k), idfs);

}

return IDFsList;

}

}
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: