您的位置:首页 > 其它

分析一个英文txt文本中单词出现的频率

2014-10-21 08:20 459 查看

要求:

写一个程序,对一个txt格式的英文文本中的单词进行单词词频统计,并且输出排在前十的单词。文本大小为30k~300K。

步骤:

1、读一个txt文本文件;
2、统计文本中出现的单词和单词的次数;
3、定义一个数组,其中包括英语单词中的副词、代词、冠词和介词等一些无实际意义的单词;
4、对读到的单词进行排序,并且输出前10个高频词汇。

编程语言:java

测试文件:D:\\test1.txt 419K

性能测试工具:visualVM1.3.8

程序代码:

英语单词中的副词、代词、冠词和介词等一些无实际意义的单词数组:

String strA [] = {"your","had","I","their","not","ago","him","men","day","eighty","able","only","still","In","man","The","will",
"you","years","year","whose","waht","with","yours","yes","a","an","are","all","any","been","both","each","either","one","two",
"three","four","five","six","seven","eigth","nine","ten","none","little","few","many","much","other","another","some","no",
"every","nobody","anybody","somebody","everybody","when","where","how","who","there","where","is","was","were","do","did",
"this","that","in","on","at","as","first","secend","third","fouth","fifth","sixth","ninth","above","over","below","under",
"beside","behind","of","the","after","from","since","for","which","by","next","last","tomorrow","yesterday","before","because",
"against","except","beyond","along","among","but","so","towards","to","it","me","i","he","she","his","they","them","her","its",
"and","has","have","my","would","then","too","or","our","off","we","be","into","weel","can","having","being","even","us","these",
"those","if","ours"};


全部代码:

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class wordCount {
public static void main(String[] args) throws Exception {

long time1 = System.currentTimeMillis();
String strA [] = {"your","had","I","their","not","ago","him","men","day","eighty","able","only","still","In","man","The","will","you",
"years","year","whose","waht","with","yours","yes","a","an","are","all","any","been","both","each","either","one","two","three","four","five",
"six","seven","eigth","nine","ten","none","little","few","many","much","other","another","some","no","every","nobody","anybody","somebody",
"everybody","when","where","how","who","there","where","is","was","were","do","did","this","that","in","on","at","as","first","secend","third",
"fouth","fifth","sixth","ninth","above","over","below","under","beside","behind","of","the","after","from","since","for","which","by","next",
"last","tomorrow","yesterday","before","because","against","except","beyond","along","among","but","so","towards","to","it","me","i","he","she",
"his","they","them","her","its","and","has","have","my","would","then","too","or","our","off","we","be","into","weel","can","having","being",
"even","us","these","those","if","ours"};BufferedReader reader = new BufferedReader(new FileReader(
"D:\\text1.txt"));
StringBuffer buffer = new StringBuffer();
String line = null;
while ((line = reader.readLine()) != null) {
buffer.append(line);
}
reader.close();
Pattern expression = Pattern.compile("[a-zA-Z]+");// 定义正则表达式匹配单词
String string = buffer.toString();
Matcher matcher = expression.matcher(string);//
Map<String, Integer> map = new TreeMap<String, Integer>();
String word = "";
int times = 0;
while (matcher.find()) {// 是否匹配单词
word = matcher.group();// 得到一个单词-树映射的键
for(int i=0;i<strA.length;i++){
if(word.equals(strA[i])){
word="";
}
}/*
if (map.containsKey(word)) {} else {
map.put(word, 1);// 否则单词第一次出现,添加到映射中
}*/
if (map.containsKey(word)) {// 如果包含该键,单词出现过
times = map.get(word);// 得到单词出现的次数
map.put(word, times + 1);
} else {
map.put(word, 1);// 否则单词第一次出现,添加到映射中
}

}
/*
* 核心:如何按照TreeMap 的value排序而不是key排序.将Map.Entry放在集合里,重写比较器,在用
* Collections.sort(list, comparator);进行排序
*/

List<Map.Entry<String, Integer>> list = new ArrayList<Map.Entry<String, Integer>>(map.entrySet());
/*
* 重写比较器
* 取出单词个数(value)比较
*/
Comparator<Map.Entry<String, Integer>> comparator = new Comparator<Map.Entry<String, Integer>>() {
public int compare(Map.Entry<String, Integer> left,Map.Entry<String, Integer> right) {
return (left.getValue()).compareTo(right.getValue());
}
};
Collections.sort(list, comparator);// 排序
// 打印
int last = list.size() - 1;
String[] strB=new String[last+1];
for (int i = last-1; i > last-11; i--) {
strB[i] = list.get(i).getKey();
Integer value = list.get(i).getValue();
System.out.print("Top"+(last-i)+" : ");
System.out.println("strB["+i+"]="+strB[i] + " \t " + value);
}
long time2 = System.currentTimeMillis();
System.out.println("耗时:");
System.out.println(time2 - time1+"ms");
}

}


运行结果:



性能测试:







分析与不足:

运用StringBuffer存储从文本文件中读到的单词,在程序中应用数组strA[ ]对从文本文件中读到的单词进行检索,剔除单词中与数组中的相同的单词。
    在进行检索的时候进行了循环运算,致使程序的运行时间大量增加,并且在数组strA[ ]中没能全部列出英语单词中的副词、代词、介词、冠词等一些无实际意义的词汇。

改进与拓展的方向:

对文件的测试中,可以对一些大文件也可以进行单词词频的统计计算,不过运算时间可能会有所增加。若对程序的单词存储结构进行优化或对改进单词检索的方法函数也可以减少程序运算时间,再者就是完善数组strA[ ]中的内容,使最后得到的结果是我们的确所需要的结果。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐