分析一个英文txt文本中单词出现的频率
2014-10-21 08:20
459 查看
要求:
写一个程序,对一个txt格式的英文文本中的单词进行单词词频统计,并且输出排在前十的单词。文本大小为30k~300K。步骤:
1、读一个txt文本文件;2、统计文本中出现的单词和单词的次数;
3、定义一个数组,其中包括英语单词中的副词、代词、冠词和介词等一些无实际意义的单词;
4、对读到的单词进行排序,并且输出前10个高频词汇。
编程语言:java
测试文件:D:\\test1.txt 419K
性能测试工具:visualVM1.3.8
程序代码:
英语单词中的副词、代词、冠词和介词等一些无实际意义的单词数组:
String strA [] = {"your","had","I","their","not","ago","him","men","day","eighty","able","only","still","In","man","The","will",
"you","years","year","whose","waht","with","yours","yes","a","an","are","all","any","been","both","each","either","one","two",
"three","four","five","six","seven","eigth","nine","ten","none","little","few","many","much","other","another","some","no",
"every","nobody","anybody","somebody","everybody","when","where","how","who","there","where","is","was","were","do","did",
"this","that","in","on","at","as","first","secend","third","fouth","fifth","sixth","ninth","above","over","below","under",
"beside","behind","of","the","after","from","since","for","which","by","next","last","tomorrow","yesterday","before","because",
"against","except","beyond","along","among","but","so","towards","to","it","me","i","he","she","his","they","them","her","its",
"and","has","have","my","would","then","too","or","our","off","we","be","into","weel","can","having","being","even","us","these",
"those","if","ours"};
全部代码:
import java.io.BufferedReader; import java.io.FileReader; import java.util.ArrayList; import java.util.Collections; import java.util.Comparator; import java.util.List; import java.util.Map; import java.util.TreeMap; import java.util.regex.Matcher; import java.util.regex.Pattern; public class wordCount { public static void main(String[] args) throws Exception { long time1 = System.currentTimeMillis(); String strA [] = {"your","had","I","their","not","ago","him","men","day","eighty","able","only","still","In","man","The","will","you",
"years","year","whose","waht","with","yours","yes","a","an","are","all","any","been","both","each","either","one","two","three","four","five",
"six","seven","eigth","nine","ten","none","little","few","many","much","other","another","some","no","every","nobody","anybody","somebody",
"everybody","when","where","how","who","there","where","is","was","were","do","did","this","that","in","on","at","as","first","secend","third",
"fouth","fifth","sixth","ninth","above","over","below","under","beside","behind","of","the","after","from","since","for","which","by","next",
"last","tomorrow","yesterday","before","because","against","except","beyond","along","among","but","so","towards","to","it","me","i","he","she",
"his","they","them","her","its","and","has","have","my","would","then","too","or","our","off","we","be","into","weel","can","having","being",
"even","us","these","those","if","ours"};BufferedReader reader = new BufferedReader(new FileReader(
"D:\\text1.txt"));
StringBuffer buffer = new StringBuffer();
String line = null;
while ((line = reader.readLine()) != null) {
buffer.append(line);
}
reader.close();
Pattern expression = Pattern.compile("[a-zA-Z]+");// 定义正则表达式匹配单词
String string = buffer.toString();
Matcher matcher = expression.matcher(string);//
Map<String, Integer> map = new TreeMap<String, Integer>();
String word = "";
int times = 0;
while (matcher.find()) {// 是否匹配单词
word = matcher.group();// 得到一个单词-树映射的键
for(int i=0;i<strA.length;i++){
if(word.equals(strA[i])){
word="";
}
}/*
if (map.containsKey(word)) {} else {
map.put(word, 1);// 否则单词第一次出现,添加到映射中
}*/
if (map.containsKey(word)) {// 如果包含该键,单词出现过
times = map.get(word);// 得到单词出现的次数
map.put(word, times + 1);
} else {
map.put(word, 1);// 否则单词第一次出现,添加到映射中
}
}
/*
* 核心:如何按照TreeMap 的value排序而不是key排序.将Map.Entry放在集合里,重写比较器,在用
* Collections.sort(list, comparator);进行排序
*/
List<Map.Entry<String, Integer>> list = new ArrayList<Map.Entry<String, Integer>>(map.entrySet());
/*
* 重写比较器
* 取出单词个数(value)比较
*/
Comparator<Map.Entry<String, Integer>> comparator = new Comparator<Map.Entry<String, Integer>>() {
public int compare(Map.Entry<String, Integer> left,Map.Entry<String, Integer> right) {
return (left.getValue()).compareTo(right.getValue());
}
};
Collections.sort(list, comparator);// 排序
// 打印
int last = list.size() - 1;
String[] strB=new String[last+1];
for (int i = last-1; i > last-11; i--) {
strB[i] = list.get(i).getKey();
Integer value = list.get(i).getValue();
System.out.print("Top"+(last-i)+" : ");
System.out.println("strB["+i+"]="+strB[i] + " \t " + value);
}
long time2 = System.currentTimeMillis();
System.out.println("耗时:");
System.out.println(time2 - time1+"ms");
}
}
运行结果:
性能测试:
分析与不足:
运用StringBuffer存储从文本文件中读到的单词,在程序中应用数组strA[ ]对从文本文件中读到的单词进行检索,剔除单词中与数组中的相同的单词。在进行检索的时候进行了循环运算,致使程序的运行时间大量增加,并且在数组strA[ ]中没能全部列出英语单词中的副词、代词、介词、冠词等一些无实际意义的词汇。
改进与拓展的方向:
对文件的测试中,可以对一些大文件也可以进行单词词频的统计计算,不过运算时间可能会有所增加。若对程序的单词存储结构进行优化或对改进单词检索的方法函数也可以减少程序运算时间,再者就是完善数组strA[ ]中的内容,使最后得到的结果是我们的确所需要的结果。相关文章推荐
- Java读本地英文txt文本,显示行数、字数、单词出现频率
- 统计一个大小为30kb~300kb的文本中各单词出现的频率,并输出前十个单词和进行程序性能分析
- 统计一个英文文本的单词出现的频率(有标点符号的)
- 写一个程序,分析一个文本文件(英文文章)中各个单词出现的频率,并且把频率最高的10词打印出来
- 统计英文文本文档中前十个出现频率最多的单词
- 分析一个文本文件中各个单词出现的频率,把频率最高的10个词打印出来
- 写一个程序,用于分析一个字符串中各个单词出现的频率,并将单词和它出现的频率输出显示。(单词之间用空格隔开,如“Hello World My First Unit Test”);
- 编写一个程序,分析一个文本文件(英文文章)中各个词出现的频率,并把频率最高的10个词打印出来
- 一天一个shell实例(1)文本中n个出现频率最高的单词(转)
- 软件工程设计:分析一个文本文件(英文文章)中各个词出现的频率,并且把频率最高的10个词打印出来。
- 分析一个文本文件中各个单词出现的频率,把频率最高的10个词打印出来
- 一个简单的程序,统计文本文档中的单词和汉字数,逆序排列(出现频率高的排在最前面)。python实现。
- 测试!用于分析一个字符串中各个单词出现的频率!
- 软件工程个人小程序:分析一个文本文件(英文文章)中各个词出现的频率,并且把频率最高的10个词打印出来
- 写一个程序分析文本文档(英文文章)中各个词出现的频率并把频率最高的10个词打印出来
- 输入一段英文文本,用程序统计出现频率最高和最低的两个单词;
- 写一个程序,分析一个文本文件中各个词出现的频率,并且把频率最高的10个词打印出来。文本文件大约是30KB~300KB大小
- 分析一个文本文件中各个词出现的频率,并把频率最高的十个单词打印出来。
- 统计一个文档中出现频率最多的10个单词(英文文档)
- 写一个程序,分析一个文本文件中各个词出现的频率,并且把频率最高的10个词打印出来。文本文件大约是30KB~300KB大小。