您的位置:首页 > 其它

找出一个文件中出现次数最多的10个单词

2009-04-23 10:51 555 查看
记得当初面试的时候,一哥们给我出的上机题。

咋一下感觉很简单,可做着觉得有了些问题,反正当时做的挺糟糕的。

这几天又时间,又把这个问题总结了下。

总体来说,它涉及到两点主要的算法问题: 查找(相同单词,统计出现次数),排序(按 次数排序)

这个题目有几点很容易忽略:

1)单靠出现次数能唯一确定前10个单词么?如果有个数重复的怎么办? 随机抽取? 那么程序的输出岂不是也随机? 但算法的一个很重要的概念就是 对于特定的输入,应该有固定的输出。否则谁敢用你的算法。

2)既然一个纬度不足够唯一确定结果,则需要再添加一个纬度。比如,当出现次数一致的情况下,按字母顺序排序。这样就可以了。

3)所以,一定要排序。我当时的想法时,首先统计出所有出现的单词以及他们出现的次数。(这个可以有很多种方法,但用hashtable是最方便的应该)。 然后呢? 选定前10个作为基准值,按顺序排好,从第11个开始,逐个同10个当中按顺序比较,如果遇见比它小的,则从当前位置把元素往后移动(最后一个被挤出去),然后把它加在这个空出的位置上。虽然实现了,但是结果是不确定的。取决与他们出现在hashtable中的次序。而且,反复移动,效率也不高

4)现在觉得,算法一定得排序,然后才能得到唯一确定的输出结果。

下面是用C#实现的算法,基本需求的是根据输入文件,把统计后的单词按出现次数从小到大写到输入文件中

class Program
{
static void Main(string[] args)
{
QuickSortUtil sortUitl = new QuickSortUtil();
sortUitl.CountWord(@"../data/input.txt", @"../data/output.txt");
Console.WriteLine("Done!");
Console.Read();
}
}

class QuickSortUtil
{
public void CountWord(string inputFilePath, string outputFilePath)
{
Hashtable allWords = getAllWords(inputFilePath);
List<WordInfo> allWordInfos = new List<WordInfo>();
foreach (string key in allWords.Keys)
{
allWordInfos.Add(new WordInfo(key, (int)allWords[key]));
}
qucikSort(allWordInfos, 0, allWordInfos.Count - 1);
writeToFile(allWordInfos, outputFilePath);
}

private Hashtable getAllWords(string filePath)
{
//since if the hashtable need to increase it's capacity frequently,
//it will affect the efficence,so give it a large enough basic capacity
Hashtable allWords = new Hashtable(10240);
using (StreamReader sr = new StreamReader(filePath, Encoding.Default))
{
string line = null;
char[] seperators = new char[] { ' ', ',', ';', '.','!','"' };
string[] words = null;
while ((line = sr.ReadLine()) != null)
{
//you can decide whether to add the ToLower() function
line = line.ToLower();
//do use the StringSplitOptions.RemoveEmptyEntries when use Split
words = line.Split(seperators, StringSplitOptions.RemoveEmptyEntries);
if (words != null && words.Length > 0)
{
for (int i = 0; i < words.Length; i++)
{
if (allWords.ContainsKey(words[i]))
{
allWords[words[i]] = (int)allWords[words[i]] + 1;
}
else
{
allWords.Add(words[i], 1);
}
}
}
}
}

return allWords;
}

private void qucikSort(List<WordInfo> allWordInfos, int low, int high)
{
if (low >= high)
{
return;
}

//use the local varaiable for index.
//do not change the original low and high because we will use them later
int pLow = low;
int pHigh = high;
//record the stand value
WordInfo value = allWordInfos[low];

while (pLow < pHigh)
{
//note: if two value equals, do not move.
//we only swap the value if it's bigger than the stand value
while ((WordInfo.Compare(allWordInfos[pHigh],value) >=0) && pHigh > pLow)
{
pHigh--;
}
//so here, if pHigh == pLow, the "<" check will be false
if (WordInfo.Compare(allWordInfos[pHigh], value) < 0)
{
allWordInfos[pLow] = allWordInfos[pHigh];
allWordInfos[pHigh] = value;
pLow++;
}

while ((WordInfo.Compare(allWordInfos[pLow],value) <= 0) && pHigh > pLow)
{
pLow++;
}
if (WordInfo.Compare(allWordInfos[pLow], value) > 0)
{
allWordInfos[pHigh] = allWordInfos[pLow];
allWordInfos[pLow] = value;
pHigh--;
}
}

//pLow and pHigh is expected to be equal here
System.Diagnostics.Trace.Assert(pLow == pHigh);

qucikSort(allWordInfos, low, pLow - 1);
qucikSort(allWordInfos, pLow + 1, high);
}

private void writeToFile(List<WordInfo> allWordInfos, string outputFilePath)
{
using (StreamWriter sw = new StreamWriter(outputFilePath, false, Encoding.Default))
{
foreach (WordInfo wi in allWordInfos)
{
sw.WriteLine("{0}:{1}", wi.Word, wi.Count);
}
}
}
}

public class WordInfo
{
private string _word = string.Empty;

public string Word
{
get { return _word; }
set { _word = value; }
}
private int _count = 0;

public int Count
{
get { return _count; }
set { _count = value; }
}

public WordInfo()
{
}

public WordInfo(string word, int count)
{
this._word = word;
this._count = count;
}

public WordInfo(WordInfo info)
{
this._word = info.Word;
this._count = info.Count;
}

public static int Compare(WordInfo info1, WordInfo info2)
{

if (info1 == null || info2 == null)
{
if (info1 == null)
{
if (info2 == null)
{
return 0;
}
else
{
return -1;
}
}

return 1;
}

if (info1.Count > info2.Count)
{
return 1;
}

if (info1.Count < info2.Count)
{
return -1;
}

return string.Compare(info1.Word, info2.Word);
}
}

算法中几点注意的:

1)使用hashtable时给它设置个相对合理的初始容量。否则hashtable扩容时需要重新对以前的元素计算位置,很影响效率

2)排序的时候用快速排序

3)Split字符串的时候,自动的去掉空内容StringSplitOptions.RemoveEmptyEntries

4)是否考虑字符串的大小写。 即大小写不一样的单词是否视为同一单词
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐