您的位置:首页 > 其它

史上绝地反击,美式英语英文学习大全。美国英语最新词频表

2012-08-09 11:16 288 查看
美国英语最新词频表
2010-04-10 13:04

(4月13日补充:这两天用网上的一些文章和GMAT的一份资料验证了一下这个WORDLIST的覆盖率,证明它的20000单词的覆盖率真的很高,几乎全部覆盖,只有一两个很个别的词没查到。它的前5000单词所带的词族估计有一万多单词,如果能熟练运用,英语水平就已经很不错了)。

因为准备8月开始的MBA课程,所以最近有意识地上网找wordlist(单词表)来加强一下词汇。GMAT、gre的单词表中很多生涩的单词只有专业文章才用,在日常学习生活中使用率很低,所以学习效率不高。后来找到了一个网上很流行的6138个单词的词频表,没看完就晕了,一方面因为它的出处是英国英语,另一方面拼写方式都很古老,甚至有whilst这样的词。whilst在美国现代用语中肯定是20000以外的词汇。可见那个表的古老程度了。功夫不负有心人,终于发现了一个最新的来自于CCAE的单词表。

CCAE“美国当代英语词汇研究”(Corpus of Contemporary American English)是这个世纪里最大的美国语言学研究项目,地位相当于影响深远的英国的BNC-British
National Corpus。我们目前使用的大多数英语词频表都是从BNC来的,换据话说都是英国英语的词频,而且是1980年代以前的词频。

美国CCAE至今还没结束,目前收集了4亿词汇的文献资料。这4亿词汇的基础材料包括1990-2009二十年里阅读量最广泛的小说和杂志(“TIME”、“New Yorker”等都是项目的参与者),电影、电视节目,大量的电话记录和面对面谈话记录,甚至还包括911报告等...)。它根据使用时间、文献性质等使用统计学方法进行分类统计,等于是在编一本带词频和流行用法的新美国英语使用辞典。

在CCAE当前成果基础上,美国杨百翰大学对这个资料库用计算机方法筛选出了美语使用频率最高的20,000个高频词汇和它的类词库。

其中前5000个最高频词汇的list文件已经可以下载:

http://www.wordfrequency.info/?freeList=y

点击最下面的 "download the list"。

另外,5000和20,000词汇的电子书的样本(两者包括5000个左右的样本单词)也可以免费下载,见http://www.wordfrequency.info/files/entries.pdf

这个wordlist最牛的是每个单词不仅带词频和同义词,而且都标注着“类词集”。类词集就是把这个词使用最相关、使用密度最高的词的集合。有了它,我们就知道美国人对这个词的最常用的几十种用法和使用环境。比如说break这个词的类词集里,前四个常用邻接词是law,heart,news和rule,所以我们猜测这个词的最高频用法是break law,break heart, breaking news和 break the rule。这比字典里的例句对培养语感所起的作用大不知高出多少倍。

下面是关于它特点的英文介绍,或者去网站http://www.wordfrequency.info直接看吧。

另外,如果你帮助他们在大的英语学习者的论坛里发一个贴子做宣传(发一个就行),然后把link用电子邮件发给他们,还能够免费得到5,000单词的词频表和类词集的电子书。这本书的印刷版在AMAZON也可以买到。

目前,这算是我见过的最好的wordlist了。

COMPARE (to data from the British National Corpus / American National Corpus)

There are many English word lists and frequency lists out on the Web. Some are good, some are very bad. Not all frequency lists are created equal.

One should be very, very suspicious of word lists that are taken from small samples of web data, outdated texts, or corpora that are too small to effectively model what is happening in the real world. Or worse,
word lists that don't give you any idea what they are based on. As the saying goes: "garbage in (bad texts), garbage out (frequency lists)".

Rather than focusing too much on a comparison with specific wordlists that are out there on the Web, here's some questions you might ask yourself as you consider downloading or purchasing a word list:

Depth and accuracy. Why do so many wordlists on the web contain just the top 1000-3000 words of English? Why not the top 10,000 or 20,000? It's because even a bad corpus (the collection of texts that the word
lists are based on) can produce a moderately accurate list for the very most frequent words. But because the corpus is neither deep nor balanced enough, you start getting messy data for medium and lower frequency words. Ask to see samples of the top 10,000
or 20,000 words (e.g. every 7th or 10th word). If they don't have it, then you should be very, very suspicious of that word list.

Genres. Does the corpus contain texts from a wide variety of genres -- spoken, fiction, popular magazines, newspapers, and academic journals? Frequency lists that are based on just one of these may only contain
40-50% of the words from a more balanced corpus. Our frequency list is based on the Corpus of Contemporary American English (COCA), which is almost perfectly balanced across genres.

Size. COCA contains more than 400 million words, and each of the top 20,000 words occurs at least 300 times. In a small 10-20 million word corpus, some of these words would occur just 7-8 times. At that point,
the lower frequency words might make it into the list "by chance", whereas others are left out. No such problem with COCA.

How recent is it? Language change happens. If the word list is based on 15-20 year-old texts (or much worse, 100 year old public domain novels), then it will be missing many of the words from the modern language.
COCA is based on texts from 1990-2009 (20 million words each year)-- or in other words, virtually right up to the current time.

Is it just a bare wordlist? Word lists are nice, but to be really useful (especially for language learning) there ought to be some indication of what these words mean and how they are used. Most of our frequency
lists contain the top 20-30 collocates (nearby words) for each word in the list, which creates a great "sketch" of each word.

--------------------------------------------------------------------------------

Summary. There are many word frequency lists out on the web. Some are just OK, and some are truly bad. The frequency lists that we have created are the only ones that are based on a large, recent, and balanced
corpus of English, and which provide indications of the meaning and use of each word.

Word frequency lists and dictionary

from the Corpus of Contemporary American English
homeusescomparesamplesfree listn-gramsnon-englishacademicpurchase


This site contains what we believe is the most accurate frequency data of English, and it comes in a number of different formats (see the table
below).
Any frequency list is only as good as the corpus (collection of texts) that it is based on. Our data is based on the only large, genre-balanced, up-to-date corpus of American English -- the 450 million word Corpus
of Contemporary American English. You can be sure that the data that you find here represents what you would encounter in the real world.
If you are a language learner, you can use the frequency lists to maximize your study of vocabulary in a way that is not possible with any other resource. If you are a (computational) linguist, you will have access
to highly accurate, robust and useful data for research and for Natural Language Processing. (More information on how to use this data.)
The English frequency data comes in a number of different formats, shown below. You can also get frequency
data for Spanish and Portuguese or Academic English.
Basic word listsTop 5,000-60,000 words (lemmas)
Genre frequencySee the frequency of each of the top 60,000 lemmas -- in spoken, fiction, popular magazine, newspapers, and academic, as well as more than 40 sub-genres like NEWS-Financial or ACAD-Medicine. You can then use this data to create your
own customized lists for particular genres and sub-genres.
CollocatesCollocates = "nearby words", and they provide great insight into the meaning and use of words -- more than any other lists. See (a maximum of) 200-300 collocates for each of the 60,000 words, giving nearly 4,800,000 node word / collocate
pairs.
N-gramsUp to 155 million unique 2-5 grams (2-5 words sequences), with frequencies for each string. Allows you to search for the patterns in which a word occurs.
eBookThe 20,000 most frequent words (lemmas) in American English, along with the 20-30 most frequent collocates and the synonyms for each word
Printed book(From Routledge). The
top 5,000 words (including collocates) and thematic lists
Free word listBasic list of the top 5,000 lemmas
Contact information
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: