您的位置:首页 > 其它

Data Mining Winter 2010 Resources (from last year's course website):

2011-10-28 10:50 375 查看
TheFind Shopping Search Engine Dataset

Craigslist Data (data will be uploaded soon!)

All Tweets and some associated metadata from June 2009

Memetracker Dataset (More than 1 million news media and blog articles per day since August 2008)

Wikipedia: Entire edit history of english wikipedia for March 2003.

Wikipedia web server logs

IM buddy graph from March 2005..

Yahoo Altavista Web Graph..

Yahoo messenger Data set.

Yahoo Music Data set.

Last Year's Final. Note that the subject matter is somewhat different this year, so you should not assume the coverage on this year's
final will be exactly the same. It will, however, cover all material in the course up to and including the 3/4 lecture.

References. We'll put in this file citations or links to papers that you may wish to read to learn more about certain topics
covered in the class. They are not required reading.

Yahoo! Catalog of data sets available. Note: Jure Leskovec will have to apply for any sets you want, and we must agree not to distribute
them further. There may be a delay, so get requests in early.

Yannis Antonellis and Jawed Karim offer a file that contains information about the search queries that were used to reach pages on the Stanford Web server. The format is

visitor_hashtimestamprequested_urlreferer_from_a_search_engine
E.g.,

a997c1950718d75c03f22ca8715e50b3[28/Feb/2007:23:45:47 -0800]/group/svsa/cgi-bin/www/officers.php"http://www.google.com/sea rch?sourceid=navclient&ie=UTF-8&rls=HPIB,HPIB:2006-47,HPIB:en&q=sexy+random+facts"
See http://www.stanford.edu/~antonell/tags_dataset.html for more information about
how to get and use this file.

The Stanford WebBase project provides a crawl, and may even be talked into providing a specialized crawl if you have a need. Find description here.
Find how to access web pages in the repository here.

Here's a data set that might be interesting to some of you as you think about your project. e.g., on news clustering, identifying trends in news stories, etc. There is a nominal fee to get the DVD with the data, but if someone if really interested I'm sure
we could arrange to make it available: http://open.blogs.nytimes.com/2009/01/12/fatten-up-your-corpus/

Excerpt: Available for noncommercial research license from The Linguistic Data Consortium (LDC), the corpus spans 20 years of newspapers between 1987 and 2007 (that's 7,475 issues, to be exact). This collection includes the text of 1.8 million articles written
at The Times (for wire service articles, you'll have to look elsewhere). Of these, more than 1.5 million have been manually annotated by The New York Times Index with distinct tags for people, places, topics and organizations drawn from a controlled vocabulary.
A further 650,000 articles also include summaries written by indexers from the New York Times Index. The corpus is provided as a collection of XML documents in the News Industry Text Format and includes open source Java tools for parsing documents into memory
resident objects.

A former CS345A student, and the TA from last year have started a company, Cellixis, to do a cellphone-based advisor. They offer two data sets that might be of interest; both are based on restaurant reviews:

A corpus of restaurants and reviews (100+ thousand restaurants, text of reviews can be tagged by part-of-speech). They are interested, for example, in knowing the keywords or key phrases (consecutive words) that best characterize different kinds of restaurants.
As a baseline for word occurrence, they can also provide a sample corpus of the web (10+ million pages), and average single word stats over that corpus.

A training set of (user id, restaurant id, rating) tuples. They can also provide a corpus of restaurant info and reviews (in case a model-based approach is used). This data can be used in a manner similar to the Netflix data, but they are not offering $1M for
a good solution. And you would have to excise from the data a small portion to measure your performance, while Netflix retains the test data itself.

If you are interested in obtaining either of these data sets, they can be emailed as love-cs345 at cellixis dt cm.

ACM has just issued its Multimedia Grand Challege(s). Many of these involve images in a way that we don't have the resources to deal with
in the next month, but you might want to read the material to see if anything looks doable and interesting.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: