您的位置:首页 > 编程语言 > Java开发

Advanced Topics in Data Mining Spring 2011

2011-10-28 10:46 239 查看


Books (PDFs):

Mining Massive Datasets by A. Rajaraman, J. Ullman.

Networks, Crowds, and Markets: Reasoning About a Highly Connected World by D. Easley, J. Kleinberg.

Data-Intensive Text Processing with MapReduce by J. Lin, C. Dyer.


Datasets:

SNAP network datasets

60 large social and information network datasets

Wikipedia

Complete edit history of Wikipedia articles: Which user edited what article at what time.

Wikipedia page to page link data

DBpedia: A richly labeled graph of Wikipedia entities.

Freebase: An entity graph of people, places and things.

Ratings and purchases (movies, music, etc.)

Amazon product co-purchasing network: 600k products and all their metadata.

KDD Cup 2011: 300M ratngs from 1M users on 600k songs, albums and artists.

IMDB database: Everything about every movie ever made.

Movielens: User movie rating data.

Yahoo! Webscope Catalog of datasets

Yahoo! Webscope dataset collection. Cotains Language Data, Graph and Social Data, Ratings Data, Advertising and Market Data, Competition Data

Note: Jure Leskovec will have to apply for any sets you want, and we must agree not to distribute them further. There may be a delay, so get requests in early.

Co-authorship and Citation Networks

DBLP: Digital Bibliography & Library Project. More
info.

Arxiv citation and co-authorship networks: Data is from KDD 2003 Cup.

Internet (Autonomous Systems) topology

AS Graphs

Who trusts whom data at Trustlet

Trust network datasets from Trustlet.org


Stanford only datasets

Instant messenger buddy graph from March 2005. There are 227 million nodes and 7.3 billion undirected edges.

Altavista web graph from 2002. 1.4 billion nodes, 5.5 billion edges.

Memetracker2. 1 million blog posts, news media articles, tweets and facebook wall posts per hour for a period from August 1 to August 31 2010. 181GB of compressed data.

The New York Times Annotated Corpus: over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata.

TheFind: product information data (price, category, related products) extracted from 239 different websites.

Twitter: About 500 million tweets over a 7 month period. Data description.

Wikipedia: Complete revision history of Wikipedia -- every edit of every article with full content.

Wikipedia webserver logs: Hourly Wikipedia page access statistics.

Yahoo! Messenger: Instant Messenger graph with some additional information

Data can be accessed here. Email Jure if you do not have a password.


Other Datasets

Yannis Antonellis and Jawed Karim offer a file that contains information about the search queries that were used to reach pages on the Stanford Web server. See http://www.stanford.edu/~antonell/tags_dataset.html

The Stanford WebBase project provides a crawl, and may even be talked into providing a specialized crawl if you have a need. Find description here.
Find how to access web pages in the repository here.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐