Advanced Topics in Data Mining Spring 2011
2011-10-28 10:46
239 查看
Books (PDFs):
Mining Massive Datasets by A. Rajaraman, J. Ullman.Networks, Crowds, and Markets: Reasoning About a Highly Connected World by D. Easley, J. Kleinberg.
Data-Intensive Text Processing with MapReduce by J. Lin, C. Dyer.
Datasets:
SNAP network datasets60 large social and information network datasets
Wikipedia
Complete edit history of Wikipedia articles: Which user edited what article at what time.
Wikipedia page to page link data
DBpedia: A richly labeled graph of Wikipedia entities.
Freebase: An entity graph of people, places and things.
Ratings and purchases (movies, music, etc.)
Amazon product co-purchasing network: 600k products and all their metadata.
KDD Cup 2011: 300M ratngs from 1M users on 600k songs, albums and artists.
IMDB database: Everything about every movie ever made.
Movielens: User movie rating data.
Yahoo! Webscope Catalog of datasets
Yahoo! Webscope dataset collection. Cotains Language Data, Graph and Social Data, Ratings Data, Advertising and Market Data, Competition Data
Note: Jure Leskovec will have to apply for any sets you want, and we must agree not to distribute them further. There may be a delay, so get requests in early.
Co-authorship and Citation Networks
DBLP: Digital Bibliography & Library Project. More
info.
Arxiv citation and co-authorship networks: Data is from KDD 2003 Cup.
Internet (Autonomous Systems) topology
AS Graphs
Who trusts whom data at Trustlet
Trust network datasets from Trustlet.org
Stanford only datasets
Instant messenger buddy graph from March 2005. There are 227 million nodes and 7.3 billion undirected edges.Altavista web graph from 2002. 1.4 billion nodes, 5.5 billion edges.
Memetracker2. 1 million blog posts, news media articles, tweets and facebook wall posts per hour for a period from August 1 to August 31 2010. 181GB of compressed data.
The New York Times Annotated Corpus: over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007 with article metadata.
TheFind: product information data (price, category, related products) extracted from 239 different websites.
Twitter: About 500 million tweets over a 7 month period. Data description.
Wikipedia: Complete revision history of Wikipedia -- every edit of every article with full content.
Wikipedia webserver logs: Hourly Wikipedia page access statistics.
Yahoo! Messenger: Instant Messenger graph with some additional information
Data can be accessed here. Email Jure if you do not have a password.
Other Datasets
Yannis Antonellis and Jawed Karim offer a file that contains information about the search queries that were used to reach pages on the Stanford Web server. See http://www.stanford.edu/~antonell/tags_dataset.htmlThe Stanford WebBase project provides a crawl, and may even be talked into providing a specialized crawl if you have a need. Find description here.
Find how to access web pages in the repository here.
相关文章推荐
- org.springframework.dao.InvalidDataAccessApiUsageException: Write operations are not allowed in r...
- 文本挖掘经典书籍推荐—THE TEXT MINING HANDBOOK(Advd. Approaches in Analyzing Unstructured Data)
- SPRING IN ACTION 第4版笔记-第八章Advanced Spring MVC-001- 配置SpringFlow(flow-executor、flow-registry、FlowHandlerMapping、FlowHandlerAdapter)
- RH033 Unit15 Advanced Topics in Users, Groups and Permissions
- org.springframework.dao.InvalidDataAccessApiUsageException: Write operations are not allowed in read
- If advanced algorithms and data structures are never used in industry, then why learn them?
- SPRING IN ACTION 第4版笔记-第十一章Persisting data with object-relational mapping-004JPA例子的代码
- 数据挖掘:Top 10 Algorithms in Data Mining(七)AdaBoost
- Weka 3: Data Mining Software in Java
- RH033 Unit15 Advanced Topics in Users, Groups and Permissions
- [spring]:org.springframework.dao.InvalidDataAccessApiUsageException: Write operations are not allowed in read-only mode
- Interesting Evolutions of Advanced Topics in Computer Vision
- org.springframework.dao.InvalidDataAccessApiUsageException: Write operations are not allowed in read
- SPRING IN ACTION 第4版笔记-第十一章Persisting data with object-relational mapping-001-使用Hibernate(@Inject、@EnableTransactionManagement、@Repository、PersistenceExceptionTranslationPostProcessor)
- Swarm Intelligence in Data Mining
- spring data jpa 中的OpenEntityManagerInViewFilter 取代OpenSessionInViewFilter 放置session失效
- 读《Mining Data Records in Web Pages》
- Mining Data Records in Web Pages ——挖掘网页中的数据记录
- Interesting Evolutions of Advanced Topics in Computer Vision
- The Top Ten Algorithms in Data Mining