【数据集】CTR-SR数据集(CiteULike-a与CiteULike-t)
2014-02-24 17:41
603 查看
数据下载地址:http://www.datatang.com/data/45466
数据说明:
CiteULike-a was used in the paper 'Collaborative Topic Regression with Social Regularization' [Wang, Chen and Li]. It was collected from CiteULike and Google Scholar. CiteULike allows users to create their own collections of articles. There are abstracts, titles,and tags for each article. Other information like authors, groups, posting time, and keywords is not used in this paper 'Collaborative Topic Regression with Social Regularization' [Wang, Chen and Li]. The details can be found at http://www.citeulike.ort/faq/data.adp.
It is partly from [Wang and Blei]. Note that the original dataset in [Wang and Blei] does not contain relations between items. We collect the tag information from CiteULike and citations from Google Scholar.
The text information (item content) of citeulike-a is preprocessed by following the same procedure as that in [Wang and Blei].
Some statistics are listed as follows:
#users 5551
#items 16980
#tags 46391
#citations 44709
#user-item pairs 204987
DATA FILES
citations.dat citations between articles
item-tag.dat tags corresponding to articles, one line corresponds to tags of one article (note that this is the version prior to preprocess thus would have more tags than used in the paper)
mult.dat bag of words for each article
raw-data.csv raw data
tags.dat tags, sorted by tag-id's
users.dat rating matrix (user-item matrix)
vocabulary.dat corresponding words for file mult.dat
---------------------------------------------------------------------------------------------------------------
CiteULike-t was used in the paper 'Collaborative Topic Regression with Social Regularization' [Wang, Chen and Li]. It was collected from CiteULike and Google Scholar. CiteULike allows users to create their own collections of articles. There are abstracts, titles,
and tags for each article. Other information like authors, groups, posting time, and keywords is not used in this paper 'Collaborative Topic Regression with Social Regularization' [Wang, Chen and Li]. The details can be found at http://www.citeulike.ort/faq/data.adp.
It is collected by us independently from the dataset citeulike-a. We manually select 273 seed tags and collect all the articles with at least one of these tags. We also crawl the citations between the articles from Google Scholar. Note that the final number
of tags associated with all the collected articles is far more than the number (273) of seed tags.
The text information (item content) of citeulike-a is preprocessed by following the same procedure as that in citeulike-a. After removing the stop words, we choose the top 20000 discriminative words according to the tf-idf values as our vocabulary.
Some statistics are listed as follows:
#users 7947
#items 25975
#tags 52946
#citations 32565
#user-item pairs 134860
DATA FILES
citations.dat citations between articles
tag-item.dat articles corresponding to tags, one line corresponds to articles relating to the same tags (note that it is different from the other dataset citeulike-a and that this is the version prior to preprocess thus would have more tags than used in
the paper)
mult.dat bag of words for each article
rawtext.dat raw data
tags.dat tags, sorted by tag-id's
users.dat rating matrix (user-item matrix)
vocabulary.dat corresponding words for file mult.dat
引用论文
Collaborative Topic Regression with Social Regularization for Tag Recommendation (IJCAI-2013)
相关文章推荐
- I2C总线信号时序总结
- Android之TabHost布局
- 客服弹出页
- linux sysctl command
- 纯软件公司的先驱(一)——硅谷老兵新传
- 网易课程-玩转c语言 基础课堂-课时1-6
- android实战Examples_08_08
- Session的生命周期 服务器中的seesion是何时建立的?
- spring 一些总结
- ios hack实战:获取支付宝手势密码(支付宝版本8.0)
- 持续集成调研报告(2)
- php ucfirst();函数
- java中将String类型的数据转成Blob类型
- Oracle 游标使用全解
- vs2010静态链接Qt
- sql 查询语言
- 互联网人士每天该看的网站
- java中break和continue的使用与区别
- activeMQ总结
- 内存详解(vmmap)