您的位置：首页 > 其它

【数据集】CTR-SR数据集（CiteULike-a与CiteULike-t）

2014-02-24 17:41 603 查看

数据下载地址：http://www.datatang.com/data/45466

数据说明：

CiteULike-a was used in the paper 'Collaborative Topic Regression with Social Regularization' [Wang, Chen and Li]. It was collected from CiteULike and Google Scholar. CiteULike allows users to create their own collections of articles. There are abstracts, titles,
and tags for each article. Other information like authors, groups, posting time, and keywords is not used in this paper 'Collaborative Topic Regression with Social Regularization' [Wang, Chen and Li]. The details can be found at http://www.citeulike.ort/faq/data.adp.
It is partly from [Wang and Blei]. Note that the original dataset in [Wang and Blei] does not contain relations between items. We collect the tag information from CiteULike and citations from Google Scholar.

The text information (item content) of citeulike-a is preprocessed by following the same procedure as that in [Wang and Blei].

Some statistics are listed as follows：

#users                    5551

#items                    16980

#tags                    46391

#citations                44709

#user-item pairs        204987

DATA FILES

citations.dat   citations between articles

item-tag.dat   tags corresponding to articles, one line corresponds to tags of one article (note that this is the version prior to preprocess thus would have more tags than used in the paper)

mult.dat       bag of words for each article

raw-data.csv   raw data

tags.dat       tags, sorted by tag-id's

users.dat       rating matrix (user-item matrix)

vocabulary.dat   corresponding words for file mult.dat

---------------------------------------------------------------------------------------------------------------

CiteULike-t was used in the paper 'Collaborative Topic Regression with Social Regularization' [Wang, Chen and Li]. It was collected from CiteULike and Google Scholar. CiteULike allows users to create their own collections of articles. There are abstracts, titles,
and tags for each article. Other information like authors, groups, posting time, and keywords is not used in this paper 'Collaborative Topic Regression with Social Regularization' [Wang, Chen and Li]. The details can be found at http://www.citeulike.ort/faq/data.adp.
It is collected by us independently from the dataset citeulike-a. We manually select 273 seed tags and collect all the articles with at least one of these tags. We also crawl the citations between the articles from Google Scholar. Note that the final number
of tags associated with all the collected articles is far more than the number (273) of seed tags.

The text information (item content) of citeulike-a is preprocessed by following the same procedure as that in citeulike-a. After removing the stop words, we choose the top 20000 discriminative words according to the tf-idf values as our vocabulary.

Some statistics are listed as follows:

#users                    7947

#items                    25975

#tags                    52946

#citations                32565

#user-item pairs        134860

DATA FILES

citations.dat   citations between articles

tag-item.dat   articles corresponding to tags, one line corresponds to articles relating to the same tags (note that it is different from the other dataset citeulike-a and that this is the version prior to preprocess thus would have more tags than used in
the paper)

mult.dat       bag of words for each article

rawtext.dat       raw data

tags.dat       tags, sorted by tag-id's

users.dat       rating matrix (user-item matrix)

vocabulary.dat   corresponding words for file mult.dat

引用论文

Collaborative Topic Regression with Social Regularization for Tag Recommendation (IJCAI-2013)

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航