您的位置:首页 > 其它

【数据集】CTR-SR数据集(CiteULike-a与CiteULike-t)

2014-02-24 17:41 603 查看


数据下载地址:http://www.datatang.com/data/45466




数据说明:

CiteULike-a was used in the paper 'Collaborative Topic Regression with Social Regularization' [Wang, Chen and Li]. It was collected from CiteULike and Google Scholar. CiteULike allows users to create their own collections of articles. There are abstracts, titles,
and tags for each article. Other information like authors, groups, posting time, and keywords is not used in this paper 'Collaborative Topic Regression with Social Regularization' [Wang, Chen and Li]. The details can be found at http://www.citeulike.ort/faq/data.adp.
It is partly from [Wang and Blei]. Note that the original dataset in [Wang and Blei] does not contain relations between items. We collect the tag information from CiteULike and citations from Google Scholar.

The text information (item content) of citeulike-a is preprocessed by following the same procedure as that in [Wang and Blei].

Some statistics are listed as follows:

#users                     5551

#items                     16980

#tags                     46391

#citations                 44709

#user-item pairs         204987

DATA FILES

citations.dat    citations between articles

item-tag.dat    tags corresponding to articles, one line corresponds to tags of one article (note that this is the version prior to preprocess thus would have more tags than used in the paper)

mult.dat        bag of words for each article

raw-data.csv    raw data

tags.dat        tags, sorted by tag-id's

users.dat        rating matrix (user-item matrix)

vocabulary.dat    corresponding words for file mult.dat

---------------------------------------------------------------------------------------------------------------

CiteULike-t was used in the paper 'Collaborative Topic Regression with Social Regularization' [Wang, Chen and Li]. It was collected from CiteULike and Google Scholar. CiteULike allows users to create their own collections of articles. There are abstracts, titles,
and tags for each article. Other information like authors, groups, posting time, and keywords is not used in this paper 'Collaborative Topic Regression with Social Regularization' [Wang, Chen and Li]. The details can be found at http://www.citeulike.ort/faq/data.adp.
It is collected by us independently from the dataset citeulike-a. We manually select 273 seed tags and collect all the articles with at least one of these tags. We also crawl the citations between the articles from Google Scholar. Note that the final number
of tags associated with all the collected articles is far more than the number (273) of seed tags.

The text information (item content) of citeulike-a is preprocessed by following the same procedure as that in citeulike-a. After removing the stop words, we choose the top 20000 discriminative words according to the tf-idf values as our vocabulary.

Some statistics are listed as follows:

#users                     7947

#items                     25975

#tags                     52946

#citations                 32565

#user-item pairs         134860

DATA FILES

citations.dat    citations between articles

tag-item.dat    articles corresponding to tags, one line corresponds to articles relating to the same tags (note that it is different from the other dataset citeulike-a and that this is the version prior to preprocess thus would have more tags than used in
the paper)

mult.dat        bag of words for each article

rawtext.dat        raw data

tags.dat        tags, sorted by tag-id's

users.dat        rating matrix (user-item matrix)

vocabulary.dat    corresponding words for file mult.dat


引用论文

Collaborative Topic Regression with Social Regularization for Tag Recommendation (IJCAI-2013)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: