您的位置:首页 > 其它

Mahout推荐算法之ItemBased

2014-10-11 13:08 459 查看

Mahout推荐之ItemBased

一、 算法原理

(一) 基本原理

如下图评分矩阵所示:行为user,列为item.



图(1)

该算法的原理:

1. 计算Item之间的相似度。

2. 对用户U做推荐



公式(一)
Map tmp ;

Map tmp1 ;



for(item a in userRatedItems){

rate =userforItemRate(a)

ListsimItem =getSimItem(a);

For(Jin simItem){

Item b =j;

Simab=sim(a,b);

Tmp.add(b,Tmp .get(b)+simab*rate)

tmp1.add(b, tmp1.get(b)+simab)



}

}

Maptmp2=temp/temp1

Sortbyval(tmp2)

return topK(tmp2,k)


(二) 相似度计算

1. Cos相似度



公式(二)
2. 皮尔逊相似度



公式(三)
3. 调整的cos相似度



公式(四)
(三) 采样
计算全量的itemPair之间的相似度耗费大量的时间,也是没有必要的,所以需要采样,减小计算量。

二、 单机模式实现

(一) 候选Item搜索

计算所有Item Pair之间的相似度在单机模式下是不现实的,需要在海量的候选集中搜索出一部分最有可能的候选集用于计算。Mahout提供了4中候选Item选择策略。

1. AllSimilarItemsCandidateItemsStrategy

@Override
FastIDSet doGetCandidateItems(long[] preferredItemIDs, DataModel dataModel) throws TasteException {
FastIDSet candidateItemIDs = new FastIDSet();
for (long itemID : preferredItemIDs) {
candidateItemIDs.addAll(similarity.allSimilarItemIDs(itemID));
}
candidateItemIDs.removeAll(preferredItemIDs);
return candidateItemIDs;
}
2. AllUnknownItemsCandidateItemsStrategy

@Override
protected FastIDSet doGetCandidateItems(long[] preferredItemIDs, DataModel dataModel) throws TasteException {
FastIDSet possibleItemIDs = new FastIDSet(dataModel.getNumItems());
LongPrimitiveIterator allItemIDs = dataModel.getItemIDs();
while (allItemIDs.hasNext()) {
possibleItemIDs.add(allItemIDs.nextLong());
}
possibleItemIDs.removeAll(preferredItemIDs);
return possibleItemIDs;
}
3. PreferredItemsNeighborhoodCandidateItemsStrategy

@Override
protected FastIDSet doGetCandidateItems(long[] preferredItemIDs, DataModel dataModel) throws TasteException {
FastIDSet possibleItemsIDs = new FastIDSet();
for (long itemID : preferredItemIDs) {
PreferenceArray itemPreferences = dataModel.getPreferencesForItem(itemID);
int numUsersPreferringItem = itemPreferences.length();
for (int index = 0; index < numUsersPreferringItem; index++) {
possibleItemsIDs.addAll(dataModel.getItemIDsFromUser(itemPreferences.getUserID(index)));
}
}
possibleItemsIDs.removeAll(preferredItemIDs);
return possibleItemsIDs;
}
4. SamplingCandidateItemsStrategy

private static int computeMaxFrom(int factor, int numThings) {
if (factor == NO_LIMIT_FACTOR) {
return MAX_LIMIT;
}
long max = (long) (factor * (1.0 + Math.log(numThings) / LOG2));
return max > MAX_LIMIT ? MAX_LIMIT : (int) max;
}

@Override
protected FastIDSet doGetCandidateItems(long[] preferredItemIDs, DataModel dataModel) throws TasteException {
LongPrimitiveIterator preferredItemIDsIterator = new LongPrimitiveArrayIterator(preferredItemIDs);
if (preferredItemIDs.length > maxItems) {
double samplingRate = (double) maxItems / preferredItemIDs.length;
// log.info("preferredItemIDs.length {}, samplingRate {}", preferredItemIDs.length, samplingRate);
preferredItemIDsIterator =
new SamplingLongPrimitiveIterator(preferredItemIDsIterator, samplingRate);
}
FastIDSet possibleItemsIDs = new FastIDSet();
while (preferredItemIDsIterator.hasNext()) {
long itemID = preferredItemIDsIterator.nextLong();
PreferenceArray prefs = dataModel.getPreferencesForItem(itemID);
int prefsLength = prefs.length();
if (prefsLength > maxUsersPerItem) {
Iterator<Preference> sampledPrefs =
new FixedSizeSamplingIterator<Preference>(maxUsersPerItem, prefs.iterator());
while (sampledPrefs.hasNext()) {
addSomeOf(possibleItemsIDs, dataModel.getItemIDsFromUser(sampledPrefs.next().getUserID()));
}
} else {
for (int i = 0; i < prefsLength; i++) {
addSomeOf(possibleItemsIDs, dataModel.getItemIDsFromUser(prefs.getUserID(i)));
}
}
}
possibleItemsIDs.removeAll(preferredItemIDs);
return possibleItemsIDs;
}

private void addSomeOf(FastIDSet possibleItemIDs, FastIDSet itemIDs) {
if (itemIDs.size() > maxItemsPerUser) {
LongPrimitiveIterator it =
new SamplingLongPrimitiveIterator(itemIDs.iterator(), (double) maxItemsPerUser / itemIDs.size());
while (it.hasNext()) {
possibleItemIDs.add(it.nextLong());
}
} else {
possibleItemIDs.addAll(itemIDs);
}
}
(二) 估值

protected float doEstimatePreference(long userID, PreferenceArray preferencesFromUser, long itemID)
throws TasteException {
double preference = 0.0;
double totalSimilarity = 0.0;
int count = 0;
double[] similarities = similarity.itemSimilarities(itemID, preferencesFromUser.getIDs());
for (int i = 0; i < similarities.length; i++) {
double theSimilarity = similarities[i];
if (!Double.isNaN(theSimilarity)) {
// Weights can be negative!
preference += theSimilarity * preferencesFromUser.getValue(i);
totalSimilarity += theSimilarity;
count++;
}
}
// Throw out the estimate if it was based on no data points, of course, but also if based on
// just one. This is a bit of a band-aid on the 'stock' item-based algorithm for the moment.
// The reason is that in this case the estimate is, simply, the user's rating for one item
// that happened to have a defined similarity. The similarity score doesn't matter, and that
// seems like a bad situation.
if (count <= 1) {
return Float.NaN;
}
float estimate = (float) (preference / totalSimilarity);
if (capper != null) {
estimate = capper.capEstimate(estimate);
}
return estimate;
}
(三) 推荐

1. 根据历史评分列表推荐

这种推荐方式根据用户之前产生过评分的item做推荐,推荐结果按照估计值的大小排序。

@Override
public List<RecommendedItem> recommend(long userID,
int howMany, IDRescorer rescorer) throws TasteException {
Preconditions.checkArgument(howMany >= 1, "howMany must be at least 1");
log.debug("Recommending items for user ID '{}'", userID);
PreferenceArray preferencesFromUser = getDataModel().getPreferencesFromUser(userID);
if (preferencesFromUser.length() == 0) {
return Collections.emptyList();
}
FastIDSet possibleItemIDs = getAllOtherItems(userID, preferencesFromUser);
TopItems.Estimator<Long> estimator = new Estimator(userID, preferencesFromUser);
List<RecommendedItem> topItems = TopItems.getTopItems(howMany, possibleItemIDs.iterator(), rescorer,
estimator);
log.debug("Recommendations are: {}", topItems);
return topItems;
}

2. Because推荐

这种推荐方式用于实时推荐。

@Override
public List<RecommendedItem> recommendedBecause(long userID, long itemID, int howMany) throws TasteException {
Preconditions.checkArgument(howMany >= 1, "howMany must be at least 1");
DataModel model = getDataModel();
TopItems.Estimator<Long> estimator = new RecommendedBecauseEstimator(userID, itemID);
PreferenceArray prefs = model.getPreferencesFromUser(userID);
int size = prefs.length();
FastIDSet allUserItems = new FastIDSet(size);
for (int i = 0; i < size; i++) {
allUserItems.add(prefs.getItemID(i));
}
allUserItems.remove(itemID);
return TopItems.getTopItems(howMany, allUserItems.iterator(), null, estimator);
}

//估值方法
@Override
public double estimate(Long itemID) throws TasteException {
Float pref = getDataModel().getPreferenceValue(userID, itemID);
if (pref == null) {
return Float.NaN;
}
double similarityValue = similarity.itemSimilarity(recommendedItemID, itemID);
return (1.0 + similarityValue) * pref;
}

三、 MapReduce模式实现

(一) 将偏好文件转换成评分矩阵(PreparePreferenceMatrixJob)

(二) 计算共现矩阵相似度(RowSimilarityJob)

(三) 挑选最相似的K个Item

(四) 用户偏好向量和相似降维后的共现矩阵做乘法

(五) 过滤制定的user\titem

(六) 生成最终的推荐结果

四、 实例演示

1. 单机模式

1) 批量推荐

DataModel dataModel =
new FileDataModel(new File("p/pereference"));

ItemSimilarity similarity = new PearsonCorrelationSimilarity(dataModel);

ItemBasedRecommender recommender = new GenericItemBasedRecommender(dataModel,similarity );

System.out.println(recommender.recommend(10, 10));
2) Because推荐

DataModel dataModel = new FileDataModel(new File("p/pereference"));

ItemSimilarity similarity = new PearsonCorrelationSimilarity(dataModel);

ItemBasedRecommender recommender = new GenericItemBasedRecommender(dataModel,similarity );

System.out.println(recommender.recommendedBecause(10, 10328, 100));

2. MapReduce模式

API

org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(args)

--input
偏好数据路径,文本文件。格式 userid\t itemid\t preference

--output
推荐结果路径

-- numRecommendations
推荐个数

--usersFile
需要做出推荐的user,默认全部做推荐

--itemsFile
需要做出推荐的item,默认全部做推荐

--filterFile
文件格式文本,userid\itemid 。目的是给userid的用户不要推荐itemid的item

--booleanData
是否是布尔数据

--maxPrefsPerUser
最大偏好值

--minPrefsPerUser
最小偏好值

--maxSimilaritiesPerItem
给每一个Item计算最多的相似item数目

--maxPrefsPerUserInItemSimilarity
ItemSimilarity估计item相似度时,对每一个user最多偏好数目

--similarityClassname
SIMILARITY_PEARSON_CORRELATION、SIMILARITY_COOCCURRENCE、SIMILARITY_LOGLIKELIHOOD、SIMILARITY_TANIMOTO_COEFFICIENT、SIMILARITY_CITY_BLOCK、SIMILARITY_COSINE、SIMILARITY_EUCLIDEAN_DISTANCE

--threshold
删除低于该阈值的item对

--outputPathForSimilarityMatrix
指定生成的item相似矩阵路径,文本文件,格式为 itemA \t itemB \t 相似值

实例

String [] args ={"--input","p",
"--output","recommender",
"--numRecommendations","10",
"--outputPathForSimilarityMatrix","simMatrix",
"--similarityClassname","SIMILARITY_PEARSON_CORRELATION"}

org.apache.mahout.cf.taste.hadoop.item.RecommenderJob.main(args);

五、 参考文献

1. M.Deshpandeand G. Karypis. Item-based top-n recommendation algorithms.

2. B.M.Sarwar, G. Karypis, J.A. Konstan, and J. Reidl. Item-based collaborativefiltering recommendation algorithms.

3. Item-based collaborative filtering

4. Accuratelycomputing running variance
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: