您的位置:首页 > 编程语言

推荐系统-基于邻域的算法

2018-01-03 17:39 267 查看
最近在看项亮的《推荐系统实践》,文章只有只有代码片段,没有完整的代码。所以在原有代码之上,根据书籍介绍的内容,还原了部分代码。

UserCF算法(基于用户的协同过滤算法):

令N(u)表示用户u的正反馈的物品集合,令N(v)表示用户v的正反馈物品集合。那么Jaccard相似度为:

wuv=|N(u)∩N(v)||N(u)∪N(v)|

余弦相似度计算:

wuv=|N(u)∩N(v)||N(u)||N(v)|−−−−−−−−−−√

得到用户之间的兴趣相似度之后,UserCF算法会给用户推荐和他兴趣最相似的K个用户喜欢的物品。如下公式度量UserCF算法中用户u对物品i的感兴趣程度:

p(u,i)=∑v∈S(u,K)∩N(i)wuvrvi

其中,S(u,K)表示和用户u兴趣最接近的K个用户。在隐式反馈中, rvi=1。

代码如下:

# -*- coding: utf-8 -*-
"""
Created on Sun Dec 31 12:46:42 2017

@author: lanlandetian
"""
import math
import operator

'''
#W is the similarity matrix
def UserSimilarity(train):
W = dict()
for u in train.keys():
for v in train.keys():
if u == v:
continue
W[u][v] = len(train[u] & train[v])
W[u][v] /= math.sqrt(len(train[u]) * len(train[v]) * 1.0)
return W
'''

def UserSimilarity(train):
# build inverse table for item_users
item_users = dict()
for u,items in train.items():
for i in items.keys():
if i not in item_users:
item_users[i] = set()
item_users[i].add(u)

#calculate co-rated items between users
C = dict()
N = dict()
for i,users in item_users.items():
for u in users:
N.setdefault(u,0)
N[u] += 1
C.setdefault(u,{})
for v in users:
if u == v:
continue
C[u].setdefault(v,0)
C[u][v] += 1

#calculate finial similarity matrix W
W = C.copy()
for u, related_users in C.items():
for v, cuv in related_users.items():
W[u][v] = cuv / math.sqrt(N[u] * N[v])
return W

def Recommend(user,train,W,K = 3):
rank = dict()
interacted_items = train[user]
for v, wuv in sorted(W[user].items(), key = operator.itemgetter(1), \
reverse = True)[0:K]:
for i, rvi in train[v].items():
#we should filter items user interacted before
if i in interacted_items:
continue
rank.setdefault(i,0)
rank[i] += wuv * rvi
return rank

def Recommendation(users, train, W, K = 3):
result = dict()
for user in users:
rank = Recommend(user,train,W,K)
R = sorted(rank.items(), key = operator.itemgetter(1), \
reverse = True)
result[user] = R
return result


用户相似度的改进(UserCF_IIF算法):

两个用户对于冷门物品的的行为更能说明他们兴趣的相似度。因此,改进的用户相似度公式如下:

wuv=∑i∈N(u)∩N(v)1log(1+|N(i)|)|N(u)||N(v)|−−−−−−−−−−√

该公式中,1log(1+|N(i)|)惩罚了热门物品对于相似度的影响。

代码如下与UserCF类似。

ItemCF算法:

令N(i)表示与物品i交互过的用户的结合。则物品i和物品j的相似度为

wij=|N(i)∩N(j)||N(i)||N(j)|−−−−−−−−−−√

在得到物品的相似度后,ItemCF通过如下公式计算用户u对物品i的兴趣:

p(u,i)=∑j∈N(u)∩S(i,K)wijrui

其中,S(i,K)表示与物品i最相近的K个物品的集合。

代码如下:

# -*- coding: utf-8 -*-
"""
Created on Sun Dec 31 13:09:26 2017

@author: lanlandetian
"""

import math
import operator

def ItemSimilarity(train):
#calculate co-rated users between items
#构建用户-物品表
C =dict()
N = dict()
for u,items in train.items():
for i in items:
N.setdefault(i,0)
N[i] += 1
C.setdefault(i,{})
for j in items:
if i == j:
continue
C[i].setdefault(j,0)
C[i][j] += 1

#calculate finial similarity matrix W
W = C.copy()
for i,related_items in C.items():
for j,cij in related_items.items():
W[i][j] = cij / math.sqrt(N[i] * N[j])
return W

def Recommend(user_id,train, W,K = 3):
rank = dict()
ru = train[user_id]
for i,pi in ru.items():
for j,wij in sorted(W[i].items(), \
key = operator.itemgetter(1), reverse = True)[0:K]:
if j in ru:
continue
rank.setdefault(j,0)
rank[j] += pi * wij
return rank

#class Node:
#    def __init__(self):
#        self.weight = 0
#        self.reason = dict()
#
#def Recommend(user_id,train, W,K =3):
#    rank = dict()
#    ru = train[user_id]
#    for i,pi in ru.items():
#        for j,wij in sorted(W[i].items(), \
#                           key = operator.itemgetter(1), reverse = True)[0:K]:
#            if j in ru:
#                continue
#            if j not in rank:
#                rank[j] = Node()
#            rank[j].reason.setdefault(i,0)
#            rank[j].weight += pi * wij
#            rank[j].reason[i] = pi * wij
#    return rank

def Recommendation(users, train, W, K = 3):
result = dict()
for user in users:
rank = Recommend(user,train,W,K)
R = sorted(rank.items(), key = operator.itemgetter(1), \
reverse = True)
result[user] = R
return result


改进的物品相似度(UserCF_IUF):

活跃用户对物品相似度的贡献应该小于不活跃的用户,应该增加IUF

参数来修正物品相似度的计算公式:

wi,j=∑u∈N(i)∩N(j)1log(1+|N(u)|)|N(i)||N(j)|−−−−−−−−−−√

代码与ItemCF类似。

此外,书中是使用dict表示数据集的。所以,我在github中是实现了整个算法的流程,包括数据读取,和最后的交叉验证。

github网址如下:

https://github.com/1092798448/RecSys.git
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息