MovieLens 1M之python数据分析练习
2018-03-01 23:06
429 查看
数据集来源https://grouplens.org/datasets/movielens/1m/
![](https://img-blog.csdn.net/20180301230918330?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvZG9uZ3lhbndlbjYwMzY=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70)
代码区:
结果:
![](https://img-blog.csdn.net/20180302101726135?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvZG9uZ3lhbndlbjYwMzY=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast)
![](https://img-blog.csdn.net/20180302101752287?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvZG9uZ3lhbndlbjYwMzY=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast)
result:
![](https://img-blog.csdn.net/20180302110419611?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvZG9uZ3lhbndlbjYwMzY=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast)
reslut:
![](https://img-blog.csdn.net/20180302113158871?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvZG9uZ3lhbndlbjYwMzY=/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast)
result:
result:
计算评分分歧
result:
代码区:
import pandas as pd uname=['user_id','gender','age','occupation','zip'] users=pd.read_table(r'D:\demo1\ml-1m\users.dat',sep='::',header=None,names=uname,engine = 'python') ''' sep : str, default ‘,’ 指定分隔符。如果不指定参数,则会尝试使用逗号分隔。分隔符长于一个字符并且不是‘\s+’, 将使用python的语法分析器。并且忽略数据中的逗号。正则表达式例子:'\r\t' header : int or list of ints, default ‘infer’指定行数用来作为列名,数据开始行数。 names : array-like, default None 用于结果的列名列表,如果数据文件中没有列标题行,就需要执行header=None。 engine解析器引擎使用。C引擎速度更快,而python引擎目前更加完善。除去警告 ''' rnames=['user_id','movie_id','rating','timestamp'] ratings=pd.read_table(r'D:\demo1\ml-1m\ratings.dat',sep='::',header=None,names=rnames,engine = 'python') mname=['movie_id','title','genres'] movies=pd.read_table(r'D:\demo1\ml-1m\movies.dat',sep='::',header=None,names=mname,engine = 'python') data=pd.merge(pd.merge(movies,ratings),users) print data.loc[0]#ix[0]已经deprecated弃用
结果:
movie_id 1 title Toy Story (1995) genres Animation|Children's|Comedy user_id 1 rating 5 timestamp 978824 4000 268 gender F age 1 occupation 10 zip 48067
''' #枢轴表pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All') ''' mean_ratings=data.pivot_table('rating',index='title',columns='gender',aggfunc='mean') print mean_ratings[:5]
result:
gender F M title $1,000,000 Duck (1971) 3.375000 2.761905 'Night Mother (1986) 3.388889 3.352941 'Til There Was You (1997) 2.675676 2.733333 'burbs, The (1989) 2.793478 2.962085 ...And Justice for All (1979) 3.828571 3.689024
#过滤数据不足200条的电影 ratings_groupby_title=data.groupby('title').size() print ratings_groupby_title[:5]
reslut:
title $1,000,000 Duck (1971) 37 'Night Mother (1986) 70 'Til There Was You (1997) 52 'burbs, The (1989) 303 ...And Justice for All (1979) 199 dtype: int64
active_titles=data.groupby('title').size().index[data.groupby('title').size()>=200] print active_titles
result:
Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)', u'101 Dalmatians (1961)', u'101 Dalmatians (1996)', u'12 Angry Men (1957)', u'13th Warrior, The (1999)', u'2 Days in the Valley (1996)', u'20,000 Leagues Under the Sea (1954)', u'2001: A Space Odyssey (1968)', u'2010 (1984)', ... u'Year of Living Dangerously (1982)', u'Yellow Submarine (1968)', u'Yojimbo (1961)', u'You've Got Mail (1998)', u'Young Frankenstein (1974)', u'Young Guns (1988)', u'Young Guns II (1990)', u'Young Sherlock Holmes (1985)', u'Zero Effect (1998)', u'eXistenZ (1999)'], dtype='object', name=u'title', length=1426)
mean_ratings=mean_ratings.loc[active_titles] #对F列进行降序 top_female_rating=mean_ratings.sort_values(by='F',ascending='False') print top_female_rating[:10]
result:
gender F M title Battlefield Earth (2000) 1.574468 1.616949 Barb Wire (1996) 1.585366 2.100386 Showgirls (1995) 1.709091 2.166667 Jaws 3-D (1983) 1.863636 1.851064 Rocky V (1990) 1.878788 2.132780 Speed 2: Cruise Control (1997) 1.906667 1.863014 Avengers, The (1998) 1.915254 2.017467 Anaconda (1997) 2.000000 2.248447 Nightmare on Elm Street 5: The Dream Child, A (... 2.052632 1.981481 Howard the Duck (1986) 2.074627 2.103542
计算评分分歧
mean_ratings['diff']=mean_ratings['M']-mean_ratings['F'] sorted_by_diff=mean_ratings.sort_values(by='diff') print sorted_by_diff[:5]
result:
gender F M title Dirty Dancing (1987) 3.790378 2.959596 To Wong Foo, Thanks for Everything! Julie Newma... 3.486842 2.795276 Jumpin' Jack Flash (1986) 3.254717 2.578358 Grease (1978) 3.975265 3.367041 Relic, The (1997) 3.309524 2.723077 gender diff title Dirty Dancing (1987) -0.830782 To Wong Foo, Thanks for Everything! Julie Newma... -0.691567 Jumpin' Jack Flash (1986) -0.676359 Grease (1978) -0.608224 Relic, The (1997) -0.586447
相关文章推荐
- 利用Python进行数据分析---ch02《MovieLens 1M数据集(下)》读书笔记
- 利用python进入数据分析之MovieLens_1M数据分析
- Python进行数据分析(二)MovieLens 1M 数据集
- 利用Python进行数据分析---ch02《MovieLens 1M数据集(上)》读书笔记
- Python数据分析练习:北京、广州PM2.5空气质量分析(2)
- Learning: 利用Python进行数据分析 - MovieLens 数据集的探索
- python练习:请求链接,得到JSON格式返回,分析数据
- MovieLens 《用Python进行数据分析》
- python基础练习(二)—— 数据分析包numpy数组操作
- Python数据分析小练习
- Python数据分析之pandas统计分析
- Python数据分析笔记
- Python数据分析-1
- 【python数据挖掘课程】十二.Pandas、Matplotlib结合SQL语句对比图分析
- Python数据分析之pandas学习
- 【Python】Python的数据分析(二)——pandas安装及使用
- 数据分析:中国高校更名历史 Python
- 基于Python实现的微信好友数据分析
- Python & 数据分析学习笔记[第1篇]
- Python的数据类型--字典以及字典练习