您的位置:首页 > 编程语言 > Python开发

MovieLens 1M之python数据分析练习

2018-03-01 23:06 429 查看
数据集来源https://grouplens.org/datasets/movielens/1m/



代码区:

import pandas as pd
uname=['user_id','gender','age','occupation','zip']
users=pd.read_table(r'D:\demo1\ml-1m\users.dat',sep='::',header=None,names=uname,engine = 'python')
'''
sep : str, default ‘,’
指定分隔符。如果不指定参数,则会尝试使用逗号分隔。分隔符长于一个字符并且不是‘\s+’,
将使用python的语法分析器。并且忽略数据中的逗号。正则表达式例子:'\r\t'

header : int or list of ints, default ‘infer’指定行数用来作为列名,数据开始行数。

names : array-like, default None
用于结果的列名列表,如果数据文件中没有列标题行,就需要执行header=None。
engine解析器引擎使用。C引擎速度更快,而python引擎目前更加完善。除去警告
'''

rnames=['user_id','movie_id','rating','timestamp']
ratings=pd.read_table(r'D:\demo1\ml-1m\ratings.dat',sep='::',header=None,names=rnames,engine = 'python')
mname=['movie_id','title','genres']
movies=pd.read_table(r'D:\demo1\ml-1m\movies.dat',sep='::',header=None,names=mname,engine = 'python')

data=pd.merge(pd.merge(movies,ratings),users)
print data.loc[0]#ix[0]已经deprecated弃用


结果:





movie_id                                1
title                    Toy Story (1995)
genres        Animation|Children's|Comedy
user_id                                 1
rating                                  5
timestamp                       978824
4000
268
gender                                  F
age                                     1
occupation                             10
zip                                 48067


'''
#枢轴表pandas.pivot_table(data, values=None,
index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
'''
mean_ratings=data.pivot_table('rating',index='title',columns='gender',aggfunc='mean')
print mean_ratings[:5]


result:



gender                                F         M
title
$1,000,000 Duck (1971)         3.375000  2.761905
'Night Mother (1986)           3.388889  3.352941
'Til There Was You (1997)      2.675676  2.733333
'burbs, The (1989)             2.793478  2.962085
...And Justice for All (1979)  3.828571  3.689024


#过滤数据不足200条的电影
ratings_groupby_title=data.groupby('title').size()
print ratings_groupby_title[:5]


reslut:

title
$1,000,000 Duck (1971)            37
'Night Mother (1986)              70
'Til There Was You (1997)         52
'burbs, The (1989)               303
...And Justice for All (1979)    199
dtype: int64




active_titles=data.groupby('title').size().index[data.groupby('title').size()>=200]
print active_titles


result:

Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)',
u'101 Dalmatians (1961)', u'101 Dalmatians (1996)',
u'12 Angry Men (1957)', u'13th Warrior, The (1999)',
u'2 Days in the Valley (1996)', u'20,000 Leagues Under the Sea (1954)',
u'2001: A Space Odyssey (1968)', u'2010 (1984)',
...
u'Year of Living Dangerously (1982)', u'Yellow Submarine (1968)',
u'Yojimbo (1961)', u'You've Got Mail (1998)',
u'Young Frankenstein (1974)', u'Young Guns (1988)',
u'Young Guns II (1990)', u'Young Sherlock Holmes (1985)',
u'Zero Effect (1998)', u'eXistenZ (1999)'],
dtype='object', name=u'title', length=1426)


mean_ratings=mean_ratings.loc[active_titles]
#对F列进行降序
top_female_rating=mean_ratings.sort_values(by='F',ascending='False')
print top_female_rating[:10]


result:

gender                                                     F         M
title
Battlefield Earth (2000)                            1.574468  1.616949
Barb Wire (1996)                                    1.585366  2.100386
Showgirls (1995)                                    1.709091  2.166667
Jaws 3-D (1983)                                     1.863636  1.851064
Rocky V (1990)                                      1.878788  2.132780
Speed 2: Cruise Control (1997)                      1.906667  1.863014
Avengers, The (1998)                                1.915254  2.017467
Anaconda (1997)                                     2.000000  2.248447
Nightmare on Elm Street 5: The Dream Child, A (...  2.052632  1.981481
Howard the Duck (1986)                              2.074627  2.103542


计算评分分歧

mean_ratings['diff']=mean_ratings['M']-mean_ratings['F']
sorted_by_diff=mean_ratings.sort_values(by='diff')
print sorted_by_diff[:5]


result:

gender                                                     F         M
title
Dirty Dancing (1987)                                3.790378  2.959596
To Wong Foo, Thanks for Everything! Julie Newma...  3.486842  2.795276
Jumpin' Jack Flash (1986)                           3.254717  2.578358
Grease (1978)                                       3.975265  3.367041
Relic, The (1997)                                   3.309524  2.723077

gender                                                  diff
title
Dirty Dancing (1987)                               -0.830782
To Wong Foo, Thanks for Everything! Julie Newma... -0.691567
Jumpin' Jack Flash (1986)                          -0.676359
Grease (1978)                                      -0.608224
Relic, The (1997)                                  -0.586447
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: