1 利用pandas对电影评分数据进行分析
数据来源于20世纪90年代末到21世纪初由Movielens用户提供的电影评分数据。这些数据包括电影评分、电影原数据(风格类型和年代)以及关于用户的人口统计学数据(年龄、邮编、性别和职业等)。数据集含有来自6000名用户对4000部电影的100万条评分数据。他分为三个表:评分、用户信息和电影信息。
1.1 下载并展示原始数据
import pandas as pd
#读取用户数据表,并指定列名
userColumnsNames = ['user_id','gender','age','occupation','zip']
user = pd.read_table('E:\python\pythonDataAnalysis\pydata-book-master\ch02\movielens\users.dat',sep='::',header=None,names=userColumnsNames)
#读取评分数据表,并指定列名
rNames = ['user_id','movie_id','rating','timestamp']
ratings = pd.read_table(r'E:\python\pythonDataAnalysis\pydata-book-master\ch02\movielens\ratings.dat',sep='::',header=None,names=rNames)
#读取评分数据表,并指定列名
moviesNames = ['movie_id','title','genres']
movies = pd.read_table('E:\python\pythonDataAnalysis\pydata-book-master\ch02\movielens\movies.dat',sep='::',header=None,names=moviesNames)
C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:5: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:9: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:13: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
user[:5]
| | user_id | gender | age | occupation | zip |
---|
0 | 1 | F | 1 | 10 | 48067 |
---|
1 | 2 | M | 56 | 16 | 70072 |
---|
2 | 3 | M | 25 | 15 | 55117 |
---|
3 | 4 | M | 45 | 7 | 02460 |
---|
4 | 5 | M | 25 | 20 | 55455 |
---|
ratings[:5]
| | user_id | movie_id | rating | timestamp |
---|
0 | 1 | 1193 | 5 | 978300760 |
---|
1 | 1 | 661 | 3 | 978302109 |
---|
2 | 1 | 914 | 3 | 978301968 |
---|
3 | 1 | 3408 | 4 | 978300275 |
---|
4 | 1 | 2355 | 5 | 978824291 |
---|
movies[:5]
| | movie_id | title | genres |
---|
0 | 1 | Toy Story (1995) | Animation|Children’s|Comedy |
---|
1 | 2 | Jumanji (1995) | Adventure|Children’s|Fantasy |
---|
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
---|
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama |
---|
4 | 5 | Father of the Bride Part II (1995) | Comedy |
---|
1.2 根据性别计算某部电影的平均得分
user表中有性别和年龄,movies表中有电影标题,ratings表中有得分,因此达到题目要求需将三个表融合在一起
data = pd.merge(pd.merge(ratings,user),movies)
data.head(5)
| | user_id | movie_id | rating | timestamp | gender | age | occupation | zip | title | genres |
---|
0 | 1 | 1193 | 5 | 978300760 | F | 1 | 10 | 48067 | One Flew Over the Cuckoo’s Nest (1975) | Drama |
---|
1 | 2 | 1193 | 5 | 978298413 | M | 56 | 16 | 70072 | One Flew Over the Cuckoo’s Nest (1975) | Drama |
---|
2 | 12 | 1193 | 4 | 978220179 | M | 25 | 12 | 32793 | One Flew Over the Cuckoo’s Nest (1975) | Drama |
---|
3 | 15 | 1193 | 4 | 978199279 | M | 25 | 7 | 22903 | One Flew Over the Cuckoo’s Nest (1975) | Drama |
---|
4 | 17 | 1193 | 5 | 978158471 | M | 50 | 1 | 95350 | One Flew Over the Cuckoo’s Nest (1975) | Drama |
---|
#计算不同性别对每部电影的平均得分
mean_ratings = pd.pivot_table(data,values = 'rating',index = ['title'],columns = ['gender'],aggfunc = 'mean')
mean_ratings[:5]
| gender | F | M |
---|
title | | |
---|
$1,000,000 Duck (1971) | 3.375000 | 2.761905 |
---|
‘Night Mother (1986) | 3.388889 | 3.352941 |
---|
‘Til There Was You (1997) | 2.675676 | 2.733333 |
---|
‘burbs, The (1989) | 2.793478 | 2.962085 |
---|
…And Justice for All (1979) | 3.828571 | 3.689024 |
---|
#查看每部电影在不同性别下的评分条数
data.groupby(['title','gender']).size().unstack()[:10]
| gender | F | M |
---|
title | | |
---|
$1,000,000 Duck (1971) | 16.0 | 21.0 |
---|
‘Night Mother (1986) | 36.0 | 34.0 |
---|
‘Til There Was You (1997) | 37.0 | 15.0 |
---|
‘burbs, The (1989) | 92.0 | 211.0 |
---|
…And Justice for All (1979) | 35.0 | 164.0 |
---|
1-900 (1994) | 1.0 | 1.0 |
---|
10 Things I Hate About You (1999) | 232.0 | 468.0 |
---|
101 Dalmatians (1961) | 187.0 | 378.0 |
---|
101 Dalmatians (1996) | 150.0 | 214.0 |
---|
12 Angry Men (1957) | 141.0 | 475.0 |
---|
#选择评论条数大于250 的电影
numComm_by_title = data.groupby(['title']).size()
numComm_by_title[:5]#title为index列
title
$1,000,000 Duck (1971) 37
'Night Mother (1986) 70
'Til There Was You (1997) 52
'burbs, The (1989) 303
...And Justice for All (1979) 199
dtype: int64
active_titles = numComm_by_title.index[numComm_by_title >= 250]
print active_titles.dtype
active_titles.size
object
1216
active_titles[:5]
Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)',
u'101 Dalmatians (1961)', u'101 Dalmatians (1996)',
u'12 Angry Men (1957)'],
dtype='object', name=u'title')
#ix既可以对行索引,也可以对列索引,可以使用数字序号,还可以使用index关键字
mean_ratings = mean_ratings.ix[active_titles]
#查看mean_ratings的详细信息可以通过mean_ratings?还可以通过help(mean_ratings)
mean_ratings[:5]
| gender | F | M |
---|
title | | |
---|
‘burbs, The (1989) | 2.793478 | 2.962085 |
---|
10 Things I Hate About You (1999) | 3.646552 | 3.311966 |
---|
101 Dalmatians (1961) | 3.791444 | 3.500000 |
---|
101 Dalmatians (1996) | 3.240000 | 2.911215 |
---|
12 Angry Men (1957) | 4.184397 | 4.328421 |
---|
1.3 查看女性最喜欢那部电影?
top_female_ratings = mean_ratings.sort_index(by='F',ascending = False)
top_female_ratings[:10]
C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
if __name__ == '__main__':
| gender | F | M |
---|
title | | |
---|
Close Shave, A (1995) | 4.644444 | 4.473795 |
---|
Wrong Trousers, The (1993) | 4.588235 | 4.478261 |
---|
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) | 4.572650 | 4.464589 |
---|
Wallace & Gromit: The Best of Aardman Animation (1996) | 4.563107 | 4.385075 |
---|
Schindler’s List (1993) | 4.562602 | 4.491415 |
---|
Shawshank Redemption, The (1994) | 4.539075 | 4.560625 |
---|
Grand Day Out, A (1992) | 4.537879 | 4.293255 |
---|
To Kill a Mockingbird (1962) | 4.536667 | 4.372611 |
---|
Creature Comforts (1990) | 4.513889 | 4.272277 |
---|
Usual Suspects, The (1995) | 4.513317 | 4.518248 |
---|
1.3 计算男女之间同一个电影评分差距最大的电影
那些电影最能反映男女之间差别,不是评分最高的,也不是最低的,而是评分差距最大的,如何找出?请看下边代码~
#加上一列存放男女之间评分差的列在透视表中
mean_ratings['diff'] = mean_ratings['M']-mean_ratings['F']
mean_ratings['diff'][:10]
title
'burbs, The (1989) 0.168607
10 Things I Hate About You (1999) -0.334586
101 Dalmatians (1961) -0.291444
101 Dalmatians (1996) -0.328785
12 Angry Men (1957) 0.144024
13th Warrior, The (1999) 0.056000
2 Days in the Valley (1996) -0.244076
20,000 Leagues Under the Sea (1954) 0.039102
2001: A Space Odyssey (1968) 0.304156
2010 (1984) -0.033097
Name: diff, dtype: float64
mean_ratings_M =mean_ratings.sort_index(by='diff',ascending = False)
C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
if __name__ == '__main__':
mean_ratings_M[:5]#男性更喜欢的电影
| gender | F | M | diff |
---|
title | | | |
---|
Good, The Bad and The Ugly, The (1966) | 3.494949 | 4.221300 | 0.726351 |
---|
Kentucky Fried Movie, The (1977) | 2.878788 | 3.555147 | 0.676359 |
---|
Dumb & Dumber (1994) | 2.697987 | 3.336595 | 0.638608 |
---|
Longest Day, The (1962) | 3.411765 | 4.031447 | 0.619682 |
---|
Cable Guy, The (1996) | 2.250000 | 2.863787 | 0.613787 |
---|
#女性更喜欢的电影
mean_ratings_F = mean_ratings_M[::-1]
mean_ratings_F[:10]
| gender | F | M | diff |
---|
title | | | |
---|
Dirty Dancing (1987) | 3.790378 | 2.959596 | -0.830782 |
---|
Jumpin’ Jack Flash (1986) | 3.254717 | 2.578358 | -0.676359 |
---|
Grease (1978) | 3.975265 | 3.367041 | -0.608224 |
---|
Little Women (1994) | 3.870588 | 3.321739 | -0.548849 |
---|
Steel Magnolias (1989) | 3.901734 | 3.365957 | -0.535777 |
---|
Anastasia (1997) | 3.800000 | 3.281609 | -0.518391 |
---|
Rocky Horror Picture Show, The (1975) | 3.673016 | 3.160131 | -0.512885 |
---|
Color Purple, The (1985) | 4.158192 | 3.659341 | -0.498851 |
---|
Age of Innocence, The (1993) | 3.827068 | 3.339506 | -0.487561 |
---|
Free Willy (1993) | 2.921348 | 2.438776 | -0.482573 |
---|
1.4 计算分歧最大的电影
仅从电影评分本身找出分歧最大的电影可以计算每部电影的评分方差或者标准差
#求每部电影的评分标准差
ratings_title_std = data.groupby(['title'])['rating'].std()
print type(ratings_title_std)
ratings_title_std[:5]
<class 'pandas.core.series.Series'>
title
$1,000,000 Duck (1971) 1.092563
'Night Mother (1986) 1.118636
'Til There Was You (1997) 1.020159
'burbs, The (1989) 1.107760
...And Justice for All (1979) 0.878110
Name: rating, dtype: float64
#对Series用order对值排序,还可以用sort_index对列排序
ratings_title_std_sort = ratings_title_std.order(ascending = False)
ratings_title_std_sort[:5]
C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: order is deprecated, use sort_values(...)
if __name__ == '__main__':
title
Foreign Student (1994) 2.828427
Criminal Lovers (Les Amants Criminels) (1999) 2.309401
Identification of a Woman (Identificazione di una donna) (1982) 2.121320
Sunset Park (1996) 2.121320
Eaten Alive (1976) 2.121320
Name: rating, dtype: float64
2 总结
本篇博客重点介绍了pandas部分功能,包括分组.groupby()、.pivot_table()、.sort_index()等方法的运用,能够快速的对数据进行统计和展示,相似的可视化工具包括excel、tableau等。