您的位置：首页 > 编程语言 > Python开发

python数据分析chapter2-2

2017-06-21 20:29 465 查看

1 利用pandas对电影评分数据进行分析

数据来源于20世纪90年代末到21世纪初由Movielens用户提供的电影评分数据。这些数据包括电影评分、电影原数据（风格类型和年代）以及关于用户的人口统计学数据（年龄、邮编、性别和职业等）。数据集含有来自6000名用户对4000部电影的100万条评分数据。他分为三个表：评分、用户信息和电影信息。

1.1 下载并展示原始数据

import pandas as pd

#读取用户数据表，并指定列名
userColumnsNames = ['user_id','gender','age','occupation','zip']
user = pd.read_table('E:\python\pythonDataAnalysis\pydata-book-master\ch02\movielens\users.dat',sep='::',header=None,names=userColumnsNames)

#读取评分数据表，并指定列名
rNames = ['user_id','movie_id','rating','timestamp']
ratings = pd.read_table(r'E:\python\pythonDataAnalysis\pydata-book-master\ch02\movielens\ratings.dat',sep='::',header=None,names=rNames)

#读取评分数据表，并指定列名
moviesNames = ['movie_id','title','genres']
movies = pd.read_table('E:\python\pythonDataAnalysis\pydata-book-master\ch02\movielens\movies.dat',sep='::',header=None,names=moviesNames)

C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:5: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:9: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:13: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.

user[:5]


	user_id	gender	age	occupation	zip
0	1	F	1	10	48067
1	2	M	56	16	70072
2	3	M	25	15	55117
3	4	M	45	7	02460
4	5	M	25	20	55455

ratings[:5]


	user_id	movie_id	rating	timestamp
0	1	1193	5	978300760
1	1	661	3	978302109
2	1	914	3	978301968
3	1	3408	4	978300275
4	1	2355	5	978824291

movies[:5]


	movie_id	title	genres
0	1	Toy Story (1995)	Animation\|Children’s\|Comedy
1	2	Jumanji (1995)	Adventure\|Children’s\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy

1.2 根据性别计算某部电影的平均得分

user表中有性别和年龄，movies表中有电影标题，ratings表中有得分，因此达到题目要求需将三个表融合在一起

data = pd.merge(pd.merge(ratings,user),movies)
data.head(5)


	user_id	movie_id	rating	timestamp	gender	age	occupation	zip	title	genres
0	1	1193	5	978300760	F	1	10	48067	One Flew Over the Cuckoo’s Nest (1975)	Drama
1	2	1193	5	978298413	M	56	16	70072	One Flew Over the Cuckoo’s Nest (1975)	Drama
2	12	1193	4	978220179	M	25	12	32793	One Flew Over the Cuckoo’s Nest (1975)	Drama
3	15	1193	4	978199279	M	25	7	22903	One Flew Over the Cuckoo’s Nest (1975)	Drama
4	17	1193	5	978158471	M	50	1	95350	One Flew Over the Cuckoo’s Nest (1975)	Drama

#计算不同性别对每部电影的平均得分
mean_ratings = pd.pivot_table(data,values = 'rating',index = ['title'],columns = ['gender'],aggfunc = 'mean')
mean_ratings[:5]


gender	F	M
title
$1,000,000 Duck (1971)	3.375000	2.761905
‘Night Mother (1986)	3.388889	3.352941
‘Til There Was You (1997)	2.675676	2.733333
‘burbs, The (1989)	2.793478	2.962085
…And Justice for All (1979)	3.828571	3.689024

#查看每部电影在不同性别下的评分条数
data.groupby(['title','gender']).size().unstack()[:10]


gender	F	M
title
$1,000,000 Duck (1971)	16.0	21.0
‘Night Mother (1986)	36.0	34.0
‘Til There Was You (1997)	37.0	15.0
‘burbs, The (1989)	92.0	211.0
…And Justice for All (1979)	35.0	164.0
1-900 (1994)	1.0	1.0
10 Things I Hate About You (1999)	232.0	468.0
101 Dalmatians (1961)	187.0	378.0
101 Dalmatians (1996)	150.0	214.0
12 Angry Men (1957)	141.0	475.0

#选择评论条数大于250 的电影
numComm_by_title = data.groupby(['title']).size()
numComm_by_title[:5]#title为index列

title
$1,000,000 Duck (1971)            37
'Night Mother (1986)              70
'Til There Was You (1997)         52
'burbs, The (1989)               303
...And Justice for All (1979)    199
dtype: int64

active_titles = numComm_by_title.index[numComm_by_title >= 250]
print active_titles.dtype
active_titles.size

object
1216

active_titles[:5]

Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)',
u'101 Dalmatians (1961)', u'101 Dalmatians (1996)',
u'12 Angry Men (1957)'],
dtype='object', name=u'title')

#ix既可以对行索引，也可以对列索引，可以使用数字序号，还可以使用index关键字
mean_ratings = mean_ratings.ix[active_titles]
#查看mean_ratings的详细信息可以通过mean_ratings？还可以通过help（mean_ratings）
mean_ratings[:5]


gender	F	M
title
‘burbs, The (1989)	2.793478	2.962085
10 Things I Hate About You (1999)	3.646552	3.311966
101 Dalmatians (1961)	3.791444	3.500000
101 Dalmatians (1996)	3.240000	2.911215
12 Angry Men (1957)	4.184397	4.328421

1.3 查看女性最喜欢那部电影？

top_female_ratings = mean_ratings.sort_index(by='F',ascending = False)
top_female_ratings[:10]

C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
if __name__ == '__main__':


gender	F	M
title
Close Shave, A (1995)	4.644444	4.473795
Wrong Trousers, The (1993)	4.588235	4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)	4.572650	4.464589
Wallace & Gromit: The Best of Aardman Animation (1996)	4.563107	4.385075
Schindler’s List (1993)	4.562602	4.491415
Shawshank Redemption, The (1994)	4.539075	4.560625
Grand Day Out, A (1992)	4.537879	4.293255
To Kill a Mockingbird (1962)	4.536667	4.372611
Creature Comforts (1990)	4.513889	4.272277
Usual Suspects, The (1995)	4.513317	4.518248

1.3 计算男女之间同一个电影评分差距最大的电影

那些电影最能反映男女之间差别，不是评分最高的，也不是最低的，而是评分差距最大的，如何找出？请看下边代码~

#加上一列存放男女之间评分差的列在透视表中
mean_ratings['diff'] = mean_ratings['M']-mean_ratings['F']
mean_ratings['diff'][:10]

title
'burbs, The (1989)                     0.168607
10 Things I Hate About You (1999)     -0.334586
101 Dalmatians (1961)                 -0.291444
101 Dalmatians (1996)                 -0.328785
12 Angry Men (1957)                    0.144024
13th Warrior, The (1999)               0.056000
2 Days in the Valley (1996)           -0.244076
20,000 Leagues Under the Sea (1954)    0.039102
2001: A Space Odyssey (1968)           0.304156
2010 (1984)                           -0.033097
Name: diff, dtype: float64

mean_ratings_M =mean_ratings.sort_index(by='diff',ascending = False)

   C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
if __name__ == '__main__':

mean_ratings_M[:5]#男性更喜欢的电影


gender	F	M	diff
title
Good, The Bad and The Ugly, The (1966)	3.494949	4.221300	0.726351
Kentucky Fried Movie, The (1977)	2.878788	3.555147	0.676359
Dumb & Dumber (1994)	2.697987	3.336595	0.638608
Longest Day, The (1962)	3.411765	4.031447	0.619682
Cable Guy, The (1996)	2.250000	2.863787	0.613787

#女性更喜欢的电影
mean_ratings_F = mean_ratings_M[::-1]
mean_ratings_F[:10]


gender	F	M	diff
title
Dirty Dancing (1987)	3.790378	2.959596	-0.830782
Jumpin’ Jack Flash (1986)	3.254717	2.578358	-0.676359
Grease (1978)	3.975265	3.367041	-0.608224
Little Women (1994)	3.870588	3.321739	-0.548849
Steel Magnolias (1989)	3.901734	3.365957	-0.535777
Anastasia (1997)	3.800000	3.281609	-0.518391
Rocky Horror Picture Show, The (1975)	3.673016	3.160131	-0.512885
Color Purple, The (1985)	4.158192	3.659341	-0.498851
Age of Innocence, The (1993)	3.827068	3.339506	-0.487561
Free Willy (1993)	2.921348	2.438776	-0.482573

1.4 计算分歧最大的电影

仅从电影评分本身找出分歧最大的电影可以计算每部电影的评分方差或者标准差

#求每部电影的评分标准差
ratings_title_std = data.groupby(['title'])['rating'].std()
print type(ratings_title_std)
ratings_title_std[:5]

<class 'pandas.core.series.Series'>

title
$1,000,000 Duck (1971)           1.092563
'Night Mother (1986)             1.118636
'Til There Was You (1997)        1.020159
'burbs, The (1989)               1.107760
...And Justice for All (1979)    0.878110
Name: rating, dtype: float64

#对Series用order对值排序，还可以用sort_index对列排序
ratings_title_std_sort  = ratings_title_std.order(ascending = False)
ratings_title_std_sort[:5]

C:\Program Files\anaconda\lib\site-packages\ipykernel\__main__.py:1: FutureWarning: order is deprecated, use sort_values(...)
if __name__ == '__main__':

title
Foreign Student (1994)                                             2.828427
Criminal Lovers (Les Amants Criminels) (1999)                      2.309401
Identification of a Woman (Identificazione di una donna) (1982)    2.121320
Sunset Park (1996)                                                 2.121320
Eaten Alive (1976)                                                 2.121320
Name: rating, dtype: float64

2 总结

本篇博客重点介绍了pandas部分功能，包括分组.groupby()、.pivot_table()、.sort_index()等方法的运用，能够快速的对数据进行统计和展示，相似的可视化工具包括excel、tableau等。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： python 数据分析

相关文章推荐

新的分享

章节导航