您的位置:首页 > 编程语言 > Python开发

Python进行数据分析(二)MovieLens 1M 数据集

2017-09-27 23:20 183 查看
# -*- coding: utf-8 -*-
"""
Created on Thu Sep 21 12:24:37 2017

@author: Douzi
"""

import pandas as pd

# 用户信息
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('ch02/movielens/users.dat', sep='::', header=None, names=unames, engine='python')

# 电影排名
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ch02/movielens/ratings.dat', sep='::', header=None, names=rnames,engine='python')

# 电影信息
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('ch02/movielens/movies.dat', sep='::', header=None, names=mnames, engine='python')

users[:5]
Out[113]:
user_id gender  age  occupation    zip
0        1      F    1          10  48067
1        2      M   56          16  70072
2        3      M   25          15  55117
3        4      M   45           7  02460
4        5      M   25          20  55455

ratings[:5]
Out[114]:
user_id  movie_id  rating  timestamp
0        1      1193       5  978300760
1        1       661       3  978302109
2        1       914       3  978301968
3        1      3408       4  978300275
4        1      2355       5  978824291

movies[:5]
Out[115]:
movie_id                               title                        genres
0         1                    Toy Story (1995)   Animation|Children's|Comedy
1         2                      Jumanji (1995)  Adventure|Children's|Fantasy
2         3             Grumpier Old Men (1995)                Comedy|Romance
3         4            Waiting to Exhale (1995)                  Comedy|Drama
4         5  Father of the Bride Part II (1995)                        Comedy


合并数据

根据任意个用户或电影属性对评分数据进行聚合操作

按性别计算每部电影的平均得分(产生了另一个DataFrame,其内容是电影平均分,行标为电影名称,列标为性别)

对title进行分组, 利用size() 得到一个含有各个电影分组大小的 Series对象:

为了了解女性观众最喜欢的电影,我们可以对F列降序排列

# -*- coding: utf-8 -*-

import pandas as pd

# 用户信息
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']

users = pd.read_table('pydata-book-master/ch02/movielens/users.dat', sep='::', header=None, names=unames, engine='python')

# 电影排名
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']

ratings = pd.read_table('pydata-book-master/ch02/movielens/ratings.dat', sep='::', header=None, names=rnames,engine='python')

# 电影信息
mnames = ['movie_id', 'title', 'genres']

movies = pd.read_table('pydata-book-master/ch02/movielens/movies.dat', sep='::', header=None, names=mnames, engine='python')

data = pd.merge(pd.merge(ratings, users), movies)

data.ix[0]

mean_ratings = data.pivot_table('rating', index='title',
columns='gender', aggfunc='mean')

mean_ratings[:5]

# 过滤掉评分数据不够250条的电影
# 对title进行分组,然后利用size()得到一个含有各电影分组大小的Series对象
ratings_by_title = data.groupby('title').size()

ratings_by_title[:10]

active_titles = ratings_by_title.index[ratings_by_title >= 250]

# 该索引中含有评分数据>250条的电影名称,然后根据前面的mean_ratings中
# 选取所需的行
mean_ratings = mean_ratings.ix[active_titles]

top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)

top_female_ratings[:10]


结果:

top_female_ratings[:10]
Out[4]:
gender                                                     F         M
title
Close Shave, A (1995)                               4.644444  4.473795
Wrong Trousers, The (1993)                          4.588235  4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)       4.572650  4.464589
Wallace & Gromit: The Best of Aardman Animation...  4.563107  4.385075
Schindler's List (1993)                             4.562602  4.491415
Shawshank Redemption, The (1994)                    4.539075  4.560625
Grand Day Out, A (1992)                             4.537879  4.293255
To Kill a Mockingbird (1962)                        4.536667  4.372611
Creature Comforts (1990)                            4.513889  4.272277
Usual Suspects, The (1995)                          4.513317  4.518248


计算评分分歧

找到男性和女性观众分歧最大的电影。

# 给mean_ratings加上一个用于存放平均得分之差的列,并对其进行排序:

mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']

sorted_by_diff = mean_ratings.sort_index(by='diff')


# 按“diff” 排序即可得到分歧最大,且女性观众更喜欢的电影。
sorted_by_diff[:15]
Out[9]:
gender                                        F         M      diff
title
Dirty Dancing (1987)                   3.790378  2.959596 -0.830782
Jumpin' Jack Flash (1986)              3.254717  2.578358 -0.676359
Grease (1978)                          3.975265  3.367041 -0.608224
Little Women (1994)                    3.870588  3.321739 -0.548849
Steel Magnolias (1989)                 3.901734  3.365957 -0.535777
Anastasia (1997)                       3.800000  3.281609 -0.518391
Rocky Horror Picture Show, The (1975)  3.673016  3.160131 -0.512885
Color Purple, The (1985)               4.158192  3.659341 -0.498851
Age of Innocence, The (1993)           3.827068  3.339506 -0.487561
Free Willy (1993)                      2.921348  2.438776 -0.482573
French Kiss (1995)                     3.535714  3.056962 -0.478752
Little Shop of Horrors, The (1960)     3.650000  3.179688 -0.470312
Guys and Dolls (1955)                  4.051724  3.583333 -0.468391
Mary Poppins (1964)                    4.197740  3.730594 -0.467147
Patch Adams (1998)                     3.473282  3.008746 -0.464536


# 对排序结果反序并取出前15行,得到的则是男性观众更喜欢的电影
sorted_by_diff[::-1][:15]

Out[11]:
gender                                         F         M      diff
title
Good, The Bad and The Ugly, The (1966)  3.494949  4.221300  0.726351
Kentucky Fried Movie, The (1977)        2.878788  3.555147  0.676359
Dumb & Dumber (1994)                    2.697987  3.336595  0.638608
Longest Day, The (1962)                 3.411765  4.031447  0.619682
Cable Guy, The (1996)                   2.250000  2.863787  0.613787
Evil Dead II (Dead By Dawn) (1987)      3.297297  3.909283  0.611985
Hidden, The (1987)                      3.137931  3.745098  0.607167
Rocky III (1982)                        2.361702  2.943503  0.581801
Caddyshack (1980)                       3.396135  3.969737  0.573602
For a Few Dollars More (1965)           3.409091  3.953795  0.544704
Porky's (1981)                          2.296875  2.836364  0.539489
Animal House (1978)                     3.628906  4.167192  0.538286
Exorcist, The (1973)                    3.537634  4.067239  0.529605
Fright Night (1985)                     2.973684  3.500000  0.526316
Barb Wire (1996)                        1.585366  2.100386  0.515020


# 根据电影名称分组的得分数据的标准差

rating_std_by_title = data.groupby('title')['rating'].std()

# 根据active_titles进行过滤

rating_std_by_title = rating_std_by_title.ix[active_titles]

# 根据值对Series进行降序排列

rating_std_by_title.order(ascending=False)[:10]


rating_std_by_title.order(ascending=False)[:10]
Out[17]:
title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Eyes Wide Shut (1999)                    1.259624
Evita (1996)                             1.253631
Billy Madison (1995)                     1.249970
Fear and Loathing in Las Vegas (1998)    1.246408
Bicentennial Man (1999)                  1.245533
Name: rating, dtype: float64
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: