Python进行数据分析(二)MovieLens 1M 数据集
2017-09-27 23:20
183 查看
# -*- coding: utf-8 -*- """ Created on Thu Sep 21 12:24:37 2017 @author: Douzi """ import pandas as pd # 用户信息 unames = ['user_id', 'gender', 'age', 'occupation', 'zip'] users = pd.read_table('ch02/movielens/users.dat', sep='::', header=None, names=unames, engine='python') # 电影排名 rnames = ['user_id', 'movie_id', 'rating', 'timestamp'] ratings = pd.read_table('ch02/movielens/ratings.dat', sep='::', header=None, names=rnames,engine='python') # 电影信息 mnames = ['movie_id', 'title', 'genres'] movies = pd.read_table('ch02/movielens/movies.dat', sep='::', header=None, names=mnames, engine='python') users[:5] Out[113]: user_id gender age occupation zip 0 1 F 1 10 48067 1 2 M 56 16 70072 2 3 M 25 15 55117 3 4 M 45 7 02460 4 5 M 25 20 55455 ratings[:5] Out[114]: user_id movie_id rating timestamp 0 1 1193 5 978300760 1 1 661 3 978302109 2 1 914 3 978301968 3 1 3408 4 978300275 4 1 2355 5 978824291 movies[:5] Out[115]: movie_id title genres 0 1 Toy Story (1995) Animation|Children's|Comedy 1 2 Jumanji (1995) Adventure|Children's|Fantasy 2 3 Grumpier Old Men (1995) Comedy|Romance 3 4 Waiting to Exhale (1995) Comedy|Drama 4 5 Father of the Bride Part II (1995) Comedy
合并数据
根据任意个用户或电影属性对评分数据进行聚合操作
按性别计算每部电影的平均得分(产生了另一个DataFrame,其内容是电影平均分,行标为电影名称,列标为性别)
对title进行分组, 利用size() 得到一个含有各个电影分组大小的 Series对象:
为了了解女性观众最喜欢的电影,我们可以对F列降序排列
# -*- coding: utf-8 -*- import pandas as pd # 用户信息 unames = ['user_id', 'gender', 'age', 'occupation', 'zip'] users = pd.read_table('pydata-book-master/ch02/movielens/users.dat', sep='::', header=None, names=unames, engine='python') # 电影排名 rnames = ['user_id', 'movie_id', 'rating', 'timestamp'] ratings = pd.read_table('pydata-book-master/ch02/movielens/ratings.dat', sep='::', header=None, names=rnames,engine='python') # 电影信息 mnames = ['movie_id', 'title', 'genres'] movies = pd.read_table('pydata-book-master/ch02/movielens/movies.dat', sep='::', header=None, names=mnames, engine='python') data = pd.merge(pd.merge(ratings, users), movies) data.ix[0] mean_ratings = data.pivot_table('rating', index='title', columns='gender', aggfunc='mean') mean_ratings[:5] # 过滤掉评分数据不够250条的电影 # 对title进行分组,然后利用size()得到一个含有各电影分组大小的Series对象 ratings_by_title = data.groupby('title').size() ratings_by_title[:10] active_titles = ratings_by_title.index[ratings_by_title >= 250] # 该索引中含有评分数据>250条的电影名称,然后根据前面的mean_ratings中 # 选取所需的行 mean_ratings = mean_ratings.ix[active_titles] top_female_ratings = mean_ratings.sort_index(by='F', ascending=False) top_female_ratings[:10]
结果:
top_female_ratings[:10] Out[4]: gender F M title Close Shave, A (1995) 4.644444 4.473795 Wrong Trousers, The (1993) 4.588235 4.478261 Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589 Wallace & Gromit: The Best of Aardman Animation... 4.563107 4.385075 Schindler's List (1993) 4.562602 4.491415 Shawshank Redemption, The (1994) 4.539075 4.560625 Grand Day Out, A (1992) 4.537879 4.293255 To Kill a Mockingbird (1962) 4.536667 4.372611 Creature Comforts (1990) 4.513889 4.272277 Usual Suspects, The (1995) 4.513317 4.518248
计算评分分歧
找到男性和女性观众分歧最大的电影。
# 给mean_ratings加上一个用于存放平均得分之差的列,并对其进行排序: mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F'] sorted_by_diff = mean_ratings.sort_index(by='diff')
# 按“diff” 排序即可得到分歧最大,且女性观众更喜欢的电影。 sorted_by_diff[:15] Out[9]: gender F M diff title Dirty Dancing (1987) 3.790378 2.959596 -0.830782 Jumpin' Jack Flash (1986) 3.254717 2.578358 -0.676359 Grease (1978) 3.975265 3.367041 -0.608224 Little Women (1994) 3.870588 3.321739 -0.548849 Steel Magnolias (1989) 3.901734 3.365957 -0.535777 Anastasia (1997) 3.800000 3.281609 -0.518391 Rocky Horror Picture Show, The (1975) 3.673016 3.160131 -0.512885 Color Purple, The (1985) 4.158192 3.659341 -0.498851 Age of Innocence, The (1993) 3.827068 3.339506 -0.487561 Free Willy (1993) 2.921348 2.438776 -0.482573 French Kiss (1995) 3.535714 3.056962 -0.478752 Little Shop of Horrors, The (1960) 3.650000 3.179688 -0.470312 Guys and Dolls (1955) 4.051724 3.583333 -0.468391 Mary Poppins (1964) 4.197740 3.730594 -0.467147 Patch Adams (1998) 3.473282 3.008746 -0.464536
# 对排序结果反序并取出前15行,得到的则是男性观众更喜欢的电影 sorted_by_diff[::-1][:15] Out[11]: gender F M diff title Good, The Bad and The Ugly, The (1966) 3.494949 4.221300 0.726351 Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359 Dumb & Dumber (1994) 2.697987 3.336595 0.638608 Longest Day, The (1962) 3.411765 4.031447 0.619682 Cable Guy, The (1996) 2.250000 2.863787 0.613787 Evil Dead II (Dead By Dawn) (1987) 3.297297 3.909283 0.611985 Hidden, The (1987) 3.137931 3.745098 0.607167 Rocky III (1982) 2.361702 2.943503 0.581801 Caddyshack (1980) 3.396135 3.969737 0.573602 For a Few Dollars More (1965) 3.409091 3.953795 0.544704 Porky's (1981) 2.296875 2.836364 0.539489 Animal House (1978) 3.628906 4.167192 0.538286 Exorcist, The (1973) 3.537634 4.067239 0.529605 Fright Night (1985) 2.973684 3.500000 0.526316 Barb Wire (1996) 1.585366 2.100386 0.515020
# 根据电影名称分组的得分数据的标准差 rating_std_by_title = data.groupby('title')['rating'].std() # 根据active_titles进行过滤 rating_std_by_title = rating_std_by_title.ix[active_titles] # 根据值对Series进行降序排列 rating_std_by_title.order(ascending=False)[:10]
rating_std_by_title.order(ascending=False)[:10] Out[17]: title Dumb & Dumber (1994) 1.321333 Blair Witch Project, The (1999) 1.316368 Natural Born Killers (1994) 1.307198 Tank Girl (1995) 1.277695 Rocky Horror Picture Show, The (1975) 1.260177 Eyes Wide Shut (1999) 1.259624 Evita (1996) 1.253631 Billy Madison (1995) 1.249970 Fear and Loathing in Las Vegas (1998) 1.246408 Bicentennial Man (1999) 1.245533 Name: rating, dtype: float64
相关文章推荐
- 利用Python进行数据分析---ch02《MovieLens 1M数据集(下)》读书笔记
- 利用Python进行数据分析---ch02《MovieLens 1M数据集(上)》读书笔记
- Learning: 利用Python进行数据分析 - MovieLens 数据集的探索
- Spark中组件Mllib的学习11之使用ALS对movieLens中一百万条(1M)数据集进行训练,并对输入的新用户数据进行电影推荐
- MovieLens 《用Python进行数据分析》
- 《Spark机器学习》笔记——基于MovieLens数据集使用Spark进行电影数据分析
- MovieLens 1M之python数据分析练习
- 利用python进入数据分析之MovieLens_1M数据分析
- 利用Python进行数据分析(10) pandas基础: 处理缺失数据
- 利用Python进行数据分析系列之——数据格式转换
- 利用python进行数据分析-数据加载、存储与文件格式2
- 用 Python 进行数据分析,不懂 Python,求合适的 Python 书籍或资料推荐?
- 利用Python进行数据分析(五)之pandas入门
- 利用Python进行数据分析(3)—— Numpy Basic(3)
- Python进行数据分析(一)初步学习 对时区进行计数
- 利用python调用elasticsearch-api来分析数据并作图进行日报邮件发送
- 利用Python进行数据分析(9) pandas基础: 汇总统计和计算
- 利用python进行数据分析(三):pandas--处理数据的工具
- 利用Python进行数据分析 基础系列随笔汇总
- 利用python进行数据分析之绘图和可视化