您的位置:首页 > 编程语言 > Python开发

用pandas分析百万电影数据

2016-05-29 17:48 621 查看

用pandas分析电影数据

Lift is short, use Python.

用Python做数据分析,pandas是Python数据分析的重要包,其他重要的包:numpy、matplotlib .

安装pandas(Linux, Mac, Windows皆同):

pip install pandas

电影数据来源:http://grouplens.org/datasets/movielens/

下载数据文件解压,包含如下4个文件:

users.dat 用户数据

movies.dat 电影数据

ratings.dat 评分数据

README 文件解释

查看README文件,可知源数据文件的格式:

users.dat (UserID::Gender::Age::Occupation::Zip-code)

movies.dat (MovieID::Title::Genres)

ratings.dat (UserID::MovieID::Rating::Timestamp)

特别解释:Occupation用户职业,Zip-code邮编, Timestamp时间戳, Genres电影类型(更多解释可以查看README文件).

文件中各每条数据的分割符是 ::

环境:

OS:Windows

Language:Python3.4

编辑器:Jupyter

用pandas读取数据.

导入必要的头文件:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


读取数据,先定义字段名,因为源数据中无字段名,只有用’::’分割的每条数据.

user_names = ['user_id', 'gender', 'age', 'occupation', 'zip'] #用户表的数据字段名


读取数据,注意源文件的地址.

users = pd.read_table('C:\\Users\\Administrator\\Downloads\\ml-1m\\users.dat', sep='::', header=None, names=user_names)


D:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
if __name__ == '__main__':


上面有个警告,可以不管,即:加载数据是用的python engine 而不是 c engine.(更多请google)

查看有多少个数据.

前5行数据.

print(len(users))
users.head()


6040

user_idgenderageoccupationzip
01F11048067
12M561670072
23M251555117
34M45702460
45M252055455
同理将movies,ratings数据读进来.

ratings_names = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('C:\\Users\\Administrator\\Downloads\\ml-1m\\ratings.dat', sep='::', header=None, names=ratings_names)
movies_names = ['movie_id', 'title', 'genres']
movies = pd.read_table('C:\\Users\\Administrator\\Downloads\\ml-1m\\movies.dat', sep='::', header=None, names=movies_names)


D:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
from ipykernel import kernelapp as app
D:\Anaconda3\lib\site-packages\ipykernel\__main__.py:4: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.


加载数据需要一点点时间,应为数据有上百万条.

查看ratings表,movies表.

print(len(ratings))
ratings.head()


1000209

user_idmovie_idratingtimestamp
0111935978300760
116613978302109
219143978301968
3134084978300275
4123555978824291
print(len(movies))
movies.head()


3883

movie_idtitlegenres
01Toy Story (1995)Animation|Children’s|Comedy
12Jumanji (1995)Adventure|Children’s|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama
45Father of the Bride Part II (1995)Comedy
电影的评分的数据有1百万多个.

将3个表合并为一个表data .

data = pd.merge(pd.merge(users, ratings), movies)
print(len(data))
data.head()


1000209

user_idgenderageoccupationzipmovie_idratingtimestamptitlegenres
01F1104806711935978300760One Flew Over the Cuckoo’s Nest (1975)Drama
12M56167007211935978298413One Flew Over the Cuckoo’s Nest (1975)Drama
212M25123279311934978220179One Flew Over the Cuckoo’s Nest (1975)Drama
315M2572290311934978199279One Flew Over the Cuckoo’s Nest (1975)Drama
417M5019535011935978158471One Flew Over the Cuckoo’s Nest (1975)Drama
查看用户id为1,对所有电影的评分.

data[data.user_id==1]


user_idgenderageoccupationzipmovie_idratingtimestamptitlegenres
01F1104806711935978300760One Flew Over the Cuckoo’s Nest (1975)Drama
17251F110480676613978302109James and the Giant Peach (1996)Animation|Children’s|Musical
22501F110480679143978301968My Fair Lady (1964)Musical|Romance
28861F1104806734084978300275Erin Brockovich (2000)Drama
42011F1104806723555978824291Bug’s Life, A (1998)Animation|Children’s|Comedy
59041F1104806711973978302268Princess Bride, The (1987)Action|Adventure|Comedy|Romance
82221F1104806712875978302039Ben-Hur (1959)Action|Adventure|Drama
89261F1104806728045978300719Christmas Story, A (1983)Comedy|Drama
102781F110480675944978302268Snow White and the Seven Dwarfs (1937)Animation|Children’s|Musical
110411F110480679194978301368Wizard of Oz, The (1939)Adventure|Children’s|Drama|Musical
127591F110480675955978824268Beauty and the Beast (1991)Animation|Children’s|Musical
138191F110480679384978301752Gigi (1958)Musical
140061F1104806723984978302281Miracle on 34th Street (1947)Drama
143861F1104806729184978302124Ferris Bueller’s Day Off (1986)Comedy
158591F1104806710355978301753Sound of Music, The (1965)Musical
167411F1104806727914978302188Airplane! (1980)Comedy
184721F1104806726873978824268Tarzan (1999)Animation|Children’s
189141F1104806720184978301777Bambi (1942)Animation|Children’s
195031F1104806731055978301713Awakenings (1990)Drama
201831F1104806727974978302039Big (1988)Comedy|Fantasy
216741F1104806723213978302205Pleasantville (1998)Comedy
228321F110480677203978300760Wallace & Gromit: The Best of Aardman Animatio…Animation
232701F1104806712705978300055Back to the Future (1985)Comedy|Sci-Fi
258531F110480675275978824195Schindler’s List (1993)Drama|War
281571F1104806723403978300103Meet Joe Black (1998)Romance
285011F11048067485978824351Pocahontas (1995)Animation|Children’s|Musical|Romance
288831F1104806710974978301953E.T. the Extra-Terrestrial (1982)Children’s|Drama|Fantasy|Sci-Fi
311521F1104806717214978300055Titanic (1997)Drama|Romance
326981F1104806715454978824139Ponette (1996)Drama
327711F110480677453978824268Close Shave, A (1995)Animation|Comedy|Thriller
334281F1104806722944978824291Antz (1998)Animation|Children’s
340731F1104806731864978300019Girl, Interrupted (1999)Drama
345041F1104806715664978824330Hercules (1997)Adventure|Animation|Children’s|Comedy|Musical
349731F110480675884978824268Aladdin (1992)Animation|Children’s|Comedy|Musical
363241F1104806719074978824330Mulan (1998)Animation|Children’s
368141F110480677834978824291Hunchback of Notre Dame, The (1996)Animation|Children’s|Musical
372041F1104806718365978300172Last Days of Disco, The (1998)Drama
373391F1104806710225978300055Cinderella (1950)Animation|Children’s|Musical
379161F1104806727624978302091Sixth Sense, The (1999)Thriller
403751F110480671505978301777Apollo 13 (1995)Drama
416261F1104806715978824268Toy Story (1995)Animation|Children’s|Comedy
437031F1104806719615978301590Rain Man (1988)Drama
450331F1104806719624978301753Driving Miss Daisy (1989)Drama
456851F1104806726924978301570Run Lola Run (Lola rennt) (1998)Action|Crime|Romance
467571F110480672604978300760Star Wars: Episode IV - A New Hope (1977)Action|Adventure|Fantasy|Sci-Fi
497481F1104806710285978301777Mary Poppins (1964)Children’s|Comedy|Musical
507591F1104806710295978302205Dumbo (1941)Animation|Children’s|Musical
513271F1104806712074978300719To Kill a Mockingbird (1962)Drama
522551F1104806720285978301619Saving Private Ryan (1998)Action|Drama|War
549081F110480675314978302149Secret Garden, The (1993)Children’s|Drama
552461F1104806731144978302174Toy Story 2 (1999)Animation|Children’s|Comedy
568311F110480676084978301398Fargo (1996)Crime|Drama|Thriller
593441F1104806712464978302091Dead Poets Society (1989)Drama
不同性别对不同电影的平均评分.

mean_ratings_by_gender = data.pivot_table(values='rating',index='title',columns='gender', aggfunc='mean')
mean_ratings_by_gender.head(10)#查看前10条数据


genderFM
title
$1,000,000 Duck (1971)3.3750002.761905
‘Night Mother (1986)3.3888893.352941
‘Til There Was You (1997)2.6756762.733333
‘burbs, The (1989)2.7934782.962085
…And Justice for All (1979)3.8285713.689024
1-900 (1994)2.0000003.000000
10 Things I Hate About You (1999)3.6465523.311966
101 Dalmatians (1961)3.7914443.500000
101 Dalmatians (1996)3.2400002.911215
12 Angry Men (1957)4.1843974.328421
mean_ratings_by_gender增加一列,男女的平均评分差.

mean_ratings_by_gender['diff'] = mean_ratings_by_gender.F - mean_ratings_by_gender.M
mean_ratings_by_gender.head()


genderFMdiff
title
$1,000,000 Duck (1971)3.3750002.7619050.613095
‘Night Mother (1986)3.3888893.3529410.035948
‘Til There Was You (1997)2.6756762.733333-0.057658
‘burbs, The (1989)2.7934782.962085-0.168607
…And Justice for All (1979)3.8285713.6890240.139547
哪些电影是男女评分差异最大的(男性评分高女生评分低,女性高男性低).

mean_ratings_by_gender.sort_values(by='diff',ascending=True).head()
#男高女低


genderFMdiff
title
Tigrero: A Film That Was Never Made (1994)1.04.333333-3.333333
Neon Bible, The (1995)1.04.000000-3.000000
Enfer, L’ (1994)1.03.750000-2.750000
Stalingrad (1993)1.03.593750-2.593750
Killer: A Journal of Murder (1995)1.03.428571-2.428571
mean_ratings_by_gender.sort_values(by='diff',ascending=False).head()
#女高男低


genderFMdiff
title
James Dean Story, The (1957)4.0000001.0000003.000000
Spiders, The (Die Spinnen, 1. Teil: Der Goldene See) (1919)4.0000001.0000003.000000
Country Life (1994)5.0000002.0000003.000000
Babyfever (1994)3.6666671.0000002.666667
Woman of Paris, A (1923)5.0000002.4285712.571429
不同电影的评分次数.

total_rating_by_title = data.groupby('title').size()
total_rating_by_title    #第一列是电影标题,第二列是评分次数


title
$1,000,000 Duck (1971)                              37
'Night Mother (1986)                                70
'Til There Was You (1997)                           52
'burbs, The (1989)                                 303
...And Justice for All (1979)                      199
1-900 (1994)                                         2
10 Things I Hate About You (1999)                  700
101 Dalmatians (1961)                              565
101 Dalmatians (1996)                              364
12 Angry Men (1957)                                616
13th Warrior, The (1999)                           750
187 (1997)                                          55
2 Days in the Valley (1996)                        286
20 Dates (1998)                                    139
20,000 Leagues Under the Sea (1954)                575
200 Cigarettes (1999)                              181
2001: A Space Odyssey (1968)                      1716
2010 (1984)                                        470
24 7: Twenty Four Seven (1997)                       5
24-hour Woman (1998)                                 9
28 Days (2000)                                     505
3 Ninjas: High Noon On Mega Mountain (1998)         47
3 Strikes (2000)                                     4
301, 302 (1995)                                      9
39 Steps, The (1935)                               253
400 Blows, The (Les Quatre cents coups) (1959)     187
42 Up (1998)                                        88
52 Pick-Up (1986)                                  140
54 (1998)                                          259
7th Voyage of Sinbad, The (1958)                   258
...
Wrongfully Accused (1998)                          123
Wyatt Earp (1994)                                  270
X-Files: Fight the Future, The (1998)              996
X-Men (2000)                                      1511
X: The Unknown (1956)                               12
Xiu Xiu: The Sent-Down Girl (Tian yu) (1998)        69
Yankee Zulu (1994)                                   2
Yards, The (1999)                                   77
Year My Voice Broke, The (1987)                     27
Year of Living Dangerously (1982)                  391
Year of the Horse (1997)                             4
Yellow Submarine (1968)                            399
Yojimbo (1961)                                     215
You Can't Take It With You (1938)                   77
You So Crazy (1994)                                 13
You've Got Mail (1998)                             838
Young Doctors in Love (1982)                        79
Young Frankenstein (1974)                         1193
Young Guns (1988)                                  562
Young Guns II (1990)                               369
Young Poisoner's Handbook, The (1995)               79
Young Sherlock Holmes (1985)                       379
Young and Innocent (1937)                           10
Your Friends and Neighbors (1998)                  109
Zachariah (1971)                                     2
Zed & Two Noughts, A (1985)                         29
Zero Effect (1998)                                 301
Zero Kelvin (Kj鎟lighetens kj鴗ere) (1995)             2
Zeus and Roxanne (1997)                             23
eXistenZ (1999)                                    410
dtype: int64


评分次数最多的10部电影.

top_10_total_rating = total_rating_by_title.sort_values(ascending=False).head(10)
top_10_total_rating


title
American Beauty (1999)                                   3428
Star Wars: Episode IV - A New Hope (1977)                2991
Star Wars: Episode V - The Empire Strikes Back (1980)    2990
Star Wars: Episode VI - Return of the Jedi (1983)        2883
Jurassic Park (1993)                                     2672
Saving Private Ryan (1998)                               2653
Terminator 2: Judgment Day (1991)                        2649
Matrix, The (1999)                                       2590
Back to the Future (1985)                                2583
Silence of the Lambs, The (1991)                         2578
dtype: int64


可以看出,评分次数最多的电影一般是我们比较熟知的电影,一般可认为是热门电影.
再来看看评分最高的10大电影(注:最高分为5.0)


mean_ratings_by_title = data.pivot_table(values='rating',index='title',aggfunc='mean')
top_10_mean_ratings = mean_ratings_by_title.sort_values(ascending=False).head(10)
top_10_mean_ratings


title
Gate of Heavenly Peace, The (1995)           5.0
Lured (1947)                                 5.0
Ulysses (Ulisse) (1954)                      5.0
Smashing Time (1967)                         5.0
Follow the Bitch (1998)                      5.0
Song of Freedom (1936)                       5.0
Bittersweet Motel (2000)                     5.0
Baby, The (1973)                             5.0
One Little Indian (1973)                     5.0
Schlafes Bruder (Brother of Sleep) (1995)    5.0
Name: rating, dtype: float64


评分人数最多的10部电影的平均评分.


mean_ratings_by_title[top_10_total_rating.index]


title
American Beauty (1999)                                   4.317386
Star Wars: Episode IV - A New Hope (1977)                4.453694
Star Wars: Episode V - The Empire Strikes Back (1980)    4.292977
Star Wars: Episode VI - Return of the Jedi (1983)        4.022893
Jurassic Park (1993)                                     3.763847
Saving Private Ryan (1998)                               4.337354
Terminator 2: Judgment Day (1991)                        4.058513
Matrix, The (1999)                                       4.315830
Back to the Future (1985)                                3.990321
Silence of the Lambs, The (1991)                         4.351823
Name: rating, dtype: float64


可以了解到评论人数最多的10部电影在平均评分最高的10大中排名并不高,评分高的电影有一部分是我们不熟知的电影,是不是数据有问题呢?其实不是,
假如有某部烂片,去观影的人很少,这很少的人给了很高的评分,所以导致一些评论人数很少但平均评分和高的电影.


如若不信,请看数据,评分最高的10大电影的评论次数


total_rating_by_title[top_10_mean_ratings.index]


title
Gate of Heavenly Peace, The (1995)           3
Lured (1947)                                 1
Ulysses (Ulisse) (1954)                      1
Smashing Time (1967)                         2
Follow the Bitch (1998)                      1
Song of Freedom (1936)                       1
Bittersweet Motel (2000)                     1
Baby, The (1973)                             1
One Little Indian (1973)                     1
Schlafes Bruder (Brother of Sleep) (1995)    1
dtype: int64


现在来重新统计10大热门电影,此处认为热门电影至少有1000人评论。
统计出热门电影


hot_movie = total_rating_by_title[total_rating_by_title>1000]
print(len(hot_movie))
hot_movie


207

title
2001: A Space Odyssey (1968)                          1716
Abyss, The (1989)                                     1715
African Queen, The (1951)                             1057
Air Force One (1997)                                  1076
Airplane! (1980)                                      1731
Aladdin (1992)                                        1351
Alien (1979)                                          2024
Aliens (1986)                                         1820
Amadeus (1984)                                        1382
American Beauty (1999)                                3428
American Pie (1999)                                   1389
American President, The (1995)                        1033
Animal House (1978)                                   1207
Annie Hall (1977)                                     1334
Apocalypse Now (1979)                                 1176
Apollo 13 (1995)                                      1251
Arachnophobia (1990)                                  1367
Armageddon (1998)                                     1110
As Good As It Gets (1997)                             1424
Austin Powers: International Man of Mystery (1997)    1205
Austin Powers: The Spy Who Shagged Me (1999)          1434
Babe (1995)                                           1751
Back to the Future (1985)                             2583
Back to the Future Part II (1989)                     1158
Back to the Future Part III (1990)                    1148
Batman (1989)                                         1431
Batman Returns (1992)                                 1031
Beauty and the Beast (1991)                           1060
Beetlejuice (1988)                                    1495
Being John Malkovich (1999)                           2241
...
Superman (1978)                                       1222
Talented Mr. Ripley, The (1999)                       1331
Taxi Driver (1976)                                    1240
Terminator 2: Judgment Day (1991)                     2649
Terminator, The (1984)                                2098
Thelma & Louise (1991)                                1417
There's Something About Mary (1998)                   1371
This Is Spinal Tap (1984)                             1118
Thomas Crown Affair, The (1999)                       1089
Three Kings (1999)                                    1021
Time Bandits (1981)                                   1010
Titanic (1997)                                        1546
Top Gun (1986)                                        1010
Total Recall (1990)                                   1996
Toy Story (1995)                                      2077
Toy Story 2 (1999)                                    1585
True Lies (1994)                                      1400
Truman Show, The (1998)                               1005
Twelve Monkeys (1995)                                 1511
Twister (1996)                                        1110
Untouchables, The (1987)                              1127
Usual Suspects, The (1995)                            1783
Wayne's World (1992)                                  1120
When Harry Met Sally... (1989)                        1568
Who Framed Roger Rabbit? (1988)                       1799
Willy Wonka and the Chocolate Factory (1971)          1313
Witness (1985)                                        1046
Wizard of Oz, The (1939)                              1718
X-Men (2000)                                          1511
Young Frankenstein (1974)                             1193
dtype: int64


#热门电影的评分
hot_movie_mean_rating = mean_ratings_by_title[hot_movie.index]
print(len(hot_movie_mean_rating))
hot_movie_mean_rating


207

title
2001: A Space Odyssey (1968)                          4.068765
Abyss, The (1989)                                     3.683965
African Queen, The (1951)                             4.251656
Air Force One (1997)                                  3.588290
Airplane! (1980)                                      3.971115
Aladdin (1992)                                        3.788305
Alien (1979)                                          4.159585
Aliens (1986)                                         4.125824
Amadeus (1984)                                        4.251809
American Beauty (1999)                                4.317386
American Pie (1999)                                   3.709863
American President, The (1995)                        3.793804
Animal House (1978)                                   4.053024
Annie Hall (1977)                                     4.141679
Apocalypse Now (1979)                                 4.243197
Apollo 13 (1995)                                      4.073541
Arachnophobia (1990)                                  3.002926
Armageddon (1998)                                     3.191892
As Good As It Gets (1997)                             3.950140
Austin Powers: International Man of Mystery (1997)    3.710373
Austin Powers: The Spy Who Shagged Me (1999)          3.388424
Babe (1995)                                           3.891491
Back to the Future (1985)                             3.990321
Back to the Future Part II (1989)                     3.343696
Back to the Future Part III (1990)                    3.242160
Batman (1989)                                         3.600978
Batman Returns (1992)                                 2.976722
Beauty and the Beast (1991)                           3.885849
Beetlejuice (1988)                                    3.567893
Being John Malkovich (1999)                           4.125390
...
Superman (1978)                                       3.536825
Talented Mr. Ripley, The (1999)                       3.503381
Taxi Driver (1976)                                    4.183871
Terminator 2: Judgment Day (1991)                     4.058513
Terminator, The (1984)                                4.152050
Thelma & Louise (1991)                                3.680311
There's Something About Mary (1998)                   3.904449
This Is Spinal Tap (1984)                             4.179785
Thomas Crown Affair, The (1999)                       3.641873
Three Kings (1999)                                    3.807052
Time Bandits (1981)                                   3.694059
Titanic (1997)                                        3.583441
Top Gun (1986)                                        3.686139
Total Recall (1990)                                   3.682365
Toy Story (1995)                                      4.146846
Toy Story 2 (1999)                                    4.218927
True Lies (1994)                                      3.634286
Truman Show, The (1998)                               3.861692
Twelve Monkeys (1995)                                 3.945731
Twister (1996)                                        3.173874
Untouchables, The (1987)                              4.007986
Usual Suspects, The (1995)                            4.517106
Wayne's World (1992)                                  3.600893
When Harry Met Sally... (1989)                        4.073342
Who Framed Roger Rabbit? (1988)                       3.679822
Willy Wonka and the Chocolate Factory (1971)          3.861386
Witness (1985)                                        3.996176
Wizard of Oz, The (1939)                              4.247963
X-Men (2000)                                          3.820649
Young Frankenstein (1974)                             4.250629
Name: rating, dtype: float64


#评论人数>=1000的10大评分最高电影
top_10_rating_movie = hot_movie_mean_rating.sort_values(ascending=False).head(10)
top_10_rating_movie


title
Shawshank Redemption, The (1994)                                               4.554558
Godfather, The (1972)                                                          4.524966
Usual Suspects, The (1995)                                                     4.517106
Schindler's List (1993)                                                        4.510417
Raiders of the Lost Ark (1981)                                                 4.477725
Rear Window (1954)                                                             4.476190
Star Wars: Episode IV - A New Hope (1977)                                      4.453694
Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)    4.449890
Casablanca (1942)                                                              4.412822
Sixth Sense, The (1999)                                                        4.406263
Name: rating, dtype: float64


%matplotlib inline #在ipython(或jupyter)中使用此命令,其他则不必
import matplotlib.pyplot as plt
import numpy as np

x = np.arange(1,11)
y = top_10_rating_movie.values
name = top_10_rating_movie.index

#画出图像
plt.plot(x, y, 'r-o')

#添加注释
for i in range(10):
plt.text(x[i], y[i], name[i])

#设置坐标范围
plt.xlim(0, 15)
plt.ylim(4.4, 4.56)

#设置坐标标题
#plt.xlabel('Rank')
#plt.ylabel=('Rating')

#plt.show() #非ipython用户使用此命令




这图太丑,献上下图:


import matplotlib.pyplot as plt
import numpy as np

plt.rcdefaults()

people = name
y_pos = np.arange(len(people))
performance = y
error = np.random.rand(len(people))

plt.barh(y_pos, performance, xerr=error, align='center', alpha=0.4)
plt.yticks(y_pos, people)

#plt.xlabel('Rating')
#plt.title('Rank')

#plt.show() #非ipython用户使用此命令



)
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  python 数据分析 pandas