您的位置:首页 > 编程语言 > Python开发

抓取豆瓣2016年电影/分类_python

2017-01-20 22:53 621 查看

Description

嗯,这次简单点

突然很想看电影,于是就抄起了python搞了一发豆瓣的电影年度清单,顺便统计了评分排名和分类之类的。还算简单吧

16年电影都在这个链接(大概)

'https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&sort=time&page_limit=365&page_start=0'


这里其实是可以get传输直接访问豆瓣的,也能访问这个链接,limit是显示多少条,设一个比较大的数字就能反馈全部电影了

大概长这样



想过用beautifulsoup但是不行,老老实实re匹配去吧

趴下来之后储存在一个dict里面,至于按key排序就比较好玩了。我们可以先记录一下dict的key生成list,然后对list排序,那么遍历这个list对应的dict值就是排好序的了

具体代码

d = {}
d['olahiuj'] = 'handsome'
for key in sorted(d.keys()):
print d[key]


推荐用sorted而不是sort,因为它不改变原本的列表

j接下来就是解析抓到的网址对应找类别,不说了就是re匹配。这一块特别慢可以多线程,但是注意访问避免过频繁尽量像真人一点(笑

r然后呢我们还是用dict来保存类别和对应的计数,输出到一个csv里面保存

0python是自带csv模块的引用就好了

import csv


0之所以选择csv而不是其他主要是因为csv能用excel编辑浏览

0写操作我们这么做

with open('filename.csv', 'wb') as csvfile:
blah = csv.writer(csvfile, dialect = 'excel')
blah.writerow([1, 2, 3])


w为了保证list中的每一个项目都能处在单独的列里,设置dialect为’excel’,还有就是输出一定要是list(大概?

b本来还想着要可视化一下数据建个图什么的,明天再弄吧。话说同性分类有11部电影是什么鬼,排名第一是又是什么鬼

Code

# -*- coding: utf-8 -*-
from requests.packages.urllib3.exceptions import InsecureRequestWarning
import threading
import requests
import time
import csv
import os
import re

def getPage(html, url, headers, params = {}, referer = ''):
flags = True
if (url[:5] == 'https'):
flags = False
headers['Referer'] = referer
response = html.get(url, headers = headers, params = params, verify = flags)
page = response.content
return page

def find(string, page, flags = 0):
pattern = re.compile(string, flags = flags)
results = re.findall(pattern, page)
return results

def work(html, url, headers, cnt):
tmp = ''
for q in url:
if q != '\\':
tmp = tmp + q
url = tmp
page = getPage(html, url, headers)
types = find(r'<span property="v:genre">(.+?)</span>', page)
global mutex, rec
mutex.acquire()
print cnt
for item in types:
if rec.has_key(item):
rec[item] += 1
else:
rec[item] = 1
mutex.release()

def init():
html = requests.session()
doubanUrl = 'https://movie.douban.com'
headers={'User-Agent':'Mozilla/5.0 (compatible; MSIE 5.5; Windows NT)'}
page = getPage(html, 'https://movie.douban.com/j/search_subjects', headers, params = {'type': 'movie', 'tag': '热门', 'sort': 'time', 'page_limit': '400', 'page_start': '0'})
results = find(r'"rate":"(.+?)",.+?"title":"(.+?)","url":"(.+?)"', page)
urls = [item[2] for item in results]
rates = [item[0] for item in results]
titles = [item[1] for item in results]

for i in xrange(len(urls)):
for j in xrange(i + 1, len(urls)):
if (rates[i] < rates[j]):
rates[i], rates[j] = rates[j], rates[i]
urls[i], urls[j] = urls[j], urls[i]
titles[i], titles[j] = titles[j], titles[i]

with open('douban.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, dialect = 'excel')
for i in xrange(len(rates)):
spamwriter.writerow([titles[i], urls[i], rates[i]])
global mutex, rec
mutex = threading.Lock()
rec = {}
jobs = []
cnt = 0
for i in xrange(len(urls)):
cnt += 1
job = threading.Thread(target = work, args = (html, urls[i], headers, cnt))
job.start()
jobs.append(job)

for job in jobs:
job.join()

with open('douban_type.csv', 'wb') as csvfile:
spamwriter = csv.writer(csvfile, dialect = 'excel')
for key in sorted(rec.keys(), reverse = True):
spamwriter.writerow([key, rec[key]])

if __name__ == '__main__':
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
init()
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: