您的位置：首页 > 编程语言 > Python开发

使用python来批量抓取网站图片

2016-02-28 23:11 621 查看

今天"无意"看美女无意溜达到一个网站，发现妹子多多，但是可恨一个page只显示一张或两张图片，家里WiFi也难用，于是发挥"程序猿"的本色，写个小脚本，把图片扒下来再看，类似功能已有不少大师实现了，但本着学习锻炼的精神，自己折腾一遍，涨涨姿势！

先来效果展示下：

# -*- coding:utf8 -*-
import urllib2
import re
import requests
from lxml import etree
import os

def check_save_path(save_path):
try:
os.mkdir(save_path)
except:
pass

def get_image_name(image_link):
file_name = os.path.basename(image_link)
return file_name

def save_image(image_link, save_path):
file_name = get_image_name(image_link)
file_path = save_path + "\\" + file_name
print("准备下载%s" % image_link)
try:
file_handler = open(file_path, "wb")
image_handler = urllib2.urlopen(url=image_link, timeout=5).read()
file_handler.write(image_handler)
file_handler.closed()
except Exception, ex:
print(ex.message)

def get_image_link_from_web_page(web_page_link):
image_link_list = []
print(web_page_link)
try:
html_content = urllib2.urlopen(url=web_page_link, timeout=5).read()
html_tree = etree.HTML(html_content)
print(str(html_tree))
link_list = html_tree.xpath('//p/img/@src')
for link in link_list:
# print(link)
if str(link).find("uploadfile"):
image_link_list.append("http://www.xgyw.cc/" + link)
except Exception, ex:
pass
return image_link_list

def get_page_link_list_from_index_page(base_page_link):
try:
html_content = urllib2.urlopen(url=base_page_link, timeout=5).read()
html_tree = etree.HTML(html_content)
print(str(html_tree))
link_tmp_list = html_tree.xpath('//div[@class="page"]/a/@href')
page_link_list = []
for link_tmp in link_tmp_list:
page_link_list.append("http://www.xgyw.cc/" + link_tmp)
return page_link_list
except Exception, ex:
print(ex.message)
return []

def get_page_title_from_index_page(base_page_link):
try:
html_content = urllib2.urlopen(url=base_page_link, timeout=5).read()
html_tree = etree.HTML(html_content)
print(str(html_tree))
page_title_list = html_tree.xpath('//td/div[@class="title"]')
page_title_tmp = page_title_list[0].text
print(page_title_tmp)
return page_title_tmp
except Exception, ex:
print(ex.message)
return ""

def get_image_from_web(base_page_link, save_path):
check_save_path(save_path)
page_link_list = get_page_link_list_from_index_page(base_page_link)
for page_link in page_link_list:
image_link_list = get_image_link_from_web_page(page_link)
for image_link in image_link_list:
save_image(image_link, save_path)

base_page_link = "http://www.xgyw.cc/tuigirl/tuigirl1346.html"
page_title = get_page_title_from_index_page(base_page_link)
if page_title <> "":
save_path = "N:\\PIC\\" + page_title
else:
save_path = "N:\\PIC\\other\\"

get_image_from_web(base_page_link, save_path)

View Code

代码思路：

使用urllib2.urlopen(url).open来获取页面数据，再使用etree.HTML()将页面解析成xml格式，方便使用xmlpath方式来获取特定node的值，最终遍历所有页面得到要下载的图片，将图片保存到本地。

--=========================================================

python包安装：

很多python包没有windows安装包，或者没有X64版本的安装包，对于新手来说，很难快速上手，可以使用pip或easy_install来安装要使用的安装包，相关安装方式：https://pypi.python.org/pypi/setuptools

本人采用easy_install方式，我电脑安装python2.7，安装路径为：C:\Python27\python.exe，下载ez_setup.py文件后到c盘保存，然后运行cmd执行以下命令：

C:\Python27\python.exe "c:\ez_setup.py"

即可安装easy_install，安装结束后可以C:\Python27\Scripts下看到easy_install-2.7.exe，如果我们想在本地安装requests包，那么可以运行以下命令来试下：

"C:\Python27\Scripts\easy_install-2.7.exe" requests

--==========================================================

依旧是妹子压贴，推女郎第68期，想要图的自己百度

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航