您的位置：首页 > 编程语言 > Python开发

七月算法课程《python爬虫》第三课: 爬虫基础知识与简易爬虫实现

2017-01-02 10:05 639 查看

这节课涉及到很多知识，CSS、XPath、Json、Dom和Sax、正则表达式、Selenium等。大家可以在W3School 和RUNOOB.COM 上了解下这方面的相关知识

CSS的几个网页使用示例

保存为相应html后直接用浏览器打开即可看到效果。

css_background_color.html：

<html>
<head>

<style type="text/css">

body {background-color: yellow}
h1 {background-color: #00ff00}
h2 {background-color: transparent}
p {background-color: rgb(250,0,255)}
p.no2 {background-color: gray; padding: 20px;}

</style>

</head>

<body>

<h1>这是标题 1</h1>
<h2>这是标题 2</h2>
<p>这是段落</p>
<p class="no2">这个段落设置了内边距。</p>

</body>
</html>

css_board_color.html:

<html>
<head>

<style type="text/css">
p.one
{
border-style: solid;
border-color: #0000ff
}
p.two
{
border-style: solid;
border-color: #ff0000 #0000ff
}
p.three
{
border-style: solid;
border-color: #ff0000 #00ff00 #0000ff
}
p.four
{
border-style: solid;
border-color: #ff0000 #00ff00 #0000ff rgb(250,0,255)
}
</style>

</head>

<body>

<p class="one">One-colored border!</p>

<p class="two">Two-colored border!</p>

<p class="three">Three-colored border!</p>

<p class="four">Four-colored border!</p>

<p><b>注释：</b>"border-width" 属性如果单独使用的话是不会起作用的。请首先使用 "border-style" 属性来设置边框。</p>

</body>
</html>

css_font_family.html:

<html>
<head>
<style type="text/css">
p.serif{font-family:"Times New Roman",Georgia,Serif}
p.sansserif{font-family:Arial,Verdana,Sans-serif}
</style>
</head>

<body>
<h1>CSS font-family</h1>
<p class="serif">This is a paragraph, shown in the Times New Roman font.</p>
<p class="sansserif">This is a paragraph, shown in the Arial font.</p>

</body>
</html>

css_text_decoration.html:

<html>
<head>
<style type="text/css">
h1 {text-decoration: overline}
h2 {text-decoration: line-through}
h3 {text-decoration: underline}
h4 {text-decoration:blink}
a {text-decoration: none}
</style>
</head>

<body>
<h1>这是标题 1</h1>
<h2>这是标题 2</h2>
<h3>这是标题 3</h3>
<h4>这是标题 4</h4>
<p><a href="http://www.w3school.com.cn/index.html">这是一个链接</a></p>
</body>

</html>

Json 解码与编码

import json

obj = {'one': '一', 'two': '二'}
encoded = json.dumps(obj)
print(type(encoded))
print(encoded)
decoded = json.loads(encoded)
print(type(decoded))
print(decoded)

<class 'str'>
{"one": "\u4e00", "two": "\u4e8c"}
<class 'dict'>
{'one': '一', 'two': '二'}

Python处理XML方法之DOM

下面程序中使用到book.xml，内容如下：

<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore>
<book>
<title lang="eng">Harry Potter</title>
<price>29.99</price>
</book>
<book>
<title lang="eng">Learning XML</title>
<price>39.95</price>
</book>
</bookstore>

from xml.dom import minidom

doc = minidom.parse('book.xml')
root = doc.documentElement
# print(dir(root))
print(root.nodeName)
books = root.getElementsByTagName('book')
print(type(books))
for book in books:
titles = book.getElementsByTagName('title')
print(titles[0].childNodes[0].nodeValue)

bookstore
<class 'xml.dom.minicompat.NodeList'>
Harry Potter
Learning XML

Python处理XML方法之SAX

import string
from xml.parsers.expat import ParserCreate

class DefaultSaxHandler(object):
def start_element(self, name, attrs):
self.element = name
print('element: %s, attrs: %s' % (name, str(attrs)))

def end_element(self, name):
print('end element: %s' % name)

def char_data(self, text):
if text.strip():
print("%s's text is %s" % (self.element, text))

handler = DefaultSaxHandler()
parser = ParserCreate()
parser.StartElementHandler = handler.start_element
parser.EndElementHandler = handler.end_element
parser.CharacterDataHandler = handler.char_data
with open('book.xml', 'r') as f:
parser.Parse(f.read())

element: bookstore, attrs: {}
element: book, attrs: {}
element: title, attrs: {'lang': 'eng'}
title's text is Harry Potter
end element: title
element: price, attrs: {}
price's text is 29.99
end element: price
end element: book
element: book, attrs: {}
element: title, attrs: {'lang': 'eng'}
title's text is Learning XML
end element: title
element: price, attrs: {}
price's text is 39.95
end element: price
end element: book
end element: bookstore

Python正则表达式

import re

m = re.match(r'\d{3}\-\d{3,8}', '010-12345')
# print(dir(m))
print(m.string)
print(m.pos, m.endpos)

# 分组
print('分组')
m = re.match(r'^(\d{3})-(\d{3,8})$', '010-12345')
print(m.groups())
print(m.group(0))
print(m.group(1))
print(m.group(2))

# 分割
print('分割')
p = re.compile(r'\d+')
print(type(p))
print(p.split('one1two3three3four4'))

t = '20:15:45'
m = re.match(r'^(0[0-9]|1[0-9]|2[0-3]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$', t)
print(m.groups())

010-12345
0 9
分组
('010', '12345')
010-12345
010
12345
分割
<class '_sre.SRE_Pattern'>
['one', 'two', 'three', 'four', '']
('20', '15', '45')

电商网站数据爬取

selenium安装参考：

selenium直接pip安装即可。

此外还要下载一个chromedriver https://sites.google.com/a/chromium.org/chromedriver/getting-started

安装教程参见：http://www.cnblogs.com/fnng/archive/2013/05/29/3106515.html

使用教程参见：

Python + selenium自动化测试 ;

Python爬虫利器五之Selenium的用法 ;

Selenium with Python

from selenium import webdriver
import time

browser = webdriver.Chrome()
browser.set_page_load_timeout(30)    #set the amount of time to wait for a page load to complete before throwing an error.
browser.get('http://www.17huo.com/search.html?sq=2&keyword=%E7%BE%8A%E6%AF%9B')
page_info = browser.find_element_by_css_selector('body > div.wrap > div.pagem.product_list_pager > div')
# print(page_info.text)
pages = int((page_info.text.split('，')[0]).split(' ')[1])
for page in range(pages):
if page > 2:
break
url = 'http://www.17huo.com/?mod=search&sq=2&keyword=%E7%BE%8A%E6%AF%9B&page=' + str(page + 1)
browser.get(url)
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)   # 不然会load不完整
goods = browser.find_element_by_css_selector('body > div.wrap > div:nth-child(2) > div.p_main > ul').find_elements_by_tag_name('li')
print('%d页有%d件商品' % ((page + 1), len(goods)))
for good in goods:
try:
title = good.find_element_by_css_selector('a:nth-child(1) > p:nth-child(2)').text
price = good.find_element_by_css_selector('div > a > span').text
print(title, price)
except:
print(good.text)

1页有24件商品
2017年春季套头半高圆领纯色羊 ¥105.00
男士羊毛衫 9829 P95 有 ¥95.00
男士羊毛衫1629 P95 断M ¥95.00
男士羊毛衫针织16807 P95 ¥95.00
男士羊毛衫 5266 P95 白 ¥95.00
男士羊毛衫 6072 P75 黑 ¥75.00
男士羊毛衫 8013 P75 白 ¥75.00
男士羊毛衫8606 P95 白断 ¥95.00
男士羊毛衫8656 P95 白断 ¥95.00
男士羊毛衫 6602 P95 断 ¥95.00
8621 P95 秋冬男士羊毛衫 ¥95.00
9993 P70男士羊毛衫毛衣 ¥115.00
男士羊毛衫 55081 P75 ¥75.00
6887 P95 男士羊毛衫 只 ¥115.00
6888 P95 男士羊毛衫 尺 ¥115.00
A01 P95 男士羊毛衫 黑断 ¥95.00
A02 P95 秋冬男士羊毛衫 ¥95.00
A09 P95 秋冬男士羊毛衫 ¥95.00
冬装加厚羊毛衫 大码毛衣8007 ¥110.00
冬装加厚羊毛衫 大码毛衣8008 ¥110.00
冬装加厚羊毛衫 大码毛衣8009 ¥110.00
冬装加厚羊毛衫 大码毛衣8010 ¥110.00
冬装加厚羊毛衫 大码毛衣8011 ¥110.00
冬装加厚羊毛衫 大码毛衣8016 ¥110.00
2页有24件商品
冬装加厚羊毛衫 大码毛衣8018 ¥110.00
冬装加厚羊毛衫 大码毛衣8019 ¥110.00
冬装中领羊毛衫 半领羊绒羊毛混纺 ¥110.00
冬装加厚羊毛衫 高领羊绒羊毛混纺 ¥110.00
冬装加厚羊毛衫 保暖毛衣8015 ¥110.00
冬装加厚羊毛衫 大码毛衣8001 ¥110.00
冬装加厚羊毛衫 大码毛衣8002 ¥110.00
冬装加厚羊毛衫 大码毛衣8004 ¥110.00
冬装加厚羊毛衫 大码毛衣8005 ¥110.00
冬装加厚羊毛衫 大码毛衣8006 ¥110.00
AB16P50 韩版半高领砖红灰 ¥50.00
（成分在详情页）冬装中长款新款呢 ¥165.00
时尚长袖镶钻拼接羊毛纯色连帽插肩 ¥125.00
圆领纯色羊毛卫衣/绒衫2017年 ¥115.00
2016秋冬装男士套头半高领羊毛 ¥200.00
2199 P95 厚款 羊毛衫男 ¥95.00
2335 P95 厚款 羊毛衫男 ¥95.00
2616 P95 厚款 羊毛衫男 ¥95.00
2017年春季七分袖中长款修身时 ¥100.00
气质时尚2017年春季羊毛纯色V ¥100.00
针织衫/毛衣纯色甜美羊毛套头长袖 ¥90.00
2017年春季是低圆领气质羊毛针 ¥65.00
[转卖]学院风宽松茧型连帽牛角扣 ¥130.00
2016 冬季男士高品质羊毛衫 ¥155.00
3页有24件商品
2016 冬季男士高品质羊毛衫 ¥155.00
2016 冬季男士高品质羊毛衫 ¥155.00
2016 冬季男士高端羊毛呢大衣 ¥430.00
【实拍】灰色夹棉加厚 2016 ¥125.00
夹棉加厚 【实拍】 围脖羊毛呢 ¥125.00
【实拍】大货已出 2016秋冬新 ¥65.00
【实拍】大货已出 2016新款 ¥65.00
【实拍】大货已出 2016新款 ¥85.00
【实拍】大货已出 2016新款 ¥75.00
【韩模实拍】大货已出 韩版千鸟格 ¥115.00
【韩模实拍】大货已出 韩版大口袋 ¥130.00
【实拍】大货已出 夹棉加厚牛角 ¥150.00
长袖2017年春季中长款修身时尚 ¥160.00
气质时尚羊毛韩版简约甜美毛呢外套 ¥150.00
长袖短款修身2017年春季时尚优 ¥110.00
2017年春季套头钉珠宽松适中羊 ¥125.00
2017年春季时尚羊毛休闲圆领针 ¥125.00
2017年春季针织衫/开衫休闲羊 ¥110.00
适中宽松潮流纯色知性V领羊毛中老 ¥95.00
2017年春季套头气质时尚适中宽 ¥95.00
圆领针织衫/开衫羊毛知性纯色气质 ¥95.00
时尚适中宽松针织衫/开衫单排扣羊 ¥115.00
2017年春季气质时尚中老年女装 ¥115.00
纯色长袖开衫单排扣2016年冬季 ¥148.00

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航