您的位置:首页 > 其它

爬虫学习记录1-基本知识,简单进行网页抓取

2018-03-16 15:36 501 查看
由于抓取数据需要,开始接触Python爬虫,这两天也观看了网络上的一些教程,爬虫门槛较低,入手较快。Python爬虫主要用到urllib这个包,获取网站信息。urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)url一定要输入,其余会自动默认。下面是对网站的源码的简单抓取。url = "http://www.jlis.cn" #将网站地址赋值给urlrequest = urllib.request.Request(url) #发送请求response = urllib.request.urlopen(request)#打开网址urldata = response.read()html = data.decode("utf-8")#读出网站源码print(html)较多次访问网站后,网站会发现你是机器访问而不是人为访问,会拒绝你的访问请求,这时要加上header,header有两种添加方法第一种作为参数添加例如,在对有道翻译网页的抓取中,将header建为字典,添加到urllib.request中
import urllib.request
import urllib.parse
import json

content = input("请输入你要翻译的内容:")

url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&sessionFrom="
header = {}
header["User-Agent"] = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'

data = {}
data["i"] = content
data["from"] = 'AUTO'
data["to"] = 'AUTO'
data["smartresult"] = 'dict'
data["client"] = 'fanyideskweb'
data["salt"] = '1521113606768'
data["sign"] = 'e1fb3adb0b4f6746766430de73a2ccf1'
data["doctype"] = 'json'
data["version"] = '2.1'
data["keyfrom"] = 'fanyi.web'
data["action"] = 'FY_BY_CLICKBUTTION'
data["typoResult"] = 'false'
data = urllib.parse.urlencode(data).encode("utf-8")

req = urllib.request.Request(url,data,header)
response = urllib.request.urlopen(req)
html = response.read().decode("utf-8")

target = json.loads(html)
print("翻译结果为:",target['translateResult'][0][0]['tgt'])
第二种,使用add_header,如下req = urllib.request.Request(url)req.add_header('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36')#headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36'}response = urllib.request.urlopen(req)为了不被网站怀疑,对我们的IP地址进行封号,我们有时候要使用多个IP地址,这里要用到代理IPimport urllib.requestimport randomurl = "http://www.whatismyip.com.tw"iplist = ['49.64.151.177:61202','171.37.42.137:61202']#将网上免费的IP地址存进列表proxy_support = urllib.request.ProxyHandler({'http':random.choice(iplist)})#使用random函数随机选择一个IP地址opener = urllib.request.build_opener(proxy_support)opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36')]urllib.request.install_opener(opener)req = urllib.request.Request(url)response = urllib.request.urlopen(req)html = response.read().decode('utf-8')print(html)
                                            
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐