您的位置：首页 > 编程语言 > Python开发

python新手爬取论坛贴吧特定人的帖子——虎扑《健美大神之路》

2017-08-21 23:01 218 查看

在虎扑上，有博主翻译《健美大神之路》，感觉很好，但是想要找电子书却没有，所以就打算自己爬下来存在文本文档中。

我应用的是urllib2,beautifulsoup这两个工具。

在这个编程中，我遇到的最大的麻烦就是，编码标准错误和我爬取的帖子文本中有他人的帖子。

第一个问题我现在还是不太懂，最后胡乱试解决了。 for string in tags.next_sibling.next_sibling.find('div',class_='quote-content').strings:
string_gbk=string.encode('utf-8')
file.write(string_gbk)

第二行如果不用方法。encode('utf-8'),就会报出gbk读码错误。
第二个问题主要要解决的便是找出特定人发的帖子和别人发的帖子，在html源中有什么不同，然后限定条件。

url='https://bbs.hupu.com/19201877.html'

一开始我找的是

<div class="quote-content">

这样的标签，然后筛选出其中的strings

因为所有人发的帖子都在这样的标签中所以自然的就都爬了。

得到这样的原因后，我就仔细的查看和比较如何才能找出我想要的

<div class="quote-content">

然后我发现了一个显而易见的逻辑，那便是帖子的头部都会有发帖人的信息，所以这就是突破口。

虽然有了正确的方向，也有了正确的工具beautifulsoup的兄弟节点，但是我在兄弟节点的处理也就是整个html的逻辑树的结构上的认识错误导致我一直出现错误，其中有两点我觉得十分重要：

1.

In real documents, the

.next_sibling

.previous_sibling

of atag will usually be a string containing whitespace. Going back to the“three sisters” document:

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

You might think that the

.next_sibling

of the first <a> tag wouldbe the second <a> tag. But actually, it’s a str
9d35
ing: the comma andnewline that separate the first <a> tag from the second:

link = soup.a
lin
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

link.next_sibling
# u',\n'

The second <a> tag is actually the

.next_sibling

of the comma:

link.next_sibling.next_sibling
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

bs官方文档对兄弟节点一个常见错误的解释。
2.一定要从大到小的顺序观察html的标签结构（节点结构），浏览器的检查元素功能要比直接查看源好。

最后放代码了，只是小的程序，所以没有按照工程结构写，推荐自己建工程定义模块定义类的方式，可以自己的逻辑和面向对象的思维有好处。# -*- coding:utf-8 -*-
import urllib2
from bs4 import BeautifulSoup

file=open('book.txt','w')
start_url='https://bbs.hupu.com/19201877.html'
all_urls=[]
all_urls.append(start_url)

for x in range(2,6):
all_urls.append('https://bbs.hupu.com/19201877-'+str(x)+'.html')
for url in all_urls:
request=urllib2.Request(url)
response=urllib2.urlopen(request)
cont=response.read()
soup=BeautifulSoup(cont,"lxml",from_encoding='utf-8')
for tags in soup.find_all('div',class_="author"):
# print tags.next_sibling.next_sibling.find('div',class_='quote-content')
if tags.div.a['href']=='https://my.hupu.com/232157742256797':
for string in tags.next_sibling.next_sibling.find('div',class_='quote-content').strings:
string_gbk=string.encode('utf-8')
file.write(string_gbk)

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航