BeautifulSoup的详细笔记
2015-08-02 14:39
489 查看
下载
如果全面安装了python组建,直接在命令行中输入pip install beautifulsoup4
导入到项目中
from bs4 import BeautifulSoup
快速开始
from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
BeautifulSoup的初始化方法
soup = BeautifulSoup(html_doc, 'html.parser')
漂亮的输出方法
print(soup.prettify()) # <html> # <head> # <title> # The Dormouse's story # </title> # </head> # <body> # <p class="title"> # <b> # The Dormouse's story # </b> # </p> # <p class="story"> # Once upon a time there were three little sisters; and their names were # <a class="sister" href="http://example.com/elsie" id="link1"> # Elsie # </a> # , # <a class="sister" href="http://example.com/lacie" id="link2"> # Lacie # </a> # and # <a class="sister" href="http://example.com/tillie" id="link2"> # Tillie # </a> # ; and they lived at the bottom of a well. # </p> # <p class="story"> # ... # </p> # </body> # </html>
获取标签的方法
soup.title # <title>The Dormouse's story</title> soup.title.name # u'title' soup.title.string # u'The Dormouse's story' soup.title.parent.name # u'head' soup.p # <p class="title"><b>The Dormouse's story</b></p>
获取标签内某个属性的方法
soup.p['class'] # u'title' soup.a #只能找到一个 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
条件查找的方法
soup.find_all('a') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.find(id="link3") # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> for link in soup.find_all('a'): print(link.get('href')) # http://example.com/elsie # http://example.com/lacie # http://example.com/tillie[/code]
获取显示在页面上的文本的方法print(soup.get_text()) ,, 重要,用 soup.get_text("",True)能解决编码问题 # The Dormouse's story # # The Dormouse's story # # Once upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; # and they lived at the bottom of a well.
获取注释的方法soup.tag.string对象
Tag
name:tag.name 标签的名字
attributes :tag[‘attributes’] 标签内属性的值
tag.prettify() 漂亮的打印所有内容
NavigableString
tag.string :获取标签内注释
unicode_string = unicode(tag.string) 转换成unicode类型
tag.string.replace_with(“No longer bold”) tag 替换字符串
BeautifulSoup :这个对象代表着文档本身,你也可以把它当作一个大标签,标签的方法大部分适用于他不同标签切换的方法
soup.body.b
soup.find_all(‘a’)# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
tag.contents 返回所有第一 .children 标签.contents and .children
.descendants 返回所有的子标签,子子标签,等等#head_tag : for child in head_tag.descendants: print(child) # <title>The Dormouse's story</title> # The Dormouse's storystring 和 strings
string 如果标签内只有一个NavigableString,用这个就能获取到
多余一个用 strings,他是一个generator。他会原封不动地返回信息,当然‘/n’也会出现
改而用 .stripped_strings 返回不带换行符的标签的关系
.parent
.parents
.next_sibling and .previous_sibling
.next_element and .previous_element查询
先看一下每个参数能够接受的东西
1. A string : soup.find_all(‘b’)
2. A regular expression :soup.find_all(re.compile(“^b”)):
3. A list :soup.find_all([“a”, “b”]) a标签或者b标签
4. A function :一个自定义的返回布尔值的函数,
详细考察find_all()的属性
Signature: find_all(name, attrs, recursive, string, limit, **kwargs)
name :一个标签的名字,表明搜索范围,可以传递上述的值
The keyword arguments :id,href,data-foo等
.find_all(href=re.compile(“.jpg”))
.find_all(id=True)
.find_all(href=re.compile(“elsie”), id=’link1’)
Searching by CSS class :因为是保留字,不能直接写出来,用了class_代替
以前的版本可以用 : soup.find_all(“a”, attrs={“class”: “sister”})
值可以是字符串,返回布尔的方法,甚至一个正则
The string argument
使用string参数你可以搜索字符串而不是标签,同样传递变态参数
一般来讲只返回匹配到的字符串
使用标签和string ,返回的匹配到字符串的标签
string就是text
The recursive argument : 如果设置为false表明只在直接子类中查找而不递归往下查找Output
1. 格式化输出
tag.prettify() #用unicode格式化,标签独占一行,类似于格式化代码2. 非格式化输出
如果只想要字符串,用str(soup) # '<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>' unicode(soup.a) # u'<a href="http://example.com/">I linked to <i>example.com</i></a>' encode() to get a bytestring, and decode() to get Unicode.3. 格式化器 formatter
对于以上几个方法,可以传入参数 formatter=,来按照自己的意思格式化。
formatter = ‘html’print(soup.prettify(formatter="html")) # <html> # <body> # <p> # Il a dit <<Sacré bleu!>> # </p> # </body> # </html>
formatter=Noneprint(soup.prettify(formatter=None)) # <html> # <body> # <p> # Il a dit <<Sacré bleu!>> # </p> # </body> # </html>
自定义方法def uppercase(str): return str.upper() print(soup.prettify(formatter=uppercase)) # <html> # <body> # <p> # IL A DIT <<SACRÉ BLEU!>> # </p> # </body> # </html> print(link_soup.a.prettify(formatter=uppercase)) # <a href="HTTP://EXAMPLE.COM/?FOO=VAL1&BAR=VAL2"> # A LINK # </a>
If you’re writing your own function, you should know about the EntitySubstitution class in the bs4.dammit module. This class implements Beautiful Soup’s standard formatters as class methods: the “html” formatter is EntitySubstitution.substitute_html, and the “minimal” formatter is EntitySubstitution.substitute_xml. You can use these functions to simulate formatter=html or formatter==minimal, but then do something extra.
比如:from bs4.dammit import EntitySubstitution def uppercase_and_substitute_html_entities(str): return EntitySubstitution.substitute_html(str.upper()) print(soup.prettify(formatter=uppercase_and_substitute_html_entities)) # <html> # <body> # <p> # IL A DIT <<SACRÉ BLEU!>> # </p> # </body> # </html>get_text()
If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string: markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>' soup = BeautifulSoup(markup) soup.get_text() u'\nI linked to example.com\n' soup.i.get_text() u'example.com' You can specify a string to be used to join the bits of text together: # soup.get_text("|") u'\nI linked to |example.com|\n' You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text: # soup.get_text("|", strip=True) u'I linked to|example.com' But at that point you might want to use the .stripped_strings generator instead, and process the text yourself: [text for text in soup.stripped_strings] # [u'I linked to', u'example.com']
相关文章推荐
- Oracle用户被锁原因及办法
- Scala学习第六天 Map、Tuple、Zip实战解析
- oracle序列详解
- SQL Server游标的使用【转】
- 用 Spring Security 4+Spring MVC+Spring4 构建健壮且安全的web应用
- php连接java最新能用的方法,javabridge的配置
- C语言面试题
- Scala学习第四天 Scala的For与Function进阶实战、Lazy的使用
- 栈 stack 用数组实现
- Scala学习第三天 Tuple、Array、May与文件操作入门实战
- org.apache.velocity 简介
- LightOJ 1422 区间DP Halloween Costumes
- win8 /win7 下安装vc 6.0 行号出现问题,无法注册
- 代码的抽象三原则
- win7资源管理器左侧的库消失了
- Mac上Homebrew的使用 (Homebrew 使 OS X 更完整)
- Python的import语法替代方案
- 跨域访问
- IOC:AutoFac使用demo
- java基础常见面试题