BeautifulSoup

最新推荐文章于 2020-03-20 16:11:40 发布

原创最新推荐文章于 2020-03-20 16:11:40 发布 · 158 阅读

0 ·

CC 4.0 BY-SA版权

爬虫专栏收录该内容

7 篇文章

订阅专栏

本文详细介绍了BeautifulSoup4的基本安装与使用方法，包括如何解析HTML文档、查找特定标签及其属性，以及如何遍历文档树等内容。

安装

pip install beautifulsoup4

使用示例

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

对于html_doc常用方法如下：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

# 从文档中找到所有<a>中第一个标签的链接
soup.find_all('a')[0].get('href')

# 从文档中获取所有文字内容
soup.get_text()

对象的种类

可以遍历的字符串(NavigableString)

获取：tag.string
替换：tag.string.replace_with("No longer bold")
一个字符串不能包含其它内容(tag能够包含字符串或是其它tag),字符串不支持 .contents 或 .string 属性或 find() 方法.

如果想在Beautiful Soup之外使用 NavigableString 对象,需要调用 unicode() 方法,将该对象转换成普通的Unicode字符串,否则就算Beautiful Soup已方法已经执行结束,该对象的输出也会带有对象的引用地址.这样会浪费内存.

BeautifulSoup

表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象
因为 BeautifulSoup 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性
BeautifulSoup 对象包含了一个值为 “[document]” 的特殊属性 .name

注释及特殊字符串

遍历文档树

.contents 和 .children

字符串没有 .contents 属性,因为字符串没有子节点:
.children 生成器,可以对tag的子节点进行循环

.descendants

所有子孙节点

.string

如果tag包含了多个子节点,tag就无法确定 .string 方法应该调用哪个子节点的内容, .string 的输出结果是 None

.strings 和 stripped_strings

父节点

.parent

.parents

兄弟节点

.next_sibling 和 .previous_sibling

.next_siblings 和 .previous_siblings

可以对当前节点的兄弟节点迭代输出

回退和前进

.next_element 和 .previous_element

.next_elements 和 .previous_elements

通过 .next_elements 和 .previous_elements 的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样

搜索文档树

find() 和 find_all()

soup.find_all("title")
# [<title>The Dormouse's story</title>]

soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse's story</b></p>]

soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

import re
soup.find(text=re.compile("sisters"))
# u'Once upon a time there were three little sisters; and their names were\n'

# True 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点
soup.find_all(True)

# 如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.
soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag:
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

两者唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表,而 find() 方法直接返回结果.

其他find方法

find_parents() 和 find_parent()
find_next_siblings() 合 find_next_sibling()
find_previous_siblings() 和 find_previous_sibling()
find_all_next() 和 find_next()
find_all_previous() 和 find_previous()

参考：