Python第三方库—Beautiful Soup-优快云博客

本文介绍了如何使用Python的BeautifulSoup库来解析HTML文档，并通过具体示例展示了如何获取页面标题、正文内容及链接等关键信息。

用于解析HTML/XML的库

官方文档：中文

根据官网文档获取实例：

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')  # 创建对象
print(soup.prettify())  # 格式化输出

# 获取所有文字内容
print(soup.get_text)

# 获取title信息
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)

# 获取p
print(soup.p)
print(soup.p['class'])

# 获取a
print(soup.a)
print(soup.a['class'])
print(soup.find(id="link3"))

# 找到所有a标签
for link in soup.find_all('a'):
    print(link)

输出信息：

格式化输出：

<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>


The Dormouse's story



Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.


...

</body>
</html>

获取所有文字内容：

<html><head><title>The Dormouse's story</title></head>
<body>
The Dormouse's story
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.
...
</body></html>>

获取title信息：

<title>The Dormouse's story</title>
title
The Dormouse's story
head

获取p：
The Dormouse's story
['title']

获取a：

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
['sister']
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

获取所有a：
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>