BeautifulSoup 使用教程及示例_beautifulsoup教程-优快云博客

本文链接：https://blog.youkuaiyun.com/a6181816/article/details/145499207

BeautifulSoup 使用教程

BeautifulSoup 是一个用于解析 HTML 和 XML 文档的 Python 库，常用于网页抓取和数据提取。以下是分步骤的教程和常用参数、方法的表格。

1. 安装 BeautifulSoup

首先，确保安装了 beautifulsoup4 和 lxml（或其他解析器）：

pip install beautifulsoup4 lxml

2. 导入库

from bs4 import BeautifulSoup

3. 创建 BeautifulSoup 对象

将 HTML 文档加载到 BeautifulSoup 中：

# 示例 HTML 文档
html_doc = """
<html>
<head><title>示例网页</title></head>
<body>
    <h1>标题</h1>
    <p class="content">这是一个段落。</p>
    <a href="https://example.com">链接</a>
    <div id="main">
        <p>另一个段落</p>
    </div>
</body>
</html>
"""

# 创建 BeautifulSoup 对象
soup = BeautifulSoup(html_doc, 'lxml')  # 使用 lxml 解析器

4. 常用方法

4.1 查找单个元素

find()：查找第一个匹配的元素。

title = soup.find('title')  # 查找 <title> 标签
print(title.text)  # 输出: 示例网页

find() 带属性：

p_tag = soup.find('p', class_='content')  # 查找 class 为 content 的 <p> 标签
print(p_tag.text)  # 输出: 这是一个段落。

4.2 查找所有元素

find_all()：查找所有匹配的元素。

p_tags = soup.find_all('p')  # 查找所有 <p> 标签
for p in p_tags:
    print(p.text)

find_all() 带属性：

a_tags = soup.find_all('a', href=True)  # 查找所有带有 href 属性的 <a> 标签
for a in a_tags:
    print(a['href'])  # 输出: https://example.com

4.3 提取文本和属性

text：提取标签内的文本。

h1_tag = soup.find('h1')
print(h1_tag.text)  # 输出: 标题

get()：提取标签的属性值。

a_tag = soup.find('a')
print(a_tag.get('href'))  # 输出: https://example.com

4.4 导航文档树

parent：获取父节点。

p_tag = soup.find('p')
print(p_tag.parent.name)  # 输出: body

children：获取子节点。

div_tag = soup.find('div')
for child in div_tag.children:
    print(child)

next_sibling 和 previous_sibling：获取兄弟节点。

p_tag = soup.find('p')
print(p_tag.next_sibling)  # 输出下一个兄弟节点

5. 常用参数和方法表格

方法/属性	描述
`BeautifulSoup(html, parser)`	创建 BeautifulSoup 对象，`parser` 可以是 `'lxml'`、`'html.parser'` 等。
`find(tag, attrs)`	查找第一个匹配的标签，`attrs` 是属性字典（如 `{'class': 'content'}`）。
`find_all(tag, attrs)`	查找所有匹配的标签。
`text`	获取标签内的文本内容。
`get(attr)`	获取标签的属性值（如 `href`、`class`）。
`parent`	获取父节点。
`children`	获取所有子节点。
`next_sibling`	获取下一个兄弟节点。
`previous_sibling`	获取上一个兄弟节点。
`select(css_selector)`	使用 CSS 选择器查找元素（如 `soup.select('div#main p')`）。

6. 示例：提取网页中的所有链接

from bs4 import BeautifulSoup
import requests

# 获取网页内容
url = "https://example.com"
response = requests.get(url)
html_content = response.text

# 创建 BeautifulSoup 对象
soup = BeautifulSoup(html_content, 'lxml')

# 提取所有链接
links = soup.find_all('a', href=True)
for link in links:
    print(link['href'])