BeautifulSoup 遍历、搜索

最新推荐文章于 2025-07-02 13:31:46 发布

fmingzh

最新推荐文章于 2025-07-02 13:31:46 发布

阅读量3.7k

点赞数

CC 4.0 BY-SA版权

分类专栏： python

本文链接：https://blog.youkuaiyun.com/misterfm/article/details/80420120

python 专栏收录该内容

7 篇文章

订阅专栏

本文详细介绍了如何利用BeautifulSoup库进行HTML解析，包括遍历DOM树、精准查找元素以及高级搜索技巧，帮助开发者高效抓取网页数据。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

# -*- coding: utf-8 -*-
import re
from bs4 import BeautifulSoup
html_doc = """ 
<html><head><title>The Dormouse's story</title></head> 
<body> 
<p class="title"><b>The Dormouse's story</b></p> 
<p class="story">Once upon a time there were three little sisters; and their names were 
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, 
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 
and they lived at the bottom of a well.</p> 
<p class="story">...</p> 
"""
#分析：bs中，<head>标签只有一个子节点,但是有2个子孙节点

# 获取BeautifulSoup对象并按标准缩进格式输出
soup = BeautifulSoup(html_doc,'lxml')
#######遍历文档树###############
print soup.head    #获取head标签
print soup.title   #获取title标签
print soup.body.b   #获取body的b标签
print soup.a        #点取属性的方式只能获得当前名字的第一个tag
print soup.find_all('a')  #得到所有的<a>标签
print '-------'
print soup.head.contents   #获取head子节点  返回元组
print soup.head.contents[0] #获取head子节点,返回字符串
print soup.head.contents[0].contents #获取head子节点,返回字符串
print '-------descendants-------'  #递归所有子孙节点
for child in soup.head.descendants:
    print child
print soup.head.string  #只有一个String 类型子节点,那么这个tag可以使用.string 得到子节点
print '--strings--'  #(soup.strings:)获取所有字符串（soup.stripped_strings，可以去掉多于空白内容）
for string in soup.stripped_strings:  #
    print repr(string)
print soup.title.parent  #获取title的父节点
print soup.title.string.parent  #string的父节点
print '---parents--'
for parent in soup.a.parents:  #获取a的所有父节点
    if parent is None:
        print(parent)
    else:
        print(parent.name)
#.next_sibling 和 .previous_sibling ,遍历兄弟结点
# .next_siblings 和 .previous_siblings,遍历所有兄弟结点
#.next_element 和 .previous_element ，指向解析过程中下一个被解析的对象(字符串或tag)
#.next_elements 和 .previous_elements，解析整个文档
########搜索文档树###############
print soup.find_all(["a", "b"]) #查找带有a或b的标签
for tag in soup.find_all(True): #找到所有tag
    print(tag.name)
print soup.find_all(id='link2') #搜索每个tag的id属性
print soup.find_all(id=True)  #搜索所有包含id属性的tag
print soup.find_all(href=re.compile("elsie")) #搜索tag的href属性
print soup.find_all(href=re.compile("elsie"), id='link1')
#soup.find_all("a", class_="sister") ,按照CSS类名搜索tag
#soup.find_all(text="Elsie") 搜索字符串
#soup.select('a[href]')
print '------'
print soup.get_text()  #获取所有tag文档