爬虫之BeautifulSoup解析库

最新推荐文章于 2025-04-20 15:18:40 发布

云风Com

最新推荐文章于 2025-04-20 15:18:40 发布

阅读量202

点赞数

分类专栏：爬虫

本文链接：https://blog.youkuaiyun.com/weixin_46318370/article/details/108661205

版权

爬虫专栏收录该内容

6 篇文章

订阅专栏

本文介绍如何使用BeautifulSoup解析HTML文档，包括获取标签、内容、属性，以及遍历文档树的方法，并演示了find_all和CSS选择器等多种搜索技巧。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

首先有一个test_baidu.html的文件，我放在了文末

Tag 标签及其内容（只能拿到它所找到的第一个内容）
执行下面程序，

from bs4 import BeautifulSoup
file = open('test_baidu.html','rb')
html = file.read()
bs = BeautifulSoup(html, "html.parser")
print(bs.title)
print(type(bs.title))

运行结果如下

<title>百度一下，你就知道</title>
<class 'bs4.element.Tag'>

（，这样拿的是标签及其内容），要想之拿到内容
那就

print(bs.title.string)

这就是下面要讲到的NavigableString

NavigableString标签里的字符串bs.title.string

from bs4 import BeautifulSoup
file = open('test_baidu.html','rb')
html = file.read()
bs = BeautifulSoup(html, "html.parser")
print(bs.title.string)
print(type(bs.title.string))

运行结果

百度一下，你就知道
<class 'bs4.element.NavigableString'>

下面方法拿到标签里边的属性bs.link.attrs

from bs4 import BeautifulSoup
file = open('test_baidu.html','rb')
html = file.read()
bs = BeautifulSoup(html, "html.parser")
print(bs.link.attrs)

打印结果

{'rel': ['shortcut', 'icon'], 'href': '/favicon.ico', 'type': 'image/x-icon'}

如果我想要整个文档怎么办呢，就看下面方法

bs4.BeautifulSoup类型

from bs4 import BeautifulSoup
file = open('test_baidu.html','rb')
html = file.read()
bs = BeautifulSoup(html, "html.parser")
print(type(bs))
print(bs.name)
print(bs.attrs)

运行结果

<class 'bs4.BeautifulSoup'>
[document]
{}

注释类型 comment

from bs4 import BeautifulSoup
file = open('test_baidu.html','rb')
html = file.read()
bs = BeautifulSoup(html, "html.parser")
print(bs.a)
print(bs.a.string)
print(type(bs.a.string))

运行结果

<a class="mnav c-font-normal c-color-t" href="http://news.baidu.com" target="_blank"><!--新闻--></a>
新闻
<class 'bs4.element.Comment'>

可见默认会将注释符号去除掉，只显示注释内容

应用（遍历文件树）

下面这个方法可以将head的所有子节点按列表方式打印出来

from bs4 import BeautifulSoup
file = open('test_baidu.html','rb')
html = file.read()
bs = BeautifulSoup(html, "html.parser")
print(bs.head.contents)

下面讲几种搜素方法

find_all方法

from bs4 import BeautifulSoup
file = open('test_baidu.html','rb')
html = file.read()
bs = BeautifulSoup(html, "html.parser")
# 文档的搜素
# （1）find_all,将所有含a标签的全部存到一个list中
#       字符串过滤，会查找与你输入的完全一样的保存起来，例如
#       下面的方式就不会找到span标签，必须完全匹配
t_list = bs.find_all('a')
#print(t_list)

# （2）使用正则表达式search匹配内容
import re
t_list = bs.find_all(re.compile('a'))#含a的就全部找出来
#print(t_list)

# （3）传入一个函数的方法，根据函数要求来搜素
def class_is_exist(tag):
    return tag.has_attr('class')


t_list = bs.find_all(class_is_exist)
print(t_list)

# (4)kwargs,直接在里面指定参数
t_list = bs.find_all(target="_blank")
print(t_list)

#（5）text参数
t_list = bs.find_all(text='百度一下，你就知道')
print(t_list)

t_list = bs.find_all(text=['新闻', '百度一下，你就知道'])
print(t_list)
#下面运用正则表达式查找包含特定文本的内容（标签里的字符串）
t_list = bs.find_all(text=re.compile("\d"))
print(t_list)

#(5)limit参数
t_list = bs.find_all(text=re.compile("\d"), limit = 2)
print(t_list)

css选择器

from bs4 import BeautifulSoup

file = open('test_baidu.html', 'rb')
html = file.read()
bs = BeautifulSoup(html, "html.parser")

t_list = bs.select('title')
print(t_list)

运行结果为以下，可以看到返回的也是列表

[<title>百度一下，你就知道</title>]

#通过类名查找
t_list = bs.select('.show-weather')  # 这里的.是因为在css中.代表class
print(t_list)

#通过id来查找
t_list = bs.select('#s_mod_weather')  # 这里的#是因为在css中#代表id
print(t_list)

#通过属性来查找
t_list = bs.select("div[class = 'show-weather']")  # 这里是div标签下的class属性
print(t_list)

#通过主子标签查找
t_list = bs.select('div > div')#查找div下面的div标签
print(t_list)

#通过兄弟标签查找
t_list = bs.select('.city ~ .weather-mod-link')
print(t_list)

<html>
<head>
    <meta http-equiv="Content-Type" content="text/html;charset=utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
    <meta content="always" name="referrer">
    <meta name="theme-color" content="#2932e1">
    <meta name="description" content="全球最大的中文搜索引擎、致力于让网民更便捷地获取信息，找到所求。百度超过千亿的中文网页数据库，可以瞬间找到相关的搜索结果。">
    <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon"/>
    <link rel="search" type="application/opensearchdescription+xml" href="/content-search.xml" title="百度搜索"/>
    <link rel="icon" sizes="any" mask href="//www.baidu.com/img/baidu_85beaf5496f291521eb75ba38eacbd87.svg">
    <link rel="dns-prefetch" href="//dss0.bdstatic.com"/>
    <link rel="dns-prefetch" href="//dss1.bdstatic.com"/>
    <link rel="dns-prefetch" href="//ss1.bdstatic.com"/>
    <link rel="dns-prefetch" href="//sp0.baidu.com"/>
    <link rel="dns-prefetch" href="//sp1.baidu.com"/>
    <link rel="dns-prefetch" href="//sp2.baidu.com"/>
    <title>百度一下，你就知道</title>
    <a href="http://news.baidu.com" target="_blank" class="mnav c-font-normal c-color-t"><!--新闻--></a>
    <a href="http://news.baidu.com" target="_blank" class="mnav c-font-normal c-color-t">新闻</a>
        <a href="http://news.baidu.com" target="_blank" class="mnav c-font-normal c-color-t">山西123</a>
    <a href="http://news.baidu.com" target="_blank" class="mnav c-font-normal c-color-t">长治456</a>
    <a href="http://news.baidu.com" target="_blank" class="mnav c-font-normal c-color-t">789</a>
        <a class="city" href="//www.baidu.com/s?tn=baidutop10&rsv_idx=2&wd=%E5%A4%A9%E6%B0%94%E9%A2%84%E6%8A%A5"
       target="_blank" class="weather-mod-link"></a>
    <a class="country" href="//www.baidu.com/s?tn=baidutop10&rsv_idx=2&wd=%E5%A4%A9%E6%B0%94%E9%A2%84%E6%8A%A5"
       target="_blank" class="weather-mod-link"></a>
    <a class="shanxi-wather" href="//www.baidu.com/s?tn=baidutop10&rsv_idx=2&wd=%E5%A4%A9%E6%B0%94%E9%A2%84%E6%8A%A5"
       target="_blank" class="weather-mod-link"> </a>
    <div id="s_mod_weather" class="s-mod-weather s-isindex-wrap hide-weater">
        <div class="weather-mod"><a class="city-wather"
                                    href="//www.baidu.com/s?tn=baidutop10&rsv_idx=2&wd=%E5%A4%A9%E6%B0%94%E9%A2%84%E6%8A%A5"
                                    target="_blank" class="weather-mod-link">
            <div class="show-weather"><span class="show-city"><span class="show-city-name c-font-normal c-color-t"
                                                                    data-key=""></span></span><span
                    class="show-icon"><img class="weather-icon"
                                           src="https://dss1.bdstatic.com/5aV1bjqh_Q23odCf/static/superman/img/weather/icons/.png"
                                           ;/>