从零开始学网络爬虫之BeautifulSoap_beautifulsoap还是beautifulsoap4-优快云博客

本文链接：https://blog.youkuaiyun.com/lxmanutd/article/details/53513103

本文介绍如何使用BeautifulSoup进行网页内容解析，包括安装方法、官方文档指引、创建对象及使用find与findAll方法等内容。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

之前我们介绍了正则表达式，可能有的小伙伴也对写正则表达式的用法还不够熟练，没关系，我们还有一个更强大的工具，叫Beautiful Soup，它可以与Requests配合使用，在获得网页源码后进行分析，实在是很方便。这一节就让我们一就一起来学习一下Beautiful Soup。

1. Beautiful Soup 安装

Beautiful Soup 3 目前已经停止开发，推荐在现在的项目中使用Beautiful Soup 4，不过它已经被移植到BS4了，也就是说导入时我们需要 import bs4 。

可以利用 pip 或者 easy_install 来安装，以下两种方法均可

 
        1
       
        easy_install 
        beautifulsoup4

  
         1
       
        pip 
        install 
        beautifulsoup4

2. Beautiful Soup 官方文档

本文中知识针对个别常用的知识点进行简介，详细的信息请参考如下官方文档。

官方文档

3. 创建 Beautiful Soup 对象

（1）入门例子

    import requests
    from bs4 import BeautifulSoup
    html=requests.get("http://www.pythonscraping.com/pages/page1.html")
    bsObj=BeautifulSoup(html.text)
    print bsObj.h1

（2）BeautifulSoup中的find()与findAll()

find_all( name , attrs , recursive , text , **kwargs )

find( name , attrs , recursive , text , **kwargs )

find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件, find则是找到一个为止

tag：.findAll({"h1","h2","h3","h4"})

attributes: .findAll("span",{"class":{"green","red"}}) #

text: nameList=bsObj.findAll(text="the prince")

print len(nameList)

keywords: allText=bsObj.findAll(id="text") #等价于allText=bsObj.findAll("",{"id","text"})

print allText[0].get_text()

(3)处理子标签和其他后代标签

    import requests
    from bs4 import BeautifulSoup
    html=requests.get("http://www.pythonscraping.com/pages/page1.html")
    bsObj=BeautifulSoup(html.text)
    for child in bsObj.find("table",{"id","giftList"}).children
    	print child

(4)处理兄弟标签

import requests
from bs4 import BeautifulSoup
html=requests.get("http://www.pythonscraping.com/pages/page1.html")
bsObj=BeautifulSoup(html.text)
for sibling in bsObj.find("table",{"id","giftList"}).tr.next_siblings
print sibling

(5)处理父标签

import requests
from bs4 import BeautifulSoup
html=requests.get("http://www.pythonscraping.com/pages/page1.html")
bsObj=BeautifulSoup(html.text)
print bsObj.find("img",{"src","../img/gifts/img1.jpg"}).parent.previous_sibling.get_text()

4. BeautifulSoup 与正则表达式

import requests
from bs4 import BeautifulSoup
html=requests.get("http://www.pythonscraping.com/pages/page1.html")
bsObj=BeautifulSoup(html.text)
images=bsObj.findAll("img",{"src":"re.compile("\.\.\/img\/gifts\/img.*\.jpg")})
for image in images:
print image["src"]

5. Beautiful Soup与Lambda

bsObj.findAll(lambda tag: len(tag.attrs)==2)