Beautiful Soup

最新推荐文章于 2025-05-28 20:48:34 发布

Denrusn

最新推荐文章于 2025-05-28 20:48:34 发布

阅读量160

点赞数

分类专栏：爬虫解析网页 beautiful soup 文章标签： python 爬虫

本文链接：https://blog.youkuaiyun.com/weixin_43171448/article/details/108001400

版权

爬虫同时被 3 个专栏收录

7 篇文章

订阅专栏

解析网页

1 篇文章

订阅专栏

beautiful soup

1 篇文章

订阅专栏

Beautiful Soup

Beautiful Soup有多个解析器如下：

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, “html.parser”)	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	BeautifulSoup(markup, “lxml”)	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, [“lxml”, “xml”])BeautifulSoup(markup, “xml”)	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

创建beautifulsoup对象

创建beautifulsoup对象，有两种方式，一个是导入字符串，一个是打开文件的方式。

soup = BeautifulSoup(html_str, 'lxml', from_encoding="utf-8") 
------------------------------- 
soup = BeautifulSoup(open('index.html'))

字符串的方式要指定解析器，打开文件的方式会自动匹配使用最合适的解析器。

四大对象种类

tag是HTML中的一个个标签。

查找的是在所有内容中第一个符合条件的标签 
print(soup.title) 
# <title>The Dormouse's story</title> 
标签的属性 
print(soup.title.name) 
# title 
print(soup.p.attrs) 
# {'class': ['title'], 'name': 'dromouse'} 
获取属性值 
print(soup.p['class']) 
# ['title'] 
修改属性值 
soup.p[s'class'] = 'Myclass'

NavigableString

NavigableString是标签内文本的类型。

print(soup.p.string) 
#The Dormouse's story 
print(type(soup.p.string)) 
#<class 'bs4.element.NavigableString'>

BeautifulSoup

BeautifulSoup就是soup的类型。

print (type(soup.name)) 
#<type 'unicode'> 
print (soup.name)  
# [document] 
print (soup.attrs)  
#{} 空字典

Comment

Comment对象是一个特殊类型的NavigableString对象，输出的内容不包含注释符号。但不好好处理它会对文本处理造成麻烦。

print (soup.a) 
下面这条输出了注释里面的内容 
print (soup.a.string) 
print (type(soup.a.string)) 
# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> 
# Elsie  
# <class 'bs4.element.Comment'>

所以打印内容是最好做一个判断是否是Comment类型

if isinstance(soup.a.string, bs4.element.Comment): 
    print(soup.a.string) 
    print(type(soup.a.string))

遍历文档树

直接子节点

.contents和.children属性 
.contents返回一个子节点列表 
print soup.head.contents  
#[<title>The Dormouse's story</title>] 
.children返回一个子节点生成器 
print soup.head.children 
#<listiterator object at 0x7f71457f5710> 
for child in  soup.body.children: 
    print child 
# <p class="title" name="dromouse"><b>The Dormouse's story</b></p> 
# <p class="story">Once upon a time there were three little sisters; and their names were 
# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, 
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and

所有子孙节点

.descendants返回一个子孙节点的生成器 
for child in soup.descendants: 
    print child 
# <html><head><title>The Dormouse's story</title></head> 
# <body> 
# <p class="title" name="dromouse"><b>The Dormouse's story</b></p> 
# <p class="story">Once upon a time there were three little sisters; and their names were 
# <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

节点内容

.string只能返回该tag的内容和该tag只包含一个子节点的内容，一旦包含多个子节点返回None 
head只包含一个title节点 
print soup.head.string 
# The Dormouse's story 
print soup.title.string 
# The Dormouse's story 
html包含多个子节点 
print soup.html.string 
# None

多个内容

.strings通过遍历获取多个内容 
.stripped_strings除去空格和空行的strings 
for string in soup.strings: 
    print(repr(string)) 
    # u"The Dormouse's story" 
    # u'\n\n' 
    # u"The Dormouse's story" 
    # u'\n\n' 
    # u'Once upon a time there were three little sisters; and their names were\n' 
    # u'Elsie' 
    # u',\n' 
    # u'Lacie' 
    # u' and\n' 
    # u'Tillie' 
    # u';\nand they lived at the bottom of a well.' 
    # u'\n\n' 
    # u'...' 
    # u'\n' 
for string in soup.stripped_strings: 
    print(repr(string)) 
    # u"The Dormouse's story" 
    # u"The Dormouse's story" 
    # u'Once upon a time there were three little sisters; and their names were' 
    # u'Elsie' 
    # u',' 
    # u'Lacie' 
    # u'and' 
    # u'Tillie' 
    # u';\nand they lived at the bottom of a well.' 
    # u'...'

父节点

.parent返回节点的直接父节点，string的节点，生成器   
.parents返回节点的所有父辈节点，生成器 
p = soup.p 
print p.parent.name 
#body 
content = soup.head.title.string 
print content.parent.name 
#title 
content = soup.head.title.string 
for parent in  content.parents: 
    print parent.name 
# title 
# head 
# html 
# [document]

兄弟节点

与本节点相处同一级的节点

.next_sibling获取该节点的下一个兄弟节点， 如果节点不存在返回None 
.previous_sibling获取该节点的上一个兄弟节点，如果节点不存在返回None 
！！！注意：实际文档中的tag的 .next_sibling 和 .previous_sibling 属性通常是字符串或空白，因为空白或者换行也可以被视作一个节点，所以得到的结果可能是空白或者换行 
print soup.p.next_sibling 
#       实际该处为空白 
print soup.p.previous_sibling 
#None   没有前一个兄弟节点，返回 None 
print soup.p.next_sibling.next_sibling 
#<p class="story">Once upon a time there were three little sisters; and their names were 
#<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, 
#<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and 
#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; 
#and they lived at the bottom of a well.</p> 
#下一个节点的下一个兄弟节点是我们可以看到的节点

全部兄弟节点

.next_siblings返回该节点后的所有兄弟节点生成器 
.previous_siblings返回该节点前的所有兄弟节点生成器 
for sibling in soup.a.next_siblings: 
    print(repr(sibling)) 
    # u',\n' 
    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
    # u' and\n' 
    # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 
    # u'; and they lived at the bottom of a well.' 
    # None

前后节点

.next_element .previous_element当前节点的前后节点，不分级别层次。 
print soup.head.next_element 
#<title>The Dormouse's story</title>

所有前后节点

.next_elements .previous_elements返回该节点前后所有节点的迭代器 
for element in last_a_tag.next_elements: 
    print(repr(element)) 
# u'Tillie' 
# u';\nand they lived at the bottom of a well.' 
# u'\n\n' 
# <p class="story">...</p> 
# u'...' 
# u'\n' 
# None

搜索文档树

find_all(name, attrs, recursive, text, **kwargs)

name参数可以查找所有tag名字为name的节点，name可以传入字符串，正则表达式，列表， True，方法。

字符串： 
soup.find_all('b') 
# [<b>The Dormouse's story</b>]   返回一个列表 
正则表达式： 
import re 
for tag in soup.find_all(re.compile("^b")): 
    print(tag.name) 
# body 
# b 
列表： 
soup.find_all(["a", "b"]) 
# [<b>The Dormouse's story</b>, 
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 
传True： 
for tag in soup.find_all(True): 
    print(tag.name) 
# html 
# head 
# title 
# body 
# p 
# b 
# p 
# a 
# a 
传方法： 
def has_class_but_no_id(tag): 
    return tag.has_attr('class') and not tag.has_attr('id') 
soup.find_all(has_class_but_no_id) 
# [<p class="title"><b>The Dormouse's story</b></p>, 
#  <p class="story">Once upon a time there were...</p>, 
#  <p class="story">...</p>]

keyword参数用来匹配节点的属性

soup.find_all(id='link2') 
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>] 
soup.find_all(href=re.compile("elsie")) 
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>] 
soup.find_all(href=re.compile("elsie"), id='link1') 
# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>] 
！！！class是系统关键字，使用class_代替 
soup.find_all("a", class_="sister") 
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] 
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>') 
data_soup.find_all(data-foo="value") 
# SyntaxError: keyword can't be an expression 
！！！使用attrs参数定义一个字典参数来包含特殊属性data-foo 
data_soup.find_all(attrs={"data-foo": "value"}) 
# [<div data-foo="value">foo!</div>]

text 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样, text 参数接受字符串 , 正则表达式 , 列表, True。

soup.find_all(text="Elsie") 
# [u'Elsie'] 
  
soup.find_all(text=["Tillie", "Elsie", "Lacie"]) 
# [u'Elsie', u'Lacie', u'Tillie'] 
  
soup.find_all(text=re.compile("Dormouse")) 
[u"The Dormouse's story", u"The Dormouse's story"]

limit 参数限制返回结果的数量

soup.find_all("a", limit=2) 
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

如果只想搜索tag的直接子节点,可以使用参数 recursive=False

soup.html.find_all("title") 
# [<title>The Dormouse's story</title>] 
  
soup.html.find_all("title", recursive=False) 
# []

css选择器

标签名查找

print soup.select('title')  
#[<title>The Dormouse's story</title>]

类名查找

print soup.select('.sister') 
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

id名查找

print soup.select('#link1') 
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

组合查找

print soup.select('p #link1') 
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>] 
print soup.select("head > title") 
#[<title>The Dormouse's story</title>]

属性查找

print soup.select('a[class="sister"]') 
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

获取节点内的文字get_text()

alist = soup.select('a[class="sister"]') 
for a in alist: 
    print(a.get_text())