BeautifulSoup

最新推荐文章于 2025-07-27 22:37:15 发布

原创最新推荐文章于 2025-07-27 22:37:15 发布 · 366 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#BeautifulSoup #Python

Python 同时被 2 个专栏收录

47 篇文章

订阅专栏

爬虫

1 篇文章

订阅专栏

导入BeautifulSoup

from bs4 import BeautifulSoup

创建BeautifulSoup对象

1.字符串创建
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')

2.文件指针创建
f = open('example.html','r',encoding = 'utf-8')
soup = BeautifulSoup(f)

3.读取网页文件创建
from urllib import request
url = 'http://music.163.com'
data = request.urlopen(url)
soup = BeautifulSoup(data)

脚本美化

soup.prettify()

获取第一个li标签内容

tag = soup.li    #还可以进一步用siup.li.a获取li下第一个a标签

获取tag的名称

tag.name

获取tag的属性并修改

tag['class']   #当class有多个值时返回list
tag['class'] = 'ex'   #将class属性的值修改为'ex'

获取tag下的字符串

tag.string #获取tag下的唯一string或者唯一标签下的唯一string

.contents 和 .children

tag.contents #以list方式返回tag子节点
tag.children #tag子节点的生成器

.descendants

tag.descendants  #tag子孙节点的生成器

.strings 和 stripped_strings

tag.strings  #tag下所有字符串的生成器
tag.stripped_strings #去除空白字符后的生成器

.parent和.parents

tag.parent  #返回tag的父节点
tag.parents  #tag所有父辈节点的递归生成器

.next_sibling 和 .previous_sibling

tag.next_sibling  #tag之后的一个兄弟节点
tag.previous_sibling #tag之前的一个兄弟节点

.next_siblings 和 .previous_siblings

tag.next_siblings  #tag之后的所有兄弟节点
tag.previous_siblings #tag之前的所有兄弟节点

.next_element 和 .previous_element

tag.next_element  #tag后被解析的第一个对象
tag.previous_element  #tag前被解析的第一个对象

.next_elements 和 .previous_elements

tag.next_element  #tag后被解析的所有对象
tag.previous_element  #tag前被解析的所有对象
#sibling是树上结点的水平，element是前向或者后向

find_all方法

1.字符串
tag.find_all('a')

2.正则表达式
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

3.列表
soup.find_all(["a", "b"])

4.True
soup.find_all(True)  #除字符串外所有tag

5.函数方法
def has_class_but_no_id(tag):
 return tag.has_attr('class') and not tag.has_attr('id')

soup.find_all(has_class_but_no_id)

#函数原型
#find_all( name , attrs , recursive , text , **kwargs )
#name  通过tag名查找，接受以上五种方法
#attrs 通过属性查找   soup.find_all('li','class_='fst')
#text  查找文本内容，接受以上五种方法
#limit  限制返回数量  soup.find_all("a", limit=2)
#recursive  是否只搜索直接子节点，默认为True
#soup.html.find_all("title",recursive=False)

find方法

与find_all相同，但只搜索第一个

六组相似方法

1. find_all() 和 find()
2. find_parents() 和 find_parent()
3.find_next_siblings() 和 find_next_sibling()
4.find_previous_siblings() 和 find_previous_sibling()
5.find_all_next() 和 find_next()
6.find_all_previous() 和 find_previous()

select方法

soup.select("title")  #直接搜索soup下的title
soup.select("p nth-of-type(3)")  #搜索soup下第三个p
soup.select("body a")
soup.select("html head title")
soup.select("head > title")  #搜索head下的直接子标签title
soup.select("p > a:nth-of-type(2)") #p下第二个直接a标签
soup.select("p > #link1") #p下id为link1的直接子标签
soup.select(".sister") #soup下类名为sister的标签
soup.select("#link1") #soup下id为link1的标签
soup.select('a[href]') #有href属性的a标签
soup.select('a[href="http://triagen.cn"]')#属性值查找

del 方法

tag['class'] = 'verybold' #为tag添加class属性
del tag['class']  #删除tag的class属性

修改.string

tag.string = 'Triagen'  #用‘’Triagen替换tag下所有内容

append方法

tag.append('Triagen')  #在tag中添加‘Triagen’

new_string方法

soup = BeautifulSoup("<b></b>")
tag = soup.b
tag.append("Hello")
new_string = soup.new_string(" there")
tag.append(new_string)
tag
# <b>Hello there.</b>
tag.contents
# [u'Hello', u' there']

new_tag方法

soup = BeautifulSoup("<b></b>")
original_tag = soup.b

new_tag = soup.new_tag("a", href="http://www.example.com")
original_tag.append(new_tag)
original_tag
# <b><a href="http://www.example.com"></a></b>

new_tag.string = "Link text."
original_tag
# <b><a href="http://www.example.com">Link text.</a></b>

insert方法

tag.insert(1, "but did not endorse ") #与append方法类似

insert_before方法

在当前tag或文本节点前插入内容

insert_after方法

在当前tag或文本节点后插入内容

clear方法

移除当前tag的内容

extract方法

将当前tag移除文档树,并作为方法结果返回

decompose方法

将当前节点移除文档树并完全销毁

replace_with方法

移除文档树中的某段内容,并用新tag或文本节点替代它

wrap方法

#对指定的tag元素进行包装 ,并返回包装后的结果
soup = BeautifulSoup("<p>I wish I was bold.</p>")
soup.p.string.wrap(soup.new_tag("b"))
# <b>I wish I was bold.</b>

soup.p.wrap(soup.new_tag("div"))
# <div><p><b>I wish I was bold.</b></p></div>

unwrap方法

#与 wrap() 方法相反.将移除tag内的所有tag标签
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a

a_tag.i.unwrap()
a_tag
# <a href="http://example.com/">I linked to example.com</a>

get_text方法

#获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回
# soup.get_text("|", strip=True)
u'I linked to|example.com'
#通过参数指定tag的文本内容的分隔符
#去除获得文本内容的前后空白

soup.original_encoding

获取soup的编格式

encode方法

#对节点编码，避免不同编码出错
soup.p.encode("latin-1")
# '<p>Sacr\xe9 bleu!</p>'

soup.p.encode("utf-8")
# '<p>Sacr\xc3\xa9 bleu!</p>'