252day（find()，find_parents() 和 find_parent()，修改文档树）

最新推荐文章于 2023-03-05 14:11:00 发布

原创最新推荐文章于 2023-03-05 14:11:00 发布 · 1.1k 阅读

0 ·

CC 4.0 BY-SA版权

《2018年6月19日》【连续252天】

标题：find()，find_parents() 和 find_parent()，修改文档树；

内容：

find( name , attrs , recursive , text , **kwargs )

find_all() 方法将返回文档中符合条件的所有tag,尽管有时候我们只想得到一个结果.比如文档中只有一个<body>标签,那么使用 find_all() 方法来查找<body>标签就不太合适, 使用 find_all 方法并设置 limit=1 参数不如直接使用 find() 方法.下面两行代码是等价的:

soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]

soup.find('title')
# <title>The Dormouse's story</title>

唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表,而 find() 方法直接返回结果.

find_all() 方法没有找到目标是返回空列表, find() 方法找不到目标时,返回 None .

soup.head.title 是 tag的名字方法的简写.这个简写的原理就是多次调用当前tag的 find() 方法:

soup.head.title
# <title>The Dormouse's story</title>

soup.find("head").find("title")
# <title>The Dormouse's story</title>

find_parents( name , attrs , recursive , text , **kwargs )

find_parent( name , attrs , recursive , text , **kwargs )

find_next_siblings( name , attrs , recursive , text , **kwargs )

find_next_sibling( name , attrs , recursive , text , **kwargs )

find_previous_siblings( name , attrs , recursive , text , **kwargs )
find_previous_sibling( name , attrs , recursive , text , **kwargs )

find_all_next( name , attrs , recursive , text , **kwargs )
find_next( name , attrs , recursive , text , **kwargs )

find_all_previous( name , attrs , recursive , text , **kwargs )

find_previous( name , attrs , recursive , text , **kwargs )

Beautiful Soup支持大部分的CSS选择器 ,在 Tag 或 BeautifulSoup 对象的 .select() 方法中传入字符串参数,即可使用CSS选择器的语法找到tag；

修改文档树：

修改 .string：

给tag的 .string 属性赋值,就相当于用当前的内容替代了原来的内容

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)

tag = soup.a
tag.string = "New link text."
tag
# <a href="http://example.com/">New link text.</a>

append()：

Tag.append() 方法想tag中添加内容,就好像Python的列表的 .append() 方法:

soup = BeautifulSoup("<a>Foo</a>")
soup.a.append("Bar")

soup
# <html><head></head><body><a>FooBar</a></body></html>
soup.a.contents
# [u'Foo', u'Bar']

BeautifulSoup.new_string() 和 .new_tag()：

如果想添加一段文本内容到文档中也没问题,可以调用Python的 append() 方法或调用工厂方法 BeautifulSoup.new_string() :

soup = BeautifulSoup("<b></b>")
tag = soup.b
tag.append("Hello")
new_string = soup.new_string(" there")
tag.append(new_string)
tag
# <b>Hello there.</b>
tag.contents
# [u'Hello', u' there']

如果想要创建一段注释,或 NavigableString 的任何子类,将子类作为 new_string() 方法的第二个参数传入:

from bs4 import Comment
new_comment = soup.new_string("Nice to see you.", Comment)
tag.append(new_comment)
tag
# <b>Hello there<!--Nice to see you.--></b>
tag.contents
# [u'Hello', u' there', u'Nice to see you.']

insert()

Tag.insert() 方法与 Tag.append() 方法类似,区别是不会把新元素添加到父节点 .contents 属性的最后,而是把元素插入到指定的位置.与Python列表总的 .insert() 方法的用法下同:

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
tag = soup.a

tag.insert(1, "but did not endorse ")
tag
# <a href="http://example.com/">I linked to but did not endorse <i>example.com</i></a>
tag.contents
# [u'I linked to ', u'but did not endorse', <i>example.com</i>]

insert_before() 方法在当前tag或文本节点前插入内容

insert_after() 方法在当前tag或文本节点后插入内容

Tag.clear() 方法移除当前tag的内容

PageElement.extract() 方法将当前tag移除文档树,并作为方法结果返回

这个方法实际上产生了2个文档树: 一个是用来解析原始文档的 BeautifulSoup 对象,另一个是被移除并且返回的tag.被移除并返回的tag可以继续调用 extract 方法

Tag.decompose() 方法将当前节点移除文档树并完全销毁

PageElement.replace_with() 方法移除文档树中的某段内容,并用新tag或文本节点替代它