BeautifulSoup 遍历文档树

最新推荐文章于 2024-04-26 03:23:26 发布

原创最新推荐文章于 2024-04-26 03:23:26 发布 · 468 阅读

·

0

·

CC 4.0 BY-SA版权

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

python 专栏收录该内容

14 篇文章

订阅专栏

本文详细讲解了如何使用Python的BeautifulSoup库操作HTML文档，包括直接子节点的获取、.descendants属性的递归搜索、节点内容的提取以及内容列表的处理。通过实例展示了contents、children、descendants、string和stripped_strings等属性的实际应用。

1）直接子节点

要点：.contents .children 属性

.contents，tag 的 .content 属性可以将tag的子节点以列表的方式输出，

1 2	`print` `soup.head.contents` `#[<title>The Dormouse's story</title>]`

输出方式为列表，我们可以用列表索引来获取它的某一个元素

1 2	`print` `soup.head.contents[0]` `#<title>The Dormouse's story</title>`

.children ，返回的不是一个 list，不过我们可以通过遍历获取所有子节点。

我们打印输出 .children 看一下，可以发现它是一个 list 生成器对象

1 2	`print` `soup.head.children` `#<listiterator object at 0x7f71457f5710>`

我们怎样获得里面的内容呢？很简单，遍历一下就好了，代码及结果如下

1 2	`for` `child` `in` `soup.body.children:` `print` `child`

（2）所有子孙节点

知识点：.descendants 属性

.descendants

.contents 和 .children 属性仅包含tag的直接子节点，.descendants 属性可以对所有tag的子孙节点进行递归循环，和 children类似，我们也需要遍历获取其中的内容。

1 2	`for` `child` `in` `soup.descendants:` `print` `child`

运行结果如下，可以发现，所有的节点都被打印出来了，先生最外层的 HTML标签，其次从 head 标签一个个剥离，以此类推。

（3）节点内容

知识点：.string 属性

如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点。

如果一个tag仅有一个子节点,那么这个tag也可以使用 .string 方法,输出结果与当前唯一子节点的 .string 结果相同。

通俗点说就是：如果一个标签里面没有标签了，那么 .string 就会返回标签里面的内容。

如果标签里面只有唯一的一个标签了，那么 .string 也会返回最里面的内容。例如

1

2

3

4

print soup.head.string

#The Dormouse's story

print soup.title.string

#The Dormouse's story

如果tag包含了多个子节点,tag就无法确定，string 方法应该调用哪个子节点的内容, .string 的输出结果是 None

1 2	`print` `soup.html.string` `# None`

注意：

这样无换行的就可以。

# <head><title>The Dormouse's story</title></head>

这样有换行的就不可以。

# <head>

# <title>The Dormouse's story</title>

# </head>

（4）多个内容

知识点： .strings .stripped_strings 属性

.strings

获取多个内容，不过需要遍历获取，比如下面的例子

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

for string in soup.strings:

print(repr(string))

# u"The Dormouse's story"

# u'\n\n'

# u"The Dormouse's story"

# u'\n\n'

# u'Once upon a time there were three little sisters; and their names were\n'

# u'Elsie'

# u',\n'

# u'Lacie'

# u' and\n'

# u'Tillie'

# u';\nand they lived at the bottom of a well.'

# u'\n\n'

# u'...'

# u'\n'

.stripped_strings

输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容

1

2

3

4

5

6

7

8

9

10

11

12

for string in soup.stripped_strings:

print(repr(string))

# u"The Dormouse's story"

# u"The Dormouse's story"

# u'Once upon a time there were three little sisters; and their names were'

# u'Elsie'

# u','

# u'Lacie'

# u'and'

# u'Tillie'

# u';\nand they lived at the bottom of a well.'

# u'...'

（5）父节点

知识点： .parent 属性

1

2

3

4

5

6

7

p = soup.p

print p.parent.name

#body

content = soup.head.title.string

print content.parent.name

#title

（6）全部父节点

知识点：.parents 属性

通过元素的 .parents 属性可以递归得到元素的所有父辈节点，例如

1

2

3

content = soup.head.title.string

for parent in content.parents:

print parent.name

（7）兄弟节点

知识点：.next_sibling .previous_sibling 属性

兄弟节点可以理解为和本节点处在统一级的节点，.next_sibling 属性获取了该节点的下一个兄弟节点，.previous_sibling 则与之相反，如果节点不存在，则返回 None

注意：实际文档中的tag的 .next_sibling 和 .previous_sibling 属性通常是字符串或空白，因为空白或者换行也可以被视作一个节点，所以得到的结果可能是空白或者换行

1

2

3

4

5

6

7

8

9

10

11

print soup.p.next_sibling

# 实际该处为空白

print soup.p.prev_sibling

#None 没有前一个兄弟节点，返回 None

print soup.p.next_sibling.next_sibling

#<p class="story">Once upon a time there were three little sisters; and their names were

#<a class="sister" href="http://example.com/elsie" id="link1"></a>,

#<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and

#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;

#and they lived at the bottom of a well.</p>

#下一个节点的下一个兄弟节点是我们可以看到的节点

（8）全部兄弟节点

知识点：.next_siblings .previous_siblings 属性

通过 .next_siblings 和 .previous_siblings 属性可以对当前节点的兄弟节点迭代输出

1

2

3

4

5

6

7

8

for sibling in soup.a.next_siblings:

print(repr(sibling))

# u',\n'

# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

# u' and\n'

# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

# u'; and they lived at the bottom of a well.'

# None

（9）前后节点

知识点：.next_element .previous_element 属性

与 .next_sibling .previous_sibling 不同，它并不是针对于兄弟节点，而是在所有节点，不分层次

比如 head 节点为

<head><title>The Dormouse's story</title></head>

那么它的下一个节点便是 title，它是不分层次关系的

1 2	`print` `soup.head.next_element` `#<title>The Dormouse's story</title>`

（10）所有前后节点

知识点：.next_elements .previous_elements 属性

通过 .next_elements 和 .previous_elements 的迭代器就可以向前或向后访问文档的解析内容,就好像文档正在被解析一样

1

2

3

4

5

6

7

8

9

for element in last_a_tag.next_elements:

print(repr(element))

# u'Tillie'

# u';\nand they lived at the bottom of a well.'

# u'\n\n'

# <p class="story">...</p>

# u'...'

# u'\n'

# None

评论 1

成就一亿技术人!

拼手气红包6.0元

还能输入1000个字符

添加红包

插入表情

表情包

代码片

HTML/XML
objective-c
Ruby
PHP
C
C++
JavaScript
Python
Java
CSS
SQL
其它

查看更多评论

条评论被折叠查看

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。