beautifulsoup_study

最新推荐文章于 2023-02-24 17:02:43 发布

xuexilangren1

最新推荐文章于 2023-02-24 17:02:43 发布

阅读量358

点赞数

CC 4.0 BY-SA版权

分类专栏： python

本文链接：https://blog.youkuaiyun.com/xuexilangren1/article/details/89600818

python 专栏收录该内容

5 篇文章

订阅专栏

本文通过一个具体的HTML文档示例，详细介绍了如何使用BeautifulSoup库进行网页解析。从获取文档标题到提取链接，再到遍历文档结构，全面展示了BeautifulSoup的强大功能。

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" id="1"><b>first</b><b>test<b><b>The Dormouse's story</b><b>two</b></b></b></p>

<p  id="1" class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,"lxml")
#print(soup.prettify())#结构化浏览
soup.title.name
soup.title.string
soup.title.parent.name
soup.p
soup.p['class']
soup.a
soup.find_all('a')
soup.find(id="link1")
print(soup.p['class'][0])
print(type(soup.p['class']))
print(type(soup.a['id']))

title
<class 'list'>
<class 'str'>

print(soup.p)
print(soup.find_all("b"))
print(type(soup.find_all("b")))

<p class="title" id="1"><b>first</b><b><b>The Dormouse's story</b><b>two</b></b></p>
[<b>first</b>, <b><b>The Dormouse's story</b><b>two</b></b>, <b>The Dormouse's story</b>, <b>two</b>]
<class 'bs4.element.ResultSet'>

print((soup.find("a"))['href'])#get('href')
print(type(soup))

http://example.com/elsie
<class 'bs4.BeautifulSoup'>

print(type(soup.p.contents))
print(soup.p.contents)
print(type(soup.p.contents[0]))
print(type(soup.p.contents[0].contents))
print(type(soup.p.children))

<class 'list'>
[<b>first</b>, <b><b>The Dormouse's story</b><b>two</b></b>]
<class 'bs4.element.Tag'>
<class 'list'>
<class 'list_iterator'>

print(type(soup.title.descendants))
for  i in soup.p.descendants:
    print(i)

<class 'generator'>
<b>first</b>
first
<b>test<b><b>The Dormouse's story</b><b>two</b></b></b>
test
<b><b>The Dormouse's story</b><b>two</b></b>
<b>The Dormouse's story</b>
The Dormouse's story
<b>two</b>
two