1.BeautifuSoup能够最html、xml等文档进行解析方便获取网页信息,下面针对一小段的html文档应用BeautifulSoup进行解析:
具体的html代码如下:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class= "title"><b>The Dormouse's story</b></p>
<p class= "story">Once upon a time there were three little sisters;and their name were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup= BeautifulSoup(html_doc,'html.parser')
print(soup.prettify())
2.运行上述Python的代码之后可以再Python的shell运行界面对产生的soup变量进行操作,获取对html_doc变量的解析结果:
================ RESTART: E:/python_program/Dormouse_story.py ================
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters;and their name were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
>>>
>>>
>>>
>>> soup.title()
[]
>>> soup.title
<title>The Dormouse's story</title>
>>> soup.title.name
'title'
>>>
>>> soup.title.string
"The Dormouse's story"
>>>
>>> soup.title.parent.name
'head'
>>>
>>> soup.p
<p class="title"><b>The Dormouse's story</b></p>
>>>
>>> soup.class
SyntaxError: invalid syntax
>>> soup.p[class]
SyntaxError: invalid syntax
>>> soup.p
<p class="title"><b>The Dormouse's story</b></p>
>>> soup.p['class']
['title']
>>> soup.p['class'][0]
'title'
>>> soup.a
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
#从文档中找到所有<a>标签的链接:
>>> soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
>>> soup.find(id='link3')
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> print(soup.get_text())
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters;and their name were
Elsie,
Lacieand
Tillie;
and they lived at the bottom of a well.
...
3.从中可以看出:使用BeautifulSoup构造的soup变量当使用soup.tag_name.name时返回的是tag的名字,当使用soup.tag_name时返回的对应标签的内容,同时能够使用get_text()函数对使用将html语言编写的文档中的出去标签部分的字符串部分提取出来。
4.<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>可以起到对Elsie设置超链接的作用,在网页中 审查元素也可以看到类似的定义