Beautifupsoup框架常用方法

最新推荐文章于 2019-03-31 21:27:18 发布

douyunqian668

最新推荐文章于 2019-03-31 21:27:18 发布

阅读量414

点赞数

CC 4.0 BY-SA版权

分类专栏： Python自动化开发

本文链接：https://blog.youkuaiyun.com/douyunqian668/article/details/53308500

Python自动化开发专栏收录该内容

208 篇文章

订阅专栏

本文介绍如何利用Python的BeautifulSoup库解析HTML文档，提取所需信息。通过实例演示了标签、属性及文本的检索方法。

#__author__ = 'DouYunQian'
#coding=utf-8
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
The Dormouse's story
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
...
"""
import re
from bs4 import BeautifulSoup

soup=BeautifulSoup(html,"html.parser")
print(soup.title)#<title>The Dormouse's story</title>
print(soup.title.string)#The Dormouse's story
print(soup.title.parent)#<head><title>The Dormouse's story</title></head>
print(soup.p)#返回第一个p标签 The Dormouse's story
print(soup.a)#<a class="sister" href="http://example.com/elsie" id="link1"></a>
print(soup.p['class'])#['title']
print(soup.find_all("a"))#返回一个列表所有a标签的
print(soup.find(id="link2"))#<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
print(soup.find(id="link2").string)#Lacie 如果中间有别的标签就不能够很好的使用了
print(soup.find(id="link2").get_text())#Lacie
print(soup.find("p",class_="title"))#The Dormouse's story
print(soup.find("p",{"class":"story2"}))#...
print(soup.find("p",{"class":"story"}).get_text())#获取任何标签中间的内容不论标签有多少
print("===================")
for tag in soup.find_all(re.compile("^b")):
print(tag.name)

print("=============")#找到属性是某种类型的所有集合
all_href=soup.find_all(href=re.compile("http://example.com/.+"))