1.BeautifulSoup(bs4)
1.获取节点:
参考代码:
html="""
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
html:目标字符串,'lxml'为解析器(可更换任意其他解析器)
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title.string)##获取标记内容
print(soup.head)#获取标记
print(soup.p)#当有多个相同标记时,只能匹配到第一个节点,忽略其他节点
print(soup.title.name)#获取节点名称
输出结果:
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
title
2.获取属性:(html代码中内容同上)
attrs---简化类的定义
attrs属性会返回标签中的所有属性,返回的值是字典,
根据属性的性质来识别返回的是列表还是字符串,
如,class具有多个的含义,返回的便是列表。
from bs4 import BeautifulSoup
html = """
balabalabla
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs)
print(soup.p.attrs['name'])
print(soup.p.attrs['class'])
print(soup.p['class'])
print(soup.p['name'])
输出结果:
{'class': ['title'], 'name': 'dromouse'}
dromouse
['title']
['title']
dromouse
3. 嵌套调用
1.嵌套调用
参考html代码:
html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
可以调用到某一元素的子集或后代集
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)
print(soup.p.children)
输出结果:
输出第一个<p>
标签内的所有直接子元素
同样是标签内的所有直接子元素,但这次将返回三个<a>
标签对象。
2. contents或children
获得子集,前者获得的是字符串列表,后者获得的是个列表迭代器对象,需要遍历获得具体内容
for i,j in enumerate(soup.p.children):
print(i,j)
输出结果:输出的是每个<a>
标签的索引和标签对象。
0
Once upon a time there were three little sisters; and their names were
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2
3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4
and
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6
and they lived at the bottom of a well.
3.调用descendants
获得后代集
for i,j in enumerate(soup.p.descendants):
print(i,j)
输出结果:打印出所有这些后代的索引和对象。
0
Once upon a time there were three little sisters; and their names were
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2
3 <span>Elsie</span>
4 Elsie
5
6
7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9
and
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12
and they lived at the bottom of a well.
Process finished with exit code 0
4.获取父节点
获取a节点的父节点,获得一个直接父节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.a.parent)
print(soup.a.parents)
遍历所有父节点:
for i,j in enumerate(soup.a.parents):
print(i,j)
5.获取兄弟节点
1.获取a标记的下一个兄弟节点
print(soup.a.next_sibling)
2. 获取a标记的上一个兄弟节点
print(soup.a.previous_sibling)
3. 获取之后的所有兄弟节点
print(soup.a.next_siblings)
for i,j in enumerate(soup.a.next_siblings):
print(i,j)
4.获取之前的所有兄弟节点
print(soup.a.previous_siblings)
for i,j in enumerate(soup.a.previous_siblings):
print(i,j)
6.方法选择器
print(soup.find_all(name='ul'))
print(soup.find_all(name='ul')[0])
- 找到文档中所有的
<ul>
标签。 - 首先找到所有的
<ul>
标签,然后尝试访问列表中的第一个元素。
for ul in soup.find_all(name='ul'):
for li in ul.find_all(name='li'):
print(li.string)
遍历了所有<ul>
标签,并在每个<ul>
标签内找到了所有的<li>
标签,
然后打印了每个<li>
标签内的文本内容。
7.属性选择器
获取文本中id值为list-1的数据,返回列表
print(soup.find_all(attrs={"id":"list-1"}))
查找所有具有name="elements"
属性的标签
print(soup.find_all(attrs={"name":"elements"}))
查找所有具有id="list-1"
属性的标签
print(soup.find_all(id='list-1'))
查找所有具有class="list"
属性的标签
由于class是环境中的关键字,因此需要下划线
print(soup.find_all(class_='list'))
可以利用正则表达式对页面内容进行匹配,返回列表
print(soup.find_all(string=re.compile("Foo")))
8.css选择器
注:每个print需依次测试
from bs4 import BeautifulSoup
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
soup = BeautifulSoup(html,'lxml')
#select方法传入多个css选择器,获取到最后一个
print(soup.select(".panel .panel-body"))
print(soup.select(".panel, .panel-body"))
#元素选择器
print(soup.select("ul li"))
print(soup.select("#list-2"))
# #嵌套获取元素中的数据
for ul in soup.select("ul"):
for li in ul.select("li"):
print(li.string)
for ul in soup.select("ul"):
print(ul['id'])