python网络爬虫-三度进击篇·BeautifulSoup_python3 网页爬虫 bs4 click-优快云博客

本文链接：https://blog.youkuaiyun.com/Tttian622/article/details/142569613

1.BeautifulSoup（bs4）

1.获取节点：

参考代码：

html="""
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

html:目标字符串，'lxml'为解析器（可更换任意其他解析器）

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title.string)##获取标记内容
print(soup.head)#获取标记
print(soup.p)#当有多个相同标记时，只能匹配到第一个节点，忽略其他节点
print(soup.title.name)#获取节点名称

输出结果：

The Dormouse's story

<head><title>The Dormouse's story</title></head>

The Dormouse's story

title

2.获取属性：（html代码中内容同上）

attrs---简化类的定义

attrs属性会返回标签中的所有属性，返回的值是字典，

根据属性的性质来识别返回的是列表还是字符串，

如，class具有多个的含义，返回的便是列表。

from bs4 import BeautifulSoup
html = """
balabalabla
"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs)
print(soup.p.attrs['name'])
print(soup.p.attrs['class'])
print(soup.p['class'])
print(soup.p['name'])

输出结果：

{'class': ['title'], 'name': 'dromouse'}

dromouse

['title']

dromouse

3. 嵌套调用

1.嵌套调用

参考html代码：

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""

可以调用到某一元素的子集或后代集

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)
print(soup.p.children)

输出结果：

输出第一个标签内的所有直接子元素

同样是标签内的所有直接子元素，但这次将返回三个<a>标签对象。

2. contents或children

获得子集，前者获得的是字符串列表，后者获得的是个列表迭代器对象，需要遍历获得具体内容

for i,j in enumerate(soup.p.children):
    print(i,j)

输出结果：输出的是每个<a>标签的索引和标签对象。

0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4  
            and
            
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 
            and they lived at the bottom of a well.

3.调用descendants

获得后代集

for i,j in enumerate(soup.p.descendants):
    print(i,j)

输出结果：打印出所有这些后代的索引和对象。

0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <span>Elsie</span>
4 Elsie
5 

6 

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9  
            and
            
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 
            and they lived at the bottom of a well.
        

Process finished with exit code 0

4.获取父节点

获取a节点的父节点，获得一个直接父节点

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.a.parent)
print(soup.a.parents)

遍历所有父节点：

for i,j in enumerate(soup.a.parents):
    print(i,j)

5.获取兄弟节点

1.获取a标记的下一个兄弟节点

print(soup.a.next_sibling)

2. 获取a标记的上一个兄弟节点

print(soup.a.previous_sibling)

3. 获取之后的所有兄弟节点

print(soup.a.next_siblings)
for i,j in enumerate(soup.a.next_siblings):
    print(i,j)

4.获取之前的所有兄弟节点

print(soup.a.previous_siblings)
for i,j in enumerate(soup.a.previous_siblings):
    print(i,j)

6.方法选择器

print(soup.find_all(name='ul'))
print(soup.find_all(name='ul')[0])

找到文档中所有的<ul>标签。
首先找到所有的<ul>标签，然后尝试访问列表中的第一个元素。

for ul in soup.find_all(name='ul'):
    for li in ul.find_all(name='li'):
        print(li.string)

遍历了所有<ul>标签，并在每个<ul>标签内找到了所有的<li>标签，

然后打印了每个<li>标签内的文本内容。

7.属性选择器

获取文本中id值为list-1的数据，返回列表

print(soup.find_all(attrs={"id":"list-1"}))

查找所有具有name="elements"属性的标签

print(soup.find_all(attrs={"name":"elements"}))

查找所有具有id="list-1"属性的标签

print(soup.find_all(id='list-1'))

查找所有具有class="list"属性的标签

由于class是环境中的关键字，因此需要下划线

print(soup.find_all(class_='list'))

可以利用正则表达式对页面内容进行匹配，返回列表

print(soup.find_all(string=re.compile("Foo")))

8.css选择器

注：每个print需依次测试

from bs4 import BeautifulSoup

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
soup = BeautifulSoup(html,'lxml')
#select方法传入多个css选择器，获取到最后一个
print(soup.select(".panel .panel-body"))
print(soup.select(".panel, .panel-body"))

#元素选择器
print(soup.select("ul li"))
print(soup.select("#list-2"))

# #嵌套获取元素中的数据
for ul in soup.select("ul"):
    for li in ul.select("li"):
        print(li.string)

for ul in soup.select("ul"):
    print(ul['id'])