[数据采集技术：实践02（数据获取与分析）]：lxml中的XPath爬取数据，bs4获取节点属性文本内容

达不溜001

已于 2024-11-26 17:22:32 修改

阅读量401

点赞数 3

分类专栏： Python 文章标签： beautifulsoup requests lxml XPath python

于 2024-10-18 11:00:45 首次发布

本文链接：https://blog.youkuaiyun.com/Abraxs/article/details/142907226

版权

Python 专栏收录该内容

14 篇文章

订阅专栏

文章目录

作业要求
TEST-01(爬取豆瓣电影 Top250 中的电影信息[使用 requests和lxml 模块中的 XPath])
- test1-01：具体代码
- test1-02：实验截图
TEST-02(BeautifulSoup获取HTML 页面节点代码)
- test2-01：具体代码
- test2-02：实验截图
TEST-03(BeautifulSoup获取HTML节点属性与文本内容)
- test3-01：具体代码
- test3-02：实验截图
before-test01
before-BeautifulSoup 模块的使用

作业要求

数据采集技术实验二（数据获取与分析）

1.创建项目文件夹，命名为：学号-2，如：20220001-2。后续所有文件均放在此文件夹中。

2.新建 test1.py 文件，使用 requests 模块和 lxml 模块中的 XPath，爬取豆瓣电影 Top250 中的电影信息。

3.新建 test2.py 文件，使用 BeautifulSoup 模块获取 HTML 页面中的节点对应代码。

4.新建 test3.py 文件，使用 BeautifulSoup 模块获取 HTML 页面中的节点的属性与文本内容。
# requests 模块和 lxml 模块中的 XPath的使用

TEST-01(爬取豆瓣电影 Top250 中的电影信息[使用 requests和lxml 模块中的 XPath])

test1-01：具体代码

from lxml import etree
import time
import random
import requests

header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/96.0.4664.45 Safari/537.36'
}


def processing(strs):
    s = ''
    for n in strs:
        n = ''.join(n.split())
        s = s + n
    return s


def get_movie_info(url):
    response = requests.get(url, headers=header)
    html = etree.HTML(response.text)
    div_all = html.xpath('//div[@class="info"]')
    for div in div_all:
        names = div.xpath('./div[@class="hd"]/a//span/text()')
        name = processing(names)
        infos = div.xpath('./div[@class="bd"]/p/text()')
        info = processing(infos)
        score = div.xpath('./div[@class="bd"]/div/span[2]/text()')
        evaluation = div.xpath('./div[@class="bd"]/div/span[4]/text()')
        summary = div.xpath('./div[@class="bd"]/p[@class="quote"]/span/text()')
        print('电影名称：', name)
        print('导演与演员：', info)
        print('电影评分：', score)
        print('评价人数：', evaluation)
        print('电影总结：', summary)
        print('----------------------------------分割线--------------------------------------------')


if __name__ == '__main__':
    for i in range(0, 250, 25):
        url = 'https://movie.douban.com/top250?start={page}&filter='.format(page=i)
        get_movie_info(url)
        time.sleep(random.randint(1, 3))

test1-02：实验截图

在这里插入图片描述

TEST-02(BeautifulSoup获取HTML 页面节点代码)

test2-01：具体代码

from bs4 import BeautifulSoup

# 创建模拟HTML代码的字符串
html_doc = """
<html>
    <head>
        <title>第一个 HTML 页面</title>
    </head>
    <body>
        <p>body 元素的内容会显示在浏览器中。</p>
        <p>title 元素的内容会显示在浏览器的标题栏中。</p>
    </body>
</html>
"""

soup = BeautifulSoup(html_doc, features='lxml')
print('head节点：\n', soup.head)
print('*'*120)
print('body节点：\n', soup.body)
print('*'*120)
print('title节点：\n', soup.title)
print('*'*120)
print('p节点：\n', soup.p)

test2-02：实验截图

在这里插入图片描述

TEST-03(BeautifulSoup获取HTML节点属性与文本内容)

test3-01：具体代码

from bs4 import BeautifulSoup

# 创建模拟HTML代码的字符串
html_doc = """
<html>
    <head>
        <title>横排响应式登录</title>
        <meta http-equiv="Content-Type" content="text/html" charset="utf-8"/>
        <meta name="viewport" content="width=device-width"/>
        <link href="font/css/bootstrap.min.css" type="text/css" rel="stylesheet">
        <link href="css/style.css" type="text/css" rel="stylesheet">
    </head>
    <body>
        <h3>登录</h3>
        <div class="glyphicon glyphicon-envelope"><input type="text" placeholder="请输入邮箱"></div>
        <div class="glyphicon glyphicon-lock"><input type="password" placeholder="请输入密码"></div>
    </body>
</html>
"""

soup = BeautifulSoup(html_doc, features='lxml')
print('meta节点的属性：\n', soup.meta.attrs)
print('link节点属性：\n', soup.link.attrs)
print('div节点属性：\n', soup.div.attrs)
print('*' * 120)
print('meta节点属性：\n', soup.meta.attrs['http-equiv'])
print('link节点属性：\n', soup.link.attrs['href'])
print('div节点属性：\n', soup.div.attrs['class'])
print('*' * 120)
print('title节点的文本内容：\n', soup.title.string)
print('h3节点的文本内容：\n', soup.h3.string)

test3-02：实验截图

在这里插入图片描述

before-test01

import requests
from lxml import etree

url = 'https://movie.douban.com/top250'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
}

response = requests.get(url, headers=headers)

if response.status_code == 200:

    tree = etree.HTML(response.text)
    movies = tree.xpath('//div[@class="info"]')

    for movie in movies:

        title = movie.xpath('div[@class="hd"]/a/span[1]/text()')[0]
        rating_num = movie.xpath('div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()')[0]
        quote = movie.xpath('div[@class="bd"]/p[@class=""]/text()')
        quote = quote[0] if quote else "无"

        print(f"电影标题: {title}")
        print(f"评分: {rating_num}")
        print(f"引言: {quote}\n")

else:
    print("请求失败，状态码:", response.status_code)

before-BeautifulSoup 模块的使用

3.新建 test2.py 文件，使用 BeautifulSoup 模块获取 HTML 页面中的
节点对应代码。

## test02
# test2.py
import requests
from bs4 import BeautifulSoup

# 示例 HTML 页面 URL（这里使用一个静态页面作为示例）
url = 'https://example.com'  # 请替换为实际的 HTML 页面 URL

# 发送 HTTP GET 请求
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
}
response = requests.get(url, headers=headers)

# 检查请求是否成功
if response.status_code == 200:
    # 解析 HTML 内容
    soup = BeautifulSoup(response.content, 'html.parser')

    # 查找特定的节点（这里以 <div class="example"> 为例）
    node = soup.find('div', class_='example')

    # 打印节点的 HTML 代码
    print("节点的 HTML 代码:")
    print(str(node))
else:
    print("请求失败，状态码:", response.status_code)

4.新建 test3.py 文件，使用 BeautifulSoup 模块获取 HTML 页面中的
节点的属性与文本内容。

 ## test03
 # test3.py
import requests
from bs4 import BeautifulSoup

# 示例 HTML 页面 URL（这里使用一个静态页面作为示例）
url = 'https://example.com'  # 请替换为实际的 HTML 页面 URL

# 发送 HTTP GET 请求
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'
}
response = requests.get(url, headers=headers)

# 检查请求是否成功
if response.status_code == 200:
    # 解析 HTML 内容
    soup = BeautifulSoup(response.content, 'html.parser')

    # 查找特定的节点（这里以 <a> 标签为例）
    node = soup.find('a')

    # 打印节点的属性和文本内容
    print("节点的属性:")
    for attr, value in node.attrs.items():
        print(f"{attr}: {value}")

    print("\n节点的文本内容:")
    print(node.get_text())
else:
    print("请求失败，状态码:", response.status_code)