使用Python 和 Selenium 爬取优快云博客排行榜数据附源码

最新推荐文章于 2025-04-04 16:07:28 发布

LIY若依

最新推荐文章于 2025-04-04 16:07:28 发布

阅读量2.3k

点赞数 19

文章标签： python 开发语言

本文链接：https://blog.youkuaiyun.com/m0_74972192/article/details/140896372

版权

在这篇博客中，我将分享如何使用Python、Selenium和BeautifulSoup爬取优快云博客页面上的特定数据。我们将通过一个示例代码展示如何实现这一目标。

准备工作

首先，我们需要安装一些必要的库：

pip install selenium beautifulsoup4

步骤说明和代码解析

1. 初始化参数

我们使用Options配置Chrome浏览器为无头模式，并设置其他参数以确保浏览器在服务器环境中正常运行。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

2. 使用Selenium打开页面

我们使用Selenium打开优快云博客的指定页面，并等待页面加载完成。

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome(options=chrome_options)
url = 'https://blog.youkuaiyun.com/rank/list/content?type=python'
driver.get(url)

try:
    WebDriverWait(driver, 20).until(
        EC.presence_of_element_located((By.CLASS_NAME, 'hosetitem-dec'))
    )
except Exception as e:
    print("Error: ", e)
    driver.quit()

3. 获取页面源代码

在页面加载完成后，我们获取页面的源代码。

html_content = driver.page_source
driver.quit()

4. 使用BeautifulSoup解析HTML内容

我们使用BeautifulSoup解析页面源代码，查找所有符合特定格式的数据。

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, 'html.parser')
data_list = []

for item in soup.find_all('div', class_='hosetitem-dec'):
    name_tag = item.find_previous('a', class_='name')
    num_tag = item.find_previous('span', class_='num')
    views_tag = item.find_all('span', style='margin-right: 4px;')[0]
    comments_tag = item.find_all('span', style='margin-right: 4px;')[1]
    favorites_tag = item.find_all('span', style='margin-right: 4px;')[2]

    if name_tag and num_tag and views_tag and comments_tag and favorites_tag:
        name = name_tag.text
        num = num_tag.text
        views = views_tag.text
        comments = comments_tag.text
        favorites = favorites_tag.text

        data_list.append({
            'name': name,
            'num': num,
            'views': views,
            'comments': comments,
            'favorites': favorites
        })

5. 打印结果

迭代打印提取的数据。

for data in data_list:
    print(f"作者: {data['name']}, 热度: {data['num']}, 浏览: {data['views']}, 评论: {data['comments']}, 收藏: {data['favorites']}")

print("数据爬取完成。")

完整代码

以下是完整的代码：

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# 初始化参数
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

# 使用Selenium打开页面
driver = webdriver.Chrome(options=chrome_options)
url = 'https://blog.youkuaiyun.com/rank/list/content?type=python'
driver.get(url)

# 等待页面加载完成
try:
    WebDriverWait(driver, 20).until(
        EC.presence_of_element_located((By.CLASS_NAME, 'hosetitem-dec'))
    )
except Exception as e:
    print("Error: ", e)
    driver.quit()

# 获取页面源代码
html_content = driver.page_source

# 关闭浏览器
driver.quit()

# 使用BeautifulSoup解析HTML内容
soup = BeautifulSoup(html_content, 'html.parser')

# 查找所有符合特定格式的数据
data_list = []
for item in soup.find_all('div', class_='hosetitem-dec'):
    name_tag = item.find_previous('a', class_='name')
    num_tag = item.find_previous('span', class_='num')
    views_tag = item.find_all('span', style='margin-right: 4px;')[0]
    comments_tag = item.find_all('span', style='margin-right: 4px;')[1]
    favorites_tag = item.find_all('span', style='margin-right: 4px;')[2]

    if name_tag and num_tag and views_tag and comments_tag and favorites_tag:
        name = name_tag.text
        num = num_tag.text
        views = views_tag.text
        comments = comments_tag.text
        favorites = favorites_tag.text

        data_list.append({
            'name': name,
            'num': num,
            'views': views,
            'comments': comments,
            'favorites': favorites
        })

# 打印结果
for data in data_list:
    print(f"作者: {data['name']}, 热度: {data['num']}, 浏览: {data['views']}, 评论: {data['comments']}, 收藏: {data['favorites']}")

print("数据爬取完成。")