scrapy——滚动获取页面获取数据

以待成追忆

于 2025-02-04 15:36:03 发布

阅读量1k

点赞数 8

分类专栏： scrapy 文章标签： scrapy python 爬虫

本文链接：https://blog.youkuaiyun.com/qq_24680545/article/details/145438892

版权

scrapy 专栏收录该内容

10 篇文章

订阅专栏

selenium

Selenium 是一个综合性的项目，为web浏览器的自动化提供了各种工具和依赖包。是个工具！当然还需要chrome以及chrome_driver也就是浏览器和他对应的驱动。

浏览器的内核要和浏览器的驱动对应。
浏览器驱动有2个下载地址：内核是114及以下，内核是132的，
其他的版本的内核，可以自己构建

https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/linux64/chrome-linux64.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/mac-arm64/chrome-mac-arm64.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/mac-x64/chrome-mac-x64.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/win32/chrome-win32.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/win64/chrome-win64.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/linux64/chromedriver-linux64.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/mac-arm64/chromedriver-mac-arm64.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/mac-x64/chromedriver-mac-x64.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/win32/chromedriver-win32.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/win64/chromedriver-win64.zip

有人说，用webdriver_manager也可以！那你去用啊…
在这里插入图片描述

我将浏览器放在E:\selenium\chrome\Application\chrome.exe，驱动放在E:\selenium\driver\chromedriver.exe这两个地址都后面都要用到。

配置

在初始化服务时，需要指定驱动路径：

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from scrapy.http import HtmlResponse
driver_path = r"E:\selenium\driver\chromedriver.exe"

service = Service(executable_path=driver_path)

配置部分选项，比如chrome路径等内容。

driver_path = r"E:\selenium\driver\chromedriver.exe"
chrome_path = r'E:\selenium\chrome\Application\chrome.exe'
profile_path = r"E:\selenium\data\profiles"
cache_path = r"E:\selenium\data\caches"
options = webdriver.ChromeOptions()
options.binary_location = chrome_path
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--headless')  # 无头模式
options.add_argument('--disable-gpu')  
options.add_argument(f'--user-data-dir={profile_path}')
options.add_argument(f'--disk-cache-dir={cache_path}')

options.binary_location：显式指定 Chrome 浏览器的可执行文件路径。
options.add_argument('--no-sandbox')：禁用 Chrome 的沙盒（Sandbox）安全机制。在 Linux 系统中以 root 用户运行时，沙盒会导致权限冲突，浏览器无法启动，Docker 容器中常见。
options.add_argument('--disable-dev-shm-usage')：Linux 系统的 /dev/shm 默认只有 64MB，而 Chrome 需要更多共享内存时会导致崩溃。
options.add_argument('--headless') ：启用无头模式（Headless Mode），不显示浏览器界面，显示界面很烦，所以就不显示了。
options.add_argument('--disable-gpu')：在headless模式下，GPU加速无意义且可能引发问题。
options.add_argument(f'--user-data-dir={profile_path}')：存储用户数据，如用户配置文件、缓存、Cookie、本地存储数据。
options.add_argument(f'--disk-cache-dir={cache_path}')：用于存储页面资源，如图片、CSS、JavaScript文件等缓存数据。

生成浏览器实例

创建并启动一个Chrome浏览器实例，

service = Service(executable_path=driver_path)
driver = None
try:
    driver = webdriver.Chrome(service=service, options=options)
    driver.get('https://www.hao123.com/')
    print(driver.title)
except Exception as e:
    print("错误详情:", e)
finally:
    if driver:
        driver.quit()

最终要记得退出，否则会因为上一次没有正常关闭，而导致报错，出现bug：

错误详情: Message: session not created: Chrome failed to start: crashed.
  (chrome not reachable)
  (The process started from chrome location E:\selenium\chrome\Application\chrome.exe is no longer running, so ChromeDriver is assuming that Chrome has crashed.)

这个bug很烦，搞了一晚上，只要关掉就行了

widnwos用这句去关：taskkill /f /im chrome.exe
linux用这句：pkill chrome

可以执行一下试试~~~~~~

嵌入scrapy中

在执行scrapy的时候，会先执行BzmhSpider类的start_requests，在这一个方法里面我们可以选择使用是方法。

from typing import Iterable
import scrapy
from bs4 import BeautifulSoup
from scrapy import Request

from ..items import FirstpcItem

class BzmhSpider(scrapy.Spider):
    name = "bzmh"
    allowed_domains = ["www.baozimh.com"]
    start_urls = ["https://www.baozimh.com/classify"]

    def start_requests(self) -> Iterable[Request]:
        if '/api/' in self.start_urls[0]:
            yield scrapy.Request(self.start_urls[0], callback=self.parse_normal)
        else:
            yield scrapy.Request(self.start_urls[0], callback=self.parse_scroll, meta={'need_scroll':True})

这里有一个callback也就是回调函数，得到response之后，要传递给谁。所以我们在BzmhSpider里面在创建两个函数，一个是parse_normal，另一个是parse_scroll。而meta={'need_scroll':True}是为了告诉中间件，我们需要执行滚动，（换句话说就是调用selenium）。

# bzmh.py
class BzmhSpider(scrapy.Spider):
	...
	
    def parse_scroll(self, response):
        items = FirstpcItem()
        soup = BeautifulSoup(response.text, "html.parser")
        paragraphs = soup.find_all("a", attrs={"href": True, "title": True})
        for paragraph in paragraphs:
            items['name'] = paragraph['title']
            items['url'] = "https://www.baozimh.com" + paragraph['href']
            yield items

    def parse_normal(self, response):
        print(response.status_code)
        pass

这里使用BeautifulSoup方便理解。xpath用不习惯(我是不会告诉你，我比较懒的)

# settings.py
# 在配置文件中，我们要将路径指定好，到时候直接从settings里面调用就行了
CHROME_DRIVER_PATH = r"E:\selenium\driver\chromedriver.exe"
CHROME_BINARY_PATH = r'E:\selenium\chrome\Application\chrome.exe'
# 把这个注释去掉
DOWNLOADER_MIDDLEWARES = {
    "firstpc.middlewares.FirstpcDownloaderMiddleware": 543,
}

然后就是修改中间件文件了，middlewares.py。

import os
import gc
import time
from pathlib import Path
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from scrapy.utils.project import get_project_settings

class FirstpcSpiderMiddleware:
	....
# 上面的不改，该下面的
class FirstpcDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    def __init__(self):
        # 这里只保存驱动路径，不初始化浏览器
        os.environ["WDM_LOCAL"] = r"E:\selenium\driver" 
        Path(os.environ["WDM_LOCAL"]).mkdir(parents=True, exist_ok=True)
        self.base_path = Path(r"E:\selenium")
        self.settings = get_project_settings()
        self.driver_path = self.settings.get("CHROME_DRIVER_PATH")
        self.driver = None  # 浏览器实例初始化为空
        self._create_directories()
	def _create_directories(self):
        """自动创建存储目录结构"""
        dirs = [
            self.base_path / "driver",
            self.base_path / "data",
            self.base_path / "caches",
            self.base_path / "downloads"
        ]

        for dir_path in dirs:
            dir_path.mkdir(parents=True, exist_ok=True)

    def process_request(self, request, spider):
    
        if '/api/' in request.url:
            return self.handle_api_request(request)
        elif request.meta.get('need_scroll'):
            try:
                # 每次请求创建新实例
                options = webdriver.ChromeOptions()
                options.binary_location = self.settings.get("CHROME_BINARY_PATH")
                options.add_argument('--no-sandbox')
                options.add_argument('--disable-dev-shm-usage')
                options.add_argument('--headless')  # 无头模式
                options.add_argument('--disable-gpu')  # Windows 无头模式可能需要
                # 设置用户数据存储路径
                profile_path = r"E:\selenium\data\profiles"
                cache_path = r"E:\selenium\data\caches"
                options.add_argument(f'--user-data-dir={profile_path}')
                options.add_argument(f'--disk-cache-dir={cache_path}')

                service = Service(executable_path=self.driver_path)
                self.driver = webdriver.Chrome(
                    service=service,
                    options=options
                )

                self.driver.get(request.url)
                self._auto_scroll()
                return HtmlResponse(
                    url=request.url,
                    body=self.driver.page_source.encode('utf-8'),
                    request=request
                )
            finally:
                # 确保关闭浏览器
                if self.driver:
                    self.driver.quit()
                self.driver = None  # 重置实例
                gc.collect() # 垃圾回收
        else:
            return None
	# ...其他函数
	def _auto_scroll(self):
        """智能滚动核心逻辑"""
        last_height = self.driver.execute_script("return document.body.scrollHeight")
        request_times = 0

        while True and request_times < 3:
            request_times += 1
            # 滚动到底部
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(2)  # 等待加载
            # 获取新高度
            new_height = self.driver.execute_script("return document.body.scrollHeight")
            # 终止条件判断
            if new_height == last_height:
                break

    def handle_api_request(self, request):
        # 处理api请求
        return None

debug

由于多次请求之后，会发现滚动之后的结果和第一次请求的结果不一样，出现了{{name}}，https://www.baozimh.comhttps://www.baozimh.com，这时候就要做修改了（改的好看点，没什么用）

{'name': '萬渣朝凰',
 'url': 'https://www.baozimh.com/comic/mozhazhaohuang-shidaimanwang'}
{'name': '總裁在上',
 'url': 'https://www.baozimh.com/comic/zongcaizaishang-iciyuandongman'}
{'name': '{{name}}', 'url': 'https://www.baozimh.com/comic/{{comic_id}}'}
{'name': '我是大神仙',
 'url': 'https://www.baozimh.comhttps://www.baozimh.com/comic/woshidashenxian-shengshiqiaman'}
{'name': '戒魔人',
 'url': 'https://www.baozimh.comhttps://www.baozimh.com/comic/jiemoren-zhangsanfeng'}
{'name': '修羅武神',
 'url': 'https://www.baozimh.comhttps://www.baozimh.com/comic/xiuluowushen-dianguangsheshanliangdemifengpikapi'}

# bzmh.py
class BzmhSpider(scrapy.Spider):
	# ...其他内容
	def parse_scroll(self, response):
        items = FirstpcItem()
        soup = BeautifulSoup(response.text, "html.parser")
        paragraphs = soup.find_all("a", attrs={"href": True, "title": True})
        for paragraph in paragraphs:
			items['name'] = paragraph['title']
            items['url'] = "https://www.baozimh.com" + paragraph['href']
            yield items
   # 改成
    def parse_scroll(self, response):
        items = FirstpcItem()
        soup = BeautifulSoup(response.text, "html.parser")
        paragraphs = soup.find_all("a", attrs={"href": True, "title": True})
        for paragraph in paragraphs:
            href = paragraph['href']
            if '{{' in href or '}}' in href:
                continue

            if href.startswith('https://'):
                url = href
            else:
                url = "https://www.baozimh.com" + href

            items['name'] = paragraph['title']
            items['url'] = url
            yield items