scrapy——滚动获取页面获取数据

在上一篇中我开启了scrapy篇章,但是这个页面是滚动获取数据的页面,并不存在分页,需要滚动到底部才能获取数据,这尼玛就犯难了。

  • 通过调用浏览器的F12来寻迹对方后台的api,查找api规律,然后本地构建。
  • 通过selenium执行,来获取页面的数据。

后台api是这个:
在这里插入图片描述
所以放弃,因为我不会构造!只能退而求其次用selenium

selenium

Selenium 是一个综合性的项目,为web浏览器的自动化提供了各种工具和依赖包。是个工具!当然还需要chrome以及chrome_driver也就是浏览器和他对应的驱动。

https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/linux64/chrome-linux64.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/mac-arm64/chrome-mac-arm64.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/mac-x64/chrome-mac-x64.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/win32/chrome-win32.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/win64/chrome-win64.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/linux64/chromedriver-linux64.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/mac-arm64/chromedriver-mac-arm64.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/mac-x64/chromedriver-mac-x64.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/win32/chromedriver-win32.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/win64/chromedriver-win64.zip

有人说,用webdriver_manager也可以!那你去用啊…
在这里插入图片描述

我将浏览器放在E:\selenium\chrome\Application\chrome.exe,驱动放在E:\selenium\driver\chromedriver.exe这两个地址都后面都要用到。

配置

在初始化服务时,需要指定驱动路径:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from scrapy.http import HtmlResponse
driver_path = r"E:\selenium\driver\chromedriver.exe"

service = Service(executable_path=driver_path)

配置部分选项,比如chrome路径等内容。

driver_path = r"E:\selenium\driver\chromedriver.exe"
chrome_path = r'E:\selenium\chrome\Application\chrome.exe'
profile_path = r"E:\selenium\data\profiles"
cache_path = r"E:\selenium\data\caches"
options = webdriver.ChromeOptions()
options.binary_location = chrome_path
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--headless')  # 无头模式
options.add_argument('--disable-gpu')  
options.add_argument(f'--user-data-dir={profile_path}')
options.add_argument(f'--disk-cache-dir={cache_path}')
  • options.binary_location:显式指定 Chrome 浏览器的可执行文件路径。
  • options.add_argument('--no-sandbox'):禁用 Chrome 的沙盒(Sandbox)安全机制。在 Linux 系统 中以 root 用户 运行时,沙盒会导致权限冲突,浏览器无法启动,Docker 容器中常见。
  • options.add_argument('--disable-dev-shm-usage'):Linux 系统的 /dev/shm 默认只有 64MB,而 Chrome 需要更多共享内存时会导致崩溃。
  • options.add_argument('--headless') :启用无头模式(Headless Mode),不显示浏览器界面,显示界面很烦,所以就不显示了。
  • options.add_argument('--disable-gpu'):在headless模式下,GPU加速无意义且可能引发问题。
  • options.add_argument(f'--user-data-dir={profile_path}'):存储用户数据,如用户配置文件、缓存、Cookie、本地存储数据。
  • options.add_argument(f'--disk-cache-dir={cache_path}'):用于存储页面资源,如图片、CSS、JavaScript文件等缓存数据。

生成浏览器实例

创建并启动一个Chrome浏览器实例,

service = Service(executable_path=driver_path)
driver = None
try:
    driver = webdriver.Chrome(service=service, options=options)
    driver.get('https://www.hao123.com/')
    print(driver.title)
except Exception as e:
    print("错误详情:", e)
finally:
    if driver:
        driver.quit()
  • 最终要记得退出,否则会因为上一次没有正常关闭,而导致报错,出现bug:
错误详情: Message: session not created: Chrome failed to start: crashed.
  (chrome not reachable)
  (The process started from chrome location E:\selenium\chrome\Application\chrome.exe is no longer running, so ChromeDriver is assuming that Chrome has crashed.)

这个bug很烦,搞了一晚上,只要关掉就行了

  • widnwos用这句去关:taskkill /f /im chrome.exe
  • linux用这句:pkill chrome

可以执行一下试试~~~~~~

嵌入scrapy中

在执行scrapy的时候,会先执行BzmhSpider类的start_requests,在这一个方法里面我们可以选择使用是方法。

from typing import Iterable
import scrapy
from bs4 import BeautifulSoup
from scrapy import Request

from ..items import FirstpcItem

class BzmhSpider(scrapy.Spider):
    name = "bzmh"
    allowed_domains = ["www.baozimh.com"]
    start_urls = ["https://www.baozimh.com/classify"]

    def start_requests(self) -> Iterable[Request]:
        if '/api/' in self.start_urls[0]:
            yield scrapy.Request(self.start_urls[0], callback=self.parse_normal)
        else:
            yield scrapy.Request(self.start_urls[0], callback=self.parse_scroll, meta={'need_scroll':True})

这里有一个callback也就是回调函数,得到response之后,要传递给谁。所以我们在BzmhSpider里面在创建两个函数,一个是parse_normal,另一个是parse_scroll。而meta={'need_scroll':True}是为了告诉中间件,我们需要执行滚动,(换句话说就是调用selenium)。

# bzmh.py
class BzmhSpider(scrapy.Spider):
	...
	
    def parse_scroll(self, response):
        items = FirstpcItem()
        soup = BeautifulSoup(response.text, "html.parser")
        paragraphs = soup.find_all("a", attrs={"href": True, "title": True})
        for paragraph in paragraphs:
            items['name'] = paragraph['title']
            items['url'] = "https://www.baozimh.com" + paragraph['href']
            yield items

    def parse_normal(self, response):
        print(response.status_code)
        pass

这里使用BeautifulSoup方便理解。xpath用不习惯(我是不会告诉你,我比较懒的)

# settings.py
# 在配置文件中,我们要将路径指定好,到时候直接从settings里面调用就行了
CHROME_DRIVER_PATH = r"E:\selenium\driver\chromedriver.exe"
CHROME_BINARY_PATH = r'E:\selenium\chrome\Application\chrome.exe'
# 把这个注释去掉
DOWNLOADER_MIDDLEWARES = {
    "firstpc.middlewares.FirstpcDownloaderMiddleware": 543,
}

然后就是修改中间件文件了,middlewares.py

import os
import gc
import time
from pathlib import Path
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from scrapy.utils.project import get_project_settings

class FirstpcSpiderMiddleware:
	....
# 上面的不改,该下面的
class FirstpcDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    def __init__(self):
        # 这里只保存驱动路径,不初始化浏览器
        os.environ["WDM_LOCAL"] = r"E:\selenium\driver" 
        Path(os.environ["WDM_LOCAL"]).mkdir(parents=True, exist_ok=True)
        self.base_path = Path(r"E:\selenium")
        self.settings = get_project_settings()
        self.driver_path = self.settings.get("CHROME_DRIVER_PATH")
        self.driver = None  # 浏览器实例初始化为空
        self._create_directories()
	def _create_directories(self):
        """自动创建存储目录结构"""
        dirs = [
            self.base_path / "driver",
            self.base_path / "data",
            self.base_path / "caches",
            self.base_path / "downloads"
        ]

        for dir_path in dirs:
            dir_path.mkdir(parents=True, exist_ok=True)

    def process_request(self, request, spider):
    
        if '/api/' in request.url:
            return self.handle_api_request(request)
        elif request.meta.get('need_scroll'):
            try:
                # 每次请求创建新实例
                options = webdriver.ChromeOptions()
                options.binary_location = self.settings.get("CHROME_BINARY_PATH")
                options.add_argument('--no-sandbox')
                options.add_argument('--disable-dev-shm-usage')
                options.add_argument('--headless')  # 无头模式
                options.add_argument('--disable-gpu')  # Windows 无头模式可能需要
                # 设置用户数据存储路径
                profile_path = r"E:\selenium\data\profiles"
                cache_path = r"E:\selenium\data\caches"
                options.add_argument(f'--user-data-dir={profile_path}')
                options.add_argument(f'--disk-cache-dir={cache_path}')

                service = Service(executable_path=self.driver_path)
                self.driver = webdriver.Chrome(
                    service=service,
                    options=options
                )

                self.driver.get(request.url)
                self._auto_scroll()
                return HtmlResponse(
                    url=request.url,
                    body=self.driver.page_source.encode('utf-8'),
                    request=request
                )
            finally:
                # 确保关闭浏览器
                if self.driver:
                    self.driver.quit()
                self.driver = None  # 重置实例
                gc.collect() # 垃圾回收
        else:
            return None
	# ...其他函数
	def _auto_scroll(self):
        """智能滚动核心逻辑"""
        last_height = self.driver.execute_script("return document.body.scrollHeight")
        request_times = 0

        while True and request_times < 3:
            request_times += 1
            # 滚动到底部
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(2)  # 等待加载
            # 获取新高度
            new_height = self.driver.execute_script("return document.body.scrollHeight")
            # 终止条件判断
            if new_height == last_height:
                break

    def handle_api_request(self, request):
        # 处理api请求
        return None

debug

由于多次请求之后,会发现滚动之后的结果和第一次请求的结果不一样,出现了{{name}}https://www.baozimh.comhttps://www.baozimh.com,这时候就要做修改了(改的好看点,没什么用)

{'name': '萬渣朝凰',
 'url': 'https://www.baozimh.com/comic/mozhazhaohuang-shidaimanwang'}
{'name': '總裁在上',
 'url': 'https://www.baozimh.com/comic/zongcaizaishang-iciyuandongman'}
{'name': '{{name}}', 'url': 'https://www.baozimh.com/comic/{{comic_id}}'}
{'name': '我是大神仙',
 'url': 'https://www.baozimh.comhttps://www.baozimh.com/comic/woshidashenxian-shengshiqiaman'}
{'name': '戒魔人',
 'url': 'https://www.baozimh.comhttps://www.baozimh.com/comic/jiemoren-zhangsanfeng'}
{'name': '修羅武神',
 'url': 'https://www.baozimh.comhttps://www.baozimh.com/comic/xiuluowushen-dianguangsheshanliangdemifengpikapi'}
# bzmh.py
class BzmhSpider(scrapy.Spider):
	# ...其他内容
	def parse_scroll(self, response):
        items = FirstpcItem()
        soup = BeautifulSoup(response.text, "html.parser")
        paragraphs = soup.find_all("a", attrs={"href": True, "title": True})
        for paragraph in paragraphs:
			items['name'] = paragraph['title']
            items['url'] = "https://www.baozimh.com" + paragraph['href']
            yield items
   # 改成
    def parse_scroll(self, response):
        items = FirstpcItem()
        soup = BeautifulSoup(response.text, "html.parser")
        paragraphs = soup.find_all("a", attrs={"href": True, "title": True})
        for paragraph in paragraphs:
            href = paragraph['href']
            if '{{' in href or '}}' in href:
                continue

            if href.startswith('https://'):
                url = href
            else:
                url = "https://www.baozimh.com" + href

            items['name'] = paragraph['title']
            items['url'] = url
            yield items
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值