在上一篇中我开启了scrapy篇章,但是这个页面是滚动获取数据的页面,并不存在分页,需要滚动到底部才能获取数据,这尼玛就犯难了。
- 通过调用浏览器的
F12
来寻迹对方后台的api,查找api规律,然后本地构建。 - 通过
selenium
执行,来获取页面的数据。
后台api是这个:
所以放弃,因为我不会构造!只能退而求其次用selenium
selenium
Selenium
是一个综合性的项目,为web浏览器的自动化提供了各种工具和依赖包。是个工具!当然还需要chrome
以及chrome_driver
也就是浏览器和他对应的驱动。
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/linux64/chrome-linux64.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/mac-arm64/chrome-mac-arm64.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/mac-x64/chrome-mac-x64.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/win32/chrome-win32.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/win64/chrome-win64.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/linux64/chromedriver-linux64.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/mac-arm64/chromedriver-mac-arm64.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/mac-x64/chromedriver-mac-x64.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/win32/chromedriver-win32.zip
https://storage.googleapis.com/chrome-for-testing-public/118.0.5962.0/win64/chromedriver-win64.zip
有人说,用webdriver_manager
也可以!那你去用啊…
我将浏览器放在E:\selenium\chrome\Application\chrome.exe
,驱动放在E:\selenium\driver\chromedriver.exe
这两个地址都后面都要用到。
配置
在初始化服务时,需要指定驱动路径:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from scrapy.http import HtmlResponse
driver_path = r"E:\selenium\driver\chromedriver.exe"
service = Service(executable_path=driver_path)
配置部分选项,比如chrome路径
等内容。
driver_path = r"E:\selenium\driver\chromedriver.exe"
chrome_path = r'E:\selenium\chrome\Application\chrome.exe'
profile_path = r"E:\selenium\data\profiles"
cache_path = r"E:\selenium\data\caches"
options = webdriver.ChromeOptions()
options.binary_location = chrome_path
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--headless') # 无头模式
options.add_argument('--disable-gpu')
options.add_argument(f'--user-data-dir={profile_path}')
options.add_argument(f'--disk-cache-dir={cache_path}')
options.binary_location
:显式指定 Chrome 浏览器的可执行文件路径。options.add_argument('--no-sandbox')
:禁用 Chrome 的沙盒(Sandbox)安全机制。在 Linux 系统 中以 root 用户 运行时,沙盒会导致权限冲突,浏览器无法启动,Docker 容器中常见。options.add_argument('--disable-dev-shm-usage')
:Linux 系统的 /dev/shm 默认只有 64MB,而 Chrome 需要更多共享内存时会导致崩溃。options.add_argument('--headless')
:启用无头模式(Headless Mode),不显示浏览器界面,显示界面很烦,所以就不显示了。options.add_argument('--disable-gpu')
:在headless模式
下,GPU加速无意义且可能引发问题。options.add_argument(f'--user-data-dir={profile_path}')
:存储用户数据,如用户配置文件、缓存、Cookie、本地存储数据。options.add_argument(f'--disk-cache-dir={cache_path}')
:用于存储页面资源,如图片、CSS、JavaScript文件等缓存数据。
生成浏览器实例
创建并启动一个Chrome
浏览器实例,
service = Service(executable_path=driver_path)
driver = None
try:
driver = webdriver.Chrome(service=service, options=options)
driver.get('https://www.hao123.com/')
print(driver.title)
except Exception as e:
print("错误详情:", e)
finally:
if driver:
driver.quit()
- 最终要记得退出,否则会因为上一次没有正常关闭,而导致报错,出现bug:
错误详情: Message: session not created: Chrome failed to start: crashed.
(chrome not reachable)
(The process started from chrome location E:\selenium\chrome\Application\chrome.exe is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
这个bug很烦,搞了一晚上,只要关掉就行了
- widnwos用这句去关:
taskkill /f /im chrome.exe
- linux用这句:
pkill chrome
可以执行一下试试~~~~~~
嵌入scrapy中
在执行scrapy
的时候,会先执行BzmhSpider
类的start_requests
,在这一个方法里面我们可以选择使用是方法。
from typing import Iterable
import scrapy
from bs4 import BeautifulSoup
from scrapy import Request
from ..items import FirstpcItem
class BzmhSpider(scrapy.Spider):
name = "bzmh"
allowed_domains = ["www.baozimh.com"]
start_urls = ["https://www.baozimh.com/classify"]
def start_requests(self) -> Iterable[Request]:
if '/api/' in self.start_urls[0]:
yield scrapy.Request(self.start_urls[0], callback=self.parse_normal)
else:
yield scrapy.Request(self.start_urls[0], callback=self.parse_scroll, meta={'need_scroll':True})
这里有一个callback
也就是回调函数,得到response
之后,要传递给谁。所以我们在BzmhSpider
里面在创建两个函数,一个是parse_normal
,另一个是parse_scroll
。而meta={'need_scroll':True}
是为了告诉中间件,我们需要执行滚动,(换句话说就是调用selenium
)。
# bzmh.py
class BzmhSpider(scrapy.Spider):
...
def parse_scroll(self, response):
items = FirstpcItem()
soup = BeautifulSoup(response.text, "html.parser")
paragraphs = soup.find_all("a", attrs={"href": True, "title": True})
for paragraph in paragraphs:
items['name'] = paragraph['title']
items['url'] = "https://www.baozimh.com" + paragraph['href']
yield items
def parse_normal(self, response):
print(response.status_code)
pass
这里使用BeautifulSoup
方便理解。xpath
用不习惯(我是不会告诉你,我比较懒的)
# settings.py
# 在配置文件中,我们要将路径指定好,到时候直接从settings里面调用就行了
CHROME_DRIVER_PATH = r"E:\selenium\driver\chromedriver.exe"
CHROME_BINARY_PATH = r'E:\selenium\chrome\Application\chrome.exe'
# 把这个注释去掉
DOWNLOADER_MIDDLEWARES = {
"firstpc.middlewares.FirstpcDownloaderMiddleware": 543,
}
然后就是修改中间件文件了,middlewares.py
。
import os
import gc
import time
from pathlib import Path
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from scrapy.utils.project import get_project_settings
class FirstpcSpiderMiddleware:
....
# 上面的不改,该下面的
class FirstpcDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
def __init__(self):
# 这里只保存驱动路径,不初始化浏览器
os.environ["WDM_LOCAL"] = r"E:\selenium\driver"
Path(os.environ["WDM_LOCAL"]).mkdir(parents=True, exist_ok=True)
self.base_path = Path(r"E:\selenium")
self.settings = get_project_settings()
self.driver_path = self.settings.get("CHROME_DRIVER_PATH")
self.driver = None # 浏览器实例初始化为空
self._create_directories()
def _create_directories(self):
"""自动创建存储目录结构"""
dirs = [
self.base_path / "driver",
self.base_path / "data",
self.base_path / "caches",
self.base_path / "downloads"
]
for dir_path in dirs:
dir_path.mkdir(parents=True, exist_ok=True)
def process_request(self, request, spider):
if '/api/' in request.url:
return self.handle_api_request(request)
elif request.meta.get('need_scroll'):
try:
# 每次请求创建新实例
options = webdriver.ChromeOptions()
options.binary_location = self.settings.get("CHROME_BINARY_PATH")
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--headless') # 无头模式
options.add_argument('--disable-gpu') # Windows 无头模式可能需要
# 设置用户数据存储路径
profile_path = r"E:\selenium\data\profiles"
cache_path = r"E:\selenium\data\caches"
options.add_argument(f'--user-data-dir={profile_path}')
options.add_argument(f'--disk-cache-dir={cache_path}')
service = Service(executable_path=self.driver_path)
self.driver = webdriver.Chrome(
service=service,
options=options
)
self.driver.get(request.url)
self._auto_scroll()
return HtmlResponse(
url=request.url,
body=self.driver.page_source.encode('utf-8'),
request=request
)
finally:
# 确保关闭浏览器
if self.driver:
self.driver.quit()
self.driver = None # 重置实例
gc.collect() # 垃圾回收
else:
return None
# ...其他函数
def _auto_scroll(self):
"""智能滚动核心逻辑"""
last_height = self.driver.execute_script("return document.body.scrollHeight")
request_times = 0
while True and request_times < 3:
request_times += 1
# 滚动到底部
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2) # 等待加载
# 获取新高度
new_height = self.driver.execute_script("return document.body.scrollHeight")
# 终止条件判断
if new_height == last_height:
break
def handle_api_request(self, request):
# 处理api请求
return None
debug
由于多次请求之后,会发现滚动之后的结果和第一次请求的结果不一样,出现了{{name}}
,https://www.baozimh.comhttps://www.baozimh.com
,这时候就要做修改了(改的好看点,没什么用)
{'name': '萬渣朝凰',
'url': 'https://www.baozimh.com/comic/mozhazhaohuang-shidaimanwang'}
{'name': '總裁在上',
'url': 'https://www.baozimh.com/comic/zongcaizaishang-iciyuandongman'}
{'name': '{{name}}', 'url': 'https://www.baozimh.com/comic/{{comic_id}}'}
{'name': '我是大神仙',
'url': 'https://www.baozimh.comhttps://www.baozimh.com/comic/woshidashenxian-shengshiqiaman'}
{'name': '戒魔人',
'url': 'https://www.baozimh.comhttps://www.baozimh.com/comic/jiemoren-zhangsanfeng'}
{'name': '修羅武神',
'url': 'https://www.baozimh.comhttps://www.baozimh.com/comic/xiuluowushen-dianguangsheshanliangdemifengpikapi'}
# bzmh.py
class BzmhSpider(scrapy.Spider):
# ...其他内容
def parse_scroll(self, response):
items = FirstpcItem()
soup = BeautifulSoup(response.text, "html.parser")
paragraphs = soup.find_all("a", attrs={"href": True, "title": True})
for paragraph in paragraphs:
items['name'] = paragraph['title']
items['url'] = "https://www.baozimh.com" + paragraph['href']
yield items
# 改成
def parse_scroll(self, response):
items = FirstpcItem()
soup = BeautifulSoup(response.text, "html.parser")
paragraphs = soup.find_all("a", attrs={"href": True, "title": True})
for paragraph in paragraphs:
href = paragraph['href']
if '{{' in href or '}}' in href:
continue
if href.startswith('https://'):
url = href
else:
url = "https://www.baozimh.com" + href
items['name'] = paragraph['title']
items['url'] = url
yield items