<think>我们使用Selenium和Edge浏览器爬取豆瓣电影Top250。我们将依次访问每一页,提取每部电影的标题、评分、评价人数、简介等信息。
步骤:
1. 初始化Edge浏览器驱动
2. 访问豆瓣Top250的每一页(共10页,每页25部)
3. 解析每页中的电影信息
4. 存储数据(例如保存为CSV文件)
5. 关闭浏览器
注意:豆瓣有反爬机制,我们需要合理设置等待时间,避免频繁请求。
我们将使用以下库:
- selenium
- time
- csv
- os
由于我们之前已经解决了Edge驱动的问题,这里我们将使用修复后的驱动初始化方法。
我们将定义一个函数来初始化浏览器,然后定义爬取函数。
注意:豆瓣页面结构可能会变化,以下代码基于当前(2023年)的页面结构。
页面URL: https://movie.douban.com/top250?start={start}&filter=
每页25部电影,start参数为0, 25, 50, ... 225
每部电影的信息在class为"item"的div中。
标题:在class为"title"的span中(可能有多个title,我们取第一个)
评分:class为"rating_num"
评价人数:class为"star"中的span(第4个span)中的文本,需要提取数字
简介:class为"inq"的span(可能有也可能没有)
另外,我们还需要处理翻页。
考虑到反爬,我们每爬取一页等待2-3秒。
我们将数据保存为CSV文件,包含以下字段:序号,标题,评分,评价人数,简介。
代码实现如下:</think># 使用 Selenium 和 Edge 浏览器爬取豆瓣电影 Top250
下面是一个完整的解决方案,使用 Selenium 和 Microsoft Edge 浏览器爬取豆瓣电影 Top250 数据:
```python
import time
import csv
import os
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.edge.service import Service
from selenium.webdriver.edge.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from bs4 import BeautifulSoup
def get_edge_driver_path():
"""获取Edge驱动路径"""
# Windows系统默认路径
windows_paths = [
r"C:\Program Files (x86)\Microsoft\Edge\Application\msedgedriver.exe",
r"C:\Program Files\Microsoft\Edge\Application\msedgedriver.exe",
r"C:\Windows\System32\msedgedriver.exe"
]
# macOS/Linux路径
other_paths = [
"/usr/local/bin/msedgedriver",
"/usr/bin/msedgedriver"
]
# 检查路径是否存在
for path in windows_paths + other_paths:
if os.path.exists(path):
return path
# 如果未找到,尝试使用webdriver-manager
try:
from webdriver_manager.microsoft import EdgeChromiumDriverManager
return EdgeChromiumDriverManager().install()
except:
raise Exception("无法找到Edge驱动程序,请手动下载并设置路径")
def init_edge_driver():
"""初始化Edge浏览器驱动"""
# 配置浏览器选项
options = Options()
options.add_argument("--start-maximized")
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--disable-infobars")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
# 设置用户代理,模拟真实用户
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.203"
options.add_argument(f"user-agent={user_agent}")
# 获取驱动路径
driver_path = get_edge_driver_path()
service = Service(executable_path=driver_path)
# 初始化浏览器
driver = webdriver.Edge(service=service, options=options)
# 隐藏Selenium特征
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
return driver
def scrape_movie_details(driver, movie_url):
"""爬取单个电影的详细信息"""
driver.get(movie_url)
# 等待页面加载
try:
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "h1 span"))
)
except TimeoutException:
print(f"页面加载超时: {movie_url}")
return None
# 使用BeautifulSoup解析页面
soup = BeautifulSoup(driver.page_source, 'html.parser')
# 提取电影信息
movie_info = {}
try:
# 标题
movie_info['title'] = soup.select_one("h1 span").text.strip()
except:
movie_info['title'] = "未知"
try:
# 年份
movie_info['year'] = soup.select_one("h1 .year").text.strip('()')
except:
movie_info['year'] = "未知"
try:
# 导演
directors = [a.text for a in soup.select('a[rel="v:directedBy"]')]
movie_info['directors'] = " / ".join(directors)
except:
movie_info['directors'] = "未知"
try:
# 编剧
writers = []
writer_span = soup.find('span', string='编剧')
if writer_span:
writers = [a.text for a in writer_span.find_next_sibling('span').find_all('a')]
movie_info['writers'] = " / ".join(writers)
except:
movie_info['writers'] = "未知"
try:
# 主演
actors = [a.text for a in soup.select('a[rel="v:starring"]')]
movie_info['actors'] = " / ".join(actors[:3]) # 只取前三位主演
except:
movie_info['actors'] = "未知"
try:
# 类型
genres = [a.text for a in soup.select('span[property="v:genre"]')]
movie_info['genres'] = " / ".join(genres)
except:
movie_info['genres'] = "未知"
try:
# 评分
movie_info['rating'] = soup.select_one('strong[property="v:average"]').text
except:
movie_info['rating'] = "0"
try:
# 评分人数
movie_info['votes'] = soup.select_one('span[property="v:votes"]').text
except:
movie_info['votes'] = "0"
try:
# 片长
movie_info['duration'] = soup.select_one('span[property="v:runtime"]').text
except:
movie_info['duration'] = "未知"
try:
# 上映日期
release_dates = [span.text for span in soup.select('span[property="v:initialReleaseDate"]')]
movie_info['release_date'] = release_dates[0] if release_dates else "未知"
except:
movie_info['release_date'] = "未知"
try:
# 简介
summary_span = soup.find('span', class_='all', property='v:summary')
if not summary_span:
summary_span = soup.find('span', property='v:summary')
movie_info['summary'] = summary_span.text.strip() if summary_span else "无简介"
except:
movie_info['summary'] = "无简介"
# 添加延迟,避免请求过快
time.sleep(1.5)
return movie_info
def scrape_douban_top250(driver):
"""爬取豆瓣Top250电影数据"""
base_url = "https://movie.douban.com/top250"
movies_data = []
# 爬取所有10页数据
for page in range(0, 250, 25):
url = f"{base_url}?start={page}"
print(f"正在爬取页面: {url}")
driver.get(url)
# 等待页面加载
try:
WebDriverWait(driver, 15).until(
EC.presence_of_element_located((By.CLASS_NAME, "grid_view"))
)
except TimeoutException:
print(f"页面加载超时: {url}")
continue
# 获取当前页电影列表
soup = BeautifulSoup(driver.page_source, 'html.parser')
movie_items = soup.select('.grid_view .item')
# 处理当前页的每部电影
for item in movie_items:
try:
# 提取基本信息
rank = item.select_one('em').text
title = item.select_one('.title').text
detail_url = item.select_one('.hd a')['href']
# 爬取详细信息
movie_details = scrape_movie_details(driver, detail_url)
if movie_details:
movie_details['rank'] = rank
movies_data.append(movie_details)
print(f"已爬取: {rank} - {title}")
else:
print(f"跳过: {rank} - {title} (爬取失败)")
except Exception as e:
print(f"处理电影时出错: {str(e)}")
continue
# 添加页面间延迟
time.sleep(2.5)
return movies_data
def save_to_csv(data, filename='douban_top250.csv'):
"""将数据保存为CSV文件"""
if not data:
print("没有数据可保存")
return
# 定义CSV字段
fieldnames = [
'rank', 'title', 'year', 'directors', 'writers',
'actors', 'genres', 'rating', 'votes', 'duration',
'release_date', 'summary'
]
# 写入CSV文件
with open(filename, 'w', newline='', encoding='utf-8-sig') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for movie in data:
writer.writerow(movie)
print(f"数据已保存到 {filename},共 {len(data)} 部电影")
def main():
"""主函数"""
print("豆瓣Top250电影爬虫启动...")
# 初始化浏览器
driver = init_edge_driver()
print("浏览器初始化成功")
try:
# 爬取数据
movies_data = scrape_douban_top250(driver)
# 保存数据
save_to_csv(movies_data)
except Exception as e:
print(f"发生错误: {str(e)}")
finally:
# 关闭浏览器
input("按Enter键关闭浏览器...")
driver.quit()
print("浏览器已关闭")
if __name__ == "__main__":
main()
```
## 代码详解
### 1. Edge 浏览器初始化
```python
def init_edge_driver():
"""初始化Edge浏览器驱动"""
# 配置浏览器选项
options = Options()
options.add_argument("--start-maximized")
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--disable-infobars")
# ... 其他配置 ...
# 设置用户代理
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ..."
options.add_argument(f"user-agent={user_agent}")
# 获取驱动路径
driver_path = get_edge_driver_path()
service = Service(executable_path=driver_path)
# 初始化浏览器
driver = webdriver.Edge(service=service, options=options)
# 隐藏Selenium特征
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
return driver
```
这部分代码负责初始化 Edge 浏览器,并做了以下反反爬措施:
- 禁用自动化控制特征
- 设置真实用户代理
- 隐藏 Selenium 自动化特征
- 最大化浏览器窗口
### 2. 豆瓣 Top250 列表爬取
```python
def scrape_douban_top250(driver):
"""爬取豆瓣Top250电影数据"""
base_url = "https://movie.douban.com/top250"
movies_data = []
# 爬取所有10页数据
for page in range(0, 250, 25):
url = f"{base_url}?start={page}"
driver.get(url)
# 等待页面加载
WebDriverWait(driver, 15).until(
EC.presence_of_element_located((By.CLASS_NAME, "grid_view"))
)
# 解析页面
soup = BeautifulSoup(driver.page_source, 'html.parser')
movie_items = soup.select('.grid_view .item')
# 处理每部电影
for item in movie_items:
# 提取基本信息
rank = item.select_one('em').text
title = item.select_one('.title').text
detail_url = item.select_one('.hd a')['href']
# 爬取详细信息
movie_details = scrape_movie_details(driver, detail_url)
# ... 保存数据 ...
# 添加页面间延迟
time.sleep(2.5)
return movies_data
```
这段代码:
1. 遍历豆瓣 Top250 的所有10页(每页25部电影)
2. 使用显式等待确保页面加载完成
3. 使用 BeautifulSoup 解析页面内容
4. 提取每部电影的基本信息(排名、标题、详情页链接)
5. 添加页面间的延迟以避免被封IP
### 3. 电影详情页爬取
```python
def scrape_movie_details(driver, movie_url):
"""爬取单个电影的详细信息"""
driver.get(movie_url)
# 等待页面加载
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "h1 span"))
)
# 使用BeautifulSoup解析页面
soup = BeautifulSoup(driver.page_source, 'html.parser')
# 提取电影信息
movie_info = {}
movie_info['title'] = soup.select_one("h1 span").text.strip()
movie_info['year'] = soup.select_one("h1 .year").text.strip('()')
# ... 提取其他信息 ...
# 添加延迟
time.sleep(1.5)
return movie_info
```
这部分代码负责爬取单个电影的详细信息,包括:
- 电影标题和年份
- 导演、编剧和主演
- 类型、评分和评分人数
- 片长和上映日期
- 电影简介
### 4. 数据保存
```python
def save_to_csv(data, filename='douban_top250.csv'):
"""将数据保存为CSV文件"""
# 定义CSV字段
fieldnames = [
'rank', 'title', 'year', 'directors', 'writers',
'actors', 'genres', 'rating', 'votes', 'duration',
'release_date', 'summary'
]
# 写入CSV文件
with open(filename, 'w', newline='', encoding='utf-8-sig') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for movie in data:
writer.writerow(movie)
```
这段代码将爬取的数据保存为 CSV 文件,包含以下字段:
- 排名(rank)
- 标题(title)
- 年份(year)
- 导演(directors)
- 编剧(writers)
- 主演(actors)
- 类型(genres)
- 评分(rating)
- 评分人数(votes)
- 片长(duration)
- 上映日期(release_date)
- 简介(summary)
## 反爬策略
1. **随机延迟**:
- 页面间延迟 2.5 秒
- 详情页间延迟 1.5 秒
- 避免触发豆瓣的反爬机制
2. **模拟真实用户行为**:
- 设置真实用户代理
- 最大化浏览器窗口
- 隐藏自动化特征
3. **健壮的错误处理**:
- 使用 try-except 捕获可能的异常
- 超时处理
- 元素不存在处理
4. **显式等待**:
```python
WebDriverWait(driver, 15).until(
EC.presence_of_element_located((By.CLASS_NAME, "grid_view"))
)
```
确保页面元素加载完成后再进行解析
## 运行说明
1. 安装依赖:
```bash
pip install selenium beautifulsoup4 webdriver-manager
```
2. 确保已安装 Microsoft Edge 浏览器
3. 如果遇到驱动问题,可以手动下载匹配的 Edge 驱动:
- 访问:https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
- 下载与浏览器版本匹配的驱动
- 放置在系统 PATH 或修改代码中的路径
4. 运行脚本:
```bash
python douban_top250.py
```
5. 爬取完成后,数据将保存在 `douban_top250.csv` 文件中
## 结果示例
CSV 文件将包含以下格式的数据:
| rank | title | year | directors | ... | summary |
|------|-------|------|-----------|-----|---------|
| 1 | 肖申克的救赎 | 1994 | 弗兰克·德拉邦特 | ... | 希望让人自由... |
| 2 | 霸王别姬 | 1993 | 陈凯歌 | ... | 风华绝代... |
| ... | ... | ... | ... | ... | ... |