Python爬虫实战入门：手把手教你抓取豆瓣电影Top250

最新推荐文章于 2025-12-09 16:47:37 发布

原创最新推荐文章于 2025-12-09 16:47:37 发布 · 632 阅读

8 ·

CC 4.0 BY-SA版权

文章标签：

#python #爬虫 #开发语言 #其他

文章目录

一、环境准备（小白必看！）

安装Python 3.6+（推荐用Anaconda全家桶）
安装必备库（在终端里运行）：

pip install requests
pip install beautifulsoup4

（别小看这两个库，它们是爬虫界的倚天剑屠龙刀！）

二、爬虫基础知识速成班

1. HTTP请求基础

GET请求：就像你在浏览器地址栏输入网址
POST请求：需要提交表单数据的请求
（敲黑板）重要参数：
- User-Agent：伪装成浏览器访问
- Cookies：维持登录状态

2. HTML解析利器

BeautifulSoup的常用方法：

soup.find('div')  # 找第一个div标签
soup.find_all('a')  # 找出所有超链接
soup.select('.class_name')  # CSS选择器大法

三、实战案例：爬取豆瓣电影Top250

1. 目标分析

我们要抓取：

电影名称
评分
经典台词
电影详情页链接

（注意看豆瓣的robots.txt文件，人家允许爬虫访问的！）

2. 代码逐行解析

import requests
from bs4 import BeautifulSoup
import time

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

def get_movies(page):
    url = f'https://movie.douban.com/top250?start={page*25}'
    response = requests.get(url, headers=headers)
    
    # 重要异常处理（网络请求可能会失败！）
    if response.status_code != 200:
        print(f'请求失败，状态码：{response.status_code}')
        return
    
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # 定位电影列表（用浏览器开发者工具看结构）
    items = soup.find_all('div', class_='item')
    
    for item in items:
        title = item.find('span', class_='title').text
        rating = item.find('span', class_='rating_num').text
        link = item.find('a')['href']
        
        # 获取详情页的经典台词
        quote = get_quote(link)
        
        print(f'【{title}】评分：{rating} | 台词：{quote}')
        print('-'*50)

def get_quote(url):
    try:
        res = requests.get(url, headers=headers, timeout=5)
        soup = BeautifulSoup(res.text, 'html.parser')
        quote = soup.find('span', class_='inq')
        return quote.text if quote else '暂无台词'
    except Exception as e:
        print(f'获取台词失败：{str(e)}')
        return '获取失败'

# 控制翻页（千万别暴力爬取！）
for page in range(10):  # 总共10页
    get_movies(page)
    time.sleep(3)  # 每页停3秒（做守法好公民！）

四、避坑指南（血泪经验！）

1. 反爬机制应对

随机User-Agent（推荐fake_useragent库）
使用代理IP（免费的可能不稳定）
控制请求频率（重要！重要！重要！）

2. 常见错误处理

# 超时设置（别让程序卡死）
requests.get(url, timeout=10)

# 重试机制（requests自带retry库）
from requests.adapters import HTTPAdapter
session = requests.Session()
session.mount('http://', HTTPAdapter(max_retries=3))

五、数据存储方案

1. CSV文件存储

import csv

with open('movies.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['电影名', '评分', '经典台词'])
    # 在循环中写入数据...

2. MySQL数据库

import pymysql

conn = pymysql.connect(host='localhost', user='root', password='123456', db='test')
cursor = conn.cursor()
sql = "INSERT INTO movies (title, rating) VALUES (%s, %s)"
cursor.execute(sql, ('肖申克的救赎', 9.7))
conn.commit()

六、高级技巧预告（学完就能接单！）

动态网页抓取（Selenium/Playwright）
API逆向分析（Chrome开发者工具）
分布式爬虫框架（Scrapy/Scrapy-Redis）

（看到这你已经超过80%的新手了！）

总结与忠告

爬虫本质是模拟浏览器行为，但要注意：

遵守网站规则（看robots.txt）
不要影响网站正常运行
注意数据版权问题

最后说句掏心窝的话：爬虫不是技术越厉害越好，懂得克制才是真高手！遇到验证码别硬刚，该收手时就收手~