Python爬虫新手避坑指南：从零开始抓取网页数据（纯干货）-优快云博客

文章目录

一、准备工作：你的第一把"铲子"

（重要！）工欲善其事必先利其器，咱们先准备好这些工具：

Python 3.8+（推荐最新版）
requests库（HTTP请求神器）
BeautifulSoup4（HTML解析专家）
Chrome浏览器（开发者工具必备）

安装命令来咯：

pip install requests beautifulsoup4

新手常见坑点预警！！！

有些网站会检测Python的User-Agent（用户代理）
注意requests.get()的timeout参数设置
记得检查HTTP状态码（200才正常）

二、第一个爬虫：抓取豆瓣电影Top250

咱们先用豆瓣的公开接口练手（注意：实际开发前请查看网站robots.txt）

import requests
from bs4 import BeautifulSoup

url = 'https://movie.douban.com/top250'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

# 提取电影标题
for item in soup.select('.item'):
    title = item.select_one('.title').text
    rating = item.select_one('.rating_num').text
    print(f"电影：{title} | 评分：{rating}")

运行结果示例：

电影：肖申克的救赎 | 评分：9.7
电影：霸王别姬 | 评分：9.6
...

三、进阶技巧：突破常见反爬机制

1. 随机User-Agent生成（防封必备）

from fake_useragent import UserAgent
ua = UserAgent()
headers = {'User-Agent': ua.random}

2. IP代理设置（突破访问限制）

proxies = {
    'http': 'http://10.10.1.10:3128',
    'https': 'http://10.10.1.10:1080',
}
requests.get(url, proxies=proxies)

3. 请求间隔设置（避免被封）

import time
import random

time.sleep(random.uniform(1, 3))  # 随机等待1-3秒

四、数据存储：把挖到的"金子"存好

1. CSV存储（简单易用）

import csv

with open('movies.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['标题', '评分'])
    # 循环写入数据...

2. MySQL存储（适合大数据量）

import pymysql

conn = pymysql.connect(host='localhost', user='root', password='123456', db='spider')
cursor = conn.cursor()
sql = 'INSERT INTO movies(title, rating) VALUES(%s, %s)'
cursor.execute(sql, (title, rating))
conn.commit()

五、法律红线：这些坑千万别踩！！！

绝对不要爬取敏感数据（用户隐私、国家机密）
遵守网站的robots.txt协议
控制访问频率（别把人家服务器搞崩了）
商业用途需获得授权
注意《网络安全法》相关规定

六、实战演练：天气预报数据抓取

来个小练习巩固下（以中国天气网为例）：

url = 'http://www.weather.com.cn/weather/101020100.shtml'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

# 解析未来7天天气
for item in soup.select('.t li'):
    date = item.select_one('h1').text
    weather = item.select_one('.wea').text
    temp = item.select_one('.tem').text.strip()
    print(f"{date}: {weather} {temp}")

输出示例：

08日（今天）: 多云转晴 29/24℃
09日（明天）: 晴 31/25℃
...

七、常见错误排查指南

1. 返回403错误怎么办？

检查User-Agent是否设置
添加Referer请求头
尝试使用代理IP

2. 页面内容获取不全？

可能是动态加载的数据（需要Selenium）
检查XPath/CSS选择器是否正确
查看网页源码确认元素结构

3. 遇到验证码怎么办？

降低请求频率
使用打码平台（商业项目）
改用API接口获取数据

最后说句掏心窝的话：爬虫虽好，可不要贪杯哦！建议新手先从公开API入手（比如豆瓣开放API、高德地图API），等熟悉规则后再挑战更复杂的网站。记住，技术是把双刃剑，咱们要用它来创造价值而不是搞破坏！