Python爬虫实战入门：手把手教你抓取天气预报数据（附踩坑指南）

文章目录

🔥 准备你的爬虫工具箱

工欲善其事必先利其器，先安装这两个神器：

pip install requests beautifulsoup4

（偷偷告诉你）新手常见翻车现场：

用管理员身份运行CMD/PowerShell（不然会报权限错误）
网络环境要稳定（别用公司内网代理！）

🕷️ 爬虫四步曲（黄金法则）

第一步：发送请求

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) 你的浏览器信息'
}

response = requests.get('https://www.weather.com.cn/weather/101020100.shtml', headers=headers)
response.encoding = 'utf-8'  # 解决中文乱码的魔法咒语

第二步：解析数据

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser')
# 千万别直接用正则表达式解析HTML！（血泪教训）

第三步：提取信息

weather_data = []
for item in soup.select('.t li'):
    date = item.select_one('h1').text
    temp = item.select_one('.tem').text.strip()
    weather_data.append({'日期': date, '温度': temp})

第四步：保存结果

import csv

with open('weather.csv', 'w', newline='', encoding='utf-8-sig') as f:
    writer = csv.DictWriter(f, fieldnames=['日期', '温度'])
    writer.writeheader()
    writer.writerows(weather_data)

💥 真实案例：上海7日天气预报爬取

完整代码（可直接运行）：

import requests
from bs4 import BeautifulSoup
import csv
import time

def get_weather():
    url = 'https://www.weather.com.cn/weather/101020100.shtml'
    
    # 伪装浏览器请求头（重要！）
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()  # 自动检测HTTP错误
        response.encoding = 'utf-8'
        
        soup = BeautifulSoup(response.text, 'html.parser')
        weather_list = soup.select('.t li')
        
        data = []
        for item in weather_list[:7]:  # 取最近7天
            date = item.select_one('h1').text
            temp = item.select_one('.tem').text.strip().replace('\n', '')
            data.append({'日期': date, '温度': temp})
            time.sleep(1)  # 做个有道德的爬虫
            
        save_to_csv(data)
        print("抓取成功！打开weather.csv查看结果")
        
    except Exception as e:
        print(f"翻车了！错误信息：{str(e)}")

def save_to_csv(data):
    with open('weather.csv', 'w', newline='', encoding='utf-8-sig') as f:
        writer = csv.DictWriter(f, fieldnames=['日期', '温度'])
        writer.writeheader()
        writer.writerows(data)

if __name__ == '__main__':
    get_weather()

运行结果示例：

日期       温度
19日（今天）  27℃/34℃
20日（明天）  28℃/35℃
...

🚨 新手必看防坑指南

高频错误1：403 Forbidden

症状：死活获取不到数据
解药：检查User-Agent是否伪装成功，建议去https://httpbin.org/user-agent 测试

高频错误2：中文乱码

症状：获取的内容像天书
解药：三步走战略：
1. 检查response.encoding设置
2. 查看网页meta标签的charset属性
3. 用chardet库自动检测编码

高频错误3：元素定位失败

症状：返回空列表
解药：
1. 用浏览器开发者工具复查CSS选择器
2. 检查网页是否有动态加载内容（这时候要用selenium）
3. 查看网页是否有反爬机制

⚡ 爬虫加速技巧

使用Session对象保持会话（登录场景必备）
设置超时时间：requests.get(url, timeout=5)
随机延时防封杀：

import random
time.sleep(random.uniform(1, 3))

🚩 爬虫伦理三原则

遵守robots.txt规则（在域名后加/robots.txt查看）
控制请求频率（别把人家服务器搞崩了）
不抓取敏感数据（身份证、手机号等）

🚀 下一步学习路线

动态网页抓取：selenium / playwright
反反爬策略：IP代理池、验证码识别
异步爬虫：aiohttp + asyncio
框架进阶：Scrapy框架实战

（真实案例）曾经有个项目因为没加延时，直接把对方服务器搞挂了（尴尬）…所以大家一定要做个文明的爬虫工程师！