Python爬虫实战：利用代理IP获取电商数据（手把手教程）

最新推荐文章于 2025-07-02 16:19:07 发布

alphabuilder1

最新推荐文章于 2025-07-02 16:19:07 发布

阅读量1.3k

点赞数 18

CC 4.0 BY-SA版权

文章标签： python 爬虫 tcp/ip 其他

本文链接：https://blog.youkuaiyun.com/alphabuilder1/article/details/148053857

文章目录

一、为什么需要代理IP？（绕不过去的坎）

做电商数据抓取的老铁们都知道（说多都是泪），平台的反爬机制简直比女朋友的心思还难猜！我上周刚用自己IP抓了1000条数据，第二天就直接喜提封号大礼包（哭唧唧）。

这时候代理IP就是咱们的救命稻草了！它相当于：

给爬虫戴上面具的伪装大师
突破访问限制的穿墙术
保护本机IP的防弹衣（超级重要）

二、环境准备篇（新手必看）

2.1 安装必备库

pip install requests
pip install beautifulsoup4
pip install fake-useragent

2.2 推荐代理服务（自用实测）

这里以青果代理为例（非广告！），注册后获取API接口：

API_URL = "http://api.qingguo.com/getip?count=10&type=json"

三、实战代码解析（含避坑指南）

3.1 获取代理IP池

import requests

def get_proxy_pool():
    try:
        response = requests.get(API_URL)
        ips = [f"{item['ip']}:{item['port']}" for item in response.json()]
        print(f"成功获取{len(ips)}个代理IP！")
        return ips
    except Exception as e:
        print("获取代理IP失败！错误信息：", e)
        return []

3.2 带重试机制的请求函数（重点！！！）

from fake_useragent import UserAgent
import random
import time

ua = UserAgent()

def smart_request(url, retry=3):
    proxies = get_proxy_pool()
    for attempt in range(retry):
        try:
            proxy = {"http": "http://" + random.choice(proxies)}
            headers = {"User-Agent": ua.random}
            
            response = requests.get(url, 
                                  headers=headers, 
                                  proxies=proxy,
                                  timeout=10)
            
            if response.status_code == 200:
                return response
            else:
                print(f"状态码异常：{response.status_code}，更换代理重试...")
        except Exception as e:
            print(f"第{attempt+1}次尝试失败：{str(e)}")
            time.sleep(2**attempt)  # 指数退避策略
    return None

3.3 数据解析技巧（以某宝为例）

from bs4 import BeautifulSoup

def parse_product(html):
    soup = BeautifulSoup(html, 'lxml')
    
    # 商品名称
    title = soup.select_one('.product-title').text.strip()
    
    # 价格（注意动态加载情况）
    price = soup.select_one('.price-wrapper')['data-price']
    
    # 月销量（注意反爬文字混淆）
    sales = soup.find('span', class_='sales').text.replace('月销', '')
    
    return {
        "title": title,
        "price": float(price),
        "monthly_sales": int(sales)
    }

四、反反爬策略大全（血泪经验）

4.1 请求头随机化（必须做！）

headers = {
    "User-Agent": ua.random,
    "Accept-Language": "zh-CN,zh;q=0.9",
    "Referer": random.choice(referer_list)  # 准备常用来路列表
}

4.2 请求频率控制（生死线）

# 随机延迟（0.5-3秒）
time.sleep(random.uniform(0.5, 3))  

# 每50次请求休息10秒
if count % 50 == 0:
    time.sleep(10)

4.3 验证码破解方案（应急用）

# 使用第三方打码平台（示例代码）
def crack_captcha(image_data):
    captcha_api = "http://api.dama2.com:7766/app/dama2"
    result = requests.post(captcha_api, data=image_data).json()
    return result['code'] if result['success'] else None

五、数据存储方案（选型指南）

5.1 轻量级方案：CSV

import csv

def save_to_csv(data, filename):
    with open(filename, 'a', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=data.keys())
        if f.tell() == 0:  # 判断文件是否为空
            writer.writeheader()
        writer.writerow(data)

5.2 推荐方案：MySQL

import pymysql

conn = pymysql.connect(
    host='localhost',
    user='root',
    password='123456',
    database='ecommerce'
)

def save_to_mysql(data):
    sql = """INSERT INTO products 
             (title, price, sales) 
             VALUES (%s, %s, %s)"""
    with conn.cursor() as cursor:
        cursor.execute(sql, (data['title'], data['price'], data['monthly_sales']))
    conn.commit()

六、法律风险提示（必看！！！）

严格遵守网站的robots.txt协议
单日采集量不超过网站总数据量的30%
不得采集用户隐私信息（姓名、电话等）
商业用途需获得平台授权
建议在22:00-8:00间进行采集（降低服务器压力）

七、常见问题QA

Q：代理IP失效太快怎么办？
A：选择高匿代理+设置IP存活检测机制

Q：遇到动态加载数据怎么破？
A：使用Selenium或Pyppeteer模拟浏览器操作

Q：数据采集速度太慢？
A：采用异步请求（aiohttp）或Scrapy框架

最后说句掏心窝的话：爬虫虽好，可不要贪杯哦！合理合法使用技术，才能走得更远～如果文章对你有帮助，欢迎转发给更多需要的小伙伴！（源码已测试通过，需要完整代码的可以私信我）