RESTful API 开发实践：淘宝商品详情页数据采集方案

原创于 2025-08-19 11:44:00 发布 · 1.5k 阅读

8 ·

CC 4.0 BY-SA版权

文章标签：

#restful #后端 #大数据 #数据库 #数据仓库

API接口同时被 3 个专栏收录

285 篇文章

订阅专栏

266 篇文章

订阅专栏

数据挖掘

258 篇文章

订阅专栏

在电商数据分析、竞品监控和价格比较等场景中，淘宝商品详情页数据采集具有重要价值。本文将介绍如何基于 RESTful API 设计原则，构建一个高效、可靠的淘宝商品详情数据采集方案，并提供完整的代码实现。

RESTful API 设计原则

RESTful API 是一种软件架构风格，旨在通过 HTTP 协议提供统一的接口设计规范。核心原则包括：

资源导向：使用 URI 表示资源，如/api/products/{id}
HTTP 方法语义：GET (查询)、POST (创建)、PUT (更新)、DELETE (删除)
无状态：每个请求都包含完整信息，服务器不存储会话状态
响应格式标准化：通常使用 JSON 格式
可缓存性：适当设置缓存头，提高性能

淘宝商品详情数据采集方案设计

1. 需求分析

我们需要采集的淘宝商品详情数据包括：

基本信息：商品 ID、标题、价格、销量、库存
媒体信息：主图、详情图
规格信息：颜色、尺寸、SKU
卖家信息：店铺名称、评分、所在地

2. API 端点设计

基于 RESTful 原则，设计以下 API 端点：

plaintext

GET /api/products/{product_id} - 获取单个商品详情
GET /api/products - 批量获取商品列表(支持分页和筛选)
GET /api/products/{product_id}/reviews - 获取商品评价

3. 数据采集实现

淘宝商品数据采集可通过两种方式实现：

API (推荐，合规稳定)
网页爬虫 (需处理反爬机制，注意合规性)

以下代码实现采用网页爬虫方式，仅供学习参考。

from flask import Flask, jsonify, request
import requests
from bs4 import BeautifulSoup
import re
import json
import time
from cachetools import TTLCache
from fake_useragent import UserAgent

app = Flask(__name__)
# 设置缓存，有效期10分钟，最多缓存1000个商品
cache = TTLCache(maxsize=1000, ttl=600)

# 随机User-Agent生成器
ua = UserAgent()

def get_taobao_product(product_id):
    """获取淘宝商品详情数据"""
    # 检查缓存
    if product_id in cache:
        return cache[product_id]
    
    # 构建商品详情页URL
    url = f"https://item.taobao.com/item.htm?id={product_id}"
    
    # 设置请求头，模拟浏览器
    headers = {
        "User-Agent": ua.random,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
        "Connection": "keep-alive",
        "Referer": "https://www.taobao.com/"
    }
    
    try:
        # 发送请求
        response = requests.get(url, headers=headers, timeout=10)
        response.encoding = "gbk"  # 淘宝页面编码通常为gbk
        soup = BeautifulSoup(response.text, "html.parser")
        
        # 提取商品数据
        product_data = {
            "product_id": product_id,
            "title": extract_title(soup),
            "price": extract_price(soup),
            "sales": extract_sales(soup),
            "stock": extract_stock(soup),
            "main_images": extract_main_images(soup),
            "detail_images": extract_detail_images(soup),
            "skus": extract_skus(soup),
            "seller": extract_seller_info(soup),
            "采集时间": time.strftime("%Y-%m-%d %H:%M:%S")
        }
        
        # 存入缓存
        cache[product_id] = product_data
        return product_data
        
    except Exception as e:
        app.logger.error(f"获取商品{product_id}数据失败: {str(e)}")
        return None

def extract_title(soup):
    """提取商品标题"""
    title_tag = soup.find("h3", class_="tb-main-title")
    return title_tag.get_text(strip=True) if title_tag else ""

def extract_price(soup):
    """提取商品价格"""
    price_tag = soup.find("em", class_="tb-rmb-num")
    return price_tag.get_text() if price_tag else ""

def extract_sales(soup):
    """提取商品销量"""
    sales_tag = soup.find("div", class_="tb-sell-counter")
    if sales_tag:
        sales_text = sales_tag.get_text()
        # 使用正则提取数字
        match = re.search(r'(\d+)', sales_text)
        return match.group(1) if match else "0"
    return "0"

def extract_stock(soup):
    """提取商品库存"""
    # 库存信息通常在script中
    scripts = soup.find_all("script")
    for script in scripts:
        if "Stock" in str(script):
            match = re.search(r'"Stock":(\d+)', str(script))
            if match:
                return match.group(1)
    return "0"

def extract_main_images(soup):
    """提取商品主图"""
    main_image_tags = soup.find_all("img", class_="J_ItemImg")
    images = []
    for img in main_image_tags:
        img_url = img.get("data-src") or img.get("src")
        if img_url:
            # 补全URL
            if img_url.startswith("//"):
                img_url = "https:" + img_url
            images.append(img_url)
    return images

def extract_detail_images(soup):
    """提取商品详情图"""
    detail_div = soup.find("div", id="description")
    if not detail_div:
        return []
    
    img_tags = detail_div.find_all("img")
    images = []
    for img in img_tags:
        img_url = img.get("src")
        if img_url:
            if img_url.startswith("//"):
                img_url = "https:" + img_url
            images.append(img_url)
    return images

def extract_skus(soup):
    """提取商品规格"""
    # 简化处理，实际SKU提取较复杂
    sku_info = []
    # 尝试从规格标签提取
    sku_labels = soup.find_all("dd", class_="item")
    for label in sku_labels:
        sku_name = label.get("data-value")
        if sku_name:
            sku_info.append(sku_name)
    return sku_info

def extract_seller_info(soup):
    """提取卖家信息"""
    seller_name_tag = soup.find("div", class_="tb-seller-name")
    seller_name = seller_name_tag.get_text(strip=True) if seller_name_tag else ""
    
    seller_rating_tag = soup.find("span", class_="rate")
    seller_rating = seller_rating_tag.get_text() if seller_rating_tag else ""
    
    location_tag = soup.find("div", class_="tb-p4p-location")
    location = location_tag.get_text() if location_tag else ""
    
    return {
        "name": seller_name,
        "rating": seller_rating,
        "location": location
    }

@app.route('/api/products/<string:product_id>', methods=['GET'])
def get_product(product_id):
    """获取单个商品详情"""
    product_data = get_taobao_product(product_id)
    
    if product_data:
        return jsonify({
            "status": "success",
            "data": product_data
        }), 200
    else:
        return jsonify({
            "status": "error",
            "message": f"无法获取商品{product_id}的信息"
        }), 404

@app.route('/api/products', methods=['GET'])
def get_products():
    """批量获取商品信息"""
    product_ids = request.args.get('ids', '').split(',')
    if not product_ids or product_ids == ['']:
        return jsonify({
            "status": "error",
            "message": "请提供商品ID，格式: ?ids=id1,id2,id3"
        }), 400
    
    result = []
    for pid in product_ids:
        if pid:  # 跳过空值
            data = get_taobao_product(pid)
            if data:
                result.append(data)
            # 避免请求过于频繁
            time.sleep(1)
    
    return jsonify({
        "status": "success",
        "count": len(result),
        "data": result
    }), 200

if __name__ == '__main__':
    # 生产环境请使用更安全的配置
    app.run(debug=True, host='0.0.0.0', port=5000)

4. 依赖安装

运行上述代码需要安装以下依赖：

bash

pip install flask requests beautifulsoup4 fake-useragent cachetools

方案优化与扩展

1. 反爬虫机制应对

淘宝有严格的反爬虫机制，为提高稳定性，可采取以下措施：

使用代理 IP 池，避免 IP 被封禁
实现请求频率控制，模拟人类浏览行为
定期更新 User-Agent 列表
处理验证码（可集成第三方打码服务）

2. 性能优化

实现多级缓存：内存缓存 (TTLCache) + 持久化缓存 (Redis)
异步请求：使用 aiohttp 替代 requests，提高并发能力
数据分页：批量请求时实现分页机制

3. 错误处理与监控

完善的日志记录系统
实现请求重试机制
监控 API 响应时间和成功率
异常报警机制

合规性考虑

在进行淘宝商品数据采集时，需特别注意：

遵守 robots.txt 协议
不进行高频次请求，避免影响网站正常运行
采集数据不得用于商业用途或侵犯他人权益
优先使用淘宝开放平台提供的官方 API（如淘宝联盟 API）

总结

本文介绍了基于 RESTful API 设计原则的淘宝商品详情数据采集方案，实现了基本的数据提取和 API 服务功能。在实际应用中，还需根据具体需求进行扩展和优化，同时严格遵守相关法律法规和网站规定。

该方案可进一步扩展为完整的电商数据平台，支持多平台数据采集、数据分析和可视化展示，为电商运营决策提供数据支持。