Python实战:手把手教你搭建一套完整的沃尔玛商品数据爬虫系统

大家好,作为一名技术开发者,我们常常需要面对复杂的数据获取需求。今天,我将以全球零售巨头沃尔玛为例,详细拆解如何利用Python构建一个高效、稳定的商品数据爬虫系统。本文将从项目结构、核心爬虫逻辑、反爬虫机制(如代理IP、验证码处理)到数据清洗存储,提供完整的代码实现与技术解析,助你掌握电商数据采集的核心技术。

沃尔玛爬虫完整构建指南:用Python打造高效商品数据采集系统

沃尔玛爬虫(Walmart Scraper)作为电商数据采集的重要工具,能够帮助卖家、分析师和开发者自动获取沃尔玛平台的商品信息、价格数据和市场趋势。在竞争激烈的电商环境中,掌握实时的商品数据对于制定营销策略、价格优化和竞品分析至关重要。本文将详细介绍如何使用Python构建一个功能完善的沃尔玛爬虫系统,涵盖从基础设置到高级优化的全过程。

为什么需要构建沃尔玛爬虫

在深入技术实现之前,我们先了解构建沃尔玛爬虫的核心价值。沃尔玛作为全球最大的零售商之一,其平台上包含数百万种商品,价格变化频繁,促销活动不断。对于电商从业者而言,及时获取这些数据能够:

  • 竞品价格监控:实时跟踪竞争对手的价格策略
  • 市场趋势分析:了解热销商品和消费者偏好
  • 库存管理优化:基于供需数据调整采购计划
  • 营销策略制定:根据促销信息制定相应策略

然而,手动收集这些数据不仅效率低下,而且容易出错。这就是Python沃尔玛数据抓取(Python Walmart Data Scraping)技术发挥作用的地方。

技术准备与环境搭建

1. 开发环境配置

首先确保您的系统已安装Python 3.7或更高版本。我们将使用以下核心库来构建我们的沃尔玛商品信息采集器(Walmart Product Information Crawler):

# requirements.txt
requests==2.31.0
beautifulsoup4==4.12.2
selenium==4.15.0
pandas==2.1.3
fake-useragent==1.4.0
python-dotenv==1.0.0

安装依赖:

pip install -r requirements.txt

2. 基础项目结构

walmart_scraper/
├── config/
│   ├── __init__.py
│   └── settings.py
├── scrapers/
│   ├── __init__.py
│   ├── base_scraper.py
│   └── walmart_scraper.py
├── utils/
│   ├── __init__.py
│   ├── proxy_handler.py
│   └── data_processor.py
├── data/
│   └── output/
├── main.py
└── requirements.txt

核心爬虫组件开发

1. 基础爬虫类设计

让我们从创建一个基础的爬虫类开始:

# scrapers/base_scraper.py
import requests
import time
import random
from fake_useragent import UserAgent
from bs4 import BeautifulSoup
import logging

class BaseScraper:
    def __init__(self):
        self.session = requests.Session()
        self.ua = UserAgent()
        self.setup_logging()
        
    def setup_logging(self):
        logging.basicConfig(
            level=logging.INFO,
            format='%(asctime)s - %(levelname)s - %(message)s',
            handlers=[
                logging.FileHandler('scraper.log'),
                logging.StreamHandler()
            ]
        )
        self.logger = logging.getLogger(__name__)
    
    def get_headers(self):
        """生成随机请求头"""
        return {
            'User-Agent': self.ua.random,
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate',
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1',
        }
    
    def random_delay(self, min_delay=1, max_delay=3):
        """随机延迟防止被识别"""
        delay = random.uniform(min_delay, max_delay)
        time.sleep(delay)
    
    def make_request(self, url, max_retries=3):
        """发送HTTP请求with重试机制"""
        for attempt in range(max_retries):
            try:
                headers = self.get_headers()
                response = self.session.get(url, headers=headers, timeout=10)
                response.raise_for_status()
                return response
            except requests.RequestException as e:
                self.logger.warning(f"请求失败 (尝试 {attempt + 1}/{max_retries}): {e}")
                if attempt < max_retries - 1:
                    self.random_delay(2, 5)
                else:
                    self.logger.error(f"所有请求尝试失败: {url}")
                    raise

2. 沃尔玛专用爬虫实现

接下来实现专门针对沃尔玛的爬虫类:

# scrapers/walmart_scraper.py
from .base_scraper import BaseScraper
from bs4 import BeautifulSoup
import json
import re
from urllib.parse import urljoin, urlparse, parse_qs

class WalmartScraper(BaseScraper):
    def __init__(self):
        super().__init__()
        self.base_url = "https://www.walmart.com"
        
    def search_products(self, keyword, page=1, max_results=50):
        """搜索商品列表"""
        search_url = f"{self.base_url}/search?q={keyword}&page={page}"
        
        try:
            response = self.make_request(search_url)
            soup = BeautifulSoup(response.content, 'html.parser')
            
            # 提取商品列表
            products = self.extract_product_list(soup)
            self.logger.info(f"成功提取 {len(products)} 个商品信息")
            
            return products[:max_results]
            
        except Exception as e:
            self.logger.error(f"搜索商品失败: {e}")
            return []
    
    def extract_product_list(self, soup):
        """从搜索结果页面提取商品信息"""
        products = []
        
        # 查找商品容器
        product_containers = soup.find_all('div', {'data-automation-id': 'product-tile'})
        
        for container in product_containers:
            try:
                product_data = self.extract_single_product(container)
                if product_data:
                    products.append(product_data)
            except Exception as e:
                self.logger.warning(f"提取单个商品失败: {e}")
                continue
                
        return products
    
    def extract_single_product(self, container):
        """提取单个商品的详细信息"""
        product = {}
        
        try:
            # 商品标题
            title_elem = container.find('span', {'data-automation-id': 'product-title'})
            product['title'] = title_elem.get_text(strip=True) if title_elem else ''
            
            # 价格信息
            price_elem = container.find('div', {'data-automation-id': 'product-price'})
            if price_elem:
                price_text = price_elem.get_text(strip=True)
                product['price'] = self.clean_price(price_text)
            
            # 商品链接
            link_elem = container.find('a', href=True)
            if link_elem:
                product['url'] = urljoin(self.base_url, link_elem['href'])
                # 从URL中提取商品ID
                product['product_id'] = self.extract_product_id(product['url'])
            
            # 评分信息
            rating_elem = container.find('span', class_=re.compile(r'.*rating.*'))
            if rating_elem:
                rating_text = rating_elem.get('aria-label', '')
                product['rating'] = self.extract_rating(rating_text)
            
            # 图片
            img_elem = container.find('img')
            if img_elem:
                product['image_url'] = img_elem.get('src', '')
            
            # 供应商信息
            seller_elem = container.find('span', string=re.compile(r'Sold by'))
            if seller_elem:
                product['seller'] = seller_elem.get_text(strip=True)
            
            return product if product.get('title') else None
            
        except Exception as e:
            self.logger.warning(f"解析商品数据失败: {e}")
            return None
    
    def get_product_details(self, product_url):
        """获取商品详细页面信息"""
        try:
            response = self.make_request(product_url)
            soup = BeautifulSoup(response.content, 'html.parser')
            
            details = {}
            
            # 从script标签中提取JSON数据
            script_tags = soup.find_all('script', {'type': 'application/ld+json'})
            for script in script_tags:
                try:
                    json_data = json.loads(script.string)
                    if '@type' in json_data and json_data['@type'] == 'Product':
                        details.update(self.parse_product_json(json_data))
                        break
                except json.JSONDecodeError:
                    continue
            
            # 商品描述
            desc_elem = soup.find('div', {'data-automation-id': 'product-highlights'})
            if desc_elem:
                details['description'] = desc_elem.get_text(strip=True)
            
            # 库存状态
            stock_elem = soup.find('div', {'data-automation-id': 'fulfillment-section'})
            if stock_elem:
                details['in_stock'] = 'in stock' in stock_elem.get_text().lower()
            
            return details
            
        except Exception as e:
            self.logger.error(f"获取商品详情失败: {e}")
            return {}
    
    def clean_price(self, price_text):
        """清理价格文本"""
        if not price_text:
            return None
        
        # 提取数字和小数点
        price_match = re.search(r'\$?(\d+\.?\d*)', price_text.replace(',', ''))
        return float(price_match.group(1)) if price_match else None
    
    def extract_product_id(self, url):
        """从URL中提取商品ID"""
        try:
            parsed_url = urlparse(url)
            path_parts = parsed_url.path.split('/')
            for part in path_parts:
                if part.isdigit():
                    return part
        except:
            pass
        return None
    
    def extract_rating(self, rating_text):
        """提取评分数值"""
        rating_match = re.search(r'(\d+\.?\d*)', rating_text)
        return float(rating_match.group(1)) if rating_match else None
    
    def parse_product_json(self, json_data):
        """解析产品JSON数据"""
        details = {}
        
        if 'name' in json_data:
            details['full_name'] = json_data['name']
        
        if 'offers' in json_data:
            offer = json_data['offers']
            if isinstance(offer, list):
                offer = offer[0]
            
            details['availability'] = offer.get('availability', '')
            details['currency'] = offer.get('priceCurrency', 'USD')
            
            if 'price' in offer:
                details['detailed_price'] = float(offer['price'])
        
        if 'aggregateRating' in json_data:
            rating_data = json_data['aggregateRating']
            details['average_rating'] = float(rating_data.get('ratingValue', 0))
            details['review_count'] = int(rating_data.get('reviewCount', 0))
        
        return details

应对反爬虫策略

1. IP代理池集成

现代电商网站都部署了先进的反爬虫系统。为了构建稳定的自动化沃尔玛爬虫系统(Automated Walmart Scraping System),我们需要集成IP代理池:

# utils/proxy_handler.py
import requests
import random
import threading
from queue import Queue
import time

class ProxyHandler:
    def __init__(self, proxy_list=None):
        self.proxy_queue = Queue()
        self.failed_proxies = set()
        self.proxy_stats = {}
        self.lock = threading.Lock()
        
        if proxy_list:
            self.load_proxies(proxy_list)
    
    def load_proxies(self, proxy_list):
        """加载代理列表"""
        for proxy in proxy_list:
            self.proxy_queue.put(proxy)
            self.proxy_stats[proxy] = {'success': 0, 'failed': 0}
    
    def get_proxy(self):
        """获取可用代理"""
        with self.lock:
            while not self.proxy_queue.empty():
                proxy = self.proxy_queue.get()
                if proxy not in self.failed_proxies:
                    return proxy
        return None
    
    def test_proxy(self, proxy, test_url="http://httpbin.org/ip"):
        """测试代理是否可用"""
        try:
            proxies = {
                'http': f'http://{proxy}',
                'https': f'https://{proxy}'
            }
            
            response = requests.get(
                test_url, 
                proxies=proxies, 
                timeout=10,
                headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
            )
            
            if response.status_code == 200:
                self.mark_proxy_success(proxy)
                return True
                
        except requests.RequestException:
            pass
        
        self.mark_proxy_failed(proxy)
        return False
    
    def mark_proxy_success(self, proxy):
        """标记代理成功"""
        with self.lock:
            if proxy in self.proxy_stats:
                self.proxy_stats[proxy]['success'] += 1
            # 成功的代理重新放回队列
            self.proxy_queue.put(proxy)
    
    def mark_proxy_failed(self, proxy):
        """标记代理失败"""
        with self.lock:
            if proxy in self.proxy_stats:
                self.proxy_stats[proxy]['failed'] += 1
            
            # 失败次数过多的代理加入黑名单
            if self.proxy_stats[proxy]['failed'] > 3:
                self.failed_proxies.add(proxy)

# 集成代理的爬虫类
class WalmartScraperWithProxy(WalmartScraper):
    def __init__(self, proxy_list=None):
        super().__init__()
        self.proxy_handler = ProxyHandler(proxy_list) if proxy_list else None
    
    def make_request_with_proxy(self, url, max_retries=3):
        """使用代理发送请求"""
        for attempt in range(max_retries):
            proxy = self.proxy_handler.get_proxy() if self.proxy_handler else None
            
            try:
                headers = self.get_headers()
                proxies = None
                
                if proxy:
                    proxies = {
                        'http': f'http://{proxy}',
                        'https': f'https://{proxy}'
                    }
                
                response = self.session.get(
                    url, 
                    headers=headers, 
                    proxies=proxies,
                    timeout=15
                )
                response.raise_for_status()
                
                if proxy and self.proxy_handler:
                    self.proxy_handler.mark_proxy_success(proxy)
                
                return response
                
            except requests.RequestException as e:
                if proxy and self.proxy_handler:
                    self.proxy_handler.mark_proxy_failed(proxy)
                
                self.logger.warning(f"代理请求失败 {proxy}: {e}")
                self.random_delay(3, 7)
        
        raise Exception(f"所有代理请求都失败: {url}")

2. 验证码识别与处理

沃尔玛网站可能会出现验证码挑战。我们需要集成验证码识别服务:

# utils/captcha_solver.py
import base64
import requests
from PIL import Image
import io

class CaptchaSolver:
    def __init__(self, api_key=None, service='2captcha'):
        self.api_key = api_key
        self.service = service
        self.base_url = 'http://2captcha.com' if service == '2captcha' else None
    
    def solve_image_captcha(self, image_data):
        """解决图片验证码"""
        if not self.api_key:
            self.logger.warning("未配置验证码服务API密钥")
            return None
        
        try:
            # 提交验证码
            submit_url = f"{self.base_url}/in.php"
            
            files = {'file': ('captcha.png', image_data, 'image/png')}
            data = {
                'key': self.api_key,
                'method': 'post'
            }
            
            response = requests.post(submit_url, files=files, data=data)
            result = response.text
            
            if 'OK|' in result:
                captcha_id = result.split('|')[1]
                return self.get_captcha_result(captcha_id)
            
        except Exception as e:
            self.logger.error(f"验证码识别失败: {e}")
        
        return None
    
    def get_captcha_result(self, captcha_id, max_wait=120):
        """获取验证码识别结果"""
        result_url = f"{self.base_url}/res.php"
        
        for _ in range(max_wait // 5):
            try:
                response = requests.get(result_url, params={
                    'key': self.api_key,
                    'action': 'get',
                    'id': captcha_id
                })
                
                result = response.text
                
                if result == 'CAPCHA_NOT_READY':
                    time.sleep(5)
                    continue
                elif 'OK|' in result:
                    return result.split('|')[1]
                else:
                    break
                    
            except Exception as e:
                self.logger.error(f"获取验证码结果失败: {e}")
                break
        
        return None

数据处理与存储

1. 数据清洗和标准化

# utils/data_processor.py
import pandas as pd
import re
from datetime import datetime
import json

class DataProcessor:
    def __init__(self):
        self.cleaned_data = []
    
    def clean_product_data(self, raw_products):
        """清洗商品数据"""
        cleaned_products = []
        
        for product in raw_products:
            cleaned_product = {}
            
            # 标题清洗
            title = product.get('title', '').strip()
            cleaned_product['title'] = self.clean_title(title)
            
            # 价格标准化
            price = product.get('price')
            cleaned_product['price_usd'] = self.standardize_price(price)
            
            # URL标准化
            url = product.get('url', '')
            cleaned_product['product_url'] = self.clean_url(url)
            
            # 评分标准化
            rating = product.get('rating')
            cleaned_product['rating_score'] = self.standardize_rating(rating)
            
            # 添加时间戳
            cleaned_product['scraped_at'] = datetime.now().isoformat()
            
            # 商品ID
            cleaned_product['product_id'] = product.get('product_id', '')
            
            # 图片URL
            cleaned_product['image_url'] = product.get('image_url', '')
            
            # 供应商
            cleaned_product['seller'] = product.get('seller', 'Walmart')
            
            if cleaned_product['title']:  # 只保留有标题的商品
                cleaned_products.append(cleaned_product)
        
        return cleaned_products
    
    def clean_title(self, title):
        """清洗商品标题"""
        if not title:
            return ''
        
        # 移除多余空白字符
        title = re.sub(r'\s+', ' ', title).strip()
        
        # 移除特殊字符但保留基本标点
        title = re.sub(r'[^\w\s\-\(\)\[\]&,.]', '', title)
        
        return title[:200]  # 限制长度
    
    def standardize_price(self, price):
        """标准化价格"""
        if price is None:
            return None
        
        if isinstance(price, str):
            # 移除货币符号和逗号
            price_clean = re.sub(r'[$,]', '', price)
            try:
                return float(price_clean)
            except ValueError:
                return None
        
        return float(price) if price else None
    
    def clean_url(self, url):
        """清洗URL"""
        if not url:
            return ''
        
        # 移除追踪参数
        if '?' in url:
            base_url = url.split('?')[0]
            return base_url
        
        return url
    
    def standardize_rating(self, rating):
        """标准化评分"""
        if rating is None:
            return None
        
        try:
            rating_float = float(rating)
            # 确保评分在0-5范围内
            return max(0, min(5, rating_float))
        except (ValueError, TypeError):
            return None
    
    def save_to_excel(self, products, filename):
        """保存到Excel文件"""
        if not products:
            self.logger.warning("没有数据要保存")
            return
        
        df = pd.DataFrame(products)
        
        # 重新排序列
        column_order = [
            'product_id', 'title', 'price_usd', 'rating_score', 
            'seller', 'product_url', 'image_url', 'scraped_at'
        ]
        
        df = df.reindex(columns=column_order)
        
        # 保存到Excel
        with pd.ExcelWriter(filename, engine='openpyxl') as writer:
            df.to_excel(writer, sheet_name='Products', index=False)
            
            # 添加统计信息
            stats_df = pd.DataFrame({
                '统计项': ['总商品数', '平均价格', '最高价格', '最低价格', '平均评分'],
                '数值': [
                    len(df),
                    df['price_usd'].mean() if df['price_usd'].notna().any() else 0,
                    df['price_usd'].max() if df['price_usd'].notna().any() else 0,
                    df['price_usd'].min() if df['price_usd'].notna().any() else 0,
                    df['rating_score'].mean() if df['rating_score'].notna().any() else 0
                ]
            })
            stats_df.to_excel(writer, sheet_name='Statistics', index=False)
        
        print(f"数据已保存到 {filename}")
    
    def save_to_json(self, products, filename):
        """保存到JSON文件"""
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(products, f, ensure_ascii=False, indent=2)
        
        print(f"JSON数据已保存到 {filename}")

2. 完整的主程序实现

现在让我们把所有组件整合到一个完整的沃尔玛商品列表抓取工具(Walmart Product List Scraping Tool)中:

# main.py
import argparse
import sys
import os
from datetime import datetime
from scrapers.walmart_scraper import WalmartScraperWithProxy
from utils.data_processor import DataProcessor
from utils.captcha_solver import CaptchaSolver
import logging

class WalmartScrapingManager:
    def __init__(self, proxy_list=None, captcha_api_key=None):
        self.scraper = WalmartScraperWithProxy(proxy_list)
        self.data_processor = DataProcessor()
        self.captcha_solver = CaptchaSolver(captcha_api_key) if captcha_api_key else None
        self.logger = logging.getLogger(__name__)
    
    def scrape_products(self, keywords, max_products_per_keyword=50, output_format='excel'):
        """批量抓取商品数据"""
        all_products = []
        
        for keyword in keywords:
            self.logger.info(f"开始抓取关键词: {keyword}")
            
            try:
                # 搜索商品列表
                products = self.scraper.search_products(
                    keyword=keyword,
                    max_results=max_products_per_keyword
                )
                
                # 获取详细信息
                detailed_products = []
                for i, product in enumerate(products):
                    if product.get('url'):
                        try:
                            details = self.scraper.get_product_details(product['url'])
                            product.update(details)
                            detailed_products.append(product)
                            
                            # 添加关键词标签
                            product['search_keyword'] = keyword
                            
                            self.logger.info(f"已处理 {i+1}/{len(products)} 个商品")
                            
                            # 随机延迟
                            self.scraper.random_delay(1, 3)
                            
                        except Exception as e:
                            self.logger.warning(f"获取商品详情失败: {e}")
                            continue
                
                all_products.extend(detailed_products)
                self.logger.info(f"关键词 '{keyword}' 抓取完成,获得 {len(detailed_products)} 个商品")
                
            except Exception as e:
                self.logger.error(f"抓取关键词 '{keyword}' 失败: {e}")
                continue
        
        # 数据清洗
        cleaned_products = self.data_processor.clean_product_data(all_products)
        
        # 保存数据
        self.save_results(cleaned_products, output_format)
        
        return cleaned_products
    
    def save_results(self, products, output_format):
        """保存抓取结果"""
        if not products:
            self.logger.warning("没有数据需要保存")
            return
        
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        
        if output_format.lower() == 'excel':
            filename = f"data/output/walmart_products_{timestamp}.xlsx"
            self.data_processor.save_to_excel(products, filename)
        elif output_format.lower() == 'json':
            filename = f"data/output/walmart_products_{timestamp}.json"
            self.data_processor.save_to_json(products, filename)
        else:
            # 同时保存两种格式
            excel_filename = f"data/output/walmart_products_{timestamp}.xlsx"
            json_filename = f"data/output/walmart_products_{timestamp}.json"
            self.data_processor.save_to_excel(products, excel_filename)
            self.data_processor.save_to_json(products, json_filename)

def main():
    parser = argparse.ArgumentParser(description='沃尔玛商品数据抓取工具')
    parser.add_argument('--keywords', nargs='+', required=True, help='搜索关键词列表')
    parser.add_argument('--max-products', type=int, default=50, help='每个关键词最大抓取商品数')
    parser.add_argument('--output-format', choices=['excel', 'json', 'both'], default='excel', help='输出格式')
    parser.add_argument('--proxy-file', help='代理列表文件路径')
    parser.add_argument('--captcha-api-key', help='验证码识别服务API密钥')
    
    args = parser.parse_args()
    
    # 确保输出目录存在
    os.makedirs('data/output', exist_ok=True)
    
    # 加载代理列表
    proxy_list = None
    if args.proxy_file and os.path.exists(args.proxy_file):
        with open(args.proxy_file, 'r') as f:
            proxy_list = [line.strip() for line in f if line.strip()]
    
    # 创建爬虫管理器
    scraper_manager = WalmartScrapingManager(
        proxy_list=proxy_list,
        captcha_api_key=args.captcha_api_key
    )
    
    # 开始抓取
    try:
        products = scraper_manager.scrape_products(
            keywords=args.keywords,
            max_products_per_keyword=args.max_products,
            output_format=args.output_format
        )
        
        print(f"\n抓取完成!总共获得 {len(products)} 个商品数据")
        
        # 显示统计信息
        if products:
            prices = [p['price_usd'] for p in products if p.get('price_usd')]
            ratings = [p['rating_score'] for p in products if p.get('rating_score')]
            
            print(f"价格统计: 平均 ${sum(prices)/len(prices):.2f}" if prices else "无价格数据")
            print(f"评分统计: 平均 {sum(ratings)/len(ratings):.2f}" if ratings else "无评分数据")
    
    except KeyboardInterrupt:
        print("\n用户中断抓取过程")
    except Exception as e:
        print(f"抓取过程出现错误: {e}")
        sys.exit(1)

if __name__ == "__main__":
    main()

常见挑战与解决方案

1. 动态内容加载

现代电商网站大量使用JavaScript动态加载内容。对于这种情况,我们需要使用Selenium来处理:

# scrapers/selenium_scraper.py
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
import undetected_chromedriver as uc

class SeleniumWalmartScraper:
    def __init__(self, headless=True, proxy=None):
        self.setup_driver(headless, proxy)
        
    def setup_driver(self, headless=True, proxy=None):
        """配置浏览器驱动"""
        options = uc.ChromeOptions()
        
        if headless:
            options.add_argument('--headless')
        
        # 反检测设置
        options.add_argument('--no-sandbox')
        options.add_argument('--disable-dev-shm-usage')
        options.add_argument('--disable-blink-features=AutomationControlled')
        options.add_experimental_option("excludeSwitches", ["enable-automation"])
        options.add_experimental_option('useAutomationExtension', False)
        
        # 代理设置
        if proxy:
            options.add_argument(f'--proxy-server={proxy}')
        
        # 用户代理
        options.add_argument('--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36')
        
        self.driver = uc.Chrome(options=options)
        
        # 执行反检测脚本
        self.driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
    
    def scrape_with_javascript(self, url, wait_selector=None):
        """使用Selenium抓取动态内容"""
        try:
            self.driver.get(url)
            
            # 等待特定元素加载
            if wait_selector:
                WebDriverWait(self.driver, 10).until(
                    EC.presence_of_element_located((By.CSS_SELECTOR, wait_selector))
                )
            
            # 滚动页面触发懒加载
            self.scroll_page()
            
            # 获取页面源码
            html_content = self.driver.page_source
            return html_content
            
        except Exception as e:
            print(f"Selenium抓取失败: {e}")
            return None
    
    def scroll_page(self):
        """滚动页面以触发懒加载"""
        last_height = self.driver.execute_script("return document.body.scrollHeight")
        
        while True:
            # 滚动到页面底部
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            
            # 等待新内容加载
            time.sleep(2)
            
            # 计算新的页面高度
            new_height = self.driver.execute_script("return document.body.scrollHeight")
            
            if new_height == last_height:
                break
            
            last_height = new_height
    
    def close(self):
        """关闭浏览器"""
        if hasattr(self, 'driver'):
            self.driver.quit()

2. 分布式爬虫架构

对于大规模数据抓取,我们可以实现分布式爬虫:

# distributed/task_manager.py
import redis
import json
import uuid
from datetime import datetime, timedelta

class TaskManager:
    def __init__(self, redis_host='localhost', redis_port=6379, redis_db=0):
        self.redis_client = redis.Redis(host=redis_host, port=redis_port, db=redis_db)
        self.task_queue = 'walmart_scrape_tasks'
        self.result_queue = 'walmart_scrape_results'
        
    def add_task(self, keyword, max_products=50, priority=1):
        """添加抓取任务"""
        task_id = str(uuid.uuid4())
        task_data = {
            'task_id': task_id,
            'keyword': keyword,
            'max_products': max_products,
            'priority': priority,
            'created_at': datetime.now().isoformat(),
            'status': 'pending'
        }
        
        # 使用优先级队列
        self.redis_client.zadd(self.task_queue, {json.dumps(task_data): priority})
        return task_id
    
    def get_task(self):
        """获取待处理任务"""
        # 获取最高优先级任务
        task_data = self.redis_client.zpopmax(self.task_queue)
        
        if task_data:
            task_json = task_data[0][0].decode('utf-8')
            return json.loads(task_json)
        
        return None
    
    def save_result(self, task_id, products, status='completed'):
        """保存抓取结果"""
        result_data = {
            'task_id': task_id,
            'products': products,
            'status': status,
            'completed_at': datetime.now().isoformat(),
            'product_count': len(products)
        }
        
        self.redis_client.lpush(self.result_queue, json.dumps(result_data))
    
    def get_results(self, limit=10):
        """获取抓取结果"""
        results = []
        for _ in range(limit):
            result_data = self.redis_client.rpop(self.result_queue)
            if result_data:
                results.append(json.loads(result_data.decode('utf-8')))
            else:
                break
        
        return results

# distributed/worker.py
import time
import logging
from task_manager import TaskManager
from scrapers.walmart_scraper import WalmartScraperWithProxy

class ScrapingWorker:
    def __init__(self, worker_id, proxy_list=None):
        self.worker_id = worker_id
        self.task_manager = TaskManager()
        self.scraper = WalmartScraperWithProxy(proxy_list)
        self.logger = logging.getLogger(f'Worker-{worker_id}')
        
    def run(self):
        """工作进程主循环"""
        self.logger.info(f"工作进程 {self.worker_id} 启动")
        
        while True:
            try:
                # 获取任务
                task = self.task_manager.get_task()
                
                if task:
                    self.logger.info(f"处理任务: {task['task_id']}")
                    self.process_task(task)
                else:
                    # 没有任务时休眠
                    time.sleep(5)
                    
            except KeyboardInterrupt:
                self.logger.info("工作进程停止")
                break
            except Exception as e:
                self.logger.error(f"工作进程异常: {e}")
                time.sleep(10)
    
    def process_task(self, task):
        """处理单个抓取任务"""
        try:
            keyword = task['keyword']
            max_products = task['max_products']
            
            # 执行抓取
            products = self.scraper.search_products(keyword, max_results=max_products)
            
            # 保存结果
            self.task_manager.save_result(
                task['task_id'], 
                products, 
                'completed'
            )
            
            self.logger.info(f"任务 {task['task_id']} 完成,抓取 {len(products)} 个商品")
            
        except Exception as e:
            self.logger.error(f"任务处理失败: {e}")
            self.task_manager.save_result(
                task['task_id'], 
                [], 
                'failed'
            )

3. 监控和告警系统

# monitoring/scraper_monitor.py
import psutil
import time
import smtplib
from email.mime.text import MimeText
from datetime import datetime, timedelta

class ScraperMonitor:
    def __init__(self, email_config=None):
        self.email_config = email_config
        self.performance_log = []
        
    def monitor_performance(self):
        """监控系统性能"""
        cpu_percent = psutil.cpu_percent(interval=1)
        memory_percent = psutil.virtual_memory().percent
        disk_percent = psutil.disk_usage('/').percent
        
        performance_data = {
            'timestamp': datetime.now(),
            'cpu_percent': cpu_percent,
            'memory_percent': memory_percent,
            'disk_percent': disk_percent
        }
        
        self.performance_log.append(performance_data)
        
        # 检查是否需要告警
        if cpu_percent > 80 or memory_percent > 80:
            self.send_alert(f"系统资源使用率过高: CPU {cpu_percent}%, 内存 {memory_percent}%")
        
        return performance_data
    
    def send_alert(self, message):
        """发送告警邮件"""
        if not self.email_config:
            print(f"告警: {message}")
            return
        
        try:
            msg = MimeText(f"沃尔玛爬虫系统告警\n\n{message}\n\n时间: {datetime.now()}")
            msg['Subject'] = '爬虫系统告警'
            msg['From'] = self.email_config['from']
            msg['To'] = self.email_config['to']
            
            server = smtplib.SMTP(self.email_config['smtp_server'], self.email_config['smtp_port'])
            server.starttls()
            server.login(self.email_config['username'], self.email_config['password'])
            server.send_message(msg)
            server.quit()
            
            print(f"告警邮件已发送: {message}")
            
        except Exception as e:
            print(f"发送告警邮件失败: {e}")

高级优化技巧

1. 智能重试机制

# utils/retry_handler.py
import time
import random
from functools import wraps

def smart_retry(max_retries=3, base_delay=1, backoff_factor=2, jitter=True):
    """智能重试装饰器"""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            last_exception = None
            
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    last_exception = e
                    
                    if attempt < max_retries - 1:
                        # 计算延迟时间
                        delay = base_delay * (backoff_factor ** attempt)
                        
                        # 添加随机抖动
                        if jitter:
                            delay += random.uniform(0, delay * 0.1)
                        
                        print(f"重试 {attempt + 1}/{max_retries},{delay:.2f}秒后重试")
                        time.sleep(delay)
                    else:
                        print(f"所有重试都失败,最后异常: {e}")
            
            raise last_exception
        
        return wrapper
    return decorator

2. 数据去重和缓存

# utils/cache_manager.py
import hashlib
import json
import os
from datetime import datetime, timedelta

class CacheManager:
    def __init__(self, cache_dir='cache', expire_hours=24):
        self.cache_dir = cache_dir
        self.expire_hours = expire_hours
        os.makedirs(cache_dir, exist_ok=True)
    
    def get_cache_key(self, url):
        """生成缓存键"""
        return hashlib.md5(url.encode()).hexdigest()
    
    def get_cache_file(self, cache_key):
        """获取缓存文件路径"""
        return os.path.join(self.cache_dir, f"{cache_key}.json")
    
    def is_cache_valid(self, cache_file):
        """检查缓存是否有效"""
        if not os.path.exists(cache_file):
            return False
        
        file_time = datetime.fromtimestamp(os.path.getmtime(cache_file))
        expire_time = datetime.now() - timedelta(hours=self.expire_hours)
        
        return file_time > expire_time
    
    def get_cached_data(self, url):
        """获取缓存数据"""
        cache_key = self.get_cache_key(url)
        cache_file = self.get_cache_file(cache_key)
        
        if self.is_cache_valid(cache_file):
            try:
                with open(cache_file, 'r', encoding='utf-8') as f:
                    return json.load(f)
            except Exception:
                pass
        
        return None
    
    def save_to_cache(self, url, data):
        """保存数据到缓存"""
        cache_key = self.get_cache_key(url)
        cache_file = self.get_cache_file(cache_key)
        
        try:
            with open(cache_file, 'w', encoding='utf-8') as f:
                json.dump(data, f, ensure_ascii=False, indent=2)
        except Exception as e:
            print(f"保存缓存失败: {e}")

class DataDeduplicator:
    def __init__(self):
        self.seen_products = set()
    
    def is_duplicate(self, product):
        """检查商品是否重复"""
        # 使用商品ID和标题创建唯一标识
        identifier = f"{product.get('product_id', '')}-{product.get('title', '')}"
        identifier_hash = hashlib.md5(identifier.encode()).hexdigest()
        
        if identifier_hash in self.seen_products:
            return True
        
        self.seen_products.add(identifier_hash)
        return False
    
    def deduplicate_products(self, products):
        """去重商品列表"""
        unique_products = []
        
        for product in products:
            if not self.is_duplicate(product):
                unique_products.append(product)
        
        print(f"去重前: {len(products)} 个商品,去重后: {len(unique_products)} 个商品")
        return unique_products

性能优化与扩展

1. 异步并发处理

# async_scraper.py
import asyncio
import aiohttp
from aiohttp import ClientTimeout
import async_timeout

class AsyncWalmartScraper:
    def __init__(self, max_concurrent=10):
        self.max_concurrent = max_concurrent
        self.semaphore = asyncio.Semaphore(max_concurrent)
        
    async def fetch_page(self, session, url):
        """异步获取页面"""
        async with self.semaphore:
            try:
                timeout = ClientTimeout(total=30)
                async with session.get(url, timeout=timeout) as response:
                    if response.status == 200:
                        return await response.text()
                    else:
                        print(f"HTTP错误 {response.status}: {url}")
            except Exception as e:
                print(f"请求失败: {e}")
            
            return None
    
    async def scrape_multiple_urls(self, urls):
        """并发抓取多个URL"""
        async with aiohttp.ClientSession() as session:
            tasks = [self.fetch_page(session, url) for url in urls]
            results = await asyncio.gather(*tasks, return_exceptions=True)
            
            # 过滤成功的结果
            successful_results = [r for r in results if isinstance(r, str)]
            print(f"成功抓取 {len(successful_results)}/{len(urls)} 个页面")
            
            return successful_results

实际应用场景示例

使用示例

# 基本使用
python main.py --keywords "wireless headphones" "bluetooth speaker" --max-products 30

# 使用代理
python main.py --keywords "laptop" --proxy-file proxies.txt --output-format both

# 大批量抓取
python main.py --keywords "electronics" "home garden" "sports" --max-products 100 --output-format json

配置代理文件示例 (proxies.txt)

192.168.1.100:8080
203.123.45.67:3128
104.248.63.15:30588
167.172.180.46:41258

为什么选择专业的API服务

虽然我们已经详细介绍了如何构建一个功能完善的沃尔玛爬虫系统,但在实际业务应用中,构建和维护自己的爬虫系统面临诸多挑战:

技术维护成本高:电商网站频繁更新反爬虫策略,需要持续投入技术资源进行适配和优化。

法律合规风险:不当的爬虫行为可能面临法律风险,需要专业的合规指导。

基础设施投入大:稳定的代理服务、验证码识别、分布式架构都需要大量资金投入。

数据质量保证难:确保数据的准确性、完整性和时效性需要专业的质量控制体系。

Pangolin Scrape API:专业的电商数据解决方案

如果您专注于沃尔玛运营和选品,希望将专业的数据采集工作交给专业团队,Scrape API之类的产品是您的理想选择。

特点

免维护智能解析:Scrape API这类产品采用智能识别算法,自动适配沃尔玛等电商平台的页面结构变化,开发者无需关注DOM结构更新。

丰富的数据字段:支持抓取商品ID、图片、标题、评分、评论数、尺寸、颜色、描述、价格、库存状态等全面的商品信息。

多种调用方式:提供同步和异步两种API调用方式,满足不同业务场景需求。

快速集成示例

使用Pangolin Scrape API为例,抓取沃尔玛商品信息非常简单:

import requests
import json

# 认证获取token
auth_url = "http://scrapeapi.pangolinfo.com/api/v1/auth"
auth_data = {
    "email": "your_email@gmail.com",
    "password": "your_password"
}

response = requests.post(auth_url, json=auth_data)
token = response.json()['data']

# 抓取沃尔玛商品详情
scrape_url = "http://scrapeapi.pangolinfo.com/api/v1"
headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {token}"
}

scrape_data = {
    "url": "https://www.walmart.com/ip/your-product-url",
    "parserName": "walmProductDetail",
    "formats": ["json"]
}

result = requests.post(scrape_url, headers=headers, json=scrape_data)
product_data = result.json()

总结

本文全面介绍了如何使用Python构建一个专业级的沃尔玛爬虫系统,涵盖了从基础环境搭建到高级优化技巧的完整流程。我们详细讲解了应对反爬虫策略、数据处理、分布式架构等关键技术点,并提供了丰富的代码示例。

构建自己的爬虫系统虽然能够深度定制,但也面临着技术维护、合规风险、成本投入等诸多挑战。对于专注业务发展的企业而言,选择像 类Scrape API这样的专业服务,能够更高效地获取所需数据,同时避免技术陷阱。

无论选择自建还是使用专业服务,关键是要根据自己的业务需求、技术能力和资源投入来做出明智的决策。数据驱动的电商时代,掌握准确、及时的市场信息就是掌握了竞争的主动权。

正如古人云:"工欲善其事,必先利其器"——选择合适的数据采集方案,让您在电商征途中事半功倍,决胜千里。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值