亚马逊Best Seller榜单实时监控系统:从数据采集到智能分析的完整技术实现

亚马逊榜单监控系统技术实现

前言

在电商数据分析领域,亚马逊榜单数据采集一直是一个技术挑战与商业价值并存的热门话题。本文将从技术角度深入探讨如何构建一个高效、稳定的亚马逊Best Seller榜单监控系统,涵盖数据采集、存储、分析和预警的完整技术栈。

技术架构概览

系统整体架构

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   数据采集层     │───▶│   数据处理层     │───▶│   分析预警层     │
│  Data Collector │    │ Data Processor  │    │ Alert System   │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   代理池管理     │    │   数据存储层     │    │   Web界面展示    │
│  Proxy Manager  │    │ Data Storage    │    │  Web Dashboard │
└─────────────────┘    └─────────────────┘    └─────────────────┘

核心技术栈

  • 数据采集:aiohttp + asyncio 异步爬虫框架
  • 数据解析:BeautifulSoup4 + lxml
  • 数据存储:MongoDB + Redis
  • 任务调度:Celery + Redis
  • 数据分析:pandas + numpy + scikit-learn
  • Web框架:FastAPI
  • 监控告警:Prometheus + Grafana
    在这里插入图片描述

数据采集模块实现

异步HTTP客户端

import asyncio
import aiohttp
import random
from typing import List, Dict, Optional
from dataclasses import dataclass
from datetime import datetime
import logging

@dataclass
class ProxyConfig:
    host: str
    port: int
    username: Optional[str] = None
    password: Optional[str] = None

class AmazonRankingCollector:
    def __init__(self, proxy_list: List[ProxyConfig] = None):
        self.proxy_list = proxy_list or []
        self.session = None
        self.user_agents = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        ]
        
    async def __aenter__(self):
        await self.init_session()
        return self
        
    async def __aexit__(self, exc_type, exc_val, exc_tb):
        if self.session:
            await self.session.close()
    
    async def init_session(self):
        """初始化HTTP会话"""
        connector = aiohttp.TCPConnector(
            limit=100,
            limit_per_host=10,
            ttl_dns_cache=300,
            use_dns_cache=True,
        )
        
        timeout = aiohttp.ClientTimeout(
            total=30,
            connect=10,
            sock_read=10
        )
        
        self.session = aiohttp.ClientSession(
            connector=connector,
            timeout=timeout,
            headers={
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                'Accept-Language': 'en-US,en;q=0.5',
                'Accept-Encoding': 'gzip, deflate, br',
                'DNT': '1',
                'Connection': 'keep-alive',
                'Upgrade-Insecure-Requests': '1',
            }
        )
    
    def get_random_headers(self) -> Dict[str, str]:
        """获取随机请求头"""
        return {
            'User-Agent': random.choice(self.user_agents),
            'X-Forwarded-For': f"{random.randint(1,255)}.{random.randint(1,255)}.{random.randint(1,255)}.{random.randint(1,255)}",
        }
    
    def get_proxy_url(self) -> Optional[str]:
        """获取代理URL"""
        if not self.proxy_list:
            return None
            
        proxy = random.choice(self.proxy_list)
        if proxy.username and proxy.password:
            return f"http://{proxy.username}:{proxy.password}@{proxy.host}:{proxy.port}"
        else:
            return f"http://{proxy.host}:{proxy.port}"
    
    async def fetch_page(self, url: str, max_retries: int = 3) -> Optional[str]:
        """获取页面内容"""
        for attempt in range(max_retries):
            try:
                headers = self.get_random_headers()
                proxy = self.get_proxy_url()
                
                async with self.session.get(
                    url, 
                    headers=headers, 
                    proxy=proxy,
                    ssl=False
                ) as response:
                    
                    if response.status == 200:
                        content = await response.text()
                        return content
                    elif response.status == 503:
                        # 服务不可用,等待后重试
                        await asyncio.sleep(random.uniform(5, 10))
                        continue
                    else:
                        logging.warning(f"HTTP {response.status} for {url}")
                        
            except asyncio.TimeoutError:
                logging.warning(f"Timeout for {url}, attempt {attempt + 1}")
                await asyncio.sleep(random.uniform(2, 5))
            except Exception as e:
                logging.error(f"Error fetching {url}: {e}")
                await asyncio.sleep(random.uniform(1, 3))
        
        return None
    
    async def collect_bestseller_ranking(self, category: str, marketplace: str = 'com') -> Dict:
        """采集Best Seller榜单数据"""
        url = f"https://www.amazon.{marketplace}/gp/bestsellers/{category}"
        
        html_content = await self.fetch_page(url)
        if not html_content:
            return {}
        
        return self.parse_bestseller_page(html_content, category, marketplace)
    
    def parse_bestseller_page(self, html_content: str, category: str, marketplace: str) -> Dict:
        """解析Best Seller页面"""
        from bs4 import BeautifulSoup
        import re
        
        soup = BeautifulSoup(html_content, 'lxml')
        products = []
        
        # 查找产品容器
        product_containers = soup.find_all('div', {'data-asin': True})
        
        for container in product_containers:
            try:
                asin = container.get('data-asin')
                if not asin:
                    continue
                
                # 提取排名
                rank_elem = container.find('span', class_='zg-bdg-text')
                rank = None
                if rank_elem:
                    rank_text = rank_elem.get_text(strip=True)
                    rank_match = re.search(r'#(\d+)', rank_text)
                    if rank_match:
                        rank = int(rank_match.group(1))
                
                # 提取标题
                title_elem = container.find('div', class_='p13n-sc-truncate')
                title = title_elem.get_text(strip=True) if title_elem else ''
                
                # 提取价格
                price_elem = container.find('span', class_='p13n-sc-price')
                price = price_elem.get_text(strip=True) if price_elem else ''
                
                # 提取评分
                rating_elem = container.find('span', class_='a-icon-alt')
                rating = None
                if rating_elem:
                    rating_text = rating_elem.get_text()
                    rating_match = re.search(r'(\d+\.?\d*)', rating_text)
                    if rating_match:
                        rating = float(rating_match.group(1))
                
                # 提取评论数
                review_elem = container.find('a', class_='a-size-small')
                review_count = None
                if review_elem:
                    review_text = review_elem.get_text()
                    review_match = re.search(r'([\d,]+)', review_text)
                    if review_match:
                        review_count = int(review_match.group(1).replace(',', ''))
                
                if rank and asin:
                    products.append({
                        'asin': asin,
                        'rank': rank,
                        'title': title,
                        'price': price,
                        'rating': rating,
                        'review_count': review_count,
                        'category': category,
                        'marketplace': marketplace,
                        'timestamp': datetime.now().isoformat()
                    })
                    
            except Exception as e:
                logging.error(f"Error parsing product: {e}")
                continue
        
        return {
            'category': category,
            'marketplace': marketplace,
            'timestamp': datetime.now().isoformat(),
            'products': sorted(products, key=lambda x: x['rank']),
            'total_products': len(products)
        }

批量采集管理器

class BatchCollectionManager:
    def __init__(self, collector: AmazonRankingCollector):
        self.collector = collector
        self.semaphore = asyncio.Semaphore(10)  # 限制并发数
        
    async def collect_multiple_categories(self, 
                                        categories: List[str], 
                                        marketplaces: List[str]) -> Dict:
        """批量采集多个类目数据"""
        tasks = []
        
        for marketplace in marketplaces:
            for category in categories:
                task = self.collect_with_semaphore(category, marketplace)
                tasks.append(task)
        
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        # 整理结果
        collection_results = {
            'timestamp': datetime.now().isoformat(),
            'total_tasks': len(tasks),
            'successful_tasks': 0,
            'failed_tasks': 0,
            'data': {}
        }
        
        for i, result in enumerate(results):
            marketplace = marketplaces[i // len(categories)]
            category = categories[i % len(categories)]
            
            if isinstance(result, Exception):
                collection_results['failed_tasks'] += 1
                logging.error(f"Failed to collect {marketplace}/{category}: {result}")
            else:
                collection_results['successful_tasks'] += 1
                if marketplace not in collection_results['data']:
                    collection_results['data'][marketplace] = {}
                collection_results['data'][marketplace][category] = result
        
        return collection_results
    
    async def collect_with_semaphore(self, category: str, marketplace: str) -> Dict:
        """带信号量控制的采集"""
        async with self.semaphore:
            # 随机延迟,避免请求过于集中
            await asyncio.sleep(random.uniform(0.5, 2.0))
            return await self.collector.collect_bestseller_ranking(category, marketplace)

数据存储与管理

MongoDB数据模型

from pymongo import MongoClient
from datetime import datetime, timedelta
import pandas as pd

class RankingDataManager:
    def __init__(self, mongo_uri: str = "mongodb://localhost:27017/", db_name: str = "amazon_rankings"):
        self.client = MongoClient(mongo_uri)
        self.db = self.client[db_name]
        self.rankings_collection = self.db.rankings
        self.products_collection = self.db.products
        
        # 创建索引
        self.create_indexes()
    
    def create_indexes(self):
        """创建数据库索引"""
        # 排名数据索引
        self.rankings_collection.create_index([
            ("asin", 1),
            ("category", 1),
            ("marketplace", 1),
            ("timestamp", -1)
        ])
        
        # 产品信息索引
        self.products_collection.create_index([("asin", 1)], unique=True)
        
        # 时间序列索引
        self.rankings_collection.create_index([("timestamp", -1)])
    
    def save_ranking_data(self, ranking_data: Dict):
        """保存排名数据"""
        try:
            # 保存产品排名记录
            for product in ranking_data.get('products', []):
                ranking_record = {
                    'asin': product['asin'],
                    'rank': product['rank'],
                    'category': ranking_data['category'],
                    'marketplace': ranking_data['marketplace'],
                    'timestamp': datetime.fromisoformat(product['timestamp']),
                    'price': product.get('price'),
                    'rating': product.get('rating'),
                    'review_count': product.get('review_count')
                }
                
                self.rankings_collection.insert_one(ranking_record)
                
                # 更新或插入产品基础信息
                product_info = {
                    'asin': product['asin'],
                    'title': product.get('title'),
                    'last_seen': datetime.fromisoformat(product['timestamp']),
                    'categories': [ranking_data['category']],
                    'marketplaces': [ranking_data['marketplace']]
                }
                
                self.products_collection.update_one(
                    {'asin': product['asin']},
                    {
                        '$set': {
                            'title': product_info['title'],
                            'last_seen': product_info['last_seen']
                        },
                        '$addToSet': {
                            'categories': ranking_data['category'],
                            'marketplaces': ranking_data['marketplace']
                        }
                    },
                    upsert=True
                )
                
        except Exception as e:
            logging.error(f"Error saving ranking data: {e}")
    
    def get_product_ranking_history(self, asin: str, 
                                  category: str, 
                                  marketplace: str, 
                                  days: int = 30) -> pd.DataFrame:
        """获取产品排名历史"""
        start_date = datetime.now() - timedelta(days=days)
        
        cursor = self.rankings_collection.find({
            'asin': asin,
            'category': category,
            'marketplace': marketplace,
            'timestamp': {'$gte': start_date}
        }).sort('timestamp', 1)
        
        data = list(cursor)
        if not data:
            return pd.DataFrame()
        
        df = pd.DataFrame(data)
        df['timestamp'] = pd.to_datetime(df['timestamp'])
        return df
    
    def get_category_snapshot(self, category: str, 
                            marketplace: str, 
                            timestamp: datetime = None) -> pd.DataFrame:
        """获取类目快照数据"""
        if timestamp is None:
            timestamp = datetime.now()
        
        # 查找最接近指定时间的数据
        pipeline = [
            {
                '$match': {
                    'category': category,
                    'marketplace': marketplace,
                    'timestamp': {
                        '$lte': timestamp,
                        '$gte': timestamp - timedelta(hours=2)
                    }
                }
            },
            {
                '$sort': {'timestamp': -1}
            },
            {
                '$group': {
                    '_id': '$asin',
                    'latest_record': {'$first': '$$ROOT'}
                }
            },
            {
                '$replaceRoot': {'newRoot': '$latest_record'}
            },
            {
                '$sort': {'rank': 1}
            }
        ]
        
        cursor = self.rankings_collection.aggregate(pipeline)
        data = list(cursor)
        
        if not data:
            return pd.DataFrame()
        
        return pd.DataFrame(data)

趋势分析算法

排名变化检测

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from typing import Tuple

class RankingTrendAnalyzer:
    def __init__(self):
        self.trend_thresholds = {
            'stable_variance': 5,
            'rising_slope': -0.5,  # 排名下降表示上升
            'falling_slope': 0.5,
            'volatile_variance': 20
        }
    
    def analyze_product_trend(self, ranking_history: pd.DataFrame) -> Dict:
        """分析产品排名趋势"""
        if len(ranking_history) < 10:
            return {'trend': 'insufficient_data', 'confidence': 0}
        
        rankings = ranking_history['rank'].values
        timestamps = np.arange(len(rankings))
        
        # 计算基础统计指标
        trend_slope = self.calculate_trend_slope(rankings, timestamps)
        variance = np.var(rankings)
        momentum = self.calculate_momentum(rankings)
        volatility = self.calculate_volatility(rankings)
        
        # 分类趋势类型
        trend_type = self.classify_trend(trend_slope, variance, momentum)
        confidence = self.calculate_confidence(rankings, trend_type)
        
        # 预测下一个排名
        next_rank_prediction = self.predict_next_rank(rankings, trend_slope)
        
        return {
            'trend': trend_type,
            'slope': trend_slope,
            'variance': variance,
            'momentum': momentum,
            'volatility': volatility,
            'confidence': confidence,
            'next_rank_prediction': next_rank_prediction,
            'analysis_timestamp': datetime.now().isoformat()
        }
    
    def calculate_trend_slope(self, rankings: np.ndarray, timestamps: np.ndarray) -> float:
        """计算趋势斜率"""
        if len(rankings) < 2:
            return 0
        
        # 使用线性回归计算趋势
        model = LinearRegression()
        model.fit(timestamps.reshape(-1, 1), rankings)
        return model.coef_[0]
    
    def calculate_momentum(self, rankings: np.ndarray, window: int = 5) -> float:
        """计算排名动量"""
        if len(rankings) < window * 2:
            return 0
        
        recent_avg = np.mean(rankings[-window:])
        previous_avg = np.mean(rankings[-window*2:-window])
        
        # 排名越小越好,所以动量计算相反
        momentum = (previous_avg - recent_avg) / previous_avg if previous_avg != 0 else 0
        return momentum
    
    def calculate_volatility(self, rankings: np.ndarray, window: int = 10) -> float:
        """计算排名波动率"""
        if len(rankings) < window:
            return np.std(rankings)
        
        # 计算滚动标准差
        rolling_std = []
        for i in range(window, len(rankings) + 1):
            rolling_std.append(np.std(rankings[i-window:i]))
        
        return np.mean(rolling_std) if rolling_std else np.std(rankings)
    
    def classify_trend(self, slope: float, variance: float, momentum: float) -> str:
        """分类趋势类型"""
        if variance > self.trend_thresholds['volatile_variance']:
            return 'volatile'
        elif slope <= self.trend_thresholds['rising_slope'] and momentum > 0.1:
            return 'rising'
        elif slope >= self.trend_thresholds['falling_slope'] and momentum < -0.1:
            return 'falling'
        elif variance <= self.trend_thresholds['stable_variance']:
            return 'stable'
        else:
            return 'neutral'
    
    def calculate_confidence(self, rankings: np.ndarray, trend_type: str) -> float:
        """计算趋势置信度"""
        if len(rankings) < 5:
            return 0.0
        
        # 基于数据点数量的基础置信度
        base_confidence = min(len(rankings) / 50, 1.0)
        
        # 基于趋势一致性的调整
        if trend_type == 'rising':
            # 检查上升趋势的一致性
            rising_points = sum(1 for i in range(1, len(rankings)) if rankings[i] < rankings[i-1])
            consistency = rising_points / (len(rankings) - 1)
        elif trend_type == 'falling':
            # 检查下降趋势的一致性
            falling_points = sum(1 for i in range(1, len(rankings)) if rankings[i] > rankings[i-1])
            consistency = falling_points / (len(rankings) - 1)
        elif trend_type == 'stable':
            # 检查稳定性
            mean_rank = np.mean(rankings)
            stable_points = sum(1 for rank in rankings if abs(rank - mean_rank) <= 5)
            consistency = stable_points / len(rankings)
        else:
            consistency = 0.5
        
        return base_confidence * consistency
    
    def predict_next_rank(self, rankings: np.ndarray, slope: float) -> int:
        """预测下一个排名"""
        if len(rankings) == 0:
            return 0
        
        current_rank = rankings[-1]
        predicted_rank = current_rank + slope
        
        # 确保预测值在合理范围内
        predicted_rank = max(1, min(predicted_rank, 1000))
        
        return int(round(predicted_rank))
    
    def detect_ranking_anomalies(self, ranking_history: pd.DataFrame) -> List[Dict]:
        """检测排名异常"""
        if len(ranking_history) < 20:
            return []
        
        rankings = ranking_history['rank'].values
        timestamps = ranking_history['timestamp'].values
        
        # 计算移动平均和标准差
        window = 10
        anomalies = []
        
        for i in range(window, len(rankings)):
            window_data = rankings[i-window:i]
            mean_rank = np.mean(window_data)
            std_rank = np.std(window_data)
            
            current_rank = rankings[i]
            z_score = abs(current_rank - mean_rank) / std_rank if std_rank > 0 else 0
            
            # 检测异常(Z-score > 2)
            if z_score > 2:
                anomalies.append({
                    'timestamp': timestamps[i],
                    'rank': current_rank,
                    'expected_rank': mean_rank,
                    'z_score': z_score,
                    'anomaly_type': 'sudden_jump' if current_rank > mean_rank else 'sudden_rise'
                })
        
        return anomalies

市场机会识别

class MarketOpportunityDetector:
    def __init__(self, data_manager: RankingDataManager):
        self.data_manager = data_manager
        self.analyzer = RankingTrendAnalyzer()
    
    def detect_breakout_products(self, category: str, marketplace: str, days: int = 7) -> List[Dict]:
        """检测突破性产品"""
        # 获取最近的类目数据
        recent_snapshot = self.data_manager.get_category_snapshot(category, marketplace)
        
        if recent_snapshot.empty:
            return []
        
        breakout_products = []
        
        for _, product in recent_snapshot.iterrows():
            asin = product['asin']
            
            # 获取产品历史数据
            history = self.data_manager.get_product_ranking_history(
                asin, category, marketplace, days=days*2
            )
            
            if len(history) < days:
                continue
            
            # 分析趋势
            trend_analysis = self.analyzer.analyze_product_trend(history)
            
            # 检测突破条件
            if (trend_analysis['trend'] == 'rising' and 
                trend_analysis['confidence'] > 0.7 and
                trend_analysis['momentum'] > 0.3):
                
                # 计算排名改善程度
                recent_ranks = history.tail(days)['rank'].values
                previous_ranks = history.head(days)['rank'].values
                
                if len(previous_ranks) > 0 and len(recent_ranks) > 0:
                    improvement = np.mean(previous_ranks) - np.mean(recent_ranks)
                    improvement_pct = improvement / np.mean(previous_ranks) * 100
                    
                    breakout_products.append({
                        'asin': asin,
                        'title': product.get('title', ''),
                        'current_rank': product['rank'],
                        'trend': trend_analysis['trend'],
                        'momentum': trend_analysis['momentum'],
                        'confidence': trend_analysis['confidence'],
                        'rank_improvement': improvement,
                        'improvement_percentage': improvement_pct,
                        'volatility': trend_analysis['volatility']
                    })
        
        # 按动量排序
        breakout_products.sort(key=lambda x: x['momentum'], reverse=True)
        return breakout_products[:20]  # 返回前20个
    
    def analyze_category_dynamics(self, category: str, marketplace: str, days: int = 30) -> Dict:
        """分析类目动态"""
        # 获取历史快照
        snapshots = []
        for i in range(days):
            date = datetime.now() - timedelta(days=i)
            snapshot = self.data_manager.get_category_snapshot(category, marketplace, date)
            if not snapshot.empty:
                snapshots.append({
                    'date': date,
                    'data': snapshot
                })
        
        if len(snapshots) < 7:
            return {'error': 'Insufficient data'}
        
        # 分析类目指标
        analysis = {
            'category': category,
            'marketplace': marketplace,
            'analysis_period': days,
            'total_snapshots': len(snapshots),
            'metrics': {}
        }
        
        # 计算平均排名变化
        rank_changes = []
        new_entries = set()
        disappeared_products = set()
        
        for i in range(1, len(snapshots)):
            current = snapshots[i-1]['data']
            previous = snapshots[i]['data']
            
            current_asins = set(current['asin'].values)
            previous_asins = set(previous['asin'].values)
            
            # 新进入产品
            new_entries.update(current_asins - previous_asins)
            
            # 消失的产品
            disappeared_products.update(previous_asins - current_asins)
            
            # 排名变化
            for asin in current_asins & previous_asins:
                current_rank = current[current['asin'] == asin]['rank'].iloc[0]
                previous_rank = previous[previous['asin'] == asin]['rank'].iloc[0]
                rank_changes.append(current_rank - previous_rank)
        
        analysis['metrics'] = {
            'average_rank_change': np.mean(rank_changes) if rank_changes else 0,
            'rank_volatility': np.std(rank_changes) if rank_changes else 0,
            'new_entries_count': len(new_entries),
            'disappeared_products_count': len(disappeared_products),
            'turnover_rate': (len(new_entries) + len(disappeared_products)) / (len(snapshots[0]['data']) * 2) * 100
        }
        
        return analysis

实时监控与告警系统

Celery任务调度

from celery import Celery
from celery.schedules import crontab
import redis

# Celery配置
app = Celery('amazon_ranking_monitor')
app.config_from_object({
    'broker_url': 'redis://localhost:6379/0',
    'result_backend': 'redis://localhost:6379/0',
    'task_serializer': 'json',
    'accept_content': ['json'],
    'result_serializer': 'json',
    'timezone': 'UTC',
    'enable_utc': True,
})

# 定时任务配置
app.conf.beat_schedule = {
    'collect-bestseller-rankings': {
        'task': 'tasks.collect_rankings',
        'schedule': crontab(minute='*/30'),  # 每30分钟执行一次
        'args': (['electronics', 'home-garden', 'sports-outdoors'], ['com', 'co.uk'])
    },
    'analyze-trends': {
        'task': 'tasks.analyze_trends',
        'schedule': crontab(minute='*/60'),  # 每小时分析一次
    },
    'detect-opportunities': {
        'task': 'tasks.detect_opportunities',
        'schedule': crontab(hour='*/6'),  # 每6小时检测一次机会
    }
}

@app.task(bind=True, max_retries=3)
def collect_rankings(self, categories: List[str], marketplaces: List[str]):
    """采集排名数据任务"""
    try:
        # 初始化采集器
        collector = AmazonRankingCollector()
        manager = BatchCollectionManager(collector)
        data_manager = RankingDataManager()
        
        # 执行采集
        async def run_collection():
            async with collector:
                results = await manager.collect_multiple_categories(categories, marketplaces)
                
                # 保存数据
                for marketplace, marketplace_data in results['data'].items():
                    for category, category_data in marketplace_data.items():
                        if category_data:
                            data_manager.save_ranking_data(category_data)
                
                return results
        
        results = asyncio.run(run_collection())
        
        return {
            'status': 'success',
            'collected_categories': len(categories),
            'collected_marketplaces': len(marketplaces),
            'successful_tasks': results['successful_tasks'],
            'failed_tasks': results['failed_tasks']
        }
        
    except Exception as exc:
        # 重试机制
        if self.request.retries < self.max_retries:
            raise self.retry(countdown=60 * (2 ** self.request.retries))
        else:
            return {'status': 'failed', 'error': str(exc)}

@app.task
def analyze_trends():
    """分析趋势任务"""
    data_manager = RankingDataManager()
    analyzer = RankingTrendAnalyzer()
    
    # 获取需要分析的产品列表
    products = data_manager.products_collection.find({
        'last_seen': {'$gte': datetime.now() - timedelta(days=1)}
    })
    
    analysis_results = []
    
    for product in products:
        for category in product.get('categories', []):
            for marketplace in product.get('marketplaces', []):
                history = data_manager.get_product_ranking_history(
                    product['asin'], category, marketplace, days=30
                )
                
                if not history.empty:
                    trend_analysis = analyzer.analyze_product_trend(history)
                    
                    # 保存分析结果
                    analysis_results.append({
                        'asin': product['asin'],
                        'category': category,
                        'marketplace': marketplace,
                        'analysis': trend_analysis,
                        'timestamp': datetime.now()
                    })
    
    # 保存分析结果到数据库
    if analysis_results:
        data_manager.db.trend_analysis.insert_many(analysis_results)
    
    return {'analyzed_products': len(analysis_results)}

@app.task
def detect_opportunities():
    """检测市场机会任务"""
    data_manager = RankingDataManager()
    detector = MarketOpportunityDetector(data_manager)
    
    categories = ['electronics', 'home-garden', 'sports-outdoors']
    marketplaces = ['com', 'co.uk']
    
    all_opportunities = []
    
    for category in categories:
        for marketplace in marketplaces:
            opportunities = detector.detect_breakout_products(category, marketplace)
            
            for opp in opportunities:
                opp.update({
                    'category': category,
                    'marketplace': marketplace,
                    'detected_at': datetime.now()
                })
                all_opportunities.append(opp)
    
    # 保存机会到数据库
    if all_opportunities:
        data_manager.db.opportunities.insert_many(all_opportunities)
        
        # 发送高优先级机会的告警
        high_priority_opportunities = [
            opp for opp in all_opportunities 
            if opp['confidence'] > 0.8 and opp['momentum'] > 0.5
        ]
        
        if high_priority_opportunities:
            send_opportunity_alerts.delay(high_priority_opportunities)
    
    return {'total_opportunities': len(all_opportunities)}

@app.task
def send_opportunity_alerts(opportunities: List[Dict]):
    """发送机会告警"""
    # 这里可以集成邮件、短信、钉钉等通知方式
    for opp in opportunities:
        message = f"""
        发现高潜力产品机会!
        
        ASIN: {opp['asin']}
        类目: {opp['category']}
        市场: {opp['marketplace']}
        当前排名: {opp['current_rank']}
        趋势: {opp['trend']}
        置信度: {opp['confidence']:.2f}
        动量: {opp['momentum']:.2f}
        
        建议立即关注此产品!
        """
        
        # 发送通知的具体实现
        print(message)  # 这里可以替换为实际的通知逻辑

Web API接口

from fastapi import FastAPI, HTTPException, Query
from fastapi.responses import JSONResponse
from typing import Optional, List
import uvicorn

app = FastAPI(title="Amazon Ranking Monitor API", version="1.0.0")

data_manager = RankingDataManager()
analyzer = RankingTrendAnalyzer()
detector = MarketOpportunityDetector(data_manager)

@app.get("/api/rankings/{asin}")
async def get_product_rankings(
    asin: str,
    category: str = Query(..., description="Product category"),
    marketplace: str = Query(..., description="Amazon marketplace"),
    days: int = Query(30, description="Number of days of history")
):
    """获取产品排名历史"""
    try:
        history = data_manager.get_product_ranking_history(asin, category, marketplace, days)
        
        if history.empty:
            raise HTTPException(status_code=404, detail="No ranking data found")
        
        # 转换为JSON格式
        history_json = history.to_dict('records')
        
        # 添加趋势分析
        trend_analysis = analyzer.analyze_product_trend(history)
        
        return {
            'asin': asin,
            'category': category,
            'marketplace': marketplace,
            'history': history_json,
            'trend_analysis': trend_analysis
        }
        
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/api/opportunities")
async def get_market_opportunities(
    category: str = Query(..., description="Product category"),
    marketplace: str = Query(..., description="Amazon marketplace"),
    days: int = Query(7, description="Analysis period in days")
):
    """获取市场机会"""
    try:
        opportunities = detector.detect_breakout_products(category, marketplace, days)
        
        return {
            'category': category,
            'marketplace': marketplace,
            'analysis_period': days,
            'opportunities': opportunities,
            'total_count': len(opportunities)
        }
        
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/api/category/analysis")
async def get_category_analysis(
    category: str = Query(..., description="Product category"),
    marketplace: str = Query(..., description="Amazon marketplace"),
    days: int = Query(30, description="Analysis period in days")
):
    """获取类目分析"""
    try:
        analysis = detector.analyze_category_dynamics(category, marketplace, days)
        
        if 'error' in analysis:
            raise HTTPException(status_code=404, detail=analysis['error'])
        
        return analysis
        
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/api/rankings/current/{category}")
async def get_current_rankings(
    category: str,
    marketplace: str = Query(..., description="Amazon marketplace"),
    limit: int = Query(50, description="Number of products to return")
):
    """获取当前排名"""
    try:
        snapshot = data_manager.get_category_snapshot(category, marketplace)
        
        if snapshot.empty:
            raise HTTPException(status_code=404, detail="No current ranking data found")
        
        # 限制返回数量
        snapshot_limited = snapshot.head(limit)
        
        return {
            'category': category,
            'marketplace': marketplace,
            'timestamp': snapshot_limited.iloc[0]['timestamp'].isoformat(),
            'rankings': snapshot_limited.to_dict('records'),
            'total_products': len(snapshot_limited)
        }
        
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

性能优化与监控

系统性能监控

import psutil
import time
from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Prometheus指标
REQUEST_COUNT = Counter('amazon_scraper_requests_total', 'Total requests', ['method', 'endpoint'])
REQUEST_LATENCY = Histogram('amazon_scraper_request_duration_seconds', 'Request latency')
ACTIVE_CONNECTIONS = Gauge('amazon_scraper_active_connections', 'Active connections')
MEMORY_USAGE = Gauge('amazon_scraper_memory_usage_bytes', 'Memory usage')
CPU_USAGE = Gauge('amazon_scraper_cpu_usage_percent', 'CPU usage')

class SystemMonitor:
    def __init__(self):
        self.start_time = time.time()
        
    def start_metrics_server(self, port: int = 8001):
        """启动指标服务器"""
        start_http_server(port)
        
    def update_system_metrics(self):
        """更新系统指标"""
        # CPU使用率
        cpu_percent = psutil.cpu_percent()
        CPU_USAGE.set(cpu_percent)
        
        # 内存使用
        memory = psutil.virtual_memory()
        MEMORY_USAGE.set(memory.used)
        
        # 网络连接数
        connections = len(psutil.net_connections())
        ACTIVE_CONNECTIONS.set(connections)
    
    def log_request(self, method: str, endpoint: str, duration: float):
        """记录请求指标"""
        REQUEST_COUNT.labels(method=method, endpoint=endpoint).inc()
        REQUEST_LATENCY.observe(duration)

# 性能优化装饰器
def monitor_performance(func):
    """性能监控装饰器"""
    def wrapper(*args, **kwargs):
        start_time = time.time()
        try:
            result = func(*args, **kwargs)
            return result
        finally:
            duration = time.time() - start_time
            print(f"{func.__name__} executed in {duration:.2f} seconds")
    return wrapper

部署与运维

Docker配置

# Dockerfile
FROM python:3.9-slim

WORKDIR /app

# 安装系统依赖
RUN apt-get update && apt-get install -y \
    gcc \
    g++ \
    && rm -rf /var/lib/apt/lists/*

# 复制依赖文件
COPY requirements.txt .

# 安装Python依赖
RUN pip install --no-cache-dir -r requirements.txt

# 复制应用代码
COPY . .

# 暴露端口
EXPOSE 8000 8001

# 启动命令
CMD ["python", "main.py"]
# docker-compose.yml
version: '3.8'

services:
  mongodb:
    image: mongo:4.4
    ports:
      - "27017:27017"
    volumes:
      - mongodb_data:/data/db
    environment:
      MONGO_INITDB_ROOT_USERNAME: admin
      MONGO_INITDB_ROOT_PASSWORD: password

  redis:
    image: redis:6-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data

  amazon-scraper:
    build: .
    ports:
      - "8000:8000"
      - "8001:8001"
    depends_on:
      - mongodb
      - redis
    environment:
      MONGO_URI: mongodb://admin:password@mongodb:27017/
      REDIS_URL: redis://redis:6379/0
    volumes:
      - ./logs:/app/logs

  celery-worker:
    build: .
    command: celery -A tasks worker --loglevel=info
    depends_on:
      - mongodb
      - redis
    environment:
      MONGO_URI: mongodb://admin:password@mongodb:27017/
      REDIS_URL: redis://redis:6379/0

  celery-beat:
    build: .
    command: celery -A tasks beat --loglevel=info
    depends_on:
      - mongodb
      - redis
    environment:
      MONGO_URI: mongodb://admin:password@mongodb:27017/
      REDIS_URL: redis://redis:6379/0

volumes:
  mongodb_data:
  redis_data:

总结

本文详细介绍了构建亚马逊Best Seller榜单监控系统的完整技术方案,从数据采集到智能分析,从系统架构到部署运维,提供了一套完整的技术实现路径。

关键技术要点包括:

  1. 异步采集架构:使用aiohttp实现高并发数据采集
  2. 智能趋势分析:基于机器学习的排名变化预测
  3. 实时监控告警:Celery任务调度和多渠道通知
  4. 可扩展存储:MongoDB时序数据存储和索引优化
  5. 系统监控:Prometheus指标收集和性能监控

对于企业级应用,建议结合专业API服务如Pangolin Scrape API,在保证数据质量的同时降低技术维护成本。

参考资料

  1. aiohttp官方文档
  2. MongoDB Python驱动
  3. Celery分布式任务队列
  4. FastAPI Web框架
  5. Prometheus监控系统

作者简介:资深Python开发工程师,专注于大规模数据采集和分析系统设计,有丰富的电商数据处理经验。

技术交流:欢迎关注我的技术博客,获取更多电商数据采集和分析的技术分享。

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值