用Scrapy和Selenium+PhantomJS爬淘宝评论

本文介绍了如何结合Scrapy和Selenium+PhantomJS来爬取淘宝的商品ID及评论列表。针对URL参数中的ua问题,使用Selenium模拟浏览器进行爬取,通过修改jsonp回调方式获取大量数据,避免解析DOM结构。测试表明,在不使用代理IP和sleep的情况下,未触发反爬策略,但不确定是否适用于所有情况。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

用Scrapy爬商品ID

首先要设置ROBOTSTXT_OBEY = False

base.py

# -*- coding: utf-8 -*-

import scrapy
import codecs


class BaseSpider(scrapy.Spider):
    allowed_domains = ["taobao.com"]

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        self.file = codecs.open(self.name + '.txt', 'w', 'utf-8')

    def __del__(self):
        self.file.close()

tce_id.py 用来爬小分类的ID

# -*- coding: utf-8 -*-

from .base import BaseSpider
import json


CATEGORY_URLS = [
    'https://www.taobao.com/markets/nvzhuang/taobaonvzhuang', 
    'https://www.taobao.com/markets/nanzhuang/2017new', 
    'https://neiyi.taobao.com', 
    'https://www.taobao.com/markets/xie/nvxie/index', 
    'https://www.taobao.com/markets/bao/xiangbao', 
    'https://pei.taobao.com', 
    'https://www.taobao.com/markets/qbb/index?spm=a21bo.50862.201879-item-1008.5.YrbXb6&pvid=b9f2df4c-6d60-4af4-b500-c5168009831f&scm=1007.12802.34660.100200300000000', 
    'https://www.taobao.com/markets/qbb/index?spm=a21bo.50862.201867-main.8.mL7cax&pvid=b9f2df4c-6d60-4af4-b500-c5168009831f&scm=1007.12802.34660.100200300000000', 
    'https://www.taobao.com/markets/qbb/index?spm=a21bo.50862.201867-main.8&pvid=b9f2df4c-6d60-4af4-b500-c5168009831f&scm=1007.12802.34660.100200300000000', 
    'https://www.taobao.com/markets/jiadian/index', 
    'https://www.taobao.com/markets/3c/shuma', 
    'https://www.taobao.com/markets/3c/sj', 
    'https://mei.taobao.com/', 
    'https://www.taobao.com/market/baihuo/xihuyongpin.php?spm=a217u.7383845.a214d5z-static.49.e8DQmz', 
    'https://g.taobao.com/brand_detail.htm?navigator=all&_input_charset=utf-8&q=%E8%90%A5%E5%85%BB%E5%93%81&spm=a21bo.50862.201867-links-4.54.oMw9IU', 
    'https://www.taobao.com/market/peishi/zhubao.php', 
    'https://www.taobao.com/market/peishi/yanjing.php?spm=a219r.lm5630.a214d69.14.CkLAJ7', 
    'https://www.taobao.com/market/peishi/shoubiao.php', 
    'https://www.taobao.com/markets/coolcity/coolcityHome', 
    'https://www.taobao.com/markets/coolcity/coolcityHome', 
    'https://www.taobao.com/markets/amusement/home', 
    'https://game.taobao.com', 
    'https://www.taobao.com/markets/acg/dongman', 
    'https://www.taobao.com/markets/acg/yingshi', 
    'https://chi.taobao.com', 
    'https://chi.taobao.com', 
    'https://chi.taobao.com', 
    'https://s.taobao.com/search?q=%E5%9B%AD%E8%89%BA&imgfile=&commend=all&ssid=s5-e&search_type=item&sourceId=tb.index&spm=a21bo.50862.201856-taobao-item.1&ie=utf8&initiative_id=tbindexz_20170419', 
    
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值