scrapy+splash爬取京东冰激凌信息

本文介绍了如何结合scrapy与splash来爬取京东网站上的冰激凌产品信息。通过启动splash Docker服务,利用其处理JavaScript的能力,克服了网页动态加载的难题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1.启动splash:

使用docker启动服务命令启动Splash服务:

sudo docker run -p 5023:5023 -p 8050:8050 -p 8051:8051 scrapinghub/splash

2.创建新的项目:icecreamJD,创建新的爬虫icecream

scrapy startproject icecreamJD

scrapy genspider icecream jd.com

3.打开setting:

#Splash服务器地址
SPLASH_URL = 'http://localhost:8050'

  #开启两个下载中间件,并调整HttpCompressionMiddlewares的次序
DOWNLOADER_MIDDLEWARES = {
     'scrapy_splash.SplashCookiesMiddleware': 723,
      'scrapy_splash.SplashMiddleware':725,
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':810,
 }

#设置去重过滤器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

#用来支持cache_args(可选)
SPIDER_MIDDLEWARES = {
     'scrapy_splash.SplashDeduplicateArgsMiddleware':100,
 }


HTTPCACHE_STORAGE ='scrapy_splash.SplashAwareFSCacheStorage'

4.打开spider:

# -*- coding: utf-8 -*-
import scrapy
#使用scrapy.splash.Request发送请求
from scrapy_splash import SplashRequest
url = 'https://search.jd.com/Search?keyword=%E5%86%B0%E6%B7%87%E6%B7%8B&enc=utf-8'
#自定义lua 脚本模拟用户滑动滑块行为
lua = '''
    function main(splash)
       splash:go(splash.args.url)
        splash:wait(3)
       splash:runjs("document.getElementById('footer-2017').scrollIntoView(true)")
       splash:wait(3)
        return splash:html()
    end
  '''
class IcecreamSpider(scrapy.Spider):
    name = 'icecream'
    allowed_domains = ['search.jd.com']
    start_urls = ['https://search.jd.com/Search?keyword=%E5%86%B0%E6%B7%87%E6%B7%8B&enc=utf-8']
    base_url = 'https://search.jd.com/Search?keyword=%E5%86%B0%E6%B7%87%E6%B7%8B&enc=utf-8'

    def parse(self, response):
        page_num = int(response.css('span.fp-text i::text').extract_first())
        for i in range(page_num):
            url = '%s?page=%s' % (self.base_url, 2 * i + 1)  # 通过观察我们发现url页面间有规律
            yield SplashRequest(url, endpoint='execute', args={'lua_source': lua}, callback=self.parse_item)

    def parse_item(self, response):  # 页面解析函数

        for sel in response.css('div.gl-i-wrap'):

            yield {
            'name': sel.css('div.p-name em').extract_first(),
            'price': sel.css('div.p-price i::text').extract_first(),
            }

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值