Scrapy框架：微信小程序社区CrawlSpider案例_scrapy对接微信小程序-优快云博客

本文详细介绍了如何使用Scrapy创建一个针对www.wxapp-union.com网站的小程序爬虫，包括设置start_urls、链接规则、解析详情内容及定义Item。重点在于演示了如何使用XPath提取数据并存储到WxappItem中，最后通过JsonLinesItemExporter导出爬取结果。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

创建爬虫

scrapy startproject wxapp

cd wxapp

scrapy gensipder -t crawl wxapp_spider "www.wxapp-union.com"

修改settings.py代码

在这里插入图片描述

爬虫部分代码

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from pa_chong.Scrapy.wxapp.wxapp.items import WxappItem


class WxappSpiderSpider(CrawlSpider):
    name = 'wxapp_spider'
    allowed_domains = ['www.wxapp-union.com']
    start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']

    rules = (   # rules是用来设置url规则的
        Rule(LinkExtractor(allow=r'.+mod=list&catid=2&page=\d'), follow=True),   # 列表页url规则
        Rule(LinkExtractor(allow=r'.+article-.+html'), callback='parse_detail', follow=False)   # 详情页url规则，并传入一个回调函数parse_detail用于解析
    )

    '''
    使用Rule和LinkExtractor来决定爬虫的具体走向
    1. allow：要能够限制在想要爬取的url上面，不能跟其他的url产生相同的正则表达式
    2. follow：在爬取某个页面的时候，如果需要将此页面中满足条件的url再次进行跟进，此时需要将follow设置为Turn，否则默认设置为False，就不会继续爬取当前页面中满足规则的url
    3. callback：如果只是为了获取页面的url，不需要指定callback
                 如果需要获取url对应页面中的数据，需要指定一个解析数据的回调函数作为参数传递给callback
    '''

    def parse_detail(self, response):
        title = response.xpath('//h1[@class="ph"]/text()').get()
        authors = response.xpath('//p[@class="authors"]/a/text()').get()
        time = response.xpath('//p[@class="authors"]/span/text()').get()
        article = response.xpath('//td[@id="article_content"]//text()').getall()
        article = ''.join(article).strip()

        item = WxappItem(title=title, authors=authors, time=time, article=article)
        yield item

items.py部分代码

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class WxappItem(scrapy.Item):
    # define the fields for your item here like:

    title = scrapy.Field()
    authors = scrapy.Field()
    time = scrapy.Field()
    article = scrapy.Field()

pipeline部分代码

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy.exporters import JsonLinesItemExporter


'''
使用scrapy.exporters下的JsonLinesItemExporter'''


class WxappPipeline(object):
    def __init__(self):
        self.f = open('wxsqjc.json', 'wb')
        self.exporter = JsonLinesItemExporter(self.f, ensure_ascii=False, encoding='utf-8')

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self, spider):
        self.f.close()