和同学想要建立一个检索 arxiv.org 论文的网站,这是一个 demo
Github地址:https://github.com/Joovo/Arxiv
鸽了好久把博客补了, scrapy 的操作:
- scrapy shell 检验 xpath 正确性
- reponse.xpath().extract() 转换为字符串列表
- str.strip()处理数据
- 获取 xpath 的子节点的所有 text
arxiv.org 本身是通过构造 url 来爬取比较简单,通过构造年月的时间戳和页面展示数据的条数。
python3 -m scrapy startproject Arxiv
cd Arxiv
# quick start a simple spider
scrapy genspider arxiv arxiv.org
# how to crawl
scrapy crawl arxiv
有了基本框架后,修改items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ArxivItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title=scrapy.Field()
authors=scrapy.Field()
comments=scrapy.Field()
subjects=scrapy.Field()
修改pipelines.py
,用于下载
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
class ArxivPipeline(object):
def __init__(self):
self.file = open('./items.json', 'a+')
def process_item(self, item, spider):
content =