scrapy爬虫框架学习(二)scrapy爬取多级网页信息
1爬取目标:
1.1 针对一级页面获取专利详情页的链接信息

1.2 针对专利详情页进行详细信息

2.项目代码实现
2.1 item.py:定义要收集的详情数据结构
import scrapy
class CnkipatentItem(scrapy.Item):
application_number = scrapy.Field()
application_date = scrapy.Field()
public_number = scrapy.Field()
publication_date = scrapy.Field()
applicant = scrapy.Field()
address = scrapy.Field()
common_applicants = scrapy.Field()
inventor = scrapy.Field()
international_application = scrapy.Field()
international_publishing = scrapy.Field()
into_the_country_date = scrapy.Field()
patent_agencies = scrapy.Field()
agents = scrapy.Field()
original_application_number = scrapy.Field()
province_code = scrapy.Field()
summary = scrapy.Field()
sovereignty_item = scrapy.Field()
page = scrapy.Field()
main_classification_number = scrapy.Field()
patent_classification_Numbers = scrapy.Field()
2.2 patentSpider.py:完成爬取和解析策略
import scrapy
import re
from cnkiPatent.items import CnkipatentItem
class PatentspiderSpider(scrapy.Spider):
name = 'patentSpider'
allowed_domains = ['search.cnki.net']
start_urls = ['http://search.cnki.net/search.aspx?q=%e5%b0%8f%e6%a0%b8%e9%85%b8