1.创建Scrapy项目
在cmd控制台执行,进入到目标文件夹路径
scrapy startproject example
- EXAMPLE
- example
- spiders
- init.py
- init.py
- items.py
- middlewares.py
- pipelines.py
- settings.py
- spiders
- scrapy.cfg
- example
2.编写item文件,根据需要爬取的内容定义爬取字段
import scrapy
from scrapy.item import Item,Field
class ExampleItem(scrapy.Item):
title= scrapy.Field()
publish_time=scrapy.Field()
source= scrapy.Field()
url=Field()
content = scrapy.Field()
3.编写spider文件
进入example目录,使用命令创建一个基础爬虫类:
# xinhua为爬虫名,xinhuanet.com为爬虫作用范围
scrapy genspider xinhua "xinhuanet.com"
执行命令后会在spiders文件夹中创建一个xinhua.py的文件
现在开始对其编写:
import scrapy
from scrapy.selector import Selector
from ..items import ExampleItem
class XinhuaSpider(scrapy.Spider):
name = 'xinhua'
start_urls = ['http://www.xinhuanet.com/video/sjxw/2019-10/24/c_1210323069.htm'
,'http://www.xinhuanet.com/video/sjxw/2021-05/03/c_1211136650.htm'
,'http://www.xinhuanet.com/video/sjxw/2021-07/14/c_1211240405.htm']
def parse(self,response):
for i in self.start_urls:
yield scrapy.Request(i,callback=self.parse_model)
def parse_model(self, response):
item = ExampleItem()
selector = Selector(response)
item['title'] = selector.xpath('/html/body/div[@class="main"]/div[@class="header"]/div[@class="title"]/h1/text()').extract_first()
item['publish_time']=selector.xpath('/html/body/div[@class="main"]/div[@class="header"]/div[@class="title"]/span[@class="time"]/text()').extract_first()
item['source'] = selector.xpath('//*[@id="source"]/text()').extract_first()
item['url']=response.url
item['content'] = selector.xpath('/html/body/div[@class="main"]/div[@class="article"]/p/text()').extract_first()
yield item
4.编写pipelines文件
from itemadapter import ItemAdapter
import csv
class ExamplePipeline:
fp = open("./example.csv",'a+',encoding='utf-8',newline='')
csv_writer = csv.writer(fp)
# 构建列表头
csv_writer.writerow(["标题","时间","来源","url","正文"])
def process_item(self, item, spider):
# 写入csv文件内容
self.csv_writer.writerow([item['title'],item['publish_time'],item['source'],item['url'],item['content']])
return item
5.settings文件设置(主要设置内容)
6.执行命令,运行程序
# xinhua为爬虫名
scrapy crwal xinhua