刚开始学习python,对于在网上爬取数据,还处于死搬硬套代码的阶段。不废话,直接开始我的第一个爬取之旅。
1.创建项目
1)创建项目命令
scrapy startproject wooyun
该命令会在当前目录下创建一个wooyun文件夹
2)定义items.py
Scrapy提供了Item类,用来保存从页面爬取的数据。有点类似于Java中的反序列化,只不过反序列化是将字节流转化为Java对象,而Item是一个通用的类,通过key/value的形式存取数据。Item类中的所有字段通过 scrapy.Field() 来声明,声明的字段可以是任意类型,比如整数、字符串、列表等。
import scrapy
class WooyunItem(scrapy.Item):
commitDate = scrapy.Field()
bugName = scrapy.Field()
author = scrapy.Field()
3)我是将爬取的数据保存在mongodb数据库,所以在settings.py里面设置
#禁止cookies,防止被ban
COOKIES_ENABLED = True
ITEM_PIPELINES = {
'wooyun.pipelines.WooyunPipeline':300 #管道下载优先级别1-1000
}
MONGO_URI = "mongodb://localhost:27017/"
MONGO_DATABASE = "local"
4)设置管道pipelines.py
# -*- coding: utf-8 -*-
import datetime
import pymongo
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
class DebugPipeline(object):
now = datetime.datetime.now()
collection_name = "wooyun_" + now.strftime('%Y%m%d')
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db[self.collection_name].insert(dict(item))
return item
5)最后编写spiders,在里面定义想要爬取的数据
# -*- coding: utf-8 -*-
import scrapy
from debug.items import DebugItem
import logging
class debugSpider(scrapy.Spider):
name = "debug"
allowed_domains = ["wooyun.org"]
start_urls = [
"http://www.wooyun.org/bugs/page/1",
]
def parse(self,response):
news_page_num = 20
if response.status == 200:
for j in range(1,news_page_num+1):
item = DebugItem()
item['news_url'] = response.xpath("//div[@class='content']/table[3]/tbody/tr["+str(j)+"]/td[1]/a/@href").extract()
item['news_title'] = response.xpath("//div[@class='content']/table[3]/tbody/tr["+str(j)+"]/td[1]/a/text()").extract()
item['news_date'] = response.xpath("//div[@class='content']/table[3]/tbody/tr["+str(j)+"]/th[1]/text()").extract()
yield item
for i in range(2,20):
next_page_url = "http://www.wooyun.org/bugs/page/"+str(i)
yield scrapy.Request(next_page_url,callback=self.parse_news)
def parse_news(self,response):
news_page_num = 20
if response.status == 200:
for j in range(1,news_page_num+1):
item = DebugItem()
item['news_url'] = response.xpath("//div[@class='content']/table[3]/tbody/tr["+str(j)+"]/td[1]/a/@href").extract()
item['news_title'] = response.xpath("//div[@class='content']/table[3]/tbody/tr["+str(j)+"]/td[1]/a/text()").extract()
item['news_date'] = response.xpath("//div[@class='content']/table[3]/tbody/tr["+str(j)+"]/th[1]/text()").extract()
yield item
6)输入命令爬取
scrapy crawl wooyun
完成!!!!!!!!!!!
转载于:https://blog.51cto.com/727229447/1744509