一、确定爬取内容并创建mysql表
1、确定要爬取的url
通过观察可以发现url为
https://search.51job.com/list/000000,000000,0000,32,9,99,+,2,xxxx.html
只要修改其中的xxxx,即可实现多网页爬取
2、前程无忧的网页数据动态获取json数据,并由js变量接收,然后显示在网页中,因此爬取时需要解析script标签中的变量
3、确定爬取字段,然后创建mysql表
mysql表结构如下:
二、scrapy项目爬取
(一)、准备工作:
1、执行scrapy startproject qcwy,创建scrapy工程
2、执行scrapy genspider qcwyCrawler www.xxx.com,创建爬虫文件
(二)、更改项目配置文件settings.py:
# Scrapy settings for qcwy project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from fake_useragent import UserAgent
BOT_NAME = 'qcwy'
SPIDER_MODULES = ['qcwy.spiders']
NEWSPIDER_MODULE = 'qcwy.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = UserAgent().random # 生成随机请求头
# Obey robots.txt rules
ROBOTSTXT_OBEY = False # 不遵守robot协议
LOG_LEVEL = 'ERROR' # 只打印error级别的日志
ITEM_PIPELINES = {
'qcwy.pipelines.QcwyPipeline': 300,
} # 开启爬虫管道
(三)、更改items.py文件,确定爬取字段
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class QcwyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
company = scrapy.Field