Scrapy爬取前程无忧

最新推荐文章于 2021-07-22 16:22:06 发布

yeyu_xing

最新推荐文章于 2021-07-22 16:22:06 发布

阅读量4.7k

点赞数 7

分类专栏： Python爬虫

本文链接：https://blog.youkuaiyun.com/yeyu_xing/article/details/113101681

版权

一、确定爬取内容并创建mysql表

1、确定要爬取的url 在这里插入图片描述
通过观察可以发现url为
https://search.51job.com/list/000000,000000,0000,32,9,99,+,2,xxxx.html
只要修改其中的xxxx，即可实现多网页爬取
2、前程无忧的网页数据动态获取json数据，并由js变量接收，然后显示在网页中，因此爬取时需要解析script标签中的变量
在这里插入图片描述

3、确定爬取字段，然后创建mysql表
mysql表结构如下：
在这里插入图片描述

二、scrapy项目爬取

（一）、准备工作：
1、执行scrapy startproject qcwy，创建scrapy工程
2、执行scrapy genspider qcwyCrawler www.xxx.com，创建爬虫文件
（二）、更改项目配置文件settings.py：

# Scrapy settings for qcwy project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
from fake_useragent import UserAgent

BOT_NAME = 'qcwy'

SPIDER_MODULES = ['qcwy.spiders']
NEWSPIDER_MODULE = 'qcwy.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = UserAgent().random  # 生成随机请求头

# Obey robots.txt rules
ROBOTSTXT_OBEY = False  # 不遵守robot协议
LOG_LEVEL = 'ERROR'  # 只打印error级别的日志

ITEM_PIPELINES = {
   
    'qcwy.pipelines.QcwyPipeline': 300,
}  # 开启爬虫管道

（三）、更改items.py文件，确定爬取字段

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class QcwyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    company = scrapy.Field

最低0.47元/天解锁文章