通过scrapy爬取前程无忧招聘数据

最新推荐文章于 2023-11-12 10:20:02 发布

原创

最新推荐文章于 2023-11-12 10:20:02 发布 · 917 阅读

13 ·

CC 4.0 BY-SA版权

本文介绍了如何利用Scrapy框架爬取前程无忧网站的招聘数据，包括在items.py中定义爬取字段，编写spider文件进行网页抓取，对薪资数据进行清洗，然后配置pipelines.py将数据存储到MongoDB数据库中，最后在settings.py中调整相关设置。通过这些步骤，成功获取并存储了招聘职位信息。

创建项目：

scrapy startproject ScrapyDemo
cd ScrapyDemo
scrapy genspider bigqcwy msearch.51job.com

items.py文件添加爬取信息：

class ScrapydemoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 职位名称
    name = scrapy.Field()
    # 薪资水平
    salary = scrapy.Field()
    # 招聘单位
    company = scrapy.Field()
    # 工作地点
    jobPlace = scrapy.Field()
    # 工作经验
    jobExperience = scrapy.Field()
    # 学历要求
    education = scrapy.Field()
    # 工作内容（岗位职责）
    # jobContent = scrapy.Field()
    # 任职要求（技能要求）
    jobRequirement = scrapy.Field()

编辑spider文件bigqcwy.py：
对薪资简单做了清洗

# -*- coding: utf-8 -*-
import scrapy
import time
from ScrapyDemo.items import ScrapydemoItem
import re


class BigqcwySpider(scrapy.Spider):
    name = 'bigqcwy'
    allowed_domains = ['msearch.51job.com']
    custom_settings = {
        "DEFAULT_REQUEST_HEADERS": {
            'Cookie':'设置你的cookie',
          
        },
        "AUTOTHROTTLE_ENABLED": True,
        # "DOWNLOAD_DELAY": 1,
        # "ScrapyDemo.pipelines.ScrapydemoPipeline": 300,
    }
    start_urls = ['https://msearch.51job.com/']

    def start_requests(self):
        # 搜索关键词列表
        list = ['0100%2C7700%2C7200%2C7300%2C7800', '7400%2C2700%2C7900%2C7500%2C6600', '800