scrapy+selenium爬取某招聘网站的职位、公司、面试评论

本文分享了使用Python爬取拉勾网职位、公司及面试评价信息的全过程,包括技术选型、爬虫逻辑设计、数据处理及可视化展示。重点介绍了Scrapy框架、Selenium、WordCloud和PyEcharts的应用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

前一段时间因某些原因再次与爬虫碰面了,这次是爬取拉勾网的一些信息并利用wordcloud生成词云,并使用pyechart生成一些统计信息什么的

爬取前还是要观察拉勾网的一个页面结构,一般都是有规律可循的

首先查看职位要求

比如要爬取对应的信息,这里选用xpath定位相应的位置,可以使用scrapy -shell 进行调试,但在调试要加上USER_AGENT

scrapy shell -s USER_AGENT="xx" 网址

获取到网页以后,可以通过response.xpath方法获取相应的xpath元素或者使用extract()、extract_first()提取相关的元素以进一步处理

但对于某些页面比如有些需要点击才可以查看更多信息,或者进行翻页

这个时候就需要用到selenium来模拟用户点击行为,我这里的处理逻辑是查看是否有该点击元素,如果有的话,则进行点击逻辑并重新给response对象赋值,来获取相关内容,注意——selenium用到的chromedriver要和你的浏览器版本对应并且要么放在python目录下要么手动指定位置,这个是chromedriver对应的版本地址http://chromedriver.storage.googleapis.com/index.html

#这里可以指定chromedriver的位置和超时时间什么的
driver = webdriver.Chrome() 

另外一个需要注意的点就是像下面这种,一个页面会有多条记录,这里我获取的方式是先把记录列表获取,再一条条封装,这里的xpath用法获取子元素用法就要注意一下

定位元素大概就到这里,这里使用crawlSpider进行全站爬取,但因为我需要爬取得是从职位入手,爬取相关的公司,然后再从公司爬取相关的面试评价,所以要对爬取的地方做一定的限制,通过restrict_xpaths限定获取该网页的位置,callback代表爬取后的处理方法 

 rules = (
        Rule(LinkExtractor(allow=("zhaopin/.*")), callback='parse_zhaopin', follow=True),
        Rule(LinkExtractor(allow=(r'jobs/\d+.html')), callback='parse_job', follow=True),
        Rule(LinkExtractor(allow=r'gongsi/\d+.html', restrict_xpaths="//dl[@id='job_company']"),
             callback='parse_company', follow=True),
        Rule(LinkExtractor(allow=r'gongsi/i\d+.html', restrict_xpaths="//a[@class='view-more']"),
             callback='parse_review', follow=True)
    )

当我们获取到需要爬取得数据以后,还是老样子,封装成对应的item并交由pipline处理,这里item使用itemloader来使item对象属性赋值定义更清晰,pipline方面是使用twisted将mysql插入变成异步执行。大概的爬虫逻辑就是如此,但这里加入了一个两个middleware,一个负责动态ip的处理,一个切换user_agent,user_agent通过安装fake_useragent来选择一个user_agent,动态ip处理这里选择付费的极光代理,通过抓取ip并放到数据库,当然他也有免费的每天领取,这里处理ip可能比较浪费,response不为200就把他删除

#middlewares文件的类
class RandomUserAgentMiddlware(object):
    # 随机更换user-agent
    def __init__(self, crawler):
        super(RandomUserAgentMiddlware, self).__init__()
        self.ua = UserAgent()
        self.ua_type = crawler.settings.get("RANDOM_UA_TYPE", "random")

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def process_request(self, request, spider):
        def get_ua():
            return getattr(self.ua, self.ua_type)

        request.headers.setdefault('User-Agent', get_ua())


class RandomProxyMiddleware(object):
    # 动态设置ip代理
    def process_request(self, request, spider):
        get_ip = GetIP()
        request.meta["proxy"] = get_ip.get_random_ip()

#setting文件设置
DOWNLOADER_MIDDLEWARES = {
    # 随机获取请求的用户代理头
    'ArticleSpider.middlewares.RandomUserAgentMiddlware': 400,
    'ArticleSpider.middlewares.RandomProxyMiddleware': 410,
    # SeleniumMiddleware 中间件
    # 'ArticleSpider.middlewares.SeleniumMiddleware': 543,
    # 将scrapy默认的user-agent中间件关闭
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    # 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}

爬取代码,评论列表主要是通过字典存储,item里面只有一个字典列表,并生成相应的批量插入语句

# -*- coding: utf-8 -*-
from functools import reduce
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ArticleSpider.items import LagouJob, LagouCompany, LagouJobItemLoader, LagouReview
from ArticleSpider.utils.common import get_md5, clear_str, get_now, get_max_min_salary, get_city
import re
import MySQLdb
from ArticleSpider.tool.crawl_jiguang import GetIP
from selenium import webdriver
import time
from scrapy.http import HtmlResponse
from selenium.common.exceptions import WebDriverException, ElementNotVisibleException, NoSuchElementException

conn = MySQLdb.connect(host="127.0.0.1", user="root", passwd="123456", db="mptest", charset="utf8")
cursor = conn.cursor()
driver = webdriver.Chrome()

#selenium处理方法,点击对应的位置,并返回评论列表
def handle_review_list(review_list, id_list, review_data_list, company_url, company_url_id, company_name):
    for review in review_list:
        review_data = dict()
        id = review.xpath("div[@class='review-action']//a//@data-id").extract_first('')
        review_data['id'] = id
        id_list.append("'" + id + "'")
        review_data['review_comment'] = review.xpath(
            "div[@class='review-content']//div[@class='interview-process']//text()").extract_first('')
        review_data['company_url'] = company_url
        review_data['company_url_id'] = company_url_id
        review_data['company_name'] = company_name
        review_data['review_tags'] = ','.join(
            review.xpath("div[@class='review-tags clearfix']//div//text()").extract())
        review_data['useful_count'] = review.xpath(
            "div[@class='review-action']/a/span/text()").extract_first('0')
        scores = review.xpath(
            "div[@class='review-stars clearfix']//span[@class='score']//text()").extract()
        score = round(reduce(lambda x, y: float(x) + float(y), scores) / len(scores), 1)
        review_data['score'] = score
        review_data['review_job'] = review.xpath(
            "div[@class='review-stars clearfix']//a[@class='job-name']//text()").extract_first('')
        review_data['comment_time'] = review.xpath(
            "div[@class='review-stars clearfix']//span[@class='review-date']//text()").extract_first('')
        review_data_list.append(review_data)

#selenium处理方法,点击对应的位置
def return_new_company_response(request, response):
    if len(response.xpath("//span[@class='text_over']").extract()) > 0:
        time.sleep(1)
        driver.get(request.url)
        time.sleep(2)
        try:
            driver.find_element_by_xpath("//span[@class='text_over']").click()
        except ElementNotVisibleException as e:
            return response
        except WebDriverException as e:
            return response
        except NoSuchElementException as e:
            return response
        return HtmlResponse(url=driver.current_url, body=driver.page_source,
                            encoding="utf-8", request=request)
    else:
        return response


# 判断是否有该url
def check_table_url(table, url):
    check_sql = "SELECT * FROM {0} where url = '{1}'".format(table, url)
    cursor.execute(check_sql)
    return len(list(cursor)) == 0

#判断是否有该面试记录
def check_comment_in(param):
    check_sql = "select id from lagou_review where id in ({0})".format(",".join(param))
    cursor.execute(check_sql)
    return len(list(cursor)) == len(param)

#返回不为200,移除ip
def delete_ip(response):
    if response.status != 200:
        request = response.request
        ip = request.meta["proxy"]
        ip = ip.split('//')[1]
        get_ip.delete_ip(ip.split(':')[0])
        return False
    return True


# 获取代理ip类
get_ip = GetIP()


class LagouSpider(CrawlSpider):
    handle_httpstatus_list = [302]
    name = 'lagou'
    allowed_domains = ['www.lagou.com']
    start_urls = ['https://www.lagou.com']
    # 配置爬取的目标规则restrict_xpath限定从哪些位置获取相关的url
    rules = (
        Rule(LinkExtractor(allow=("zhaopin/.*")), callback='parse_zhaopin', follow=True),
        Rule(LinkExtractor(allow=(r'jobs/\d+.html')), callback='parse_job', follow=True),
        Rule(LinkExtractor(allow=r'gongsi/\d+.html', restrict_xpaths="//dl[@id='job_company']"),
             callback='parse_company', follow=True),
        Rule(LinkExtractor(allow=r'gongsi/i\d+.html', restrict_xpaths="//a[@class='view-more']"),
             callback='parse_review', follow=True)
    )

    def parse_zhaopin(self, response):
        delete_ip(response)

    # 处理公司,主要是通过xpath获取相对应的信息,然后转成item供pipline处理,其中详情会遇到展开点击,使用selenium进行点击
    def parse_company(self, response):
        if delete_ip(response):
            url = response.url
            match_re = r'(https://www.lagou.com/gongsi/?\d+.html).*$'
            match = re.match(match_re, url)
            if match:
                url = match.group(1)
                if check_table_url('lagou_company', url):
                    response = return_new_company_response(response.request, response)
                    companyItemLoader = LagouJobItemLoader(item=LagouCompany(), response=response)
                    companyItemLoader.add_value('url', url)
                    companyItemLoader.add_value('url_object_id', get_md5(url))
                    tags = response.xpath("//div[@id='tags_container']//li//text()").extract()
                    if len(tags) != 0:
                        tags = clear_str((',').join(tags))
                    else:
                        tags = ''
                    companyItemLoader.add_value('tags', tags)
                    company_name = clear_str(
                        ''.join(response.xpath("//h1[@class='company_main_title']//text()").extract()))
                    companyItemLoader.add_value('company_name', company_name)
                    companyItemLoader.add_value('industry', response.xpath(
                        "//div[@id='basic_container']//li//i[@class='type']/following-sibling::span[1]//text()").extract_first(
                        ""))
                    companyItemLoader.add_value('finance', response.xpath(
                        "//div[@id='basic_container']//li//i[@class='process']/following-sibling::span[1]//text()").extract_first(
                        ""))
                    companyItemLoader.add_value('people_count', response.xpath(
                        "//div[@id='basic_container']//li//i[@class='number']/following-sibling::span[1]//text()").extract_first(
                        ""))
                    companyItemLoader.add_value('city', response.xpath(
                        "//div[@id='basic_container']//li//i[@class='address']/following-sibling::span[1]//text()").extract_first(
                        ""))
                    score = response.xpath("//span[@class='score']//text()").extract_first("0")
                    companyItemLoader.add_value('score', score)
                    create_date = response.xpath(
                        r"//div[@class='company_bussiness_info_container']//div[@class='content']//text()").extract()
                    if len(create_date) != 0:
                        create_date = create_date[1]
                    else:
                        create_date = ''
                    companyItemLoader.add_value('create_date', create_date)
                    company_desc = response.xpath(
                        "//div[@id='company_intro']/div[@class='item_content']/div[@class='company_intro_text']//text()").extract()
                    company_desc = clear_str(('').join(company_desc))
                    companyItemLoader.add_value('company_desc', company_desc.strip())
                    companyItemLoader.add_value('crawl_time', get_now())
                    company_data = response.xpath("//div[@class='company_data']//li//strong//text()").extract()
                    companyItemLoader.add_value('review_count', company_data[3].strip())
                    companyItemLoader.add_value('job_count', company_data[0].strip())
                    company_item = companyItemLoader.load_item()
                    return company_item

    # 解析拉钩网职位
    def parse_job(self, response):
        # 返回不为200,删掉该ip
        if delete_ip(response):
            url = response.url
            match_re = r'(https://www.lagou.com/jobs/?\d+.html).*$'
            match = re.match(match_re, url)
            #同理也是通过xpath进行相关数据的获取和利用正则、字符串一些方法来处理拿下来的数据
            if match:
                url = match.group(1)
                # 判断数据库是否有该url
                if check_table_url('lagou_job', url):
                    jobItemLoader = LagouJobItemLoader(item=LagouJob(), response=response)
                    jobItemLoader.add_xpath('title', "//div[@class='job-name']//h1/text()")
                    jobItemLoader.add_value('url', url)
                    url_object_id = get_md5(url)
                    jobItemLoader.add_value('url_object_id', url_object_id)
                    job_request = response.xpath("//dd[@class='job_request']//span/text()").extract()
                    salary = job_request[0].strip()
                    jobItemLoader.add_value('max_salary', get_max_min_salary(salary, True))
                    jobItemLoader.add_value('min_salary', get_max_min_salary(salary, False))
                    job_city = get_city(job_request[1])
                    jobItemLoader.add_value('job_city', job_city)
                    work_years = job_request[2]
                    jobItemLoader.add_value('work_years', work_years)
                    degree_need = job_request[3]
                    jobItemLoader.add_value('degree_need', degree_need)
                    job_type = job_request[4]
                    jobItemLoader.add_value('job_type', job_type)
                    jobItemLoader.add_xpath('publish_time', "//p[@class='publish_time']/text()")
                    jobItemLoader.add_xpath('job_advantage', "//dd[@class='job-advantage']//p/text()")
                    jobItemLoader.add_xpath('job_desc', "//div[@class='job-detail']//text()")
                    job_addr = ''.join(response.xpath("//div[@class='work_addr']//text()").extract())
                    jobItemLoader.add_value('job_addr', clear_str(job_addr))
                    jobItemLoader.add_value("company_name",
                                            response.xpath("//h3[@class='fl']/em/text()").extract()[0].strip())
                    jobItemLoader.add_xpath("company_url", "//dl[@class='job_company']//a/@href")
                    jobItemLoader.add_xpath("company_url_id", "//dl[@class='job_company']//a/@href")
                    tags = response.xpath("//ul[@class='position-label clearfix']//li/text()").extract()
                    if len(tags) != 0:
                        tags = clear_str((',').join(tags))
                    else:
                        tags = ''
                    jobItemLoader.add_value('tags', tags)
                    jobItemLoader.add_value('crawl_time', get_now())
                    job_item = jobItemLoader.load_item()
                    return job_item

    #一次获取的是评论列表,所以把每个评论项的属性构成一个对象,item则存储这些对象的列表并返回批量插入sql的语句
    def parse_review(self, response):
        if delete_ip(response):
            url = response.url
            match_re = '(https://www.lagou.com/gongsi/i?\d+.html).*$'
            match = re.match(match_re, url)
            if match:
                review_item = LagouReview()
                review_data_list = list()
                company_url = response.xpath("//div[@class='reviews-title']/a/@href").extract()[0]
                company_url_id = get_md5(company_url)
                company_name = response.xpath("//div[@class='reviews-title']/a/text()").extract_first('')
                id_list = list()
                count = 5
                review_list = response.xpath("//div[@class='review-right']")
                handle_review_list(review_list, id_list, review_data_list, company_url, company_url_id, company_name)
                time.sleep(1)
                driver.get(url)
                for x in range(1, count):
                    try:
                        time.sleep(1)
                        driver.find_element_by_xpath("//span[@class='next']").click()
                        time.sleep(2)
                    except ElementNotVisibleException as e:
                        break
                    except WebDriverException as e:
                        break
                    except NoSuchElementException as e:
                        break
                    response = HtmlResponse(url=driver.current_url, body=driver.page_source,
                                            encoding="utf-8", request=url)
                    review_list = response.xpath("//div[@class='review-right']")
                    handle_review_list(review_list, id_list, review_data_list, company_url, company_url_id,
                                       company_name)
                if not check_comment_in(id_list):
                    review_item['review_data_list'] = review_data_list
                    return review_item

对应的item文件

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
from scrapy.loader import ItemLoader
import scrapy
from scrapy.loader.processors import MapCompose, TakeFirst, Join
import re
from ArticleSpider.utils.common import get_publish_time, clear_str, get_md5, get_num


class LagouJobItemLoader(ItemLoader):
    # 自定义itemloader
    default_output_processor = TakeFirst()


class LagouCompany(scrapy.Item):
    tags = scrapy.Field(input_processor=MapCompose(Join('')))
    url = scrapy.Field()
    url_object_id = scrapy.Field()
    company_name = scrapy.Field()
    industry = scrapy.Field()
    finance = scrapy.Field()
    people_count = scrapy.Field()
    city = scrapy.Field()
    score = scrapy.Field()
    create_date = scrapy.Field()
    company_desc = scrapy.Field()
    crawl_time = scrapy.Field()
    review_count = scrapy.Field(input_processor=MapCompose(get_num))
    job_count = scrapy.Field(input_processor=MapCompose(get_num))

    def get_insert_sql(item):
        insert_sql = """
        insert into lagou_company values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
          """
        params = (
            item["url"], item["url_object_id"], item["company_name"], item["industry"], item["finance"]
            , item["people_count"], item["city"], item["score"], item["create_date"], item['tags'],
            item["company_desc"], item["crawl_time"]
            , item["review_count"], item["job_count"]
        )
        return insert_sql, params


class LagouJob(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
    url_object_id = scrapy.Field()
    max_salary = scrapy.Field(
    )
    min_salary = scrapy.Field(
    )
    job_city = scrapy.Field(
        input_processor=MapCompose(clear_str)
    )
    work_years = scrapy.Field(
        input_processor=MapCompose(clear_str)
    )
    degree_need = scrapy.Field(input_processor=MapCompose(clear_str)
                               )
    job_type = scrapy.Field()
    publish_time = scrapy.Field(
        input_processor=MapCompose(get_publish_time)
    )
    job_advantage = scrapy.Field()
    job_desc = scrapy.Field(
        input_processor=Join('')
    )
    job_addr = scrapy.Field()
    company_name = scrapy.Field()
    company_url = scrapy.Field()
    company_url_id = scrapy.Field(
        input_processor=MapCompose(get_md5)
    )
    tags = scrapy.Field()
    crawl_time = scrapy.Field()

    def get_insert_sql(item):
        insert_sql = """
        insert into lagou_job values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
          """
        params = (
            item["url"], item["url_object_id"], item["title"]
            , item["max_salary"], item["min_salary"], item["job_city"], item["work_years"],
            item["degree_need"], item["job_type"],
            item["publish_time"], item["tags"], item["job_advantage"], item["job_desc"],
            item["job_addr"], item["company_url"], item["company_url_id"], item["company_name"],
            item["crawl_time"]
        )
        return insert_sql, params


class LagouReview(scrapy.Item):
    review_data_list = scrapy.Field()

    def get_insert_sql(item):
        data_list = item['review_data_list']
        values = []
        for data in data_list:
            for x in data:
                data[x] = format_str(data[x])
            value=','.join(data.values())
            value='('+value+')'
            values.append(value)

        insert_sql = "insert into lagou_review values {0}".format(','.join(values))
        params = ()
        return insert_sql, params


def format_str(str):
    return "'{0}'".format(str)

pipline文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import MySQLdb
import MySQLdb.cursors
from twisted.enterprise import adbapi


class MysqlTwistedPipline(object):
    def __init__(self, dbpool):
        self.dbpool = dbpool

    @classmethod
    def from_settings(cls, settings):
        dbparms = dict(
            host = settings["MYSQL_HOST"],
            db = settings["MYSQL_DBNAME"],
            user = settings["MYSQL_USER"],
            passwd = settings["MYSQL_PASSWORD"],
            charset='utf8',
            cursorclass=MySQLdb.cursors.DictCursor,
            use_unicode=True,
        )
        dbpool = adbapi.ConnectionPool("MySQLdb", **dbparms)

        return cls(dbpool)

    def process_item(self, item, spider):
        #使用twisted将mysql插入变成异步执行
        query = self.dbpool.runInteraction(self.do_insert, item)
        query.addErrback(self.handle_error, item, spider) #处理异常

    def handle_error(self, failure, item, spider):
        #处理异步插入的异常
        print (failure)

    def do_insert(self, cursor, item):
        #执行具体的插入
        #根据不同的item 构建不同的sql语句并插入到mysql中
        insert_sql, params = item.get_insert_sql()
        cursor.execute(insert_sql, params)

 

接下来到词云的处理,词云处理主要是查处某列数据后,先通过jieba结合停用词表和关键词表进行分词后,通过numpy和pandas统计词频最后用wordcloud生成词云,而pyecharts介绍及使用方法https://pyecharts.org/

代码如下

from pyecharts.charts import Bar
from pyecharts.charts import Pie
from pyecharts.charts import Geo
import pyecharts.options as opts
import MySQLdb
from wordcloud import WordCloud
import jieba
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
import os
import pandas as pd
import re

conn = MySQLdb.connect(host="127.0.0.1", user="root", passwd="123456", db="mptest", charset="utf8")
cursor = conn.cursor()
project_dir = os.path.abspath(os.path.dirname(__file__))
data_store = os.path.join(project_dir, 'data')


# 创建停用词列表
def stopwordslist():
    stopwords = [line.strip() for line in open('stop.txt', encoding='UTF-8').readlines()]
    return stopwords

#生成文件前判断是否有该文件,没有则新建一个
def check_path(file):
    if not os.path.exists(file):
        f = open(file, 'w')
        f.close()


class GetData(object):

    #从数据库挑选某一列的数据,加载关键词列表后,用jieba进行切词并统计词频,然后生成相关的wordcloud
    def get_worldcloud(self, column, picture, table, where):
        content = ''
        # 连接所有公司福利介绍
        sql = "select {0} from {1} {2}".format(column, table, where)
        result = cursor.execute(sql)
        for x in cursor.fetchall():
            content = content + x[0]
        # 去除多余字符
        reg = "[^A-Za-z\u4e00-\u9fa5]"
        content = re.sub(reg, '', content)
        stopwords = stopwordslist()
        for word in stopwords:
            content = re.sub(word, '', content)
        # jieba 切词,pandas、numpy计数
        jieba.load_userdict('keyword.txt')
        segment = jieba.cut(content)
        words_df = pd.DataFrame({'segment': segment})
        words_stat = words_df.groupby(by=['segment'])['segment'].agg(np.size)
        words_stat = words_stat.to_frame()
        words_stat.columns = ['count']
        words_stat = words_stat.reset_index().sort_values(by=["count"], ascending=False)
        image = np.array(Image.open(picture))
        word_frequence = {x[0]: x[1] for x in words_stat.values}
        wordcloud = WordCloud(font_path="msyh.ttc", width=800, height=500, mask=image, background_color="white")
        wordcloud.fit_words(word_frequence)
        # 绘制,可以不用绘制直接保存图片
        plt.imshow(wordcloud)
        plt.axis("off")
        plt.show()
        path = data_store + "\\" + table + '-' + column + ".png"
        check_path(path)
        wordcloud.to_file(path)
    #通过公司表获取城市和相关的薪水生成分段型geo图
    def get_salary_city(self):
        sql = """
        select city,cast(sum((max_salary+min_salary)/2)/count(*) as signed) salary from lagou_job GROUP BY city order by salary desc
        """
        cursor.execute(sql)
        key = []
        values = []
        for info in cursor.fetchall():
            if not info[0]=='海外':
                key.append(info[0])
                values.append(info[1])



        max = values[0];
        geo = Geo(init_opts=opts.InitOpts(width="1600px", height="1000px"))
        geo.add_schema(maptype="china",
                       itemstyle_opts=opts.ItemStyleOpts(color="#DDF8FF", border_color="#111"),
                       )
        geo.add("薪资水平", [list(z) for z in zip(key, values)], type_="effectScatter")
        geo.set_series_opts(label_opts=opts.LabelOpts(is_show=False))
        geo.set_global_opts(
            visualmap_opts=opts.VisualMapOpts(max_=max),
            title_opts=opts.TitleOpts(title="各城市工资水平,单位K"))
        file = data_store + "\\" + "salary.html"
        check_path(file)
        geo.render(file)

    #通过对每个城市拥有的公司生产热力分布图
    def get_company_city(self):
        sql = """
         select city,count(distinct url) as company_count from lagou_company GROUP BY city 
         """
        result = cursor.execute(sql)
        key = []
        values = []
        for info in cursor.fetchall():
            if info[0]!='海外' and re.match('[\u4e00-\u9fa5]',info[0]):
                key.append(info[0])
                values.append(info[1])
        geo = Geo(init_opts=opts.InitOpts(width="1600px", height="1000px"))
        geo.add_schema(maptype="china"
                       )
        geo.add("公司分布", [list(z) for z in zip(key, values)], type_="heatmap")
        geo.set_series_opts(label_opts=opts.LabelOpts(is_show=False))
        geo.set_global_opts(
            visualmap_opts=opts.VisualMapOpts(),
            title_opts=opts.TitleOpts(title="各城市公司分布"))
        file = data_store + "\\" + "company.html"
        check_path(file)
        geo.render(file)
    #对某个表里的某个列进行统计生成柱状图
    def get_bar(self, table, column, title):
        sql = "select count(*),{0} from {1} group by {2}".format(column,table, column)
        result = cursor.execute(sql)
        key = []
        values = []
        for info in cursor.fetchall():
            key.append(info[1])
            values.append(info[0])
        bar = Bar()
        bar.add_xaxis(key)
        bar.add_yaxis(title, values)
        bar.set_global_opts(title_opts=opts.TitleOpts(title=title))
        file = data_store + "\\" + table + '-' + column + ".html"
        check_path(file)
        bar.render(file)
    #根据职位名统计相关的数据并进行并集操作生成职位需求分布
    def get_work_data(self):
        work = """
        select * from 
        (select count(*) as count,'java' from lagou_job where title like '%java%'
        union
        select count(*) as count,'python' from lagou_job where title like '%python%'
        union
        select count(*) as count,'前端' from lagou_job where title like '%前端%'
        union
        select count(*) as count,'后台/后端' from lagou_job where title like '%后台%' or title like '%后端%'
        union
        select count(*) as count,'区块链' from lagou_job where title like '%区块链%'
        union
        select count(*) as count,'C++/C' from lagou_job where title like '%C++%' or title like '%C'
        union
        select count(*) as count,'产品' from lagou_job where title like '%产品%'
        union
        select count(*) as count,'运维' from lagou_job where title like '%运维%'
        union
        select count(*) as count,'测试' from lagou_job where title like '%测试%'
        union
        select count(*) as count,'网络' from lagou_job where title like '%网络%'
        union
        select count(*) as count,'安全' from lagou_job where title like '%安全%'
        union
        select count(*) as count,'.net' from lagou_job where title like '%.net%'
        union
        select count(*) as count,'php' from lagou_job where title like '%php%'
        union
        select count(*) as count,'大数据/数据相关' from lagou_job where title like '%大数据%' or title like '%数据%'
        union
        select count(*) as count,'算法/NLP/机器学习/ai/深度学习/自然语言/人工智能/图像' from lagou_job where title like '%机器学习%' or title like '%ai%' or title like'%深度学习%' or title like '%自然语言%' or title like '%智能%' or title like '%图像%' or title like '%.算法%' or title like '%NLP%'
        union
        select count(*) as count,'架构' from lagou_job where title like '%架构%'
        union
        select count(*) as count,'devops' from lagou_job where title like '%devops%'
        union
        select count(*) as count,'go' from lagou_job where title like '%go%')res order by count desc
        """
        result = cursor.execute(work)
        work_key = []
        work_value = []
        for info in cursor.fetchall():
            work_key.append(info[1])
            work_value.append(info[0])

        data_pair = [list(z) for z in zip(work_key, work_value)]
        pie = Pie(init_opts=opts.InitOpts(width="1600px", height="1000px"))
        pie.add(series_name="职位",
                data_pair=data_pair,
                )
        pie.set_global_opts(
            title_opts=opts.TitleOpts(
                title="职位需求分布",
                pos_left="center",
                pos_top="20",
                title_textstyle_opts=opts.TextStyleOpts(color="#fff"),
            ),
            legend_opts=opts.LegendOpts(is_show=False),

        )
        file = data_store + "\\" + "work.html"
        check_path(file)
        pie.render(file)

先看看效果

职位 8306条

公司 1925条

面试评论 17312条

招聘公司分布热点图

 

 

 

 

城市薪资图

职位分布

等等,这里就不一一展示了,sql文件也在项目里面

github地址 https://github.com/97lele/lagouscrapydemo

### Scrapy框架相关的面试问题和经验分享 在Scrapy框架的面试中,常见的问题主要围绕框架的基本概念、使用场景、优缺点以及实际应用展开。以下是详细的解答内容: #### 1. Scrapy框架的核心组件 Scrapy框架的核心组件包括以下几个部分[^1]: - **Engine(引擎)**:负责调度和管理爬虫的整体运行。 - **Scheduler(调度器)**:用于存储待抓取的请求,并按顺序返回给引擎。 - **Downloader(下载器)**:处理与网页服务器的通信,发送请求并接收响应。 - **Spider(爬虫)**:定义如何解析响应数据并提取所需信息。 - **Item Pipeline(项目管道)**:处理由爬虫提取的数据,例如清洗、验证和存储。 #### 2. Scrapy框架的适用场景 Scrapy框架适用于以下场景[^2]: - 简单的页面爬取任务,例如明确知道URL模式的情况。 - 数据量较大的爬取任务,可以通过分布式扩展来提高效率。 - 需要对抓取到的数据进行深度处理和存储的任务。 #### 3. Scrapy框架的优点和缺点 **优点**[^2]: - 功能强大,支持多种协议(如HTTP、FTP等)。 - 提供了丰富的中间件功能,便于扩展。 - 内置支持异步网络请求,性能较高。 **缺点**[^2]: - 对于复杂的动态页面(如JavaScript渲染的页面),原生支持不足。 - 配置较为复杂,初学者可能需要花费较多时间学习。 #### 4. Scrapy框架的常见面试问题 以下是Scrapy框架相关的常见面试问题及其解答: - **Q1: Scrapy中的Request和Response是什么?** - Request是Scrapy中用于发起HTTP请求的对象,包含URL、方法、头部信息等参数。Response则是服务器返回的HTTP响应对象,包含状态码、头部信息和响应体等内容。 - **Q2: 如何处理Scrapy中的重复请求?** - Scrapy内置了一个去重机制,默认使用`RFPDupeFilter`类来过滤重复的请求。可以通过设置`DUPEFILTER_CLASS`来自定义去重逻辑。 - **Q3: Scrapy的Pipeline有什么作用?** - Pipeline主要用于处理爬虫提取的数据,例如清洗、验证和存储。可以定义多个Pipeline,每个Pipeline负责不同的处理任务[^1]。 - **Q4: 如何优化Scrapy的性能?** - 可以通过调整并发数(`CONCURRENT_REQUESTS`)、启用缓存(`HTTPCACHE_ENABLED`)、限制下载延迟(`DOWNLOAD_DELAY`)等方式来优化Scrapy的性能[^2]。 - **Q5: Scrapy如何处理动态页面?** - 对于动态页面,可以结合Selenium或Pyppeteer等工具模拟浏览器行为,或者分析页面载的AJAX请求,直接抓取接口数据[^2]。 #### 5. 实际应用中的经验分享 在实际应用中,以下几点经验可以帮助更好地使用Scrapy框架[^3]: - **合理设置并发数**:根据目标网站的负载能力,适当调整并发请求数,避免对目标网站造成过大压力。 - **使用代理和User-Agent池**:通过轮换IP地址和User-Agent,降低被目标网站封禁的风险。 - **处理异常情况**:为可能出现的网络超时、连接失败等问题设置合理的重试机制。 ```python import scrapy class ExampleSpider(scrapy.Spider): name = "example" start_urls = ["http://example.com"] def parse(self, response): # 提取数据 data = response.css("div.example::text").get() yield {"data": data} ```
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值