scrapy+selenium爬取某招聘网站的职位、公司、面试评论

本文分享了使用Python爬取拉勾网职位、公司及面试评价信息的全过程,包括技术选型、爬虫逻辑设计、数据处理及可视化展示。重点介绍了Scrapy框架、Selenium、WordCloud和PyEcharts的应用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

前一段时间因某些原因再次与爬虫碰面了,这次是爬取拉勾网的一些信息并利用wordcloud生成词云,并使用pyechart生成一些统计信息什么的

爬取前还是要观察拉勾网的一个页面结构,一般都是有规律可循的

首先查看职位要求

比如要爬取对应的信息,这里选用xpath定位相应的位置,可以使用scrapy -shell 进行调试,但在调试要加上USER_AGENT

scrapy shell -s USER_AGENT="xx" 网址

获取到网页以后,可以通过response.xpath方法获取相应的xpath元素或者使用extract()、extract_first()提取相关的元素以进一步处理

但对于某些页面比如有些需要点击才可以查看更多信息,或者进行翻页

这个时候就需要用到selenium来模拟用户点击行为,我这里的处理逻辑是查看是否有该点击元素,如果有的话,则进行点击逻辑并重新给response对象赋值,来获取相关内容,注意——selenium用到的chromedriver要和你的浏览器版本对应并且要么放在python目录下要么手动指定位置,这个是chromedriver对应的版本地址http://chromedriver.storage.googleapis.com/index.html

#这里可以指定chromedriver的位置和超时时间什么的
driver = webdriver.Chrome() 

另外一个需要注意的点就是像下面这种,一个页面会有多条记录,这里我获取的方式是先把记录列表获取,再一条条封装,这里的xpath用法获取子元素用法就要注意一下

定位元素大概就到这里,这里使用crawlSpider进行全站爬取,但因为我需要爬取得是从职位入手,爬取相关的公司,然后再从公司爬取相关的面试评价,所以要对爬取的地方做一定的限制,通过restrict_xpaths限定获取该网页的位置,callback代表爬取后的处理方法 

 rules = (
        Rule(LinkExtractor(allow=("zhaopin/.*")), callback='parse_zhaopin', follow=True),
        Rule(LinkExtractor(allow=(r'jobs/\d+.html')), callback='parse_job', follow=True),
        Rule(LinkExtractor(allow=r'gongsi/\d+.html', restrict_xpaths="//dl[@id='job_company']"),
             callback='parse_company', follow=True),
        Rule(LinkExtractor(allow=r'gongsi/i\d+.html', restrict_xpaths="//a[@class='view-more']"),
             callback='parse_review', follow=True)
    )

当我们获取到需要爬取得数据以后,还是老样子,封装成对应的item并交由pipline处理,这里item使用itemloader来使item对象属性赋值定义更清晰,pipline方面是使用twisted将mysql插入变成异步执行。大概的爬虫逻辑就是如此,但这里加入了一个两个middleware,一个负责动态ip的处理,一个切换user_agent,user_agent通过安装fake_useragent来选择一个user_agent,动态ip处理这里选择付费的极光代理,通过抓取ip并放到数据库,当然他也有免费的每天领取,这里处理ip可能比较浪费,response不为200就把他删除

#middlewares文件的类
class RandomUserAgentMiddlware(object):
    # 随机更换user-agent
    def __init__(self, crawler):
        super(RandomUserAgentMiddlware, self).__init__()
        self.ua = UserAgent()
        self.ua_type = crawler.settings.get("RANDOM_UA_TYPE", "random")

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def process_request(self, request, spider):
        def get_ua():
            return getattr(self.ua, self.ua_type)

        request.headers.setdefault('User-Agent', get_ua())


class RandomProxyMiddleware(object):
    # 动态设置ip代理
    def process_request(self, request, spider):
        get_ip = GetIP()
        request.meta["proxy"] = get_ip.get_random_ip()

#setting文件设置
DOWNLOADER_MIDDLEWARES = {
    # 随机获取请求的用户代理头
    'ArticleSpider.middlewares.RandomUserAgentMiddlware': 400,
    'ArticleSpider.middlewares.RandomProxyMiddleware': 410,
    # SeleniumMiddleware 中间件
    # 'ArticleSpider.middlewares.SeleniumMiddleware': 543,
    # 将scrapy默认的user-agent中间件关闭
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    # 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}

爬取代码,评论列表主要是通过字典存储,item里面只有一个字典列表,并生成相应的批量插入语句

# -*- coding: utf-8 -*-
from functools import reduce
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ArticleSpider.items import LagouJob, LagouCompany, LagouJobItemLoader, LagouReview
from ArticleSpider.utils.common import get_md5, clear_str, get_now, get_max_min_salary, get_city
import re
import MySQLdb
from ArticleSpider.tool.crawl_jiguang import GetIP
from selenium import webdriver
import time
from scrapy.http import HtmlResponse
from selenium.common.exceptions import WebDriverException, ElementNotVisibleException, NoSuchElementException

conn = MySQLdb.connect(host="127.0.0.1", user="root", passwd="123456", db="mptest", charset="utf8")
cursor = conn.cursor()
driver = webdriver.Chrome()

#selenium处理方法,点击对应的位置,并返回评论列表
def handle_review_list(review_list, id_list, review_data_list, company_url, company_url_id, company_name):
    for review in review_list:
        review_data = dict()
        id = review.xpath("div[@class='review-action']//a//@data-id").extract_first('')
        review_data['id'] = id
        id_list.append("'" + id + "'")
        review_data['review_comment'] = review.xpath(
            "div[@class='review-content']//div[@class='interview-process']//text()").extract_first('')
        review_data['company_url'] = company_url
        review_data['company_url_id'] = company_url_id
        review_data['company_name'] = company_name
        review_data['review_tags'] = ','.join(
            review.xpath("div[@class='review-tags clearfix']//div//text()").extract())
        review_data['useful_count'] = review.xpath(
            "div[@class='review-action']/a/span/text()").extract_first('0')
        scores = review.xpath(
            "div[@class='review-stars clearfix']//span[@class='score']//text()").extract()
        score = round(reduce(lambda x, y: float(x) + float(y), scores) / len(scores), 1)
        review_data['score'] = score
        review_data['review_job'] = review.xpath(
            "div[@class='review-stars clearfix']//a[@class='job-name']//text()").extract_first('')
        review_data['comment_time'] = review.xpath(
            "div[@class='review-stars clearfix']//span[@class='review-date']//text()").extract_first('')
        review_data_list.append(review_data)

#selenium处理方法,点击对应的位置
def return_new_company_response(request, response):
    if len(response.xpath("//span[@class='text_over']").extract()) > 0:
        time.sleep(1)
        driver.get(request.url)
        time.sleep(2)
        try:
            driver.find_element_by_xpath("//span[@class='text_over']").click()
        except ElementNotVisibleException as e:
            return response
        except WebDriverException as e:
            return response
        except NoSuchElementException as e:
            return response
        return HtmlResponse(url=driver.current_url, body=driver.page_source,
                            encoding="utf-8", request=request)
    else:
        return response


# 判断是否有该url
def check_table_url(table, url):
    check_sql = "SELECT * FROM {0} where url = '{1}'".format(table, url)
    cursor.execute(check_sql)
    return len(list(cursor)) == 0

#判断是否有该面试记录
def check_comment_in(param):
    check_sql = "select id from lagou_review where id in ({0})".format(",".join(param))
    cursor.execute(check_sql)
    return len(list(cursor)) == len(param)

#返回不为200,移除ip
def delete_ip(response):
    if response.status != 200:
        request = response.request
        ip = request.meta["proxy"]
        ip = ip.split('//')[1]
        get_ip.delete_ip(ip.split(':')[0])
        return False
    return True


# 获取代理ip类
get_ip = GetIP()


class LagouSpider(CrawlSpider):
    handle_httpstatus_list = [302]
    name = 'lagou'
    allowed_domains = ['www.lagou.com']
    start_urls = ['https://www.lagou.com']
    # 配置爬取的目标规则restrict_xpath限定从哪些位置获取相关的url
    rules = (
        Rule(LinkExtractor(allow=("zhaopin/.*")), callback='parse_zhaopin', follow=True),
        Rule(LinkExtractor(allow=(r'jobs/\d+.html')), callback='parse_job', follow=True),
        Rule(LinkExtractor(allow=r'gongsi/\d+.html', restrict_xpaths="//dl[@id='job_company']"),
             callback='parse_company', follow=True),
        Rule(LinkExtractor(allow=r'gongsi/i\d+.html', restrict_xpaths="//a[@class='view-more']"),
             callback='parse_review', follow=True)
    )

    def parse_zhaopin(self, response):
        delete_ip(response)

    # 处理公司,主要是通过xpath获取相对应的信息,然后转成item供pipline处理,其中详情会遇到展开点击,使用selenium进行点击
    def parse_company(self, response):
        if delete_ip(response):
            url = response.url
            match_re = r'(https://www.lagou.com/gongsi/?\d+.html).*$'
            match = re.match(match_re, url)
            if match:
                url = match.group(1)
                if check_table_url('lagou_company', url):
                    response = return_new_company_response(response.request, response)
                    companyItemLoader = LagouJobItemLoader(item=LagouCompany(), response=response)
                    companyItemLoader.add_value('url', url)
                    companyItemLoader.add_value('url_object_id', get_md5(url))
                    tags = response.xpath("//div[@id='tags_container']//li//text()").extract()
                    if len(tags) != 0:
                        tags = clear_str((',').join(tags))
                    else:
                        tags = ''
                    companyItemLoader.add_value('tags', tags)
                    company_name = clear_str(
                        ''.join(response.xpath("//h1[@class='company_main_title']//text()").extract()))
                    companyItemLoader.add_value('company_name', company_name)
                    companyItemLoader.add_value('industry', response.xpath(
                        "//div[@id='basic_container']//li//i[@class='type']/following-sibling::span[1]//text()").extract_first(
                        ""))
                    companyItemLoader.add_value('finance', response.xpath(
                        "//div[@id='basic_container']//li//i[@class='process']/following-sibling::span[1]//text()").extract_first(
                        ""))
                    companyItemLoader.add_value('people_count', response.xpath(
                        "//div[@id='basic_container']//li//i[@class='number']/following-sibling::span[1]//text()").extract_first(
                        ""))
                    companyItemLoader.add_value('city', response.xpath(
                        "//div[@id='basic_container']//li//i[@class='address']/following-sibling::span[1]//text()").extract_first(
                        ""))
                    score = response.xpath("//span[@class='score']//text()").extract_first("0")
                    companyItemLoader.add_value('score', score)
                    create_date = response.xpath(
                        r"//div[@class='company_bussiness_info_container']//div[@class='content']//text()").extract()
                    if len(create_date) != 0:
                        create_date = create_date[1]
                    else:
                        create_date = ''
                    companyItemLoader.add_value('create_date', create_date)
                    company_desc = response.xpath(
                        "//div[@id='company_intro']/div[@class='item_content']/div[@class='company_intro_text']//text()").extract()
                    company_desc = clear_str(('').join(company_desc))
                    companyItemLoader.add_value('company_desc', company_desc.strip())
                    companyItemLoader.add_value('crawl_time', get_now())
                    company_data = response.xpath("//div[@class='company_data']//li//strong//text()").extract()
                    companyItemLoader.add_value('review_count', company_data[3].strip())
                    companyItemLoader.add_value('job_count', company_data[0].strip())
                    company_item = companyItemLoader.load_item()
                    return company_item

    # 解析拉钩网职位
    def parse_job(self, response):
        # 返回不为200,删掉该ip
        if delete_ip(response):
            url = response.url
            match_re = r'(https://www.lagou.com/jobs/?\d+.html).*$'
            match = re.match(match_re, url)
            #同理也是通过xpath进行相关数据的获取和利用正则、字符串一些方法来处理拿下来的数据
            if match:
                url = match.group(1)
                # 判断数据库是否有该url
                if check_table_url('lagou_job', url):
                    jobItemLoader = LagouJobItemLoader(item=LagouJob(), response=response)
                    jobItemLoader.add_xpath('title', "//div[@class='job-name']//h1/text()")
                    jobItemLoader.add_value('url', url)
                    url_object_id = get_md5(url)
                    jobItemLoader.add_value('url_object_id', url_object_id)
                    job_request = response.xpath("//dd[@class='job_request']//span/text()").extract()
                    salary = job_request[0].strip()
                    jobItemLoader.add_value('max_salary', get_max_min_salary(salary, True))
                    jobItemLoader.add_value('min_salary', get_max_min_salary(salary, False))
                    job_city = get_city(job_request[1])
                    jobItemLoader.add_value('job_city', job_city)
                    work_years = job_request[2]
                    jobItemLoader.add_value('work_years', work_years)
                    degree_need = job_request[3]
                    jobItemLoader.add_value('degree_need', degree_need)
                    job_type = job_request[4]
                    jobItemLoader.add_value('job_type', job_type)
                    jobItemLoader.add_xpath('publish_time', "//p[@class='publish_time']/text()")
                    jobItemLoader.add_xpath('job_advantage', "//dd[@class='job-advantage']//p/text()")
                    jobItemLoader.add_xpath('job_desc', "//div[@class='job-detail']//text()")
                    job_addr = ''.join(response.xpath("//div[@class='work_addr']//text()").extract())
                    jobItemLoader.add_value('job_addr', clear_str(job_addr))
                    jobItemLoader.add_value("company_name",
                                            response.xpath("//h3[@class='fl']/em/text()").extract()[0].strip())
                    jobItemLoader.add_xpath("company_url", "//dl[@class='job_company']//a/@href")
                    jobItemLoader.add_xpath("company_url_id", "//dl[@class='job_company']//a/@href")
                    tags = response.xpath("//ul[@class='position-label clearfix']//li/text()").extract()
                    if len(tags) != 0:
                        tags = clear_str((',').join(tags))
                    else:
                        tags = ''
                    jobItemLoader.add_value('tags', tags)
                    jobItemLoader.add_value('crawl_time', get_now())
                    job_item = jobItemLoader.load_item()
                    return job_item

    #一次获取的是评论列表,所以把每个评论项的属性构成一个对象,item则存储这些对象的列表并返回批量插入sql的语句
    def parse_review(self, response):
        if delete_ip(response):
            url = response.url
            match_re = '(https://www.lagou.com/gongsi/i?\d+.html).*$'
            match = re.match(match_re, url)
            if match:
                review_item = LagouReview()
                review_data_list = list()
                company_url = response.xpath("//div[@class='reviews-title']/a/@href").extract()[0]
                company_url_id = get_md5(company_url)
                company_name = response.xpath("//div[@class='reviews-title']/a/text()").extract_first('')
                id_list = list()
                count = 5
                review_list = response.xpath("//div[@class='review-right']")
                handle_review_list(review_list, id_list, review_data_list, company_url, company_url_id, company_name)
                time.sleep(1)
                driver.get(url)
                for x in range(1, count):
                    try:
                        time.sleep(1)
                        driver.find_element_by_xpath("//span[@class='next']").click()
                        time.sleep(2)
                    except ElementNotVisibleException as e:
                        break
                    except WebDriverException as e:
                        break
                    except NoSuchElementException as e:
                        break
                    response = HtmlResponse(url=driver.current_url, body=driver.page_source,
                                            encoding="utf-8", request=url)
                    review_list = response.xpath("//div[@class='review-right']")
                    handle_review_list(review_list, id_list, review_data_list, company_url, company_url_id,
                                       company_name)
                if not check_comment_in(id_list):
                    review_item['review_data_list'] = review_data_list
                    return review_item

对应的item文件

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
from scrapy.loader import ItemLoader
import scrapy
from scrapy.loader.processors import MapCompose, TakeFirst, Join
import re
from ArticleSpider.utils.common import get_publish_time, clear_str, get_md5, get_num


class LagouJobItemLoader(ItemLoader):
    # 自定义itemloader
    default_output_processor = TakeFirst()


class LagouCompany(scrapy.Item):
    tags = scrapy.Field(input_processor=MapCompose(Join('')))
    url = scrapy.Field()
    url_object_id = scrapy.Field()
    company_name = scrapy.Field()
    industry = scrapy.Field()
    finance = scrapy.Field()
    people_count = scrapy.Field()
    city = scrapy.Field()
    score = scrapy.Field()
    create_date = scrapy.Field()
    company_desc = scrapy.Field()
    crawl_time = scrapy.Field()
    review_count = scrapy.Field(input_processor=MapCompose(get_num))
    job_count = scrapy.Field(input_processor=MapCompose(get_num))

    def get_insert_sql(item):
        insert_sql = """
        insert into lagou_company values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
          """
        params = (
            item["url"], item["url_object_id"], item["company_name"], item["industry"], item["finance"]
            , item["people_count"], item["city"], item["score"], item["create_date"], item['tags'],
            item["company_desc"], item["crawl_time"]
            , item["review_count"], item["job_count"]
        )
        return insert_sql, params


class LagouJob(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
    url_object_id = scrapy.Field()
    max_salary = scrapy.Field(
    )
    min_salary = scrapy.Field(
    )
    job_city = scrapy.Field(
        input_processor=MapCompose(clear_str)
    )
    work_years = scrapy.Field(
        input_processor=MapCompose(clear_str)
    )
    degree_need = scrapy.Field(input_processor=MapCompose(clear_str)
                               )
    job_type = scrapy.Field()
    publish_time = scrapy.Field(
        input_processor=MapCompose(get_publish_time)
    )
    job_advantage = scrapy.Field()
    job_desc = scrapy.Field(
        input_processor=Join('')
    )
    job_addr = scrapy.Field()
    company_name = scrapy.Field()
    company_url = scrapy.Field()
    company_url_id = scrapy.Field(
        input_processor=MapCompose(get_md5)
    )
    tags = scrapy.Field()
    crawl_time = scrapy.Field()

    def get_insert_sql(item):
        insert_sql = """
        insert into lagou_job values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
          """
        params = (
            item["url"], item["url_object_id"], item["title"]
            , item["max_salary"], item["min_salary"], item["job_city"], item["work_years"],
            item["degree_need"], item["job_type"],
            item["publish_time"], item["tags"], item["job_advantage"], item["job_desc"],
            item["job_addr"], item["company_url"], item["company_url_id"], item["company_name"],
            item["crawl_time"]
        )
        return insert_sql, params


class LagouReview(scrapy.Item):
    review_data_list = scrapy.Field()

    def get_insert_sql(item):
        data_list = item['review_data_list']
        values = []
        for data in data_list:
            for x in data:
                data[x] = format_str(data[x])
            value=','.join(data.values())
            value='('+value+')'
            values.append(value)

        insert_sql = "insert into lagou_review values {0}".format(','.join(values))
        params = ()
        return insert_sql, params


def format_str(str):
    return "'{0}'".format(str)

pipline文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import MySQLdb
import MySQLdb.cursors
from twisted.enterprise import adbapi


class MysqlTwistedPipline(object):
    def __init__(self, dbpool):
        self.dbpool = dbpool

    @classmethod
    def from_settings(cls, settings):
        dbparms = dict(
            host = settings["MYSQL_HOST"],
            db = settings["MYSQL_DBNAME"],
            user = settings["MYSQL_USER"],
            passwd = settings["MYSQL_PASSWORD"],
            charset='utf8',
            cursorclass=MySQLdb.cursors.DictCursor,
            use_unicode=True,
        )
        dbpool = adbapi.ConnectionPool("MySQLdb", **dbparms)

        return cls(dbpool)

    def process_item(self, item, spider):
        #使用twisted将mysql插入变成异步执行
        query = self.dbpool.runInteraction(self.do_insert, item)
        query.addErrback(self.handle_error, item, spider) #处理异常

    def handle_error(self, failure, item, spider):
        #处理异步插入的异常
        print (failure)

    def do_insert(self, cursor, item):
        #执行具体的插入
        #根据不同的item 构建不同的sql语句并插入到mysql中
        insert_sql, params = item.get_insert_sql()
        cursor.execute(insert_sql, params)

 

接下来到词云的处理,词云处理主要是查处某列数据后,先通过jieba结合停用词表和关键词表进行分词后,通过numpy和pandas统计词频最后用wordcloud生成词云,而pyecharts介绍及使用方法https://pyecharts.org/

代码如下

from pyecharts.charts import Bar
from pyecharts.charts import Pie
from pyecharts.charts import Geo
import pyecharts.options as opts
import MySQLdb
from wordcloud import WordCloud
import jieba
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
import os
import pandas as pd
import re

conn = MySQLdb.connect(host="127.0.0.1", user="root", passwd="123456", db="mptest", charset="utf8")
cursor = conn.cursor()
project_dir = os.path.abspath(os.path.dirname(__file__))
data_store = os.path.join(project_dir, 'data')


# 创建停用词列表
def stopwordslist():
    stopwords = [line.strip() for line in open('stop.txt', encoding='UTF-8').readlines()]
    return stopwords

#生成文件前判断是否有该文件,没有则新建一个
def check_path(file):
    if not os.path.exists(file):
        f = open(file, 'w')
        f.close()


class GetData(object):

    #从数据库挑选某一列的数据,加载关键词列表后,用jieba进行切词并统计词频,然后生成相关的wordcloud
    def get_worldcloud(self, column, picture, table, where):
        content = ''
        # 连接所有公司福利介绍
        sql = "select {0} from {1} {2}".format(column, table, where)
        result = cursor.execute(sql)
        for x in cursor.fetchall():
            content = content + x[0]
        # 去除多余字符
        reg = "[^A-Za-z\u4e00-\u9fa5]"
        content = re.sub(reg, '', content)
        stopwords = stopwordslist()
        for word in stopwords:
            content = re.sub(word, '', content)
        # jieba 切词,pandas、numpy计数
        jieba.load_userdict('keyword.txt')
        segment = jieba.cut(content)
        words_df = pd.DataFrame({'segment': segment})
        words_stat = words_df.groupby(by=['segment'])['segment'].agg(np.size)
        words_stat = words_stat.to_frame()
        words_stat.columns = ['count']
        words_stat = words_stat.reset_index().sort_values(by=["count"], ascending=False)
        image = np.array(Image.open(picture))
        word_frequence = {x[0]: x[1] for x in words_stat.values}
        wordcloud = WordCloud(font_path="msyh.ttc", width=800, height=500, mask=image, background_color="white")
        wordcloud.fit_words(word_frequence)
        # 绘制,可以不用绘制直接保存图片
        plt.imshow(wordcloud)
        plt.axis("off")
        plt.show()
        path = data_store + "\\" + table + '-' + column + ".png"
        check_path(path)
        wordcloud.to_file(path)
    #通过公司表获取城市和相关的薪水生成分段型geo图
    def get_salary_city(self):
        sql = """
        select city,cast(sum((max_salary+min_salary)/2)/count(*) as signed) salary from lagou_job GROUP BY city order by salary desc
        """
        cursor.execute(sql)
        key = []
        values = []
        for info in cursor.fetchall():
            if not info[0]=='海外':
                key.append(info[0])
                values.append(info[1])



        max = values[0];
        geo = Geo(init_opts=opts.InitOpts(width="1600px", height="1000px"))
        geo.add_schema(maptype="china",
                       itemstyle_opts=opts.ItemStyleOpts(color="#DDF8FF", border_color="#111"),
                       )
        geo.add("薪资水平", [list(z) for z in zip(key, values)], type_="effectScatter")
        geo.set_series_opts(label_opts=opts.LabelOpts(is_show=False))
        geo.set_global_opts(
            visualmap_opts=opts.VisualMapOpts(max_=max),
            title_opts=opts.TitleOpts(title="各城市工资水平,单位K"))
        file = data_store + "\\" + "salary.html"
        check_path(file)
        geo.render(file)

    #通过对每个城市拥有的公司生产热力分布图
    def get_company_city(self):
        sql = """
         select city,count(distinct url) as company_count from lagou_company GROUP BY city 
         """
        result = cursor.execute(sql)
        key = []
        values = []
        for info in cursor.fetchall():
            if info[0]!='海外' and re.match('[\u4e00-\u9fa5]',info[0]):
                key.append(info[0])
                values.append(info[1])
        geo = Geo(init_opts=opts.InitOpts(width="1600px", height="1000px"))
        geo.add_schema(maptype="china"
                       )
        geo.add("公司分布", [list(z) for z in zip(key, values)], type_="heatmap")
        geo.set_series_opts(label_opts=opts.LabelOpts(is_show=False))
        geo.set_global_opts(
            visualmap_opts=opts.VisualMapOpts(),
            title_opts=opts.TitleOpts(title="各城市公司分布"))
        file = data_store + "\\" + "company.html"
        check_path(file)
        geo.render(file)
    #对某个表里的某个列进行统计生成柱状图
    def get_bar(self, table, column, title):
        sql = "select count(*),{0} from {1} group by {2}".format(column,table, column)
        result = cursor.execute(sql)
        key = []
        values = []
        for info in cursor.fetchall():
            key.append(info[1])
            values.append(info[0])
        bar = Bar()
        bar.add_xaxis(key)
        bar.add_yaxis(title, values)
        bar.set_global_opts(title_opts=opts.TitleOpts(title=title))
        file = data_store + "\\" + table + '-' + column + ".html"
        check_path(file)
        bar.render(file)
    #根据职位名统计相关的数据并进行并集操作生成职位需求分布
    def get_work_data(self):
        work = """
        select * from 
        (select count(*) as count,'java' from lagou_job where title like '%java%'
        union
        select count(*) as count,'python' from lagou_job where title like '%python%'
        union
        select count(*) as count,'前端' from lagou_job where title like '%前端%'
        union
        select count(*) as count,'后台/后端' from lagou_job where title like '%后台%' or title like '%后端%'
        union
        select count(*) as count,'区块链' from lagou_job where title like '%区块链%'
        union
        select count(*) as count,'C++/C' from lagou_job where title like '%C++%' or title like '%C'
        union
        select count(*) as count,'产品' from lagou_job where title like '%产品%'
        union
        select count(*) as count,'运维' from lagou_job where title like '%运维%'
        union
        select count(*) as count,'测试' from lagou_job where title like '%测试%'
        union
        select count(*) as count,'网络' from lagou_job where title like '%网络%'
        union
        select count(*) as count,'安全' from lagou_job where title like '%安全%'
        union
        select count(*) as count,'.net' from lagou_job where title like '%.net%'
        union
        select count(*) as count,'php' from lagou_job where title like '%php%'
        union
        select count(*) as count,'大数据/数据相关' from lagou_job where title like '%大数据%' or title like '%数据%'
        union
        select count(*) as count,'算法/NLP/机器学习/ai/深度学习/自然语言/人工智能/图像' from lagou_job where title like '%机器学习%' or title like '%ai%' or title like'%深度学习%' or title like '%自然语言%' or title like '%智能%' or title like '%图像%' or title like '%.算法%' or title like '%NLP%'
        union
        select count(*) as count,'架构' from lagou_job where title like '%架构%'
        union
        select count(*) as count,'devops' from lagou_job where title like '%devops%'
        union
        select count(*) as count,'go' from lagou_job where title like '%go%')res order by count desc
        """
        result = cursor.execute(work)
        work_key = []
        work_value = []
        for info in cursor.fetchall():
            work_key.append(info[1])
            work_value.append(info[0])

        data_pair = [list(z) for z in zip(work_key, work_value)]
        pie = Pie(init_opts=opts.InitOpts(width="1600px", height="1000px"))
        pie.add(series_name="职位",
                data_pair=data_pair,
                )
        pie.set_global_opts(
            title_opts=opts.TitleOpts(
                title="职位需求分布",
                pos_left="center",
                pos_top="20",
                title_textstyle_opts=opts.TextStyleOpts(color="#fff"),
            ),
            legend_opts=opts.LegendOpts(is_show=False),

        )
        file = data_store + "\\" + "work.html"
        check_path(file)
        pie.render(file)

先看看效果

职位 8306条

公司 1925条

面试评论 17312条

招聘公司分布热点图

 

 

 

 

城市薪资图

职位分布

等等,这里就不一一展示了,sql文件也在项目里面

github地址 https://github.com/97lele/lagouscrapydemo

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值