Python爬虫——拉勾网职位信息爬取

Python爬虫——拉勾网职位信息爬取

学号:16310320326 姓名:张林
基本原理
Python爬虫基本要具备以下功能:
1,将爬取的公司名称及其职位等信息放入excel表中。
2,将爬取到的信息进行图形可视化处理。
3,将爬取到的招聘信息制作成词云图。
研究目标网站
打开拉勾网可以发现目标url:https://www.lagou.com/jobs/list_python?labelWords=sug&fromSearch=true&suginput=py。
在爬取的过程中发现通过该url获得的数据与原网页相差极大,可以断定该网站有反爬虫机制。点击F12,进入开发者选项,查看源代码:
该网站采用了ajax异步请求技术
可以看到该网站采用了ajax异步请求技术,其真正的数据存放在positionAjax.json中,我们可以模拟用户点击浏览器来获取拉勾网的真实数据。查看positionAjax.json的请求头:
在这里插入图片描述
得到真正的url地址:https://www.lagou.com/jobs/positionAjax.json?city=成都&needAddtionalResult=false
除此之外还需要获得头部信息的Cookie、Referer、X-Anit-Forge-Code、X-Anit-Forge-Token、X-Requested-With以及From Data中的信息。From Data中的pn表示当前页数,因此我们可以通过遍历来获取所有页数的信息。
具体编程代码如下:

import requests
import time
import xlwt
from urllib.request import quote
URL1 = "https://www.lagou.com/jobs/list_"
URl3 = "?city="
URL5 = "&cl=false&fromSearch=true&labelWords=&suginput="
url1 = input("请输入您要查的职位:")
url2 = input("请输入您要查询的城市:")
url4 = quote(url2)
url_job = URL1 + url1 + URl3+ url4 +URL5
print(url_job)
def getJoblist(page):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
        'Referer': url_job,
        "Host": "www.lagou.com",
        "Origin": "https: // www.lagou.com",
        "X - Anit - Forge - Code": "0",
        "X - Anit - Forge - Token": "None",
        "X - Requested - With": "XMLHttpRequest"
    }
    cookies = {
        "Cookie": "__guid=237742470.2364411580900169700.1542356654561.3325; _ga=GA1.2.408294538.1542356655; user_trace_token=20181116162414-00e0f37d-e979-11e8-8906-5254005c3644; LGUID=20181116162414-00e0f7d4-e979-11e8-8906-5254005c3644; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221671b9ef5711c1-0b6fd440eb80bb-3c604504-2073600-1671b9ef5722dd%22%2C%22%24device_id%22%3A%221671b9ef5711c1-0b6fd440eb80bb-3c604504-2073600-1671b9ef5722dd%22%2C%22props%22%3A%7B%22%24latest_utm_source%22%3A%22m_cf_cpc_360_pc%22%7D%7D; hasDeliver=0; index_location_city=%E6%88%90%E9%83%BD; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; _gid=GA1.2.54237655.1543300679; WEBTJ-ID=20181127194950-16755010ad93ea-09952e8975b7a6-3c604504-2073600-16755010ada2ca; LGSID=20181127194949-8b4c1667-f23a-11e8-8164-525400f775ce; _putrc=15778E7119CDB0FD123F89F2B170EADC; JSESSIONID=ABAAABAAAIAACBI48E45C02FA95B7DA20CDAC162370D9DA; login=true; unick=%E6%8B%89%E5%8B%BE%E7%94%A8%E6%88%B75625; X_HTTP_TOKEN=7ac3d4ce2e2ce593846e8a2ed4c943ce; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1543300679,1543303247,1543319392,1543321120; gate_login_token=6615e763a3dfee2572a73a463be5fc40d7d50ece3b8fca81e7bc50e46fe29b01; _gat=1; LGRID=20181127211518-7c7a8038-f246-11e8-81a0-525400f775ce; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1543324521; TG-TRACK-CODE=search_code; SEARCH_ID=ee99b37671be4fee8f8098ee7502c741; monitor_count=106"
    }
    data = {
        "first": "false",
        "pn": page,
        "kd": url1
    }
    url_1 = url4+"&needAddtionalResult=false"
    url_2 = "https://www.lagou.com/jobs/positionAjax.json?city="
    url_3 = url_2+url_1
    result = requests.post(url_3,headers=headers, data=data,cookies = cookies)
    if result.content:
        json_data = result.json()
    jobs= json_data['content']['positionResult']['result']
    return jobs
def main():
    excel = xlwt.Workbook()
    sheet1 = excel.add_sheet("lagou")
    sheet1.write(0,0,'companyId')
    sheet1.write(0,1,'positionName')
    sheet1.write(0,2,'workYear')
    sheet1.write(0,3,'education')
    sheet1.write(0, 4, 'jobNature')
    sheet1.write(0, 5, 'city')
    sheet1.write(0, 6, 'positionAdvantage')
    sheet1.write(0, 7, 'salary')
    sheet1.write(0, 8, 'financeStage')
    sheet1.write(0, 9, 'companySize')
    sheet1.write(0, 10, 'district')
    sheet1.write(0,11,'companyFullName')
    n = 1
    for page in range(1,31):
        print(page)
        for job in getJoblist(page):
            print(job)
            sheet1.write(n,0,job['companyId'])
            sheet1.write(n, 1, job['positionName'])
            sheet1.write(n, 2, job['workYear'])
            sheet1.write(n, 3, job['education'])
            sheet1.write(n, 4, job['jobNature'])
            sheet1.write(n, 5, job['city'])
            sheet1.write(n, 6, job['positionAdvantage'])
            sheet1.write(n, 7, job['salary'])
            sheet1.write(n, 8, job['financeStage'])
            sheet1.write(n, 9, job['companySize'])
            sheet1.write(n, 10, job['district'])
            sheet1.write(n,11,job['companyFullName'])
            n+=1
        time.sleep(5)
    excel.save("lagou_spider_"+url1+url2+".xls")
main()

该代码实现了爬取拉勾网上的招聘信息,并且导入了xlwt库,实现了将获取到的信息存放在名为lagou_spider_+url1+url2+.xls的excel表中。运行效果如下:
在这里插入图片描述

同时生成了excel表格如下:
在这里插入图片描述
生成词云及图形可视化:

import pandas as pd
from collections import Counter
from pyecharts import Bar,Pie,WordCloud,Gauge
def getdata(filed):
    wb = pd.read_excel('G:\Python\lagou\lagou_spider_java成都.xls')
    fb = wb[filed]
    count = dict(Counter(each for each in fb))
    data_list = list(count.keys())
    data_count = [count[i] for i in data_list]
    return data_list, data_count
education_list,education_count = getdata('education')           # 教育程度
district_list,district_count = getdata('district')              # 地区
positionName_list,positionName_count = getdata('positionName')  # 职位
workYear_list,workYear_count = getdata('workYear')              # 工作年限
salary_list,salary_count = getdata('salary')                    # 薪资
companyFullName_list,companyFullName_count = getdata('companyFullName')
pie = Pie('工作区域',title_pos='left',width=1200,height=1200)
pie.add('',district_list,district_count,is_label_show=True,legend_orient='vertical',legend_pos='right',is_random=True)
pie.show_config()
pie.render('district.html')
pie = Pie('教育程度',title_pos='center')
pie.add('',education_list,education_count, radius=[40,75],is_label_show=True,label_text_color=None,legend_orient='vertical',legend_pos='left',is_random=True)
# pie.show_config()
pie.use_theme("dark")
pie.render('education.html')

pie = Pie('薪资',title_pos='right')
pie.add('',salary_list,salary_count,is_label_show=True,legend_pos='left',legend_orient='vertical')
pie.show_config()
pie.render('salary.html')
pie = Pie('工作年限','')
pie.add('',workYear_list,workYear_count,is_label_show=True)
pie.show_config()
pie.render('workYear.html')
wordcloud = WordCloud('公司名称',title_pos='right')
wordcloud.add('',companyFullName_list,companyFullName_count,word_size_range=[20,60],shape='circle')
wordcloud.show_config()
wordcloud.render('companyFullName.html')

效果如下:
在这里插入图片描述
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值