requests + Beautiful 爬取boss直聘

本文介绍了一种使用Python的requests库和BeautifulSoup库抓取智联招聘网站职位信息的方法。通过构造请求头增强反反爬能力,解析网页源代码获取职位标题、薪资、教育背景等详细信息,并将数据保存为JSON格式。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

import requests
from bs4 import BeautifulSoup
import json
import codecs

def GetHtmlText(url):
    try:
        headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3493.3 Safari/537.36"} #随机构造请求头,增加一定的反反爬能力。
        r = requests.get(url, timeout=5, headers=headers)
        r.raise_for_status() #判断r若果不是200,产生异常requests.HTTPError异常
        r.encoding = r.apparent_encoding
        return r.text  #http响应内容的字符串形式,即URL返回的页面内容
    except:
        return None

def fillJobList(html):
    soup = BeautifulSoup(html,'html.parser') #解析网页源代码
    dic = {}
    test_list = []
    for job in soup.findAll('div',{'class':'info-primary'}):
        try:
            position = job.find('div',{'class':'job-title'}).text
            pay = job.find('span',{'class':'red'}).text
            edu = job.find('p').text
            dic['position'] = position
            dic['pay'] = pay
            dic['edu'] = edu
            detail_url = job.find('a').attrs['href']
            detail_html = GetHtmlText('https://www.zhipin.com'+ detail_url)
            detail_soup = BeautifulSoup(detail_html, 'html.parser')
            detail = detail_soup.find('div', {'class': 'text'}).text.strip()
            dic['detail'] = detail
            # print(dic)
            test_list.append(dic)
        except:
            continue
    return test_list

def write(dic):

    with open("data.json", "w", encoding='utf-8') as f:
        f.write(json.dumps(dic, ensure_ascii=False))
    # with open('jobs.json', 'w', encoding='utf-8') as f:
    #     fp = json.dumps(dic,  ensure_ascii=False)
    #     f.write(fp)

def main():
    keywords = input('输入职位:')
    pages = int(input('获取页数:'))
    for i in range(1, pages + 1):
        url = 'https://www.zhipin.com/c101270100-p100109/?query=' + keywords + '&page=' + str(i)
        html = GetHtmlText(url)
        print(fillJobList(html))
        print(len(fillJobList(html)))
        #for i in fillJobList(html):
        #     write(i)



if __name__ == '__main__':
    main()复制代码


转载于:https://juejin.im/post/5be04aadf265da614c4c4276

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值