Scrapy框架学习练手之爬取腾讯招聘技术类岗位

最新推荐文章于 2024-05-01 17:58:32 发布

原创最新推荐文章于 2024-05-01 17:58:32 发布 · 458 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#Python #Scrapy

Python 专栏收录该内容

14 篇文章

订阅专栏

本文介绍了一种使用Scrapy框架爬取腾讯招聘网站岗位信息的方法，包括岗位名称、工作职责、工作要求及发布日期，并详细解析了网页请求规律及Python代码实现。

页面地址：

https://careers.tencent.com/search.html?pcid=40001

实现目标：

将爬取到的岗位名称、工作职责、工作要求、发布日期以字典格式输出。

Scrapy目录框架：

思路：

浏览器抓包分析网页请求地址规律（爬虫最重要），找到页面地址规律后，根据请求返回的数据进行提取即可。

图一

由图一页面可知，招聘岗位共有187页，需循环遍历所有页面；

浏览器抓包实际请求页面地址为:

https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1575882949947&countryId=&cityId=&bgIds=&productId=&categoryId=40001001,40001002,40001003,40001004,40001005,40001006&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn

规律：pageIndex=1 为页数，其他暂保持不变

（爬完之后发现这里也有个时间戳，哈哈，冒似没啥影响，时间固定一样可以获取到数据）

图二

图三

页面请求返回的PostId为岗位详情页后面的具体地址:

招聘岗位详情页示例：

https://careers.tencent.com/jobdesc.html?postId=1203886892391600128

图四

从浏览器抓包的请求头来看，URL地址规律如下：

https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1575882580480&postId=1203886892391600128&language=zh-cn

“https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp= “此为固定部分，后面为当前时间戳，

Python代码如下：

时间戳部分学习参考：https://blog.youkuaiyun.com/qq_31603575/article/details/83343791

# -*- coding: utf-8 -*-
import scrapy
import json,datetime,time

#生成 13位当前时间戳
def get_time_stamp13():
    # 生成13时间戳   eg:1540281250399895
    datetime_now = datetime.datetime.now()

    # 10位，时间点相当于从UNIX TIME的纪元时间开始的当年时间编号
    date_stamp = str(int(time.mktime(datetime_now.timetuple())))

    # 3位，微秒
    data_microsecond = str("%06d" % datetime_now.microsecond)[0:3]

    date_stamp = date_stamp + data_microsecond
    return int(date_stamp)

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['tencent.com']
    start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1575855782891&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=40001&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn']

    def pasrse1(self,response):
        html1 = response.text
        #响应数据转换为字典格式
        html1_json = json.loads(html1)
        # print(html1_json['Data'])
        # exit()
        # for j in html1_json['Data']:
        job_dic = {}
        job_dic['岗位名称'] = html1_json['Data']['RecruitPostName']
        job_dic['工作职责'] = html1_json['Data']['Responsibility']
        job_dic['发布日期'] = html1_json['Data']['LastUpdateTime']
        job_dic['工作要求'] = html1_json['Data']['Requirement']
        print(job_dic)
    def parse(self, response):
        html = response.text
        html_json = json.loads(html)
        id_url_list = html_json['Data']['Posts'] #[0]['PostId']
        for lj in id_url_list:
            time_format = str(get_time_stamp13())
            #url地址拼接
            desc_url = 'https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=' + time_format + '&postId=' + lj['PostId'] + '&language=zh-cn'
            # print(desc_url)
            # print(lj['PostId'])
        # for k in url_list['PostURL']:
            # job_url = []
            # job_url.append(k)
            # yield scrapy.Request(url=job_url, callback=self.pasrse1, dont_filter=True)
            #提交请求，迭代处理
            yield scrapy.Request(url=desc_url, callback=self.pasrse1, dont_filter=True)
        for i in range(2,188):
            #页面地址循环遍历
            url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1575855782891&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=40001&attrId=&keyword=&pageIndex=' + str(i) + '&pageSize=10&language=zh-cn&area=cn'
            yield scrapy.Request(url=url,callback=self.parse,dont_filter=True)