scrapy crawlspider 腾讯招聘

最新推荐文章于 2022-12-08 17:04:27 发布

日出2133

最新推荐文章于 2022-12-08 17:04:27 发布

阅读量183

点赞数

CC 4.0 BY-SA版权

分类专栏： spider

本文链接：https://blog.youkuaiyun.com/qwe1110/article/details/79599258

spider 专栏收录该内容

5 篇文章

订阅专栏

本文介绍使用Scrapy框架抓取腾讯招聘信息的方法，包括利用正则表达式匹配URL、实现自动翻页及内容提取等关键步骤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

scrapy的一个小分支，主要是利用正则，匹配url，

第一个rule实现自动翻页，

第二个rule实现提取内容，

愿大家每天进步一点点，会发现生活如此美好~

能看到这篇文章，相信，老铁已经在爬虫的路上，走过一段时间了，给部分想踏入爬虫这个行业的人，

推荐一本书：《Python爬虫开发与项目实战》

不用买，网上就有，里面都是基础，在面试中磨练基础，在项目中磨练经验

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from tencent_job.items import TencentJobItem


class TxjobSpider(CrawlSpider):
    name = 'txjob'
    allowed_domains = ['tencent.com']
    start_urls = ['https://hr.tencent.com/position.php?&start=0#a']
    item = TencentJobItem()
    rules = (
        Rule(LinkExtractor(allow=r'&start=\d+'), follow=True),
        Rule(LinkExtractor(allow=r'\?id=\d+'), callback='parse_item'),
    )

    def parse_item(self, response):
        item = TencentJobItem()
        item['title'] = response.xpath('//td[@id="sharetitle"]/text()').extract_first()
        item['address'] = response.xpath('//tr[@class="c bottomline"]/td[1]/text()').extract_first()
        item['type'] = response.xpath('//tr[@class="c bottomline"]/td[2]/text()').extract_first()
        item['count'] = response.xpath('//tr[@class="c bottomline"]/td[3]/text()').extract_first()
        item['responsibility'] = response.xpath('//ul[@class="squareli"]//text()').extract()