爬虫实战——使用selenium爬取拉勾网(2)

本文介绍了一个使用Python和Selenium的爬虫程序,用于抓取拉勾网上Python职位的详细信息,包括薪资、城市、经验要求、教育背景和职位描述。
部署运行你感兴趣的模型镜像
import requests
import time
from lxml import etree
import re
from selenium import webdriver
import time

class LagouSpider(object):
    driver_path = r"D:\Python_pycharm\PyCharm Community Edition 2018.3.5\chromedriver.exe"
    def __init__(self):
        self.driver=webdriver.Chrome(executable_path=LagouSpider.driver_path)
        self.url="https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput="
        self.positions=[]

    def run(self):
        self.driver.get(self.url)
        while True:
            source=self.driver.page_source
            self.parse_list_page(source)
            next_btn = self.driver.find_element_by_xpath("//span[contains(@class,'pager_next')]")
            next_btn=self.driver.find_element_by_xpath("//div[@class='pager_container']/span[last()]")
            if "pager_next_disabled" in next_btn.get_attribute('class'):
                break
            next_btn.click()
            time.sleep(1)

    def parse_list_page(self,source):
        html=etree.HTML(source)
        links=html.xpath("//a[@class='position_link']/@href")
        for index,link in enumerate(links):
            self.driver.execute_script("window.open('%s')"%link)
            self.driver.switch_to.window(self.driver.window_handles[1])
            source = self.driver.page_source
            self.parse_detail_links(source)
            self.driver.close()
            time.sleep(1)
            self.driver.switch_to.window(self.driver.window_handles[0])


    def parse_detail_links(self,source):
        # response = requests.get(url, headers=headers)
        # text = response.text
        # # print(text)
        html = etree.HTML(source)
        position_name = html.xpath("//h1[@class='name']/text()")[0]
        job_request_spans = html.xpath("//dd[@class='job_request']//span")
        salary = job_request_spans[0].xpath('.//text()')[0].strip()
        city = job_request_spans[1].xpath('.//text()')[0].strip()
        city = re.sub(r"[\s/]", '', city)
        experience = job_request_spans[2].xpath('.//text()')[0].strip()
        experience = re.sub(r"[\s/]", '', experience)
        education = job_request_spans[3].xpath('.//text()')[0].strip()
        education = re.sub(r"[\s/]", '', education)
        desc = "".join(html.xpath("//dd[@class='job_bt']//text()")).strip()
        position={
            "position_name":position_name,
            "salary":salary,
            "city":city,
            "experience":experience,
            "education":education,
            "desc":desc

        }
        self.positions.append(position)



if __name__=='__main__':
    a=LagouSpider()
    a.run()
    print(a.positions)

 

您可能感兴趣的与本文相关的镜像

Python3.8

Python3.8

Conda
Python

Python 是一种高级、解释型、通用的编程语言,以其简洁易读的语法而闻名,适用于广泛的应用,包括Web开发、数据分析、人工智能和自动化脚本

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值