拉勾网爬取客服信息并记录到数据库(只获取公司名称)下一篇会根据公司名筛选

本文介绍了如何通过网络爬虫从拉勾网上抓取各个公司的客服信息,并将这些信息集中存储到数据库中,重点关注公司名称的获取。下一篇文章将基于这些公司名称进行进一步的数据筛选和分析。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

from selenium import webdriver
from lxml import etree
import re
import time
from selenium.webdriver.common.by import By
import csv
import requests
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import  pymysql




class LagouSpider(object):

    def __init__(self):
        self.driver_path = r'D:\cd\chromedriver.exe'
        self.driver = webdriver.Chrome(executable_path=self.driver_path)
        self.url = 'https://www.lagou.com/jobs/list_%E5%AE%A2%E6%9C%8D?city=%E6%B7%B1%E5%9C%B3&cl=false&fromSearch=true&labelWords=&suginput='
        self.positions = []
        self.stauts = 0
        self.cursor=''
        self.db=''

    def run(self):
        #给傻缺链接数据库
        self.db = pymysql.connect("127.0.0.1", "root", "111111", "kedou")
        self.cursor = self.db.cursor(pymysql.cursors.DictCursor)
        self.driver.get(self.url)
        while True:
            WebDriverWait(driver=self.driver, timeout=10).until(
                EC.presence_of_element_located((By.XPATH, "//div[@class='pager_container']/span[last()]"))
            )
            # print(self.driver.page_source)
            source = self.driver.page_source
            self.get_company(source)
            #self.page_list_page(source)
            try:
                # 获取下一页的
                next_btn = self.driver.find_element_by_xpath("//div[@class='pager_container']/span[last()]")
                if "pager_next_disabled" in next_btn.get_attribute("class"):
                    break
                else:
                    next_btn.click()
            except:
                print(source)
            time.sleep(10)

    def get_company(self,source):
        html=etree.HTML(source)
        jianshadiaos=html.xpath("//div[@class='company_name']/a/text()")
        for  jianshadiao in jianshadiaos:
            sql = "insert into kedou (`kedou`) values ('%s')" %(jianshadiao)
            print(sql)
            ok=self.cursor.execute(sql)
            self.db.commit()
            print(jianshadiao)
            print(ok)
    def page_list_page(self, source):
        html = etree.HTML(source)
        links = html.xpath("//div[@class='p_top']//a/@href")
        for link in links:
            self.request_detail_page(link)
            time.sleep(3)

    def request_detail_page(self, url):
        # 需要open 详情页
        # self.driver.get(url)
        self.driver.execute_script("window.open('%s')" % url)
        self.driver.switch_to_window(self.driver.window_handles[1])
        WebDriverWait(driver=self.driver, timeout=10).until(
            # //div[@class='job-name']//span[@class='name']/text() 不能这么写 这个地方不想普通的xpath 只找节点元素 不能找到text
            EC.presence_of_element_located((By.XPATH, "//div[@class='job-name']/span[@class='name']"))
        )
        source = self.driver.page_source
        self.parse_detail_page(source)
        # 保持只有2个页面 关闭他
        self.driver.close()
        # 切回列表页
        self.driver.switch_to_window(self.driver.window_handles[0])

    def parse_detail_page(self, source):
        html = etree.HTML(source)
        position_name = html.xpath("//div[@class='job-name']//span[@class='name']/text()")[0]
        # print(position_name)
        job_request_spans = html.xpath("//dd[@class='job_request']//span")
        salary = job_request_spans[0].xpath('.//text()')[0].strip()
        # print(salary)
        city = job_request_spans[1].xpath('.//text()')[0].strip()
        city = re.sub(r"[\s/]", "", city)
        # print(city)
        work_years = job_request_spans[2].xpath('.//text()')[0].strip()
        work_years = re.sub(r"[\s/]", "", work_years)
        # print(work_years)
        education = job_request_spans[3].xpath('.//text()')[0].strip()
        education = re.sub(r"[\s/]", "", education)
        # print(education)
        desc = "".join(html.xpath("//dd[@class='job_bt']//text()")).strip()
        # print(desc)
        company_name = html.xpath("//h2[@class='fl']/em/text()")[0].strip()
        position = {
            'name': position_name,
            'salary': salary,
            'city': city,
            'work_years': work_years,
            'education': education,
            'desc': desc,
            'company_name': company_name
        }
        self.positions.append(position)
        print("*" * 40)
        # 写入csv
        print(position)
        if self.stauts == 0:
            self.stauts = 1
            self.save_csv(position)
        else:
            print('进来了')
            self.save_csv1(position)

    def save_csv(self, data):
        headers = ['name', 'salary', 'city', 'work_years', 'education', 'desc', 'company_name']
        values = []
        values.append(data)
        with open('job.csv', 'w', encoding='utf-8', newline='') as fp:
            writer = csv.DictWriter(fp, headers)
            writer.writeheader()
            writer.writerows(values)

    def save_csv1(self, data):
        headers = ['name', 'salary', 'city', 'work_years', 'education', 'desc', 'company_name']
        values = []
        values.append(data)
        with open('job.csv', 'a', encoding='utf-8', newline='') as fp:
            writer = csv.DictWriter(fp, headers)
            writer.writerows(values)

    def read_csv(self, path='job1.csv'):
        with open(path, 'r', encoding='utf-8') as fp:
            readers = csv.DictReader(fp)
            for reader in readers:
                print(reader)
                print(reader['name'])
                print(reader['desc'])


if __name__ == "__main__":
    spider = LagouSpider()
    spider.run()
    # print(spider.positions)

 

要实现使用Scrapy框架抓拉勾职位信息存储到MySQL数据库的过程,首先需要了解Scrapy的Item Pipeline机制,它允许你处理抓后的数据。《Python爬虫实例——scrapy框架爬取拉勾招聘信息》提供了详细的步骤和代码示例,可以帮助你更好地理解和实践这一过程。 参考资源链接:[Python爬虫实例——scrapy框架爬取拉勾招聘信息](https://wenku.youkuaiyun.com/doc/6412b486be7fbd1778d3fe41?spm=1055.2569.3001.10343) 在创建Scrapy项目后,你需要定义Item来描述抓数据结构。接着,编写Item Pipeline类,实现process_item方法来处理每个抓到的Item。在该方法中,可以对数据进行清洗、验证等操作,将其存储到MySQL数据库。具体来说,你需要使用Python的MySQL库(如mysql-connector-python或pymysql)来实现数据库连接和数据插入。 举个例子,假设你已经有一个定义好的Item 'LagouJobItem',你将创建一个Pipeline类,在其process_item方法中,首先对数据进行处理,然后使用数据库连接对象将数据插入MySQL数据库。以下是部分代码示例:(代码示例略) 此外,别忘了在项目的settings.py文件中启用你的Pipeline,通过设置'ITEM_PIPELINES'来指定_pipeline类的路径和顺序。完成这些步骤后,Scrapy将自动调用Pipeline处理每个抓到的Item。 在你解决了如何利用Scrapy抓和存储数据的问题后,若想进一步深入学习Scrapy框架的高级用法或者遇到其他高级问题,继续阅读《Python爬虫实例——scrapy框架爬取拉勾招聘信息》将会是不错的选择。这份资料不仅帮助你掌握了基本的数据和存储,还能够为你提供高级功能的介绍和高级问题的解决方案。 参考资源链接:[Python爬虫实例——scrapy框架爬取拉勾招聘信息](https://wenku.youkuaiyun.com/doc/6412b486be7fbd1778d3fe41?spm=1055.2569.3001.10343)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值