python 爬取 IT 桔子网

本篇博客介绍了一个Python爬虫项目,该项目用于从IT桔子网站抓取公司信息,并将其存储到MongoDB数据库中。爬虫使用了requests库进行网络请求,pymongo库与MongoDB交互,同时利用了fake_useragent库来随机生成User-Agent以提高爬取的成功率。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

python 爬取 IT 桔子网

import requests
import re
import pymongo
import random
import time
import json
import random
import numpy as np
import csv
import pandas as pd
from fake_useragent import UserAgent
import socket  # 断线重试
from urllib.parse import urlencode

# 随机ua
ua = UserAgent()

client = pymongo.MongoClient('localhost', 27017)
# 获得数据库
db = client.ITJUZI
mongodb_collection_company = db.itjuzi_company


class ITJUZI(object):
    def __init__(self):
        self.headers = {
            'User-Agent': ua.random,
            'X-Requested-With': 'XMLHttpRequest',
            # 主页cookie
            'Cookie': '76b20f6015442399597225100e18094750f575673b045da6ec0b77984422f6',
        }
        self.url = 'https://www.itjuzi.com/api/companys'  # company
        self.session = requests.Session()

    def get_table(self, page):

        company_payload = {"pagetotal": 121292, "total": 0, "per_page": 30, "scope": "", "sub_scope": "",
                           "round": "", "location": "", "prov": "", "city": "", "status": "", "sort": "",
                           "selected": ""}
        retrytimes = 3
        while retrytimes:
            try:
                response = self.session.get(
                    self.url, params=company_payload, headers=self.headers, timeout=(5, 20)).json()
                print(response)
                self.save_to_mongo(response)
                break
            except socket.timeout:
                print('下载第{}页,第{}次网页请求超时'.format(page, retrytimes))
                retrytimes -= 1

    def save_to_mongo(self, response):
        try:

            data = response['data']['data']
            df = pd.DataFrame(data)
            table = json.loads(df.T.to_json()).values()
            if mongodb_collection_company.insert_many(table):  # investevent
                # if mongo_collection2.insert_many(table):    # company
                # if mongo_collection3.insert_many(table):    # investment
                # if mongo_collection4.insert_many(table):    # horse
                print('存储到mongodb成功')
                sleep = np.random.randint(3, 7)
                time.sleep(sleep)
        except Exception as e:
            print('存储到mongodb失败', e)

    def spider_itjuzi(self, start_page, end_page):
        for page in range(start_page, end_page):
            print('下载第%s页:' % page)
            self.get_table(page)

        print('下载完成')


if __name__ == '__main__':
    spider = ITJUZI()
    spider.spider_itjuzi(398, 4045)

  完整代码下载:https://github.com/tanjunchen/SpiderProject/tree/master/ITOrange

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

远方的飞猪

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值