Mozilla Location Service-10

本文介绍如何使用Celery实现Python应用程序中的任务调度与分布式处理。包括Celery的基本配置、任务定义、周期性任务设置及常见问题排查等内容。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

上次说到 缓冲数据库同步到mysql的问题,实际上,同步并不需要自己另外写程序写命令来特地执行同步操作。妙处就在celery配置好并开启相应服务后,可以自动完成这项工作。
celery 一个半中文文档地址:(看全英有些吃力,专业术语好多)
http://docs.jinkan.org/docs/celery/
以下是简介:
Celery - 分布式任务队列

Celery 是一个简单、灵活且可靠的,处理大量消息的分布式系统,并且提供维护这样一个系统的必需工具。

它是一个专注于实时处理的任务队列,同时也支持任务调度。

Celery 有广泛、多样的用户与贡献者社区,你可以通过 IRC 或是 邮件列表 加入我们。

Celery 是开源的,使用 BSD 许可证 授权。
在这个项目中的文件结构:

ichnaea.async(注意不是egg下,我也不知道为何不是运行egg下的文件)
- app.py
- config.py
- settings.py
- task.py

看起来跟webapp那个文件夹里的内容很像。
仔细看看文档后发现确实很多地方如出一辙。
用以下命令跑起worker:

ICHNAEA_CFG=location.ini bin/celery -A ichnaea.async.app:celery_app worker \
    -Ofair --no-execv --without-mingle --without-gossip

配置参数的含义在文档里都有详细解释。

start...
redis_uri is: redis://localhost:6379/0

 -------------- celery@sa-VirtualBox v3.1.23 (Cipater)
---- **** ----- 
--- * ***  * -- Linux-4.4.0-31-generic-x86_64-with-Ubuntu-16.04-xenial
-- * - **** --- 
- ** ---------- [config]
- ** ---------- .> app:         ichnaea.async.app:0x7fdce337aa10
- ** ---------- .> transport:   redis://localhost:6379/0
- ** ---------- .> results:     redis://localhost:6379/0
- *** --- * --- .> concurrency: 1 (prefork)
-- ******* ---- 
--- ***** ----- [queues]
 -------------- .> celery_blue      exchange=celery(direct) key=celery_blue
                .> celery_cell      exchange=celery(direct) key=celery_cell
                .> celery_content   exchange=celery(direct) key=celery_content
                .> celery_default   exchange=celery(direct) key=celery_default
                .> celery_export    exchange=celery(direct) key=celery_export
                .> celery_incoming  exchange=celery(direct) key=celery_incoming
                .> celery_monitor   exchange=celery(direct) key=celery_monitor
                .> celery_ocid      exchange=celery(direct) key=celery_ocid
                .> celery_reports   exchange=celery(direct) key=celery_reports
                .> celery_wifi      exchange=celery(direct) key=celery_wifi

[2016-08-19 11:53:23,870: WARNING/MainProcess] celery@sa-VirtualBox ready.

看到出现这一段文字,说明启动成功。
然而在运行过程中并没有传说中的‘周期性动作’的配置和执行。
数据库仍然没有半点动静。

接下来终于发现celery有个神奇的beat:
http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html
celery beat is a scheduler. It kicks off tasks at regular intervals, which are then executed by the worker nodes available in the cluster.

By default the entries are taken from the CELERYBEAT_SCHEDULE setting, but custom stores can also be used, like storing the entries in an SQL database.

You have to ensure only a single scheduler is running for a schedule at a time, otherwise you would end up with duplicate tasks. Using a centralized approach means the schedule does not have to be synchronized, and the service can operate without using locks.

让add这个task每隔30秒执行一次

from celery.schedules import crontab

CELERYBEAT_SCHEDULE = {
    # Executes every Monday morning at 7:30 A.M
    'add-every-monday-morning': {
        'task': 'tasks.add',
        'schedule':seconds(30),
        'args': (16, 16),
    },
}

那我们项目中的task在哪里?又是在哪里配置周期时常呢?
/ProgFile/ichnaea-for-liuqiao/ichnaea/ichnaea/async/task.py:

  if enabled and cls._schedule:
            app.conf.CELERYBEAT_SCHEDULE.update(cls.beat_config())

照葫芦画瓢搜索到这样几句代码
beat_config看起来很可疑:

@classmethod
    def beat_config(cls):
        """
        Returns the beat schedule for this task, taking into account
        the optional shard_model to create multiple schedule entries.
        """
        if cls._shard_model is None:
            return {cls.shortname(): {
                'task': cls.name,
                'schedule': cls._schedule,
            }}

        result = {}
        for shard_id in cls._shard_model.shards().keys():
            result[cls.shortname() + '_' + shard_id] = {
                'task': cls.name,
                'schedule': cls._schedule,
                'kwargs': {'shard_id': shard_id},
            }
        return result

再来追踪下task和schedule
真相是真正的task在data/tasks下,所有的task都用注解的方式继承了这个基类,同时也指定了周期,比如下面这个:

@celery_app.task(base=BaseTask, bind=True, queue='celery_reports',
                 _countdown=2, expires=20, _schedule=timedelta(seconds=32))
def update_incoming(self):
    print 'update_incoming'
    export.IncomingQueue(self)(export_reports)

现在还不清楚这个task到底完成什么任务。

不管了,看看beat怎么开启来。

Starting the Scheduler

To start the celery beat service:

$ celery -A proj beat
这个proj 是ichnaea.async.app:celery_app,没有这个app啥也干不了

You can also start embed beat inside the worker by enabling workers -B option, this is convenient if you will never run more than one worker node, but it’s not commonly used and for that reason is not recommended for production use:

$ celery -A proj worker -B 不推荐使用这种方法

Beat needs to store the last run times of the tasks in a local database file (named celerybeat-schedule by default), so it needs access to write in the current directory, or alternatively you can specify a custom location for this file:

$ celery -A proj beat -s /home/celery/var/run/celerybeat-schedule 这种会报错

用第一种方法开启,出现:

a@sa-VirtualBox:/ProgFile/ichnaea-for-liuqiao/ichnaea$ ICHNAEA_CFG=location.ini bin/celery -A ichnaea.async.app:celery_app beat
start...
redis_uri is: redis://localhost:6379/0
celery beat v3.1.23 (Cipater) is starting.
__    -    ... __   -        _
Configuration ->
    . broker -> redis://localhost:6379/0
    . loader -> celery.loaders.app.AppLoader
    . scheduler -> celery.beat.PersistentScheduler
    . db -> celerybeat-schedule
    . logfile -> [stderr]@%WARNING
    . maxinterval -> now (0s)

这样开启成功了。一旦beat跑起来,所有带周期的task都像接到了命令一样开始执行起来了。
这个时候在我们开启worker的那个终端可以看到自动出现的一排排黄色字体,like:
[2016-08-19 14:03:31,521: WARNING/Worker-1] **
我自己在每个task开始的地方加了个print,打印任务的名字:

[2016-08-19 14:05:37,375: WARNING/Worker-1] update_incoming
[2016-08-19 14:05:37,377: WARNING/Worker-1] query in session.py entities:
[2016-08-19 14:05:37,378: WARNING/Worker-1] (<class 'ichnaea.models.config.ExportConfig'>,)
[2016-08-19 14:05:37,380: WARNING/Worker-1] sqlalchemy.orm.query
[2016-08-19 14:05:40,971: WARNING/Worker-1] update_cellarea
[2016-08-19 14:05:49,969: WARNING/Worker-1] update_datamap
[2016-08-19 14:05:50,091: WARNING/Worker-1] update_datamap
[2016-08-19 14:05:50,142: WARNING/Worker-1] update_datamap
[2016-08-19 14:05:50,168: WARNING/Worker-1] update_datamap
[2016-08-19 14:05:52,997: WARNING/Worker-1] update_blue
[2016-08-19 14:05:53,021: WARNING/Worker-1] update_blue
[2016-08-19 14:05:53,058: WARNING/Worker-1] update_blue
[2016-08-19 14:05:53,118: WARNING/Worker-1] update_blue
[2016-08-19 14:05:53,162: WARNING/Worker-1] update_blue
[2016-08-19 14:05:53,205: WARNING/Worker-1] update_blue
[2016-08-19 14:05:53,233: WARNING/Worker-1] update_blue
[2016-08-19 14:05:53,248: WARNING/Worker-1] update_blue
[2016-08-19 14:05:53,252: WARNING/Worker-1] update_blue
[2016-08-19 14:05:53,275: WARNING/Worker-1] update_blue
[2016-08-19 14:06:09,111: WARNING/Worker-1] update_wifi
[2016-08-19 14:06:09,136: WARNING/Worker-1] update_wifi
[2016-08-19 14:06:09,175: WARNING/Worker-1] update_wifi
[2016-08-19 14:06:09,202: WARNING/Worker-1] update_wifi
[2016-08-19 14:06:09,239: WARNING/Worker-1] update_wifi
[2016-08-19 14:06:09,256: WARNING/Worker-1] update_wifi
[2016-08-19 14:06:09,294: WARNING/Worker-1] update_wifi
[2016-08-19 14:06:09,307: WARNING/Worker-1] update_wifi
[2016-08-19 14:06:09,330: WARNING/Worker-1] update_wifi
[2016-08-19 14:06:09,336: WARNING/Worker-1] update_wifi
[2016-08-19 14:06:09,355: WARNING/Worker-1] update_wifi
[2016-08-19 14:06:09,369: WARNING/Worker-1] update_wifi
[2016-08-19 14:06:09,373: WARNING/Worker-1] update_wifi
[2016-08-19 14:06:09,392: WARNING/Worker-1] update_wifi
[2016-08-19 14:06:09,413: WARNING/Worker-1] update_wifi
[2016-08-19 14:06:09,436: WARNING/Worker-1] update_wifi
[2016-08-19 14:06:09,573: WARNING/Worker-1] update_incoming
[2016-08-19 14:06:09,575: WARNING/Worker-1] query in session.py entities:

我以为到这一步,redis里的东西就会进入到mysql了,然并卵。
仅发现:
1.stat表里有了些记录:

mysql> select * from stat;
+-----+------------+-------+
| key | time       | value |
+-----+------------+-------+
|   1 | 2016-08-18 |     0 |
|   1 | 2016-08-19 |     0 |
|   2 | 2016-08-18 |     0 |
|   2 | 2016-08-19 |     0 |
|   3 | 2016-08-18 |     0 |
|   3 | 2016-08-19 |     0 |
|   4 | 2016-08-18 |     0 |
|   4 | 2016-08-19 |     0 |
|   7 | 2016-08-18 |     0 |
|   7 | 2016-08-19 |     0 |
|   8 | 2016-08-18 |     0 |
|   8 | 2016-08-19 |     0 |
|   9 | 2016-08-18 |     0 |
|   9 | 2016-08-19 |     0 |
+-----+------------+-------+
14 rows in set (0.00 sec)

第二个发现时,md,redis里的key跟之前完全不一样了!

127.0.0.1:6379> keys *
 1) "statcounter_unique_wifi_20160819"
 2) "statcounter_unique_wifi_20160818"
 3) "statcounter_unique_blue_20160819"
 4) "statcounter_blue_20160818"
 5) "statcounter_unique_cell_20160818"
 6) "statcounter_unique_cell_ocid_20160818"
 7) "statcounter_unique_cell_20160819"
 8) "statcounter_wifi_20160818"
 9) "statcounter_unique_blue_20160818"
10) "_kombu.binding.celeryev"
11) "_kombu.binding.celery.pidbox"
12) "statcounter_blue_20160819"
13) "statcounter_unique_cell_ocid_20160819"
14) "_kombu.binding.celery"
15) "statcounter_cell_20160818"
16) "statcounter_wifi_20160819"
17) "statcounter_cell_20160819"

还有,某些个时候,开worker的那个终端会报错,报错内容看不懂。

import time import random import requests import json import logging import socket from pathlib import Path from openpyxl import load_workbook from selenium import webdriver from selenium.webdriver.edge.service import Service from selenium.webdriver.edge.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.action_chains import ActionChains from selenium.common.exceptions import TimeoutException, WebDriverException from fake_useragent import UserAgent, FakeUserAgentError # -------------------------- 新增:手动指定Edge浏览器和驱动路径 -------------------------- EDGE_BINARY_PATH = r"C:\Program Files (x86)\Microsoft\Edge\Application\msedge.exe" # 浏览器路径 EDGE_DRIVER_PATH = r"C:\Users\27570\Desktop\edgedriver_win32\msedgedriver.exe" # 驱动路径,需下载并指定 # ------------------------------------------------------------------------------------- # 配置日志 logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', handlers=[ logging.FileHandler("ip_query.log"), logging.StreamHandler() ] ) # -------------------------- 优化:整合网络连接检查 -------------------------- def check_internet_connection(): """检查网络连接是否正常,尝试多种连接方式提高可靠性""" try: # 尝试连接到Google的公共DNS服务器 socket.create_connection(("8.8.8.8", 53), timeout=5) logging.info("网络连接测试通过 (DNS)") return True except OSError: logging.warning("无法连接到DNS服务器,尝试HTTP请求...") try: # 尝试HTTP请求到百度 response = requests.get("https://www.baidu.com", timeout=5) if response.status_code == 200: logging.info("网络连接测试通过 (HTTP)") return True except requests.RequestException: logging.warning("HTTP请求失败,尝试HTTPS请求...") try: # 尝试HTTPS请求到百度 response = requests.get("https://www.baidu.com", timeout=5) if response.status_code == 200: logging.info("网络连接测试通过 (HTTPS)") return True except requests.RequestException: logging.error("HTTPS请求失败") return False # ------------------------------------------------------------------------------------- def build_query_url(base_url, ip_address, path_format="{ip}/", use_params=True, param_name="ip"): """构建IP查询URL,支持路径参数和查询参数两种格式""" if not base_url.startswith(('http://', 'https://')): base_url = 'https://' + base_url base_url = base_url.rstrip('/') try: if '.' in ip_address: # IPv4 socket.inet_pton(socket.AF_INET, ip_address) elif ':' in ip_address: # IPv6 socket.inet_pton(socket.AF_INET6, ip_address) else: raise ValueError("无效的IP地址") if ':' in ip_address: ip_address = f"[{ip_address}]" if use_params: # 使用查询参数的方式构造URL from urllib.parse import urlencode # 添加固定参数action=2 params = {param_name: ip_address, "action": 2} return f"{base_url}?{urlencode(params)}" else: # 原有的路径参数方式 from urllib.parse import quote encoded_ip = quote(ip_address) return f"{base_url}/{path_format.format(ip=encoded_ip)}" except socket.error: logging.error(f"无效的IP地址格式: {ip_address}") return None except Exception as e: logging.error(f"构建查询URL时出错: {str(e)}") return None # 手动配置的固定请求头 MANUAL_HEADERS = { 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7', 'accept-encoding': 'gzip, deflate, br, zstd', 'accept-language': 'zh-CN,zh;q=0.9', 'cache-control': 'max-age=0', 'connection': 'keep-alive', 'cookie': '_c_WBKFRo=NWgPw1zeBaW3I2CtOcadfhJJw33TcEYmWMtyGzTE; Hm_lvt_f4f76646cd877e538aa1fbbdf351c548=1753560343,1753617545,1753793389,1753862286; HMACCOUNT=96B6BD9DE68EFF3B; PHPSESSID=o9fnnscr7sofru4b8r1khlde3f; Hm_lvt_f4f76646cd877e538aa1fbbdf351c548=1754123598; HMACCOUNT=96B6BD9DE68EFF3B; Hm_lpvt_f4f76646cd877e538aa1fbbdf351c548=1754611428; Hm_lpvt_f4f76646cd877e538aa1fbbdf351c548=1754613370', 'host': 'www.ip138.com', 'referer': 'https://www.ip138.com/iplookup.php?ip=27.154.214.154&action=2', 'sec-ch-ua': '"Not)A;Brand";v="8", "Chromium";v="138", "Microsoft Edge";v="138"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"Windows"', 'sec-fetch-dest': 'document', 'sec-fetch-mode': 'navigate', 'sec-fetch-site': 'same-origin', 'sec-fetch-user': '?1', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.0.0 Safari/537.36 Edg/138.0.0.0' } def configure_driver(max_retries=5): """配置Edge浏览器驱动""" for attempt in range(max_retries): try: # 先检查网络连接 if not check_internet_connection(): raise Exception("网络连接不可用") # 创建Edge选项 edge_options = Options() # 1. 基础配置 edge_options.binary_location = EDGE_BINARY_PATH # 手动指定浏览器路径 edge_options.add_argument("--disable-blink-features=AutomationControlled") # 核心反检测 edge_options.add_experimental_option("excludeSwitches", ["enable-automation"]) edge_options.add_experimental_option("useAutomationExtension", False) edge_options.add_argument("--start-maximized") # 最大化窗口 # 2. 增强反检测 edge_options.add_argument("--disable-extensions") edge_options.add_argument("--disable-plugins-discovery") edge_options.add_argument("--disable-web-security") # 3. 随机化配置 features_to_disable = [ "AutomationControlled", "InterestCohort", "BlinkGenPropertyTrees" ] edge_options.add_argument(f"--disable-features={','.join(random.sample(features_to_disable, random.randint(2, 4)))}") screen_sizes = [(1366, 768), (1920, 1080), (1536, 864)] width, height = random.choice(screen_sizes) edge_options.add_argument(f"--window-size={width},{height}") if random.random() > 0.5: edge_options.add_argument("--disable-gpu") else: edge_options.add_argument("--enable-gpu-rasterization") # 4. 资源加载控制 prefs = { "profile.managed_default_content_settings.images": 2, "profile.managed_default_content_settings.stylesheets": 2, } edge_options.add_experimental_option("prefs", prefs) edge_options.page_load_strategy = 'eager' # 只等待DOM加载 # 5. 使用手动指定的驱动路径 try: service = Service(EDGE_DRIVER_PATH) # 手动指定驱动路径 logging.info(f"使用手动指定的驱动路径: {EDGE_DRIVER_PATH}") except Exception as e: logging.error(f"驱动路径配置错误: {str(e)}") raise # 6. 创建浏览器实例 driver = webdriver.Edge(service=service, options=edge_options) # 7. 隐藏自动化特征 driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", { "source": """ // 核心:隐藏webdriver标志 Object.defineProperty(navigator, 'webdriver', { get: () => undefined }); // 模拟Chrome特征 window.chrome = { runtime: {} }; // 模拟时区 Intl.DateTimeFormat().resolvedOptions().timeZone = ['Asia/Shanghai', 'Asia/Beijing'][Math.floor(Math.random() * 2)]; """ }) # 8. 设置超时 driver.set_page_load_timeout(30) driver.set_script_timeout(30) # 9. 应用手动配置的请求头 logging.info(f"应用手动配置的请求头: {json.dumps(MANUAL_HEADERS, indent=2)[:100]}...") driver.execute_cdp_cmd("Network.setUserAgentOverride", { "userAgent": MANUAL_HEADERS["user-agent"], "accept": MANUAL_HEADERS["accept"], "acceptLanguage": MANUAL_HEADERS["accept-language"], }) logging.info(f"浏览器驱动初始化成功 (尝试 {attempt+1}/{max_retries})") return driver except Exception as e: logging.error(f"配置浏览器驱动失败 (尝试 {attempt+1}/{max_retries}): {str(e)}") if attempt < max_retries - 1: wait_time = 2 ** attempt + random.uniform(5, 10) logging.info(f"将在 {wait_time:.2f} 秒后重试") time.sleep(wait_time) logging.critical("达到最大重试次数,无法初始化浏览器驱动") return None def change_user_agent(driver): """更换为手动配置的请求头""" logging.info(f"应用手动配置的请求头: {json.dumps(MANUAL_HEADERS, indent=2)[:100]}...") driver.execute_cdp_cmd("Network.setUserAgentOverride", { "userAgent": MANUAL_HEADERS["user-agent"], "accept": MANUAL_HEADERS["accept"], "acceptLanguage": MANUAL_HEADERS["accept-language"], }) driver.refresh() time.sleep(random.uniform(2, 4)) def handle_cookies(driver): """处理和保存Cookie""" cookies = driver.get_cookies() logging.info(f"获取到 {len(cookies)} 个Cookie") return cookies def is_banned(driver): """检测是否被封禁""" try: banned_xpaths = [ '//div[contains(text(), "访问被阻止")]', '//div[contains(text(), "验证码")]', '//div[contains(text(), "您的IP已被封禁")]', ] for xpath in banned_xpaths: if WebDriverWait(driver, 5).until( EC.presence_of_element_located((By.XPATH, xpath)) ): logging.warning("检测到封禁页面") return True return False except: return False def check_dynamic_element(driver, xpaths): """检查网页上是否存在任一动态XPath的元素""" for i, xpath in enumerate(xpaths, 1): try: WebDriverWait(driver, 5).until( EC.presence_of_element_located((By.XPATH, xpath)) ) logging.info(f"使用动态元素XPath {i}: {xpath}") return True except: continue return False def get_result_element(driver, xpaths): """尝试获取任一结果元素""" for i, xpath in enumerate(xpaths, 1): try: element = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.XPATH, xpath)) ) logging.info(f"使用结果元素XPath {i}: {xpath}") return element except: continue return None def simulate_human_behavior(driver): """模拟人类浏览行为""" try: # 随机滚动页面 scroll_height = driver.execute_script("return document.body.scrollHeight") scroll_steps = random.randint(3, 7) for i in range(scroll_steps): scroll_to = int(scroll_height * (i + 1) / scroll_steps) driver.execute_script(f"window.scrollTo(0, {scroll_to})") time.sleep(random.uniform(0.5, 1.5)) # 随机移动鼠标 actions = ActionChains(driver) elements = driver.find_elements(By.TAG_NAME, "a") if elements: for _ in range(random.randint(1, 3)): element = random.choice(elements) actions.move_to_element(element).perform() time.sleep(random.uniform(0.3, 0.8)) except Exception as e: logging.warning(f"模拟人类行为时出错: {e}") def query_ip(driver, ip_address, base_url, xpath_expressions, dynamic_xpaths, max_retries=5): """查询IP信息,添加封禁检测和处理""" if not ip_address or not isinstance(ip_address, str) or ip_address.strip() == "": logging.warning(f"无效的IP地址: {ip_address}") return "无效IP" ip_address = ip_address.strip() for attempt in range(max_retries): try: if attempt > 0: wait_time = 2 ** attempt + random.uniform(3, 7) logging.info(f"第 {attempt+1} 次重试前等待 {wait_time:.2f} 秒...") time.sleep(wait_time) change_user_agent(driver) query_url = build_query_url(base_url, ip_address) if not query_url: logging.error(f"无法构建有效的查询URL,IP: {ip_address}") return "无效URL" logging.info(f"访问查询URL (尝试 {attempt+1}/{max_retries}): {query_url}") try: driver.get(query_url) time.sleep(random.uniform(8, 15)) if is_banned(driver): logging.warning(f"IP {ip_address} 查询时被封禁") driver.quit() time.sleep(5) driver = configure_driver() time.sleep(5) continue current_url = driver.current_url if current_url == "data:," or "about:blank" in current_url: raise Exception("浏览器加载了空白页面") simulate_human_behavior(driver) time.sleep(random.uniform(2, 5)) handle_cookies(driver) except TimeoutException: logging.warning(f"页面加载超时,尝试重新加载") driver.refresh() time.sleep(15) continue if check_dynamic_element(driver, dynamic_xpaths): logging.info(f"检测到动态元素,结果将设为'动态'") return "动态" result_element = get_result_element(driver, xpath_expressions) if result_element: return result_element.text.strip() else: raise Exception("无法找到结果元素") except WebDriverException as e: logging.error(f"WebDriver错误 (尝试 {attempt+1}/{max_retries}): {str(e)}") if "ERR_EMPTY_RESPONSE" in str(e) or "ERR_CONNECTION_RESET" in str(e): logging.warning("检测到连接错误,尝试重启浏览器...") driver.quit() time.sleep(15) driver = configure_driver() time.sleep(10) else: time.sleep(2 ** attempt + random.uniform(5, 10)) continue except Exception as e: logging.error(f"查询IP {ip_address} 失败 (尝试 {attempt+1}/{max_retries}): {str(e)}") time.sleep(2 ** attempt + random.uniform(5, 10)) continue logging.error(f"IP {ip_address} 查询失败,已达到最大重试次数") driver.save_screenshot(f"error_{ip_address}.png") return "查询失败" def is_row_hidden(worksheet, row_idx): """检查Excel行是否被隐藏""" return worksheet.row_dimensions[row_idx].hidden def process_excel(input_file, base_url, xpath_expressions, dynamic_xpaths, ip_column='A', result_column='I', start_row=2): """处理Excel文件""" wb = load_workbook(input_file) ws = wb.active has_filter = ws.auto_filter.ref is not None logging.info(f"检测到筛选: {has_filter}") driver = configure_driver() if not driver: logging.critical("无法初始化浏览器驱动,退出程序") return visible_rows = [] for row_idx in range(start_row, ws.max_row + 1): row_dim = ws.row_dimensions.get(row_idx) if not row_dim or not row_dim.hidden: visible_rows.append(row_idx) logging.info(f"可见行共{len(visible_rows)}行") total_visible = len(visible_rows) processed_count = 0 try: for i, row in enumerate(visible_rows, 1): ip_address = ws[f"{ip_column}{row}"].value if not ip_address: logging.info(f"第 {row} 行IP地址为空,跳过") continue logging.info(f"正在查询IP: {ip_address} ({i}/{total_visible})") result = query_ip(driver, ip_address, base_url, xpath_expressions, dynamic_xpaths) ws[f"{result_column}{row}"] = result processed_count += 1 if i % 3 == 0 or i == total_visible: wb.save(input_file) logging.info(f"已保存进度: {i}/{total_visible} 到 {input_file}") wait_time = random.uniform(20,40) logging.info(f"等待 {wait_time:.2f} 秒后继续...") time.sleep(wait_time) if i % 10 == 0: extra_wait = random.uniform(40,60) logging.info(f"已处理 {i} 个IP,额外休息 {extra_wait:.2f} 秒...") time.sleep(extra_wait) if i % 20 == 0: logging.info(f"已处理 {i} 个IP,重启浏览器以避免被检测...") driver.quit() time.sleep(15) driver = configure_driver() if not driver: logging.critical("无法重新初始化浏览器驱动,退出程序") return except Exception as e: logging.critical(f"处理过程中发生意外错误: {str(e)}") finally: if driver: driver.quit() wb.save(input_file) logging.info(f"已保存最终结果到 {input_file}") logging.info(f"处理完成!共处理 {processed_count}/{total_visible} 个可见IP地址") if __name__ == "__main__": INPUT_FILE = r"C:\Users\27570\Desktop\飞塔-福建-简版-更新版20250730.xlsx" # 修改为新的基础URL BASE_URL = "https://www.ip138.com/iplookup.php" # 配置两种XPath表达式 XPATH_EXPRESSIONS = [ '/html/body/div/div[2]/div[1]/div/table/tbody/tr[2]/td[2]', '/html/body/div/div[2]/div[2]/div/div[2]/div[1]/div/div[2]/div[2]/div[2]/table/tbody/tr[2]/td[2]' ] DYNAMIC_XPATHS = [ '/html/body/div/div[2]/div[1]/div/p/a', '/html/body/div/div[2]/div[2]/div/div[2]/div[1]/div/div[2]/div[2]/div[1]/p[1]/a' ] max_main_retries = 3 for main_attempt in range(max_main_retries): try: logging.info(f"开始处理Excel文件 (尝试 {main_attempt+1}/{max_main_retries})") process_excel(INPUT_FILE, BASE_URL, XPATH_EXPRESSIONS, DYNAMIC_XPATHS) break except Exception as e: logging.critical(f"主程序执行失败 (尝试 {main_attempt+1}/{max_main_retries}): {str(e)}") if main_attempt < max_main_retries - 1: wait_time = 10 + random.uniform(10, 30) logging.info(f"将在 {wait_time:.2f} 秒后重试") time.sleep(wait_time) else: logging.critical("达到最大重试次数,程序终止") 以上代码在运行时出现了以下问题,解决问题并给我完整代码 Message: session not created: probably user data directory is already in use, please specify a unique value for --user-data-dir argument, or don't use --user-data-dir; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#sessionnotcreatedexception Stacktrace: GetHandleVerifier [0x0xf6d593+37219] (No symbol) [0x0xe1a716] (No symbol) [0x0xbe51ce] (No symbol) [0x0xc0d881] (No symbol) [0x0xc08a43] (No symbol) [0x0xc3d78b] (No symbol) [0x0xc3d22a] (No symbol) [0x0xc324f6] (No symbol) [0x0xc13327] (No symbol) [0x0xc12723] (No symbol) [0x0xc13144] sqlite3_dbdata_init [0x0x105a89c+518364] sqlite3_dbdata_init [0x0x1141ab0+1465072] sqlite3_dbdata_init [0x0x11413e5+1463333] sqlite3_dbdata_init [0x0x11328ec+1403180] sqlite3_dbdata_init [0x0x11422d2+1467154] (No symbol) [0x0xe31d9d] (No symbol) [0x0xe25108] (No symbol) [0x0xe252fb] (No symbol) [0x0xe0a649] BaseThreadInitThunk [0x0x75525d49+25] RtlInitializeExceptionChain [0x0x7706d1ab+107] RtlGetAppContainerNamedObjectPath [0x0x7706d131+561]
08-09
Unable to obtain driver using Selenium Manager: Selenium Manager failed for: E:\excavate\other\lib\site-packages\selenium\webdriver\common\windows\selenium-manager.exe --browser firefox --output json. error sending request for url (https://github.com/mozilla/geckodriver/releases/latest): connection error: connection reset Traceback (most recent call last): File "E:\Two\pythonProject\main.py", line 3, in <module> driver = webdriver.Firefox() File "E:\excavate\other\lib\site-packages\selenium\webdriver\firefox\webdriver.py", line 195, in __init__ self.service.path = DriverFinder.get_path(self.service, options) File "E:\excavate\other\lib\site-packages\selenium\webdriver\common\driver_finder.py", line 43, in get_path raise err File "E:\excavate\other\lib\site-packages\selenium\webdriver\common\driver_finder.py", line 40, in get_path path = shutil.which(service.path) or SeleniumManager().driver_location(options) File "E:\excavate\other\lib\site-packages\selenium\webdriver\common\selenium_manager.py", line 91, in driver_location result = self.run(args) File "E:\excavate\other\lib\site-packages\selenium\webdriver\common\selenium_manager.py", line 112, in run raise SeleniumManagerException(f"Selenium Manager failed for: {command}.\n{result}{stderr}") selenium.common.exceptions.SeleniumManagerException: Message: Selenium Manager failed for: E:\excavate\other\lib\site-packages\selenium\webdriver\common\windows\selenium-manager.exe --browser firefox --output json. error sending request for url (https://github.com/mozilla/geckodriver/releases/latest): connection error: connection reset Process finished with exit code 1
06-07
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值