python库 `clickhouse-driver` 使用过程报错:clickhouse-cityhash is required to use compression

python库 clickhouse-driver 使用过程报错:clickhouse-cityhash is required to use compression

问题描述

python库 clickhouse-driver 使用过程报错:clickhouse-cityhash is required to use compression

原因分析

底层库(比如 clickhouse-driver),它在启用压缩通信时仍然会尝试加载 clickhouse-cityhash 库来做内部 hash 校验

clickhouse-cityhash 是一个 Python 包(绑定库),提供 CityHash 哈希函数的实现,用于 ClickHouse 客户端或驱动在启用压缩/散列相关功能时计算哈希值或校验。错误提示的意思是你的程序需要这个包来支持压缩功能,但当前环境中没有安装它

启用压缩,请先 python3 -m pip install clickhouse-cityhash

CityHash 是 Google 提供的一组高性能非加密哈希函数,适用于短/长字符串快速哈希计算。

clickhouse-cityhash 是一个 Python 包(绑定库),提供 CityHash 哈希函数的实现,用于 ClickHouse 客户端或驱动在启用压缩/散列相关功能时计算哈希值或校验。

pip 安装 clickhouse-cityhash 报错

  • 使用的是 Python 3.13
  • 出现了多个类似如下的错误:
error: no member named 'curexc_type' in '_ts'
error: no member named 'ob_digit' in '_longobject'
error: no matching function for call to '_PyLong_AsByteArray'

这些错误主要是由于该包的 C/C++ 扩展代码与当前使用的 Python 3.13+ 不兼容导致的

💡 解决方法:降级到 Python <= 3.12

亲测可用!

# Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # server: port: 12345 servlet: session: timeout: 120m context-path: /dolphinscheduler/ compression: enabled: true mime-types: text/html,text/xml,text/plain,text/css,text/javascript,application/javascript,application/json,application/xml jetty: max-http-form-post-size: 5000000 spring: application: name: api-server banner: charset: UTF-8 jackson: time-zone: UTC date-format: "yyyy-MM-dd HH:mm:ss" servlet: multipart: max-file-size: 1024MB max-request-size: 1024MB messages: basename: i18n/messages datasource: driver-class-name: com.mysql.cj.jdbc.Driver url: jdbc:mysql://rm-2zer55j9y09b13451.mysql.rds.aliyuncs.com:3306/bigdata_dolphinscheduler?&characterEncoding=UTF-8&allowMultiQueries=true&serverTimezone=GMT%2B8 username: admin password: 1231123 hikari: connection-test-query: select 1 minimum-idle: 5 auto-commit: true validation-timeout: 3000 pool-name: DolphinScheduler maximum-pool-size: 50 connection-timeout: 30000 idle-timeout: 600000 leak-detection-threshold: 0 initialization-fail-timeout: 1 quartz: auto-startup: false job-store-type: jdbc jdbc: initialize-schema: never properties: org.quartz.threadPool.threadPriority: 5 org.quartz.jobStore.isClustered: true # org.quartz.jobStore.class: org.quartz.impl.jdbcjobstore.JobStoreTX org.quartz.scheduler.instanceId: AUTO org.quartz.jobStore.tablePrefix: QRTZ_ org.quartz.jobStore.acquireTriggersWithinLock: true org.quartz.scheduler.instanceName: DolphinScheduler org.quartz.threadPool.class: org.quartz.simpl.SimpleThreadPool org.quartz.jobStore.useProperties: false org.quartz.threadPool.makeThreadsDaemons: true org.quartz.threadPool.threadCount: 25 org.quartz.jobStore.misfireThreshold: 60000 org.quartz.scheduler.makeSchedulerThreadDaemon: true org.quartz.jobStore.driverDelegateClass: org.quartz.impl.jdbcjobstore.StdJDBCDelegate org.quartz.jobStore.clusterCheckinInterval: 5000 org.quartz.scheduler.batchTriggerAcquisitionMaxCount: 1 management: endpoints: web: exposure: include: '*' endpoint: health: enabled: true show-details: always health: db: enabled: true defaults: enabled: false metrics: tags: application: ${spring.application.name} registry: type: zookeeper zookeeper: namespace: dolphinscheduler connect-string: master-1-1:2181,master-1-2:2181,master-1-3:2181 retry-policy: base-sleep-time: 60ms max-sleep: 300ms max-retries: 5 session-timeout: 30s connection-timeout: 9s block-until-connected: 600ms digest: ~ audit: enabled: false metrics: enabled: true python-gateway: # Weather enable python gateway server or not. The default value is true. enabled: true # The address of Python gateway server start. Set its value to `0.0.0.0` if your Python API run in different # between Python gateway server. It could be be specific to other address like `127.0.0.1` or `localhost` gateway-server-address: 0.0.0.0 # The port of Python gateway server start. Define which port you could connect to Python gateway server from # Python API side. gateway-server-port: 25333 # The address of Python callback client. python-address: 127.0.0.1 # The port of Python callback client. python-port: 25334 # Close connection of socket server if no other request accept after x milliseconds. Define value is (0 = infinite), # and socket server would never close even though no requests accept connect-timeout: 0 # Close each active connection of socket server if python program not active after x milliseconds. Define value is # (0 = infinite), and socket server would never close even though no requests accept read-timeout: 0 security: authentication: # Authentication types (supported types: PASSWORD,LDAP) type: PASSWORD # IF you set type `LDAP`, below config will be effective ldap: # admin userId user.admin: read-only-admin # ldap server config urls: ldap://ldap.forumsys.com:389/ base.dn: dc=example,dc=com username: cn=read-only-admin,dc=example,dc=com password: password user.identity.attribute: uid user.email.attribute: mail # Traffic control, if you turn on this config, the maximum number of request/s will be limited. # global max request number per second # default tenant-level max request number traffic: control: global-switch: false max-global-qps-rate: 300 tenant-switch: false default-tenant-qps-rate: 10 #customize-tenant-qps-rate: # eg. #tenant1: 11 #tenant2: 20 cloud: enabled: false client-id: "" tenant-id: "" client-secret: "" # Override by profile --- spring: config: activate: on-profile: mysql datasource: driver-class-name: com.mysql.cj.jdbc.Driver url: jdbc:mysql://rm-2zer55j9y09b13451.mysql.rds.aliyuncs.com:3306/bigdata_dolphinscheduler?&characterEncoding=UTF-8&allowMultiQueries=true&serverTimezone=GMT%2B8 username: admin password: 1231123 quartz: properties: org.quartz.jobStore.driverDelegateClass: org.quartz.impl.jdbcjobstore.StdJDBCDelegate 这是配置
06-28
<think>好的,我现在需要帮助用户获取ClickHouse-copier的使用指南或文档。首先,回顾用户的历史对话,用户之前询问如何优化ClickHouse-copier的迁移速度,现在想了解具体的使用教程。我需要结合之前的回答和用户提供的引用内容来构建一个清晰的指南。 首先,用户提供的引用中提到使用Docker搭建测试集群,集群结构有四个节点,分为两个分片,每个分片两个副本。这可能涉及配置文件的编写,比如Zookeeper的设置和任务模板。我需要确保在指南中包含这些配置示例。 另外,引用[2]提到了在迁移大数据时出现数据量不符的错误,解决方案是修改config.xml中的参数。这应该在常见问题部分提到,并给出配置示例,比如设置max_table_size_to_drop为0。 接下来,用户需要的是使用教程,所以结构应该从环境准备开始,包括安装和集群配置。然后详细说明配置文件的编写,比如Zookeeper配置和任务模板,特别是分片和副本的设置。任务模板中的参数如并行度、过滤器、压缩等需要详细解释,可能还需要引用之前优化时的建议,如调整max_threads和启用压缩。 操作步骤部分需要具体命令,比如如何启动copier,如何监控任务状态,以及完成后的校验方法。用户之前的回答中提到了使用系统表监控和校验SQL,这些内容需要包含在内。 注意事项部分需要涵盖网络、数据一致性、版本兼容性等问题,特别是用户之前提到的分区和大表处理的问题。常见问题解答应针对用户引用中的错误,比如数据量不一致和权限问题,给出解决方案。 最后,确保提供官方文档链接,方便用户进一步查阅。整个过程中要保持结构清晰,步骤明确,示例具体,并参考之前的回答和用户提供的引用内容,确保信息的一致性和实用性。</think>### ClickHouse-copier 使用指南与最佳实践 #### 一、**环境准备**(基于引用[1]测试集群) 1. **集群架构示例** ```text 集群名称: ck_cluster ┌─分片1 (shard_num=1) │ ├─副本1 (host=clickhouse1) │ └─副本2 (host=clickhouse2) └─分片2 (shard_num=2) ├─副本1 (host=clickhouse3) └─副本2 (host=clickhouse4) ``` 2. **配置文件要求** - 所有节点需共享**相同ZooKeeper配置** - 确保`/etc/clickhouse-server/config.xml`中启用跨集群复制: ```xml <remote_servers incl="clickhouse_remote_servers"> <ck_cluster> <shard> <replica><host>clickhouse1</host></replica> <replica><host>clickhouse2</host></replica> </shard> <shard> <replica><host>clickhouse3</host></replica> <replica><host>clickhouse4</host></replica> </shard> </ck_cluster> </remote_servers> ``` #### 二、**核心配置文件** 1. **任务模板 (`task.xml`)** ```xml <yandex> <logger>...</logger> <zookeeper> <!-- 引用[1]的集群配置 --> <node index="1"><host>zk_server1</host></node> </zookeeper> <task> <name>migration_task</name> <source_cluster>ck_cluster</source_cluster> <destination_cluster>ck_cluster</destination_cluster> <database>test_db</database> <table>source_table</table> <sharding_key>rand()</sharding_key> <!-- 分片策略 --> <copy_factor>1</copy_factor> <!-- 副本数 --> <settings> <max_threads>8</max_threads> <!-- 并行度 --> <use_compression>1</use_compression> </settings> <enabled_partitions> <!-- 分区过滤 --> <partition>2023-10</partition> </enabled_partitions> </task> </yandex> ``` #### 三、**操作步骤** 1. **启动copier工具** ```bash clickhouse-copier --config /path/to/task.xml --task-path /zk/path/migration_task ``` - `--task-path`参数指向ZooKeeper中的任务协调节点 2. **监控任务状态** - 通过ZooKeeper查看任务进度: ```bash zkCli.sh ls /zk/path/migration_task/status ``` - 使用系统表监控(引用自历史回答): ```sql SELECT * FROM system.merges WHERE database = 'test_db' ``` 3. **完成校验** 执行数据一致性检查(优化回答中的方法): ```sql -- 源表与目标表对比 SELECT (SELECT count() FROM source_table) AS src_count, (SELECT count() FROM target_table) AS dst_count, (SELECT sumHash(*) FROM source_table) AS src_hash, (SELECT sumHash(*) FROM target_table) AS dst_hash ``` #### 四、**注意事项** 1. **大表处理**(引用[2]问题解决方案) 在`config.xml`中增加配置避免删除限制: ```xml <max_table_size_to_drop>0</max_table_size_to_drop> <max_partition_size_to_drop>0</max_partition_size_to_drop> ``` 2. **网络稳定性** - 建议使用专用网络通道(如10Gbps+带宽) - 启用断点续传:copier自动记录已传输的分区状态到ZooKeeper 3. **版本兼容性** - 源集群与目标集群的ClickHouse版本差异需 ≤ 2个次要版本 - 跨大版本迁移需使用`clickhouse-backup`工具辅助 #### 五、**官方文档指引** 1. **核心文档** - [ClickHouse官方文档 - copier工具](https://clickhouse.com/docs/en/operations/utilities/clickhouse-copier) - [GitHub示例配置](https://github.com/ClickHouse/ClickHouse/tree/master/programs/copier) 2. **关键参数说明** | 参数 | 作用 | 推荐值 | |------|------|--------| | `max_threads` | 并行线程数 | CPU核心数的50%-75% | | `use_compression` | 网络传输压缩 | 1 (启用) | | `check_referential_integrity` | 外键检查 | 0 (迁移时关闭) | #### 六、**常见问题**(结合引用[2]) 1. **数据量不一致** **现象**:copier报错实际同步量≠计划量 **解决方案**: - 检查源表和目标表的`PARTITION BY`定义是否一致 - 验证ZooKeeper中是否存在残留任务状态(清理`/zk/path/migration_task`) 2. **权限问题** - 确保copier进程有ZooKeeper节点的读写权限 - 在`users.xml`中配置分布式DDL权限: ```xml <access_control_improvements>1</access_control_improvements> ``` --- ### 扩展阅读 1. [百亿级数据迁移实战案例](https://clickhouse.com/case-studies/data-migration-at-scale) 2. [分布式表与本地表性能对比分析](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replication)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

西京刀客

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值