目录
OceanBase集群运维管理手册
项目概述
本文档基于OceanBase 4.2.x企业版,提供全面的集群运维管理指南,涵盖监控、参数调整、备份恢复、故障处理、租户管理、性能优化、安全审计和应急预案等关键运维任务。
1. 监控集群状态
1.1 系统监控维度
1.1.1 集群级监控
-- 查看集群状态
SELECT * FROM oceanbase.__all_server;
-- 查看Zone状态
SELECT zone, status, region FROM oceanbase.__all_zone;
-- 查看集群参数
SHOW PARAMETERS LIKE '%cluster%';
-- 查看集群资源使用情况
SELECT
svr_ip,
svr_port,
cpu_capacity,
cpu_assigned,
mem_capacity,
mem_assigned,
disk_total,
disk_assigned
FROM oceanbase.__all_virtual_server_stat;
1.1.2 节点级监控
-- OBServer状态监控
SELECT
svr_ip,
svr_port,
zone,
status,
start_service_time,
stop_time
FROM oceanbase.DBA_OB_SERVERS;
-- 节点资源使用率
SELECT
svr_ip,
cpu_total,
cpu_assigned,
cpu_max_assigned,
mem_total,
mem_assigned,
mem_max_assigned,
disk_total,
disk_assigned
FROM oceanbase.GV$OB_SERVERS;
-- 节点会话信息
SELECT
svr_ip,
active_session_num,
all_session_num
FROM oceanbase.__all_virtual_processlist
GROUP BY svr_ip;
1.1.3 租户级监控
-- 租户资源使用情况
SELECT
tenant_id,
tenant_name,
primary_zone,
locality,
resource_pool_count
FROM oceanbase.DBA_OB_TENANTS;
-- 租户资源池监控
SELECT
name,
unit_count,
unit_config_id,
zone_list
FROM oceanbase.DBA_OB_RESOURCE_POOLS;
-- 租户单元配置
SELECT
name,
max_cpu,
min_cpu,
max_memory,
min_memory,
max_iops,
min_iops,
max_disk_size
FROM oceanbase.DBA_OB_UNIT_CONFIGS;
1.2 性能监控指标
1.2.1 关键性能指标
-- TPS/QPS监控
SELECT
tenant_id,
svr_ip,
tps,
qps,
rt
FROM oceanbase.GV$SQL_AUDIT_STAT
WHERE request_time > DATE_SUB(NOW(), INTERVAL 5 MINUTE);
-- SQL响应时间分布
SELECT
tenant_id,
CASE
WHEN elapsed_time < 1000 THEN '<1ms'
WHEN elapsed_time < 10000 THEN '1-10ms'
WHEN elapsed_time < 100000 THEN '10-100ms'
WHEN elapsed_time < 1000000 THEN '100ms-1s'
ELSE '>1s'
END AS rt_range,
COUNT(*) AS count
FROM oceanbase.GV$SQL_AUDIT
WHERE request_time > DATE_SUB(NOW(), INTERVAL 1 HOUR)
GROUP BY tenant_id, rt_range;
-- 慢查询监控
SELECT
query_sql,
elapsed_time,
execute_time,
queue_time,
get_plan_time
FROM oceanbase.GV$SQL_AUDIT
WHERE elapsed_time > 100000 -- 超过100ms
ORDER BY elapsed_time DESC
LIMIT 20;
1.2.2 存储监控
-- SSTable监控
SELECT
tenant_id,
tablet_id,
sstable_count,
sstable_size,
row_count
FROM oceanbase.__all_virtual_tablet_sstable_macro_info;
-- 转储合并监控
SELECT
tenant_id,
frozen_scn,
frozen_time,
status
FROM oceanbase.CDB_OB_MAJOR_COMPACTION;
-- 磁盘使用监控
SELECT
svr_ip,
total_size,
used_size,
free_size,
used_percent
FROM oceanbase.__all_virtual_disk_stat;
1.3 监控工具集成
1.3.1 OCP监控配置
# OCP监控配置
ocp_monitor:
# 数据采集间隔
collect_interval: 60s
# 告警规则
alert_rules:
- name: high_cpu_usage
metric: cpu_usage_percent
threshold: 80
duration: 5m
severity: warning
- name: memory_shortage
metric: memory_available
threshold: 10GB
duration: 5m
severity: critical
- name: disk_full
metric: disk_usage_percent
threshold: 90
duration: 5m
severity: critical
1.3.2 Prometheus集成
# prometheus配置
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'oceanbase'
static_configs:
- targets: ['ob1:2883', 'ob2:2883', 'ob3:2883']
metrics_path: '/metrics'
params:
format: ['prometheus']
1.3.3 Grafana Dashboard配置
{
"dashboard": {
"title": "OceanBase Cluster Monitor",
"panels": [
{
"title": "QPS/TPS",
"targets": [
{
"expr": "rate(oceanbase_sql_qps[5m])"
}
]
},
{
"title": "Response Time",
"targets": [
{
"expr": "histogram_quantile(0.99, oceanbase_sql_rt)"
}
]
}
]
}
}
参考文档:
2. 调整系统参数
2.1 参数分类与管理
2.1.1 集群级参数
-- 查看所有集群参数
SHOW PARAMETERS;
-- 查看特定参数
SHOW PARAMETERS LIKE 'enable_sql_audit';
-- 修改集群参数
ALTER SYSTEM SET enable_sql_audit = TRUE;
ALTER SYSTEM SET syslog_level = 'INFO';
ALTER SYSTEM SET max_syslog_file_count = 100;
-- 重要集群参数配置
ALTER SYSTEM SET memory_limit = '100G';
ALTER SYSTEM SET memory_limit_percentage = 80;
ALTER SYSTEM SET system_memory = '30G';
ALTER SYSTEM SET cpu_count = 64;
2.1.2 租户级参数
-- 查看租户参数
SHOW PARAMETERS LIKE '%tenant%' TENANT = tenant_name;
-- 修改租户参数
ALTER SYSTEM SET ob_query_timeout = 10000000 TENANT = tenant_name;
ALTER SYSTEM SET ob_trx_timeout = 100000000 TENANT = tenant_name;
ALTER SYSTEM SET ob_sql_work_area_percentage = 30 TENANT = tenant_name;
-- 租户变量设置
ALTER TENANT tenant_name SET VARIABLES
ob_query_timeout = 10000000,
ob_trx_timeout = 100000000,
ob_read_consistency = 'STRONG';
2.2 性能优化参数
2.2.1 SQL优化参数
-- 优化器参数
ALTER SYSTEM SET optimizer_use_sql_plan_baselines = TRUE;
ALTER SYSTEM SET optimizer_capture_sql_plan_baselines = TRUE;
ALTER SYSTEM SET enable_sql_audit = TRUE;
-- 并行执行参数
ALTER SYSTEM SET parallel_servers_target = 64;
ALTER SYSTEM SET parallel_max_servers = 128;
-- 查询缓存参数
ALTER SYSTEM SET query_cache_type = 1;
ALTER SYSTEM SET query_cache_size = '2G';
2.2.2 存储优化参数
-- 内存管理参数
ALTER SYSTEM SET memstore_limit_percentage = 50;
ALTER SYSTEM SET freeze_trigger_percentage = 60;
-- 合并参数
ALTER SYSTEM SET major_freeze_duty_time = '02:00';
ALTER SYSTEM SET minor_freeze_times = 5;
ALTER SYSTEM SET merge_thread_count = 20;
-- 缓存参数
ALTER SYSTEM SET cache_wash_threshold = '4G';
ALTER SYSTEM SET user_block_cache_priority = 5;
ALTER SYSTEM SET index_block_cache_priority = 10;
2.3 参数调优最佳实践
2.3.1 内存参数调优模板
-- OLTP场景内存配置
ALTER SYSTEM SET memory_limit_percentage = 80;
ALTER SYSTEM SET memstore_limit_percentage = 50;
ALTER SYSTEM SET freeze_trigger_percentage = 60;
ALTER SYSTEM SET cache_wash_threshold = '4G';
-- OLAP场景内存配置
ALTER SYSTEM SET memory_limit_percentage = 85;
ALTER SYSTEM SET memstore_limit_percentage = 30;
ALTER SYSTEM SET ob_sql_work_area_percentage = 40;
ALTER SYSTEM SET cache_wash_threshold = '8G';
2.3.2 并发控制参数
-- 高并发场景配置
ALTER SYSTEM SET max_parallel_degree = 64;
ALTER SYSTEM SET parallel_servers_target = 128;
ALTER SYSTEM SET ob_query_timeout = 30000000;
ALTER SYSTEM SET ob_trx_timeout = 100000000;
-- 事务隔离级别设置
ALTER SYSTEM SET tx_isolation = 'READ-COMMITTED';
ALTER SYSTEM SET ob_read_consistency = 'WEAK';
参考文档:
3. 执行备份恢复
3.1 备份策略设计
3.1.1 全量备份
-- 创建备份策略
ALTER SYSTEM ADD BACKUP DESTINATION 'oss://backup-bucket/ob-backup';
-- 设置备份参数
ALTER SYSTEM SET backup_dest = 'oss://backup-bucket/ob-backup';
ALTER SYSTEM SET backup_data_file_size = '4G';
ALTER SYSTEM SET backup_log_archive_option = 'MANDATORY';
-- 发起全量备份
ALTER SYSTEM BACKUP DATABASE;
-- 指定租户备份
ALTER SYSTEM BACKUP TENANT tenant_name TO 'oss://backup-bucket/tenant-backup';
3.1.2 增量备份
-- 启用日志归档
ALTER SYSTEM ARCHIVELOG;
-- 配置归档目标
ALTER SYSTEM SET log_archive_dest = 'oss://backup-bucket/archive';
ALTER SYSTEM SET log_archive_dest_state = 'ENABLE';
-- 启动日志归档
ALTER SYSTEM SET log_archive_config = 'ENABLE';
-- 增量备份任务
ALTER SYSTEM BACKUP INCREMENTAL DATABASE;
3.2 自动化备份配置
3.2.1 定时备份任务
-- 创建备份调度任务
BEGIN
DBMS_SCHEDULER.CREATE_JOB(
job_name => 'daily_backup_job',
job_type => 'PLSQL_BLOCK',
job_action => 'BEGIN ALTER SYSTEM BACKUP DATABASE; END;',
start_date => SYSTIMESTAMP,
repeat_interval => 'FREQ=DAILY; BYHOUR=2; BYMINUTE=0',
enabled => TRUE
);
END;
-- 创建清理任务
BEGIN
DBMS_SCHEDULER.CREATE_JOB(
job_name => 'backup_cleanup_job',
job_type => 'PLSQL_BLOCK',
job_action => 'BEGIN ALTER SYSTEM DELETE OBSOLETE BACKUP; END;',
start_date => SYSTIMESTAMP,
repeat_interval => 'FREQ=WEEKLY; BYDAY=SUN; BYHOUR=3',
enabled => TRUE
);
END;
3.2.2 备份验证脚本
#!/bin/bash
# backup_verify.sh
# 配置参数
BACKUP_PATH="oss://backup-bucket/ob-backup"
LOG_FILE="/var/log/ob_backup_verify.log"
# 验证备份完整性
function verify_backup() {
echo "[$(date)] Starting backup verification..." >> $LOG_FILE
# 检查备份集
obclient -h127.0.0.1 -P2883 -uroot@sys -p -e "
SELECT
backup_set_id,
backup_type,
status,
start_time,
end_time
FROM oceanbase.CDB_OB_BACKUP_SET_DETAILS
WHERE status = 'COMPLETED'
ORDER BY start_time DESC
LIMIT 1;
" >> $LOG_FILE
# 验证备份数据
ob_admin verify_backup \
--backup_path=$BACKUP_PATH \
--check_data=true \
--check_log=true >> $LOG_FILE 2>&1
if [ $? -eq 0 ]; then
echo "[$(date)] Backup verification successful" >> $LOG_FILE
else
echo "[$(date)] Backup verification failed" >> $LOG_FILE
# 发送告警
send_alert "Backup verification failed"
fi
}
# 执行验证
verify_backup
3.3 恢复操作
3.3.1 物理恢复
-- 查看可用备份集
SELECT
backup_set_id,
backup_type,
start_time,
end_time,
status
FROM oceanbase.CDB_OB_BACKUP_SET_DETAILS
WHERE status = 'COMPLETED';
-- 恢复到指定时间点
ALTER SYSTEM RESTORE TENANT tenant_name
FROM 'oss://backup-bucket/ob-backup'
UNTIL TIME '2024-01-01 12:00:00';
-- 恢复指定备份集
ALTER SYSTEM RESTORE TENANT tenant_name
FROM 'oss://backup-bucket/ob-backup'
BACKUP SET 1;
3.3.2 逻辑恢复
# 使用ob_loader导出数据
ob_loader -h127.0.0.1 -P2883 -uroot@tenant -p \
--mode=export \
--table='*' \
--file-path=/backup/logical_backup \
--thread=16
# 使用ob_loader导入数据
ob_loader -h127.0.0.1 -P2883 -uroot@tenant -p \
--mode=import \
--table='*' \
--file-path=/backup/logical_backup \
--thread=16 \
--replace
3.4 备份恢复监控
3.4.1 备份任务监控
-- 监控备份进度
SELECT
job_id,
tenant_id,
job_type,
job_status,
progress,
start_time,
estimated_end_time
FROM oceanbase.DBA_OB_BACKUP_JOBS
WHERE job_status = 'DOING';
-- 备份历史查询
SELECT
backup_set_id,
tenant_id,
backup_type,
encryption_mode,
start_time,
end_time,
input_bytes,
output_bytes,
compression_ratio
FROM oceanbase.CDB_OB_BACKUP_SET_DETAILS
ORDER BY start_time DESC;
参考文档:
4. 处理节点故障
4.1 故障检测机制
4.1.1 节点状态监控
-- 实时节点状态检查
SELECT
svr_ip,
svr_port,
zone,
status,
stop_time,
start_service_time,
CASE
WHEN status = 'ACTIVE' THEN 'NORMAL'
WHEN status = 'INACTIVE' THEN 'WARNING'
ELSE 'CRITICAL'
END AS health_status
FROM oceanbase.DBA_OB_SERVERS;
-- 节点心跳检测
SELECT
svr_ip,
svr_port,
last_hb_time,
TIMESTAMPDIFF(SECOND, last_hb_time, NOW()) AS hb_delay_seconds
FROM oceanbase.__all_server_event_history
WHERE event_type = 'HEARTBEAT';
4.1.2 自动故障检测脚本
#!/bin/bash
# node_health_check.sh
# 配置参数
ALERT_THRESHOLD=30 # 心跳超时阈值(秒)
CHECK_INTERVAL=10 # 检查间隔(秒)
function check_node_health() {
# 获取所有节点状态
nodes=$(mysql -h127.0.0.1 -P2883 -uroot@sys -p -e "
SELECT svr_ip, svr_port, status
FROM oceanbase.DBA_OB_SERVERS;
" | tail -n +2)
while IFS=$'\t' read -r ip port status; do
if [ "$status" != "ACTIVE" ]; then
echo "Alert: Node $ip:$port is $status"
handle_node_failure $ip $port $status
fi
done <<< "$nodes"
}
function handle_node_failure() {
local ip=$1
local port=$2
local status=$3
case $status in
"INACTIVE")
echo "Node $ip:$port is inactive, attempting recovery..."
# 尝试重启节点
ssh $ip "cd /home/admin/oceanbase && ./bin/observer -r $ip:$port"
;;
"STOPPED")
echo "Node $ip:$port is stopped, initiating failover..."
# 执行故障转移
initiate_failover $ip $port
;;
esac
}
# 主循环
while true; do
check_node_health
sleep $CHECK_INTERVAL
done
4.2 故障处理流程
4.2.1 单节点故障处理
-- 1. 检查故障节点状态
SELECT * FROM oceanbase.DBA_OB_SERVERS
WHERE svr_ip = '192.168.1.101';
-- 2. 停止故障节点(如果还在运行)
ALTER SYSTEM STOP SERVER '192.168.1.101:2882';
-- 3. 从集群中删除故障节点
ALTER SYSTEM DELETE SERVER '192.168.1.101:2882';
-- 4. 添加新节点
ALTER SYSTEM ADD SERVER '192.168.1.104:2882' ZONE 'zone1';
-- 5. 启动新节点
ALTER SYSTEM START SERVER '192.168.1.104:2882';
4.2.2 Zone级故障处理
-- 1. 检查Zone状态
SELECT zone, status, region FROM oceanbase.DBA_OB_ZONES;
-- 2. 停止故障Zone
ALTER SYSTEM STOP ZONE zone1;
-- 3. 修改副本分布
ALTER TENANT tenant_name LOCALITY = 'F@zone2,F@zone3';
-- 4. 启动故障Zone(修复后)
ALTER SYSTEM START ZONE zone1;
-- 5. 恢复副本分布
ALTER TENANT tenant_name LOCALITY = 'F@zone1,F@zone2,F@zone3';
4.3 数据恢复与重建
4.3.1 副本恢复
-- 检查副本状态
SELECT
tenant_id,
table_id,
partition_id,
svr_ip,
role,
replica_status
FROM oceanbase.__all_virtual_partition_info
WHERE replica_status != 'NORMAL';
-- 手动触发副本恢复
ALTER SYSTEM RECOVER REPLICA;
-- 指定表的副本恢复
ALTER TABLE table_name RECOVER PARTITION;
4.3.2 数据重建脚本
#!/bin/bash
# rebuild_replica.sh
function rebuild_replica() {
local tenant=$1
local table=$2
local partition=$3
echo "Starting replica rebuild for $tenant.$table partition $partition"
# 1. 检查当前副本分布
mysql -h127.0.0.1 -P2883 -uroot@sys -p -e "
SELECT
svr_ip,
role,
replica_status
FROM oceanbase.__all_virtual_partition_info
WHERE tenant_id = (SELECT tenant_id FROM oceanbase.DBA_OB_TENANTS WHERE tenant_name = '$tenant')
AND table_id = (SELECT table_id FROM oceanbase.DBA_OB_TABLE_LOCATIONS WHERE table_name = '$table')
AND partition_id = $partition;
"
# 2. 触发副本重建
mysql -h127.0.0.1 -P2883 -uroot@sys -p -e "
ALTER SYSTEM REBUILD REPLICA
TENANT = '$tenant'
TABLE = '$table'
PARTITION = $partition;
"
# 3. 监控重建进度
while true; do
status=$(mysql -h127.0.0.1 -P2883 -uroot@sys -p -e "
SELECT replica_status
FROM oceanbase.__all_virtual_partition_info
WHERE tenant_id = (SELECT tenant_id FROM oceanbase.DBA_OB_TENANTS WHERE tenant_name = '$tenant')
AND table_id = (SELECT table_id FROM oceanbase.DBA_OB_TABLE_LOCATIONS WHERE table_name = '$table')
AND partition_id = $partition
LIMIT 1;
" | tail -n 1)
if [ "$status" = "NORMAL" ]; then
echo "Replica rebuild completed successfully"
break
fi
echo "Rebuild in progress... Current status: $status"
sleep 10
done
}
# 执行重建
rebuild_replica "test_tenant" "test_table" 0
4.4 故障切换与高可用
4.4.1 主备切换
-- 手动主备切换
ALTER SYSTEM SWITCH REPLICA LEADER
PARTITION_ID = 1
SERVER = '192.168.1.102:2882';
-- 批量切换Leader
ALTER SYSTEM SWITCH REPLICA LEADER
TENANT = tenant_name
SERVER = '192.168.1.102:2882';
4.4.2 自动故障转移配置
-- 配置故障转移参数
ALTER SYSTEM SET enable_auto_leader_switch = TRUE;
ALTER SYSTEM SET leader_switch_threshold = 10;
ALTER SYSTEM SET replica_safe_remove_time = 600;
-- 设置选举优先级
ALTER SYSTEM SET election_priority = 'zone1:100,zone2:50,zone3:30';
参考文档:
5. 管理租户资源
5.1 租户创建与配置
5.1.1 创建租户
-- 创建资源单元规格
CREATE RESOURCE UNIT unit_4c8g
MAX_CPU = 4,
MIN_CPU = 4,
MAX_MEMORY = '8G',
MIN_MEMORY = '8G',
MAX_IOPS = 10000,
MIN_IOPS = 1000,
MAX_SESSION_NUM = 1000,
MAX_DISK_SIZE = '100G';
-- 创建资源池
CREATE RESOURCE POOL pool_tenant1
UNIT = 'unit_4c8g',
UNIT_NUM = 1,
ZONE_LIST = ('zone1', 'zone2', 'zone3');
-- 创建租户
CREATE TENANT tenant1
RESOURCE_POOL_LIST = ('pool_tenant1')
SET ob_tcp_invited_nodes = '%',
ob_compatibility_mode = 'MYSQL',
lower_case_table_names = 1;
5.1.2 租户参数配置
-- 设置租户变量
ALTER TENANT tenant1 SET VARIABLES
max_connections = 1000,
ob_query_timeout = 30000000,
ob_trx_timeout = 100000000,
ob_sql_work_area_percentage = 30,
parallel_servers_target = 64;
-- 设置租户系统参数
ALTER SYSTEM SET recyclebin = 'ON' TENANT = tenant1;
ALTER SYSTEM SET undo_retention = 1800 TENANT = tenant1;
5.2 资源管理与调整
5.2.1 动态资源调整
-- 调整资源单元规格
ALTER RESOURCE UNIT unit_4c8g
MAX_CPU = 8,
MAX_MEMORY = '16G';
-- 扩容资源池
ALTER RESOURCE POOL pool_tenant1 UNIT_NUM = 2;
-- 缩容资源池
ALTER RESOURCE POOL pool_tenant1 UNIT_NUM = 1;
-- 迁移资源池
ALTER RESOURCE POOL pool_tenant1
ZONE_LIST = ('zone1', 'zone2', 'zone4');
5.2.2 资源使用监控
-- 租户资源使用统计
SELECT
t.tenant_name,
r.name AS resource_pool,
u.name AS unit_config,
u.max_cpu,
u.max_memory,
s.cpu_total,
s.cpu_assigned,
s.mem_total,
s.mem_assigned,
ROUND(s.cpu_assigned/s.cpu_total * 100, 2) AS cpu_usage_percent,
ROUND(s.mem_assigned/s.mem_total * 100, 2) AS mem_usage_percent
FROM oceanbase.DBA_OB_TENANTS t
JOIN oceanbase.DBA_OB_RESOURCE_POOLS r ON t.tenant_id = r.tenant_id
JOIN oceanbase.DBA_OB_UNIT_CONFIGS u ON r.unit_config_id = u.unit_config_id
JOIN oceanbase.__all_virtual_server_stat s ON t.tenant_id = s.tenant_id;
-- 租户会话统计
SELECT
tenant_id,
COUNT(*) AS session_count,
COUNT(CASE WHEN state = 'ACTIVE' THEN 1 END) AS active_sessions,
COUNT(CASE WHEN state = 'IDLE' THEN 1 END) AS idle_sessions
FROM oceanbase.GV$OB_PROCESSLIST
GROUP BY tenant_id;
5.3 租户隔离与限制
5.3.1 资源隔离配置
-- CPU隔离
ALTER RESOURCE UNIT unit_4c8g
MAX_CPU = 4,
MIN_CPU = 2; -- 保证最小CPU资源
-- 内存隔离
ALTER RESOURCE UNIT unit_4c8g
MAX_MEMORY = '8G',
MIN_MEMORY = '4G'; -- 保证最小内存资源
-- IOPS限制
ALTER RESOURCE UNIT unit_4c8g
MAX_IOPS = 10000,
MIN_IOPS = 1000,
MAX_DISK_SIZE = '100G';
5.3.2 网络隔离配置
-- 设置白名单
ALTER TENANT tenant1 SET ob_tcp_invited_nodes = '192.168.1.%,192.168.2.%';
-- 设置连接数限制
ALTER TENANT tenant1 SET max_connections = 500;
-- 设置并发限制
ALTER TENANT tenant1 SET VARIABLES
max_user_connections = 100,
parallel_servers_target = 32;
5.4 租户备份与迁移
5.4.1 租户级备份
-- 租户数据备份
ALTER SYSTEM BACKUP TENANT tenant1
TO 'oss://backup-bucket/tenant1-backup';
-- 租户增量备份
ALTER SYSTEM BACKUP INCREMENTAL TENANT tenant1
TO 'oss://backup-bucket/tenant1-backup';
-- 查看租户备份进度
SELECT
tenant_id,
job_id,
backup_type,
status,
progress,
start_time
FROM oceanbase.CDB_OB_BACKUP_JOBS
WHERE tenant_id = (SELECT tenant_id FROM oceanbase.DBA_OB_TENANTS WHERE tenant_name = 'tenant1');
5.4.2 租户迁移脚本
#!/bin/bash
# tenant_migration.sh
SOURCE_CLUSTER="192.168.1.101:2883"
TARGET_CLUSTER="192.168.2.101:2883"
TENANT_NAME="tenant1"
# 1. 导出租户数据
echo "Exporting tenant data..."
ob_loader -h${SOURCE_CLUSTER%:*} -P${SOURCE_CLUSTER#*:} \
-uroot@${TENANT_NAME} -p \
--mode=export \
--table='*.*' \
--file-path=/tmp/tenant_export \
--thread=16
# 2. 在目标集群创建租户
echo "Creating tenant on target cluster..."
mysql -h${TARGET_CLUSTER%:*} -P${TARGET_CLUSTER#*:} -uroot@sys -p -e "
CREATE RESOURCE UNIT unit_migrate
MAX_CPU = 4, MAX_MEMORY = '8G', MAX_DISK_SIZE = '100G';
CREATE RESOURCE POOL pool_migrate
UNIT = 'unit_migrate', UNIT_NUM = 1;
CREATE TENANT ${TENANT_NAME}
RESOURCE_POOL_LIST = ('pool_migrate');
"
# 3. 导入数据到目标集群
echo "Importing data to target cluster..."
ob_loader -h${TARGET_CLUSTER%:*} -P${TARGET_CLUSTER#*:} \
-uroot@${TENANT_NAME} -p \
--mode=import \
--table='*.*' \
--file-path=/tmp/tenant_export \
--thread=16
echo "Tenant migration completed"
5.5 多租户管理最佳实践
5.5.1 资源规划模板
-- OLTP租户模板
CREATE RESOURCE UNIT unit_oltp_small MAX_CPU = 4, MAX_MEMORY = '8G', MAX_IOPS = 10000;
CREATE RESOURCE UNIT unit_oltp_medium MAX_CPU = 8, MAX_MEMORY = '16G', MAX_IOPS = 20000;
CREATE RESOURCE UNIT unit_oltp_large MAX_CPU = 16, MAX_MEMORY = '32G', MAX_IOPS = 40000;
-- OLAP租户模板
CREATE RESOURCE UNIT unit_olap_small MAX_CPU = 8, MAX_MEMORY = '32G', MAX_IOPS = 5000;
CREATE RESOURCE UNIT unit_olap_medium MAX_CPU = 16, MAX_MEMORY = '64G', MAX_IOPS = 10000;
CREATE RESOURCE UNIT unit_olap_large MAX_CPU = 32, MAX_MEMORY = '128G', MAX_IOPS = 20000;
参考文档:
6. 优化系统性能
6.1 SQL性能优化
6.1.1 SQL诊断与分析
-- 慢SQL分析
SELECT
query_sql,
db_name,
user_name,
elapsed_time/1000 AS elapsed_ms,
cpu_time/1000 AS cpu_ms,
queue_time/1000 AS queue_ms,
get_plan_time/1000 AS plan_ms,
execute_time/1000 AS exec_ms,
return_rows,
affected_rows
FROM oceanbase.GV$SQL_AUDIT
WHERE elapsed_time > 1000000 -- 超过1秒
AND request_time > DATE_SUB(NOW(), INTERVAL 1 HOUR)
ORDER BY elapsed_time DESC
LIMIT 20;
-- Top SQL分析
SELECT
sql_id,
COUNT(*) AS exec_count,
AVG(elapsed_time)/1000 AS avg_elapsed_ms,
MAX(elapsed_time)/1000 AS max_elapsed_ms,
MIN(elapsed_time)/1000 AS min_elapsed_ms,
SUM(elapsed_time)/1000000 AS total_elapsed_s
FROM oceanbase.GV$SQL_AUDIT
WHERE request_time > DATE_SUB(NOW(), INTERVAL 1 HOUR)
GROUP BY sql_id
ORDER BY total_elapsed_s DESC
LIMIT 20;
6.1.2 执行计划优化
-- 查看执行计划
EXPLAIN SELECT * FROM orders WHERE order_date > '2024-01-01';
-- 使用Hint优化
SELECT /*+ INDEX(orders idx_order_date) PARALLEL(4) */
* FROM orders
WHERE order_date > '2024-01-01';
-- 创建SQL Plan Baseline
ALTER SYSTEM SET optimizer_capture_sql_plan_baselines = TRUE;
ALTER SYSTEM SET optimizer_use_sql_plan_baselines = TRUE;
-- 固定执行计划
CREATE OUTLINE outline_orders
ON SELECT * FROM orders WHERE order_date > '2024-01-01'
USING HINT /*+ INDEX(orders idx_order_date) */;
6.2 索引优化
6.2.1 索引分析与创建
-- 分析表统计信息
ANALYZE TABLE orders COMPUTE STATISTICS;
-- 查看索引使用情况
SELECT
table_name,
index_name,
cardinality,
avg_row_length,
data_length
FROM information_schema.statistics
WHERE table_schema = 'test'
ORDER BY cardinality DESC;
-- 创建复合索引
CREATE INDEX idx_order_composite
ON orders(customer_id, order_date, status);
-- 创建函数索引
CREATE INDEX idx_upper_name
ON customers((UPPER(customer_name)));
-- 创建全局索引(分区表)
CREATE GLOBAL INDEX gidx_order_customer
ON orders(customer_id)
PARTITION BY HASH(customer_id) PARTITIONS 8;
6.2.2 索引维护脚本
#!/bin/bash
# index_maintenance.sh
function analyze_index_usage() {
local tenant=$1
local database=$2
echo "Analyzing index usage for $tenant.$database"
# 获取未使用的索引
mysql -h127.0.0.1 -P2883 -u$tenant@$database -p -e "
SELECT
t.table_name,
i.index_name,
i.cardinality,
i.avg_row_length
FROM information_schema.tables t
JOIN information_schema.statistics i
ON t.table_name = i.table_name
LEFT JOIN (
SELECT DISTINCT
table_name,
index_name
FROM oceanbase.GV\$PLAN_CACHE_PLAN_EXPLAIN
WHERE tenant_id = (SELECT tenant_id FROM oceanbase.DBA_OB_TENANTS WHERE tenant_name = '$tenant')
) used
ON i.table_name = used.table_name
AND i.index_name = used.index_name
WHERE t.table_schema = '$database'
AND i.index_name != 'PRIMARY'
AND used.index_name IS NULL;
" > unused_indexes.txt
# 分析重复索引
mysql -h127.0.0.1 -P2883 -u$tenant@$database -p -e "
SELECT
table_name,
GROUP_CONCAT(index_name) AS duplicate_indexes,
column_name
FROM information_schema.statistics
WHERE table_schema = '$database'
GROUP BY table_name, column_name
HAVING COUNT(DISTINCT index_name) > 1;
" > duplicate_indexes.txt
}
# 执行分析
analyze_index_usage "test_tenant" "testdb"
6.3 内存优化
6.3.1 内存使用分析
-- 租户内存使用详情
SELECT
tenant_id,
ctx_name,
hold_bytes/1024/1024 AS hold_mb,
used_bytes/1024/1024 AS used_mb,
ROUND(used_bytes/hold_bytes * 100, 2) AS usage_percent
FROM oceanbase.GV$OB_MEMORY
WHERE tenant_id > 1000
ORDER BY hold_bytes DESC;
-- MemTable内存使用
SELECT
tenant_id,
active_memstore_used/1024/1024 AS active_mb,
total_memstore_used/1024/1024 AS total_mb,
major_freeze_trigger/1024/1024 AS freeze_trigger_mb,
memstore_limit/1024/1024 AS limit_mb
FROM oceanbase.GV$OB_MEMSTORE;
6.3.2 内存优化配置
-- 调整工作区内存
ALTER SYSTEM SET ob_sql_work_area_percentage = 30 TENANT = tenant1;
-- 调整MemTable内存
ALTER SYSTEM SET memstore_limit_percentage = 50 TENANT = tenant1;
ALTER SYSTEM SET freeze_trigger_percentage = 60 TENANT = tenant1;
-- 调整缓存大小
ALTER SYSTEM SET cache_wash_threshold = '4G';
ALTER SYSTEM SET plan_cache_threshold = '500M';
6.4 并发优化
6.4.1 并发控制配置
-- 设置并行度
ALTER SYSTEM SET parallel_servers_target = 64;
ALTER SYSTEM SET parallel_max_servers = 128;
-- 会话级并行设置
SET SESSION parallel_degree = 8;
SET SESSION parallel_min_time_threshold = 10;
-- 表级并行设置
ALTER TABLE large_table PARALLEL 16;
6.4.2 锁等待分析
-- 查看锁等待
SELECT
waiting_session_id,
waiting_query,
blocking_session_id,
blocking_query,
wait_time
FROM oceanbase.GV$OB_LOCKS
WHERE wait_time > 0;
-- 死锁检测
SELECT
deadlock_id,
cycle_size,
cycle_info,
create_time
FROM oceanbase.V$OB_DEADLOCK_EVENT_HISTORY
ORDER BY create_time DESC;
6.5 性能调优工具集
6.5.1 自动调优脚本
#!/usr/bin/env python3
# auto_tuning.py
import pymysql
import json
from datetime import datetime, timedelta
class OBAutoTuner:
def __init__(self, host, port, user, password):
self.conn = pymysql.connect(
host=host,
port=port,
user=user,
password=password
)
self.cursor = self.conn.cursor()
def analyze_slow_queries(self):
"""分析慢查询并提供优化建议"""
sql = """
SELECT
sql_id,
query_sql,
elapsed_time,
cpu_time,
return_rows
FROM oceanbase.GV$SQL_AUDIT
WHERE elapsed_time > 1000000
AND request_time > DATE_SUB(NOW(), INTERVAL 1 HOUR)
ORDER BY elapsed_time DESC
LIMIT 10
"""
self.cursor.execute(sql)
slow_queries = self.cursor.fetchall()
recommendations = []
for query in slow_queries:
sql_id, query_sql, elapsed_time, cpu_time, return_rows = query
# 分析查询特征
if 'SELECT' in query_sql.upper():
if return_rows > 10000:
recommendations.append({
'sql_id': sql_id,
'type': 'LARGE_RESULT_SET',
'suggestion': 'Consider adding LIMIT clause or pagination'
})
if 'JOIN' in query_sql.upper():
recommendations.append({
'sql_id': sql_id,
'type': 'JOIN_OPTIMIZATION',
'suggestion': 'Check join conditions and consider adding indexes'
})
return recommendations
def optimize_memory_settings(self):
"""根据使用情况优化内存设置"""
sql = """
SELECT
AVG(mem_used/mem_total) AS avg_usage,
MAX(mem_used/mem_total) AS max_usage
FROM oceanbase.__all_virtual_server_stat
"""
self.cursor.execute(sql)
result = self.cursor.fetchone()
avg_usage, max_usage = result
if max_usage > 0.9:
# 内存使用率过高,建议调整
return {
'action': 'INCREASE_MEMORY',
'current_usage': f"{max_usage*100:.2f}%",
'recommendation': 'Consider increasing memory_limit_percentage'
}
elif avg_usage < 0.5:
# 内存使用率过低,可以减少分配
return {
'action': 'DECREASE_MEMORY',
'current_usage': f"{avg_usage*100:.2f}%",
'recommendation': 'Consider decreasing memory allocation'
}
return {'action': 'NO_CHANGE', 'status': 'Memory usage is optimal'}
def generate_report(self):
"""生成性能优化报告"""
report = {
'timestamp': datetime.now().isoformat(),
'slow_queries': self.analyze_slow_queries(),
'memory_optimization': self.optimize_memory_settings()
}
with open('performance_report.json', 'w') as f:
json.dump(report, f, indent=2)
return report
# 使用示例
if __name__ == "__main__":
tuner = OBAutoTuner('127.0.0.1', 2883, 'root@sys', 'password')
report = tuner.generate_report()
print(json.dumps(report, indent=2))
参考文档:
7. 安全审计与加固
7.1 访问控制管理
7.1.1 用户权限管理
-- 创建用户
CREATE USER 'app_user'@'%' IDENTIFIED BY 'Strong_Pass123!';
-- 授予权限
GRANT SELECT, INSERT, UPDATE, DELETE ON testdb.* TO 'app_user'@'%';
GRANT SELECT ON oceanbase.* TO 'monitor_user'@'192.168.1.%';
-- 创建角色
CREATE ROLE app_role;
GRANT SELECT, INSERT, UPDATE ON testdb.* TO app_role;
GRANT app_role TO 'app_user'@'%';
-- 查看用户权限
SHOW GRANTS FOR 'app_user'@'%';
-- 回收权限
REVOKE DELETE ON testdb.* FROM 'app_user'@'%';
7.1.2 密码策略配置
-- 设置密码复杂度要求
ALTER SYSTEM SET validate_password_length = 12;
ALTER SYSTEM SET validate_password_mixed_case_count = 2;
ALTER SYSTEM SET validate_password_number_count = 2;
ALTER SYSTEM SET validate_password_special_char_count = 2;
-- 设置密码过期策略
ALTER USER 'app_user'@'%' PASSWORD EXPIRE INTERVAL 90 DAY;
-- 设置登录失败锁定
ALTER USER 'app_user'@'%'
FAILED_LOGIN_ATTEMPTS 5
PASSWORD_LOCK_TIME 30;
7.2 审计配置
7.2.1 审计功能开启
-- 开启SQL审计
ALTER SYSTEM SET enable_sql_audit = TRUE;
ALTER SYSTEM SET sql_audit_percentage = 100;
-- 配置审计规则
CREATE AUDIT POLICY sensitive_data_access
PRIVILEGES SELECT
ON testdb.sensitive_table
BY 'app_user'
WHENEVER SUCCESSFUL;
-- 开启系统审计
ALTER SYSTEM SET enable_syslog_recycle = TRUE;
ALTER SYSTEM SET max_syslog_file_count = 100;
ALTER SYSTEM SET syslog_level = 'INFO';
7.2.2 审计日志分析
-- 查询审计日志
SELECT
user_name,
db_name,
query_sql,
return_rows,
affected_rows,
request_time,
client_ip
FROM oceanbase.GV$SQL_AUDIT
WHERE db_name = 'sensitive_db'
AND request_time > DATE_SUB(NOW(), INTERVAL 24 HOUR)
ORDER BY request_time DESC;
-- 异常访问检测
SELECT
user_name,
client_ip,
COUNT(*) AS access_count,
COUNT(DISTINCT db_name) AS db_count,
COUNT(DISTINCT table_name) AS table_count
FROM oceanbase.GV$SQL_AUDIT
WHERE request_time > DATE_SUB(NOW(), INTERVAL 1 HOUR)
GROUP BY user_name, client_ip
HAVING access_count > 1000 OR db_count > 5;
7.3 数据加密
7.3.1 透明数据加密(TDE)
-- 创建加密密钥
CREATE MASTER KEY;
-- 为表空间启用加密
ALTER TABLESPACE tablespace1 ENCRYPTION = 'Y';
-- 创建加密表
CREATE TABLE encrypted_table (
id INT PRIMARY KEY,
sensitive_data VARCHAR(100)
) ENCRYPTION = 'Y';
-- 查看加密状态
SELECT
table_name,
encryption
FROM information_schema.tables
WHERE table_schema = 'testdb';
7.3.2 传输加密配置
# 配置SSL/TLS
# 生成证书
openssl req -x509 -newkey rsa:2048 -keyout server-key.pem -out server-cert.pem -days 365 -nodes
# 配置OceanBase SSL
cat >> observer.config.bin << EOF
ssl_client_authentication = TRUE
ssl_cert_file = /path/to/server-cert.pem
ssl_key_file = /path/to/server-key.pem
ssl_ca_file = /path/to/ca-cert.pem
EOF
# 重启服务使配置生效
7.4 安全加固措施
7.4.1 网络安全配置
-- 配置IP白名单
ALTER TENANT tenant1 SET ob_tcp_invited_nodes = '192.168.1.0/24,10.0.0.0/8';
-- 限制连接数
ALTER SYSTEM SET max_connections = 1000;
ALTER TENANT tenant1 SET max_connections = 500;
-- 配置防火墙规则(系统层面)
#!/bin/bash
# firewall_config.sh
# 允许OceanBase端口
firewall-cmd --permanent --add-port=2881/tcp # SQL端口
firewall-cmd --permanent --add-port=2882/tcp # RPC端口
firewall-cmd --permanent --add-port=2883/tcp # MySQL协议端口
# 限制来源IP
firewall-cmd --permanent --add-rich-rule='
rule family="ipv4"
source address="192.168.1.0/24"
port protocol="tcp" port="2883" accept'
# 重载防火墙规则
firewall-cmd --reload
7.4.2 安全检查脚本
#!/usr/bin/env python3
# security_check.py
import pymysql
import hashlib
from datetime import datetime
class SecurityAuditor:
def __init__(self, host, port, user, password):
self.conn = pymysql.connect(
host=host,
port=port,
user=user,
password=password
)
self.cursor = self.conn.cursor()
self.issues = []
def check_weak_passwords(self):
"""检查弱密码"""
weak_passwords = ['123456', 'password', 'admin', 'root']
sql = "SELECT user, host FROM mysql.user"
self.cursor.execute(sql)
users = self.cursor.fetchall()
for user, host in users:
# 这里仅作示例,实际应通过其他方式验证
self.issues.append({
'type': 'PASSWORD_CHECK',
'user': f"{user}@{host}",
'message': 'Ensure strong password is set'
})
def check_privileges(self):
"""检查过度授权"""
sql = """
SELECT
user,
host,
COUNT(*) AS priv_count
FROM information_schema.user_privileges
GROUP BY user, host
HAVING priv_count > 10
"""
self.cursor.execute(sql)
over_privileged = self.cursor.fetchall()
for user, host, count in over_privileged:
self.issues.append({
'type': 'OVER_PRIVILEGED',
'user': f"{user}@{host}",
'message': f'User has {count} privileges, consider reviewing'
})
def check_audit_settings(self):
"""检查审计设置"""
sql = "SHOW PARAMETERS LIKE '%audit%'"
self.cursor.execute(sql)
audit_params = self.cursor.fetchall()
required_settings = {
'enable_sql_audit': 'TRUE',
'sql_audit_percentage': '100'
}
for param, value, *_ in audit_params:
if param in required_settings:
if value != required_settings[param]:
self.issues.append({
'type': 'AUDIT_CONFIG',
'parameter': param,
'current': value,
'recommended': required_settings[param]
})
def generate_report(self):
"""生成安全报告"""
report = {
'timestamp': datetime.now().isoformat(),
'total_issues': len(self.issues),
'issues': self.issues
}
with open('security_audit_report.json', 'w') as f:
import json
json.dump(report, f, indent=2)
return report
# 执行安全检查
if __name__ == "__main__":
auditor = SecurityAuditor('127.0.0.1', 2883, 'root@sys', 'password')
auditor.check_weak_passwords()
auditor.check_privileges()
auditor.check_audit_settings()
report = auditor.generate_report()
print(f"Security audit completed. Found {report['total_issues']} issues.")
7.5 合规性管理
7.5.1 数据脱敏
-- 创建脱敏函数
DELIMITER //
CREATE FUNCTION mask_sensitive_data(input VARCHAR(100))
RETURNS VARCHAR(100)
DETERMINISTIC
BEGIN
DECLARE masked VARCHAR(100);
IF LENGTH(input) <= 4 THEN
SET masked = REPEAT('*', LENGTH(input));
ELSE
SET masked = CONCAT(
LEFT(input, 2),
REPEAT('*', LENGTH(input) - 4),
RIGHT(input, 2)
);
END IF;
RETURN masked;
END//
DELIMITER ;
-- 创建脱敏视图
CREATE VIEW customer_masked AS
SELECT
customer_id,
mask_sensitive_data(customer_name) AS customer_name,
mask_sensitive_data(phone) AS phone,
mask_sensitive_data(email) AS email
FROM customers;
参考文档:
8. 应急预案
8.1 应急响应体系
8.1.1 应急组织架构
# emergency_team.yaml
emergency_response_team:
leader:
role: "应急总指挥"
responsibilities:
- "决策应急方案"
- "协调资源调配"
- "对外沟通协调"
technical_team:
dba_lead:
role: "DBA负责人"
responsibilities:
- "数据库故障诊断"
- "数据恢复操作"
- "性能应急调优"
network_lead:
role: "网络负责人"
responsibilities:
- "网络故障排查"
- "网络流量控制"
- "安全防护加固"
app_lead:
role: "应用负责人"
responsibilities:
- "应用服务切换"
- "业务降级处理"
- "用户通知协调"
support_team:
monitoring:
role: "监控值守"
responsibilities:
- "7×24小时监控"
- "故障首次响应"
- "应急通知触发"
8.1.2 应急响应流程
8.2 故障场景与处理
8.2.1 集群故障应急处理
#!/bin/bash
# cluster_emergency.sh
# 集群完全故障恢复流程
function cluster_recovery() {
echo "========== 集群应急恢复开始 =========="
# 1. 检查集群状态
echo "[Step 1] 检查集群状态..."
obclient -h127.0.0.1 -P2883 -uroot@sys -p -e "
SELECT svr_ip, svr_port, zone, status
FROM oceanbase.DBA_OB_SERVERS;
" || echo "集群无法连接"
# 2. 检查各节点进程
echo "[Step 2] 检查OBServer进程..."
for node in 192.168.1.101 192.168.1.102 192.168.1.103; do
echo "检查节点 $node..."
ssh $node "ps aux | grep observer | grep -v grep" || {
echo "节点 $node 的observer进程异常"
# 尝试启动observer
ssh $node "cd /home/admin/oceanbase && ./bin/observer &"
}
done
# 3. 检查多数派
echo "[Step 3] 检查Paxos多数派..."
available_nodes=$(obclient -h127.0.0.1 -P2883 -uroot@sys -p -e "
SELECT COUNT(*) FROM oceanbase.DBA_OB_SERVERS WHERE status = 'ACTIVE';
" | tail -1)
if [ $available_nodes -lt 2 ]; then
echo "警告:可用节点少于多数派要求,需要强制恢复"
# 执行强制恢复流程
force_recovery
fi
# 4. 检查租户状态
echo "[Step 4] 检查租户服务状态..."
obclient -h127.0.0.1 -P2883 -uroot@sys -p -e "
SELECT tenant_name, tenant_status
FROM oceanbase.DBA_OB_TENANTS;
"
echo "========== 集群应急恢复完成 =========="
}
# 强制恢复流程
function force_recovery() {
echo "执行强制恢复流程..."
# 1. 停止所有节点
for node in 192.168.1.101 192.168.1.102 192.168.1.103; do
ssh $node "pkill -9 observer"
done
# 2. 清理状态
for node in 192.168.1.101 192.168.1.102 192.168.1.103; do
ssh $node "rm -f /home/admin/oceanbase/store/clog/*"
done
# 3. 重新启动bootstrap节点
ssh 192.168.1.101 "cd /home/admin/oceanbase && ./bin/observer -r '192.168.1.101:2882' -p 2881 -P 2882 -n ob_cluster -c 1 -d /home/admin/oceanbase/store -i eth0 -o 'memory_limit=8G,system_memory=2G,datafile_size=50G,config_additional_dir=/home/admin/oceanbase/etc'"
sleep 30
# 4. 加入其他节点
for node in 192.168.1.102 192.168.1.103; do
ssh $node "cd /home/admin/oceanbase && ./bin/observer -r '192.168.1.101:2882,192.168.1.102:2882,192.168.1.103:2882' -p 2881 -P 2882 -n ob_cluster -c 1 -d /home/admin/oceanbase/store -i eth0"
done
}
# 执行恢复
cluster_recovery
8.2.2 数据损坏应急恢复
-- 数据页损坏检测与修复
-- 1. 检查数据页损坏
SELECT
tenant_id,
svr_ip,
table_id,
partition_id,
macro_block_id,
error_type,
error_msg
FROM oceanbase.__all_virtual_macro_block_marker
WHERE error_type IS NOT NULL;
-- 2. 标记损坏的宏块
ALTER SYSTEM MARK CORRUPTED MACRO BLOCK
tenant_id = 1001,
table_id = 500001,
partition_id = 0,
macro_block_id = 12345;
-- 3. 触发副本修复
ALTER SYSTEM RECOVER PARTITION
tenant = 'tenant1',
table = 'corrupted_table',
partition = 'p1';
-- 4. 验证修复结果
SELECT
replica_status,
data_checksum,
row_count
FROM oceanbase.__all_virtual_partition_info
WHERE table_id = 500001;
8.3 灾难恢复方案
8.3.1 主备切换预案
#!/bin/bash
# dr_switchover.sh
PRIMARY_CLUSTER="192.168.1.101:2883"
STANDBY_CLUSTER="192.168.2.101:2883"
function switchover_to_standby() {
echo "开始执行主备切换..."
# 1. 停止主集群写入
echo "[1] 停止主集群写入..."
mysql -h${PRIMARY_CLUSTER%:*} -P${PRIMARY_CLUSTER#*:} -uroot@sys -p -e "
ALTER SYSTEM SET read_only = ON;
"
# 2. 等待数据同步完成
echo "[2] 等待数据同步..."
while true; do
lag=$(mysql -h${STANDBY_CLUSTER%:*} -P${STANDBY_CLUSTER#*:} -uroot@sys -p -e "
SELECT MAX(lag_seconds) FROM oceanbase.__all_virtual_tenant_replication_info;
" | tail -1)
if [ "$lag" -eq 0 ]; then
echo "数据同步完成"
break
fi
echo "同步延迟: ${lag}秒,等待中..."
sleep 5
done
# 3. 切换备集群为主集群
echo "[3] 提升备集群为主集群..."
mysql -h${STANDBY_CLUSTER%:*} -P${STANDBY_CLUSTER#*:} -uroot@sys -p -e "
ALTER SYSTEM SET read_only = OFF;
ALTER SYSTEM SET cluster_role = 'PRIMARY';
"
# 4. 更新应用连接配置
echo "[4] 更新应用配置..."
update_app_config $STANDBY_CLUSTER
# 5. 验证切换结果
echo "[5] 验证切换结果..."
verify_switchover
echo "主备切换完成!"
}
function verify_switchover() {
# 验证新主集群状态
mysql -h${STANDBY_CLUSTER%:*} -P${STANDBY_CLUSTER#*:} -uroot@sys -p -e "
SELECT
cluster_role,
cluster_status,
read_only
FROM oceanbase.__all_cluster;
"
}
# 执行切换
switchover_to_standby
8.3.2 异地容灾演练
# disaster_recovery_drill.yaml
dr_drill_plan:
schedule:
frequency: "quarterly"
duration: "4 hours"
participants:
- "DBA团队"
- "运维团队"
- "应用团队"
scenarios:
- name: "主机房断电"
steps:
- "模拟主机房网络中断"
- "触发自动故障转移"
- "验证备机房接管服务"
- "测试数据一致性"
- "执行回切操作"
- name: "数据中心级故障"
steps:
- "停止主数据中心所有服务"
- "激活灾备数据中心"
- "验证RPO和RTO指标"
- "测试业务功能完整性"
- name: "数据损坏恢复"
steps:
- "模拟数据文件损坏"
- "从备份恢复数据"
- "验证数据完整性"
- "测试业务连续性"
success_criteria:
rpo: "< 1 minute"
rto: "< 30 minutes"
data_integrity: "100%"
service_availability: "> 99.9%"
8.4 应急工具箱
8.4.1 快速诊断工具
#!/usr/bin/env python3
# quick_diagnosis.py
import pymysql
import subprocess
import json
from datetime import datetime
class EmergencyDiagnostic:
def __init__(self, cluster_endpoints):
self.endpoints = cluster_endpoints
self.diagnosis_result = {
'timestamp': datetime.now().isoformat(),
'cluster_health': {},
'issues': [],
'recommendations': []
}
def check_cluster_connectivity(self):
"""检查集群连接性"""
for endpoint in self.endpoints:
host, port = endpoint.split(':')
try:
conn = pymysql.connect(
host=host,
port=int(port),
user='root@sys',
password='password',
connect_timeout=5
)
conn.close()
self.diagnosis_result['cluster_health'][endpoint] = 'ONLINE'
except Exception as e:
self.diagnosis_result['cluster_health'][endpoint] = 'OFFLINE'
self.diagnosis_result['issues'].append({
'type': 'CONNECTIVITY',
'endpoint': endpoint,
'error': str(e)
})
def check_system_resources(self):
"""检查系统资源"""
commands = {
'cpu': "top -bn1 | grep 'Cpu(s)' | awk '{print $2}'",
'memory': "free -m | grep Mem | awk '{print ($3/$2)*100}'",
'disk': "df -h | grep -E '^/dev/' | awk '{print $5}'"
}
for resource, cmd in commands.items():
try:
result = subprocess.check_output(cmd, shell=True).decode().strip()
if resource == 'disk':
usage = max([int(x.rstrip('%')) for x in result.split('\n')])
else:
usage = float(result.rstrip('%'))
if usage > 80:
self.diagnosis_result['issues'].append({
'type': 'RESOURCE',
'resource': resource,
'usage': f"{usage}%",
'severity': 'HIGH'
})
except Exception as e:
print(f"Failed to check {resource}: {e}")
def check_critical_services(self):
"""检查关键服务状态"""
critical_queries = [
("SELECT COUNT(*) FROM oceanbase.DBA_OB_SERVERS WHERE status != 'ACTIVE'",
"inactive_servers"),
("SELECT COUNT(*) FROM oceanbase.DBA_OB_TENANTS WHERE tenant_status != 'NORMAL'",
"abnormal_tenants"),
("SELECT COUNT(*) FROM oceanbase.__all_virtual_partition_info WHERE replica_status != 'NORMAL'",
"abnormal_replicas")
]
for endpoint in self.endpoints:
if self.diagnosis_result['cluster_health'][endpoint] == 'ONLINE':
host, port = endpoint.split(':')
try:
conn = pymysql.connect(
host=host,
port=int(port),
user='root@sys',
password='password'
)
cursor = conn.cursor()
for query, check_name in critical_queries:
cursor.execute(query)
count = cursor.fetchone()[0]
if count > 0:
self.diagnosis_result['issues'].append({
'type': 'SERVICE',
'check': check_name,
'count': count,
'endpoint': endpoint
})
conn.close()
break # 只需要从一个可用节点检查
except Exception as e:
print(f"Failed to check services on {endpoint}: {e}")
def generate_recommendations(self):
"""生成应急建议"""
for issue in self.diagnosis_result['issues']:
if issue['type'] == 'CONNECTIVITY':
self.diagnosis_result['recommendations'].append({
'issue': issue,
'action': 'Check network connectivity and observer process',
'command': f"ssh {issue['endpoint'].split(':')[0]} 'ps aux | grep observer'"
})
elif issue['type'] == 'RESOURCE':
self.diagnosis_result['recommendations'].append({
'issue': issue,
'action': f"Clear {issue['resource']} resources",
'priority': 'HIGH' if float(issue['usage'].rstrip('%')) > 90 else 'MEDIUM'
})
elif issue['type'] == 'SERVICE':
self.diagnosis_result['recommendations'].append({
'issue': issue,
'action': f"Investigate and fix {issue['check']}",
'query': f"Check details for {issue['check']} on {issue['endpoint']}"
})
def run_diagnosis(self):
"""执行完整诊断"""
print("Starting emergency diagnosis...")
self.check_cluster_connectivity()
self.check_system_resources()
self.check_critical_services()
self.generate_recommendations()
# 保存诊断结果
with open('emergency_diagnosis.json', 'w') as f:
json.dump(self.diagnosis_result, f, indent=2)
# 输出关键信息
print(f"\nDiagnosis completed at {self.diagnosis_result['timestamp']}")
print(f"Found {len(self.diagnosis_result['issues'])} issues")
print(f"Generated {len(self.diagnosis_result['recommendations'])} recommendations")
if self.diagnosis_result['issues']:
print("\nCritical Issues:")
for issue in self.diagnosis_result['issues'][:5]: # 显示前5个问题
print(f" - {issue['type']}: {issue}")
return self.diagnosis_result
# 使用示例
if __name__ == "__main__":
clusters = [
"192.168.1.101:2883",
"192.168.1.102:2883",
"192.168.1.103:2883"
]
diagnostic = EmergencyDiagnostic(clusters)
result = diagnostic.run_diagnosis()
8.5 应急通讯与报告
8.5.1 应急通知模板
# emergency_notification.py
def generate_emergency_notification(incident_type, severity, description):
"""生成应急通知"""
template = f"""
【OceanBase集群应急通知】
事件类型: {incident_type}
严重级别: {severity}
发生时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
事件描述:
{description}
影响范围:
- 受影响的服务: [列举受影响的服务]
- 受影响的用户: [估计影响用户数]
- 业务影响: [描述业务影响]
当前状态:
- 应急小组已启动
- 正在执行恢复操作
- 预计恢复时间: [预估时间]
联系方式:
- 应急指挥: 13800138000
- 技术支持: 13800138001
- 状态更新: https://status.company.com
请相关人员立即响应!
"""
return template
def send_notification(message, channels=['email', 'sms', 'dingtalk']):
"""发送应急通知"""
for channel in channels:
if channel == 'email':
# 发送邮件通知
pass
elif channel == 'sms':
# 发送短信通知
pass
elif channel == 'dingtalk':
# 发送钉钉通知
pass
8.5.2 故障报告模板
# OceanBase集群故障报告
## 1. 故障概述
- **故障ID**: INC-20240315-001
- **故障级别**: P0
- **发生时间**: 2024-03-15 14:30:00
- **恢复时间**: 2024-03-15 16:45:00
- **持续时长**: 2小时15分钟
- **影响范围**: 生产环境主集群
## 2. 故障现象
- 集群节点192.168.1.101失去响应
- 30%的SQL请求超时
- 部分租户无法正常访问
## 3. 故障原因
### 直接原因
- OBServer进程因内存溢出崩溃
### 根本原因
- 大查询导致内存使用超限
- 内存限制参数配置不当
## 4. 处理过程
| 时间 | 操作 | 结果 |
|------|------|------|
| 14:30 | 监控告警触发 | 确认故障 |
| 14:35 | 启动应急预案 | 应急小组集结 |
| 14:45 | 重启故障节点 | 节点恢复 |
| 15:00 | 调整内存参数 | 参数生效 |
| 16:00 | 全面功能验证 | 验证通过 |
| 16:45 | 故障恢复确认 | 服务正常 |
## 5. 改进措施
1. **短期措施**
- 优化内存参数配置
- 增加内存监控告警
2. **长期措施**
- 升级集群硬件配置
- 实施SQL审核机制
- 完善容量规划
## 6. 经验教训
- 需要加强大查询的资源控制
- 应急响应流程需要进一步优化
- 监控告警阈值需要调整
## 7. 附件
- 故障期间日志: [链接]
- 监控截图: [链接]
- 详细分析报告: [链接]
参考文档:
总结
本运维管理手册涵盖了OceanBase 4.2.x企业版集群运维的各个关键方面:
- 监控体系 - 建立全方位的监控机制,及时发现问题
- 参数调优 - 根据业务特点优化系统参数配置
- 备份恢复 - 制定完善的备份策略,确保数据安全
- 故障处理 - 快速响应和处理各类故障场景
- 租户管理 - 合理分配和管理多租户资源
- 性能优化 - 持续优化系统性能,提升用户体验
- 安全加固 - 加强安全防护,确保数据安全合规
- 应急预案 - 建立完善的应急响应机制
通过本手册的实践,可以有效提升OceanBase集群的稳定性、可用性和性能,为业务提供可靠的数据库服务支撑。
455

被折叠的 条评论
为什么被折叠?



