一、RegionServer级别的监控
Metric | Type(GAUGE,COUNTER) | 类型 | 业务意义 | 备注 |
---|---|---|---|---|
regionCount |
GAUGE | The number of regions hosted by the regionserver | RegionServer包含对象 | |
storeCount |
GAUGE | |||
storeFileCount |
GAUGE | The number of store files on disk currently managed by the regionserver | ||
storeFileSize |
GAUGE | Aggregate size of the store files on disk | ||
hlogFileCount |
GAUGE | The number of write ahead logs not yet archived | ||
totalRequestCount |
COUNTER | The total number of requests received | 负载 | |
readRequestCount |
COUNTER | The number of read requests received | ||
writeRequestCount |
COUNTER | The number of write requests received | ||
numOpenConnections |
GAUGE | The number of open connections at the RPC layer | 连接与队列 | |
numActiveHandler |
GAUGE | The number of RPC handlers actively servicing requests | ||
numCallsInGeneralQueue |
GAUGE | The number of currently enqueued user requests | ||
numCallsInReplicationQueue |
GAUGE | The number of currently enqueued operations received from replication | ||
numCallsInPriorityQueue |
GAUGE | The number of currently enqueued priority (internal housekeeping) requests | ||
flushQueueLength |
GAUGE | Current depth of the memstore flush queue. If increasing, we are falling behind with clearing memstores out to HDFS. | ||
compactionQueueLength |
GAUGE | Current depth of the compaction request queue. If increasing, we are falling behind with storefile compaction. | ||
updatesBlockedTime |
COUNTER | ms | Number of milliseconds updates have been blocked so the memstore can be flushed | |
blockCacheHitCount |
COUNTER | The number of block cache hits | blockcache使用情况 | |
blockCacheMissCount |
COUNTER | The number of block cache misses | ||
blockCacheExpressHitPercent |
GAUGE | percent | The percent of the time that requests with the cache turned on hit the cache | |
percentFilesLocal |
GAUGE | percent | Percent of store file data that can be read from the local DataNode, 0-100 | 文件本地化比例 |
<op>_<measure> |
GAUGE | Operation latencies, where <op> is one of Append, Delete, Mutate, Get, Replay, Increment; and where <measure> is one of min, max, mean, median, 75th_percentile, 95th_percentile, 99th_percentile | 详细的各类操作计数器 | |
slow<op>Count |
COUNTER | The number of operations we thought were slow, where <op> is one of the list above | ||
GcTimeMillis |
COUNTER | ms | Time spent in garbage collection, in milliseconds | GC时间 |
GcTimeMillisParNew |
COUNTER | ms | Time spent in garbage collection of the young generation, in milliseconds | |
GcTimeMillisConcurrentMarkSweep |
COUNTER | ms | Time spent in garbage collection of the old generation, in milliseconds | |
authenticationSuccesses |
COUNTER | Number of client connections where authentication succeeded | ACL模块的统计 | |
authenticationFailures |
COUNTER | Number of client connection authentication failures | ||
mutationsWithoutWALCount |
COUNTER | Count of writes submitted with a flag indicating they should bypass the write ahead log | ||
如下部分为非核心指标,暂未实现 | ||||
compactedCellsCount |
COUNTER | 合并cell个数 | cell统计 | |
majorCompactedCellsCount |
COUNTER | 大合并cell个数 | ||
flushedCellsSize |
COUNTER | flush到磁盘的大小 | ||
blockedRequestCount |
COUNTER | 因memstore大于阈值而引发flush的次数 | ||
splitRequestCount |
COUNTER | region分裂请求次数 | region分裂情况 | |
splitSuccessCounnt |
COUNTER | region分裂成功次数 | ||
receivedBytes |
COUNTER | bytes | 收到数据量 | 带宽 |
sentBytes |
COUNTER | bytes | 发出数据量SyncTime_mean | |
compactionQueueSize |
GAUGE | compaction Queue的大小 | compaction情况统计 | |
compactionSize_avg_time |
GAUGE | ms | 履行一次Compaction的数据大小 | |
compactionSize_num_ops |
COUNTER | 履行compaction的次数 | ||
compactionTime_avg_time |
GAUGE | ms | 均匀履行一次Compaction的时间 | |
compactionTime_num_ops |
COUNTER | 履行compaction的次数 |
二、RegionServe报警设置
Metric | 报警策略 | 报警级别 | 备注 |
---|---|---|---|
totalRequestCount | all(#3) > 50000 | P1 | 负载过大 |
compactionQueueLength | all(#3) > 100 | P1 | 压缩队列过长 |
percentFilesLocal | all(#3) <= 90 | P1 | 文件本地化低于95% |
blockCacheExpressHitPercent | all(#3) <= 90 | P1 | blockCache命中率低于95% |
GcTimeMillisConcurrentMarkSweep | all(#3) > 200 | P1 | GC时间过长 |
storeFileCount | all(#3) > 1000 | P1 | StoreFile过多,需要考虑compact |
三、RegionServer上的table(region)级别的监控
Metric | Type(GAUGE,COUNTER) | 类型 | 业务意义 | 备注 |
---|---|---|---|---|
appendCount |
COUNTER |
region级别的各类操作计数器 | ||
deleteCount |
COUNTER | |||
mutateCount |
COUNTER | |||
incrementCount |
COUNTER | |||
scanNext_num_ops |
COUNTER | |||
get_num_ops |
COUNTER | |||
numBytesCompactedCount |
COUNTER | bytes | 合并完成文件总大小 |
合并操作 |
numFilesCompactedCount |
COUNTER | 合并完成文件个数 |