1 前言
HBase集群数据量越来越大了,故障也越来越频繁,为了有效地监控HBase集群的使用情况,先研究Nagios的使用,后续与Gangia或者JMX集成,达到很好地监控作用。
2 Nagios简介
Nagios可以监控本地或远程主机以及服务,更重要地是提供异常通知功能,方便第一时间通知运维人员系统的异常状况。
Nagios构架图如下:
主要说明:
- Nagios本身并不提供监控功能,监控功能有Nagios Plugins插件完成
- 远程监控主机或服务有很多种方式,NRPE只是其中的一种
- NRPE是CS构架,监控端为C,被监控端为S
3 环境准备
4 部署规划
监控节点:
- hadoop-ehp0 (192.168.137.100)
被监控节点:
- hadoop-ehp1 (192.168.137.101)
- hadoop-ehp2 (192.168.137.102)
- hadoop-ehp3 (192.168.137.103)
5 监控节点部署
防火墙、selinux都先关闭。
5.1 添加权限组
groupadd nagcmd
useradd -G nagcmd nagios
passwd bagios
5.2 安装依赖软件包
rpm -q gcc glibc glibc-common gd gd-devel openssl-devel httpd php
没有安装的使用yum install安装好即可。
5.3 安装Nagios
./configure --exec-prefix=/usr/local/nagios --with-command-group=nagcmd --enable-event-broker
make all
make install
make install-init
make install-commandmode
make install-config
make install-webconf #创建/etc/httpd/conf.d/nagios.conf
5.4 创建Nagios Web服务用户密码
htpasswd -c /usr/local/nagios/etc/htpasswd.users nagiosadmin
5.5 开启Nagios服务
service httpd start
service nagios start
登录http://192.168.137.100/nagios,查看web界面。
5.6 安装Nagios Plugins
./configure --prefix=/usr/local/nagios --with-nagios-user=nagios --with-nagios-group=nagios
make
make install
5.7 安装NRPE(C端)
./configure --prefix=/usr/local/nagios \
--with-nrpe-user=nagios \
--with-nrpe-group=nagios \
--with-nagios-user=nagios \
--with-nagios-group=nagios \
--enable-command-args \
--enable-ssl
make all
make install-plugin
6 被监控节点部署
6.1 添加用户
useradd -s /sbin/nologin nagios
6.2 安装依赖软件包
rpm -q gcc glibc glibc-common gd gd-devel openssl-devel
没有安装的使用yum install安装好即可。
6.3 安装Nagios Plugins
./configure --prefix=/usr/local/nagios --with-nagios-user=nagios --with-nagios-group=nagios
make
make install
6.4 安装NRPE(S端)
./configure --prefix=/usr/local/nagios \
--with-nrpe-user=nagios \
--with-nrpe-group=nagios \
--with-nagios-user=nagios \
--with-nagios-group=nagios \
--enable-command-args \
--enable-ssl
make all
make install-plugin
make install-daemon
make install-daemon-config
6.5 制作nrped服务
编写/etc/init.d/nrped,如下:
#!/bin/bash
NRPE=/usr/local/nagios/bin/nrpe
NRPECONF=/usr/local/nagios/etc/nrpe.cfg
case "$1" in
start)
echo -n "Start NRPE daemon..."
$NRPE -c $NRPECONF -d
echo " done."
;;
stop)
echo -n "Stop NRPE daemon..."
pkill -u nagios nrpe
echo " done."
;;
restart)
$0 stop
sleep 2
$0 start
;;
*)
echo "Usage: $0 start|stop|restart."
;;
esac
exit 0
完成后,修改权限,使其有执行权限,如下:
chmod u+x /etc/init.d/nrped
6.6 修改nrpe配置
编辑etc/nrpe.cfg,如下:
server_address=192.168.137.10x #修改为本机IP
allowed_hosts=192.168.137.100 #修改为监控机IP
dont_blame_nrpe=1 #允许插件带参
6.7 开启NRPE服务
service nrped start
7 监控验证
登录到监控端主机,执行验证NRPE服务:
[root@hadoop-ehp0 nagios]# libexec/check_nrpe -H 192.168.137.101
NRPE v2.15
[root@hadoop-ehp0 nagios]# libexec/check_nrpe -H 192.168.137.102
NRPE v2.15
[root@hadoop-ehp0 nagios]# libexec/check_nrpe -H 192.168.137.103
NRPE v2.15
查看被监控端nrpe.cfg,看哪些命令默认配置了,如下:
command[check_users]=/usr/local/nagios/libexec/check_users -w 5 -c 10
command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
command[check_hda1]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/hda1
command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z
command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 150 -c 200
验证命令,如下:
[root@hadoop-ehp0 nagios]# libexec/check_nrpe -H 192.168.137.101 -c check_users
USERS OK - 1 users currently logged in |users=1;5;10;0
[root@hadoop-ehp0 nagios]# libexec/check_nrpe -H 192.168.137.102 -c check_load
OK - load average: 0.00, 0.00, 0.01|load1=0.000;15.000;30.000;0; load5=0.000;10.000;25.000;0; load15=0.010;5.000;20.000;0;
[root@hadoop-ehp0 nagios]# libexec/check_nrpe -H 192.168.137.103 -c check_total_procs
PROCS OK: 124 processes | procs=124;150;200;0;
8 小结
本文详细地阐述了Nagios安装部署,由于篇幅限制,有关Nagios如何监控集群,如何监控服务,如何发送邮电,以及各重要配置文件的讲解放到下一篇。