再配置本文之前请先阅读http://blog.youkuaiyun.com/u010257584/article/details/56278009点击打开链接,关于Nagios的其他内容请关注作者陆续更新!
1.Nagios配置文件简介
先说一下“private”和“public”服务、应用及协议,这里的“private”是指主机的CPU load、Memory usage、Disk usage、Logged in users、Running processes等服务,“public”服务是指可以通过本地网络或者互联网连接获得的服务,比如:HTTP,POP3,IMAP,FTP以及SSH,其实在日常使用中还有更多的基础服务,这些服务与应用,包括所依托的协议,可以被Nagios直接监控而不需要额外的插件来支持。相反,“private”服务如果没有某些中间件做代理Nagios是无法监控的,关于远程Linux/UNIX主机的“private”服务监控,可以参考后续博文介绍的NRPE的内容。
本节就常见的一些服务的监控配置做一些简单的说明。
1.Nagios目录结构
对应各目录的内容如下:
目录名称 | 目录内容 |
bin | Nagios可执行程序所在目录 |
etc | Nagios配置文件目录 |
sbin | Nagios cgi文件所在目录,也就是执行外部命令所需要文件所在的目录 |
share | Nagios网页存放路径 |
libexec | Nagios外部插件存放目录 |
var | Nagios日志文件、Lock等文件所在的目录 |
var/archives | Nagios日志自动归档目录 |
var/rw | 用来存放外部命令文件的目录 |
2.Nagios配置文件关系
Nagios的配置文件包括:主配置文件、资源文件、对象定义文件和CGI配置文件。主配置文件包含影响Nagios Core守护程序操作方式的许多指令,此配置文件由Nagios Core守护程序和CGI读取;资源文件用于存储用户定义的宏,主要是用来存储敏感的配置信息(如密码),而不使它们可用于CGI;对象定义文件用于定义主机,服务,主机组,联系人,联系人组,命令等;CGI配置文件中包含了一些影响操作指令的CGI,它还包含一个引用主配置文件,知道Nagios的配置内容和对象定义存储的位置,它们的关系如下:
3.Nagios配置文件简介
Nagios安装成功后,会在/usr/local/nagios/etc目录下生成相应的主机,服务、命令、模板等配置文件,同时也可看到之前设置的Nagios授权目录认证文件htpasswed.users,而Object目录是存放一些配置文件模板,主要用于定义Nagios对象。
Nagios配置目录与文件:
Nagios对象模板文件如下:
对应的配置文件简介如下:
配置文件 | 说明 |
cgi.cfg | 控制CGI访问的配置文件 |
nagios.cfg | Nagios主配置文件 |
resource.cfg | 变量定义文件,又称为资源文件,在此文件中定义变量,以便由其他配置文件引用,如$USER1$ |
objects | objects是一个目录,在此目录下有很多配置文件模板,用于定义Nagios对象 |
objects/commands.cfg | 命令定义配置文件,其中定义的命令可以被其他配置文件引用 |
objects/contacts.cfg | 定义联系人和联系人组的配置文件 |
objects/localhost.cfg | 定义监控本地主机的配置文件 |
objects/printer.cfg | 定义监控打印机的一个配置文件模板,默认没有启用此文件 |
objects/switch.cfg | 监控路由器的一个配置文件模板,默认没有启用此文件 |
objects/templates.cfg | 定义主机和服务的一个模板配置文件,可以在其他配置文件中引用 |
objects/timeperiods.cfg | 定义Nagios监控时间段的配置文件 |
objects/windows.cfg | 监控Windows主机的一个配置文件模板,默认没有启用此文件 |
备注: | Nagios在配置方面非常灵活,默认的配置文件并不是必需的。可以使用这些默认的配置文件, 也可以创建自己的配置文件,然后在主配置文件nagios.cfg中引用即可。 |
2.Nagios配置模板简介
Nagios的配置过程可从五个步骤来入手,参见接下来的5节内容:
2.1.定义主机或服务出现问题时要通知的联系人和联系人组
1.contact用于识别在网络中出现问题时应联系的人。
定义格式:
define contact{
contact_name contact_name(*)
alias alias(*)
contactgroups contactgroup_names
host_notifications_enabled [0/1](*)
service_notifications_enabled [0/1](*)
host_notification_period timeperiod_name(*)
service_notification_period timeperiod_name(*)
host_notification_options [d,u,r,f,s,n](*)
service_notification_options [w,u,c,r,f,s,n](*)
host_notification_commands command_name(*)
service_notification_commands command_name(*)
email email_address
pager pager_number or pager_email_gateway
addressx additional_contact_address
can_submit_commands [0/1]
retain_status_information [0/1]
retain_nonstatus_information [0/1]
...
}
定义样例:
define contact{
contact_name jdoe
alias John Doe
host_notifications_enabled 1
service_notifications_enabled 1
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r
host_notification_options d,u,r
service_notification_commands notify-by-email
host_notification_commands host-notify-by-email
email jdoe@localhost.localdomain
pager 555-5555@pagergateway.localhost.localdomain
address1 xxxxx.xyyy@icq.com
address2 555-555-5555
can_submit_commands 1
}
简单说明下Host_notification_options、service_notification_options:
1)Host_notification_options:
- d = notify on DOWN host states,
- u = notify on UNREACHABLE host states
- r = notify on host recoveries (UP states)
- f = notify when the host starts and stops flapping
- s = send notifications when host or service scheduled downtime starts and ends
- (none) as an option, the contact will not receive any type of host notifications.
2)service_notification_options:
- w = notify on WARNING service states
- u = notify on UNKNOWN service states
- c = notify on CRITICAL service states
- r = notify on service recoveries (OK states)
- f = notify when the service starts and stops flapping
- n (none) as an option, the contact will not receive any type of service notifications.
常用的设置
- host_notification_options:d,u,r
- service_notification_options:w,u,c,r
2.contactgroup用于将一个或多个联系人分组在一起以发送警报/恢复通知,定义格式:
define contactgroup{
contactgroup_name contactgroup_name(*)
alias alias(*)
members contacts(*)
contactgroup_members contactgroups
...
}
定义样例:
define contactgroup{
contactgroup_name novell-admins
alias Novell Administrators
members jdoe,rtobert,tzach
}
2.2.定义主机、主机组、服务和服务组
该步骤涉及的配置文件:/usr/local/nagios/etc/objects/hosts.cfg、/usr/local/nagios/etc/objects/services.cfg,关于host、hostgroup、service、servicegroup的基础配置可以参见/usr/local/nagios/etc/objects/timeperiods.cfg中的定义。
1.hosts.cfg配置
1) hosts.cfg用来配置主机和主机组,格式可参考localhost.cfg中关于host和hostgroup的定义。
主机(host)主机被定义为存在于网络中的一个物理服务器、工作站或设备等,详细格式(标记了(*)的是必备的,其他是可选的):
define host{
host_name host_name(*)
alias alias(*)
display_name display_name
address address(*)
parents host_names
hostgroups hostgroup_names
check_command command_name
initial_state [o,d,u]
max_check_attempts #(*)
check_interval #
retry_interval #
active_checks_enabled [0/1]
passive_checks_enabled [0/1]
check_period timeperiod_name(*)
obsess_over_host [0/1]
check_freshness [0/1]
freshness_threshold #
event_handler command_name
event_handler_enabled [0/1]
low_flap_threshold #
high_flap_threshold #
flap_detection_enabled [0/1]
flap_detection_options [o,d,u]
process_perf_data [0/1]
retain_status_information [0/1]
retain_nonstatus_information [0/1]
contacts contacts(*)
contact_groups contact_groups(*)
notification_interval #(*)
first_notification_delay #
notification_period timeperiod_name(*)
notification_options [d,u,r,f,s]
notifications_enabled [0/1]
stalking_options [o,d,u]
notes note_string
notes_url url
action_url url
icon_image image_file
icon_image_alt alt_string
vrml_image image_file
statusmap_image image_file
2d_coords x_coord,y_coord
3d_coords x_coord,y_coord,z_coord
...
}
定义样例:
define host{
host_name bogus-router
alias Bogus Router #1
address 192.168.1.254
parents server-backbone
check_command check-host-alive
check_interval 5
retry_interval 1
max_check_attempts 5
check_period 24x7
process_perf_data 0
retain_nonstatus_information 0
contact_groups router-admins
notification_interval 30
notification_period 24x7
notification_options d,u,r
}
2)主机组(hostgroup)是指一台或多台主机构成的组,可使配置更简单或是为完成特定目的而在
CGI里显示使用,格式:
define hostgroup{
hostgroup_name hostgroup_name(*)
alias alias(*)
members hosts
hostgroup_members hostgroups
notes note_string
notes_url url
action_url url
...
}
定义样例:
define hostgroup{
hostgroup_name novell-servers
alias Novell Servers
members netware1,netware2,netware3,netware4
}
3. services.cfg配置
1)Service服务定义为在主机上运行的某种“应用服务”,定义格式:
define service{
host_name host_name(*)
hostgroup_name hostgroup_name
service_description service_description(*)
display_name display_name
servicegroups servicegroup_names
is_volatile [0/1]
check_command command_name(*)
initial_state [o,w,u,c]
max_check_attempts #(*)
check_interval #(*)
retry_interval #(*)
active_checks_enabled [0/1]
passive_checks_enabled [0/1]
check_period timeperiod_name(*)
obsess_over_service [0/1]
check_freshness [0/1]
freshness_threshold #
event_handler command_name
event_handler_enabled [0/1]
low_flap_threshold #
high_flap_threshold #
flap_detection_enabled [0/1]
flap_detection_options [o,w,c,u]
process_perf_data [0/1]
retain_status_information [0/1]
retain_nonstatus_information [0/1]
notification_interval #(*)
first_notification_delay #
notification_period timeperiod_name(*)
notification_options [w,u,c,r,f,s]
notifications_enabled [0/1]
contacts contacts(*)
contact_groups contact_groups(*)
stalking_options [o,w,u,c]
notes note_string
notes_url url
action_url url
icon_image image_file
icon_image_alt alt_string
...
}
定义样例:
define service{
host_name linux-server
service_description check-disk-sda1
check_command check-disk!/dev/sda1
max_check_attempts 5
check_interval 5
retry_interval 3
check_period 24x7
notification_interval 30
notification_period 24x7
notification_options w,c,r
contact_groups linux-admins
}
2)Servicegroup将一个或者多个服务组织在一起,简化
service配置
define servicegroup{
servicegroup_name servicegroup_name(*)
alias alias(*)
members services
servicegroup_members servicegroups
notes note_string
notes_url url
action_url url
...
}
定义样例:
define servicegroup{
servicegroup_name dbservices
alias Database Services
members ms1,SQL Server,ms1,SQL Server Agent,ms1,SQL DTC
}
2.3 定义监控命令
command定义包括服务检查,服务通知,服务事件处理程序,主机检查,主机通知和主机事件处理程序等命令,配置文件为/usr/local/nagios/etc/objects/commands.cfg。
定义格式:
define command{
command_name command_name(*)
command_line command_line(*)
...
}
定义样例:
define command{
command_name check_pop
command_line /usr/local/nagios/libexec/check_pop -H $HOSTADDRESS$
}
2.4 定义监控时间周期
timeperiod定义通知和服务检查的“有效”时间的时间列表,以周为循环时间范围,配置文件为/usr/local/nagios/etc/objects/timeperiods.cfg。
定义格式:
define timeperiod{
timeperiod_name timeperiod_name(*)
alias alias(*)
[weekday] timeranges
[exception] timeranges
exclude [timeperiod1,timeperiod2,...,timeperiodn]
...
}
定义样例:
define timeperiod{
timeperiod_name nonworkhours
alias Non-Work Hours
sunday 00:00-24:00 ; Every Sunday of every week
monday 00:00-09:00,17:00-24:00 ; Every Monday of every week
tuesday 00:00-09:00,17:00-24:00 ; Every Tuesday of every week
wednesday 00:00-09:00,17:00-24:00 ; Every Wednesday of every week
thursday 00:00-09:00,17:00-24:00 ; Every Thursday of every week
friday 00:00-09:00,17:00-24:00 ; Every Friday of every week
saturday 00:00-24:00 ; Every Saturday of every week
}
define timeperiod{
timeperiod_name misc-single-days
alias Misc Single Days
1999-01-28 00:00-24:00 ; January 28th, 1999
monday 3 00:00-24:00 ; 3rd Monday of every month
day 2 00:00-24:00 ; 2nd day of every month
february 10 00:00-24:00 ; February 10th of every year
february -1 00:00-24:00 ; Last day in February of every year
friday -2 00:00-24:00 ; 2nd to last Friday of every month
thursday -1 november 00:00-24:00 ; Last Thursday in November of every year
}
define timeperiod{
timeperiod_name misc-date-ranges
alias Misc Date Ranges
2007-01-01 - 2008-02-01 00:00-24:00 ; January 1st, 2007 to February 1st, 2008
monday 3 - thursday 4 00:00-24:00 ; 3rd Monday to 4th Thursday of every month
day 1 - 15 00:00-24:00 ; 1st to 15th day of every month
day 20 - -1 00:00-24:00 ; 20th to the last day of every month
july 10 - 15 00:00-24:00 ; July 10th to July 15th of every year
april 10 - may 15 00:00-24:00 ; April 10th to May 15th of every year
tuesday 1 april - friday 2 may 00:00-24:00 ; 1st Tuesday in April to 2nd Friday in May of every year
}
define timeperiod{
timeperiod_name misc-skip-ranges
alias Misc Skip Ranges
2007-01-01 - 2008-02-01 / 3 00:00-24:00 ; Every 3 days from January 1st, 2007 to February 1st, 2008
2008-04-01 / 7 00:00-24:00 ; Every 7 days from April 1st, 2008 (continuing forever)
monday 3 - thursday 4 / 2 00:00-24:00 ; Every other day from 3rd Monday to 4th Thursday of every month
day 1 - 15 / 5 00:00-24:00 ; Every 5 days from the 1st to the 15th day of every month
july 10 - 15 / 2 00:00-24:00 ; Every other day from July 10th to July 15th of every year
tuesday 1 april - friday 2 may / 6 00:00-24:00 ; Every 6 days from the 1st Tuesday in April to the 2nd Friday in May of every year
2.5 主配置文件nagios.cfg的配置
将以上4个步骤所配置的文件,通过cfg_file加上cfg_dir添加到/usr/local/nagios/etc/nagios.cfg文件中,具体的可以参考该文件中已有的配置,这里就不赘述了。完成上面所有的配置,再重启对应nagios以及插件的服务,即可在nagios的web端看到配置的成果。
3.Nagios远程监控Linux/UNIX主机配置
本节以监控本地的常见服务为例,只做最简单的配置,如需更为全面的监控,还需继续研究配置文件。
1.配置concat.cfg文件
[root@monitors objects]# vi contacts.cfg
define contact{
contact_name nagiosadminnn
use generic-contact
alias Nagios Admin
host_notifications_enabled 1
service_notifications_enabled 1
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r
host_notification_options d,u,r
service_notification_commands notify-service-by-email
host_notification_commands notify-host-by-email
email 88414341@qq.com
}
define contact{
contact_name nagiosadminkk
use generic-contact
alias Nagios Admin
host_notifications_enabled 1
service_notifications_enabled 1
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r
host_notification_options d,u,r
service_notification_commands notify-service-by-email
host_notification_commands notify-host-by-email
email nnwan0110@163.com
}
# CONTACT GROUPS
define contactgroup{
contactgroup_name admins
alias Nagios Administrators
members nagiosadminnn,nagiosadminkk
}
后面提及的邮件通知涉及到contact里面配置的notify-service-by-email、 notify-host-by-email,其命令格式可以参见command.cfg.
2.配置hosts.cfg
[root@monitors ~]# vi /usr/local/nagios/etc/objects/hosts.cfg
# Define a host for the remote machine
define host{
host_name monitors
alias monitor-server
use linux-server
address 172.16.56.131
max_check_attempts 5
check_period 24x7
check_interval 5
retry_interval 1
max_check_attempts 10
check_command check-host-alive
notification_period 24x7
notification_interval 30
notification_options d,r
contact_groups admins
}
# Define an optional hostgroup for Linux machines
define hostgroup{
hostgroup_name local-linux-servers ; The name of the hostgroup
alias Linux Servers ; Long name of the group
members * ; Comma separated list of hosts that belong to this group
}
3.配置services.cfg
[root@monitors~]#vi/usr/local/nagios/etc/objects/linuxserver.cfg
#Define a service to"ping"the local machine
define service{
use local-service;Name of service template to use
host_name monitors
service_description PING
check_command check_ping!100.0,20%!500.0,60%
contact_groups admins
}
#Define a service to check the disk space of the root partition on the local machine.Warning if<20%free,critical if<10%free space on partition.
define service{
use local-service;Name of service template to use
host_name monitors
service_description Root Partition
check_command check_local_disk!20%!10%!/
contact_groups admins
}
#Define a service to check the number of currently logged in users on the local machine.Warning if>20 users,critical if>50 users.
define service{
use local-service;Name of service template to use
host_name monitors
service_description Current Users
check_command check_local_users!20!50
contact_groups admins
}
#Define a service to check the number of currently running procs on the local machine.Warning if>250 processes,critical if>400 processes.
define service{
use local-service;Name of service template to use
host_name monitors
service_description Total Processes
check_command check_local_procs!250!400!RSZDT
contact_groups admins
}
#Define a service to check the load on the local machine.
define service{
use local-service;Name of service template to use
host_name monitors
service_description Current Load
check_command check_local_load!5.0,4.0,3.0!10.0,6.0,4.0
contact_groups admins
}
#Define a service to check the swap usage the local machine.Critical if less than 10%of swap is free,warning if less than 20%is free
define service{
use local-service;Name of service template to use
host_name monitors
service_description Swap Usage
check_command check_local_swap!20!10
contact_groups admins
}
#Define a service to check SSH on the local machine.
#Disable notifications for this service by default,as not all users may have SSH enabled.
define service{
use local-service;Name of service template to use
host_name monitors
service_description SSH
check_command check_ssh
notifications_enabled 0
contact_groups admins
}
#Define a service to check HTTP on the local machine.
#Disable notifications for this service by default,as not all users may have HTTP enabled.
define service{
use local-service;Name of service template to use
host_name monitors
service_description HTTP
check_command check_http
notifications_enabled 0
contact_groups admins
}
4.配置主配置文件
[root@monitors ~]# vi /usr/local/nagios/etc/nagios.cfg
#definitions for monitoring the remote(linux/unix)host
cfg_file=/usr/local/nagios/etc/objects/hosts.cfg
#definitions for monitoring the remote(linux/unix)host services
cfg_file=/usr/local/nagios/etc/objects/linuxserver.cfg
5.检验配置是否正确
[root@monitors ~]# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
原因是commands.cfg里没有定义这两个命令,解决方法就显而易见,这里因为command.cfg文件是有默认的host和service的异常通知命令(本文用的默认的,无需更改),我们就直接改掉contact.cfg中的contact配置即可,无需重新对这2个命令作新定义:
service_notification_commands notify-service-by-email
host_notification_commands notify-host-by-emai
6.查看配置效果
如果上一步没有error的话,那么重启nagios和httpd服务:
[root@monitors ~]# /etc/init.d/nagios restart
Running configuration check...
Stopping nagios:. done.
Starting nagios: done.
[root@monitors ~]# /etc/init.d/httpd restart
停止 httpd: [确定]
正在启动 httpd:httpd: Could not reliably determine the server's fully qualified domain name, using 127.0.0.1 for ServerName [确定]
登录http://172.16.56.131/nagios/,即可以查看到主机的运行情况。
发现警告,赶紧开始着手解决问题吧
附:本文参阅Nagios官方文档编写,后续将继续完善,不足之处欢迎批评指正!