Nagios插件开发之监控程序占用资源

最新推荐文章于 2020-06-05 15:21:21 发布

原创最新推荐文章于 2020-06-05 15:21:21 发布 · 419 阅读

0 ·

CC 4.0 BY-SA版权

本文介绍了一款用于监控程序进程资源使用的Nagios插件。该插件通过Shell脚本实现，能够设置CPU和内存的阈值并进行报警。同时，脚本还具备检查进程是否存在的功能。

一般情况下，我们只需要监控程序进程在没在就可以了。但是这次遭遇了这样的事，公司开发的程序，程序进程还在，但是死锁了。导致大范围的影响，更要命的是根本不知道问题出在哪里,还是别的测试部同事帮忙发现的，真是丢尽运维的脸了…

为避免下次再遭遇到这样的情况，分析了这次进程死锁的现象，发现死锁会占用100%的cpu，正常情况下只占用10%以内。决定编写nagios插件，用来监控程序占用的资源，包括cpu,内存等。

1、shell脚本需求分析：

能设置cpu,mem的阈值，资源占用超过阈值就报警。

要能判断这个进程是否存在，若有一个不存在，则报警。

2、shell脚本执行效果如下：

1、如果输入格式不正确，则输出帮助信息

[root@center230 libexec]# shcomponent_resource.sh

Usage parament:

component_resource.sh [--cpu] [--mem]

Example:

component_resource.sh --cpu 50 --mem 50

2、若没超出阈值，输出资源占用情况，退出值为0

[root@center230 libexec]# shcomponent_resource.sh --cpu 50 --mem 50

VueSERVER_cpu_use=5.6% VueCache_cpu_use=1.9%VueAgent_cpu_use=0.0% VueCenter_cpu_use=0.0% VueDaemon_cpu_use=0.0%;VueSERVER_mem_use=0.2% VueCache_mem_use=7.4% VueAgent_mem_use=0.5% VueCenter_mem_use=0.1%VueDaemon_mem_use=0.0%

[root@center230 libexec]# echo $?

3、若超出阈值，输出资源占用情况，退出值为2

[root@center230 libexec]# shcomponent_resource.sh --cpu 5 --mem 5

VueSERVER_cpu_use=9.4% VueCache_cpu_use=0.0%VueAgent_cpu_use=0.0% VueCenter_cpu_use=0.0% VueDaemon_cpu_use=0.0%;VueSERVER_mem_use=0.2% VueCache_mem_use=7.4% VueAgent_mem_use=0.5%VueCenter_mem_use=0.1% VueDaemon_mem_use=0.0%

[root@center230 libexec]# echo $?

4、若进程不存在，输出down掉的进程，以及正常使用中的进程资源情况，退出值为2

[root@yckj scripts]# sh component_resource.sh--cpu 50 --mem 50

Current VueDaemon VueCenter VueAgent VueCache VueSERVER is down.

[root@yckj scripts]# echo $?

3、Shell脚本代码如下：

[root@center230 libexec]# catcomponent_resource.sh
#!/bin/sh
#author:yangrong
#date:2014-05-20
#mail:10286460@qq.com
#pragrom_list=(VueDaemon VueCenter VueAgentVueCache VueSERVER VUEConnector Myswitch Slirpvde)
pragrom_list=(VueDaemon VueCenter VueAgentVueCache VueSERVER)
####获取cpu阈值和mem阈值#######
case 
$1 in
--cpu)
   cpu_crit=$2
  ;;
--mem)
   mem_crit=$2
  ;;
esac
case 
$3 in
--cpu)
   cpu_crit=$4
  ;;
--mem)
   mem_crit=$4
  ;;
esac
###判断传参数量,如果不为4，则var值为1，var0则正常####
if 
[[ $1 == $3  ]];then
       var=1  
elif 
[ $# -ne 4 ] ;then
       var=1
else
       var=0
fi
###打印错误提示信息
if 
[ $var -eq 1 ];then
   echo
"Usage parament:"
   echo
"    $0 [--cpu][--mem]"
   echo
""
   echo
"Example:"
   echo
"    $0 --cpu 50 --mem50"
   exit
fi
###把不存在的进程放一变量中
num=$(( ${#pragrom_list[@]}-1 ))
NotExist=""
for 
digit in `seq
0 $num`
do
a=`ps
-ef|grep 
-v grep
|grep 
${pragrom_list[$digit]}|wc 
-l`
  if[ $a -eq
0 ];then
    NotExist="$NotExist ${pragrom_list[$digit]}"
    unset
pragrom_list[$digit]
  fi
done
#echo"pragrom_list=${pragrom_list[@]}"
####对比进程所占资源与阈值大小
cpu_use_all=""
mem_use_all=""
compare_cpu_temp=0
compare_mem_temp=0
for 
n in ${pragrom_list[@]}
do
  cpu_use=`top
-b -n1|grep 
$n|awk '{print $9}'`
  mem_use=`top
-b -n1|grep 
$n|awk '{print $10}'`
   if[[ $cpu_use ==
"" ]];then
       cpu_use=0
   fi
   if[[ $mem_use ==
"" ]];then
       mem_use=0
   fi
  compare_cpu=`echo
"$cpu_use > $cpu_crit"|bc`
  compare_mem=`echo
"$mem_use > $mem_crit"|bc` 
   if[[ $compare_cpu == 1  ]];then
       compare_cpu_temp=1
   fi
   if[[ $compare_mem == 1  ]];then
       compare_mem_temp=1
   fi
  cpu_use_all="${n}_cpu_use=${cpu_use}% ${cpu_use_all}"
  mem_use_all="${n}_mem_use=${mem_use}% ${mem_use_all}"
done
###如果该变量有值，则代表有进程down。则退出值为2
if 
[[ "$NotExist" !=
""]];then
echo
-e "Current ${NotExist} isdown.$cpu_use_all;$mem_use_all"
exit
2
###如果cpu比较值为1，则代表有进程占用超过阈值，则退出值为2
elif 
[[ "$compare_cpu_temp" 
== 1]];then
   echo
-e "$cpu_use_all;$mem_use_all"
   exit
2
##如果mem比较值为1，则代表为进程mem占用超过阈值，则退出值为2
elif 
[[ $compare_mem_temp == 1 ]];then
   echo
-e "$cpu_use_all;$mem_use_all"
   exit
2
##否则则正常输出，并输出所占cpu与内存比例
else
   echo
-e "$cpu_use_all;$mem_use_all"
   exit
0
fi