47、进程控制脚本详解

linux6sysadmin

于 2025-12-13 12:54:57 发布

阅读量5

点赞数

CC 4.0 BY-SA版权

分类专栏： Shell脚本编程艺术文章标签： Friar Tuck HA Monitor 进程监控

本文链接：https://blog.youkuaiyun.com/linux6sysadmin/article/details/155960828

Shell脚本编程艺术专栏收录该内容

50 篇文章 ¥499.90

订阅专栏¥69.90

会员秒杀 ¥9.9 重磅福利

超级会员免费看

进程控制脚本详解

1. 系统概述

在进程监控与控制的场景中，有两个关键脚本发挥着重要作用：Friar Tuck 和 HA Monitor。Friar Tuck 相对简单，用于描述监控过程的基本操作；而 HA Monitor 脚本则更为复杂，能提供更深入的进程监控和控制代码。

Friar Tuck 还是系统的通信枢纽，通过调用 friartuck.sh 启动它，并向其发送信号来停止服务或强制刷新配置。若不借助此机制，完全停止整个框架将十分困难，因为需要在短时间内同时终止两个进程，否则另一个进程会将其恢复。

2. 潜在问题

同时终止问题 ：两个进程可能会在同一时间被终止，在当前的集群级别下，对此情况暂无有效应对措施。更高级的集群系统可直接与操作系统内核挂钩，处理集群的各种关键方面，但本系统仅致力于保持进程的存活。
进程重启平衡 ：系统配置方面存在一个问题，即如何平衡放弃和重启失败的进程。若进程因配置问题、内存泄漏或代码故障等原因频繁失败，让服务偶尔对外部用户可用但又不断离线，可能会适得其反。此时，允许服务失败并手动干预解决潜在问题或许是更好的选择。
故障诊断 ：使用数组存储最近故障的时间戳，可更准确地诊断问题。例如，若过去 3 分钟内发生两次故障，但上一次故障是在 3 个月前，重启服务是合理的；若服务持续失败，则最好完全禁用该服务。实现每次更新故障时间戳数组并不困难。

3. 系统结构

系统的基本结构由 HA Monitor 构建，而 Friar Tuck 进程中嵌入了其简化版本。此外，数据结构也是该系统的关键要素，下面将详细介绍。

4. 数据结构

HA Monitor 脚本 ： hamonitor.sh 脚本广泛使用数组来跟踪其可监控的无限数量进程的各个方面。由于 shell 不支持多维数组，每个进程的各个方面都有独立的数组。脚本未使用 stopcmd 、 min 和 max 数组，列出这些数组是为了保证完整性，并提示可对脚本进行简单修改以实现其他功能。 pid 数组用于跟踪进程使用的 PID，若在收集 PID 前需要延迟，则将其设置为负值。
标签变量 ：两个脚本都使用了 tag 变量， logger 借助该变量来识别正在运行的进程。使用 logger -t “$tag” 可准确标识脚本。 “$tag” 周围的引号很重要，若没有引号，日志信息可能不够清晰。例如，使用引号时日志为 friartuck (8357): friartuck.sh FINISHING ，不使用引号则为 friartuck: (8357) ./friartuck.sh FINISHING 。这在调试和诊断脚本时非常有用。

5. Friar Tuck 脚本

Friar Tuck 负责控制一切，包括启动监控系统和向 HA Monitor 发送信号。它捕获信号 3 和 4，并创建 /tmp 文件与 HA Monitor 进行通信。当接收到 SIGQUIT (3) 信号时，协调两个进程的关闭；接收到 SIGILL (4) 信号时，指示 HA Monitor 重新读取其配置文件。

若 pgrep 未找到以 root 身份运行的 hamonitor.sh ，则启动该脚本；若找到但 PID 与预期不符，则记录该情况并更新 PID。无论 hamonitor.sh 失败多少次， friartuck.sh 都会尝试重启它，这种行为比 hamonitor.sh 更简单，但这正是 friartuck.sh 确保 hamonitor.sh 始终运行所需采取的粗略而简单的方法。

以下是 friartuck.sh 脚本的代码：

#!/bin/bash
function bailout
{
  logger -t $tag “$0 FINISHING. Signalling $pid to do the same.”
  touch /tmp/hastop.$pid
  while [ -f /tmp/hastop.$pid ]
  do
    sleep 5
  done
  logger -t $tag “$0 FINISHED.”
  exit 0
}
function reread
{
  logger -t $tag “$0 signalling $pid to reread config.”
  touch /tmp/haread.$pid
}
trap bailout 3
trap reread 4
tag=”friartuck ($$)”
debug=9
DELAY=10
pid=0
cd `dirname $0`
logger -t $tag “Starting HA Monitor Monitoring”
while :
do
  sleep $DELAY
    [ “$debug” -gt “2” ] && logger -t $tag “Checking hamonitor.sh”
    NewPID=`pgrep -u root hamonitor.sh`
    if [ -z “$NewPID” ]; then
      # No process found; child is dead.
      logger -t $tag “No HA process found!”
      logger -t $tag “Starting \”`pwd`/hamonitor.sh\””
      nohup `pwd`/hamonitor.sh >/dev/null 2>&1 &
      pid=0
    elif [ “$NewPID” != “$pid” ]; then
      logger -t $tag “HA Process rediscovered as $NewPID (was $pid)”
      pid=$NewPID
    else
      # All is well.
      [ “$debug” -gt “3” ] && logger -t $tag “hamonitor.sh is running”
    fi
done

6. HA Monitor 脚本

hamonitor.sh 脚本约 200 行，是一个相对复杂的脚本。对于结构合理的 shell 脚本而言，这个长度接近合理范围，若脚本更长，将其拆分为不同的函数和代码库可能更合适。

脚本中有一个 while 循环，从脚本中间偏下位置开始，一直到脚本末尾，长度近 100 行，其中主要部分是一个 for 循环，长度更易于管理。

该脚本可使用 bash 4 版本提供的关联数组，若不可用，配置文件必须命名为 1.conf 、 2.conf 等，并且 declare -A 语句需改为 declare -a ，因为 -A 用于声明关联数组，在 bash 4 之前的版本中不存在。

以下是 hamonitor.sh 脚本的代码：

#!/bin/bash
function readconfig
{
  # Read Configuration
  logger -t $tag Reading Configuration
  for proc in ${CONFDIR}/*.conf
  do
    # This filename can be web.conf if Bash4, otherwise 1.conf, 2.conf etc
    unset ENABLED START STOP PROCESS MIN MAX STARTDELAY USER STOPPABLE
    index=`basename $proc .conf`
    echo “Reading $index configuration”
    . $proc
    startcmd[$index]=$START
    stopcmd[$index]=$STOP
    process[$index]=$PROCESS
    min[$index]=$MIN
    max[$index]=$MAX
    startdelay[$index]=$STARTDELAY
    user[$index]=$USER
    enabled[$index]=$ENABLED
    idx[$index]=$index
    lastfailure[$index]=0
    stoppable[$index]=${STOPPABLE:-1}
    PID=`pgrep -d ‘ ‘ -u ${user[$index]} $PROCESS`
    if [ ! -z “$PID” ]; then
      # Already running
      logger -t $tag “${PROCESS} is already running;”\
         “ will monitor ${USER}’s PID(s) $PID”
      pid[$index]=$PID
    else
      pid[$index]=-1
      if [ “$ENABLED” ]; then
        startproc $ENABLED $USER $START
      fi
    fi
  done
  logger -t $tag “Monitoring ${idx[@]}”
  # Set defaults
  DELAY=10
  FAILWINDOW=180
  debug=9
  . ${CONFDIR}/ha.cfg
}
# If Bash prior to version 4, use declare -a to declare an array
declare -A process
declare -A startcmd
declare -A stopcmd
declare -A min
declare -A max
declare -A pid
declare -A user
declare -A startdelay
declare -A enabled
declare -A lastfailure
declare -A stoppable
# Need to keep an array of indices for Bash prior to v4 (no associative arrays)
declare -A idx
function failurecount
{
  index=$1
  interval=`expr $(date +%s) - ${lastfailure[$index]}`
  lastfailure[$index]=`date +%s`
  if [ “$interval” -lt “$FAILWINDOW” ]; then
    if [ ${stoppable[$index]} -eq 1 ]; then
      logger -t $tag “${process[$index]} has failed twice within $interval”\
         “ seconds. Disabling.”
      enabled[$index]=0
    else
      logger -t $tag “${process[$index]} has failed twice within $interval”\
         “ seconds but can not be disabled.”
    fi
  fi
}
function startproc
{
  if [ “$1” -ne “1” ]; then
    shift 2
    logger -t “Not starting \”$@\” as it is disabled.”
    return
  fi
  user=$2
  shift 2
  logger -t $tag “Starting \”$@\” as \”$user\””
  nohup sudo -u $user $@ >/dev/null 2>&1 &
}
CONFDIR=/etc/ha
tag=”hamonitor ($$)”
STOPFILE=/tmp/hastop.$$
READFILE=/tmp/haread.$$
cd `dirname $0`
logger -t $tag “Starting HA Monitoring”
readconfig
while :
do
  if [ -f $STOPFILE ]; then
    case `stat -c %u $STOPFILE` in
      0)
        logger -t $tag “$0 FINISHING”
        rm -f $STOPFILE
        exit 0
      ;;
      *)
        logger -t $tag “$0 ignoring non-root $STOPFILE”
      ;;
    esac
  fi
  if [ -f $READFILE ]; then
    case `stat -c %u $READFILE` in
      0) readconfig
         rm -f $READFILE
         ;;
      *)
         logger -t $tag “$0 ignoring non-root $READFILE”
         ;;
    esac
  fi
  sleep $DELAY
  for index in ${idx[@]}
  do
    if [ ${enabled[$index]} -eq 0 ]; then
      [ “$debug” -gt “3” ] && logger -t $tag “Skipping ${process[$index]}”\
           “ as it is disabled.”
      continue
    fi
    # Check daemon running; start it if not.
    if [ ${pid[$index]} -lt -1 ]; then
      # still waiting for it to start up; skip.
      logger -t $tag “Not checking ${process[$index]} yet.”
      pid[$index]=`expr ${pid[$index]} + 1`
      continue
    elif [ ${pid[$index]} == -1 ]; then
      pid[$index]=`pgrep -d’ ‘ -u ${user[$index]} ${process[$index]}`
      if [ -z “${pid[$index]}” ]; then
        logger -t $tag “${process[$index]} didn’t start in the allowed timespan.”
        failurecount $index
      fi
      logger -t $tag “PID of ${process[$index]} is ${pid[$index]}.”
      continue
    fi
    [ “$debug” -gt “2” ] && logger -t $tag “Checking ${process[$index]}”
    NewPID=`pgrep -d ‘ ‘ -u ${user[$index]} ${process[$index]}`
    if [ -z “$NewPID” ]; then
      # No process found; child is dead.
      logger -t $tag “No process for ${process[$index]} found!”
      failurecount $index
      startproc ${enabled[$index]} ${user[$index]} ${startcmd[$index]}
      if [ ${startdelay[$index]} -eq 0 ]; then
        pid[$index]=`pgrep -d ‘ ‘ -u ${user[$index]} ${process[$index]}`
      else
        pid[$index]=`expr 0 - ${startdelay[$index]}`
      fi
      [ “$debug” -gt “4” ] && logger -t $tag “Start Delay for “\
          “${process[$index]} is ${startdelay[$index]}.”
    elif [ “$NewPID” != “${pid[$index]}” ]; then
        # The PID has changed. Is it just new processes?
        failed=0
        for thispid in ${pid[$index]}
        do
          echo $NewPID | grep -w $thispid > /dev/null
          if [ “$?” -ne “0” ]; then
            # one of our PIDs is missing
            ((failed++))
          fi
        done
        if [ “$failed” -gt “0” ]; then
          failurecount $index
          logger -t $tag “PID changed for ${process[$index]}; was \””\
              “${pid[$index]}\” now \”$NewPID\””
          if [ ${startdelay[$index]} -eq 0 ]; then
            pid[$index]=$NewPID
          else
            pid[$index]=`expr 0 - ${startdelay[$index]}`
          fi
        fi
      else
        # All is well.
        [ “$debug” -gt “3” ] && logger -t $tag “${process[$index]} is running”
    fi
  done
done

7. 停止脚本

stoph.sh 脚本用于停止整个框架，它向 friartuck.sh 进程发送信号 3 ( SIGQUIT )，Friar Tuck 捕获该信号后，会协调两个进程的关闭。

以下是 stoph.sh 脚本的代码：

#!/bin/bash
pid=${1:-`pgrep -u root friartuck.sh`}
kill -3 $pid

8. 配置文件示例

以下是几个配置文件的示例：

# apache.conf
START=”/usr/sbin/apachectl start”
STOP=”/usr/sbin/apachectl stop”
PROCESS=apache2
MIN=1
MAX=10
STARTDELAY=2
ENABLED=1
USER=root

# friartuck.conf
START=”nohup ./friartuck.sh >/dev/null 2>&1”
STOP=/bin/false
PROCESS=”friartuck.sh”
MIN=1
MAX=1
STARTDELAY=0
ENABLED=1
USER=root
STOPPABLE=0

# sleep.conf
START=”sleep 600”
STOP=
PROCESS=sleep
MIN=1
MAX=10
STARTDELAY=0
ENABLED=1
USER=steve

9. 系统调用与运行示例

启动框架只需运行 friartuck.sh 脚本，系统会记录相关事件到 /var/log/messages 文件（具体文件可能因 syslog 配置而异）。以下是一个运行示例：

Apr 20 11:03:36 goldie friartuck (10521): Starting HA Monitor Monitoring
Apr 20 11:03:46 goldie friartuck (10521): Checking hamonitor.sh
Apr 20 11:03:46 goldie friartuck (10521): No HA process found!
Apr 20 11:03:46 goldie friartuck (10521): Starting “/etc/ha/hamonitor.sh”
Apr 20 11:03:46 goldie hamonitor (10531): Starting HA Monitoring
Apr 20 11:03:46 goldie hamonitor (10531): Reading Configuration
Apr 20 11:03:46 goldie hamonitor (10531): apache2 is already running;  will monitor
root’s PID(s) 7663
Apr 20 11:03:46 goldie hamonitor (10531): friartuck.sh is already running;  will mo
nitor root’s PID(s) 10521
Apr 20 11:03:46 goldie hamonitor (10531): sleep is already running;  will monitor s
teve’s PID(s) 10273
Apr 20 11:03:46 goldie hamonitor (10531): Monitoring friartuck sleep apache
Apr 20 11:03:56 goldie friartuck (10521): Checking hamonitor.sh
Apr 20 11:03:56 goldie friartuck (10521): HA Process rediscovered as 10531 (was 0)
Apr 20 11:03:56 goldie hamonitor (10531): Checking friartuck.sh
Apr 20 11:03:56 goldie hamonitor (10531): friartuck.sh is running
Apr 20 11:03:56 goldie hamonitor (10531): Checking sleep
Apr 20 11:03:56 goldie hamonitor (10531): sleep is running
Apr 20 11:03:56 goldie hamonitor (10531): Checking apache2
Apr 20 11:03:56 goldie hamonitor (10531): apache2 is running

10. 系统运行情况分析

进程重启 ： friartuck.sh 和 hamonitor.sh 脚本可相互重启。例如，杀死 friartuck.sh 进程后， hamonitor.sh 会自动重启它；杀死 hamonitor.sh 进程后， friartuck.sh 会重新启动它。
进程失败处理 ：对于不同的进程，系统有不同的处理方式。以 sleep 进程为例，若在 3 分钟内失败两次，系统会将其标记为禁用，直到重新读取配置文件才会尝试重启。而 Friar Tuck 进程由于配置文件中 STOPPABLE=0 ，无论被杀死多少次，都会被自动重启。
Apache 进程处理 ：停止 Apache 进程后，HA Monitor 脚本会像重启 sleep 和 Friar Tuck 进程一样重启它。但如果独立调用 apachectl restart ，Apache 进程会以不同的 PID 重新出现，只要超出上一次失败的 3 分钟窗口，系统会记录该情况并继续监控新的 PID。

11. 系统操作流程

启动系统 ：运行 friartuck.sh 脚本即可启动整个框架。
重新读取配置 ：向 friartuck.sh 进程发送 kill -4 信号，会使 Friar Tuck 创建 /tmp/haread.$pid 文件， hamonitor.sh 脚本在下一次循环时会检测到该文件并重新读取配置文件。
停止系统 ：向 friartuck.sh 进程发送 kill -3 信号，或运行 stoph.sh 脚本，会使 Friar Tuck 创建 /tmp/hastop.$pid 文件，等待 hamonitor.sh 移除该文件后，两个脚本会干净地退出，不再重启。

12. 总结

通过 Friar Tuck 和 HA Monitor 脚本的协同工作，实现了一个简单而有效的进程监控和控制系统。该系统能够自动重启失败的进程，根据故障情况灵活处理，同时提供了方便的配置文件管理和系统操作接口。在实际应用中，可根据具体需求对脚本和配置文件进行调整，以满足不同的监控和控制要求。

以下是系统主要操作流程的 mermaid 流程图：

graph TD;
    A[启动系统] --> B[运行 friartuck.sh];
    B --> C{检测 hamonitor.sh 是否运行};
    C -- 未运行 --> D[启动 hamonitor.sh];
    C -- 已运行 --> E[继续监控];
    F[发送 kill -4 信号] --> G[Friar Tuck 创建 /tmp/haread.$pid];
    G --> H[hamonitor.sh 重新读取配置];
    I[发送 kill -3 信号或运行 stoph.sh] --> J[Friar Tuck 创建 /tmp/hastop.$pid];
    J --> K[hamonitor.sh 移除 /tmp/hastop.$pid 后退出];
    K --> L[Friar Tuck 退出];

通过以上介绍，你可以深入了解该进程监控和控制系统的工作原理、脚本实现和操作方法，希望对你有所帮助。

进程控制脚本详解

13. 关键函数解析

为了更好地理解系统的运行机制，下面对几个关键函数进行详细解析：
- readconfig 函数 ：该函数的主要作用是读取配置文件，并将配置信息存储到相应的数组中。具体步骤如下：
1. 遍历 CONFDIR 目录下的所有 .conf 配置文件。
2. 在读取每个配置文件之前，先清除之前定义的变量，避免配置信息混淆。
3. 读取配置文件内容，并将相关信息存储到数组中，如 startcmd 、 stopcmd 、 process 等。
4. 检查进程是否已经在运行，若运行则记录其 PID，否则将 pid 数组元素设置为 -1，并根据 ENABLED 标志决定是否启动进程。

function readconfig
{
  # Read Configuration
  logger -t $tag Reading Configuration
  for proc in ${CONFDIR}/*.conf
  do
    # This filename can be web.conf if Bash4, otherwise 1.conf, 2.conf etc
    unset ENABLED START STOP PROCESS MIN MAX STARTDELAY USER STOPPABLE
    index=`basename $proc .conf`
    echo “Reading $index configuration”
    . $proc
    startcmd[$index]=$START
    stopcmd[$index]=$STOP
    process[$index]=$PROCESS
    min[$index]=$MIN
    max[$index]=$MAX
    startdelay[$index]=$STARTDELAY
    user[$index]=$USER
    enabled[$index]=$ENABLED
    idx[$index]=$index
    lastfailure[$index]=0
    stoppable[$index]=${STOPPABLE:-1}
    PID=`pgrep -d ‘ ‘ -u ${user[$index]} $PROCESS`
    if [ ! -z “$PID” ]; then
      # Already running
      logger -t $tag “${PROCESS} is already running;”\
         “ will monitor ${USER}’s PID(s) $PID”
      pid[$index]=$PID
    else
      pid[$index]=-1
      if [ “$ENABLED” ]; then
        startproc $ENABLED $USER $START
      fi
    fi
  done
  logger -t $tag “Monitoring ${idx[@]}”
  # Set defaults
  DELAY=10
  FAILWINDOW=180
  debug=9
  . ${CONFDIR}/ha.cfg
}

failurecount 函数 ：该函数用于比较进程的最后失败时间和当前时间，判断进程是否在允许的失败窗口内再次失败。若在窗口内失败，则根据 stoppable 标志决定是否禁用该进程。

function failurecount
{
  index=$1
  interval=`expr $(date +%s) - ${lastfailure[$index]}`
  lastfailure[$index]=`date +%s`
  if [ “$interval” -lt “$FAILWINDOW” ]; then
    if [ ${stoppable[$index]} -eq 1 ]; then
      logger -t $tag “${process[$index]} has failed twice within $interval”\
         “ seconds. Disabling.”
      enabled[$index]=0
    else
      logger -t $tag “${process[$index]} has failed twice within $interval”\
         “ seconds but can not be disabled.”
    fi
  fi
}

startproc 函数 ：该函数用于启动进程，首先检查 ENABLED 标志，若为 0 则不启动进程；若为 1，则以指定用户身份启动进程。

function startproc
{
  if [ “$1” -ne “1” ]; then
    shift 2
    logger -t “Not starting \”$@\” as it is disabled.”
    return
  fi
  user=$2
  shift 2
  logger -t $tag “Starting \”$@\” as \”$user\””
  nohup sudo -u $user $@ >/dev/null 2>&1 &
}

14. 系统运行状态监控

15. 系统故障处理机制

系统在面对不同类型的故障时，有相应的处理机制：
- 进程频繁失败 ：对于在允许的失败窗口内频繁失败的进程，系统会将其标记为禁用，避免无效的重启操作。例如 sleep 进程，若在 3 分钟内失败两次，会被禁用。
- 同时杀死两个脚本 ：虽然存在两个脚本同时被杀死且无法重启的风险，但由于两个脚本会相互监控和重启，这种情况发生的可能性较小。

16. 系统优化建议

为了提高系统的稳定性和可靠性，可以考虑以下优化建议：
- 状态持久化 ：将进程的状态信息（如禁用状态）存储到文件中，避免在脚本重启后丢失。例如，让 hamonitor.sh 脚本将更新后的状态写入配置文件或其他状态跟踪文件。
- 日志分析 ：对系统日志进行定期分析，及时发现潜在的问题和故障模式，以便进行针对性的优化。
- 配置文件管理 ：对配置文件进行版本控制，方便管理和回溯。同时，提供配置文件验证机制，确保配置信息的正确性。

17. 系统操作示例

以下是一些具体的系统操作示例，帮助你更好地理解系统的使用方法：

17.1 启动系统

./friartuck.sh

17.2 重新读取配置

kill -4 `pgrep -u root friartuck.sh`

17.3 停止系统

kill -3 `pgrep -u root friartuck.sh`
# 或者
./stoph.sh

18. 系统操作流程图

graph LR;
    A[启动系统] --> B[运行 friartuck.sh];
    B --> C{检测 hamonitor.sh 状态};
    C -- 未运行 --> D[启动 hamonitor.sh];
    C -- 已运行 --> E[监控进程状态];
    E --> F{进程是否失败};
    F -- 是 --> G[调用 failurecount 函数];
    G --> H{是否可重启};
    H -- 是 --> I[重启进程];
    H -- 否 --> J[标记为禁用];
    F -- 否 --> K[继续监控];
    L[重新读取配置] --> M[发送 kill -4 信号];
    M --> N[Friar Tuck 创建 /tmp/haread.$pid];
    N --> O[hamonitor.sh 重新读取配置];
    P[停止系统] --> Q[发送 kill -3 信号];
    Q --> R[Friar Tuck 创建 /tmp/hastop.$pid];
    R --> S[hamonitor.sh 移除文件并退出];
    S --> T[Friar Tuck 退出];

19. 总结

通过对 Friar Tuck 和 HA Monitor 脚本的详细分析，我们了解了一个完整的进程监控和控制系统的实现原理和操作方法。该系统通过脚本的协同工作，能够自动处理进程的启动、监控、重启和故障处理等任务，为系统的稳定性提供了有力保障。同时，我们也提出了一些优化建议，希望能帮助你进一步完善和优化该系统。在实际应用中，你可以根据具体需求对系统进行定制和扩展，以满足不同的业务场景。

希望本文能对你理解和使用进程监控系统有所帮助，如果你在使用过程中遇到任何问题，欢迎随时交流。