【实现一套爬虫数据抓取平台】[3-5-02] CentOS 关闭超时进程

零、系列目录

写这套文章的时候,不会完全按照目录的顺序一篇一篇写, 大家可以到目录中直接找到对应的章节进行查看。

点我跳转

一、背景

在运行爬虫的时候,有些任务难免执行超时,针对超时的任务,我们采取的策略就是直接关闭这个任务的进程,避免任务阻塞。

二、脚本代码

先上干货,有需要的同学可以直接拿走了。

完整代码位置:点击跳转

#!/bin/bash

function kill_timeout_process()
{
    pid=$1
    max_time=$2
    start_time=$(cat /proc/$pid/stat | cut -d" " -f22)

    if [[ "$start_time" == "" ]]
    then
        echo "[$pid] not exists."
        return 0
    fi

    user_hz=$(getconf CLK_TCK)
    sys_uptime=$(cat /proc/uptime | cut -d" " -f1)

    runs_time=$(( ${sys_uptime%.*} - $start_time/$user_hz ))

    echo "[$pid] runs for $runs_time seconds."

    if [ "$runs_time" -ge "$max_time" ]
    then
        echo "[$pid] killed."
        kill -9 $pid
    fi

    return 1
}

process_name="scrapy"
timeout=7200

for pid in `pgrep $process_name`
do
    kill_timeout_process $pid $timeout
done

三、解析

如果有同学对上面的脚本原理感兴趣,咱们继续。

执行步骤

1、获取待比较的进程
2、获取进程的运行时间
3、判断运行时间是否超过设置时间
4、关闭超时进行

代码

针对上述步骤,我们一个一个来讲。

获取进程 pid

想要获取进程id,ps 可以、pgrep 也可以。

下面例子中,scrapy 是我想要查询的进程包含的命令。

ps 方式:

ps -ef | grep "scrapy" | grep -v grep | awk '{print $2}'

运行结果:

[root@docker]# ps -ef | grep "scrapy" | grep -v grep | awk '{print $2}'
5369
5371
14628
14634
16514
16516
17383
17384

pgrep 方式:

pgrep scrapy

运行结果:

[root@docker]# pgrep scrapy
5371
14634
17384
17857
27193
27587
27592
27599
27604

可以看到,如果只是单纯的需要 pid 的话,pgrep 的方式更简单,我们就选它。

获取进程的运行时间

获取进程运行时间的方式是:

进程运行时间 = 系统当前运行时间 - 进程启动时的系统时间

如何获取系统当前运行时间?

系统当前的运行时间信息在 /proc/uptime 这个文件里面,例如:

[root@docker]# cat /proc/uptime
5047164.56 34461345.27

可以看到,这个文件里面有两个数。第一个就表示系统到目前为止的运行时间,单位秒。

如何获取进程启动时的系统时间?

这里可以从 /proc/[pid]/stat 这个文件下手,这里保存着 pid 对应的进程的所有运行数据。 具体可见下文:

/proc/[pid]/stat
          Status information about the process.  This is used by ps(1).
          It is defined in the kernel source file fs/proc/array.c.

          The fields, in order, with their proper scanf(3) format speci‐
          fiers, are listed below.  Whether or not certain of these
          fields display valid information is governed by a ptrace
          access mode PTRACE_MODE_READ_FSCREDS | PTRACE_MODE_NOAUDIT
          check (refer to ptrace(2)).  If the check denies access, then
          the field value is displayed as 0.  The affected fields are
          indicated with the marking [PT].

          (1) pid  %d
                    The process ID.

          (2) comm  %s
                    The filename of the executable, in parentheses.
                    This is visible whether or not the executable is
                    swapped out.

          (3) state  %c
                    One of the following characters, indicating process
                    state:

                    R  Running

                    S  Sleeping in an interruptible wait

                    D  Waiting in uninterruptible disk sleep

                    Z  Zombie

                    T  Stopped (on a signal) or (before Linux 2.6.33)
                       trace stopped

                    t  Tracing stop (Linux 2.6.33 onward)

                    W  Paging (only before Linux 2.6.0)

                    X  Dead (from Linux 2.6.0 onward)

                    x  Dead (Linux 2.6.33 to 3.13 only)

                    K  Wakekill (Linux 2.6.33 to 3.13 only)

                    W  Waking (Linux 2.6.33 to 3.13 only)

                    P  Parked (Linux 3.9 to 3.13 only)

          (4) ppid  %d
                    The PID of the parent of this process.

          (5) pgrp  %d
                    The process group ID of the process.

          (6) session  %d
                    The session ID of the process.

          (7) tty_nr  %d
                    The controlling terminal of the process.  (The minor
                    device number is contained in the combination of
                    bits 31 to 20 and 7 to 0; the major device number is
                    in bits 15 to 8.)

          (8) tpgid  %d
                    The ID of the foreground process group of the con‐
                    trolling terminal of the process.

          (9) flags  %u
                    The kernel flags word of the process.  For bit mean‐
                    ings, see the PF_* defines in the Linux kernel
                    source file include/linux/sched.h.  Details depend
                    on the kernel version.

                    The format for this field was %lu before Linux 2.6.

          (10) minflt  %lu
                    The number of minor faults the process has made
                    which have not required loading a memory page from
                    disk.

          (11) cminflt  %lu
                    The number of minor faults that the process's
                    waited-for children have made.

          (12) majflt  %lu
                    The number of major faults the process has made
                    which have required loading a memory page from disk.

          (13) cmajflt  %lu
                    The number of major faults that the process's
                    waited-for children have made.

          (14) utime  %lu
                    Amount of time that this process has been scheduled
                    in user mode, measured in clock ticks (divide by
                    sysconf(_SC_CLK_TCK)).  This includes guest time,
                    guest_time (time spent running a virtual CPU, see
                    below), so that applications that are not aware of
                    the guest time field do not lose that time from
                    their calculations.

          (15) stime  %lu
                    Amount of time that this process has been scheduled
                    in kernel mode, measured in clock ticks (divide by
                    sysconf(_SC_CLK_TCK)).

          (16) cutime  %ld
                    Amount of time that this process's waited-for chil‐
                    dren have been scheduled in user mode, measured in
                    clock ticks (divide by sysconf(_SC_CLK_TCK)).  (See
                    also times(2).)  This includes guest time,
                    cguest_time (time spent running a virtual CPU, see
                    below).

          (17) cstime  %ld
                    Amount of time that this process's waited-for chil‐
                    dren have been scheduled in kernel mode, measured in
                    clock ticks (divide by sysconf(_SC_CLK_TCK)).

          (18) priority  %ld
                    (Explanation for Linux 2.6) For processes running a
                    real-time scheduling policy (policy below; see
                    sched_setscheduler(2)), this is the negated schedul‐
                    ing priority, minus one; that is, a number in the
                    range -2 to -100, corresponding to real-time priori‐
                    ties 1 to 99.  For processes running under a non-
                    real-time scheduling policy, this is the raw nice
                    value (setpriority(2)) as represented in the kernel.
                    The kernel stores nice values as numbers in the
                    range 0 (high) to 39 (low), corresponding to the
                    user-visible nice range of -20 to 19.

                    Before Linux 2.6, this was a scaled value based on
                    the scheduler weighting given to this process.

          (19) nice  %ld
                    The nice value (see setpriority(2)), a value in the
                    range 19 (low priority) to -20 (high priority).

          (20) num_threads  %ld
                    Number of threads in this process (since Linux 2.6).
                    Before kernel 2.6, this field was hard coded to 0 as
                    a placeholder for an earlier removed field.

          (21) itrealvalue  %ld
                    The time in jiffies before the next SIGALRM is sent
                    to the process due to an interval timer.  Since ker‐
                    nel 2.6.17, this field is no longer maintained, and
                    is hard coded as 0.

          (22) starttime  %llu
                    The time the process started after system boot.  In
                    kernels before Linux 2.6, this value was expressed
                    in jiffies.  Since Linux 2.6, the value is expressed
                    in clock ticks (divide by sysconf(_SC_CLK_TCK)).

                    The format for this field was %lu before Linux 2.6.

          (23) vsize  %lu
                    Virtual memory size in bytes.

          (24) rss  %ld
                    Resident Set Size: number of pages the process has
                    in real memory.  This is just the pages which count
                    toward text, data, or stack space.  This does not
                    include pages which have not been demand-loaded in,
                    or which are swapped out.

          (25) rsslim  %lu
                    Current soft limit in bytes on the rss of the
                    process; see the description of RLIMIT_RSS in
                    getrlimit(2).

          (26) startcode  %lu  [PT]
                    The address above which program text can run.

          (27) endcode  %lu  [PT]
                    The address below which program text can run.

          (28) startstack  %lu  [PT]
                    The address of the start (i.e., bottom) of the
                    stack.

          (29) kstkesp  %lu  [PT]
                    The current value of ESP (stack pointer), as found
                    in the kernel stack page for the process.

          (30) kstkeip  %lu  [PT]
                    The current EIP (instruction pointer).

          (31) signal  %lu
                    The bitmap of pending signals, displayed as a deci‐
                    mal number.  Obsolete, because it does not provide
                    information on real-time signals; use
                    /proc/[pid]/status instead.

          (32) blocked  %lu
                    The bitmap of blocked signals, displayed as a deci‐
                    mal number.  Obsolete, because it does not provide
                    information on real-time signals; use
                    /proc/[pid]/status instead.

          (33) sigignore  %lu
                    The bitmap of ignored signals, displayed as a deci‐
                    mal number.  Obsolete, because it does not provide
                    information on real-time signals; use
                    /proc/[pid]/status instead.

          (34) sigcatch  %lu
                    The bitmap of caught signals, displayed as a decimal
                    number.  Obsolete, because it does not provide
                    information on real-time signals; use
                    /proc/[pid]/status instead.

          (35) wchan  %lu  [PT]
                    This is the "channel" in which the process is wait‐
                    ing.  It is the address of a location in the kernel
                    where the process is sleeping.  The corresponding
                    symbolic name can be found in /proc/[pid]/wchan.

          (36) nswap  %lu
                    Number of pages swapped (not maintained).

          (37) cnswap  %lu
                    Cumulative nswap for child processes (not main‐
                    tained).

          (38) exit_signal  %d  (since Linux 2.1.22)
                    Signal to be sent to parent when we die.

          (39) processor  %d  (since Linux 2.2.8)
                    CPU number last executed on.

          (40) rt_priority  %u  (since Linux 2.5.19)
                    Real-time scheduling priority, a number in the range
                    1 to 99 for processes scheduled under a real-time
                    policy, or 0, for non-real-time processes (see
                    sched_setscheduler(2)).

          (41) policy  %u  (since Linux 2.5.19)
                    Scheduling policy (see sched_setscheduler(2)).
                    Decode using the SCHED_* constants in linux/sched.h.

                    The format for this field was %lu before Linux
                    2.6.22.

          (42) delayacct_blkio_ticks  %llu  (since Linux 2.6.18)
                    Aggregated block I/O delays, measured in clock ticks
                    (centiseconds).

          (43) guest_time  %lu  (since Linux 2.6.24)
                    Guest time of the process (time spent running a vir‐
                    tual CPU for a guest operating system), measured in
                    clock ticks (divide by sysconf(_SC_CLK_TCK)).

          (44) cguest_time  %ld  (since Linux 2.6.24)
                    Guest time of the process's children, measured in
                    clock ticks (divide by sysconf(_SC_CLK_TCK)).

          (45) start_data  %lu  (since Linux 3.3)  [PT]
                    Address above which program initialized and unini‐
                    tialized (BSS) data are placed.

          (46) end_data  %lu  (since Linux 3.3)  [PT]
                    Address below which program initialized and unini‐
                    tialized (BSS) data are placed.

          (47) start_brk  %lu  (since Linux 3.3)  [PT]
                    Address above which program heap can be expanded
                    with brk(2).

          (48) arg_start  %lu  (since Linux 3.5)  [PT]
                    Address above which program command-line arguments
                    (argv) are placed.

          (49) arg_end  %lu  (since Linux 3.5)  [PT]
                    Address below program command-line arguments (argv)
                    are placed.

          (50) env_start  %lu  (since Linux 3.5)  [PT]
                    Address above which program environment is placed.

          (51) env_end  %lu  (since Linux 3.5)  [PT]
                    Address below which program environment is placed.

          (52) exit_code  %d  (since Linux 3.5)  [PT]
                    The thread's exit status in the form reported by
                    waitpid(2).

可以看到,我们上文是通过 cat /proc/$pid/stat | cut -d" " -f22 来获取的该进程的 starttime —— 启动时间。这里需要注意,这里所说的启动时间,是指该进程启动时,系统当时的中断次数,这个次数除以系统的时钟频率(CLK_TCK)就得到了该进程的开始时间(单位秒)。

至于剩下的比较和判断,还有关闭进程,这些比较基础的内容就不一一讲解了。

四、总结

关闭超时进程的操作思路比较简单,主要是需要对 Linux 系统有一些了解,知道相关的内容(比如系统运行时间、进程运行时间等)保存在哪里,其他的都是基本操作了。

祝大家变的更强。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值