Managing Jobs

最新推荐文章于 2023-11-12 16:46:35 发布

转载最新推荐文章于 2023-11-12 16:46:35 发布 · 2.3k 阅读

LSF 专栏收录该内容

3 篇文章

订阅专栏

本文详细介绍了LSF的工作流程、任务管理、作业状态、调度策略、异常处理及资源控制，包括作业提交、查看、控制、优先级调整、挂起、恢复、强制执行、限制与删除等操作。

Knowledge Center Contents Previous Next Index

Managing Jobs

Contents

Understanding Job States

View Job Information

Changing Job Order Within Queues

Switch Jobs from One Queue to Another

Forcing Job Execution

Suspending and Resuming Jobs

Killing Jobs

Sending a Signal to a Job

Using Job Groups

Handling Job Exceptions

Understanding Job States

The bjobs command displays the current state of the job.

Normal job states

Most jobs enter only three states:

Job state

Description

PEND

Waiting in a queue for scheduling and dispatch

RUN

Dispatched to a host and running

DONE

Finished normally with a zero exit value

Suspended job states

If a job is suspended, it has three states:

Job state

Description

PSUSP

Suspended by its owner or the LSF administrator while in PEND state

USUSP

Suspended by its owner or the LSF administrator after being dispatched

SSUSP

Suspended by the LSF system after being dispatched

State transitions

A job goes through a series of state transitions until it eventually completes its task, fails, or is terminated. The possible states of a job during its life cycle are shown in the diagram.

Pending jobs

A job remains pending until all conditions for its execution are met. Some of the conditions are:

Start time specified by the user when the job is submitted

Load conditions on qualified hosts

Dispatch windows during which the queue can dispatch and qualified hosts can accept jobs

Run windows during which jobs from the queue can run

Limits on the number of job slots configured for a queue, a host, or a user

Relative priority to other users and jobs

Availability of the specified resources

Job dependency and pre-execution conditions

Maximum pending job threshold

If the user or user group submitting the job has reached the pending job threshold as specified by MAX_PEND_JOBS (either in the User section of lsb.users, or cluster-wide in lsb.params), LSF will reject any further job submission requests sent by that user or user group. The system will continue to send the job submission requests with the interval specified by SUB_TRY_INTERVAL in lsb.params until it has made a number of attempts equal to the LSB_NTRIES environment variable. If LSB_NTRIES is undefined and LSF rejects the job submission request, the system will continue to send the job submission requests indefinitely as the default behavior.

Suspended jobs

A job can be suspended at any time. A job can be suspended by its owner, by the LSF administrator, by the root user (superuser), or by LSF.

After a job has been dispatched and started on a host, it can be suspended by LSF. When a job is running, LSF periodically checks the load level on the execution host. If any load index is beyond either its per-host or its per-queue suspending conditions, the lowest priority batch job on that host is suspended.

If the load on the execution host or hosts becomes too high, batch jobs could be interfering among themselves or could be interfering with interactive jobs. In either case, some jobs should be suspended to maximize host performance or to guarantee interactive response time.

LSF suspends jobs according to the priority of the job's queue. When a host is busy, LSF suspends lower priority jobs first unless the scheduling policy associated with the job dictates otherwise.

Jobs are also suspended by the system if the job queue has a run window and the current time goes outside the run window.

A system-suspended job can later be resumed by LSF if the load condition on the execution hosts falls low enough or when the closed run window of the queue opens again.

WAIT state (chunk jobs)

If you have configured chunk job queues, members of a chunk job that are waiting to run are displayed as WAIT by bjobs. Any jobs in WAIT status are included in the count of pending jobs by bqueues and busers, even though the entire chunk job has been dispatched and occupies a job slot. The bhosts command shows the single job slot occupied by the entire chunk job in the number of jobs shown in the NJOBS column.

You can switch (bswitch) or migrate (bmig) a chunk job member in WAIT state to another queue.

See Chapter 32, "Chunk Job Dispatch" for more information about chunk jobs.

Exited jobs

An exited job ended with a non-zero exit status.

A job might terminate abnormally for various reasons. Job termination can happen from any state. An abnormally terminated job goes into EXIT state. The situations where a job terminates abnormally include:

The job is cancelled by its owner or the LSF administrator while pending, or after being dispatched to a host.

The job is not able to be dispatched before it reaches its termination deadline set by bsub -t, and thus is terminated by LSF.

The job fails to start successfully. For example, the wrong executable is specified by the user when the job is submitted.

The application exits with a non-zero exit code.

You can configure hosts so that LSF detects an abnormally high rate of job exit from a host. See Handling Host-level Job Exceptions for more information.

Post-execution states

Some jobs may not be considered complete until some post-job processing is performed. For example, a job may need to exit from a post-execution job script, clean up job files, or transfer job output after the job completes.

The DONE or EXIT job states do not indicate whether post-processing is complete, so jobs that depend on processing may start prematurely. Use the post_done and post_err keywords on the bsub -w command to specify job dependency conditions for job post-processing. The corresponding job states POST_DONE and POST_ERR indicate the state of the post-processing.

After the job completes, you cannot perform any job control on the post-processing. Post-processing exit codes are not reported to LSF.

See Chapter 38, "Pre-Execution and Post-Execution Commands" for more information.

View Job Information

The bjobs command is used to display job information. By default, bjobs displays information for the user who invoked the command. For more information about bjobs, see the LSF Reference and the bjobs(1) man page.

View all jobs for all users
Run bjobs -u all to display all jobs for all users.
Job information is displayed in the following order:
Running jobs

Pending jobs in the order in which they are scheduled

Jobs in high-priority queues are listed before those in lower-priority queues
For example:
bjobs -u all 
JOBID   USER    STAT    QUEUE     FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME 
1004    user1   RUN     short     hostA       hostA       job0       Dec 16 09:23 
1235    user3   PEND    priority  hostM                   job1       Dec 11 13:55 
1234    user2   SSUSP   normal    hostD       hostM       job3       Dec 11 10:09 
1250    user1   PEND    short     hostA                   job4       Dec 11 13:59 
View jobs for specific users
Run bjobs -u user_name to display jobs for a specific user:
bjobs -u user1 
JOBID   USER    STAT    QUEUE     FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME 
2225    user1   USUSP   normal    hostA                   job1       Nov 16 11:55 
2226    user1   PSUSP   normal    hostA                   job2       Nov 16 12:30 
2227    user1   PSUSP   normal    hostA                   job3       Nov 16 12:31 
View running jobs

Run bjobs -r to display running jobs.

View done jobs

Run bjobs -d to display recently completed jobs.

View pending job information

Run bjobs -p to display the reason why a job is pending.

Run busers -w all to see the maximum pending job threshold for all users.

View suspension reasons

Run bjobs -s to display the reason why a job was suspended.

View chunk job wait status and wait reason

Run bhist -l to display jobs in WAIT status. Jobs are shown as Waiting ...

The bjobs -l command does not display a WAIT reason in the list of pending jobs.

View post-execution states

Run bhist to display the POST_DONE and POST_ERR states.

The resource usage of post-processing is not included in the job resource usage.

View exception status for jobs (bjobs)
Run bjobs to display job exceptions. bjobs -l shows exception information for unfinished jobs, and bjobs -x -l shows finished as well as unfinished jobs.
For example, the following bjobs command shows that job 2 is running longer than the configured JOB_OVERRUN threshold, and is consuming no CPU time. bjobs displays the job idle factor, and both job overrun and job idle exceptions. Job 1 finished before the configured JOB_UNDERRUN threshold, so bjobs shows exception status of underrun:
bjobs -x -l -a 
Job <2>, User <user1>, Project <default>, Status <RUN>, Queue <normal>, Command 
                     <sleep 600> 
Wed Aug 13 14:23:35: Submitted from host <hostA>, CWD <$HOME>, Output File 
                     </dev/null>, Specified Hosts <hostB>; 
Wed Aug 13 14:23:43: Started on <hostB>, Execution Home </home/user1>, Execution  
                     CWD </home/user1>; 
Resource usage collected. 
                     IDLE_FACTOR(cputime/runtime):   0.00 
                     MEM: 3 Mbytes;  SWAP: 4 Mbytes;  NTHREAD: 3 
                     PGID: 5027;  PIDs: 5027 5028 5029  
 
 SCHEDULING PARAMETERS: 
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem 
 loadSched   -     -     -     -       -     -    -     -     -      -      -   
 loadStop    -     -     -     -       -     -    -     -     -      -      -   
 
                cpuspeed    bandwidth 
 loadSched          -            - 
 loadStop           -            - 
 
 EXCEPTION STATUS:  overrun  idle 
------------------------------------------------------------------------------ 
 
Job <1>, User <user1>, Project <default>, Status <DONE>, Queue <normal>, Command 
                     <sleep 20> 
Wed Aug 13 14:18:00: Submitted from host <hostA>, CWD <$HOME>, 
                     Output File </dev/null>, Specified Hosts < 
                     hostB>; 
Wed Aug 13 14:18:10: Started on <hostB>, Execution Home </home/user1>, Execution  
                     CWD </home/user1>; 
Wed Aug 13 14:18:50: Done successfully. The CPU time used is 0.2 seconds. 
 
 SCHEDULING PARAMETERS: 
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem 
 loadSched   -     -     -     -       -     -    -     -     -      -      -   
 loadStop    -     -     -     -       -     -    -     -     -      -      -   
 
                cpuspeed    bandwidth 
 loadSched          -            - 
 loadStop           -            - 
 
 EXCEPTION STATUS:  underrun 
 
Use bacct -l -x to trace the history of job exceptions.
Changing Job Order Within Queues

By default, LSF dispatches jobs in a queue in the order of arrival (that is, first-come, first-served), subject to availability of suitable server hosts.

Use the btop and bbot commands to change the position of pending jobs, or of pending job array elements, to affect the order in which jobs are considered for dispatch. Users can only change the relative position of their own jobs, and LSF administrators can change the position of any users' jobs.

bbot

Moves jobs relative to your last job in the queue.

If invoked by a regular user, bbot moves the selected job after the last job with the same priority submitted by the user to the queue.

If invoked by the LSF administrator, bbot moves the selected job after the last job with the same priority submitted to the queue.

btop

Moves jobs relative to your first job in the queue.

If invoked by a regular user, btop moves the selected job before the first job with the same priority submitted by the user to the queue.

If invoked by the LSF administrator, btop moves the selected job before the first job with the same priority submitted to the queue.

Moving a job to the top of the queue

In the following example, job 5311 is moved to the top of the queue. Since job 5308 is already running, job 5311 is placed in the queue after job 5308.

Note that user1's job is still in the same position on the queue. user2 cannot use btop to get extra jobs at the top of the queue; when one of his jobs moves up the queue, the rest of his jobs move down.
bjobs -u all 
JOBID USER  STAT  QUEUE    FROM_HOST  EXEC_HOST  JOB_NAME   SUBMIT_TIME 
5308  user2 RUN   normal   hostA      hostD      /s500     Oct 23 10:16 
5309  user2 PEND  night    hostA                 /s200     Oct 23 11:04 
5310  user1 PEND  night    hostB                 /myjob    Oct 23 13:45 
5311  user2 PEND  night    hostA                 /s700     Oct 23 18:17 
 
btop 5311 
Job <5311> has been moved to position 1 from top. 
 
bjobs -u all 
JOBID USER  STAT  QUEUE    FROM_HOST  EXEC_HOST  JOB_NAME   SUBMIT_TIME 
5308  user2 RUN   normal   hostA      hostD      /s500     Oct 23 10:16 
5311  user2 PEND  night    hostA                 /s200     Oct 23 18:17 
5310  user1 PEND  night    hostB                 /myjob    Oct 23 13:45 
5309  user2 PEND  night    hostA                 /s700     Oct 23 11:04 
Switch Jobs from One Queue to Another

You can use the command bswitch to change jobs from one queue to another. This is useful if you submit a job to the wrong queue, or if the job is suspended because of queue thresholds or run windows and you would like to resume the job.

Switch a single job to a different queue
Run bswitch to move pending and running jobs from queue to queue.
In the following example, job 5309 is switched to the priority queue:
bswitch priority 5309
Job <5309> is switched to queue <priority> 
bjobs -u all
JOBID    USER   STAT   QUEUE    FROM_HOST  EXEC_HOST   JOB_NAME   SUBMIT_TIME
5308     user2   RUN   normal   hostA      hostD       /job500    Oct 23 10:16
5309     user2   RUN   priority hostA      hostB       /job200    Oct 23 11:04
5311     user2   PEND  night    hostA                  /job700    Oct 23 18:17
5310     user1   PEND  night    hostB                  /myjob     Oct 23 13:45 
Switch all jobs to a different queue
Run bswitch -q from_queue to_queue 0 to switch all the jobs in a queue to another queue.
The -q option is used to operate on all jobs in a queue. The job ID number 0 specifies that all jobs from the night queue should be switched to the idle queue:

The example below selects jobs from the night queue and switches them to the idle queue.
bswitch -q night idle 0
Job <5308> is switched to queue <idle>
Job <5310> is switched to queue <idle> 
Forcing Job Execution

A pending job can be forced to run with the brun command. This operation can only be performed by an LSF administrator.

You can force a job to run on a particular host, to run until completion, and other restrictions. For more information, see the brun command.

When a job is forced to run, any other constraints associated with the job such as resource requirements or dependency conditions are ignored.

In this situation you may see some job slot limits, such as the maximum number of jobs that can run on a host, being violated. A job that is forced to run cannot be preempted.

Force a pending job to run
Run brun -m hostname job_ID to force a pending job to run.
You must specify the host on which the job will run.

For example, the following command will force the sequential job 104 to run on hostA:
brun -m hostA 104 
Suspending and Resuming Jobs

A job can be suspended by its owner or the LSF administrator. These jobs are considered user-suspended and are displayed by bjobs as USUSP.

If a user suspends a high priority job from a non-preemptive queue, the load may become low enough for LSF to start a lower priority job in its place. The load created by the low priority job can prevent the high priority job from resuming. This can be avoided by configuring preemptive queues.

Suspend a job
Run bstop job_ID.
Your job goes into USUSP state if the job is already started, or into PSUSP state if it is pending.
bstop 3421
Job <3421> is being stopped 
 
The above example suspends job 3421.
UNIX

bstop sends the following signals to the job:

SIGTSTP for parallel or interactive jobs-SIGTSTP is caught by the master process and passed to all the slave processes running on other hosts.

SIGSTOP for sequential jobs-SIGSTOP cannot be caught by user programs. The SIGSTOP signal can be configured with the LSB_SIGSTOP parameter in lsf.conf.

Windows

bstop causes the job to be suspended.

Resume a job
Run bresume job_ID:
bresume 3421
Job <3421> is being resumed 
 
resumes job 3421.

 
Resuming a user-suspended job does not put your job into RUN state immediately. If your job was running before the suspension, bresume first puts your job into SSUSP state and then waits for sbatchd to schedule it according to the load conditions.
Killing Jobs

The bkill command cancels pending batch jobs and sends signals to running jobs. By default, on UNIX, bkill sends the SIGKILL signal to running jobs.

Before SIGKILL is sent, SIGINT and SIGTERM are sent to give the job a chance to catch the signals and clean up. The signals are forwarded from mbatchd to sbatchd. sbatchd waits for the job to exit before reporting the status. Because of these delays, for a short period of time after the bkill command has been issued, bjobs may still report that the job is running.

On Windows, job control messages replace the SIGINT and SIGTERM signals, and termination is implemented by the TerminateProcess() system call.

Kill a job
Run bkill job_ID. For example, the following command kills job 3421:
bkill 3421
Job <3421> is being terminated 
Kill multiple jobs
Run bkill 0 to kill all pending jobs in the cluster or use bkill 0 with the -g, -J, -m, -q, or -u options to kill all jobs that satisfy these options.
The following command kills all jobs dispatched to the hostA host:
bkill -m hostA 0 
Job <267> is being terminated 
Job <268> is being terminated 
Job <271> is being terminated 
 
The following command kills all jobs in the groupA job group:

bkill -g groupA 0 
Job <2083> is being terminated 
Job <2085> is being terminated 
Kill a large number of jobs rapidly

Killing multiple jobs with bkill 0 and other commands is usually sufficient for moderate numbers of jobs. However, killing a large number of jobs (approximately greater than 1000 jobs) can take a long time to finish.

Run bkill -b to kill a large number of jobs faster than with normal means. However, jobs killed in this manner are not logged to lsb.acct.

Local pending jobs are killed immediately and cleaned up as soon as possible, ignoring the time interval specified by CLEAN_PERIOD in lsb.params. Other jobs are killed as soon as possible but cleaned up normally (after the CLEAN_PERIOD time interval).

If the -b option is used with bkill 0, it kills all applicable jobs and silently skips the jobs that cannot be killed.

The -b option is ignored if used with -r or -s.

Force removal of a job from LSF

Run bkill -r to force the removal of the job from LSF. Use this option when a job cannot be killed in the operating system.

The bkill -r command removes a job from the LSF system without waiting for the job to terminate in the operating system. This sends the same series of signals as bkill without -r, except that the job is removed from the system immediately, the job is marked as EXIT, and job resources that LSF monitors are released as soon as LSF receives the first signal.

Sending a Signal to a Job

LSF uses signals to control jobs, to enforce scheduling policies, or in response to user requests. The principal signals LSF uses are SIGSTOP to suspend a job, SIGCONT to resume a job, and SIGKILL to terminate a job.

Occasionally, you may want to override the default actions. For example, instead of suspending a job, you might want to kill or checkpoint it. You can override the default job control actions by defining the JOB_CONTROLS parameter in your queue configuration. Each queue can have its separate job control actions.

You can also send a signal directly to a job. You cannot send arbitrary signals to a pending job; most signals are only valid for running jobs. However, LSF does allow you to kill, suspend and resume pending jobs.

You must be the owner of a job or an LSF administrator to send signals to a job.

You use the bkill -s command to send a signal to a job. If you issue bkill without the -s option, a SIGKILL signal is sent to the specified jobs to kill them. Twenty seconds before SIGKILL is sent, SIGTERM and SIGINT are sent to give the job a chance to catch the signals and clean up.

On Windows, job control messages replace the SIGINT and SIGTERM signals, but only customized applications are able to process them. Termination is implemented by the TerminateProcess() system call.

Signals on different platforms

LSF translates signal numbers across different platforms because different host types may have different signal numbering. The real meaning of a specific signal is interpreted by the machine from which the bkill command is issued.

For example, if you send signal 18 from a SunOS 4.x host, it means SIGTSTP. If the job is running on HP-UX and SIGTSTP is defined as signal number 25, LSF sends signal 25 to the job.

Send a signal to a job

On most versions of UNIX, signal names and numbers are listed in the kill(1) or signal(2) man pages. On Windows, only customized applications are able to process job control messages specified with the -s option.
Run bkill -s signal job_id, where signal is either the signal name or the signal number:
bkill -s TSTP 3421
Job <3421> is being signaled 
 
The above example sends the TSTP signal to job 3421.
Using Job Groups

A collection of jobs can be organized into job groups for easy management. A job group is a container for jobs in much the same way that a directory in a file system is a container for files. For example, a payroll application may have one group of jobs that calculates weekly payments, another job group for calculating monthly salaries, and a third job group that handles the salaries of part-time or contract employees. Users can submit, view, and control jobs according to their groups rather than looking at individual jobs.

How job groups are created

Job groups can be created explicitly or implicitly:

A job group is created explicitly with the bgadd command.

A job group is created implicitly by the bsub -g or bmod -g command when the specified group does not exist. Job groups are also created implicitly when a default job group is configured (DEFAULT_JOBGROUP in lsb.params or LSB_DEFAULT_JOBGROUP environment variable).

Job groups created when jobs are attached to an SLA service class at submission are implicit job groups (bsub -sla service_class_name -g job_group_name). Job groups attached to an SLA service class with bgadd are explicit job groups (bgadd -sla service_class_name job_group_name).

The GRP_ADD event in lsb.events indicates how the job group was created:

0x01 - job group was created explicitly

0x02 - job group was created implicitly

For example:
GRP_ADD" "7.02" 1193032735 1285 1193032735 0 "/Z" "" "user1" "" "" 2 0 "" -1 1 
means job group /Z is an explicitly created job group.

Child groups can be created explicitly or implicitly under any job group.

Only an implicitly created job group which has no job group limit (bgadd -L) and is not attached to any SLA can be automatically deleted once it becomes empty. An empty job group is a job group that has no jobs associated with it (including finished jobs). NJOBS displayed by bjgroup is 0.

Job group hierarchy

Jobs in job groups are organized into a hierarchical tree similar to the directory structure of a file system. Like a file system, the tree contains groups (which are like directories) and jobs (which are like files). Each group can contain other groups or individual jobs. Job groups are created independently of jobs, and can have dependency conditions which control when jobs within the group are considered for scheduling.

Job group path

The job group path is the name and location of a job group within the job group hierarchy. Multiple levels of job groups can be defined to form a hierarchical tree. A job group can contain jobs and sub-groups.

Root job group

LSF maintains a single tree under which all jobs in the system are organized. The top-most level of the tree is represented by a top-level "root" job group, named "/". The root group is owned by the primary LSF Administrator and cannot be removed. Users and administrators create new groups under the root group. By default, if you do not specify a job group path name when submitting a job, the job is created under the top-level "root" job group, named "/".

The root job group is not displayed by job group query commands, and you cannot specify the root job in commands.

Job group owner

Each group is owned by the user who created it. The login name of the user who creates the job group is the job group owner. Users can add job groups into a groups that are owned by other users, and they can submit jobs to groups owned by other users. Child job groups are owned by the creator of the job group and the creators of any parent groups.

Job control under job groups

Job owners can control their own jobs attached to job groups as usual. Job group owners can also control any job under the groups they own and below.

For example:

Job group /A is created by user1

Job group /A/B is created by user2

Job group /A/B/C is created by user3

All users can submit jobs to any job group, and control the jobs they own in all job groups. For jobs submitted by other users:

user1 can control jobs submitted by other users in all 3 job groups: /A, /A/B, and /A/B/C

user2 can control jobs submitted by other users only in 2 job groups: /A/B and /A/B/C

user3 can control jobs submitted by other users only in job group /A/B/C

The LSF administrator can control jobs in any job group.

Default job group

You can specify a default job group for jobs submitted without explicitly specifying a job group. LSF associates the job with the job group specified with DEFAULT_JOBGROUP in lsb.params. The LSB_DEFAULT_JOBGROUP environment variable overrides the setting of DEFAULT_JOBGROUP. The bsub -g job_group_name option overrides both LSB_DEFAULT_JOBGROUP and DEFAULT_JOBGROUP.

Default job group specification supports macro substitution for project name (%p) and user name (%u). When you specify bsub -P project_name, the value of %p is the specified project name. If you do not specify a project name at job submission, %p is the project name defined by setting the environment variable LSB_DEFAULTPROJECT, or the project name specified by DEFAULT_PROJECT in lsb.params. the default project name is default.

For example, a default job group name specified by DEFAULT_JOBGROUP=/canada/%p/%u is expanded to the value for the LSF project name and the user name of the job submission user (for example, /canada/projects/user1).

Job group names must follow this format:

Job group names must start with a slash character (/). For example, DEFAULT_JOBGROUP=/A/B/C is correct, but DEFAULT_JOBGROUP=A/B/C is not correct.

Job group names cannot end with a slash character (/). For example, DEFAULT_JOBGROUP=/A/ is not correct.

Job group names cannot contain more than one slash character (/) in a row. For example, job group names like DEFAULT_JOBGROUP=/A//B or DEFAULT_JOBGROUP=A////B are not correct.

Job group names cannot contain spaces. For example, DEFAULT_JOBGROUP=/A/B C/D is not correct.

Project names and user names used for macro substitution with %p and %u cannot start or end with slash character (/).

Project names and user names used for macro substitution with %p and %u cannot contain spaces or more than one slash character (/) in a row.

Project names or user names containing slash character (/) will create separate job groups. For example, if the project name is canada/projects, DEFAULT_JOBGROUP=/%p results in a job group hierarchy /canada/projects.

Job group limits

Job group limits specified with bgadd -L apply to the job group hierarchy. The job group limit is a positive number greater than or equal to zero (0), specifying the maximum number of running and suspended jobs under the job group (including child groups). If limit is zero (0), no jobs under the job group can run.

By default, a job group has no limit. Limits persist across mbatchd restart and reconfiguration.

You cannot specify a limit for the root job group. The root job group has no job limit. Job groups added with no limits specified inherit any limits of existing parent job groups. The -L option only limits the lowest level job group created.

The maximum number of running and suspended jobs (including USUSP and SSUSP) in a job group cannot exceed the limit defined on the job group and its parent job group.

The job group limit is based on the number of running and suspended jobs in the job group. If you specify a job group limit as 2, at most 2 jobs can run under the group at any time, regardless of how many jobs or job slots are used. If the currently available job slots is zero (0), even if the job group job limit is not exceeded, LSF cannot dispatch a job to the job group.

If a parallel job requests 2 CPUs (bsub -n 2), the job group limit is per job, not per slots used by the job.

A job array may also be under a job group, so job arrays also support job group limits.

Job group limits are not supported at job submission for job groups created automatically with bsub -g. Use bgadd -L before job submission.

Jobs forwarded to the execution cluster in a MultiCluster environment are not counted towards the job group limit.

Examples
bgadd -L 6 /canada/projects/test 
If /canada is existing job group, and /canada/projects and /canada/projects/test are new groups, only the job group /canada/projects/test is limited to 6 running and suspended jobs. Job group /canada/projects will have whatever limit is specified for its parent job group /canada. The limit of /canada does not change.

The limits on child job groups cannot exceed the parent job group limit. For example, if /canada/projects has a limit of 5:
bgadd -L 6 /canada/projects/test 
is rejected because /canada/projects/test attempts to increase the limit of its parent /canada/projects from 5 to 6.

Example job group hierarchy with limits

In this configuration:

Every node is a job group, including the root (/) job group

The root (/) job group cannot have any limit definition

By default, child groups have the same limit definition as their direct parent group, so /asia, /asia/projects, and /asia/projects/test all have no limit

The number of running and suspended jobs in a job group (including all of its child groups) cannot exceed the defined limit

If there are 7 running or suspended jobs in job group /canada/projects/test1, even though the job limit of group /canada/qa/auto is 6, /canada/qa/auto can only have a maximum of 5 running and suspended (12-7=5)

When a job is submitted to a job group, LSF checks the limits for the entire job group. For example, for a job is submitted to job group /canada/qa/auto, LSF checks the limits on groups /canada/qa/auto, /canada/qa and /canada. If any one limit in the branch of the hierarchy is exceeded, the job remains pending

The zero (0) job limit for job group /canada/qa/manual means no job in the job group can enter running status

Create a job group

Use the bgadd command to create a new job group.

You must provide full group path name for the new job group. The last component of the path is the name of the new group to be created:

bgadd /risk_group

The above example creates a job group named risk_group under the root group /.

bgadd /risk_group/portfolio1

The above example creates a job group named portfolio1 under job group /risk_group.

bgadd /risk_group/portfolio1/current

The above example creates a job group named current under job group /risk_group/portfolio1.

If the group hierarchy /risk_group/portfolio1/current does not exist, LSF checks its parent recursively, and if no groups in the hierarchy exist, all three job groups are created with the specified hierarchy.

Add a job group limit (bgadd)
Run bgadd -L limit /job_group_name to specify a job limit for a job group.
Where limit is a positive number greater than or equal to zero (0), specifying the maximum the number of running and suspended jobs under the job group (including child groups) If limit is zero (0), no jobs under the job group can run.

For example:
bgadd -L 6 /canada/projects/test 
 
If /canada is existing job group, and /canada/projects and /canada/projects/test are new groups, only the job group /canada/projects/test is limited to 6 running and suspended jobs. Job group /canada/projects will have whatever limit is specified for its parent job group /canada. The limit of /canada does not change. 
Submit jobs under a job group
Use the -g option of bsub to submit a job into a job group.
The job group does not have to exist before submitting the job.
bsub -g /risk_group/portfolio1/current myjob 
Job <105> is submitted to default queue. 
 
Submits myjob to the job group /risk_group/portfolio1/current. 

 
If group /risk_group/portfolio1/current exists, job 105 is attached to the job group.

 
If group /risk_group/portfolio1/current does not exist, LSF checks its parent recursively, and if no groups in the hierarchy exist, all three job groups are created with the specified hierarchy and the job is attached to group.
-g and -sla options

tip:

Use -sla with -g to attach all jobs in a job group to a service class and have them scheduled as SLA jobs. Multiple job groups can be created under the same SLA. You can submit additional jobs to the job group without specifying the service class name again.

MultiCluster

In a MultiCluster job forwarding mode, job groups only apply on the submission cluster, not on the execution cluster. LSF treats the execution cluster as execution engine, and only enforces job group policies at the submission cluster.

Jobs forwarded to the execution cluster in a MultiCluster environment are not counted towards job group limits.

View jobs in job groups

View job group information, and jobs running in specific job groups.

View information about job groups (bjgroup)
Use the bjgroup command to see information about jobs in job groups.
bjgroup 
GROUP_NAME         NJOBS   PEND    RUN    SSUSP  USUSP  FINISH  SLA   JLIMIT  OWNER 
/A                 0       0       0      0      0      0        ()    0/10  user1 
/X                 0       0       0      0      0      0        ()     0/-  user2 
/A/B               0       0       0      0      0      0        ()     0/5  user1 
/X/Y               0       0       0      0      0      0        ()     0/5  user2 
Use bjgroup -s to sort job groups by group hierarchy.
For example, for job groups named /A, /A/B, /X and /X/Y, bjgroup -s displays:
bjgroup -s 
GROUP_NAME         NJOBS   PEND    RUN    SSUSP  USUSP  FINISH  SLA   JLIMIT  OWNER 
/A                 0       0       0      0      0      0       ()       0/10  user1 
/A/B               0       0       0      0      0      0       ()       0/5  user1 
/X                 0       0       0      0      0      0       ()       0/-  user2 
/X/Y               0       0       0      0      0      0       ()       0/5  user2 
Specify a job group name to show the hierarchy of a single job group:
bjgroup -s /X 
GROUP_NAME   NJOBS  PEND   RUN   SSUSP  USUSP  FINISH       SLA   JLIMIT  OWNER 
/X              25     0    25       0      0       0   puccini  25/100   user1 
/X/Y            20     0    20       0      0       0   puccini   20/30   user1 
/X/Z             5     0     5       0      0       0   puccini    5/10   user2 
Specify a job group name with a trailing slash character (/) to show only the root job group:
bjgroup -s /X/ 
GROUP_NAME   NJOBS  PEND   RUN   SSUSP  USUSP  FINISH      SLA   JLIMIT  OWNER 
/X               25    0    25       0      0       0   puccini  25/100  user1 
Use bjgroup -N to display job group information by job slots instead of number of jobs. NSLOTS, PEND, RUN, SSUSP, USUSP, RSV are all counted in slots rather than number of jobs:
bjgroup -N 
GROUP_NAME NSLOTS PEND   RUN   SSUSP  USUSP   RSV      SLA     OWNER 
/X             25    0    25       0      0     0  puccini     user1 
/A/B           20    0    20       0      0     0   wagner     batch 
 
-N by itself shows job slot info for all job groups, and can combine with -s to sort the job groups by hierarchy:

bjgroup -N -s 
GROUP_NAME NSLOTS PEND   RUN   SSUSP   USUSP  RSV      SLA     OWNER 
/A              0    0     0       0       0    0   wagner      batch 
/A/B            0    0     0       0       0    0   wagner      user1 
/X             25    0    25       0       0    0   puccini     user1 
/X/Y           20    0    20       0       0    0   puccini     batch 
/X/Z            5     0    5       0       0    0   puccini     batch 
View jobs for a specific job group (bjobs)
Run bjobs -g and specify a job group path to view jobs attached to the specified group.
bjobs -g /risk_group
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
113     user1   PEND  normal     hostA                   myjob     Jun 17 16:15
111     user2   RUN   normal     hostA       hostA       myjob     Jun 14 15:13
110     user1   RUN   normal     hostB       hostA       myjob     Jun 12 05:03
104     user3   RUN   normal     hostA       hostC       myjob     Jun 11 13:18 
 
bjobs -l displays the full path to the group to which a job is attached:

bjobs -l -g /risk_group

Job <101>, User <user1>, Project <default>, Job Group 
</risk_group>, Status <RUN>, Queue <normal>, Command <myjob>
Tue Jun 17 16:21:49: Submitted from host <hostA>, CWD 
</home/user1;
Tue Jun 17 16:22:01: Started on <hostA>;
... 
Control jobs in job groups

Suspend and resume jobs in job groups, move jobs to different job groups, terminate jobs in job groups, and delete job groups.

Suspend jobs (bstop)
Use the -g option of bstop and specify a job group path to suspend jobs in a job group
bstop -g /risk_group 106
Job <106> is being stopped 
Use job ID 0 (zero) to suspend all jobs in a job group:
bstop -g /risk_group/consolidate 0
Job <107> is being stopped
Job <108> is being stopped
Job <109> is being stopped 
Resume suspended jobs (bresume)
Use the -g option of bresume and specify a job group path to resume suspended jobs in a job group:
bresume -g /risk_group 106
Job <106> is being resumed 
Use job ID 0 (zero) to resume all jobs in a job group:
bresume -g /risk_group 0
Job <109> is being resumed
Job <110> is being resumed
Job <112> is being resumed 
Move jobs to a different job group (bmod)
Use the -g option of bmod and specify a job group path to move a job or a job array from one job group to another.
bmod -g /risk_group/portfolio2/monthly 105 
 
moves job 105 to job group /risk_group/portfolio2/monthly.

 
Like bsub -g, if the job group does not exist, LSF creates it.

 
bmod -g cannot be combined with other bmod options. It can only operate on pending jobs. It cannot operate on running or finished jobs.

 
You can modify your own job groups and job groups that other users create under your job groups. The LSF administrator can modify job groups of all users.

 
You cannot move job array elements from one job group to another, only entire job arrays. If any job array elements in a job array are running, you cannot move the job array to another group. A job array can only belong to one job group at a time.

 
You cannot modify the job group of a job attached to a service class.

 
bhist -l shows job group modification information:

bhist -l 105

Job <105>, User <user1>, Project <default>, Job Group </risk_group>, Command <myjob>
                     
Wed May 14 15:24:07: Submitted from host <hostA>, to Queue <normal>, CWD
<$HOME/lsf51/5.1/sparc-sol7-64/bin>;
Wed May 14 15:24:10: Parameters of Job are changed:
                         Job group changes to: /risk_group/portfolio2/monthly;
Wed May 14 15:24:17: Dispatched to <hostA>;
Wed May 14 15:24:17: Starting (Pid 8602);
... 
Terminate jobs (bkill)
Use the -g option of bkill and specify a job group path to terminate jobs in a job group.
bkill -g /risk_group 106
Job <106> is being terminated 
Use job ID 0 (zero) to terminate all jobs in a job group:
bkill -g /risk_group 0
Job <1413> is being terminated
Job <1414> is being terminated
Job <1415> is being terminated
Job <1416> is being terminated 
 
bkill only kills jobs in the job group you specify. It does not kill jobs in lower level job groups in the path. For example, jobs are attached to job groups /risk_group and /risk_group/consolidate:

bsub -g /risk_group  myjob
Job <115> is submitted to default queue <normal>. 
bsub -g /risk_group/consolidate myjob2
Job <116> is submitted to default queue <normal>. 
 
The following bkill command only kills jobs in /risk_group, not the subgroup /risk_group/consolidate:

bkill -g /risk_group 0
Job <115> is being terminated 
 
To kill jobs in /risk_group/consolidate, specify the path to the consolidate job group explicitly:

bkill -g /risk_group/consolidate 0
Job <116> is being terminated 
Delete a job groups manually (bgdel)
Use the bgdel command to manually remove a job group. The job group cannot contain any jobs.
bgdel /risk_group
Job group /risk_group is deleted. 
 
deletes the job group /risk_group and all its subgroups.

 
Normal users can only delete the empty groups they own that are specified by the requested job_group_name. These groups can be explicit or implicit. 
Run bgdel 0 to delete all empty job groups you own. Theses groups can be explicit or implicit.

LSF administrators can use bgdel -u user_name 0 to delete all empty job groups created by specific users. These groups can be explicit or implicit.

Run bgdel -u all 0 to delete all the users' empty job groups and their sub groups. LSF administrators can delete empty job groups created by any user. These groups can be explicit or implicit.

Run bgdel -c job_group_name to delete all empty groups below the requested job_group_name including job_group_name itself.
Modify a job group limit (bgmod)
Run bgmod to change a job group limit.
bgmod [-L limit | -Ln] /job_group_name 
 
-L limit changes the limit of job_group_name to the specified value. If the job group has parent job groups, the new limit cannot exceed the limits of any higher level job groups. Similarly, if the job group has child job groups, the new value must be greater than any limits on the lower level job groups.

 
-Ln removes the existing job limit for the job group. If the the job group has parent job groups, the job modified group automatically inherits any limits from its direct parent job group.

 
You must provide full group path name for the modified job group. The last component of the path is the name of the job group to be modified. 

 
Only root, LSF administrators, or the job group creator, or the creator of the parent  job groups can use bgmod to modify a job group limit.

 
The following command only modifies the limit of group /canada/projects/test1. It does not modify limits of /canada or/canada/projects. 

bgmod -L 6 /canada/projects/test1 
 
To modify limits of /canada or/canada/projects, you must specify the exact group name:

bgmod -L 6 /canada 
 
or 

bgmod -L 6 /canada/projects 
Automatic job group cleanup

When an implicitly created job group becomes empty, it can be automatically deleted by LSF. Job groups that can be automatically deleted cannot:

Have limits specified including their child groups

Have explicitly created child job groups

Be attached to any SLA

Configure JOB_GROUP_CLEAN=Y in lsb.params to enable automatic job group deletion.

For example, for the following job groups:

When automatic job group deletion is enabled, LSF only deletes job groups /X/Y/Z/W and /X/Y/Z. Job group /X/Y is not deleted because it is an explicitly created job group, Job group /X is also not deleted because it has an explicitly created child job group /X/Y.

Automatic job group deletion does not delete job groups attached to SLA service classes. Use bgdel to manually delete job groups attached to SLAs.

Handling Job Exceptions

You can configure hosts and queues so that LSF detects exceptional conditions while jobs are running, and take appropriate action automatically. You can customize what exceptions are detected and their corresponding actions. By default, LSF does not detect any exceptions.

Run bjobs -d -m host_name to see exited jobs for a particular host.

Job exceptions LSF can detect

If you configure job exception handling in your queues, LSF detects the following job exceptions:

Job underrun - jobs end too soon (run time is less than expected). Underrun jobs are detected when a job exits abnormally

Job overrun - job runs too long (run time is longer than expected). By default, LSF checks for overrun jobs every 1 minute. Use EADMIN_TRIGGER_DURATION in lsb.params to change how frequently LSF checks for job overrun.

Job estimated run time exceeded- the job's actual run time has exceeded the estimated run time.

Idle job - running job consumes less CPU time than expected (in terms of CPU time/runtime). By default, LSF checks for idle jobs every 1 minute. Use EADMIN_TRIGGER_DURATION in lsb.params to change how frequently LSF checks for idle jobs.

Host exceptions LSF can detect

If you configure host exception handling, LSF can detect jobs that exit repeatedly on a host. The host can still be available to accept jobs, but some other problem prevents the jobs from running. Typically jobs dispatched to such "black hole", or "job-eating" hosts exit abnormally. By default, LSF monitors the job exit rate for hosts, and closes the host if the rate exceeds a threshold you configure (EXIT_RATE in lsb.hosts).

If EXIT_RATE is not specified for the host, LSF invokes eadmin if the job exit rate for a host remains above the configured threshold for longer than 5 minutes. Use JOB_EXIT_RATE_DURATION in lsb.params to change how frequently LSF checks the job exit rate.

Use GLOBAL_EXIT_RATE in lsb.params to set a cluster-wide threshold in minutes for exited jobs. If EXIT_RATE is not specified for the host in lsb.hosts, GLOBAL_EXIT_RATE defines a default exit rate for all hosts in the cluster. Host-level EXIT_RATE overrides the GLOBAL_EXIT_RATE value.

Customize job exception actions with the eadmin script

When an exception is detected, LSF takes appropriate action by running the script LSF_SERVERDIR/eadmin on the master host.

You can customize eadmin to suit the requirements of your site. For example, eadmin could find out the owner of the problem jobs and use bstop -u to stop all jobs that belong to the user.

In some environments, a job running 1 hour would be an overrun job, while this may be a normal job in other environments. If your configuration considers jobs running longer than 1 hour to be overrun jobs, you may want to close the queue when LSF detects a job that has run longer than 1 hour and invokes eadmin.

Email job exception details

Set LSF to send you an email about job exceptions that includes details including JOB_ID, RUN_TIME, IDLE_FACTOR (if job has been idle), USER, QUEUE, EXEC_HOST, and JOB_NAME.

In lsb.params, set EXTEND_JOB_EXCEPTION_NOTIFY=Y.

Set the format option in the eadmin script (LSF_SERVERDIR/eadmin on the master host).

Uncomment the JOB_EXCEPTION_EMAIL_FORMAT line and add a value for the format:

JOB_EXCEPTION_EMAIL_FORMAT=fixed: The eadmin shell generates an exception email with a fixed length for the job exception information. For any given field, the characters truncate when the maximum is reached (between 10-19).

JOB_EXCEPTION_EMAIL_FORMAT=full: The eadmin shell generates an exception email without a fixed length for the job exception information.

Default eadmin actions

For host-level exceptions, LSF closes the host and sends email to the LSF administrator. The email contains the host name, job exit rate for the host, and other host information. The message eadmin: JOB EXIT THRESHOLD EXCEEDED is attached to the closed host event in lsb.events, and displayed by badmin hist and badmin hhist.

For job exceptions. LSF sends email to the LSF administrator. The email contains the job ID, exception type (overrun, underrun, idle job), and other job information.

An email is sent for all detected job exceptions according to the frequency configured by EADMIN_TRIGGER_DURATION in lsb.params. For example, if EADMIN_TRIGGER_DURATION is set to 5 minutes, and 1 overrun job and 2 idle jobs are detected, after 5 minutes, eadmin is invoked and only one email is sent. If another overrun job is detected in the next 5 minutes, another email is sent.

Handling job initialization failures

By default, LSF handles job exceptions for jobs that exit after they have started running. You can also configure LSF to handle jobs that exit during initialization because of an execution environment problem, or because of a user action or LSF policy.

LSF detects that the jobs are exiting before they actually start running, and takes appropriate action when the job exit rate exceeds the threshold for specific hosts (EXIT_RATE in lsb.hosts) or for all hosts (GLOBAL_EXIT_RATE in lsb.params).

Use EXIT_RATE_TYPE in lsb.params to include job initialization failures in the exit rate calculation. The following table summarizes the exit rate types you can configure:

Table 1: Exit rate types you can configure

Exit rate type ...

Includes ...

JOBEXIT

Local exited jobs

Remote job initialization failures

Parallel job initialization failures on hosts other than the first execution host

Jobs exited by user action (e.g., bkill, bstop, etc.) or LSF policy (e.g., load threshold exceeded, job control action, advance reservation expired, etc.)

JOBEXIT_NONLSF

This is the default when EXIT_RATE_TYPE is not set

Local exited jobs

Remote job initialization failures

Parallel job initialization failures on hosts other than the first execution host

JOBINIT

Local job initialization failures

Parallel job initialization failures on the first execution host

HPCINIT

Job initialization failures for Platform LSF HPC jobs

Job exits excluded from exit rate calculation

By default, jobs that are exited for non-host related reasons (user actions and LSF policies) are not counted in the exit rate calculation. Only jobs that are exited for what LSF considers host-related problems and are used to calculate a host exit rate.

The following cases are not included in the exit rate calculations:

bkill, bkill -r

brequeue

RERUNNABLE jobs killed when a host is unavailable

Resource usage limit exceeded (for example, PROCESSLIMIT, CPULIMIT, etc.)

Queue-level job control action TERMINATE and TERMINATE_WHEN

Checkpointing a job with the kill option (bchkpnt -k)

Rerunnable job migration

Job killed when an advance reservation has expired

Remote lease job start fails

Any jobs with an exit code found in SUCCESS_EXIT_VALUES, where a particular exit value is deemed as successful.

Excluding LSF and user-related job exits

To explicitly exclude jobs exited because of user actions or LSF-related policies from the job exit calculation, set EXIT_RATE_TYPE = JOBEXIT_NONLSF in lsb.params. JOBEXIT_NONLSF tells LSF to include all job exits except those that are related to user action or LSF policy. This is the default value for EXIT_RATE_TYPE .

To include all job exit cases in the exit rate count, you must set EXIT_RATE_TYPE = JOBEXIT in lsb.params. JOBEXIT considers all job exits.

Jobs killed by signal external to LSF will still be counted towards exit rate

Jobs killed because of job control SUSPEND action and RESUME action are still counted towards the exit rate. This because LSF cannot distinguish between jobs killed from SUSPEND action and jobs killed by external signals.

If both JOBEXIT and JOBEXIT_NONLSF are defined, JOBEXIT_NONLSF is used.

Local jobs

When EXIT_RATE_TYPE=JOBINIT, various job initialization failures are included in the exit rate calculation, including:

Host-related failures; for example, incorrect user account, user permissions, incorrect directories for checkpointable jobs, host name resolution failed, or other execution environment problems

Job-related failures; for example, pre-execution or setup problem, job file not created, etc.

Parallel jobs

By default, or when EXIT_RATE_TYPE=JOBEXIT_NONLSF, job initialization failure on the first execution host does not count in the job exit rate calculation. Job initialization failure for hosts other than the first execution host are counted in the exit rate calculation.

When EXIT_RATE_TYPE=JOBINIT, job initialization failure happens on the first execution host are counted in the job exit rate calculation. Job initialization failures for hosts other than the first execution host are not counted in the exit rate calculation.

tip:

For parallel job exit exceptions to be counted for all hosts, specify EXIT_RATE_TYPE=HPCINIT or EXIT_RATE_TYPE=JOBEXIT_NONLSF JOBINIT.

Remote jobs

By default, or when EXIT_RATE_TYPE=JOBEXIT_NONLSF, job initialization failures are counted as exited jobs on the remote execution host and are included in the exit rate calculation for that host. To include only local job initialization failures on the execution cluster from the exit rate calculation, set EXIT_RATE_TYPE to include only JOBINIT or HPCINIT.

Scaling and tuning job exit rate by number of slots

On large, multiprocessor hosts, use to ENABLE_EXIT_RATE_PER_SLOT=Y in lsb.params to scale the job exit rate so that the host is only closed when the job exit rate is high enough in proportion to the number of processors on the host. This avoids having a relatively low exit rate close a host inappropriately.

Use a float value for GLOBAL_EXIT_RATE in lsb.params to tune the exit rate on multislot hosts. The actual calculated exit rate value is never less than 1.

Example: exit rate of 5 on single processor and multiprocessor hosts

On a single-processor host, a job exit rate of 5 is much more severe than on a 20-processor host. If a stream of jobs to a single-processor host is consistently failing, it is reasonable to close the host or take some other action after 5 failures.

On the other hand, for the same stream of jobs on a 20-processor host, it is possible that 19 of the processors are busy doing other work that is running fine. To close this host after only 5 failures would be wrong because effectively less than 5% of the jobs on that host are actually failing.

Example: float value for GLOBAL_EXIT_RATE on multislot hosts

Using a float value for GLOBAL_EXIT_RATE allows the exit rate to be less than the number of slots on the host. For example, on a host with 4 slots, GLOBAL_EXIT_RATE=0.25 gives an exit rate of 1. The same value on an 8 slot machine would be 2 and so on. On a single-slot host, the value is never less than 1.

For more information

See Handling Host-level Job Exceptions for information about configuring host-level job exceptions.

See Handling Job Exceptions in Queues for information about configuring job exceptions. in queues

Job state	Description
PEND	Waiting in a queue for scheduling and dispatch
RUN	Dispatched to a host and running
DONE	Finished normally with a zero exit value

Job state	Description
PSUSP	Suspended by its owner or the LSF administrator while in PEND state
USUSP	Suspended by its owner or the LSF administrator after being dispatched
SSUSP	Suspended by the LSF system after being dispatched

Exit rate type ...	Includes ...
JOBEXIT	Local exited jobs Remote job initialization failures Parallel job initialization failures on hosts other than the first execution host Jobs exited by user action (e.g., bkill, bstop, etc.) or LSF policy (e.g., load threshold exceeded, job control action, advance reservation expired, etc.)
JOBEXIT_NONLSF This is the default when EXIT_RATE_TYPE is not set	Local exited jobs Remote job initialization failures Parallel job initialization failures on hosts other than the first execution host
JOBINIT	Local job initialization failures Parallel job initialization failures on the first execution host
HPCINIT	Job initialization failures for Platform LSF HPC jobs