| Knowledge Center Contents Previous Next Index |
Managing Jobs
Contents
- Understanding Job States
- View Job Information
- Changing Job Order Within Queues
- Switch Jobs from One Queue to Another
- Forcing Job Execution
- Suspending and Resuming Jobs
- Killing Jobs
- Sending a Signal to a Job
- Using Job Groups
- Handling Job Exceptions
Understanding Job States
The
bjobscommand displays the current state of the job.Normal job states
Most jobs enter only three states:
Job state Description PEND Waiting in a queue for scheduling and dispatch RUN Dispatched to a host and running DONE Finished normally with a zero exit value
Suspended job states
If a job is suspended, it has three states:
State transitions
A job goes through a series of state transitions until it eventually completes its task, fails, or is terminated. The possible states of a job during its life cycle are shown in the diagram.
![]()
Pending jobs
A job remains pending until all conditions for its execution are met. Some of the conditions are:
- Start time specified by the user when the job is submitted
- Load conditions on qualified hosts
- Dispatch windows during which the queue can dispatch and qualified hosts can accept jobs
- Run windows during which jobs from the queue can run
- Limits on the number of job slots configured for a queue, a host, or a user
- Relative priority to other users and jobs
- Availability of the specified resources
- Job dependency and pre-execution conditions
Maximum pending job threshold
If the user or user group submitting the job has reached the pending job threshold as specified by
MAX_PEND_JOBS(either in theUsersection oflsb.users, or cluster-wide inlsb.params), LSF will reject any further job submission requests sent by that user or user group. The system will continue to send the job submission requests with the interval specified bySUB_TRY_INTERVALinlsb.paramsuntil it has made a number of attempts equal to theLSB_NTRIESenvironment variable. IfLSB_NTRIESis undefined and LSF rejects the job submission request, the system will continue to send the job submission requests indefinitely as the default behavior.Suspended jobs
A job can be suspended at any time. A job can be suspended by its owner, by the LSF administrator, by the root user (superuser), or by LSF.
After a job has been dispatched and started on a host, it can be suspended by LSF. When a job is running, LSF periodically checks the load level on the execution host. If any load index is beyond either its per-host or its per-queue suspending conditions, the lowest priority batch job on that host is suspended.
If the load on the execution host or hosts becomes too high, batch jobs could be interfering among themselves or could be interfering with interactive jobs. In either case, some jobs should be suspended to maximize host performance or to guarantee interactive response time.
LSF suspends jobs according to the priority of the job's queue. When a host is busy, LSF suspends lower priority jobs first unless the scheduling policy associated with the job dictates otherwise.
Jobs are also suspended by the system if the job queue has a run window and the current time goes outside the run window.
A system-suspended job can later be resumed by LSF if the load condition on the execution hosts falls low enough or when the closed run window of the queue opens again.
WAIT state (chunk jobs)
If you have configured chunk job queues, members of a chunk job that are waiting to run are displayed as
WAITbybjobs. Any jobs inWAITstatus are included in the count of pending jobs bybqueuesandbusers, even though the entire chunk job has been dispatched and occupies a job slot. Thebhostscommand shows the single job slot occupied by the entire chunk job in the number of jobs shown in the NJOBS column.You can switch (
bswitch) or migrate (bmig) a chunk job member inWAITstate to another queue.See Chapter 32, "Chunk Job Dispatch" for more information about chunk jobs.
Exited jobs
An exited job ended with a non-zero exit status.
A job might terminate abnormally for various reasons. Job termination can happen from any state. An abnormally terminated job goes into EXIT state. The situations where a job terminates abnormally include:
- The job is cancelled by its owner or the LSF administrator while pending, or after being dispatched to a host.
- The job is not able to be dispatched before it reaches its termination deadline set by
bsub -t, and thus is terminated by LSF.- The job fails to start successfully. For example, the wrong executable is specified by the user when the job is submitted.
- The application exits with a non-zero exit code.
You can configure hosts so that LSF detects an abnormally high rate of job exit from a host. See Handling Host-level Job Exceptions for more information.
Post-execution states
Some jobs may not be considered complete until some post-job processing is performed. For example, a job may need to exit from a post-execution job script, clean up job files, or transfer job output after the job completes.
The DONE or EXIT job states do not indicate whether post-processing is complete, so jobs that depend on processing may start prematurely. Use the
post_doneandpost_errkeywords on thebsub -wcommand to specify job dependency conditions for job post-processing. The corresponding job states POST_DONE and POST_ERR indicate the state of the post-processing.After the job completes, you cannot perform any job control on the post-processing. Post-processing exit codes are not reported to LSF.
See Chapter 38, "Pre-Execution and Post-Execution Commands" for more information.
View Job Information
The
bjobscommand is used to display job information. By default,bjobsdisplays information for the user who invoked the command. For more information aboutbjobs, see theLSF Referenceand thebjobs(1)man page.View all jobs for all users
- Run
bjobs -u allto display all jobs for all users.Job information is displayed in the following order:
- Running jobs
- Pending jobs in the order in which they are scheduled
- Jobs in high-priority queues are listed before those in lower-priority queues
For example:
bjobs -u allJOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 1004 user1 RUN short hostA hostA job0 Dec 16 09:23 1235 user3 PEND priority hostM job1 Dec 11 13:55 1234 user2 SSUSP normal hostD hostM job3 Dec 11 10:09 1250 user1 PEND short hostA job4 Dec 11 13:59View jobs for specific users
- Run
bjobs-uuser_nameto display jobs for a specific user:bjobs -u user1JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 2225 user1 USUSP normal hostA job1 Nov 16 11:55 2226 user1 PSUSP normal hostA job2 Nov 16 12:30 2227 user1 PSUSP normal hostA job3 Nov 16 12:31View running jobs
View done jobs
View pending job information
- Run
bjobs -pto display the reason why a job is pending.- Run
busers -w allto see the maximum pending job threshold for all users.View suspension reasons
View chunk job wait status and wait reason
- Run
bhist -lto display jobs inWAITstatus. Jobs are shown asWaiting ...The
bjobs -lcommand does not display aWAITreason in the list of pending jobs.View post-execution states
- Run
bhistto display the POST_DONE and POST_ERR states.The resource usage of post-processing is not included in the job resource usage.
View exception status for jobs (bjobs)
- Run
bjobsto display job exceptions.bjobs -lshows exception information for unfinished jobs, andbjobs -x -lshows finished as well as unfinished jobs.For example, the following
bjobscommand shows that job 2 is running longer than the configured JOB_OVERRUN threshold, and is consuming no CPU time.bjobsdisplays the job idle factor, and both job overrun and job idle exceptions. Job 1 finished before the configured JOB_UNDERRUN threshold, sobjobsshows exception status of underrun:bjobs -x -l -aJob <2>, User <user1>, Project <default>, Status <RUN>, Queue <normal>, Command <sleep 600> Wed Aug 13 14:23:35: Submitted from host <hostA>, CWD <$HOME>, Output File </dev/null>, Specified Hosts <hostB>; Wed Aug 13 14:23:43: Started on <hostB>, Execution Home </home/user1>, Execution CWD </home/user1>; Resource usage collected.IDLE_FACTOR(cputime/runtime): 0.00MEM: 3 Mbytes; SWAP: 4 Mbytes; NTHREAD: 3 PGID: 5027; PIDs: 5027 5028 5029 SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - -EXCEPTION STATUS: overrun idle------------------------------------------------------------------------------ Job <1>, User <user1>, Project <default>, Status <DONE>, Queue <normal>, Command <sleep 20> Wed Aug 13 14:18:00: Submitted from host <hostA>, CWD <$HOME>, Output File </dev/null>, Specified Hosts < hostB>; Wed Aug 13 14:18:10: Started on <hostB>, Execution Home </home/user1>, Execution CWD </home/user1>; Wed Aug 13 14:18:50: Done successfully. The CPU time used is 0.2 seconds. SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - -EXCEPTION STATUS: underrunUse
bacct -l -xto trace the history of job exceptions.Changing Job Order Within Queues
By default, LSF dispatches jobs in a queue in the order of arrival (that is, first-come, first-served), subject to availability of suitable server hosts.
Use the
btopandbbotcommands to change the position of pending jobs, or of pending job array elements, to affect the order in which jobs are considered for dispatch. Users can only change the relative position of their own jobs, and LSF administrators can change the position of any users' jobs.bbot
Moves jobs relative to your last job in the queue.
If invoked by a regular user,
bbotmoves the selected job after the last job with the same priority submitted by the user to the queue.If invoked by the LSF administrator,
bbotmoves the selected job after the last job with the same priority submitted to the queue.btop
Moves jobs relative to your first job in the queue.
If invoked by a regular user,
btopmoves the selected job before the first job with the same priority submitted by the user to the queue.If invoked by the LSF administrator,
btopmoves the selected job before the first job with the same priority submitted to the queue.Moving a job to the top of the queue
In the following example, job 5311 is moved to the top of the queue. Since job 5308 is already running, job 5311 is placed in the queue after job 5308.
Note that
user1's job is still in the same position on the queue.user2cannot usebtopto get extra jobs at the top of the queue; when one of his jobs moves up the queue, the rest of his jobs move down.bjobs -u allJOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 5308 user2 RUN normal hostA hostD /s500 Oct 23 10:16 5309 user2 PEND night hostA /s200 Oct 23 11:04 5310 user1 PEND night hostB /myjob Oct 23 13:45 5311 user2 PEND night hostA /s700 Oct 23 18:17btop 5311Job <5311> has been moved to position 1 from top.bjobs -u allJOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 5308 user2 RUN normal hostA hostD /s500 Oct 23 10:16 5311 user2 PEND night hostA /s200 Oct 23 18:17 5310 user1 PEND night hostB /myjob Oct 23 13:45 5309 user2 PEND night hostA /s700 Oct 23 11:04Switch Jobs from One Queue to Another
You can use the command
bswitchto change jobs from one queue to another. This is useful if you submit a job to the wrong queue, or if the job is suspended because of queue thresholds or run windows and you would like to resume the job.Switch a single job to a different queue
- Run
bswitchto move pending and running jobs from queue to queue.In the following example, job 5309 is switched to the
priorityqueue:bswitch priority 5309Job <5309> is switched to queue <priority>bjobs -u allJOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 5308 user2 RUN normal hostA hostD /job500 Oct 23 10:16 5309 user2 RUN priority hostA hostB /job200 Oct 23 11:04 5311 user2 PEND night hostA /job700 Oct 23 18:17 5310 user1 PEND night hostB /myjob Oct 23 13:45Switch all jobs to a different queue
- Run
bswitch -q from_queue to_queue 0to switch all the jobs in a queue to another queue.The
-qoption is used to operate on all jobs in a queue. The job ID number 0 specifies that all jobs from the night queue should be switched to the idle queue:The example below selects jobs from the
nightqueue and switches them to theidlequeue.bswitch -q night idle 0Job <5308> is switched to queue <idle> Job <5310> is switched to queue <idle>Forcing Job Execution
A pending job can be forced to run with the
bruncommand. This operation can only be performed by an LSF administrator.You can force a job to run on a particular host, to run until completion, and other restrictions. For more information, see the
bruncommand.When a job is forced to run, any other constraints associated with the job such as resource requirements or dependency conditions are ignored.
In this situation you may see some job slot limits, such as the maximum number of jobs that can run on a host, being violated. A job that is forced to run cannot be preempted.
Force a pending job to run
- Run
brun -mhostnamejob_IDto force a pending job to run.You must specify the host on which the job will run.
For example, the following command will force the sequential job 104 to run on
hostA:brun -m hostA 104Suspending and Resuming Jobs
A job can be suspended by its owner or the LSF administrator. These jobs are considered user-suspended and are displayed by
bjobsasUSUSP.If a user suspends a high priority job from a non-preemptive queue, the load may become low enough for LSF to start a lower priority job in its place. The load created by the low priority job can prevent the high priority job from resuming. This can be avoided by configuring preemptive queues.
Suspend a job
- Run
bstopjob_ID.Your job goes into
USUSPstate if the job is already started, or intoPSUSPstate if it is pending.bstop 3421Job <3421> is being stoppedThe above example suspends job 3421.
UNIX
bstopsends the following signals to the job:
SIGTSTPfor parallel or interactive jobs-SIGTSTPis caught by the master process and passed to all the slave processes running on other hosts.SIGSTOPfor sequential jobs-SIGSTOPcannot be caught by user programs. TheSIGSTOPsignal can be configured with the LSB_SIGSTOP parameter inlsf.conf.Windows
bstopcauses the job to be suspended.Resume a job
- Run
bresumejob_ID:bresume 3421Job <3421> is being resumedresumes job 3421.
Resuming a user-suspended job does not put your job into
RUNstate immediately. If your job was running before the suspension,bresumefirst puts your job intoSSUSPstate and then waits forsbatchdto schedule it according to the load conditions.Killing Jobs
The
bkillcommand cancels pending batch jobs and sends signals to running jobs. By default, on UNIX,bkillsends theSIGKILLsignal to running jobs.Before
SIGKILLis sent,SIGINTandSIGTERMare sent to give the job a chance to catch the signals and clean up. The signals are forwarded frommbatchdtosbatchd.sbatchdwaits for the job to exit before reporting the status. Because of these delays, for a short period of time after thebkillcommand has been issued,bjobsmay still report that the job is running.On Windows, job control messages replace the
SIGINTandSIGTERMsignals, and termination is implemented by theTerminateProcess()system call.Kill a job
- Run
bkilljob_ID.For example, the following command kills job 3421:bkill 3421Job <3421> is being terminatedKill multiple jobs
- Run
bkill 0to kill all pending jobs in the cluster or usebkill 0with the-g,-J,-m,-q, or-uoptions to kill all jobs that satisfy these options.The following command kills all jobs dispatched to
the hostAhost:bkill -m hostA 0Job <267> is being terminated Job <268> is being terminated Job <271> is being terminatedThe following command kills all jobs in the
groupAjob group:bkill -g groupA 0Job <2083> is being terminated Job <2085> is being terminatedKill a large number of jobs rapidly
Killing multiple jobs with
bkill 0and other commands is usually sufficient for moderate numbers of jobs. However, killing a large number of jobs (approximately greater than 1000 jobs) can take a long time to finish.
- Run
bkill -bto kill a large number of jobs faster than with normal means. However, jobs killed in this manner are not logged tolsb.acct.Local pending jobs are killed immediately and cleaned up as soon as possible, ignoring the time interval specified by CLEAN_PERIOD in
lsb.params. Other jobs are killed as soon as possible but cleaned up normally (after the CLEAN_PERIOD time interval).If the
-boption is used withbkill 0, it kills all applicable jobs and silently skips the jobs that cannot be killed.The
-boption is ignored if used with-ror-s.Force removal of a job from LSF
- Run
bkill -rto force the removal of the job from LSF. Use this option when a job cannot be killed in the operating system.The
bkill -rcommand removes a job from the LSF system without waiting for the job to terminate in the operating system. This sends the same series of signals asbkillwithout -r, except that the job is removed from the system immediately, the job is marked as EXIT, and job resources that LSF monitors are released as soon as LSF receives the first signal.Sending a Signal to a Job
LSF uses signals to control jobs, to enforce scheduling policies, or in response to user requests. The principal signals LSF uses are
SIGSTOPto suspend a job,SIGCONTto resume a job, andSIGKILLto terminate a job.Occasionally, you may want to override the default actions. For example, instead of suspending a job, you might want to kill or checkpoint it. You can override the default job control actions by defining the JOB_CONTROLS parameter in your queue configuration. Each queue can have its separate job control actions.
You can also send a signal directly to a job. You cannot send arbitrary signals to a pending job; most signals are only valid for running jobs. However, LSF does allow you to kill, suspend and resume pending jobs.
You must be the owner of a job or an LSF administrator to send signals to a job.
You use the
bkill -scommand to send a signal to a job. If you issuebkillwithout the -soption, aSIGKILLsignal is sent to the specified jobs to kill them. Twenty seconds beforeSIGKILLis sent,SIGTERMandSIGINTare sent to give the job a chance to catch the signals and clean up.On Windows, job control messages replace the
SIGINTandSIGTERMsignals, but only customized applications are able to process them. Termination is implemented by theTerminateProcess()system call.Signals on different platforms
LSF translates signal numbers across different platforms because different host types may have different signal numbering. The real meaning of a specific signal is interpreted by the machine from which the
bkillcommand is issued.For example, if you send signal 18 from a SunOS 4.x host, it means
SIGTSTP. If the job is running on HP-UX andSIGTSTPis defined as signal number 25, LSF sends signal 25 to the job.Send a signal to a job
On most versions of UNIX, signal names and numbers are listed in the
kill(1) orsignal(2)man pages. On Windows, only customized applications are able to process job control messages specified with the-soption.
- Run
bkill-ssignal job_id, wheresignalis either the signal name or the signal number:bkill -s TSTP 3421Job <3421> is being signaledThe above example sends the
TSTPsignal to job 3421.Using Job Groups
A collection of jobs can be organized into job groups for easy management. A job group is a container for jobs in much the same way that a directory in a file system is a container for files. For example, a payroll application may have one group of jobs that calculates weekly payments, another job group for calculating monthly salaries, and a third job group that handles the salaries of part-time or contract employees. Users can submit, view, and control jobs according to their groups rather than looking at individual jobs.
How job groups are created
Job groups can be created
explicitlyorimplicitly:
- A job group is created
explicitlywith thebgaddcommand.- A job group is created
implicitlyby thebsub -gorbmod -gcommand when the specified group does not exist. Job groups are also created implicitly when a default job group is configured (DEFAULT_JOBGROUP inlsb.paramsor LSB_DEFAULT_JOBGROUP environment variable).Job groups created when jobs are attached to an SLA service class at submission are implicit job groups (
bsub -slaservice_class_name-gjob_group_name). Job groups attached to an SLA service class withbgaddare explicit job groups (bgadd -slaservice_class_namejob_group_name).The GRP_ADD event in
lsb.eventsindicates how the job group was created:For example:
GRP_ADD" "7.02" 1193032735 1285 1193032735 0 "/Z" "" "user1" "" "" 2 0 "" -1 1means job group
/Zis an explicitly created job group.Child groups can be created explicitly or implicitly under any job group.
Only an implicitly created job group which has no job group limit (
bgadd -L) and is not attached to any SLA can be automatically deleted once it becomes empty. An empty job group is a job group that has no jobs associated with it (including finished jobs). NJOBS displayed bybjgroupis 0.Job group hierarchy
Jobs in job groups are organized into a hierarchical tree similar to the directory structure of a file system. Like a file system, the tree contains groups (which are like directories) and jobs (which are like files). Each group can contain other groups or individual jobs. Job groups are created independently of jobs, and can have dependency conditions which control when jobs within the group are considered for scheduling.
Job group path
The
job group pathis the name and location of a job group within the job group hierarchy. Multiple levels of job groups can be defined to form a hierarchical tree. A job group can contain jobs and sub-groups.Root job group
LSF maintains a single tree under which all jobs in the system are organized. The top-most level of the tree is represented by a top-level "root" job group, named "
/". The root group is owned by the primary LSF Administrator and cannot be removed. Users and administrators create new groups under the root group. By default, if you do not specify a job group path name when submitting a job, the job is created under the top-level "root" job group, named "/".The root job group is not displayed by job group query commands, and you cannot specify the root job in commands.
Job group owner
Each group is owned by the user who created it. The login name of the user who creates the job group is the job group owner. Users can add job groups into a groups that are owned by other users, and they can submit jobs to groups owned by other users. Child job groups are owned by the creator of the job group and the creators of any parent groups.
Job control under job groups
Job owners can control their own jobs attached to job groups as usual. Job group owners can also control any job under the groups they own and below.
For example:
- Job group
/Ais created byuser1- Job group
/A/Bis created byuser2- Job group
/A/B/Cis created byuser3All users can submit jobs to any job group, and control the jobs they own in all job groups. For jobs submitted by other users:
user1can control jobs submitted by other users in all 3 job groups:/A,/A/B, and/A/B/Cuser2can control jobs submitted by other users only in 2 job groups:/A/Band/A/B/Cuser3can control jobs submitted by other users only in job group/A/B/CThe LSF administrator can control jobs in any job group.
Default job group
You can specify a default job group for jobs submitted without explicitly specifying a job group. LSF associates the job with the job group specified with DEFAULT_JOBGROUP in
lsb.params. The LSB_DEFAULT_JOBGROUP environment variable overrides the setting of DEFAULT_JOBGROUP. Thebsub -gjob_group_nameoption overrides both LSB_DEFAULT_JOBGROUP and DEFAULT_JOBGROUP.Default job group specification supports macro substitution for project name (
%p) and user name (%u). When you specifybsub -Pproject_name, the value of%pis the specified project name. If you do not specify a project name at job submission,%pis the project name defined by setting the environment variable LSB_DEFAULTPROJECT, or the project name specified by DEFAULT_PROJECT inlsb.params. the default project name isdefault.For example, a default job group name specified by
DEFAULT_JOBGROUP=/canada/%p/%uis expanded to the value for the LSF project name and the user name of the job submission user (for example,/canada/projects/user1).Job group names must follow this format:
- Job group names must start with a slash character (
/). For example,DEFAULT_JOBGROUP=/A/B/Cis correct, butDEFAULT_JOBGROUP=A/B/Cis not correct.- Job group names cannot end with a slash character (
/). For example,DEFAULT_JOBGROUP=/A/is not correct.- Job group names cannot contain more than one slash character (
/) in a row. For example, job group names likeDEFAULT_JOBGROUP=/A//BorDEFAULT_JOBGROUP=A////Bare not correct.- Job group names cannot contain spaces. For example,
DEFAULT_JOBGROUP=/A/B C/Dis not correct.- Project names and user names used for macro substitution with
%pand%ucannot start or end with slash character (/).- Project names and user names used for macro substitution with
%pand%ucannot contain spaces or more than one slash character (/) in a row.- Project names or user names containing slash character (
/) will create separate job groups. For example, if the project name iscanada/projects,DEFAULT_JOBGROUP=/%presults in a job group hierarchy/canada/projects.Job group limits
Job group limits specified with
bgadd -Lapply to the job group hierarchy. The job group limit is a positive number greater than or equal to zero (0), specifying the maximum number of running and suspended jobs under the job group (including child groups). If limit is zero (0), no jobs under the job group can run.By default, a job group has no limit. Limits persist across
mbatchdrestart and reconfiguration.You cannot specify a limit for the root job group. The root job group has no job limit. Job groups added with no limits specified inherit any limits of existing parent job groups. The
-Loption only limits the lowest level job group created.The maximum number of running and suspended jobs (including USUSP and SSUSP) in a job group cannot exceed the limit defined on the job group and its parent job group.
The job group limit is based on the number of running and suspended jobs in the job group. If you specify a job group limit as 2, at most 2 jobs can run under the group at any time, regardless of how many jobs or job slots are used. If the currently available job slots is zero (0), even if the job group job limit is not exceeded, LSF cannot dispatch a job to the job group.
If a parallel job requests 2 CPUs (
bsub -n 2), the job group limit is per job, not per slots used by the job.A job array may also be under a job group, so job arrays also support job group limits.
Job group limits are not supported at job submission for job groups created automatically with
bsub -g. Usebgadd -Lbefore job submission.Jobs forwarded to the execution cluster in a MultiCluster environment are not counted towards the job group limit.
Examples
bgadd -L 6 /canada/projects/testIf
/canadais existing job group, and/canada/projectsand/canada/projects/testare new groups, only the job group/canada/projects/testis limited to 6 running and suspended jobs. Job group/canada/projectswill have whatever limit is specified for its parent job group/canada. The limit of/canadadoes not change.The limits on child job groups cannot exceed the parent job group limit. For example, if
/canada/projectshas a limit of 5:bgadd -L 6 /canada/projects/testis rejected because
/canada/projects/testattempts to increase the limit of its parent/canada/projectsfrom 5 to 6.Example job group hierarchy with limits
![]()
In this configuration:
- Every node is a job group, including the root (
/) job group- The root (
/) job group cannot have any limit definition- By default, child groups have the same limit definition as their direct parent group, so
/asia,/asia/projects, and/asia/projects/testall have no limit- The number of running and suspended jobs in a job group (including all of its child groups) cannot exceed the defined limit
- If there are 7 running or suspended jobs in job group
/canada/projects/test1, even though the job limit of group/canada/qa/autois 6,/canada/qa/autocan only have a maximum of 5 running and suspended (12-7=5)- When a job is submitted to a job group, LSF checks the limits for the entire job group. For example, for a job is submitted to job group
/canada/qa/auto, LSF checks the limits on groups/canada/qa/auto,/canada/qaand/canada. If any one limit in the branch of the hierarchy is exceeded, the job remains pending- The zero (0) job limit for job group
/canada/qa/manualmeans no job in the job group can enter running statusCreate a job group
- Use the
bgaddcommand to create a new job group.You must provide full group path name for the new job group. The last component of the path is the name of the new group to be created:
bgadd /risk_groupThe above example creates a job group named
risk_groupunder the root group/.
bgadd /risk_group/portfolio1The above example creates a job group named
portfolio1under job group/risk_group.
bgadd /risk_group/portfolio1/currentThe above example creates a job group named
currentunder job group/risk_group/portfolio1.If the group hierarchy
/risk_group/portfolio1/currentdoes not exist, LSF checks its parent recursively, and if no groups in the hierarchy exist, all three job groups are created with the specified hierarchy.Add a job group limit (bgadd)
- Run
bgadd -Llimit/job_group_nameto specify a job limit for a job group.Where
limitis a positive number greater than or equal to zero (0), specifying the maximum the number of running and suspended jobs under the job group (including child groups) If limit is zero (0), no jobs under the job group can run.For example:
bgadd -L 6 /canada/projects/testIf
/canadais existing job group, and/canada/projectsand/canada/projects/testare new groups, only the job group/canada/projects/testis limited to 6 running and suspended jobs. Job group/canada/projectswill have whatever limit is specified for its parent job group/canada. The limit of/canadadoes not change.Submit jobs under a job group
- Use the
-goption ofbsubto submit a job into a job group.The job group does not have to exist before submitting the job.
bsub -g /risk_group/portfolio1/current myjobJob <105> is submitted to default queue.Submits
myjobto the job group/risk_group/portfolio1/current.If group
/risk_group/portfolio1/currentexists, job 105 is attached to the job group.If group
/risk_group/portfolio1/currentdoes not exist, LSF checks its parent recursively, and if no groups in the hierarchy exist, all three job groups are created with the specified hierarchy and the job is attached to group.-g and -sla options
tip:Use-slawith-gto attach all jobs in a job group to a service class and have them scheduled as SLA jobs. Multiple job groups can be created under the same SLA. You can submit additional jobs to the job group without specifying the service class name again.MultiCluster
In a MultiCluster job forwarding mode, job groups only apply on the submission cluster, not on the execution cluster. LSF treats the execution cluster as execution engine, and only enforces job group policies at the submission cluster.
Jobs forwarded to the execution cluster in a MultiCluster environment are not counted towards job group limits.
View jobs in job groups
View job group information, and jobs running in specific job groups.
View information about job groups (bjgroup)
- Use the
bjgroupcommand to see information about jobs in job groups.bjgroupGROUP_NAME NJOBS PEND RUN SSUSP USUSP FINISH SLA JLIMIT OWNER /A 0 0 0 0 0 0 () 0/10 user1 /X 0 0 0 0 0 0 () 0/- user2 /A/B 0 0 0 0 0 0 () 0/5 user1 /X/Y 0 0 0 0 0 0 () 0/5 user2- Use
bjgroup -sto sort job groups by group hierarchy.For example, for job groups named
/A,/A/B,/Xand/X/Y,bjgroup -sdisplays:bjgroup -sGROUP_NAME NJOBS PEND RUN SSUSP USUSP FINISH SLA JLIMIT OWNER /A 0 0 0 0 0 0 () 0/10 user1 /A/B 0 0 0 0 0 0 () 0/5 user1 /X 0 0 0 0 0 0 () 0/- user2 /X/Y 0 0 0 0 0 0 () 0/5 user2- Specify a job group name to show the hierarchy of a single job group:
bjgroup -s /XGROUP_NAME NJOBS PEND RUN SSUSP USUSP FINISH SLA JLIMIT OWNER /X 25 0 25 0 0 0 puccini 25/100 user1 /X/Y 20 0 20 0 0 0 puccini 20/30 user1 /X/Z 5 0 5 0 0 0 puccini 5/10 user2- Specify a job group name with a trailing slash character (
/) to show only the root job group:bjgroup -s /X/GROUP_NAME NJOBS PEND RUN SSUSP USUSP FINISH SLA JLIMIT OWNER /X 25 0 25 0 0 0 puccini 25/100 user1- Use
bjgroup -Nto display job group information by job slots instead of number of jobs. NSLOTS, PEND, RUN, SSUSP, USUSP, RSV are all counted in slots rather than number of jobs:bjgroup -NGROUP_NAME NSLOTS PEND RUN SSUSP USUSP RSV SLA OWNER /X 25 0 25 0 0 0 puccini user1 /A/B 20 0 20 0 0 0 wagner batch
-Nby itself shows job slot info for all job groups, and can combine with-sto sort the job groups by hierarchy:bjgroup -N -sGROUP_NAME NSLOTS PEND RUN SSUSP USUSP RSV SLA OWNER /A 0 0 0 0 0 0 wagner batch /A/B 0 0 0 0 0 0 wagner user1 /X 25 0 25 0 0 0 puccini user1 /X/Y 20 0 20 0 0 0 puccini batch /X/Z 5 0 5 0 0 0 puccini batchView jobs for a specific job group (bjobs)
- Run
bjobs -gand specify a job group path to view jobs attached to the specified group.bjobs -g /risk_groupJOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 113 user1 PEND normal hostA myjob Jun 17 16:15 111 user2 RUN normal hostA hostA myjob Jun 14 15:13 110 user1 RUN normal hostB hostA myjob Jun 12 05:03 104 user3 RUN normal hostA hostC myjob Jun 11 13:18
bjobs -ldisplays the full path to the group to which a job is attached:bjobs -l -g /risk_groupJob <101>, User <user1>, Project <default>, Job Group </risk_group>, Status <RUN>, Queue <normal>, Command <myjob> Tue Jun 17 16:21:49: Submitted from host <hostA>, CWD </home/user1; Tue Jun 17 16:22:01: Started on <hostA>; ...Control jobs in job groups
Suspend and resume jobs in job groups, move jobs to different job groups, terminate jobs in job groups, and delete job groups.
Suspend jobs (bstop)
- Use the
-goption ofbstopand specify a job group path to suspend jobs in a job groupbstop -g /risk_group 106Job <106> is being stopped- Use job ID 0 (zero) to suspend all jobs in a job group:
bstop -g /risk_group/consolidate 0Job <107> is being stopped Job <108> is being stopped Job <109> is being stoppedResume suspended jobs (bresume)
- Use the
-goption ofbresumeand specify a job group path to resume suspended jobs in a job group:bresume -g /risk_group 106Job <106> is being resumed- Use job ID 0 (zero) to resume all jobs in a job group:
bresume -g /risk_group 0Job <109> is being resumed Job <110> is being resumed Job <112> is being resumedMove jobs to a different job group (bmod)
- Use the
-goption ofbmodand specify a job group path to move a job or a job array from one job group to another.bmod -g /risk_group/portfolio2/monthly 105moves job 105 to job group
/risk_group/portfolio2/monthly.Like
bsub -g, if the job group does not exist, LSF creates it.
bmod -gcannot be combined with otherbmodoptions. It can only operate on pending jobs. It cannot operate on running or finished jobs.You can modify your own job groups and job groups that other users create under your job groups. The LSF administrator can modify job groups of all users.
You cannot move job array elements from one job group to another, only entire job arrays. If any job array elements in a job array are running, you cannot move the job array to another group. A job array can only belong to one job group at a time.
You cannot modify the job group of a job attached to a service class.
bhist -lshows job group modification information:bhist -l 105Job <105>, User <user1>, Project <default>, Job Group </risk_group>, Command <myjob> Wed May 14 15:24:07: Submitted from host <hostA>, to Queue <normal>, CWD <$HOME/lsf51/5.1/sparc-sol7-64/bin>; Wed May 14 15:24:10: Parameters of Job are changed: Job group changes to: /risk_group/portfolio2/monthly; Wed May 14 15:24:17: Dispatched to <hostA>; Wed May 14 15:24:17: Starting (Pid 8602); ...Terminate jobs (bkill)
- Use the
-goption ofbkilland specify a job group path to terminate jobs in a job group.bkill -g /risk_group 106Job <106> is being terminated- Use job ID 0 (zero) to terminate all jobs in a job group:
bkill -g /risk_group 0Job <1413> is being terminated Job <1414> is being terminated Job <1415> is being terminated Job <1416> is being terminated
bkillonly kills jobs in the job group you specify. It does not kill jobs in lower level job groups in the path. For example, jobs are attached to job groups/risk_groupand/risk_group/consolidate:bsub -g /risk_group myjobJob <115> is submitted to default queue <normal>.bsub -g /risk_group/consolidate myjob2Job <116> is submitted to default queue <normal>.The following
bkillcommand only kills jobs in/risk_group, not the subgroup/risk_group/consolidate:bkill -g /risk_group 0Job <115> is being terminatedTo kill jobs in
/risk_group/consolidate, specify the path to theconsolidatejob group explicitly:bkill -g /risk_group/consolidate 0Job <116> is being terminatedDelete a job groups manually (bgdel)
- Use the
bgdelcommand to manually remove a job group. The job group cannot contain any jobs.bgdel /risk_groupJob group /risk_group is deleted.deletes the job group
/risk_groupand all its subgroups.Normal users can only delete the empty groups they own that are specified by the requested
job_group_name. These groups can be explicit or implicit.- Run
bgdel 0to delete all empty job groups you own. Theses groups can be explicit or implicit.- LSF administrators can use
bgdel -uuser_name0to delete all empty job groups created by specific users. These groups can be explicit or implicit.Run
bgdel -u all 0to delete all the users' empty job groups and their sub groups. LSF administrators can delete empty job groups created by any user. These groups can be explicit or implicit.- Run
bgdel -cjob_group_nameto delete all empty groups below the requestedjob_group_nameincludingjob_group_nameitself.Modify a job group limit (bgmod)
- Run
bgmodto change a job group limit. bgmod [-Llimit| -Ln] /job_group_name
-Llimitchanges the limit ofjob_group_nameto the specified value. If the job group has parent job groups, the new limit cannot exceed the limits of any higher level job groups. Similarly, if the job group has child job groups, the new value must be greater than any limits on the lower level job groups.
-Lnremoves the existing job limit for the job group. If the the job group has parent job groups, the job modified group automatically inherits any limits from its direct parent job group.You must provide full group path name for the modified job group. The last component of the path is the name of the job group to be modified.
Only root, LSF administrators, or the job group creator, or the creator of the parent job groups can use bgmod to modify a job group limit.
The following command only modifies the limit of group
/canada/projects/test1. It does not modify limits of/canadaor/canada/projects.bgmod -L 6 /canada/projects/test1To modify limits of
/canadaor/canada/projects, you must specify the exact group name:bgmod -L 6 /canadaor
bgmod -L 6 /canada/projectsAutomatic job group cleanup
When an implicitly created job group becomes empty, it can be automatically deleted by LSF. Job groups that can be automatically deleted cannot:
- Have limits specified including their child groups
- Have explicitly created child job groups
- Be attached to any SLA
Configure JOB_GROUP_CLEAN=Y in
lsb.paramsto enable automatic job group deletion.For example, for the following job groups:
![]()
When automatic job group deletion is enabled, LSF only deletes job groups
/X/Y/Z/Wand/X/Y/Z. Job group/X/Yis not deleted because it is an explicitly created job group, Job group/Xis also not deleted because it has an explicitly created child job group/X/Y.Automatic job group deletion does not delete job groups attached to SLA service classes. Use
bgdelto manually delete job groups attached to SLAs.Handling Job Exceptions
You can configure hosts and queues so that LSF detects exceptional conditions while jobs are running, and take appropriate action automatically. You can customize what exceptions are detected and their corresponding actions. By default, LSF does not detect any exceptions.
Run
bjobs -d -mhost_nameto see exited jobs for a particular host.Job exceptions LSF can detect
If you configure job exception handling in your queues, LSF detects the following job exceptions:
- Job underrun - jobs end too soon (run time is less than expected). Underrun jobs are detected when a job exits abnormally
- Job overrun - job runs too long (run time is longer than expected). By default, LSF checks for overrun jobs every 1 minute. Use EADMIN_TRIGGER_DURATION in
lsb.paramsto change how frequently LSF checks for job overrun.- Job estimated run time exceeded- the job's actual run time has exceeded the estimated run time.
- Idle job - running job consumes less CPU time than expected (in terms of CPU time/runtime). By default, LSF checks for idle jobs every 1 minute. Use EADMIN_TRIGGER_DURATION in
lsb.paramsto change how frequently LSF checks for idle jobs.Host exceptions LSF can detect
If you configure host exception handling, LSF can detect jobs that exit repeatedly on a host. The host can still be available to accept jobs, but some other problem prevents the jobs from running. Typically jobs dispatched to such "black hole", or "job-eating" hosts exit abnormally. By default, LSF monitors the job exit rate for hosts, and closes the host if the rate exceeds a threshold you configure (EXIT_RATE in
lsb.hosts).If EXIT_RATE is not specified for the host, LSF invokes
eadminif the job exit rate for a host remains above the configured threshold for longer than 5 minutes. Use JOB_EXIT_RATE_DURATION inlsb.paramsto change how frequently LSF checks the job exit rate.Use GLOBAL_EXIT_RATE in
lsb.paramsto set a cluster-wide threshold in minutes for exited jobs. If EXIT_RATE is not specified for the host inlsb.hosts, GLOBAL_EXIT_RATE defines a default exit rate for all hosts in the cluster. Host-level EXIT_RATE overrides the GLOBAL_EXIT_RATE value.Customize job exception actions with the eadmin script
When an exception is detected, LSF takes appropriate action by running the script
LSF_SERVERDIR/eadminon the master host.You can customize
eadminto suit the requirements of your site. For example,eadmincould find out the owner of the problem jobs and usebstop -uto stop all jobs that belong to the user.In some environments, a job running 1 hour would be an overrun job, while this may be a normal job in other environments. If your configuration considers jobs running longer than 1 hour to be overrun jobs, you may want to close the queue when LSF detects a job that has run longer than 1 hour and invokes
eadmin.Email job exception details
Set LSF to send you an email about job exceptions that includes details including JOB_ID, RUN_TIME, IDLE_FACTOR (if job has been idle), USER, QUEUE, EXEC_HOST, and JOB_NAME.
- In
lsb.params, setEXTEND_JOB_EXCEPTION_NOTIFY=Y.- Set the format option in the
eadminscript (LSF_SERVERDIR/eadminon the master host).
- Uncomment the
JOB_EXCEPTION_EMAIL_FORMATline and add a value for the format:
JOB_EXCEPTION_EMAIL_FORMAT=fixed: The eadmin shell generates an exception email with a fixed length for the job exception information. For any given field, the characters truncate when the maximum is reached (between 10-19).JOB_EXCEPTION_EMAIL_FORMAT=full: The eadmin shell generates an exception email without a fixed length for the job exception information.Default eadmin actions
For host-level exceptions, LSF closes the host and sends email to the LSF administrator. The email contains the host name, job exit rate for the host, and other host information. The message
eadmin: JOB EXIT THRESHOLD EXCEEDEDis attached to the closed host event inlsb.events, and displayed bybadmin histandbadmin hhist.For job exceptions. LSF sends email to the LSF administrator. The email contains the job ID, exception type (overrun, underrun, idle job), and other job information.
An email is sent for all detected job exceptions according to the frequency configured by EADMIN_TRIGGER_DURATION in
lsb.params. For example, if EADMIN_TRIGGER_DURATION is set to 5 minutes, and 1 overrun job and 2 idle jobs are detected, after 5 minutes,eadminis invoked and only one email is sent. If another overrun job is detected in the next 5 minutes, another email is sent.Handling job initialization failures
By default, LSF handles job exceptions for jobs that exit after they have started running. You can also configure LSF to handle jobs that exit during initialization because of an execution environment problem, or because of a user action or LSF policy.
LSF detects that the jobs are exiting before they actually start running, and takes appropriate action when the job exit rate exceeds the threshold for specific hosts (EXIT_RATE in
lsb.hosts) or for all hosts (GLOBAL_EXIT_RATE inlsb.params).Use EXIT_RATE_TYPE in
lsb.paramsto include job initialization failures in the exit rate calculation. The following table summarizes the exit rate types you can configure:Table 1: Exit rate types you can configureJob exits excluded from exit rate calculation
By default, jobs that are exited for non-host related reasons (user actions and LSF policies) are not counted in the exit rate calculation. Only jobs that are exited for what LSF considers host-related problems and are used to calculate a host exit rate.
The following cases are
not includedin the exit rate calculations:
bkill,bkill -rbrequeue- RERUNNABLE jobs killed when a host is unavailable
- Resource usage limit exceeded (for example, PROCESSLIMIT, CPULIMIT, etc.)
- Queue-level job control action TERMINATE and TERMINATE_WHEN
- Checkpointing a job with the kill option (
bchkpnt -k)- Rerunnable job migration
- Job killed when an advance reservation has expired
- Remote lease job start fails
- Any jobs with an exit code found in SUCCESS_EXIT_VALUES, where a particular exit value is deemed as successful.
Excluding LSF and user-related job exits
To explicitly
excludejobs exited because of user actions or LSF-related policies from the job exit calculation, set EXIT_RATE_TYPE = JOBEXIT_NONLSF inlsb.params. JOBEXIT_NONLSF tells LSF to include all job exitsexceptthose that are related to user action or LSF policy. This is the default value for EXIT_RATE_TYPE .To
includeall job exit cases in the exit rate count, you must set EXIT_RATE_TYPE = JOBEXIT inlsb.params. JOBEXIT considers all job exits.Jobs killed by signal external to LSF will still be counted towards exit rate
Jobs killed because of job control SUSPEND action and RESUME action are still counted towards the exit rate. This because LSF cannot distinguish between jobs killed from SUSPEND action and jobs killed by external signals.
If both JOBEXIT and JOBEXIT_NONLSF are defined, JOBEXIT_NONLSF is used.
Local jobs
When EXIT_RATE_TYPE=JOBINIT, various job initialization failures are included in the exit rate calculation, including:
- Host-related failures; for example, incorrect user account, user permissions, incorrect directories for checkpointable jobs, host name resolution failed, or other execution environment problems
- Job-related failures; for example, pre-execution or setup problem, job file not created, etc.
Parallel jobs
By default, or when EXIT_RATE_TYPE=JOBEXIT_NONLSF, job initialization failure on the first execution host does not count in the job exit rate calculation. Job initialization failure for hosts other than the first execution host are counted in the exit rate calculation.
When EXIT_RATE_TYPE=JOBINIT, job initialization failure happens on the first execution host are counted in the job exit rate calculation. Job initialization failures for hosts other than the first execution host are
notcounted in the exit rate calculation.
tip:For parallel job exit exceptions to be counted forallhosts, specify EXIT_RATE_TYPE=HPCINIT or EXIT_RATE_TYPE=JOBEXIT_NONLSF JOBINIT.Remote jobs
By default, or when EXIT_RATE_TYPE=JOBEXIT_NONLSF, job initialization failures are counted as exited jobs on the remote execution host and are included in the exit rate calculation for that host. To include only
localjob initialization failures on the execution cluster from the exit rate calculation, set EXIT_RATE_TYPE to include only JOBINIT or HPCINIT.Scaling and tuning job exit rate by number of slots
On large, multiprocessor hosts, use to ENABLE_EXIT_RATE_PER_SLOT=Y in
lsb.paramsto scale the job exit rate so that the host is only closed when the job exit rate is high enough in proportion to the number of processors on the host. This avoids having a relatively low exit rate close a host inappropriately.Use a float value for GLOBAL_EXIT_RATE in
lsb.paramsto tune the exit rate on multislot hosts. The actual calculated exit rate value is never less than 1.Example: exit rate of 5 on single processor and multiprocessor hosts
On a single-processor host, a job exit rate of 5 is much more severe than on a 20-processor host. If a stream of jobs to a single-processor host is consistently failing, it is reasonable to close the host or take some other action after 5 failures.
On the other hand, for the same stream of jobs on a 20-processor host, it is possible that 19 of the processors are busy doing other work that is running fine. To close this host after only 5 failures would be wrong because effectively less than 5% of the jobs on that host are actually failing.
Example: float value for GLOBAL_EXIT_RATE on multislot hosts
Using a float value for GLOBAL_EXIT_RATE allows the exit rate to be less than the number of slots on the host. For example, on a host with 4 slots, GLOBAL_EXIT_RATE=0.25 gives an exit rate of 1. The same value on an 8 slot machine would be 2 and so on. On a single-slot host, the value is never less than 1.
For more information
- See Handling Host-level Job Exceptions for information about configuring host-level job exceptions.
- See Handling Job Exceptions in Queues for information about configuring job exceptions. in queues
| Platform Computing Inc. www.platform.com |
| Knowledge Center Contents Previous Next Index |
http://www.ccs.miami.edu/hpc/lsf/7.0.6/admin/job_ops.html
http://www-01.ibm.com/support/knowledgecenter/SSETD4_9.1.3/lsf_command_ref/lsinfo.1.dita
http://www2.nchc.org.tw/~a00yys00/lsf7/7.0.6/lsf_using/index.htm?job_kill.html~main
本文详细介绍了LSF的工作流程、任务管理、作业状态、调度策略、异常处理及资源控制,包括作业提交、查看、控制、优先级调整、挂起、恢复、强制执行、限制与删除等操作。
711

被折叠的 条评论
为什么被折叠?



