小规模作业运行在厚节点队列上,配置:
这三个队列节点都一样,16路4核 共64核 Xeon X7350 2.93GHz, 512G内存
x64_small 共2个节点 1-8核 6小时
x64_3950 共5个节点 1-64核 6小时
x64_3950_long 共11个节点 1-64核 144小时
x64_small就是用来给小作业运行的。
如果需要运行稍大作业的话,不妨用x64_3950或x64_3950_long,事实上这两个队列的资源使用情况相对x64_small更为空闲一 些
中大规模作业运行在刀片队列,限制核心数>= 64
作业运行时间1分钟,需要2个CPU核心,单个节点上使用1个CPU核心,提交到x64_small 队列,标准输出文件为zlt.out,错误输出文件为zlt.err,运行程序名为comm:
[scwangj@LB270108 zjl]$ bsub -W 1 -a intemmpi -n 2 -R "span[ptile=1]" -q x64_small -o zlt.out -e zlt.err mpirun.lsf ./commJob <78607> is submitted to queue <x64_small>.
一次提交多个作业,写个bash脚本submit.sh[其实不必这样,命令叠加的方式也不错]:
#!/bin/bash
for i in 50 60 70 80 90 100
do
bsub -W 6 -a intemmpi -n $i -R span[ptile=1] -q x64_blades -o $i.out -e $i.err ./matrix
done
查看作业:
[scwangj@LB270210 zl]$ bjobs -u scwangj
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
82504 scwangj PEND x64_blades lb270210 ./matrix Jul 4 12:47
82505 scwangj PEND x64_blades lb270210 ./matrix Jul 4 12:47
82506 scwangj PEND x64_blades lb270210 ./matrix Jul 4 12:47
82507 scwangj PEND x64_blades lb270210 ./matrix Jul 4 12:47
82487 scwangj PEND x64_small lb270210 ./matrix Jul 4 12:35
[scwangj@LB270210 zl]$
附:
[scwangj@v3903 20x20x100]$ cat submit.sh
#!/bin/bash
for i in 1 2 3 4 5 6 7 8
do
bsub -W 5:40 -a intelmpi -n $i -R span[ptile=2] -q x64_small -o $i.out -e $i.err mpirun.lsf ./simple
done
[scwangj@v3903 20x20x100]$ cd ..
[scwangj@v3903 ddm]$ ls
10.err 18.err 1.err 20x20x100 2.out 9.out bsubmpi ddm.sh fluid.grd serial solveuss.F solvewss.F stagsimple.F submit.log tdma.F uc.fun variable.mod
10.out 18.out 1.out 2.err 9.err a.sh bsub.txt del.sh ppoisson.F simple solvevss.F s.sh submit2.sh submit.sh time.dat uc.nam
[scwangj@v3903 ddm]$ cat submit.sh
#!/bin/bash
for i in 1 2 3 4 5 6 7 8
do
bsub -W 5:40 -a intelmpi -n $i -R span[ptile=2] -q x64_small -o $i.out -e $i.err mpirun.lsf ./simple
done
[scwangj@v3903 ddm]$ cat submit2.sh
#!/bin/bash
for i in 9 10 11 12 13 14 15 16 17 18
do
bsub -W 5:40 -a intelmpi -n $i -R span[ptile=9] -q x64_3950 -o $i.out -e $i.err mpirun.lsf ./simple
done
[scwangj@v3903 ddm]$ bjobs -u scwangj
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
82725 scwangj RUN x64_3950 v3903 9*t3701 * ./simple Jul 4 20:39
2*t3601
82726 scwangj RUN x64_3950 v3903 9*t3802 * ./simple Jul 4 20:39
3*t4102
82727 scwangj RUN x64_3950 v3903 9*t3701 * ./simple Jul 4 20:39
4*t3802
82728 scwangj RUN x64_3950 v3903 9*t3601 * ./simple Jul 4 20:39
5*t4102
82729 scwangj RUN x64_3950 v3903 9*t3701 * ./simple Jul 4 20:39
6*t3802
82730 scwangj RUN x64_3950 v3903 9*t3601 * ./simple Jul 4 20:39
7*t4102
82731 scwangj RUN x64_3950 v3903 9*t3701 * ./simple Jul 4 20:39
8*t3802
82745 scwangj RUN x64_3950 v3903 9*t3701 * ./simple Jul 4 20:41
82634 scwangj RUN x64_small v3903 t4601 * ./simple Jul 4 16:59
82635 scwangj RUN x64_small v3903 1*t4601 * ./simple Jul 4 16:59
1*t3701
82710 scwangj RUN x64_small v3903 t4601 * ./simple Jul 4 20:37
82711 scwangj RUN x64_small v3903 2*t3701 * ./simple Jul 4 20:37
82746 scwangj PEND x64_3950 v3903 * ./simple Jul 4 20:41
82619 scwangj PEND x64_small lb270210 * ./matrix Jul 4 16:53
[scwangj@v3903 ddm]$

本文详细介绍了如何使用bash脚本批量提交不同大小的作业到特定队列,并通过提交命令和bjobs命令查看作业状态。包括x64_small、x64_blades和x64_3950_long等队列的使用方法,以及如何设置资源需求如CPU核心数、运行时间和输出文件。
1万+

被折叠的 条评论
为什么被折叠?



