安装slurm与重启slurm

本文介绍了如何在系统中安装openssl和munge,这是Slurm集群管理软件的基础。随后详细阐述了Slurm配置文件Slurm.conf的修改及注意事项,一旦配置文件更新,需要重启slurmctld和slurmd服务以应用变更。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1. 先安装openssl和munge 

2. install

Install(caoj7)
./configure --prefix=/usr/local --sysconfdir=/usr/local/etc --enable-debug
make
sudomake install

2. Slurm.conf (If revised, slurmctld andslurmd need toreboot)

Use doc/html/configurator.html to createslurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=vm1
#ControlAddr=
# 
#MailProg=/bin/mail 
MpiDefault=none
#MpiParams=ports=#-# 
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817 
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818 
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=caoj7
SlurmdUser=caoj7 
StateSaveLocation=/var/spool
SwitchType=switch/none
TaskPlugin=task/none
# 
# 
# TIMERS 
#KillWait=30 
#MinJobAge=300 
#SlurmctldTimeout=120 
#SlurmdTimeout=300 
# 
# 
# SCHEDULING 
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321 
SelectType=select/linear
# 
# 
# LOGGING AND ACCOUNTING 
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30 
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3 
#SlurmctldLogFile=
#SlurmdDebug=3 
#SlurmdLogFile=
# 
# 
# COMPUTE NODES 
NodeName=vm[2-5] CPUs=4 State=UNKNOWN 
PartitionName=compute Nodes=vm[2-5] Default=YES MaxTime=INFINITE State=UP
/usr/local/etc/slurm.conf (revised SlurmUser=caoj7 SlurmdUser=caoj7)
sudoscp /usr/local/etc/slurm.conf     vm2:/usr/local/etc/   (etc.)
sudochown caoj7:caoj7 /usr/local/etc/slurm.conf (etc.)

3. Createfile anddir
sudo touch /var/run/slurmctld.pid
sudo chown caoj7:caoj7 /var/run/slurmctld.pid
sudo touch /var/run/slurmd.pid
sudo chown caoj7:caoj7 /var/run/slurmd.pid
touch /var/run/slurmd.pid
–sudo mkdir /var/spool/slurmd
•sudo chown -R caoj7:caoj7 /var/spool/slurmd
sudo touch /var/spool/job_state
sudo chown caoj7:caoj7 /var/spool/job_state
sudo touch /var/spool/resv_state
sudo chown caoj7:caoj7 /var/spool/resv_state
sudo touch /var/spool/node_state
sudo chown caoj7:caoj7 /var/spool/node_state
sudo touch /var/spool/trigger_state
sudo chown caoj7:caoj7 /var/spool/trigger_state

4. Startup
Master
slurmctld -Dvvvvvv
If/var/run/slurmctld.pid is removed, use vi to re-createit
Slave
slurmd -Dvvvvvv
If/var/run/slurmd.pid is removed, use vi to re-createit

5. Error

Slurmctld error: authentication: expiredcredential
Timer isnot sync.
Date –s “2012-9-3 14:27:00”
Reboot munge and slurm

Ifnode002 can’t register to master
Might becausessh
Try  sshmasternode(e.g., node001) fromnode002

salloc 出错
[caoj7@vm2mpi]$salloc -N2
-bash:./salloc: /lib/ld-linux.so.2: bad ELFinterpreter: No such file ordirectory
[caoj7@vm1 mpi]$ ldd /usr/local/bin/salloc
  linux-vdso.so.1 =>  (0x00007fff0ebff000)
  libdl.so.2 =>/lib64/libdl.so.2 (0x0000003d3f000000)
  libpthread.so.0 =>/lib64/libpthread.so.0 (0x0000003d6e000000)
  libc.so.6 => /lib64/libc.so.6(0x0000003d6dc00000)
  /lib64/ld-linux-x86-64.so.2(0x0000003d6d400000)

[caoj7@vm1mpi]$ cd /lib
[caoj7@vm1lib]$ln -s/lib64/ld-linux-x86-64.so.2 ld-linux.so.2
但后来又出错了,unlink后正确

------------------------------------------------------------------
重启
1. 启动munge
[caoj7@vm5 ~]$ sudo /etc/init.d/munge start
2. 启动slurmctld或者slurmd

[caoj7@vm5 ~]$ slurmd -D vvvvvv
slurmd: slurmd version 2.4.4 started
slurmd: error: Unable to open pidfile `/var/run/slurmd.pid':Permission denied
slurmd: slurmd started on Fri 30 Nov 2012 09:57:55 +0000
slurmd: CPUs=4 Sockets=4 Cores=1 Threads=1 Memory=15949 TmpDisk=21851 Uptime=846
^Cslurmd: error: Unable to remove pidfile `/var/run/slurmd.pid': No such file or directory
slurmd: Slurmd shutdown completing

[caoj7@vm5 ~]$ sudo touch /var/run/slurmd.pid
[caoj7@vm5 ~]$ sudo chown caoj7:caoj7 /var/run/slurmd.pid

[caoj7@vm5 ~]$ slurmd -D vvvvvv

slurmd: slurmd version 2.4.4 started
slurmd: error: Possible corrupt pidfile `/var/run/slurmd.pid'
slurmd: slurmd started on Fri 30 Nov 2012 09:58:48 +0000
slurmd: CPUs=4 Sockets=4 Cores=1 Threads=1 Memory=15949 TmpDisk=21851 Uptime=899
^Cslurmd: error: Unable to remove pidfile `/var/run/slurmd.pid': Permission denied
slurmd: Slurmd shutdown completing

[caoj7@vm5 ~]$ touch /var/run/slurmd.pid 

[caoj7@vm5 ~]$ slurmd -D vvvvvv

slurmd: slurmd version 2.4.4 started
slurmd: slurmd started on Fri 30 Nov 2012 09:59:14 +0000
slurmd: CPUs=4 Sockets=4 Cores=1 Threads=1 Memory=15949 TmpDisk=21851 Uptime=925
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值