slurm集群安装

本文档介绍了如何在三台Ubuntu 18.04 LTS机器上搭建slurm集群。详细步骤包括安装munge和slurm软件,配置slurm.conf文件,设置hosts文件,启动slurmctld和slurmd服务,同步munge.key,以及启动munge服务。通过这些步骤,可以将机器转变为计算集群并进行测试。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

环境:三台物理机,os均为ubuntu-18-04 LTS,hostname分别为tian-609-06、tian-609-07、tian-609-08。其中tian-609-06作为控制节点和计算节点,其他节点作为计算节点。

1、安装munge和slurm(所有机器)

sudo apt install munge slurm-wlm

2、配置/etc/slurm-llnl/slurm.conf文件(所有机器,配置一样)

# slurm.conf file generated by configurator easy.html. 
# Put this file on all nodes of your cluster. 
# See the slurm.conf man page for more information. 
# 
ControlMachine=tian-609-06 #<YOUR-HOST-NAME> 
#ControlAddr= 
# 
#MailProg=/bin/mail 
MpiDefault=none 
#MpiParams=ports=#-# 
ProctrackType=proctrack/pgid 
ReturnToService=1 
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid 
#SlurmctldPort=6817 
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid 
#SlurmdPort=6818 
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd 
SlurmUser=slurm 
#SlurmdUser=root 
StateSaveLocation=/var/lib/slurm-llnl/slurmctld 
SwitchType=switch/none 
TaskPlugin=task/none 
# 
# 
# TIMERS 
#KillWait=30 
#MinJobAge=300 
#SlurmctldTimeout=120 
#SlurmdTi
### 如何在Ubuntu上搭建SLURM集群 #### 安装依赖项 为了安装和配置 SLURM 集群,在所有节点(控制节点和计算节点)上都需要先更新软件包列表并安装必要的依赖项。 ```bash sudo apt update && sudo apt upgrade -y sudo apt install wget munge libmunge-dev gcc make perl python3 tmux vim git -y ``` #### 下载与编译源码 下载最新版本的 SLURM 源代码,并按照官方说明进行编译。如果 SLURM安装路径不是默认位置,则需指定实际安装目录[^1]: ```bash wget https://download.schedmd.com/slurm/slurm-21.08.7.tar.bz2 tar xf slurm-*.tar.bz2 cd slurm-*/ ./configure --prefix=/opt/slurm --sysconfdir=/etc/slurm \ --with-pmi=slurm --with-pm=no --with-slurm=/opt/slurm make -j$(nproc) sudo make install ``` #### 设置 MUNGE 认证服务 确保 `munge` 用户存在并且可以无密码执行命令: ```bash sudo adduser --system --no-create-home munge echo 'munge ALL=(ALL) NOPASSWD:ALL' | sudo tee /etc/sudoers.d/munge ``` 创建密钥文件并将权限设置为仅允许 root 和 munge 用户访问: ```bash sudo mkdir -p /etc/munge sudo chown munge:munge /etc/munge/ sudo chmod 700 /etc/munge/ # Generate key and copy it across all nodes. sudo dd if=/dev/urandom bs=1 count=1024 > ~/munge.key sudo mv ~/munge.key /etc/munge/munge.key sudo chown munge:munge /etc/munge/munge.key sudo chmod 400 /etc/munge/munge.key ``` 启动并启用 MUNGE 服务: ```bash sudo systemctl enable munged.service sudo systemctl start munged.service ``` #### 配置 SLURM 控制器 (Control Node) 编辑 `/etc/slurm/slurm.conf`, 添加如下内容来定义控制器和计算节点的信息: ```plaintext # # Example configuration file for a small Slurm cluster with one controller node, # two compute nodes, each having four CPUs per socket and eight cores total. ClusterName=clustername SlurmctldHost=localhost AuthType=auth/munge JobCredentialPrivateKey=/etc/slurm/slurm.jkey JobCredentialPublicCertificate=/etc/slurm/slurm.cert StateSaveLocation=/var/spool/slurm PidFile=/var/run/slurm-llnl/slurmctld.pid SlurmUser=munge ProctrackType=proctrack/cgroup ReturnToService=2 TaskPlugin=task/affinity TreeWidth=0 TmpFS=/tmp UsePAM=0 InactiveLimit=0 KillWait=30 MessageTimeout=10 SlurmctldPort=6817 SlurmdPort=6818 SwitchType=switch/noop TaskPluginParam=Sched/MCSAllowPIDChange=yes FastSchedule=1 SchedulerTimeSlice=300 SuspendProgram=/usr/local/bin/suspend.sh ResumeProgram=/usr/local/bin/resume.sh SuspendTimeout=300 ResumeTimeout=300 PreemptMode=CANCEL_BATCH SelectType=select/cons_res ConstrainNodes=YES MaxMemPerNode=999999 MinCPUVersion=62 DebugFlags=power PowerParameters=cpu_freq,gpu_power_limit GresTypes=gpu DefMemPerNode=0 OverSubscribe=NO FirstJobId=1 MaxJobCount=1000000 MaxStepCount=1000000 CheckpointType=checkpoint/nul Epilog=/usr/local/sbin/epilog Prolog=/usr/local/sbin/prolog AccountingStorageEnforce=none AccountingStoreJobComment=YES JobAcctGatherFrequency=30 JobCompLoc=/var/log/job_completions.log JobSubmitPlugins=accounting_storage MailProg=/bin/mail LogTimeFormat=%Y-%m-%dT%H:%M:%S.%sZ SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdLogFile=/var/log/slurm-llnl/slurmd.log SlurmSchedLogFile=/var/log/slurm-llnl/slurm_sched.log SlurmDbdLogFile=/var/log/slurm-llnl/slurmdbd.log SlurmctldSpoolDir=/var/spool/slurm-llnl/ctld SlurmdSpoolDir=/var/spool/slurm-llnl/d TrackWCKey=yes PropagatePrioProcess=no PropagateResourceLimits= PropagateResourceLimitsExcept=MEMLOCK,MLOCKS SrunEpilog=/usr/local/sbin/srun_epilog SrunProlog=/usr/local/sbin/srun_prolog TaskPlugin=task/affinity TopologyPlugin=topology/tree SelectType=select/linear PriorityType=priority/basic FairShareDampeningFactor=0 FairShareDecayHalfLife=7-0 MaxArraySize=1000000 MaxJobsPerUser=1000000 MaxPartitionCpus=1000000 MaxPartitionMemory=1000000 MaxTasksPerNode=1000000 MaxStepCount=1000000 MaxWckeyLength=100 MinJobAge=300 ReconfigFlag=FULL ResvOverRunPolicy=cancel ResumeFailProgram=/usr/local/bin/resume_fail.sh ResumeRate=5 StartDelay=0 SuspiciousSystemExit=hold_job TaskPlugin=task/affinity TimerResolution=1ms TreeNodeFanout=4 UnkillableStepProgram=/usr/local/bin/unrecoverable_step.sh VSizeFactor=0 Waittime=0 WorkDir=/tmp/slurm-%u/%j NodeName=node[01-02] RealMemory=8192 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=1 State=UNKNOWN PartitionName=debug Nodes=node[01-02] Default=YES MaxTime=INFINITE State=UP ``` 注意替换上述配置中的 ClusterName、SlurmctldHost 及其他特定于环境的部分以匹配实际情况。 #### 启动 SLURM 控制守护进程 初始化数据库并启动 slurmctld 服务: ```bash sudo scontrol show config | grep ^SlurmctldHost sudo slurmd -N $(hostname) -c sudo systemctl enable slurmctld sudo systemctl start slurmctld ``` #### 在计算节点上部署 SLURM 软件栈 复制来自控制节点上的二进制文件以及配置文件到各个计算节点,同步时间戳以便后续操作顺利进行。 ```bash rsync -avz --delete /opt/slurm user@compute-node:/opt/slurm rsync -avz --delete /etc/slurm user@compute-node:/etc/slurm ssh user@compute-node sudo systemctl restart munged.slurmctld ``` 最后一步是在每台计算节点运行以下命令注册自己给定的名字至控制系统内: ```bash sudo slurmd -N $(hostname) -c sudo systemctl enable slurmd sudo systemctl start slurmd ``` 完成以上步骤之后应该已经成功建立了一个基本可用的小型 SLURM 集群环境。可以通过提交测试作业验证其功能正常与否。
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值