slurm and MPI example

 

Overview

RCC supports these MPI implementations:

  • IntelMPI
  • MVAPICH2
  • OpenMPI

Each MPI implementation usually has a module available for use with GCC, the Intel Compiler Suite, and PGI. For example, at the time of this writing these MPI modules were available:

openmpi/1.6(default)
openmpi/1.6+intel-12.1
openmpi/1.6+pgi-2012
mvapich2/1.8(default)
mvapich2/1.8+intel-12.1
mvapich2/1.8+pgi-2012
mvapich2/1.8-gpudirect
mvapich2/1.8-gpudirect+intel-12.1
intelmpi/4.0
intelmpi/4.0+intel-12.1(default)

MPI Implementation Notes

The different MPI implementations have different options and features. Any notable differences are noted here.

IntelMPI

IntelMPI uses an environment variable to affect the network communication fabric it uses:

I_MPI_FABRICS

During job launch the Slurm TaskProlog detects the network hardware and sets this variable approately. This will typically be set to shm:ofa, which makes IntelMPI use shared memory communication followed by ibverbs. If a job is run on a node without Infiniband this will be set to shm which uses shared memory only and limits IntelMPI to a single node job. This is usually what is wanted on nodes without a high speed interconnect. This variable can be overridden if desired in the submission script.

MVAPICH2

MVAPICH2 is compiled with the OFA-IB-CH3 interface. There is no support for running programs compiled with MVAPICH2 on loosely coupled nodes.

GPUDirect builds of MVAPICH2 with CUDA enabled are available for use on the GPU nodes. These builds are otherwise identical to the standard MVAPICH2 build.

OpenMPI

Nothing at this time.

Example

Let’s look at an example MPI hello world program and explain the steps needed to compile and submit it to the queue. An example MPI hello world program: hello-mpi.c

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char *argv[], char *envp[]) {
  int numprocs, rank, namelen;
  char processor_name[MPI_MAX_PROCESSOR_NAME];

  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Get_processor_name(processor_name, &namelen);

  printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);

  MPI_Finalize();
}

Place hello-mpi.c in your home directory. Compile and execute this program interactively by entering the following commands into the terminal:

module load openmpi
mpicc hello-mpi.c -o hello-mpi

In this case we are using the default version of the openmpi module which defaults to the GCC compiler.  It should be possible to use any available MPI/compiler for this example.

hello-mpi.sbatch is a submission script that can be used to submit a job to the queue to run this program.

#!/bin/bash

# set the job name to hello-mpi
#SBATCH --job-name=hello-mpi

# send output to hello-mpi.out
#SBATCH --output=hello-mpi.out

# this job requests 2 nodes
#SBATCH --nodes=2

# this job requests exclusive access to the nodes it is given
# this mean it will be the only job running on the node
#SBATCH --exclusive

# --constraint=ib must be give to guarantee a job is allocated 
# nodes with Infiniband
#SBATCH --constraint=ib

# load the openmpi module
module load openmpi

# Run the process with mpirun. Notice -n is not required. mpirun will
# automatically figure out how many processes to run from the slurm options
mpirun ./hello-mpi

The inline comments describe what each line does, but is important to point out 3 things that almost all MPI jobs have in common:

  • --constraint=ib is given to guarantee a node with Infiniband is allocated
  • --exclusive is given to guarantee this job will be the only job on the node
  • mpirun does not need to be given -n. All supported MPI environments automatically determine the proper layout based on the slurm options

You can submit this job with this command:

sbatch hello-mpi.sbatch

Here is example output of this program:

Process 4 on midway123 out of 32
Process 0 on midway123 out of 32
Process 1 on midway123 out of 32
Process 2 on midway123 out of 32
Process 5 on midway123 out of 32
Process 15 on midway123 out of 32
Process 12 on midway123 out of 32
Process 7 on midway123 out of 32
Process 9 on midway123 out of 32
Process 14 on midway123 out of 32
Process 8 on midway123 out of 32
Process 24 on midway124 out of 32
Process 10 on midway123 out of 32
Process 11 on midway123 out of 32
Process 3 on midway123 out of 32
Process 6 on midway123 out of 32
Process 13 on midway123 out of 32
Process 17 on midway124 out of 32
Process 20 on midway124 out of 32
Process 19 on midway124 out of 32
Process 25 on midway124 out of 32
Process 27 on midway124 out of 32
Process 26 on midway124 out of 32
Process 29 on midway124 out of 32
Process 28 on midway124 out of 32
Process 31 on midway124 out of 32
Process 30 on midway124 out of 32
Process 18 on midway124 out of 32
Process 22 on midway124 out of 32
Process 21 on midway124 out of 32
Process 23 on midway124 out of 32
Process 16 on midway124 out of 32

It is possible to affect the number of tasks run per node with the--ntasks-per-node option. Submitting the job like this:

sbatch --ntasks-per-node=1 hello-mpi.sbatch

Results in output like this:

Process 0 on midway123 out of 2
Process 1 on midway124 out of 2

Advanced Usage

Both OpenMPI and IntelMPI have the possibility to launch MPI programs directly with the Slurm command srun. It is not necessary to use this mode for most jobs, but it may allow job launch options that would not otherwise be possible. For example, on a login node it is possible to launch the above hello-mpi command using OpenMPI directly on a compute node with this command:

srun --constraint=ib -n16 --exclusive hello-mpi

For IntelMPI, it is necessary to set an environment variable for this to work:

export I_MPI_PMI_LIBRARY=/software/slurm-current-$DISTARCH/lib/libpmi.so
srun --constraint=ib -n16 --exclusive hello-mpi
### MPI程序强制单线程运行的原因 MPI(Message Passing Interface)是一种用于并行计算的消息传递标准,其设计初衷是为了支持多进程间的通信。然而,在某些情况下,MPI程序可能表现为单线程运行模式。这可能是由于以下几个原因: 1. **环境变量未设置** 如果没有正确配置MPI库的相关环境变量,例如`OMP_NUM_THREADS`或`MPIEXEC_HOSTS`等,则可能导致MPI程序仅在一个进程中执行[^1]。 2. **编译选项错误** 编译时如果未启用多线程支持或者选择了不合适的编译器标志,也可能导致程序无法利用多核资源。例如,OpenMP与MPI混合编程场景下,如果没有显式指定多线程支持,可能会默认为单线程运行。 3. **硬件资源分配不足** 当提交作业到集群管理系统(如SLURM、PBS Pro等)时,若请求的核心数不足以满足MPI程序的需求,那么即使程序本身支持多线程或多进程,也会被限制成单线程运行[^2]。 4. **代码逻辑问题** 若源码中存在不当的设计,比如只初始化了一个MPI进程 (`MPI_Init`) 或者未能调用 `MPI_Comm_size` 和 `MPI_Comm_rank` 来获取总进程数量以及当前进程编号,则整个应用会退化为单一进程操作。 --- ### 解决方案 针对上述提到的各种可能性,以下是几种常见的解决办法: #### 方法一:调整环境变量 确保设置了必要的环境参数来控制线程数目和绑定策略。对于大多数现代MPI实现来说,可以通过如下方式设定: ```bash export OMP_NUM_THREADS=<number_of_threads> mpirun -np <number_of_processes> ./your_mpi_program ``` 这里 `<number_of_threads>` 表示每个MPI进程内部允许使用的线程数;而 `-np` 参数则定义了启动多少个独立的MPI进程。 #### 方法二:修改Makefile/编译脚本 确认项目构建过程中包含了正确的编译标记。如果是C/C++工程,应该添加类似于下面这样的flags给gcc/g++命令行: ```makefile CC = mpicc CXX = mpicxx FLAGS += -fopenmp # 启用OpenMP指令扩展 (可选) LDFLAGS += $(FLAGS) ``` 此外还需注意链接阶段是否引入了相应的库文件,像libmpi.so之类的动态共享对象。 #### 方法三:优化调度配置 假如是在大型分布式系统上面部署的应用,务必仔细检查队列管理工具中的资源配置情况。以Slurm为例,可以尝试增加以下选项至sbatch脚本里去申请更多CPU单元供使用: ```slurm #SBATCH --ntasks=8 # 总共需要创建的任务(即MPI rank)的数量 #SBATCH --cpus-per-task=4 # 每项任务所能获得的最大物理核心计数 ``` 同时记得更新实际执行部分对应的mpirun/mpispawn语句形式匹配新的布局结构。 #### 方法四:审查原始代码质量 最后但同样重要的一环就是重新审视开发人员撰写的业务逻辑是否存在缺陷。特别关注那些涉及同步屏障(`MPI_Barrier`)、广播发送(`MPI_Bcast`)等功能的地方是否有遗漏处理边界条件的情况发生。 --- ### 示例代码片段 假设我们有一个简单的矩阵乘法例子想要测试不同规模下的性能表现差异,下面是基于Python bindings版本的一个简单示范: ```python from mpi4py import MPI import numpy as np comm = MPI.COMM_WORLD rank = comm.Get_rank() size = comm.Get_size() if size != 2: raise ValueError('This example requires exactly two processes') local_a = None local_b = None result = [] if rank == 0: a = np.random.rand(5, 5).astype(np.float64) b = np.random.rand(5, 5).astype(np.float64) split_index = int(len(a)/2) local_a = [a[:split_index], a[split_index:]] else: local_a = [] b = None b_scattered = comm.scatter([None]*len(local_a), root=0) for i in range(len(b_scattered)): result.append(np.dot(local_a[i], b_scattered[i])) final_result = comm.gather(result,root=0) if rank==0: final_matrix=np.vstack(final_result) print("Final Matrix:\n",final_matrix) ``` 此段落展示了如何合理划分数据块交给不同的worker node完成局部运算后再汇总得到最终全局解的过程. ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值