openMPI多节点无法执行的问题

最新推荐文章于 2025-06-12 15:04:01 发布

grasshoper97

最新推荐文章于 2025-06-12 15:04:01 发布

阅读量1.4w

点赞数 14

CC 4.0 BY-SA版权

分类专栏： GPU体系结构文章标签： openMPI orte

本文链接：https://blog.youkuaiyun.com/grasshoper97/article/details/90737605

GPU体系结构专栏收录该内容

4 篇文章

订阅专栏

机房有多台HP服务器，有个任务计算量大，而且并行度高，非常适合并行计算，于是准备配置openMPI来做加速。

创建ssh免密登陆、安装openMPI 1.65 都非常顺利，每个节点上单独运行例子都正常，诡异的是，一旦在命令中加入 --hostfile hosts 参数，指定多节点运行，马上出现

bash: orted: 未找到命令
--------------------------------------------------------------------------
A daemon (pid 8793) died unexpectedly with status 127 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.

百度了很多网页，众说纷纭，有的说在.bashrc中加入 PATH和 LD_LIBRARY_PATH 目录，有的说在 /etc/profile 中加入，还有的说在/root/.bashrc中加入。

关键是每个节点上用 env 命令查看，路径都正确，打 which mpirun ,都能找到路径。单节点例子都能正确运行，明显不是路径问题。

看来只能从原始文档里找答案了，

先 man openmpi

从中得知 mpirun/ mpiexec/ orterun 这几个命令完全相同，是同义词，更详细命令参数建议查看 mpirun 文档。

于是 man mpirun

文档第一页中显示，可以用 --prefix 来指定一个绝对运行目录，查看更详细的Remote Execution，该参数用来指向远程节点的openMPI安装目录，让远程节点可以顺利找到 mpirun、mpiexec、 orterun 来运行程序。

   Remote Execution
       Open  MPI requires that the PATH environment variable be set to find executables on remote nodes (this is typically only necessary in rsh- or ssh-
       based environments -- batch/scheduled environments typically copy the current environment to the execution of remote jobs, so if the current envi‐
       ronment  has  PATH  and/or  LD_LIBRARY_PATH  set properly, the remote nodes will also have it set properly).  If Open MPI was compiled with shared
       library support, it may also be necessary to have the LD_LIBRARY_PATH environment variable set on remote nodes as well  (especially  to  find  the
       shared libraries required to run user MPI applications).

       However,  it  is not always desirable or possible to edit shell startup files to set PATH and/or LD_LIBRARY_PATH.  The --prefix option is provided
       for some simple configurations where this is not possible.

       The --prefix option takes a single argument: the base directory on the remote node where Open MPI is installed.  Open MPI will use this  directory
       to  set  the remote PATH and LD_LIBRARY_PATH before executing any Open MPI or user applications.  This allows running Open MPI jobs without having
       pre-configured the PATH and LD_LIBRARY_PATH on the remote nodes.

       Open MPI adds the basename of the current node's "bindir" (the directory where Open MPI's executables are installed) to the prefix and  uses  that
       to  set  the  PATH  on  the  remote  node.   Similarly,  Open MPI adds the basename of the current node's "libdir" (the directory where Open MPI's
       libraries are installed) to the prefix and uses that to set the LD_LIBRARY_PATH on the remote node.  For example:

       Local bindir:  /localde/directory/bin

       Local libdir:  /localde/directoryb64

于是尝试加上该参数，全部节点正常工作。

语法： [mpirun | mpiexec | orterun ] -np 参与运行的核心数 --hostfile 多节点名文件 --prefix 远程节点openmpi安装目录待执行程序

例子： mpirun -np 20 --hostfile myhosts --prefix /usr/local/my_mpi a.out

分析原因： .bashrc 中设置的 PATH 和 LD_LIBRARY_PATH 只对直接登录的用户生效，对 mpi 通过 ssh 登录进来的用户无效，所以会提示 orted ：命令未找到，即在远程节点上找不到 mpirun/ mpiexec/ orterun 命令，更无法执行程序。

解决办法：主节点启动命令时，通过 --prefix 参数指定远程节点的 openMPI 安装目录，让远程主机能顺利找到这些命令。

另外：安装过程中的一些说法

（1）运行 openMPI 可以不用 root 用户，普通用户只要添加到权限合适的组中（我把用户加入到了 adm 、root 、sudo组，基本和root权限一样了），也能正常运行；

（2）编译openMPI时建议加 prefix 参数来指定安装目录，并且全部节点都用同一个目录，这样将来管理方便，我编译用 ./configrue --prefix /usr/local/my_mpi

（3）运行程序时，把待执行程序用scp命令拷贝到各子节点上，也要放到同一目录下。

结论：起码对计算机专业来说，学英语还是很有必要的，遇到问题，最权威的解释，往往不再网络上，而是在你本机的文档中。

顺便把测试代码也放上，这个其实是 mpich( 另一个 mpi 实现)的实例代码，不但能显示当前是第几个进程，还能显示各节点的主机名，方便测试。

//HelloMpi.c
#include <stdio.h>
#include <mpi.h>

int main(int argc, char *argv[])
{
	int myrank, nprocs;
	char name[10];
	int name_len;
	int i,j,k,sum=0;
	MPI_Init(&argc, &argv);
	MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
	MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
	MPI_Get_processor_name(name, &name_len);
	printf("core[%3d] of [%3d] in {%s},  in dir[ ~/托腮etMPI ]\n", myrank, nprocs, name);
	/* for(i =0 ; i<10000; i++) */
		/* for(j=0 ;j< myrank ; j++) */
			/* sum +=j; */
	/* printf("core[%3d], sum= %12d\n", myrank,sum ); */

	MPI_Finalize();

	return 0;
}