Getting started with LAM

本文介绍LAM环境的基础操作,包括启动、编译和运行MPI程序的方法,并演示如何监控MPI应用及清理环境。通过实例展示了从启动集群到运行简单MPI程序的全过程。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

转自:http://www.lam-mpi.org/tutorials/one-step/lam.php

LAM is a simple yet powerful environment for running and monitoring MPI applications on clusters. The few essential steps in LAM operations are covered below.

Booting LAM

The user creates a file listing the participating machines in the cluster.

shell$ cat lamhosts
# a 2-node LAM
node1.cluster.example.com
node2.cluster.example.com

Each machine will be given a node identifier (nodeid) starting with 0 for the first listed machine, 1 for the second, etc.

 

The recon tool verifies that the cluster is bootable:

shell$ recon -v lamhosts
recon: -- testing n0 (node1.cluster.example.com)
recon: -- testing n1 (node2.cluster.example.com)

The lamboot tool actually starts LAM on the specified cluster.

% lamboot -v lamhosts

LAM 7.1.4 - Indiana University

Executing hboot on n0 (node1.cluster.example.com - 1 CPU)...
Executing hboot on n1 (node2.cluster.example.com - 1 CPU)...

lamboot returns to the UNIX shell prompt. LAM does not force a canned environment or a "LAM shell". The tping command builds user confidence that the cluster and LAM are running.

shell$ tping -c1 N
  1 byte from 1 remote node and 1 local node: 0.008 secs

1 message, 1 byte (0.001K), 0.008 secs (0.246K/sec)
roundtrip min/avg/max: 0.008/0.008/0.008

Compiling MPI Programs

Refer to MPI: It's Easy to Get Started to see a simple MPI program. mpicc (and mpiCC and mpif77) is a wrapper for the C (C++, and F77) compiler that includes all the necessary command line switches to the underlying compiler to find the LAM include files, the relevant LAM libraries, etc.

shell$ mpicc -o foo foo.c
shell$ mpif77 -o foo foo.f

Executing MPI Programs

A MPI application is started by one invocation of the mpirun command. A SPMD application can be started on the mpirun command line.

shell$ mpirun -v -np 2 foo
2445 foo running on n0 (o)
361 foo running on n1

An application with multiple programs must be described in an application schema, a file that lists each program and its target node(s).

shell$ cat appfile
# 1 master, 2 slaves
n0 master 
n0-1 slave 

shell$ mpirun -v appfile
3292 master running on n0 (o)
3296 slave running on n0 (o)
412 slave running on n1

Monitoring MPI Applications

The full MPI synchronization status of all processes and messages can be displayed at any time. This includes the source and destination ranks, the message tag, count and datatype, the communicator, and the function invoked.

shell$ mpitask
TASK (G/L)    FUNCTION      PEER|ROOT  TAG    COMM   COUNT   DATATYPE
0/0 master    Recv          ANY        ANY    WORLD  1       INT
1 slave       <running>
2 slave       <running>

Process rank 0 is blocked receiving a message consisting of a single integer from any source rank and any message tag, using the MPI_COMM_WORLD communicator. The other processes are running.

shell$ mpimsg
SRC (G/L)   DEST (G/L)   TAG   COMM    COUNT   DATATYPE    MSG
0/0         1/1          7     WORLD   4       INT         n0,#0

Later, we see that a message sent by process rank 0 to process rank 1 is buffered and waiting to be received. It was sent with tag 7 using the MPI_COMM_WORLD communicator and contains 4 integers.

Cleaning LAM

All user processes and messages can be removed, without rebooting.

shell$ lamclean -v
killing processes, done      
sweeping messages, done      
closing files, done      
sweeping traces, done

It is typical for users to mpirun a program, lamclean when it finishes, and then mpirun another program. It is not necessary to lamboot to run each user MPI program.

Terminating LAM

The lamhalt tool removes all traces of the LAM session on the network. This is only performed when LAM/MPI is no longer needed (i.e., no more mpirun/lamclean commands will be issued).

shell$ lamhalt

In the case of a catastrophic failure (e.g., one or more LAM nodes crash), the lamhalt utility will hang. In this case, the wipe tool is necessary. The same boot schema that was used with lamboot is necessary to list each node where the LAM run-time environment is running:

shell$ wipe -v lamhosts
Executing tkill on n0 (node1.cluster.example.com)...
Executing tkill on n1 (node2.cluster.example.com)...
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值