[padb-users] padb stalls with no output.

Ashley Pittman ashley at pittman.co.uk
Fri Aug 20 19:46:13 BST 2010


On 20 Aug 2010, at 19:38, Ashley Pittman wrote:

> The problem is with OpenMPI and with it's state directories in /tmp in particular, it could either be that you've had a parallel job that crashed and has left files around or it could that you are running multiple versions of OMPI and they don't like talking to each other.

Oh - I forgot to say how to "fix" this issue, wait until there are no jobs running and remove all OPMI related files and directories in /tmp on the node where the mpirun process was running.

Alternatively you could try setting the "resource manager" in padb to mpirun instead of orte, this won't use ompi-ps but will use the MPIR interface instead to extract the information from the mpirun process directly, you are limited to 32 nodes if you do this however as it uses pdsh as a backend.  You can try increasing this by setting FANOUT=128 in your environment but you risk running out of file descriptors if you do this, see the pdsh man page for details.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk





More information about the padb-users mailing list