[padb-users] padb not finding remote ranks

Ashley Pittman ashley at pittman.co.uk
Wed Sep 1 22:23:40 BST 2010


On 30 Aug 2010, at 20:59, Steve Wise wrote:

> Hey,
> 
> I have an openmpi-1.4.1 8 node 64NP cluster, which is running jobs via orte/mpirun.  I start a job that is hanging, and then run padb to get the stack traces, yet padb displays this error and only shows the local process stacks (see output below).
> 
> Any ideas?

This looks like a hostname issue, I note that ompi-ps is giving a FQDN for for n0 but short names for n[1-7], padb is then finding every process on n0 but none on any of the other nodes.  It looks like I can probably make padb work in this case by changing the logic around slightly, if the reported hostname is a fully qualified domain name and no hosts are specified for that host then also check the shorter un-qualified hostname.  This shouldn't be too hard to add and is far from a "proper" solution but from the looks of it should make padb behave correctly in this case.

I'm on holiday this week having moved house and don't have regular internet access or an office right now but I can take a look at this early next week for you.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk





More information about the padb-users mailing list