[padb-users] Problems running padb on large processor counts

Duncan Harris harris.duncan at gmail.com
Wed Sep 15 11:47:09 BST 2010


Hi.
One of our users is trying to run padb to find out why his 576 PE job
is hanging. However when he does he gets this:

[node195]> padb -x -t -a -Ormgr=mpirun
Waiting for signon from 16 hosts.
Waiting for signon from 16 hosts.
Waiting for signon from 16 hosts.
Waiting for signon from 16 hosts.
Waiting for signon from 16 hosts.
einner: node197: Failed to connect to child (node204:53314)
at ~/bin/padb line 4414.
einner: pdsh at node195: node197: ssh exited with exit code 111
einner: node197: Failed to connect to child (node210:36682)
at ~/bin/padb line 4414.
einner: node209: Use of uninitialized value in numeric eq (==)
at ~/bin/padb line 9886.
einner: node208: Use of uninitialized value in numeric eq (==)
at ~/bin/padb line 9886.
einner: pdsh at node195: node198: ssh exited with exit code 111
einner: node196: Failed to connect to child (node201:39642)
at ~/bin/padb line 4414.
einner: node200: Use of uninitialized value in numeric eq (==)
at ~/bin/padb line 9886.
einner: pdsh at node195: node196: ssh exited with exit code 111
einner: node199: Failed to connect to child (node212:54985)
at ~/bin/padb line 4414.
einner: pdsh at node195: node199: ssh exited with exit code 111



This is running padb-3.2-beta0 over infiniband with OpenMPI 1.4.1 and
Red Hat Enterprise Linux v5.3

Any suggestions as to what the problem could be please?

Cheers.
Duncan




More information about the padb-users mailing list