[padb-users] problem with minfo

Dave Love d.love at liverpool.ac.uk
Tue Dec 10 16:25:19 GMT 2013


Ashley Pittman <ashley at pittman.co.uk> writes:

> Dave,
>
> Thank you for your query, I’ve just tried this with the current tip of
> openmpi and I’m seeing something very similar although it includes
> more helpful information.  This looks like an issue with openmpi
> itself, I know there have been problems with the topology code in the
> past but I thought these had all been resolved now.

I was confused because I was looking at the raw code and actually
running with a version of your groups patch.  The patch removed the
reporting of the symbol involved, and I thought that the general message
printed was from padb rather than minfo -- is there a good reason for
that?

It turns out my problem with 1.6.5 was due to having bad debug info.
Maybe it's worth a caution somewhere that it has to be present and
correct?  --deadlock does basically work with a patched version of
openmpi 1.6.5, and it isn't necessary to comment out the relevant topo
symbol, as was done previously.  It is necessary to do that with openmpi
1.7.3, so there's been a regression in the 1.7 series.

Having got it working, and running larger jobs than previously, it takes
a long time to start up and often gives an error (the same for
--deadlock and -mpi-watch of the options I've tried) which stops --watch
working.  This --deadlock run against 144 processes on 9 hosts took over
two minutes:

  Waiting for signon from 9 hosts.
  Waiting for signon from 9 hosts.
  Total: 234 communicators of which 0 are in use.
  No data was recorded for 3490 communicators
  Unexpected EOF from child socket (live)
  Unexpected EOF from Inner stdout (live)
  Unexpected EOF from Inner stderr (live)
  Unexpected exit from parallel command (state=live)

Any suggestions on debugging the errors or the long startup?  (--debug
and --verbose didn't help.)

For what it's worth, the issue for the problem with the topo symbol is
<https://svn.open-mpi.org/trac/ompi/ticket/3958>.




More information about the padb-users mailing list