[padb-users] problem with minfo

Ashley Pittman ashley at pittman.co.uk
Fri Dec 13 01:06:19 GMT 2013


On 10 Dec 2013, at 16:25, Dave Love <d.love at liverpool.ac.uk> wrote:

> Ashley Pittman <ashley at pittman.co.uk> writes:
> 
>> Dave,
>> 
>> Thank you for your query, I’ve just tried this with the current tip of
>> openmpi and I’m seeing something very similar although it includes
>> more helpful information.  This looks like an issue with openmpi
>> itself, I know there have been problems with the topology code in the
>> past but I thought these had all been resolved now.
> 
> I was confused because I was looking at the raw code and actually
> running with a version of your groups patch.

> The patch removed the
> reporting of the symbol involved, and I thought that the general message
> printed was from padb rather than minfo -- is there a good reason for
> that?

Yes and no, there was a bug in Open MPI that it always returns the error message, even on success.  My patch was a little heavy-handed here and that part of it is no longer needed.

> It turns out my problem with 1.6.5 was due to having bad debug info.
> Maybe it's worth a caution somewhere that it has to be present and
> correct?

Does the image_has_queues function fail without it?  I can change the minfo error message but this string is something the dll itself could report.

>  --deadlock does basically work with a patched version of
> openmpi 1.6.5, and it isn't necessary to comment out the relevant topo
> symbol, as was done previously.  It is necessary to do that with openmpi
> 1.7.3, so there's been a regression in the 1.7 series.

That makes sense, I didn’t see how message queues could be totally broken in 1.6.5.

> Having got it working, and running larger jobs than previously, it takes
> a long time to start up and often gives an error (the same for
> --deadlock and -mpi-watch of the options I've tried) which stops --watch
> working.  This --deadlock run against 144 processes on 9 hosts took over
> two minutes:
> 
>  Waiting for signon from 9 hosts.
>  Waiting for signon from 9 hosts.

This error can happen occasionally, every five seconds the code prints how many clients it’s waiting for.  As there are only 9 clients and they all respond between 10 and 15 seconds after the connection is started this sounds like it could be DNS lookups by the remote SSH server?

>  Total: 234 communicators of which 0 are in use.
>  No data was recorded for 3490 communicators
>  Unexpected EOF from child socket (live)
>  Unexpected EOF from Inner stdout (live)
>  Unexpected EOF from Inner stderr (live)
>  Unexpected exit from parallel command (state=live)

These EOF messages probably represent a bug, there isn’t enough information here for me to say what it might be however.

> Any suggestions on debugging the errors or the long startup?  (--debug
> and --verbose didn't help.)

If you’re using rmgr=mpirun then padb will not be able to use orte to launch the job and will fall back to using pdsh, this might be the cause or as above it could be ssh/DNS timeouts you’d get with orterun anyway. 

> For what it's worth, the issue for the problem with the topo symbol is
> <https://svn.open-mpi.org/trac/ompi/ticket/3958>.

I’ve attached a patch to padb to re-enable the deadlock code, it suffered slightly during some code reorganisation that I made.  The interaction between minfo and padb has changed slightly and this code never got updated.  I’ve only tested this on a simple program but it’s working correctly for me here against a git checkout of master from the 2nd Dec.

I’ve also attached the open-mpi patch I’m working with currently.

Let me know how you get on.

Ashley,

-------------- next part --------------
A non-text attachment was scrubbed...
Name: padb.patch
Type: application/octet-stream
Size: 4112 bytes
Desc: not available
URL: <http://pittman.org.uk/pipermail/padb-users_pittman.org.uk/attachments/20131213/63c3b8b1/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: open-mpi.patch
Type: application/octet-stream
Size: 18012 bytes
Desc: not available
URL: <http://pittman.org.uk/pipermail/padb-users_pittman.org.uk/attachments/20131213/63c3b8b1/attachment-0001.obj>


More information about the padb-users mailing list