[padb-users] problem with minfo
Dave Love
d.love at liverpool.ac.uk
Mon Dec 16 11:03:16 GMT 2013
Ashley Pittman <ashley at pittman.co.uk> writes:
>> It turns out my problem with 1.6.5 was due to having bad debug info.
>> Maybe it's worth a caution somewhere that it has to be present and
>> correct?
>
> Does the image_has_queues function fail without it?
Yes. The symbols involved are types, and so debug only.
> I can change the
> minfo error message but this string is something the dll itself could
> report.
Good point. I'll send a patch.
>> Having got it working, and running larger jobs than previously, it takes
>> a long time to start up and often gives an error (the same for
>> --deadlock and -mpi-watch of the options I've tried) which stops --watch
>> working. This --deadlock run against 144 processes on 9 hosts took over
>> two minutes:
>>
>> Waiting for signon from 9 hosts.
>> Waiting for signon from 9 hosts.
>
> This error can happen occasionally, every five seconds the code prints
> how many clients it’s waiting for. As there are only 9 clients and
> they all respond between 10 and 15 seconds after the connection is
> started this sounds like it could be DNS lookups by the remote SSH
> server?
I think the few seconds is just slow ssh into busy stateless nodes; DNS
isn't slow.
Modifying the "DEBUG (full_duplex)" output to print what it's running,
and using "base64 -d|strings" on the slow part shows
deadlock
mode
minfo
show_group_members
show_all_groups
mpi_dll
cargs
and then there's another hiatus after the first lot of "Target output"
before the EOF errors, so it looks as if it's held up in minfo and/or
comms with it. I'll try find a way to debug it at some stage.
>> Total: 234 communicators of which 0 are in use.
>> No data was recorded for 3490 communicators
>> Unexpected EOF from child socket (live)
>> Unexpected EOF from Inner stdout (live)
>> Unexpected EOF from Inner stderr (live)
>> Unexpected exit from parallel command (state=live)
>
> These EOF messages probably represent a bug, there isn’t enough information here for me to say what it might be however.
I couldn't see a way of getting any more with --debug, for instance, and
assume it will be tricky to track down.
>> Any suggestions on debugging the errors or the long startup? (--debug
>> and --verbose didn't help.)
>
> If you’re using rmgr=mpirun then padb will not be able to use orte to
> launch the job and will fall back to using pdsh, this might be the
> cause or as above it could be ssh/DNS timeouts you’d get with orterun
> anyway.
No, the above was with orte, which works with ompi 1.6, our production
version.
>> For what it's worth, the issue for the problem with the topo symbol is
>> <https://svn.open-mpi.org/trac/ompi/ticket/3958>.
>
> I’ve attached a patch to padb to re-enable the deadlock code, it suffered slightly during some code reorganisation that I made. The interaction between minfo and padb has changed slightly and this code never got updated. I’ve only tested this on a simple program but it’s working correctly for me here against a git checkout of master from the 2nd Dec.
> I’ve also attached the open-mpi patch I’m working with currently.
>
> Let me know how you get on.
Ah. I'm now getting useful-looking info from --deadlock, but clearly
the counters need to be longs:
Warning: errors reported by some ranks
========
[9]: UNPARSEABLE MINFO: col: id:1 count:-1236416616 active:1
[4]: UNPARSEABLE MINFO: col: id:1 count:-1640605800 active:1
[11]: UNPARSEABLE MINFO: col: id:1 count:-503367784 active:1
I assume that's a simple change, but I haven't tried yet.
More information about the padb-users
mailing list