[padb-users] problem with minfo

Mon Dec 16 11:03:16 GMT 2013

Ashley Pittman <ashley at pittman.co.uk> writes:

>> It turns out my problem with 1.6.5 was due to having bad debug info.
>> Maybe it's worth a caution somewhere that it has to be present and
>> correct?
>
> Does the image_has_queues function fail without it?

Yes.  The symbols involved are types, and so debug only.

> I can change the
> minfo error message but this string is something the dll itself could
> report.

Good point.  I'll send a patch.

>> Having got it working, and running larger jobs than previously, it takes
>> a long time to start up and often gives an error (the same for
>> --deadlock and -mpi-watch of the options I've tried) which stops --watch
>> working.  This --deadlock run against 144 processes on 9 hosts took over
>> two minutes:
>> 
>>  Waiting for signon from 9 hosts.
>>  Waiting for signon from 9 hosts.
>
> This error can happen occasionally, every five seconds the code prints
> how many clients it’s waiting for.  As there are only 9 clients and
> they all respond between 10 and 15 seconds after the connection is
> started this sounds like it could be DNS lookups by the remote SSH
> server?

I think the few seconds is just slow ssh into busy stateless nodes; DNS
isn't slow.

Modifying the "DEBUG (full_duplex)" output to print what it's running,
and using "base64 -d|strings" on the slow part shows

  deadlock
  mode
  minfo
  show_group_members
  show_all_groups
  mpi_dll
  cargs

and then there's another hiatus after the first lot of "Target output"
before the EOF errors, so it looks as if it's held up in minfo and/or
comms with it.  I'll try find a way to debug it at some stage.

>>  Total: 234 communicators of which 0 are in use.
>>  No data was recorded for 3490 communicators
>>  Unexpected EOF from child socket (live)
>>  Unexpected EOF from Inner stdout (live)
>>  Unexpected EOF from Inner stderr (live)
>>  Unexpected exit from parallel command (state=live)
>
> These EOF messages probably represent a bug, there isn’t enough information here for me to say what it might be however.

I couldn't see a way of getting any more with --debug, for instance, and
assume it will be tricky to track down.

>> Any suggestions on debugging the errors or the long startup?  (--debug
>> and --verbose didn't help.)
>
> If you’re using rmgr=mpirun then padb will not be able to use orte to
> launch the job and will fall back to using pdsh, this might be the
> cause or as above it could be ssh/DNS timeouts you’d get with orterun
> anyway. 

No, the above was with orte, which works with ompi 1.6, our production
version.

>> For what it's worth, the issue for the problem with the topo symbol is
>> <https://svn.open-mpi.org/trac/ompi/ticket/3958>.
>
> I’ve attached a patch to padb to re-enable the deadlock code, it suffered slightly during some code reorganisation that I made.  The interaction between minfo and padb has changed slightly and this code never got updated.  I’ve only tested this on a simple program but it’s working correctly for me here against a git checkout of master from the 2nd Dec.
> I’ve also attached the open-mpi patch I’m working with currently.
>
> Let me know how you get on.

Ah.  I'm now getting useful-looking info from --deadlock, but clearly
the counters need to be longs:

  Warning: errors reported by some ranks
  ========
  [9]: UNPARSEABLE MINFO: col: id:1 count:-1236416616 active:1
  [4]: UNPARSEABLE MINFO: col: id:1 count:-1640605800 active:1
  [11]: UNPARSEABLE MINFO: col: id:1 count:-503367784 active:1

I assume that's a simple change, but I haven't tried yet.