From d.love at liverpool.ac.uk Mon Dec 2 12:33:51 2013 From: d.love at liverpool.ac.uk (Dave Love) Date: Mon, 02 Dec 2013 12:33:51 +0000 Subject: [padb-users] problem with minfo Message-ID: <87haarfn4g.fsf@pc102091.liv.ac.uk> I've been trying to run padb (current repo head) against openmpi 1.6.5 and I see this: Warning: errors reported by some ranks ======== [0-31]: Error message from /usr/libexec/minfo: image_has_queues() failed ======== whereas it seems to be happy with openmpi 1.4.5 on another system. I don't understand the debugging support, and there's a "Are we supposed to ignore this ?" comment in ompi where the error return is done. Any suggestions for debugging the debugging? Thanks. From ashley at pittman.co.uk Mon Dec 2 23:10:28 2013 From: ashley at pittman.co.uk (Ashley Pittman) Date: Mon, 2 Dec 2013 23:10:28 +0000 Subject: [padb-users] problem with minfo In-Reply-To: <87haarfn4g.fsf@pc102091.liv.ac.uk> References: <87haarfn4g.fsf@pc102091.liv.ac.uk> Message-ID: Dave, Thank you for your query, I?ve just tried this with the current tip of openmpi and I?m seeing something very similar although it includes more helpful information. This looks like an issue with openmpi itself, I know there have been problems with the topology code in the past but I thought these had all been resolved now. [22:46] [0] ashley at cloud2:~/code/padb $ ./src/padb -aq -Ormgr=mpirun Warning: errors reported by some ranks ======== [0-1]: Error message from /home/ashley/code/padb/./src/minfo: image_has_queues() failed ======== ---------------- [0-1]: Error string from DLL ---------------- Failed to find some type ---------------- [0-1]: Message from DLL ---------------- mca_topo_base_module_2_1_0_t I?ll try some more tests with openmpi 1.6.5 tomorrow but it looks to me like openmpi itself is broken rather than anything with padb itself. In detail what I think is happening here is that the MPIR_Ignore_queues symbol isn?t being found (as it shouldn?t be) but rather the ompi_fill_in_type_info() function is failing, and in particular it is unable to find the mca_topo_base_module_2_1_0_t symbol which it treats as a critical error. I?ve grepped the source for this string and the only matches are from the debugger directory itself which makes me think that either the code is plain wrong or that the ?2_1_0? part of the type is somehow auto-generated and the debugger code hasn?t been updated to reflect a change elsewhere. Ashley, On 2 Dec 2013, at 12:33, Dave Love wrote: > I've been trying to run padb (current repo head) against openmpi 1.6.5 > and I see this: > > Warning: errors reported by some ranks > ======== > [0-31]: Error message from /usr/libexec/minfo: image_has_queues() failed > ======== > > whereas it seems to be happy with openmpi 1.4.5 on another system. > > I don't understand the debugging support, and there's a "Are we supposed > to ignore this ?" comment in ompi where the error return is done. Any > suggestions for debugging the debugging? > > Thanks. > > _______________________________________________ > padb-users mailing list > padb-users at pittman.org.uk > http://pittman.org.uk/mailman/listinfo/padb-users_pittman.org.uk From d.love at liverpool.ac.uk Thu Dec 5 13:09:49 2013 From: d.love at liverpool.ac.uk (Dave Love) Date: Thu, 05 Dec 2013 13:09:49 +0000 Subject: [padb-users] problem with minfo In-Reply-To: (Ashley Pittman's message of "Mon, 2 Dec 2013 23:10:28 +0000") References: <87haarfn4g.fsf@pc102091.liv.ac.uk> Message-ID: <871u1re95u.fsf@pc102091.liv.ac.uk> Ashley Pittman writes: > I?ll try some more tests with openmpi 1.6.5 tomorrow but it looks to > me like openmpi itself is broken rather than anything with padb > itself. Yes, I assumed it was, but thought it was best to ask here first. Should I raise an openmpi issue? From ashley at pittman.co.uk Thu Dec 5 14:46:57 2013 From: ashley at pittman.co.uk (Ashley Pittman) Date: Thu, 5 Dec 2013 14:46:57 +0000 Subject: [padb-users] problem with minfo In-Reply-To: <871u1re95u.fsf@pc102091.liv.ac.uk> References: <87haarfn4g.fsf@pc102091.liv.ac.uk> <871u1re95u.fsf@pc102091.liv.ac.uk> Message-ID: On 5 Dec 2013, at 13:09, Dave Love wrote: > Ashley Pittman writes: > >> I?ll try some more tests with openmpi 1.6.5 tomorrow but it looks to >> me like openmpi itself is broken rather than anything with padb >> itself. > > Yes, I assumed it was, but thought it was best to ask here first. > Should I raise an openmpi issue? I?ve already emailed Jeff but I know he?s swamped at the minute. Yes, it?s best to raise an openmpi issue. I?m fairly familiar with their code so feel free to cc me and I can fill in any technical details that are required. Ashley. From d.love at liverpool.ac.uk Tue Dec 10 16:25:19 2013 From: d.love at liverpool.ac.uk (Dave Love) Date: Tue, 10 Dec 2013 16:25:19 +0000 Subject: [padb-users] problem with minfo References: <87haarfn4g.fsf@pc102091.liv.ac.uk> Message-ID: <8738m07jww.fsf@pc102091.liv.ac.uk> Ashley Pittman writes: > Dave, > > Thank you for your query, I?ve just tried this with the current tip of > openmpi and I?m seeing something very similar although it includes > more helpful information. This looks like an issue with openmpi > itself, I know there have been problems with the topology code in the > past but I thought these had all been resolved now. I was confused because I was looking at the raw code and actually running with a version of your groups patch. The patch removed the reporting of the symbol involved, and I thought that the general message printed was from padb rather than minfo -- is there a good reason for that? It turns out my problem with 1.6.5 was due to having bad debug info. Maybe it's worth a caution somewhere that it has to be present and correct? --deadlock does basically work with a patched version of openmpi 1.6.5, and it isn't necessary to comment out the relevant topo symbol, as was done previously. It is necessary to do that with openmpi 1.7.3, so there's been a regression in the 1.7 series. Having got it working, and running larger jobs than previously, it takes a long time to start up and often gives an error (the same for --deadlock and -mpi-watch of the options I've tried) which stops --watch working. This --deadlock run against 144 processes on 9 hosts took over two minutes: Waiting for signon from 9 hosts. Waiting for signon from 9 hosts. Total: 234 communicators of which 0 are in use. No data was recorded for 3490 communicators Unexpected EOF from child socket (live) Unexpected EOF from Inner stdout (live) Unexpected EOF from Inner stderr (live) Unexpected exit from parallel command (state=live) Any suggestions on debugging the errors or the long startup? (--debug and --verbose didn't help.) For what it's worth, the issue for the problem with the topo symbol is . From d.love at liverpool.ac.uk Tue Dec 10 16:30:50 2013 From: d.love at liverpool.ac.uk (Dave Love) Date: Tue, 10 Dec 2013 16:30:50 +0000 Subject: [padb-users] rmgr=orte is broken with openmpi 1.7 Message-ID: <87y53s6539.fsf@pc102091.liv.ac.uk> In case it saves someone some time: While checking padb with openmpi 1.7.3 I found it reported no jobs because orte-ps appears to be broken. Use mpirun instead. I opened . Is there actually any advantage to using orte over mpirun, assuming it's working? From ashley at pittman.co.uk Fri Dec 13 01:06:19 2013 From: ashley at pittman.co.uk (Ashley Pittman) Date: Fri, 13 Dec 2013 01:06:19 +0000 Subject: [padb-users] problem with minfo In-Reply-To: <8738m07jww.fsf@pc102091.liv.ac.uk> References: <87haarfn4g.fsf@pc102091.liv.ac.uk> <8738m07jww.fsf@pc102091.liv.ac.uk> Message-ID: On 10 Dec 2013, at 16:25, Dave Love wrote: > Ashley Pittman writes: > >> Dave, >> >> Thank you for your query, I?ve just tried this with the current tip of >> openmpi and I?m seeing something very similar although it includes >> more helpful information. This looks like an issue with openmpi >> itself, I know there have been problems with the topology code in the >> past but I thought these had all been resolved now. > > I was confused because I was looking at the raw code and actually > running with a version of your groups patch. > The patch removed the > reporting of the symbol involved, and I thought that the general message > printed was from padb rather than minfo -- is there a good reason for > that? Yes and no, there was a bug in Open MPI that it always returns the error message, even on success. My patch was a little heavy-handed here and that part of it is no longer needed. > It turns out my problem with 1.6.5 was due to having bad debug info. > Maybe it's worth a caution somewhere that it has to be present and > correct? Does the image_has_queues function fail without it? I can change the minfo error message but this string is something the dll itself could report. > --deadlock does basically work with a patched version of > openmpi 1.6.5, and it isn't necessary to comment out the relevant topo > symbol, as was done previously. It is necessary to do that with openmpi > 1.7.3, so there's been a regression in the 1.7 series. That makes sense, I didn?t see how message queues could be totally broken in 1.6.5. > Having got it working, and running larger jobs than previously, it takes > a long time to start up and often gives an error (the same for > --deadlock and -mpi-watch of the options I've tried) which stops --watch > working. This --deadlock run against 144 processes on 9 hosts took over > two minutes: > > Waiting for signon from 9 hosts. > Waiting for signon from 9 hosts. This error can happen occasionally, every five seconds the code prints how many clients it?s waiting for. As there are only 9 clients and they all respond between 10 and 15 seconds after the connection is started this sounds like it could be DNS lookups by the remote SSH server? > Total: 234 communicators of which 0 are in use. > No data was recorded for 3490 communicators > Unexpected EOF from child socket (live) > Unexpected EOF from Inner stdout (live) > Unexpected EOF from Inner stderr (live) > Unexpected exit from parallel command (state=live) These EOF messages probably represent a bug, there isn?t enough information here for me to say what it might be however. > Any suggestions on debugging the errors or the long startup? (--debug > and --verbose didn't help.) If you?re using rmgr=mpirun then padb will not be able to use orte to launch the job and will fall back to using pdsh, this might be the cause or as above it could be ssh/DNS timeouts you?d get with orterun anyway. > For what it's worth, the issue for the problem with the topo symbol is > . I?ve attached a patch to padb to re-enable the deadlock code, it suffered slightly during some code reorganisation that I made. The interaction between minfo and padb has changed slightly and this code never got updated. I?ve only tested this on a simple program but it?s working correctly for me here against a git checkout of master from the 2nd Dec. I?ve also attached the open-mpi patch I?m working with currently. Let me know how you get on. Ashley, -------------- next part -------------- A non-text attachment was scrubbed... Name: padb.patch Type: application/octet-stream Size: 4112 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: open-mpi.patch Type: application/octet-stream Size: 18012 bytes Desc: not available URL: From ashley at pittman.co.uk Fri Dec 13 01:09:37 2013 From: ashley at pittman.co.uk (Ashley Pittman) Date: Fri, 13 Dec 2013 01:09:37 +0000 Subject: [padb-users] rmgr=orte is broken with openmpi 1.7 In-Reply-To: <87y53s6539.fsf@pc102091.liv.ac.uk> References: <87y53s6539.fsf@pc102091.liv.ac.uk> Message-ID: On 10 Dec 2013, at 16:30, Dave Love wrote: > In case it saves someone some time: While checking padb with openmpi > 1.7.3 I found it reported no jobs because orte-ps appears to be broken. > Use mpirun instead. I opened . I found this as well, orte-ps is crashing for me so padb cannot detect any jobs. > Is there actually any advantage to using orte over mpirun, assuming it's > working? Yes. orte should work from any node where the job is running (depending on orte-ps working) whereas mpirun will only work on the node where the mpirun process itself is. Additionally, if you use orte the backend will be launched with orterun (by default) whereas with mpirun pdsh will be used so there may well be different startup performance. Ashley. From d.love at liverpool.ac.uk Mon Dec 16 11:03:16 2013 From: d.love at liverpool.ac.uk (Dave Love) Date: Mon, 16 Dec 2013 11:03:16 +0000 Subject: [padb-users] problem with minfo References: <87haarfn4g.fsf@pc102091.liv.ac.uk> <8738m07jww.fsf@pc102091.liv.ac.uk> Message-ID: <87d2kx3vnv.fsf@pc102091.liv.ac.uk> Ashley Pittman writes: >> It turns out my problem with 1.6.5 was due to having bad debug info. >> Maybe it's worth a caution somewhere that it has to be present and >> correct? > > Does the image_has_queues function fail without it? Yes. The symbols involved are types, and so debug only. > I can change the > minfo error message but this string is something the dll itself could > report. Good point. I'll send a patch. >> Having got it working, and running larger jobs than previously, it takes >> a long time to start up and often gives an error (the same for >> --deadlock and -mpi-watch of the options I've tried) which stops --watch >> working. This --deadlock run against 144 processes on 9 hosts took over >> two minutes: >> >> Waiting for signon from 9 hosts. >> Waiting for signon from 9 hosts. > > This error can happen occasionally, every five seconds the code prints > how many clients it?s waiting for. As there are only 9 clients and > they all respond between 10 and 15 seconds after the connection is > started this sounds like it could be DNS lookups by the remote SSH > server? I think the few seconds is just slow ssh into busy stateless nodes; DNS isn't slow. Modifying the "DEBUG (full_duplex)" output to print what it's running, and using "base64 -d|strings" on the slow part shows deadlock mode minfo show_group_members show_all_groups mpi_dll cargs and then there's another hiatus after the first lot of "Target output" before the EOF errors, so it looks as if it's held up in minfo and/or comms with it. I'll try find a way to debug it at some stage. >> Total: 234 communicators of which 0 are in use. >> No data was recorded for 3490 communicators >> Unexpected EOF from child socket (live) >> Unexpected EOF from Inner stdout (live) >> Unexpected EOF from Inner stderr (live) >> Unexpected exit from parallel command (state=live) > > These EOF messages probably represent a bug, there isn?t enough information here for me to say what it might be however. I couldn't see a way of getting any more with --debug, for instance, and assume it will be tricky to track down. >> Any suggestions on debugging the errors or the long startup? (--debug >> and --verbose didn't help.) > > If you?re using rmgr=mpirun then padb will not be able to use orte to > launch the job and will fall back to using pdsh, this might be the > cause or as above it could be ssh/DNS timeouts you?d get with orterun > anyway. No, the above was with orte, which works with ompi 1.6, our production version. >> For what it's worth, the issue for the problem with the topo symbol is >> . > > I?ve attached a patch to padb to re-enable the deadlock code, it suffered slightly during some code reorganisation that I made. The interaction between minfo and padb has changed slightly and this code never got updated. I?ve only tested this on a simple program but it?s working correctly for me here against a git checkout of master from the 2nd Dec. > I?ve also attached the open-mpi patch I?m working with currently. > > Let me know how you get on. Ah. I'm now getting useful-looking info from --deadlock, but clearly the counters need to be longs: Warning: errors reported by some ranks ======== [9]: UNPARSEABLE MINFO: col: id:1 count:-1236416616 active:1 [4]: UNPARSEABLE MINFO: col: id:1 count:-1640605800 active:1 [11]: UNPARSEABLE MINFO: col: id:1 count:-503367784 active:1 I assume that's a simple change, but I haven't tried yet.