[padb-users] Fwd: Error message from /opt/sbin/libexec/minfo: No DLL to load
Ashley Pittman
ashley at pittman.co.uk
Thu Aug 19 12:18:07 BST 2010
On 19 Aug 2010, at 10:51, Daniel Kidger wrote:
> >As a final point debugging collectives can be hard, in a deadlock situation it can be hard to tell if all >ranks are on the same iteration or if some are ahead of others and some are behind, I have a >patch to Open-MPI to add a counter to all collective calls to allow this situation to be detected and >reported correctly, if you're still stuck even with the stack trace then you might find this of use. It'll >mean patching you MPI build and fixing the above problem with the DLL.
>
> I would be particularly interested in this patch.
> Albeit it is often further complicated in that with the code I am working on often calls collectives like MPI_Allgather from various subsets of MPI_COMM_WORLD such that I do no expect all process to have called it the same number of times - does your patch allow for this?
Yes it does. To be clear the "collective debugger" functionality is a proposal for extending the specification between tool (in this case padb) and the MPI library. The patch is a implementation of the proposal for Open-MPI so you will need to re-compile your MPI library to use it. Unfortunately it's looking like the proposal might not be formally adopted purely due to a lack of time on my part but I'm hoping that it can be made to work somehow.
The patch and it's background are on-line although unfortunately no sample output from when padb uses this.
http://padb.pittman.org.uk/extensions.html
Ashley.
--
Ashley Pittman, Bath, UK.
Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk
More information about the padb-users
mailing list