[padb-users] Error message from /opt/sbin/libexec/minfo: No DLL to load

Ashley Pittman ashley at pittman.co.uk
Thu Aug 19 00:03:40 BST 2010


On 18 Aug 2010, at 21:35, Rahul Nabar wrote:
> which mpirun
> /opt/ompi_new/bin/mpirun
> 
> cd /opt/ompi_new/lib/openmpi
> ls
> libompi_dbg_msgq.a  libompi_dbg_msgq.la

That's odd, I have 158 files in that directory including the libompi_dbg_msgq.la but not libompi_dbg_msgq.a, are you using a static build of OpenMPI or building it with --disable-dlopen?  It looks like you've found a bug with the OpenMPI build though, this .so should exist even for static builds.

> Some background on where I'm going with this:
> I've a problem on our 10GigE network where the OFED broadcast tests
> don't run when called with larger number of cores. The test just
> stalls. I've an OFED developer helping me debug this and he wanted the
> stacks  from each compute node once the bcast test is in this strange
> stalled state. That's exactly what I am trying to use padb for.

I would send him what you sent to me in your previous mail.  It's a lot of information and it can be hard to parse because of the long lines so it's best to re-direct it to a file and attach it to avoid line-wrap.

> Just to confirm:
> 
> Even though I don;t have the *.so file and padb is throwing the error
> about not finding the dll yet the stack seen in the output I pasted is
> still relevant?

Yes it's still relevant, it's the complete stack trace of the application, it's just missing some information that could be included.  What is interesting and potentially important is that the stack trace for six processes isn't present, it appears the processes were found or padb would have complained and they give warnings about the DLL but they have no stack trace showing, did you truncate the output in the previous email?

> i.e. is there something that's missing because of the
> absence of the mpi dll?

The "MPI message queues" are missing, this will show you the contents of the send and receive queue for every rank.  It's possibly not relevant to what you are looking for if you are only interested in the collectives.  An example is shown on the web-page or it can also be seen in the OMPI testing results.

http://padb.pittman.org.uk/modes.html#mpi-queue

As a final point debugging collectives can be hard, in a deadlock situation it can be hard to tell if all ranks are on the same iteration or if some are ahead of others and some are behind, I have a patch to Open-MPI to add a counter to all collective calls to allow this situation to be detected and reported correctly, if you're still stuck even with the stack trace then you might find this of use.  It'll mean patching you MPI build and fixing the above problem with the DLL.

http://padb.pittman.org.uk/extensions.html

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk





More information about the padb-users mailing list