[padb-users] Error message from /opt/sbin/libexec/minfo: No DLL to load

Rahul Nabar rpnabar at gmail.com
Wed Aug 18 20:35:01 BST 2010


On Wed, Aug 18, 2010 at 2:10 PM, Ashley Pittman <ashley at pittman.co.uk> wrote:
>
> On 18 Aug 2010, at 20:50, Rahul Nabar wrote:
> I've checked with Jeff already and it's enabled by default.

Great! Thanks for checking.


>
> Can you send the list of the files that you do have there?  I've just checked on a system here and the correct location is
$OPAL_PREFIX/lib/openmpi/libompi_dbg_msgq.so (note the extra openmpi in there).

 which mpirun
/opt/ompi_new/bin/mpirun

cd /opt/ompi_new/lib/openmpi
ls
libompi_dbg_msgq.a  libompi_dbg_msgq.la

 If you do have that file then could you take a peek inside a single
MPI process and print out the value of "MPIR_dll_name", just attach
with gdb and type "p MPIR_dll_name" should tell you.  I assume you are
using a recent version of OpenMPI?

That's confusing. I don't seem to have that file!

mpirun --version
mpirun (Open MPI) 1.4.1

> Mainly that the stack trace is from your application and not from say /bin/sh.  Some people like to run "mpirun sh -c /path/to/my-app" and whilst padb should cope with this if you use a non-standard shell it might not.

Ok. My shell's bash and the invocation was indeed pretty similar to
what you show:

NP=256;mpirun  -np $NP --host
eu001,eu002,eu003,eu004,eu005,eu006,eu007,eu008,eu009,eu010,eu011,eu012,eu013,eu014,eu015,eu016,eu017,eu018,eu019,eu020,eu021,eu022,eu023,eu024,eu025,eu026,eu027,eu028,eu029,eu030,eu031,eu032
  -mca btl openib,sm,self    /opt/src/mpitests/imb/src/IMB-MPI1 -npmin
$NP  bcast


> No.  35883 is the orte job number, run the command opmi-ps and it'll all become clear.

Ah! Got it! :)
>
> This is absolutely correct.  The extended output with variables like you have can be over-whelming, it's included in -full-report for off-line diagnostics but if you have a reproducer and can experiment it's often easier to see problems without it so just specifying -xt rather than --full-report

Thanks for the tip!

Some background on where I'm going with this:
I've a problem on our 10GigE network where the OFED broadcast tests
don't run when called with larger number of cores. The test just
stalls. I've an OFED developer helping me debug this and he wanted the
stacks  from each compute node once the bcast test is in this strange
stalled state. That's exactly what I am trying to use padb for.

Just to confirm:

Even though I don;t have the *.so file and padb is throwing the error
about not finding the dll yet the stack seen in the output I pasted is
still relevant? i.e. is there something that's missing because of the
absence of the mpi dll?

Thanks again for all your help!

-- 
Rahul




More information about the padb-users mailing list