[padb-users] Error message from /opt/sbin/libexec/minfo: No DLL to load

Thu Aug 19 00:58:31 BST 2010

On 19 Aug 2010, at 01:20, Rahul Nabar wrote:

> On Wed, Aug 18, 2010 at 6:03 PM, Ashley Pittman <ashley at pittman.co.uk> wrote:
> 
> Alternatively, some snooping in the sources does reveal that
> config.log has this:
> 
> $ ./configure --prefix=/opt/ompi_new --with-tm=/opt/torque FC=ifort
> CC=icc F77=ifort CXX=icpc CFLAGS=-g -O3 -mp FFLAGS=-mp -recurs
> ive -O3 CXXFLAGS=-g CPPFLAGS=-DPgiFortran --disable-shared
> --enable-static --with-memory-manager --disable-dlopen
> --enable-openib-rd
> macm --with-openib=/usr

I'd say that was fairly conclusive, I'm happy to take this up with the Open-MPI team myself unless you want to?

> http://dl.dropbox.com/u/118481/padb.log.new.new.txt

That's good to see, re your problem the process that jumps out at me is rank 255, all processes are in ompi_coll_tuned_bcast_intra_generic() but at three different line numbers, rank 255 is unique in being at coll_tuned_bcast.c:232.  This is later in the code and suggests to me that this process is stuck here on one iteration for whatever reason and all other processes have continued on, into the next iteration of MPI_Bcast() and are blocked waiting for rank 255 to complete the previous broadcast and start the next one.  This is where the collective state patch comes into it's own really.

>> As a final point debugging collectives can be hard, in a deadlock situation it can be hard to tell if all ranks are on the same iteration or if some are ahead of others and some are behind, I have a patch to Open-MPI to add a counter to all collective calls to allow this situation to be detected and reported correctly, if you're still stuck even with the stack trace then you might find this of use.  It'll mean patching you MPI build and fixing the above problem with the DLL.
> 
> That would be my next line of attack, thanks! :)
> 
> BTW, out of curiosity, is padb an alternative to things like vampir,
> totalview etc. or are those a different niche with a different goal?

It's both an alternative and a also different niche.  There is overlap with totalview certainly but I try to target a different work-flow, TotalView is good if you are local to the machine, have a working reproducer and are sitting down for an afternoon to work at a problem, padb is more lightweight and is more for taking snapshots of state and then moving on.  padb can both be automated and it's results emailed around which I think are both big plus points, it's non-interactive though so can't go into anything like the level of detail.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk