[padb-users] Error message from /opt/sbin/libexec/minfo: No DLL to load

Rahul Nabar rpnabar at gmail.com
Thu Aug 19 01:34:34 BST 2010


On Wed, Aug 18, 2010 at 6:58 PM, Ashley Pittman <ashley at pittman.co.uk> wrote:

> I'd say that was fairly conclusive, I'm happy to take this up with the Open-MPI team myself unless you want to?

Please, do go ahead. I'd be happy to assist in any way required but
you likely understand the meat of the matter better! I'd confuse the
Open-MPIers mostly if I posted there.


>> http://dl.dropbox.com/u/118481/padb.log.new.new.txt
>
> That's good to see, re your problem the process that jumps out at me is rank 255, all processes are in ompi_coll_tuned_bcast_intra_generic() but at three different line numbers, rank 255 is unique in being at coll_tuned_bcast.c:232.  This is later in the code and suggests to me that this process is stuck here on one iteration for whatever reason and all other processes have continued on, into the next iteration of MPI_Bcast() and are blocked waiting for rank 255 to complete the previous broadcast and start the next one.  This is where the collective state patch comes into it's own really.

Thanks for this analysis! One, I'm passing this one to the OFED
developer working with me on this. Second, I'm going to try to apply
your patch just to see if that makes Steve's debugging life any
easier.

>
> It's both an alternative and a also different niche.  There is overlap with totalview certainly but I try to target a different work-flow, TotalView is good if you are local to the machine, have a working reproducer and are sitting down for an afternoon to work at a problem, padb is more lightweight and is more for taking snapshots of state and then moving on.  padb can both be automated and it's results emailed around which I think are both big plus points, it's non-interactive though so can't go into anything like the level of detail.

Thanks for the clarification. padb is clearly very useful! :)

-- 
Rahul




More information about the padb-users mailing list