From brockp at umich.edu  Wed Mar 16 14:03:13 2011
From: brockp at umich.edu (Brock Palen)
Date: Wed, 16 Mar 2011 10:03:13 -0400
Subject: [padb-users] finding who is not taking part in a collective
Message-ID: <4B765074-F751-4181-9989-88F648D6A866@umich.edu>

I am using padb with openmpi/1.4.3  and my result looks like this:

[grbala at nyx5033 ~]$ padb --all -Ormgr=orte --mpi-watch --watch -Owatch-clears-screen=no
Waiting for signon from 16 hosts.
Warning, remote process state differs across ranks
state : ranks
R (running) : [0-1,3,5,8,10,16-17,19,21-22,25-28,30-32,34,36-37,39,41-42,45]
S (sleeping) : [2,4,6-7,9,11-15,18,20,23-24,29,33,35,38,40,43-44,46-49]
u: unexpected messages U: unexpected and other messages
s: sending messages r: receiving messages m: sending and receiving
b: Barrier B: Broadcast g: Gather G: AllGather r: reduce: R: AllReduce
a: alltoall A: alltoalls w: waiting
.: consuming CPU cycles ,: using CPU but no queue data -: sleeping *: error
0....5....1....5....2....5....3....5....4....5....
,,---,-,-,----,--,--,,-,RRRRRRRR,---,----,,--,-,-,
,,-,-,,,-,,--,-,,-,-,-,-RRRRRRRR-,-,---,,,--,,---,
,,---,-,,,,-,-,,-,-,----RRRRRRRR,----,-,--,,-----,
--,,-,-,,,,-,,------,,--RRRRRRRR,,----,,--,------,
---,,,,------,-,,-,--,,,RRRRRRRR,-,,,-,-------,---


Is there a way I can figure out why that AllReduce is stuck? Like what other rank is to be involved in it?  The mpi-queue option is a bit impenetrable for me.

Thanks

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985


From ashley at pittman.co.uk  Wed Mar 16 22:48:44 2011
From: ashley at pittman.co.uk (Ashley Pittman)
Date: Wed, 16 Mar 2011 22:48:44 +0000
Subject: [padb-users] finding who is not taking part in a collective
In-Reply-To: <4B765074-F751-4181-9989-88F648D6A866@umich.edu>
References: <4B765074-F751-4181-9989-88F648D6A866@umich.edu>
Message-ID: <5C64036C-0A65-4341-8ECC-318AAE368E05@pittman.co.uk>


On 16 Mar 2011, at 14:03, Brock Palen wrote:

> I am using padb with openmpi/1.4.3  and my result looks like this:
> 
> [grbala at nyx5033 ~]$ padb --all -Ormgr=orte --mpi-watch --watch -Owatch-clears-screen=no
> Waiting for signon from 16 hosts.
> Warning, remote process state differs across ranks
> state : ranks
> R (running) : [0-1,3,5,8,10,16-17,19,21-22,25-28,30-32,34,36-37,39,41-42,45]
> S (sleeping) : [2,4,6-7,9,11-15,18,20,23-24,29,33,35,38,40,43-44,46-49]
> u: unexpected messages U: unexpected and other messages
> s: sending messages r: receiving messages m: sending and receiving
> b: Barrier B: Broadcast g: Gather G: AllGather r: reduce: R: AllReduce
> a: alltoall A: alltoalls w: waiting
> .: consuming CPU cycles ,: using CPU but no queue data -: sleeping *: error
> 0....5....1....5....2....5....3....5....4....5....
> ,,---,-,-,----,--,--,,-,RRRRRRRR,---,----,,--,-,-,
> ,,-,-,,,-,,--,-,,-,-,-,-RRRRRRRR-,-,---,,,--,,---,
> ,,---,-,,,,-,-,,-,-,----RRRRRRRR,----,-,--,,-----,
> --,,-,-,,,,-,,------,,--RRRRRRRR,,----,,--,------,
> ---,,,,------,-,,-,--,,,RRRRRRRR,-,,,-,-------,---

That's an interesting view, it shows that most processes 42? are spending most of their time sleeping but 8 right in the middle are performing MPI_Allreduce.  What it can't show you however is if those processes are "stuck" in the same reduce somehow, either because of a bug or because of missing processes (remember that all-reduce is a barrier collective) or if the reduce is working but is simply being called many times in a tight loop.

The view that you see here is done entirely by string-compares in stack traces.

> Is there a way I can figure out why that AllReduce is stuck? Like what other rank is to be involved in it?

Yes - or at least I can get you more information.  The quick way would be to look at stack-traces for the job (using -axt) but to turn on locals and parameter display, for OpenMPI this should dip inside the communicator and print out the rank and size whenever possible.  This will be enough to tell you the size of the communicator, if it's of size 8 and the ID's match then it's a deadlock in OpenMPI, if the communicator is larger then you have processes that should have called reduce but haven't at the point in time that it's sampled so likely a bug in your application.  If you do this take time to check that all processes which are calling the collective are providing the same values for operation and count as this can lead to deadlock.

The best option is unfortunately the most involved as it requires patching Open-MPI.  This adds data about collective calls to the debugging interface which padb can query, it'll then be able to tell you which communicators have "active" collective operations and for ones that do which ranks within the communicator have and have called the function.

The patch to Open-MPI is on-line here, I think the line numbers are off for the latest release but it does apply and work.
http://padb.pittman.org.uk/extensions.html

Once you have that added then you'll be able to run in this mode.  This will tell you the size of the communicator but also which ranks have made collective calls and if any are ahead or behind.
http://padb.pittman.org.uk/modes.html#deadlock

> The mpi-queue option is a bit impenetrable for me.

It is a bit verbose and probably doesn't do what you want in this case anyway although it would allow you to lookup the ranks of the collective.  If you send it to me I'll take a look and see what I can tell you.  One quick thing to test is are there actually any communicators of size 8 that cover these exact ranks, if there aren't then it's definitely a case that some ranks haven't made the correct collective calls.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk