[padb-users] finding who is not taking part in a collective

Brock Palen brockp at umich.edu
Wed Mar 16 14:03:13 GMT 2011


I am using padb with openmpi/1.4.3  and my result looks like this:

[grbala at nyx5033 ~]$ padb --all -Ormgr=orte --mpi-watch --watch -Owatch-clears-screen=no
Waiting for signon from 16 hosts.
Warning, remote process state differs across ranks
state : ranks
R (running) : [0-1,3,5,8,10,16-17,19,21-22,25-28,30-32,34,36-37,39,41-42,45]
S (sleeping) : [2,4,6-7,9,11-15,18,20,23-24,29,33,35,38,40,43-44,46-49]
u: unexpected messages U: unexpected and other messages
s: sending messages r: receiving messages m: sending and receiving
b: Barrier B: Broadcast g: Gather G: AllGather r: reduce: R: AllReduce
a: alltoall A: alltoalls w: waiting
.: consuming CPU cycles ,: using CPU but no queue data -: sleeping *: error
0....5....1....5....2....5....3....5....4....5....
,,---,-,-,----,--,--,,-,RRRRRRRR,---,----,,--,-,-,
,,-,-,,,-,,--,-,,-,-,-,-RRRRRRRR-,-,---,,,--,,---,
,,---,-,,,,-,-,,-,-,----RRRRRRRR,----,-,--,,-----,
--,,-,-,,,,-,,------,,--RRRRRRRR,,----,,--,------,
---,,,,------,-,,-,--,,,RRRRRRRR,-,,,-,-------,---


Is there a way I can figure out why that AllReduce is stuck? Like what other rank is to be involved in it?  The mpi-queue option is a bit impenetrable for me.

Thanks

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985







More information about the padb-users mailing list