[padb] Réf. : Re: Réf. : Re: Patch of support of Slurm + Openmpi Orte manager
thipadin.seng-long at bull.net
thipadin.seng-long at bull.net
Thu Dec 3 12:20:47 GMT 2009
Hi, good holidays, there ?
I have applied the patch below.
It works now:
padbr345P -O rmgr=slurm --proc-summary -a
Warning, remote process state differs across ranks
state : ranks
R (running) : [2]
S (sleeping) : [0-1,3-7]
rank hostname pid vmsize vmrss S uptime %cpu lcore command
0 vb8 24595 133440 kB 47296 kB S 0.01 0 0
pp_sndrcv_spbl
1 vb9 17406 111616 kB 25536 kB S 0.01 0 0
pp_sndrcv_spbl
2 vb10 12521 133440 kB 47296 kB R 0.93 99 1
pp_sndrcv_spbl
3 vb8 24588 111616 kB 25728 kB S 0.01 0 2
pp_sndrcv_spbl
4 vb9 17411 111616 kB 25600 kB S 0.01 0 5
pp_sndrcv_spbl
5 vb10 12522 111616 kB 25600 kB S 0.93 0 0
pp_sndrcv_spbl
6 vb8 24589 111616 kB 25600 kB S 0.01 0 3
pp_sndrcv_spbl
7 vb9 17407 112640 kB 25728 kB S 0.01 0 0
pp_sndrcv_spbl
[thipa at vb0 openmpi]$
Thipadin.
Ashley Pittman <ashley at pittman.co.uk>
12/03/2009 12:08 PM
Pour : thipadin.seng-long at bull.net
cc : florence.vallee at bull.net, francois.wellenreiter at bull.net,
padb-devel at pittman.org.uk, Sylvain.JEAUGEY at bull.net
Objet : Re: Réf. : Re: [padb] Patch of support of Slurm + Openmpi Orte manager
I'm just running out of the door myself and will be away until Sunday
now.
On Thu, 2009-12-03 at 11:45 +0100, thipadin.seng-long at bull.net wrote:
> You have mpirun which has rank0, this shouldn't, and you miss 3,6.
ranks 3 and 6 are on the same node as rank 0, can you try the following
additional patch which should cause it to skip over the mpirun process
and look for local ones based on their environment.
If this patch doesn't work take a look at the the contents
of /proc/$pid/status for the process it's erroneously reporting as rank
0 to see what Name is set to. In the example you sent it's pid 22210
--- padb-slurm-open-3 2009-12-03 11:03:08.500044734 +0000
+++ padb 2009-12-03 11:03:15.333036493 +0000
@@ -8187,6 +8187,7 @@
next unless ( $job eq $jobid );
next unless ( $step == $inner_conf{slurm_job_step} );
next if( find_from_status( $pid, 'Name' ) eq 'orted');
+ next if( find_from_status( $pid, 'Name' ) eq 'mpirun');
maybe_show_pid( $global, $pid );
$found_target = 1;
}
--
Ashley Pittman, Bath, UK.
Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091203/b5e5abef/attachment.html>
More information about the padb-devel
mailing list