[padb] Réf. : Re: Réf. : Re: Patch of support of Slurm + Openmpi Orte manager

thipadin.seng-long at bull.net thipadin.seng-long at bull.net
Thu Dec 3 12:20:47 GMT 2009


Hi, good holidays, there ?
I have applied the patch below.
It works now:

padbr345P -O rmgr=slurm  --proc-summary  -a
Warning, remote process state differs across ranks
state : ranks
R (running) : [2]
S (sleeping) : [0-1,3-7]
rank  hostname  pid    vmsize     vmrss     S  uptime  %cpu  lcore command 
 
   0       vb8  24595  133440 kB  47296 kB  S    0.01     0      0 
pp_sndrcv_spbl
   1       vb9  17406  111616 kB  25536 kB  S    0.01     0      0 
pp_sndrcv_spbl
   2      vb10  12521  133440 kB  47296 kB  R    0.93    99      1 
pp_sndrcv_spbl
   3       vb8  24588  111616 kB  25728 kB  S    0.01     0      2 
pp_sndrcv_spbl
   4       vb9  17411  111616 kB  25600 kB  S    0.01     0      5 
pp_sndrcv_spbl
   5      vb10  12522  111616 kB  25600 kB  S    0.93     0      0 
pp_sndrcv_spbl
   6       vb8  24589  111616 kB  25600 kB  S    0.01     0      3 
pp_sndrcv_spbl
   7       vb9  17407  112640 kB  25728 kB  S    0.01     0      0 
pp_sndrcv_spbl
[thipa at vb0 openmpi]$ 

Thipadin.





Ashley Pittman <ashley at pittman.co.uk>
12/03/2009 12:08 PM

 
        Pour :  thipadin.seng-long at bull.net
        cc :    florence.vallee at bull.net, francois.wellenreiter at bull.net, 
padb-devel at pittman.org.uk, Sylvain.JEAUGEY at bull.net
        Objet : Re: Réf. : Re: [padb] Patch of support of Slurm + Openmpi Orte manager


I'm just running out of the door myself and will be away until Sunday
now.

On Thu, 2009-12-03 at 11:45 +0100, thipadin.seng-long at bull.net wrote:
> You have mpirun which has rank0, this shouldn't, and you miss 3,6.

ranks 3 and 6 are on the same node as rank 0, can you try the following
additional patch which should cause it to skip over the mpirun process
and look for local ones based on their environment.

If this patch doesn't work take a look at the the contents
of /proc/$pid/status for the process it's erroneously reporting as rank
0 to see what Name is set to.  In the example you sent it's pid 22210

--- padb-slurm-open-3            2009-12-03 11:03:08.500044734 +0000
+++ padb                 2009-12-03 11:03:15.333036493 +0000
@@ -8187,6 +8187,7 @@
         next unless ( $job eq $jobid );
         next unless ( $step == $inner_conf{slurm_job_step} );
         next if( find_from_status( $pid, 'Name' ) eq 'orted');
+        next if( find_from_status( $pid, 'Name' ) eq 'mpirun');
         maybe_show_pid( $global, $pid );
         $found_target = 1;
     }


-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091203/b5e5abef/attachment.html>


More information about the padb-devel mailing list