[padb] Réf. : Re: Patch of support of Slurm + OpenmpiOrte manager

thipadin.seng-long at bull.net thipadin.seng-long at bull.net
Thu Dec 3 10:45:37 GMT 2009


Hi,
I was off yesterday, in the mean time you have made some versions.
I am only trying the test the last one (padb-slurm-open-3.patch).
I understand you want to handle automaticallly the fact that user do 
slurm/openmpi combination or not.

I am starting with something wrong, i think it needs more handles:
So the combination is:
salloc
srun -n 1 mpirun -bynode -n 8 my_prog
this combination should be equivalent to
salloc
mprun -bynode -n 8 my_prog
so in all my test I've got this.
The result is a little confused, let 's have a look:

The test:

[thipa at vb0 openmpi]$ salloc -p jlg -w vb8,vb9,vb10
salloc: Granted job allocation 27834
[thipa at vb0 openmpi]$ 
[thipa at vb0 openmpi]$ srun -n1 mpirun -bynode -n 8 ./pp_sndrcv_spbl
srun: Warning: can't run 1 processes on 3 nodes, setting nnodes to 1
I am, process 3 starting on vb8, total by srun  8
I am, process 6 starting on vb8, total by srun  8
I am, process 0 starting on vb8, total by srun  8
I am, process 7 starting on vb9, total by srun  8
I am, process 4 starting on vb9, total by srun  8
I am, process 2 starting on vb10, total by srun  8
I am, process 5 starting on vb10, total by srun  8
I am, process 1 starting on vb9, total by srun  8
Me, process 0, send  1000 to process 2


Padb Test:

[thipa at vb0 openmpi]$ padbr345P -O rmgr=slurm -O stack-shows-locals=no  -O 
stack-shows-params=no --verbose -tx 27834
Loading config from "/etc/padb.conf"
Loading config from "/home_nfs/thipa/.padbrc"
Loading config from environment
Loading config from command line
Setting 'rmgr' to 'slurm'
Setting 'stack_shows_locals' to 'no'
Setting 'stack_shows_params' to 'no'

Collecting information for job '27834'

Attaching to job 27834
Job has 1 process(es)
Job spans 3 host(s)
Warning, failed to locate ranks [3,6]
Warning, remote process name differs across ranks
name : ranks
mpirun : [0]
pp_sndrcv_spbl : [1-2,4-5,7]
Warning, remote process state differs across ranks
state : ranks
R (running) : [2]
S (sleeping) : [0-1,4-5,7]
Mode 'stack' mode specific options:
     gdb_retry_count : '3'
 max_distinct_values : '3'
  stack_shows_locals : '0'
  stack_shows_params : '0'
   stack_strip_above : 
'elan_waitWord,elan_pollWord,elan_deviceCheck,opal_condition_wait,opal_progress'
   stack_strip_below : 'main,__libc_start_main,start_thread'
    strip_above_wait : '1'
    strip_below_main : '1'
-----------------
[0] (1 processes)
-----------------
main() at main.c:13
  orterun() at orterun.c:686
    opal_event_dispatch() at ?:?
      opal_event_base_loop() at ?:?
        poll_dispatch() at ?:?
          poll() at ?:?
            ??() at ?:?
-----------------
[1-2,4-5,7] (5 processes)
-----------------
ThreadId: 1
  -----------------
  [1,4-5,7] (4 processes)
  -----------------
  main() at pp_sndrcv_spbl.c:53
    PMPI_Finalize() at ?:?
      ompi_mpi_finalize() at ?:?
        barrier() at ?:?
          opal_progress() at ?:?
            ThreadId: 2
              start_thread() at ?:?
                btl_openib_async_thread() at ?:?
                  poll() at ?:?
                    ??() at ?:?
                      ThreadId: 3
                        start_thread() at ?:?
                          service_thread_start() at ?:?
                            __GC___select() at ?:?
                              ??() at ?:?
  -----------------
  [2] (1 processes)
  -----------------
  main() at pp_sndrcv_spbl.c:49
    PMPI_Recv() at ?:?
      mca_pml_ob1_recv() at ?:?
        opal_progress() at ?:?
          ThreadId: 2
            start_thread() at ?:?
              btl_openib_async_thread() at ?:?
                poll() at ?:?
                  ??() at ?:?
                    ThreadId: 3
                      start_thread() at ?:?
                        service_thread_start() at ?:?
                          __GC___select() at ?:?
                            ??() at ?:?
result from parallel command is 0 (state=shutdown)
[thipa at vb0 openmpi]$ padbr345P -O rmgr=slurm -O stack-shows-locals=no  -O 
stack-shows-params=no --verbose --proc-summary
Loading config from "/etc/padb.conf"
Loading config from "/home_nfs/thipa/.padbrc"
Loading config from environment
Loading config from command line
Setting 'rmgr' to 'slurm'
Setting 'stack_shows_locals' to 'no'
Setting 'stack_shows_params' to 'no'
padbr345P: Error: no jobs specified, use --all or jobids
[thipa at vb0 openmpi]$ padbr345P -O rmgr=slurm -O stack-shows-locals=no  -O 
stack-shows-params=no --verbose --proc-summary -a
Loading config from "/etc/padb.conf"
Loading config from "/home_nfs/thipa/.padbrc"
Loading config from environment
Loading config from command line
Setting 'rmgr' to 'slurm'
Setting 'stack_shows_locals' to 'no'
Setting 'stack_shows_params' to 'no'
Active jobs (1) are 27834

Collecting information for job '27834'

Attaching to job 27834
Job has 1 process(es)
Job spans 3 host(s)
Warning, failed to locate ranks [3,6]
Warning, remote process name differs across ranks
name : ranks
mpirun : [0]
pp_sndrcv_spbl : [1-2,4-5,7]
Warning, remote process state differs across ranks
state : ranks
R (running) : [2]
S (sleeping) : [0-1,4-5,7]
Mode 'proc_summary' mode specific options:
    column_seperator : '  '
       nprocs_output : undef
         proc_format : 
'rank,hostname,pid,vmsize,vmrss,stat.state=S,load1=uptime,pcpu=%cpu,stat.processor=lcore,name=command'
    proc_show_header : '1'
      proc_shows_fds : '0'
     proc_shows_maps : '0'
     proc_shows_proc : '1'
     proc_shows_stat : '1'
       proc_sort_key : 'rank'
  reverse_sort_order : '0'
rank  hostname  pid    vmsize     vmrss     S  uptime  %cpu  lcore command 
 
   0       vb8  22210   16320 kB  13952 kB  S    0.00     0      3  mpirun
   1       vb9  14985  112384 kB  25600 kB  S    0.08     0      5 
pp_sndrcv_spbl
   2      vb10   9540  133440 kB  47296 kB  R    1.15    99      1 
pp_sndrcv_spbl
   4       vb9  14986  111616 kB  25600 kB  S    0.08     0      5 
pp_sndrcv_spbl
   5      vb10   9544  111616 kB  25600 kB  S    1.15     0      0 
pp_sndrcv_spbl
   7       vb9  14987  112640 kB  25728 kB  S    0.08     0      0 
pp_sndrcv_spbl
result from parallel command is 0 (state=shutdown)
[thipa at vb0 openmpi]$

All processes alive:

ssh vb8
[thipa at vb8 ~]$ psu
  PID  PPID CMD
22210 22206 /home_nfs/thipa/openMPI/install/bin/mpirun -bynode -n 8 
./pp_sndrcv_
22213 22210 srun --nodes=2 --ntasks=2 --kill-on-bad-exit 
--nodelist=vb9,vb10 ort
22218 22210 ./pp_sndrcv_spbl
22219 22210 ./pp_sndrcv_spbl
22220 22210 ./pp_sndrcv_spbl
22990 22986 sshd: thipa at pts/6
22991 22990 -bash
23021 22991 ps -o pid,ppid,cmd -u thipa
[thipa at vb8 ~]$ ssh vb9
[thipa at vb9 ~]$ psu
  PID  PPID CMD
14982 14978 /home_nfs/thipa/openMPI/install/bin/orted -mca ess slurm -mca 
orte_e
14985 14982 ./pp_sndrcv_spbl
14986 14982 ./pp_sndrcv_spbl
14987 14982 ./pp_sndrcv_spbl
15776 15772 sshd: thipa at pts/6
15777 15776 -bash
15807 15777 ps -o pid,ppid,cmd -u thipa
[thipa at vb9 ~]$ ssh vb10
[thipa at vb10 ~]$ psu
  PID  PPID CMD
 9531  9527 /home_nfs/thipa/openMPI/install/bin/orted -mca ess slurm -mca 
orte_e
 9534  9531 ./pp_sndrcv_spbl
 9535  9531 ./pp_sndrcv_spbl
10513 10509 sshd: thipa at pts/4
10514 10513 -bash
10544 10514 ps -o pid,ppid,cmd -u thipa
[thipa at vb10 ~]$ 

You have mpirun which has rank0, this shouldn't, and you miss 3,6.


Now the other test that works:
Combination:
salloc 
mpirun  -bynode -n 8 my_prog

The test:

[thipa at vb0 openmpi]$ salloc -p jlg -w vb8,vb9,vb10
salloc: Granted job allocation 27835
[thipa at vb0 openmpi]$ 
[thipa at vb0 openmpi]$ 
[thipa at vb0 openmpi]$ 
[thipa at vb0 openmpi]$ mpirun -bynode -n 8 ./pp_sndrcv_spbl
I am, process 1 starting on vb9, total by srun  8
I am, process 4 starting on vb9, total by srun  8
I am, process 0 starting on vb8, total by srun  8
I am, process 6 starting on vb8, total by srun  8
I am, process 7 starting on vb9, total by srun  8
I am, process 2 starting on vb10, total by srun  8
I am, process 5 starting on vb10, total by srun  8
I am, process 3 starting on vb8, total by srun  8
Me, process 0, send  1000 to process 2

Padb test:

[thipa at vb0 openmpi]$ padbr345P -O rmgr=slurm --verbose --proc-summary -a
Loading config from "/etc/padb.conf"
Loading config from "/home_nfs/thipa/.padbrc"
Loading config from environment
Loading config from command line
Setting 'rmgr' to 'slurm'
Active jobs (1) are 27835

Collecting information for job '27835'

Attaching to job 27835
Job has 3 process(es)
Job spans 3 host(s)
Warning, remote process state differs across ranks
state : ranks
R (running) : [2]
S (sleeping) : [0-1,3-7]
Mode 'proc_summary' mode specific options:
    column_seperator : '  '
       nprocs_output : undef
         proc_format : 
'rank,hostname,pid,vmsize,vmrss,stat.state=S,load1=uptime,pcpu=%cpu,stat.processor=lcore,name=command'
    proc_show_header : '1'
      proc_shows_fds : '0'
     proc_shows_maps : '0'
     proc_shows_proc : '1'
     proc_shows_stat : '1'
       proc_sort_key : 'rank'
  reverse_sort_order : '0'
rank  hostname  pid    vmsize     vmrss     S  uptime  %cpu  lcore command 
 
   0       vb8  23049  133440 kB  47104 kB  S    0.00     0      5 
pp_sndrcv_spbl
   1       vb9  15828  112640 kB  25408 kB  S    0.00     0      0 
pp_sndrcv_spbl
   2      vb10  10571  134464 kB  47168 kB  R    0.92   100      0 
pp_sndrcv_spbl
   3       vb8  23058  111616 kB  25536 kB  S    0.00     0      2 
pp_sndrcv_spbl
   4       vb9  15845  111616 kB  25408 kB  S    0.00     0      0 
pp_sndrcv_spbl
   5      vb10  10575  111616 kB  25408 kB  S    0.92     0      1 
pp_sndrcv_spbl
   6       vb8  23054  111616 kB  25408 kB  S    0.00     0      0 
pp_sndrcv_spbl
   7       vb9  15830  111616 kB  25408 kB  S    0.00     0      0 
pp_sndrcv_spbl
result from parallel command is 0 (state=shutdown)
[thipa at vb0 openmpi]$ 
[thipa at vb0 openmpi]$ 
[thipa at vb0 openmpi]$ 
[thipa at vb0 openmpi]$ 
[thipa at vb0 openmpi]$ 
[thipa at vb0 openmpi]$ padbr345P -O rmgr=slurm -O stack-shows-locals=no  -O 
stack-shows-params=no --verbose -tx 27835
Loading config from "/etc/padb.conf"
Loading config from "/home_nfs/thipa/.padbrc"
Loading config from environment
Loading config from command line
Setting 'rmgr' to 'slurm'
Setting 'stack_shows_locals' to 'no'
Setting 'stack_shows_params' to 'no'

Collecting information for job '27835'

Attaching to job 27835
Job has 3 process(es)
Job spans 3 host(s)
Warning, remote process state differs across ranks
state : ranks
R (running) : [2]
S (sleeping) : [0-1,3-7]
Mode 'stack' mode specific options:
     gdb_retry_count : '3'
 max_distinct_values : '3'
  stack_shows_locals : '0'
  stack_shows_params : '0'
   stack_strip_above : 
'elan_waitWord,elan_pollWord,elan_deviceCheck,opal_condition_wait,opal_progress'
   stack_strip_below : 'main,__libc_start_main,start_thread'
    strip_above_wait : '1'
    strip_below_main : '1'
-----------------
[0-7] (8 processes)
-----------------
ThreadId: 1
  -----------------
  [0-1,3-7] (7 processes)
  -----------------
  main() at pp_sndrcv_spbl.c:53
    PMPI_Finalize() at ?:?
      ompi_mpi_finalize() at ?:?
        barrier() at ?:?
          opal_progress() at ?:?
            ThreadId: 2
              start_thread() at ?:?
                btl_openib_async_thread() at ?:?
                  poll() at ?:?
                    ??() at ?:?
                      ThreadId: 3
                        start_thread() at ?:?
                          service_thread_start() at ?:?
                            __GC___select() at ?:?
                              ??() at ?:?
  -----------------
  [2] (1 processes)
  -----------------
  main() at pp_sndrcv_spbl.c:49
    PMPI_Recv() at ?:?
      mca_pml_ob1_recv() at ?:?
        ThreadId: 2
          start_thread() at ?:?
            btl_openib_async_thread() at ?:?
              poll() at ?:?
                ??() at ?:?
                  ThreadId: 3
                    start_thread() at ?:?
                      service_thread_start() at ?:?
                        __GC___select() at ?:?
                          ??() at ?:?
result from parallel command is 0 (state=shutdown)
[thipa at vb0 openmpi]$


Thipadin.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091203/8d5c7797/attachment.html>


More information about the padb-devel mailing list