[padb] Réf. : Re: Patch of support of Slurm + OpenmpiOrte manager
thipadin.seng-long at bull.net
thipadin.seng-long at bull.net
Thu Dec 3 10:45:37 GMT 2009
Hi,
I was off yesterday, in the mean time you have made some versions.
I am only trying the test the last one (padb-slurm-open-3.patch).
I understand you want to handle automaticallly the fact that user do
slurm/openmpi combination or not.
I am starting with something wrong, i think it needs more handles:
So the combination is:
salloc
srun -n 1 mpirun -bynode -n 8 my_prog
this combination should be equivalent to
salloc
mprun -bynode -n 8 my_prog
so in all my test I've got this.
The result is a little confused, let 's have a look:
The test:
[thipa at vb0 openmpi]$ salloc -p jlg -w vb8,vb9,vb10
salloc: Granted job allocation 27834
[thipa at vb0 openmpi]$
[thipa at vb0 openmpi]$ srun -n1 mpirun -bynode -n 8 ./pp_sndrcv_spbl
srun: Warning: can't run 1 processes on 3 nodes, setting nnodes to 1
I am, process 3 starting on vb8, total by srun 8
I am, process 6 starting on vb8, total by srun 8
I am, process 0 starting on vb8, total by srun 8
I am, process 7 starting on vb9, total by srun 8
I am, process 4 starting on vb9, total by srun 8
I am, process 2 starting on vb10, total by srun 8
I am, process 5 starting on vb10, total by srun 8
I am, process 1 starting on vb9, total by srun 8
Me, process 0, send 1000 to process 2
Padb Test:
[thipa at vb0 openmpi]$ padbr345P -O rmgr=slurm -O stack-shows-locals=no -O
stack-shows-params=no --verbose -tx 27834
Loading config from "/etc/padb.conf"
Loading config from "/home_nfs/thipa/.padbrc"
Loading config from environment
Loading config from command line
Setting 'rmgr' to 'slurm'
Setting 'stack_shows_locals' to 'no'
Setting 'stack_shows_params' to 'no'
Collecting information for job '27834'
Attaching to job 27834
Job has 1 process(es)
Job spans 3 host(s)
Warning, failed to locate ranks [3,6]
Warning, remote process name differs across ranks
name : ranks
mpirun : [0]
pp_sndrcv_spbl : [1-2,4-5,7]
Warning, remote process state differs across ranks
state : ranks
R (running) : [2]
S (sleeping) : [0-1,4-5,7]
Mode 'stack' mode specific options:
gdb_retry_count : '3'
max_distinct_values : '3'
stack_shows_locals : '0'
stack_shows_params : '0'
stack_strip_above :
'elan_waitWord,elan_pollWord,elan_deviceCheck,opal_condition_wait,opal_progress'
stack_strip_below : 'main,__libc_start_main,start_thread'
strip_above_wait : '1'
strip_below_main : '1'
-----------------
[0] (1 processes)
-----------------
main() at main.c:13
orterun() at orterun.c:686
opal_event_dispatch() at ?:?
opal_event_base_loop() at ?:?
poll_dispatch() at ?:?
poll() at ?:?
??() at ?:?
-----------------
[1-2,4-5,7] (5 processes)
-----------------
ThreadId: 1
-----------------
[1,4-5,7] (4 processes)
-----------------
main() at pp_sndrcv_spbl.c:53
PMPI_Finalize() at ?:?
ompi_mpi_finalize() at ?:?
barrier() at ?:?
opal_progress() at ?:?
ThreadId: 2
start_thread() at ?:?
btl_openib_async_thread() at ?:?
poll() at ?:?
??() at ?:?
ThreadId: 3
start_thread() at ?:?
service_thread_start() at ?:?
__GC___select() at ?:?
??() at ?:?
-----------------
[2] (1 processes)
-----------------
main() at pp_sndrcv_spbl.c:49
PMPI_Recv() at ?:?
mca_pml_ob1_recv() at ?:?
opal_progress() at ?:?
ThreadId: 2
start_thread() at ?:?
btl_openib_async_thread() at ?:?
poll() at ?:?
??() at ?:?
ThreadId: 3
start_thread() at ?:?
service_thread_start() at ?:?
__GC___select() at ?:?
??() at ?:?
result from parallel command is 0 (state=shutdown)
[thipa at vb0 openmpi]$ padbr345P -O rmgr=slurm -O stack-shows-locals=no -O
stack-shows-params=no --verbose --proc-summary
Loading config from "/etc/padb.conf"
Loading config from "/home_nfs/thipa/.padbrc"
Loading config from environment
Loading config from command line
Setting 'rmgr' to 'slurm'
Setting 'stack_shows_locals' to 'no'
Setting 'stack_shows_params' to 'no'
padbr345P: Error: no jobs specified, use --all or jobids
[thipa at vb0 openmpi]$ padbr345P -O rmgr=slurm -O stack-shows-locals=no -O
stack-shows-params=no --verbose --proc-summary -a
Loading config from "/etc/padb.conf"
Loading config from "/home_nfs/thipa/.padbrc"
Loading config from environment
Loading config from command line
Setting 'rmgr' to 'slurm'
Setting 'stack_shows_locals' to 'no'
Setting 'stack_shows_params' to 'no'
Active jobs (1) are 27834
Collecting information for job '27834'
Attaching to job 27834
Job has 1 process(es)
Job spans 3 host(s)
Warning, failed to locate ranks [3,6]
Warning, remote process name differs across ranks
name : ranks
mpirun : [0]
pp_sndrcv_spbl : [1-2,4-5,7]
Warning, remote process state differs across ranks
state : ranks
R (running) : [2]
S (sleeping) : [0-1,4-5,7]
Mode 'proc_summary' mode specific options:
column_seperator : ' '
nprocs_output : undef
proc_format :
'rank,hostname,pid,vmsize,vmrss,stat.state=S,load1=uptime,pcpu=%cpu,stat.processor=lcore,name=command'
proc_show_header : '1'
proc_shows_fds : '0'
proc_shows_maps : '0'
proc_shows_proc : '1'
proc_shows_stat : '1'
proc_sort_key : 'rank'
reverse_sort_order : '0'
rank hostname pid vmsize vmrss S uptime %cpu lcore command
0 vb8 22210 16320 kB 13952 kB S 0.00 0 3 mpirun
1 vb9 14985 112384 kB 25600 kB S 0.08 0 5
pp_sndrcv_spbl
2 vb10 9540 133440 kB 47296 kB R 1.15 99 1
pp_sndrcv_spbl
4 vb9 14986 111616 kB 25600 kB S 0.08 0 5
pp_sndrcv_spbl
5 vb10 9544 111616 kB 25600 kB S 1.15 0 0
pp_sndrcv_spbl
7 vb9 14987 112640 kB 25728 kB S 0.08 0 0
pp_sndrcv_spbl
result from parallel command is 0 (state=shutdown)
[thipa at vb0 openmpi]$
All processes alive:
ssh vb8
[thipa at vb8 ~]$ psu
PID PPID CMD
22210 22206 /home_nfs/thipa/openMPI/install/bin/mpirun -bynode -n 8
./pp_sndrcv_
22213 22210 srun --nodes=2 --ntasks=2 --kill-on-bad-exit
--nodelist=vb9,vb10 ort
22218 22210 ./pp_sndrcv_spbl
22219 22210 ./pp_sndrcv_spbl
22220 22210 ./pp_sndrcv_spbl
22990 22986 sshd: thipa at pts/6
22991 22990 -bash
23021 22991 ps -o pid,ppid,cmd -u thipa
[thipa at vb8 ~]$ ssh vb9
[thipa at vb9 ~]$ psu
PID PPID CMD
14982 14978 /home_nfs/thipa/openMPI/install/bin/orted -mca ess slurm -mca
orte_e
14985 14982 ./pp_sndrcv_spbl
14986 14982 ./pp_sndrcv_spbl
14987 14982 ./pp_sndrcv_spbl
15776 15772 sshd: thipa at pts/6
15777 15776 -bash
15807 15777 ps -o pid,ppid,cmd -u thipa
[thipa at vb9 ~]$ ssh vb10
[thipa at vb10 ~]$ psu
PID PPID CMD
9531 9527 /home_nfs/thipa/openMPI/install/bin/orted -mca ess slurm -mca
orte_e
9534 9531 ./pp_sndrcv_spbl
9535 9531 ./pp_sndrcv_spbl
10513 10509 sshd: thipa at pts/4
10514 10513 -bash
10544 10514 ps -o pid,ppid,cmd -u thipa
[thipa at vb10 ~]$
You have mpirun which has rank0, this shouldn't, and you miss 3,6.
Now the other test that works:
Combination:
salloc
mpirun -bynode -n 8 my_prog
The test:
[thipa at vb0 openmpi]$ salloc -p jlg -w vb8,vb9,vb10
salloc: Granted job allocation 27835
[thipa at vb0 openmpi]$
[thipa at vb0 openmpi]$
[thipa at vb0 openmpi]$
[thipa at vb0 openmpi]$ mpirun -bynode -n 8 ./pp_sndrcv_spbl
I am, process 1 starting on vb9, total by srun 8
I am, process 4 starting on vb9, total by srun 8
I am, process 0 starting on vb8, total by srun 8
I am, process 6 starting on vb8, total by srun 8
I am, process 7 starting on vb9, total by srun 8
I am, process 2 starting on vb10, total by srun 8
I am, process 5 starting on vb10, total by srun 8
I am, process 3 starting on vb8, total by srun 8
Me, process 0, send 1000 to process 2
Padb test:
[thipa at vb0 openmpi]$ padbr345P -O rmgr=slurm --verbose --proc-summary -a
Loading config from "/etc/padb.conf"
Loading config from "/home_nfs/thipa/.padbrc"
Loading config from environment
Loading config from command line
Setting 'rmgr' to 'slurm'
Active jobs (1) are 27835
Collecting information for job '27835'
Attaching to job 27835
Job has 3 process(es)
Job spans 3 host(s)
Warning, remote process state differs across ranks
state : ranks
R (running) : [2]
S (sleeping) : [0-1,3-7]
Mode 'proc_summary' mode specific options:
column_seperator : ' '
nprocs_output : undef
proc_format :
'rank,hostname,pid,vmsize,vmrss,stat.state=S,load1=uptime,pcpu=%cpu,stat.processor=lcore,name=command'
proc_show_header : '1'
proc_shows_fds : '0'
proc_shows_maps : '0'
proc_shows_proc : '1'
proc_shows_stat : '1'
proc_sort_key : 'rank'
reverse_sort_order : '0'
rank hostname pid vmsize vmrss S uptime %cpu lcore command
0 vb8 23049 133440 kB 47104 kB S 0.00 0 5
pp_sndrcv_spbl
1 vb9 15828 112640 kB 25408 kB S 0.00 0 0
pp_sndrcv_spbl
2 vb10 10571 134464 kB 47168 kB R 0.92 100 0
pp_sndrcv_spbl
3 vb8 23058 111616 kB 25536 kB S 0.00 0 2
pp_sndrcv_spbl
4 vb9 15845 111616 kB 25408 kB S 0.00 0 0
pp_sndrcv_spbl
5 vb10 10575 111616 kB 25408 kB S 0.92 0 1
pp_sndrcv_spbl
6 vb8 23054 111616 kB 25408 kB S 0.00 0 0
pp_sndrcv_spbl
7 vb9 15830 111616 kB 25408 kB S 0.00 0 0
pp_sndrcv_spbl
result from parallel command is 0 (state=shutdown)
[thipa at vb0 openmpi]$
[thipa at vb0 openmpi]$
[thipa at vb0 openmpi]$
[thipa at vb0 openmpi]$
[thipa at vb0 openmpi]$
[thipa at vb0 openmpi]$ padbr345P -O rmgr=slurm -O stack-shows-locals=no -O
stack-shows-params=no --verbose -tx 27835
Loading config from "/etc/padb.conf"
Loading config from "/home_nfs/thipa/.padbrc"
Loading config from environment
Loading config from command line
Setting 'rmgr' to 'slurm'
Setting 'stack_shows_locals' to 'no'
Setting 'stack_shows_params' to 'no'
Collecting information for job '27835'
Attaching to job 27835
Job has 3 process(es)
Job spans 3 host(s)
Warning, remote process state differs across ranks
state : ranks
R (running) : [2]
S (sleeping) : [0-1,3-7]
Mode 'stack' mode specific options:
gdb_retry_count : '3'
max_distinct_values : '3'
stack_shows_locals : '0'
stack_shows_params : '0'
stack_strip_above :
'elan_waitWord,elan_pollWord,elan_deviceCheck,opal_condition_wait,opal_progress'
stack_strip_below : 'main,__libc_start_main,start_thread'
strip_above_wait : '1'
strip_below_main : '1'
-----------------
[0-7] (8 processes)
-----------------
ThreadId: 1
-----------------
[0-1,3-7] (7 processes)
-----------------
main() at pp_sndrcv_spbl.c:53
PMPI_Finalize() at ?:?
ompi_mpi_finalize() at ?:?
barrier() at ?:?
opal_progress() at ?:?
ThreadId: 2
start_thread() at ?:?
btl_openib_async_thread() at ?:?
poll() at ?:?
??() at ?:?
ThreadId: 3
start_thread() at ?:?
service_thread_start() at ?:?
__GC___select() at ?:?
??() at ?:?
-----------------
[2] (1 processes)
-----------------
main() at pp_sndrcv_spbl.c:49
PMPI_Recv() at ?:?
mca_pml_ob1_recv() at ?:?
ThreadId: 2
start_thread() at ?:?
btl_openib_async_thread() at ?:?
poll() at ?:?
??() at ?:?
ThreadId: 3
start_thread() at ?:?
service_thread_start() at ?:?
__GC___select() at ?:?
??() at ?:?
result from parallel command is 0 (state=shutdown)
[thipa at vb0 openmpi]$
Thipadin.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091203/8d5c7797/attachment.html>
More information about the padb-devel
mailing list