[padb] Patch of support of Slurm + Openmpi Orte manager
thipadin.seng-long at bull.net
thipadin.seng-long at bull.net
Mon Nov 30 14:36:00 GMT 2009
Hi Ashley,
May I introduce you my patch against padb r341 for supporting Slurm
combined with openmpi orte manager.
The Key is we use salloc to get resource from slurm and then use it to run
mpirun of openmpi to start jobs.
This kind of combination is not yet supported by current padb so far. When
we start padb with
rmgr=slurm in a such job environment we have seen only the stack of orted
(see below)
So my patch aims to remedy the situation.
Here are what's going on:
salloc -p tsl -w machu139,machu140,machu141
[thipa at machu0 padb_open]$
salloc: Granted job allocation 8324
[thipa at machu0 padb_open]$
[thipa at machu0 padb_open]$ srun -n1 mpirun -n 9 pp_sndrcv_spbl
srun: Warning: can't run 1 processes on 3 nodes, setting nnodes to 1
I am, process 0 starting on machu139, total by srun 9
I am, process 3 starting on machu139, total by srun 9
I am, process 6 starting on machu139, total by srun 9
I am, process 1 starting on machu140, total by srun 9
I am, process 8 starting on machu141, total by srun 9
I am, process 4 starting on machu140, total by srun 9
I am, process 2 starting on machu141, total by srun 9
I am, process 7 starting on machu140, total by srun 9
I am, process 5 starting on machu141, total by srun 9
Me, process 0, send 1000 to process 2
...........
[thipa at machu0 padb_open]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8324 tsl bash thipa R 36:33 3 machu[139-141]
[thipa at machu0 padb_open]$
padb with rmgr=slurm
[thipa at machu0 padb_open]$ ./padb -O rmgr="slurm" -O stack-shows-locals=no
-O stack-shows-params=no --debug=verbose=all -tx 8324
DEBUG (verbose): 0: There are 1 processes over 3 hosts
-----------------
[0] (1 processes)
-----------------
main() at ?:?
orterun() at ?:?
opal_event_dispatch() at event.c:682
opal_event_loop() at event.c:746
poll_dispatch() at poll.c:167
poll() at ?:?
DEBUG (verbose): 0: Completed command
[thipa at machu0 padb_open]$
padb with rmgr="sl-orte" (my patch)
[thipa at machu0 padb_open]$ ./padb -O rmgr="sl-orte" -O
stack-shows-locals=no -O stack-shows-params=no --debug=verbose=all -tx
8324
DEBUG (verbose): 0: There are 1 processes over 3 hosts
Warning, remote process state differs across ranks
state : ranks
R (running) : [2]
S (sleeping) : [0-1,3-8]
-----------------
[0-8] (9 processes)
-----------------
ThreadId: 1
-----------------
[0-1,3-8] (8 processes)
-----------------
main() at pp_sndrcv_spbl.c:55
PMPI_Finalize() at pfinalize.c:46
ompi_mpi_finalize() at runtime/ompi_mpi_finalize.c:224
barrier() at grpcomm_bad_module.c:277
opal_progress() at runtime/opal_progress.c:189
ThreadId: 2
start_thread() at ?:?
btl_openib_async_thread() at btl_openib_async.c:346
poll() at ?:?
ThreadId: 3
start_thread() at ?:?
service_thread_start() at btl_openib_fd.c:427
select() at ?:?
-----------------
[2] (1 processes)
-----------------
main() at pp_sndrcv_spbl.c:50
PMPI_Recv() at precv.c:78
mca_pml_ob1_recv() at pml_ob1_irecv.c:104
opal_progress() at runtime/opal_progress.c:207
ThreadId: 2
start_thread() at ?:?
btl_openib_async_thread() at btl_openib_async.c:346
poll() at ?:?
ThreadId: 3
start_thread() at ?:?
service_thread_start() at btl_openib_fd.c:427
select() at ?:?
DEBUG (verbose): 1: Completed command
[thipa at machu0 padb_open]$
Possibility to start jobs as follows:
1-salloc ... mpirun -n 6 openmpi_appli
2-salloc ....
bash: mpirun -n 6 openmpi_appli
3-salloc ...
bash: srun -n 1 mpirun -n 6 openmpi_appli
Here is the patch you may commit as is or work over. The patch support all
possibility above.
I don't use scontrol listpids, because I found this command not a
universal method (some version doesn't have it),
and may issued error message such as :
slurmd[machu139]: proctrack/pgid does not implement
slurm_container_get_pids
Thipadin.
More later.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091130/e7bf1d14/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slorte.patch
Type: application/octet-stream
Size: 4057 bytes
Desc: not available
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091130/e7bf1d14/attachment.obj>
More information about the padb-devel
mailing list