[padb] Réf. : Re: Réf. : Re: Réf. : Re: Réf. : Re: [padb-devel] Patch for Support of PBS Pro resource manager
Sylvain Jeaugey
sylvain.jeaugey at bull.net
Wed Nov 25 10:37:21 GMT 2009
Hi all.
That's great news. For the future developments, we will need to support a
large number of combinations, and in this respect, I'm wondering if the
rmgr approach is fine enough. Maybe a two-steps approach would be better.
Here is how I would love to configure padb :
jobmgr = slurm / pbs/ lsf / local / ...
relay = none / orte / mpd / ...
mpi = openmpi / mpich / mpich2
The jobmgr step would convert a jobid into a list of host/pid, the relay
step would follow childs (or not, e.g. for pure srun launch), and the mpi
step would get the MPI rank depending on the MPI library.
Maybe "relay" would be better defined as "launch system" to cope with e.g.
blaunch under lsf. I don't know the internals of padb, but maybe this
approach would remove the need to create a new rmgr each time a new
combination is used.
Ashley, what do you think about it ? Do you see how it could be done
inside padb ?
Sylvain
On Wed, 25 Nov 2009, thipadin.seng-long at bull.net wrote:
>
> Hi,
>
> I have given a try for your last commit(r331) including so PBS support.
> It seems to work now. You have done all the corrections (at your manner) I have sent.
> Thanks you to accept to commit.
> Now let's take a look to what is out from padb on a PBS job:
>
> qstat -an:
> xn0:
> Req'd Req'd Elap
> Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
> --------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
> 27617.xn0 thipa workq STDIN 1145 3 9 -- -- R 00:15
> xn19/0*3+xn20/0*3+xn21/0*3
> [thipa at xn5 padb_open]$
>
> [thipa at xn5 padb_open]$ DirTest/padb -O rmgr=pbs -O stack-shows-locals=no -O stack-shows-params=no -O check-signon=none
> -tx 27617
> -----------------
> [0,3,6] (3 processes)
> -----------------
> ThreadId: 1
> main() at pp_sndrcv_spbl.c:50
> PMPI_Finalize() at ?:?
> MPID_Finalize() at ?:?
> MPIDI_CH3_Progress_wait() at ?:?
> MPIDU_Sock_wait() at ?:?
> poll() at ?:?
> ThreadId: 2
> start_thread() at ?:?
> fd_server() at server.c:354
> select() at ?:?
> -----------------
> [1,4,7] (3 processes)
> -----------------
> main() at pp_sndrcv_spbl.c:50
> PMPI_Finalize() at ?:?
> MPID_Finalize() at ?:?
> MPIDI_CH3_Progress_wait() at ?:?
> MPIDU_Sock_wait() at ?:?
> poll() at ?:?
> -----------------
> [2] (1 processes)
> -----------------
> main() at pp_sndrcv_spbl.c:46
> PMPI_Recv() at ?:?
> MPID_Progress_wait() at ?:?
> MPIDI_CH3_Progress_wait() at ?:?
> MPIDU_Sock_wait() at ?:?
> poll() at ?:?
> [thipa at xn5 padb_open]$
>
>
> Next, I'll send you a version that supports slurm combined with openmpi mpirun (orte).
> This case is used in our company. I mean it's not a pure slurm, neither a pure orte job.
> It's a combination.
> I will be working against r331.
>
> Thank for every thing
> Thipadin. Regards.
>
More information about the padb-devel
mailing list