[padb] Réf. : Re: Réf. : Re: Réf. : Re: Réf. : Re: [padb-devel] Patch for Support of PBS Pro resource manager

Sylvain Jeaugey sylvain.jeaugey at bull.net
Wed Nov 25 10:37:21 GMT 2009


Hi all.

That's great news. For the future developments, we will need to support a 
large number of combinations, and in this respect, I'm wondering if the 
rmgr approach is fine enough. Maybe a two-steps approach would be better.

Here is how I would love to configure padb :
jobmgr = slurm / pbs/ lsf / local / ...
relay = none / orte / mpd / ...
mpi = openmpi / mpich / mpich2

The jobmgr step would convert a jobid into a list of host/pid, the relay 
step would follow childs (or not, e.g. for pure srun launch), and the mpi 
step would get the MPI rank depending on the MPI library.

Maybe "relay" would be better defined as "launch system" to cope with e.g. 
blaunch under lsf. I don't know the internals of padb, but maybe this 
approach would remove the need to create a new rmgr each time a new 
combination is used.

Ashley, what do you think about it ? Do you see how it could be done 
inside padb ?

Sylvain

On Wed, 25 Nov 2009, thipadin.seng-long at bull.net wrote:

> 
> Hi,
> 
> I have given a try for your last commit(r331) including so PBS support.
> It seems to work now. You have done all the corrections (at your manner) I have sent.
> Thanks you to accept to commit.
> Now let's take a look to what is out from padb on a PBS job:
> 
> qstat -an:
> xn0:
>                                                             Req'd  Req'd   Elap
> Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
> --------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
> 27617.xn0       thipa    workq    STDIN        1145   3   9    --    --  R 00:15
>    xn19/0*3+xn20/0*3+xn21/0*3
> [thipa at xn5 padb_open]$
> 
> [thipa at xn5 padb_open]$ DirTest/padb -O rmgr=pbs -O stack-shows-locals=no  -O stack-shows-params=no -O check-signon=none
>  -tx 27617
> -----------------
> [0,3,6] (3 processes)
> -----------------
> ThreadId: 1
>   main() at pp_sndrcv_spbl.c:50
>     PMPI_Finalize() at ?:?
>       MPID_Finalize() at ?:?
>         MPIDI_CH3_Progress_wait() at ?:?
>           MPIDU_Sock_wait() at ?:?
>             poll() at ?:?
>               ThreadId: 2
>                 start_thread() at ?:?
>                   fd_server() at server.c:354
>                     select() at ?:?
> -----------------
> [1,4,7] (3 processes)
> -----------------
> main() at pp_sndrcv_spbl.c:50
>   PMPI_Finalize() at ?:?
>     MPID_Finalize() at ?:?
>       MPIDI_CH3_Progress_wait() at ?:?
>         MPIDU_Sock_wait() at ?:?
>           poll() at ?:?
> -----------------
> [2] (1 processes)
> -----------------
> main() at pp_sndrcv_spbl.c:46
>   PMPI_Recv() at ?:?
>     MPID_Progress_wait() at ?:?
>       MPIDI_CH3_Progress_wait() at ?:?
>         MPIDU_Sock_wait() at ?:?
>           poll() at ?:?
> [thipa at xn5 padb_open]$
> 
> 
> Next, I'll send you a version that supports slurm combined with openmpi mpirun (orte).
> This case is used in our company. I mean it's not a pure slurm, neither a pure orte job.
> It's a combination.
> I will be working against r331.
> 
> Thank for every thing
> Thipadin. Regards.
>


More information about the padb-devel mailing list