[padb] Réf. : Re: Réf. : Re: Réf. : Re: Réf. : Re: [padb-devel] Patch for Support of PBS Pro resource manager

Wed Nov 25 11:09:05 GMT 2009

On Wed, 2009-11-25 at 11:37 +0100, Sylvain Jeaugey wrote:
> That's great news. For the future developments, we will need to support a 
> large number of combinations, and in this respect, I'm wondering if the 
> rmgr approach is fine enough. Maybe a two-steps approach would be better.
> 
> Here is how I would love to configure padb :
> jobmgr = slurm / pbs/ lsf / local / ...
> relay = none / orte / mpd / ...
> mpi = openmpi / mpich / mpich2
> 
> The jobmgr step would convert a jobid into a list of host/pid, the relay 
> step would follow childs (or not, e.g. for pure srun launch), and the mpi 
> step would get the MPI rank depending on the MPI library.
> 
> Maybe "relay" would be better defined as "launch system" to cope with e.g. 
> blaunch under lsf. I don't know the internals of padb, but maybe this 
> approach would remove the need to create a new rmgr each time a new 
> combination is used.
> 
> Ashley, what do you think about it ? Do you see how it could be done 
> inside padb ?

I've been thinking the very same thing, see this commit which documents
my thoughts.

http://code.google.com/p/padb/source/detail?r=329

Specifically to your comments the mpi step is really the same as the
jobmgr, some resource managers let you find the target pids on the front
end (orte/mpd/slurm) and some let you find them when on the destination
node (slurm/rms/mpd), even if both are provided it's not easy to pick
which one to use, slurm will tell you local pids only but mpd will tell
you all pids for all nodes.  Currently only orte and mpirun do the
pid=>rank conversion on the front end although mpd would benefit from it
as well.

There are really also (at least) three strata of resource managers,
there are the normal full-feature ones (orte/slurm/mpd/rms/pbs), a
middle layer which is the generic "mpirun" and finally the local ones,
local,local-fd and probably local-exe soon as well.

The launch system is the most critical aspect of scalability for large
jobs, it may be that I concentrate work on one single relay (probably
orte) for this case and insist that it's the only relay used for say 4k
or higher process count jobs.

Finally there is lsf-rms which is really rms but allows lsf job
identifiers to be specified and converted to rms identifiers
transparently by padb.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk