[padb] Réf. : Re: Patch of support of Slurm + Openmpi Orte manager

Ashley Pittman ashley at pittman.co.uk
Tue Dec 1 17:17:03 GMT 2009


On Tue, 2009-12-01 at 16:30 +0100, thipadin.seng-long at bull.net wrote:
> >I also notice the /proc/version in the patch, does this mean the
> patch
> >works on an OS other than Linux? 
> 
> It is not complete, to run on other OS that linux you must have two
> branches: 
> 1 - with /proc/version using readdir /proc and /proc/$pid/cmdline 
> 2 - with "ps -edf | grep slurmstepd" something like this. 

I'd view slurmstepd arguments as liable to change over time, if there
was a way to get this information without relying on slurm internals I'd
prefer it.  If this is the only way to get the information then we'll
use it however.

> >What happens if you run salloc... srun?  Does this work with the
> >existing support and how should users know which resource manager
> plugin
> >to pick (Ideally padb could do the right thing). 
> 
> You mean salloc ... srun ...mpirun  prog ? 
> That's what I have experimented: 

I was thinking salloc...srun with say a MPICH2 program.  Does that work
with the existing slurm support?  How would we document for users which
resource manager (or mode) to use in this case?


> >> [thipa at machu0 padb_open]$ ./padb -O rmgr="sl-orte" -O
> >> stack-shows-locals=no  -O stack-shows-params=no --debug=verbose=all
> >> -tx 8324 
> >> DEBUG (verbose):   0: There are 1 processes over 3 hosts 
> 
> >This isn't great, the number of processes expected is so far only
> used
> >to check for missing processes but there are other potential uses for
> it
> >so I'd rather it was correct. 
> 
> I will dig it more, I don't know the meaning of processes number
> actually you do with.

The expected process count is returned by setup_job and is only used to
ensure that all processes are present, padb could live without this
however being able to warn users on missing processes is useful and was
something that was requested from the 2.0 series.  Perhaps I could make
it that nprocs was returned from the find_pids function on the inner
process and passed back up the tree some how.

> >> I don't use scontrol listpids, because I found this command not a
> >> universal method (some version doesn't have it), 
> >> and may issued error message such as : 
> >> slurmd[machu139]: proctrack/pgid does not implement
> >> slurm_container_get_pids 
> 
> >I'd prefer to use this if at all possible, this option was added at a
> >request my be several years ago so I'd have thought most versions
> have
> >it by now, can you be clearer on the versions where it doesn't work? 
> 
> It work only for slurm upper from 1.2, may be some clients have it
> still ?

At some point we have to drop support for old versions, the current
slurm code won't work without it so requiring it for the the
openmpi/slurm combination doesn't seem like too much of a hardship to
me.

> If you can get rid of messages above (slurmd[hostnn]: proctrack/pgid
> does not implement) 

I'll raise this on the slurm list, I get these warning messages too but
I'd assumed that was because I'm using a debug build.  The listpids
command still works even though these warnings are issued.

Ashley,
-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk





More information about the padb-devel mailing list