[padb-users] running with SGE/OMPI

Daniel Kidger daniel.kidger at googlemail.com
Mon Jul 12 17:58:27 BST 2010


> You are right, padb will use the "jobid" that orte had allocated the job
rather than the id that
> Gridengine has given it but the tight integration mighy have changed the
orte behaviour.  I see
> this with mpd (Mpich2) and PBS as well where PBS sets an environment
variable which causes
> mpd to store it's temporary files under a different filename.
 Unfortunately this is very hard to get
> around.

In particular, I found this to be from these lines in mpirun (from Intel mpi
4.0.0)
---------------
if [ -n "$PBS_ENVIRONMENT" ] ; then
    export MPD_CON_EXT="${PBS_JOBID}_$$" # PBS Pro and Torque
(lines deleted)
elif [ -n "$MP_JOBID" ] ; then
    export MPD_CON_EXT="${MP_JOBID}_$$" # SGE
---------------
The environment variable MPD_CON_EXT is used by mpdboot to add an extension
to both the socket /tmp/mpd2.console_<username> and the logfile
/tmp/mpd2.logfile_<username>

For padb I add my own wrapper to add the (known) PBS_JOBID to MPD_CON_EXT
(The processes id thought needs to be found by inspection)
padb appears to call mpdlistjobs  which itself honours MPD_CON_EXT.

Hope this helps,

Daniel



On 12 July 2010 14:01, Ashley Pittman <ashley at pittman.co.uk> wrote:

>
> On 9 Jul 2010, at 15:32, Dave Love wrote:
>
> > Ashley Pittman <ashley at pittman.co.uk> writes:
> >
> > I assumed Gridengine is relevant (a) in referring to `jobs', and (b) in
> > that I think the OpenMPI tight integration is relevant, at least because
> > it seems ompi-ps appears to be looking in the wrong place for files.
>
> You are right, padb will use the "jobid" that orte had allocated the job
> rather than the id that Gridengine has given it but the tight integration
> mighy have changed the orte behaviour.  I see this with mpd (Mpich2) and PBS
> as well where PBS sets an environment variable which causes mpd to store
> it's temporary files under a different filename.  Unfortunately this is very
> hard to get around.
>
> > That's easy, but neither mpirun nor orte work.  With mpirun I get
> >
> > Error, resource manager "mpirun" not supported
>
> You need to use the 3.2 beta release for this, I keep forgetting it's not
> in 3.0.  When using this method of attaching to jobs you have to run padb on
> the host where the "mpirun" process is running and the jobid will be the pid
> of that process.  Padb use pdsh to launch itself on the nodes so you'll need
> to have this installed if you haven't already.
>
> > and orte doesn't find any jobs because ompi-ps doesn't.  I'll try to
> > figure out what's going on when I get some time.
>
> Unfortunately without a working ompi-os padb has no way of collecting the
> information it needs so the orte resource manager won't work for you in this
> case, you could on the opmi-users list to see if there is anything they
> recommend, as above we managed to get this working on MPICH2 recently by
> asking users to unset PBS_JOBID in their job script.
>
> Ashley,
>
> --
>
> Ashley Pittman, Bath, UK.
>
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
>
>
> _______________________________________________
> padb-users mailing list
> padb-users at pittman.org.uk
> http://pittman.org.uk/mailman/listinfo/padb-users_pittman.org.uk
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pittman.org.uk/pipermail/padb-users_pittman.org.uk/attachments/20100712/9e34fa67/attachment.html>


More information about the padb-users mailing list