[padb-users] start using padb on TORQUE

Ashley Pittman ashley at pittman.co.uk
Fri Nov 26 20:26:37 GMT 2010


On 26 Nov 2010, at 02:45, Jie Cai wrote:

> 
> On 26/11/10 07:53, Ashley Pittman wrote:
>> 
>>> * A common variant of MPI jobs are those launched like
>>>      mpirun wrapper_script mpi_executable
>>>  so that the parent of the MPI tasks is not orted/mpid/mpirun. We are
>>>  interested in ways to support such jobs.  Since job processes are
>>>  contained in cpusets (cgroups) on our system, we can easily get the
>>>  relevant process list and then use environment to find ranks. Will it
>>>  matter if a non-MPI process with OMPI_COMM_WORLD_RANK set is queried
>>>  for message queue info?  Does it matter that two process have the same
>>>  rank?
>>>     
>> Padb *should* handle this case, it only allows one process per rank and, depending on the resource manager, it'll either pick the direct child of the resource manager or if that process is deemed to be a wrapper script and has any children then it will pick the first one.  The definition of wrapper script can be configured by the "scripts" configuration option, it defaults to "bash,sh,dash,ash,perl,xterm" so should cover most bases.
>> 
>> The code for this is in convert_pids_to_child_pids() and is called once per node and passed a list of potential process which are direct descendants of the resource manager and makes a decision based on what processes are active.
>> 
>> Ashley.
>> 
>>   
> I found out that pid_to_name() does not pick the correct script process. I have following debug information:
> 
> inner: x153: rmpid is 4376, name is lu.sh.
> inner: x153: notscripts pid is 4376.
> 
> I guess this is due to find_from_status() does not give the correct script name. I think it might be better to check /proc/$pid/cmdline to find the first element and to compare them with preset scripts keys.

I took this 

> I will hack it later today or early next week, and will let you updated whether it is working.
> 
> Kind Regards,
> Jie
> 
> -- 
> Jie Cai                         Jie.Cai at anu.edu.au
> ANU Supercomputer Facility      NCI National Facility
> Leonard Huxley, Mills Road      Ph:  +61 2 6125 7965
> Australian National University  Fax: +61 2 6125 8199
> Canberra, ACT 0200, Australia   http://nf.nci.org.au
> -----------------------------------------------------
> 

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk





More information about the padb-users mailing list