[padb] Réf. : Re: Réf. : Bull changes ( with LSF -mpich2wrapper patch )

Ashley Pittman ashley at pittman.co.uk
Wed Jan 27 17:30:15 GMT 2010


On 21 Jan 2010, at 14:20, thipadin.seng-long at bull.net wrote:
> 
> I get back to you after a short break, as I've been doing some validation on a openmpi spawn functionality. 
> Now I've finished what you've asked me above,  I am just sending both patches. 
> One for lsf-mpich2 wrapper, and the other one with lsf-openmpi wrapper. I did it against r386 version. 
> Both are alike and have many common sub routines. As the patches are seperated some routines 
> are in both patches. I prefer you integrate once as you can factorize. 
> If you need some 'ps' or 'bjobs' command layouts to understand the coding, please ask, I'll send you. 

As they are dependant on each other could you send them as a single, combined patch please.

I don't have systems I can test this on as I don't have lsf but I would like to understand the code, could you put together a paragraph for each rmgr describing how the underlying resource manager lays out processes and how padb finds it's information.  I'm particularly interested in why it has to ssh around to different nodes to see the information it needs.

With the ps command you can prevent the printing of headers by using the option "-o pid=,ppid=,cmd=" which will avoid the special case for removing these later on.  Stripping the leading spaces from ps output is already done in get_extended_process_list(), can you use the same regexp in get_line_ppid() for clarity please.

I'm not sure that your loop over @chaps in lsfmpich2wr_get_mpiproc() is correct, should the if ($found_app != 0) test be outside of the main loop?  Again a comment explaining what the code is trying to extract would be useful here.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk





More information about the padb-devel mailing list