[padb-users] start using padb on TORQUE

Ashley Pittman ashley at pittman.co.uk
Thu Nov 25 18:46:11 GMT 2010


On 10 Nov 2010, at 23:48, Jie Cai wrote:

> 
> On 11/11/10 06:41, Ashley Pittman wrote:
>> 
>>> (2) in the PBS interactive mode of a job, I have following information and warning, please noted that no PBS job detected. I am actually expecting a pbs job detected.
>>>     
>> pbs_pro support has been included for a while, pbs and Torque support are slightly different and have only been added very recently, in fact the current HEAD will detect jobs but and launch itself on the remote nodes but not find the individual processes, it is almost certainly looking for the wrong environment variable so should be easy to fix when I get some more feedback from people who are testing it (I don't have access to a pbs system and that makes it difficult).
>> 
>>   
> I am pretty happy to help with this. Our PBS system is built on OpenPBS. I am not sure whether there is major difference in the interface between old OpenPBS and torque or PBS pro.

Have you made any headway on this?  I'm back from SC now so can devote some time to it myself if you have questions or can get me access to a PBS system.  As I said it should really just be a case of finding out what environment variables are set by pbs and what the parent process of the parallel processes is called.

> BTW: do you have any documents, which explain how padb works, e.g work flow. It can help us significantly with understanding your code and design idea. Then we can feedback some more useful information.

The common use-case really is "what is my parallel program doing right now" and the drive for this could be for debugging, monitoring or verifying the system is functioning correctly after a previous problem.  Unlike a "full featured" parallel debugger padb is really very easy to use and gives you information very quickly with no setup cost or steps needed when launching the job in the first place.  I know sites which run padb automatically for every job every hour checking for processes in D state and use this to notify admins and users of possible problems, at the other end of the scale you can do in-depth debugging by using padb to look at individual ranks within a parallel job in great detail and for comparing state across the job looking for outliers.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk





More information about the padb-users mailing list