[padb-users] start using padb on TORQUE

Ashley Pittman ashley at pittman.co.uk
Sat Nov 13 16:37:32 GMT 2010


On 11 Nov 2010, at 00:48, Jie Cai wrote:
> I am pretty happy to help with this. Our PBS system is built on OpenPBS. I am not sure whether there is major difference in the interface between old OpenPBS and torque or PBS pro.
> 
> Is the "HEAD" you mentioned means padb? or PBS mom? I am a little bit confused.

As Dan says Subversion HEAD of padb.  The instructions for getting the source are at:

http://code.google.com/p/padb/source/checkout

The most recent pbs additions have gone in since the last beta release was cut.  If you are prepared to could you run two parallel jobs and send me the output, "env" and "ps auwxf".  From that I should have the information I need to make it work in your case.  Single CPU jobs will be fine.

> I have tested a number of OMPI versions installed on our system, from 1.3.3 to 1.4.2. All shows the warning message.
> 
> I have tested padb on another cluster with 1.3.3, while no warning messages turned up.

The WARNING error is coming from OpenMPI which in this case is then returning false from the "setup_communicator_iterator" callback.  The code in question is looking for struct offsets in the running program so it likely to do with exporting symbols correctly or possibly static builds?

See this mail for more details:

http://pittman.org.uk/pipermail/padb-users_pittman.org.uk/2010-October/000043.html

>> That shouldn't happen, can you send the output of "gdb -p 10782" in this case?
>>   
> I similar information that complaining "ptrace: operation not permitted"
> I did sudo for both gdb and padb:

Does this mean that your resource manager is installed suid?  I don't think padb can support this case as it would need to run as one user to read the process list and a different user for everything else.

> $ sudo padb --config-option rmgr=mpirun --full-report=5226
> padb version 3.2 (Revision 399)
> full job report for job 5226
> 
> Warning, failed to locate any ranks

In this case it will have discovered the process list but be looking for root owned processes only and so not discover the processes on the remote node(s)

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk





More information about the padb-users mailing list