[padb-users] start using padb on TORQUE

Ashley Pittman ashley at pittman.co.uk
Wed Nov 10 19:41:38 GMT 2010


On 10 Nov 2010, at 00:54, Jie Cai wrote:

> I have looked into "padb" perl script. It seems that information resource managers are collected through rmsquery or prun. Neither is available on our system. (I don't really know what rmsquery or prun does. This is just a guess.. :-)).

rmsquery are part of the "rms" resource manager, padb will use any resource manager it can find installed (the binaries have to be on $PATH).  "mpirun" is a special case because it looks for processes called "mpirun" and doesn't need anything installed, it can with with multiple resource managers.

> (2) in the PBS interactive mode of a job, I have following information and warning, please noted that no PBS job detected. I am actually expecting a pbs job detected.

pbs_pro support has been included for a while, pbs and Torque support are slightly different and have only been added very recently, in fact the current HEAD will detect jobs but and launch itself on the remote nodes but not find the individual processes, it is almost certainly looking for the wrong environment variable so should be easy to fix when I get some more feedback from people who are testing it (I don't have access to a pbs system and that makes it difficult).

> $ padb --full-report=53259 --config-option rmgr=orte --rank 0padb version 3.2 (Revision 399)
> 
> ========
> Warning: errors reported by some ranks
> ========
> [0]: Error message from /short/z00/jxc900/PADB/padb_build/libexec/minfo: setup_communicator_iterator() failed
> [0]: Stderr from minfo:
> WARNING: Field opal_list_next of type opal_list_item_t not found!
> WARNING: Field opal_list_sentinel of type opal_list_t not found!
> WARNING: Field fl_mpool of type ompi_free_list_t not found!
> WARNING: Field fl_allocations of type ompi_free_list_t not found!

These are errors from the MPI library, padb has done the right thing here, it's discovered the job, launched itself, found the processes but the MPI debugger callback DLL is unable to extract the information it needs.  This is the second time this has been reported in as many weeks so I'm wondering if this is something that they have broken recently, the best place to take this up would be the Ompi developers list or if you can wait until after SC next week I can test if for you and report it myself.

> (3) when I used batch mode of TORQUE, I have failed to attach to the running MPI program. Also, this time, no active job is detected for orte, which is different from the interactive PBS mode.

orte will only report jobs on the host where the orterun process is running, this is a limitation of OMPI and there is nothing I can do about it.

> $ padb --config-option rmgr=mpirun --full-report=10782
> padb version 3.2 (Revision 399)
> full job report for job 10782
> 
> Failed to attach to process
> Fatal problem setting up the resource manager: mpirun

That shouldn't happen, can you send the output of "gdb -p 10782" in this case?

> In the OpenMPI mailing list, I saw you actually have some other patch for PBS system. Will those patches solve this problem?

All the patches I have are committed now, I'm still waiting for feedback on the the environment variables that pbs sets for parallel jobs however.  You should try with HEAD code as at least one of your issues is fixed though (although it'll still fail further on I'm afraid).

Ashley.

-- 
Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk





More information about the padb-users mailing list