[padb-users] start using padb on TORQUE

Jie Cai Jie.Cai at anu.edu.au
Wed Nov 10 23:48:50 GMT 2010


On 11/11/10 06:41, Ashley Pittman wrote:
>
>> (2) in the PBS interactive mode of a job, I have following information and warning, please noted that no PBS job detected. I am actually expecting a pbs job detected.
>>      
> pbs_pro support has been included for a while, pbs and Torque support are slightly different and have only been added very recently, in fact the current HEAD will detect jobs but and launch itself on the remote nodes but not find the individual processes, it is almost certainly looking for the wrong environment variable so should be easy to fix when I get some more feedback from people who are testing it (I don't have access to a pbs system and that makes it difficult).
>
>    
I am pretty happy to help with this. Our PBS system is built on OpenPBS. 
I am not sure whether there is major difference in the interface between 
old OpenPBS and torque or PBS pro.

Is the "HEAD" you mentioned means padb? or PBS mom? I am a little bit 
confused.
>> $ padb --full-report=53259 --config-option rmgr=orte --rank 0padb version 3.2 (Revision 399)
>>
>> ========
>> Warning: errors reported by some ranks
>> ========
>> [0]: Error message from /short/z00/jxc900/PADB/padb_build/libexec/minfo: setup_communicator_iterator() failed
>> [0]: Stderr from minfo:
>> WARNING: Field opal_list_next of type opal_list_item_t not found!
>> WARNING: Field opal_list_sentinel of type opal_list_t not found!
>> WARNING: Field fl_mpool of type ompi_free_list_t not found!
>> WARNING: Field fl_allocations of type ompi_free_list_t not found!
>>      
> These are errors from the MPI library, padb has done the right thing here, it's discovered the job, launched itself, found the processes but the MPI debugger callback DLL is unable to extract the information it needs.  This is the second time this has been reported in as many weeks so I'm wondering if this is something that they have broken recently, the best place to take this up would be the Ompi developers list or if you can wait until after SC next week I can test if for you and report it myself.
>
>    
The symbol is loaded successfully. The error happened in following code 
in minfo.c.

     res = dll_ep.setup_communicator_iterator(target_process);
     if ( res != mqs_ok ) {
         die_with_code(res,"setup_communicator_iterator() failed");
     }

I have tested a number of OMPI versions installed on our system, from 
1.3.3 to 1.4.2. All shows the warning message.

I have tested padb on another cluster with 1.3.3, while no warning 
messages turned up.

I don't know what were the configuration flags use to compile those 
OpenMPI libraries, and thus I don't know whether there's a potential 
problem that we haven't use the necessary flag to turn debug callback on 
(although it doesn't seem like that from OMPI available configuration list).
> That shouldn't happen, can you send the output of "gdb -p 10782" in this case?
>    
I similar information that complaining "ptrace: operation not permitted"
I did sudo for both gdb and padb:
b/libopen-pal.so.0...(no debugging symbols found)...done.
Loaded symbols for /apps/openmpi/1.4.2/lib/libopen-pal.so.0
Reading symbols from /lib64/libdl.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libnsl.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libnsl.so.1
Reading symbols from /lib64/libutil.so.1...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/libutil.so.1
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols 
found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from 
/apps/openmpi/1.4.2/lib/openmpi/mca_paffinity_linux.so...(no debugging 
symbols found)...done.
Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_paffinity_linux.so
Reading symbols from 
/apps/openmpi/1.4.2/lib/openmpi/mca_carto_auto_detect.so...(no debugging 
symbols found)...done.
Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_carto_auto_detect.so
Reading symbols from 
/apps/openmpi/1.4.2/lib/openmpi/mca_ess_hnp.so...(no debugging symbols 
found)...done.
Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_ess_hnp.so
Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_plm_tm.so...done.
Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_plm_tm.so
Reading symbols from 
/apps/openmpi/1.4.2/lib/openmpi/mca_rml_oob.so...(no debugging symbols 
found)...done.
Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_rml_oob.so
Reading symbols from 
/apps/openmpi/1.4.2/lib/openmpi/mca_oob_tcp.so...(no debugging symbols 
found)...done.
Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_oob_tcp.so
Reading symbols from 
/apps/openmpi/1.4.2/lib/openmpi/mca_routed_binomial.so...(no debugging 
symbols found)...done.
Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_routed_binomial.so
Reading symbols from 
/apps/openmpi/1.4.2/lib/openmpi/mca_grpcomm_bad.so...(no debugging 
symbols found)...done.
Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_grpcomm_bad.so
Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_ras_tm.so...(no 
debugging symbols found)...done.
Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_ras_tm.so
Reading symbols from 
/apps/openmpi/1.4.2/lib/openmpi/mca_rmaps_round_robin.so...(no debugging 
symbols found)...done.
Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_rmaps_round_robin.so
Reading symbols from 
/apps/openmpi/1.4.2/lib/openmpi/mca_errmgr_default.so...(no debugging 
symbols found)...done.
Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_errmgr_default.so
Reading symbols from 
/apps/openmpi/1.4.2/lib/openmpi/mca_odls_default.so...(no debugging 
symbols found)...done.
Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_odls_default.so
Reading symbols from 
/apps/openmpi/1.4.2/lib/openmpi/mca_iof_hnp.so...(no debugging symbols 
found)...done.
Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_iof_hnp.so
Reading symbols from 
/apps/openmpi/1.4.2/lib/openmpi/mca_filem_rsh.so...(no debugging symbols 
found)...done.
Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_filem_rsh.so
0x00007ffff6bfb14f in poll () from /lib64/libc.so.6

$ sudo padb --config-option rmgr=mpirun --full-report=5226
padb version 3.2 (Revision 399)
full job report for job 5226

Warning, failed to locate any ranks

> All the patches I have are committed now, I'm still waiting for feedback on the the environment variables that pbs sets for parallel jobs however.  You should try with HEAD code as at least one of your issues is fixed though (although it'll still fail further on I'm afraid).
>    
You have mentioned a few times about HEAD code. Again, do you mean OMPI 
header files? or padb perl script? I will really appreciate if you could 
clarify this a little bit more.
> Ashley.
>
>    

BTW: do you have any documents, which explain how padb works, e.g work 
flow. It can help us significantly with understanding your code and 
design idea. Then we can feedback some more useful information.

Jie




More information about the padb-users mailing list