[padb-users] start using padb on TORQUE

Daniel Kidger daniel.kidger at googlemail.com
Thu Nov 11 09:31:34 GMT 2010


Jie,
I know Ashley is away at the moment, so I will reply

By HEAD code code he means the latetst verstion of the padb source code

You could try 3.2-beta1 from   http://padb.pittman.org.uk/  dated 23-10-10

Hope this helps,
Daniel

On 10 November 2010 23:48, Jie Cai <Jie.Cai at anu.edu.au> wrote:

>
> On 11/11/10 06:41, Ashley Pittman wrote:
>
>>
>>  (2) in the PBS interactive mode of a job, I have following information
>>> and warning, please noted that no PBS job detected. I am actually expecting
>>> a pbs job detected.
>>>
>>>
>> pbs_pro support has been included for a while, pbs and Torque support are
>> slightly different and have only been added very recently, in fact the
>> current HEAD will detect jobs but and launch itself on the remote nodes but
>> not find the individual processes, it is almost certainly looking for the
>> wrong environment variable so should be easy to fix when I get some more
>> feedback from people who are testing it (I don't have access to a pbs system
>> and that makes it difficult).
>>
>>
>>
> I am pretty happy to help with this. Our PBS system is built on OpenPBS. I
> am not sure whether there is major difference in the interface between old
> OpenPBS and torque or PBS pro.
>
> Is the "HEAD" you mentioned means padb? or PBS mom? I am a little bit
> confused.
>
>  $ padb --full-report=53259 --config-option rmgr=orte --rank 0padb version
>>> 3.2 (Revision 399)
>>>
>>> ========
>>> Warning: errors reported by some ranks
>>> ========
>>> [0]: Error message from /short/z00/jxc900/PADB/padb_build/libexec/minfo:
>>> setup_communicator_iterator() failed
>>> [0]: Stderr from minfo:
>>> WARNING: Field opal_list_next of type opal_list_item_t not found!
>>> WARNING: Field opal_list_sentinel of type opal_list_t not found!
>>> WARNING: Field fl_mpool of type ompi_free_list_t not found!
>>> WARNING: Field fl_allocations of type ompi_free_list_t not found!
>>>
>>>
>> These are errors from the MPI library, padb has done the right thing here,
>> it's discovered the job, launched itself, found the processes but the MPI
>> debugger callback DLL is unable to extract the information it needs.  This
>> is the second time this has been reported in as many weeks so I'm wondering
>> if this is something that they have broken recently, the best place to take
>> this up would be the Ompi developers list or if you can wait until after SC
>> next week I can test if for you and report it myself.
>>
>>
>>
> The symbol is loaded successfully. The error happened in following code in
> minfo.c.
>
>    res = dll_ep.setup_communicator_iterator(target_process);
>    if ( res != mqs_ok ) {
>        die_with_code(res,"setup_communicator_iterator() failed");
>    }
>
> I have tested a number of OMPI versions installed on our system, from 1.3.3
> to 1.4.2. All shows the warning message.
>
> I have tested padb on another cluster with 1.3.3, while no warning messages
> turned up.
>
> I don't know what were the configuration flags use to compile those OpenMPI
> libraries, and thus I don't know whether there's a potential problem that we
> haven't use the necessary flag to turn debug callback on (although it
> doesn't seem like that from OMPI available configuration list).
>
>  That shouldn't happen, can you send the output of "gdb -p 10782" in this
>> case?
>>
>>
> I similar information that complaining "ptrace: operation not permitted"
> I did sudo for both gdb and padb:
> b/libopen-pal.so.0...(no debugging symbols found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/libopen-pal.so.0
> Reading symbols from /lib64/libdl.so.2...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libdl.so.2
> Reading symbols from /lib64/libnsl.so.1...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libnsl.so.1
> Reading symbols from /lib64/libutil.so.1...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libutil.so.1
> Reading symbols from /lib64/libm.so.6...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libm.so.6
> Reading symbols from /lib64/libpthread.so.0...(no debugging symbols
> found)...done.
> [Thread debugging using libthread_db enabled]
> Loaded symbols for /lib64/libpthread.so.0
> Reading symbols from /lib64/libc.so.6...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> Reading symbols from
> /apps/openmpi/1.4.2/lib/openmpi/mca_paffinity_linux.so...(no debugging
> symbols found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_paffinity_linux.so
> Reading symbols from
> /apps/openmpi/1.4.2/lib/openmpi/mca_carto_auto_detect.so...(no debugging
> symbols found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_carto_auto_detect.so
> Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_ess_hnp.so...(no
> debugging symbols found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_ess_hnp.so
> Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_plm_tm.so...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_plm_tm.so
> Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_rml_oob.so...(no
> debugging symbols found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_rml_oob.so
> Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_oob_tcp.so...(no
> debugging symbols found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_oob_tcp.so
> Reading symbols from
> /apps/openmpi/1.4.2/lib/openmpi/mca_routed_binomial.so...(no debugging
> symbols found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_routed_binomial.so
> Reading symbols from
> /apps/openmpi/1.4.2/lib/openmpi/mca_grpcomm_bad.so...(no debugging symbols
> found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_grpcomm_bad.so
> Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_ras_tm.so...(no
> debugging symbols found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_ras_tm.so
> Reading symbols from
> /apps/openmpi/1.4.2/lib/openmpi/mca_rmaps_round_robin.so...(no debugging
> symbols found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_rmaps_round_robin.so
> Reading symbols from
> /apps/openmpi/1.4.2/lib/openmpi/mca_errmgr_default.so...(no debugging
> symbols found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_errmgr_default.so
> Reading symbols from
> /apps/openmpi/1.4.2/lib/openmpi/mca_odls_default.so...(no debugging symbols
> found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_odls_default.so
> Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_iof_hnp.so...(no
> debugging symbols found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_iof_hnp.so
> Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_filem_rsh.so...(no
> debugging symbols found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_filem_rsh.so
> 0x00007ffff6bfb14f in poll () from /lib64/libc.so.6
>
> $ sudo padb --config-option rmgr=mpirun --full-report=5226
>
> padb version 3.2 (Revision 399)
> full job report for job 5226
>
> Warning, failed to locate any ranks
>
>
>  All the patches I have are committed now, I'm still waiting for feedback
>> on the the environment variables that pbs sets for parallel jobs however.
>>  You should try with HEAD code as at least one of your issues is fixed
>> though (although it'll still fail further on I'm afraid).
>>
>>
> You have mentioned a few times about HEAD code. Again, do you mean OMPI
> header files? or padb perl script? I will really appreciate if you could
> clarify this a little bit more.
>
>> Ashley.
>>
>>
>>
>
> BTW: do you have any documents, which explain how padb works, e.g work
> flow. It can help us significantly with understanding your code and design
> idea. Then we can feedback some more useful information.
>
> Jie
>
>
> _______________________________________________
> padb-users mailing list
> padb-users at pittman.org.uk
> http://pittman.org.uk/mailman/listinfo/padb-users_pittman.org.uk
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pittman.org.uk/pipermail/padb-users_pittman.org.uk/attachments/20101111/18db6313/attachment.html>


More information about the padb-users mailing list