[padb-users] start using padb on TORQUE

Jie Cai Jie.Cai at anu.edu.au
Thu Nov 11 11:05:51 GMT 2010


Hi Daniel,

Thanks for the reply. Yes, I do have the truck check'd out. Please let 
us updated if you have answers for other questions.

Kind Regards,
Jie


Daniel Kidger wrote:
>
> Jie,
> I know Ashley is away at the moment, so I will reply
>
> By HEAD code code he means the latetst verstion of the padb source code
>
> You could try 3.2-beta1 from   http://padb.pittman.org.uk/  dated 23-10-10
>
> Hope this helps,
> Daniel
>
> On 10 November 2010 23:48, Jie Cai <Jie.Cai at anu.edu.au 
> <mailto:Jie.Cai at anu.edu.au>> wrote:
>
>
>     On 11/11/10 06:41, Ashley Pittman wrote:
>
>
>             (2) in the PBS interactive mode of a job, I have following
>             information and warning, please noted that no PBS job
>             detected. I am actually expecting a pbs job detected.
>                
>
>         pbs_pro support has been included for a while, pbs and Torque
>         support are slightly different and have only been added very
>         recently, in fact the current HEAD will detect jobs but and
>         launch itself on the remote nodes but not find the individual
>         processes, it is almost certainly looking for the wrong
>         environment variable so should be easy to fix when I get some
>         more feedback from people who are testing it (I don't have
>         access to a pbs system and that makes it difficult).
>
>          
>
>     I am pretty happy to help with this. Our PBS system is built on
>     OpenPBS. I am not sure whether there is major difference in the
>     interface between old OpenPBS and torque or PBS pro.
>
>     Is the "HEAD" you mentioned means padb? or PBS mom? I am a little
>     bit confused.
>
>             $ padb --full-report=53259 --config-option rmgr=orte
>             --rank 0padb version 3.2 (Revision 399)
>
>             ========
>             Warning: errors reported by some ranks
>             ========
>             [0]: Error message from
>             /short/z00/jxc900/PADB/padb_build/libexec/minfo:
>             setup_communicator_iterator() failed
>             [0]: Stderr from minfo:
>             WARNING: Field opal_list_next of type opal_list_item_t not
>             found!
>             WARNING: Field opal_list_sentinel of type opal_list_t not
>             found!
>             WARNING: Field fl_mpool of type ompi_free_list_t not found!
>             WARNING: Field fl_allocations of type ompi_free_list_t not
>             found!
>                
>
>         These are errors from the MPI library, padb has done the right
>         thing here, it's discovered the job, launched itself, found
>         the processes but the MPI debugger callback DLL is unable to
>         extract the information it needs.  This is the second time
>         this has been reported in as many weeks so I'm wondering if
>         this is something that they have broken recently, the best
>         place to take this up would be the Ompi developers list or if
>         you can wait until after SC next week I can test if for you
>         and report it myself.
>
>          
>
>     The symbol is loaded successfully. The error happened in following
>     code in minfo.c.
>
>        res = dll_ep.setup_communicator_iterator(target_process);
>        if ( res != mqs_ok ) {
>            die_with_code(res,"setup_communicator_iterator() failed");
>        }
>
>     I have tested a number of OMPI versions installed on our system,
>     from 1.3.3 to 1.4.2. All shows the warning message.
>
>     I have tested padb on another cluster with 1.3.3, while no warning
>     messages turned up.
>
>     I don't know what were the configuration flags use to compile
>     those OpenMPI libraries, and thus I don't know whether there's a
>     potential problem that we haven't use the necessary flag to turn
>     debug callback on (although it doesn't seem like that from OMPI
>     available configuration list).
>
>         That shouldn't happen, can you send the output of "gdb -p
>         10782" in this case?
>          
>
>     I similar information that complaining "ptrace: operation not
>     permitted"
>     I did sudo for both gdb and padb:
>     b/libopen-pal.so.0...(no debugging symbols found)...done.
>     Loaded symbols for /apps/openmpi/1.4.2/lib/libopen-pal.so.0
>     Reading symbols from /lib64/libdl.so.2...(no debugging symbols
>     found)...done.
>     Loaded symbols for /lib64/libdl.so.2
>     Reading symbols from /lib64/libnsl.so.1...(no debugging symbols
>     found)...done.
>     Loaded symbols for /lib64/libnsl.so.1
>     Reading symbols from /lib64/libutil.so.1...(no debugging symbols
>     found)...done.
>     Loaded symbols for /lib64/libutil.so.1
>     Reading symbols from /lib64/libm.so.6...(no debugging symbols
>     found)...done.
>     Loaded symbols for /lib64/libm.so.6
>     Reading symbols from /lib64/libpthread.so.0...(no debugging
>     symbols found)...done.
>     [Thread debugging using libthread_db enabled]
>     Loaded symbols for /lib64/libpthread.so.0
>     Reading symbols from /lib64/libc.so.6...(no debugging symbols
>     found)...done.
>     Loaded symbols for /lib64/libc.so.6
>     Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging
>     symbols found)...done.
>     Loaded symbols for /lib64/ld-linux-x86-64.so.2
>     Reading symbols from
>     /apps/openmpi/1.4.2/lib/openmpi/mca_paffinity_linux.so...(no
>     debugging symbols found)...done.
>     Loaded symbols for
>     /apps/openmpi/1.4.2/lib/openmpi/mca_paffinity_linux.so
>     Reading symbols from
>     /apps/openmpi/1.4.2/lib/openmpi/mca_carto_auto_detect.so...(no
>     debugging symbols found)...done.
>     Loaded symbols for
>     /apps/openmpi/1.4.2/lib/openmpi/mca_carto_auto_detect.so
>     Reading symbols from
>     /apps/openmpi/1.4.2/lib/openmpi/mca_ess_hnp.so...(no debugging
>     symbols found)...done.
>     Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_ess_hnp.so
>     Reading symbols from
>     /apps/openmpi/1.4.2/lib/openmpi/mca_plm_tm.so...done.
>     Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_plm_tm.so
>     Reading symbols from
>     /apps/openmpi/1.4.2/lib/openmpi/mca_rml_oob.so...(no debugging
>     symbols found)...done.
>     Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_rml_oob.so
>     Reading symbols from
>     /apps/openmpi/1.4.2/lib/openmpi/mca_oob_tcp.so...(no debugging
>     symbols found)...done.
>     Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_oob_tcp.so
>     Reading symbols from
>     /apps/openmpi/1.4.2/lib/openmpi/mca_routed_binomial.so...(no
>     debugging symbols found)...done.
>     Loaded symbols for
>     /apps/openmpi/1.4.2/lib/openmpi/mca_routed_binomial.so
>     Reading symbols from
>     /apps/openmpi/1.4.2/lib/openmpi/mca_grpcomm_bad.so...(no debugging
>     symbols found)...done.
>     Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_grpcomm_bad.so
>     Reading symbols from
>     /apps/openmpi/1.4.2/lib/openmpi/mca_ras_tm.so...(no debugging
>     symbols found)...done.
>     Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_ras_tm.so
>     Reading symbols from
>     /apps/openmpi/1.4.2/lib/openmpi/mca_rmaps_round_robin.so...(no
>     debugging symbols found)...done.
>     Loaded symbols for
>     /apps/openmpi/1.4.2/lib/openmpi/mca_rmaps_round_robin.so
>     Reading symbols from
>     /apps/openmpi/1.4.2/lib/openmpi/mca_errmgr_default.so...(no
>     debugging symbols found)...done.
>     Loaded symbols for
>     /apps/openmpi/1.4.2/lib/openmpi/mca_errmgr_default.so
>     Reading symbols from
>     /apps/openmpi/1.4.2/lib/openmpi/mca_odls_default.so...(no
>     debugging symbols found)...done.
>     Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_odls_default.so
>     Reading symbols from
>     /apps/openmpi/1.4.2/lib/openmpi/mca_iof_hnp.so...(no debugging
>     symbols found)...done.
>     Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_iof_hnp.so
>     Reading symbols from
>     /apps/openmpi/1.4.2/lib/openmpi/mca_filem_rsh.so...(no debugging
>     symbols found)...done.
>     Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_filem_rsh.so
>     0x00007ffff6bfb14f in poll () from /lib64/libc.so.6
>
>     $ sudo padb --config-option rmgr=mpirun --full-report=5226
>
>     padb version 3.2 (Revision 399)
>     full job report for job 5226
>
>     Warning, failed to locate any ranks
>
>
>         All the patches I have are committed now, I'm still waiting
>         for feedback on the the environment variables that pbs sets
>         for parallel jobs however.  You should try with HEAD code as
>         at least one of your issues is fixed though (although it'll
>         still fail further on I'm afraid).
>          
>
>     You have mentioned a few times about HEAD code. Again, do you mean
>     OMPI header files? or padb perl script? I will really appreciate
>     if you could clarify this a little bit more.
>
>         Ashley.
>
>          
>
>
>     BTW: do you have any documents, which explain how padb works, e.g
>     work flow. It can help us significantly with understanding your
>     code and design idea. Then we can feedback some more useful
>     information.
>
>     Jie
>
>
>     _______________________________________________
>     padb-users mailing list
>     padb-users at pittman.org.uk <mailto:padb-users at pittman.org.uk>
>     http://pittman.org.uk/mailman/listinfo/padb-users_pittman.org.uk
>
>




More information about the padb-users mailing list