[padb-users] start using padb on TORQUE
Jie Cai
Jie.Cai at anu.edu.au
Thu Nov 11 11:05:51 GMT 2010
Hi Daniel,
Thanks for the reply. Yes, I do have the truck check'd out. Please let
us updated if you have answers for other questions.
Kind Regards,
Jie
Daniel Kidger wrote:
>
> Jie,
> I know Ashley is away at the moment, so I will reply
>
> By HEAD code code he means the latetst verstion of the padb source code
>
> You could try 3.2-beta1 from http://padb.pittman.org.uk/ dated 23-10-10
>
> Hope this helps,
> Daniel
>
> On 10 November 2010 23:48, Jie Cai <Jie.Cai at anu.edu.au
> <mailto:Jie.Cai at anu.edu.au>> wrote:
>
>
> On 11/11/10 06:41, Ashley Pittman wrote:
>
>
> (2) in the PBS interactive mode of a job, I have following
> information and warning, please noted that no PBS job
> detected. I am actually expecting a pbs job detected.
>
>
> pbs_pro support has been included for a while, pbs and Torque
> support are slightly different and have only been added very
> recently, in fact the current HEAD will detect jobs but and
> launch itself on the remote nodes but not find the individual
> processes, it is almost certainly looking for the wrong
> environment variable so should be easy to fix when I get some
> more feedback from people who are testing it (I don't have
> access to a pbs system and that makes it difficult).
>
>
>
> I am pretty happy to help with this. Our PBS system is built on
> OpenPBS. I am not sure whether there is major difference in the
> interface between old OpenPBS and torque or PBS pro.
>
> Is the "HEAD" you mentioned means padb? or PBS mom? I am a little
> bit confused.
>
> $ padb --full-report=53259 --config-option rmgr=orte
> --rank 0padb version 3.2 (Revision 399)
>
> ========
> Warning: errors reported by some ranks
> ========
> [0]: Error message from
> /short/z00/jxc900/PADB/padb_build/libexec/minfo:
> setup_communicator_iterator() failed
> [0]: Stderr from minfo:
> WARNING: Field opal_list_next of type opal_list_item_t not
> found!
> WARNING: Field opal_list_sentinel of type opal_list_t not
> found!
> WARNING: Field fl_mpool of type ompi_free_list_t not found!
> WARNING: Field fl_allocations of type ompi_free_list_t not
> found!
>
>
> These are errors from the MPI library, padb has done the right
> thing here, it's discovered the job, launched itself, found
> the processes but the MPI debugger callback DLL is unable to
> extract the information it needs. This is the second time
> this has been reported in as many weeks so I'm wondering if
> this is something that they have broken recently, the best
> place to take this up would be the Ompi developers list or if
> you can wait until after SC next week I can test if for you
> and report it myself.
>
>
>
> The symbol is loaded successfully. The error happened in following
> code in minfo.c.
>
> res = dll_ep.setup_communicator_iterator(target_process);
> if ( res != mqs_ok ) {
> die_with_code(res,"setup_communicator_iterator() failed");
> }
>
> I have tested a number of OMPI versions installed on our system,
> from 1.3.3 to 1.4.2. All shows the warning message.
>
> I have tested padb on another cluster with 1.3.3, while no warning
> messages turned up.
>
> I don't know what were the configuration flags use to compile
> those OpenMPI libraries, and thus I don't know whether there's a
> potential problem that we haven't use the necessary flag to turn
> debug callback on (although it doesn't seem like that from OMPI
> available configuration list).
>
> That shouldn't happen, can you send the output of "gdb -p
> 10782" in this case?
>
>
> I similar information that complaining "ptrace: operation not
> permitted"
> I did sudo for both gdb and padb:
> b/libopen-pal.so.0...(no debugging symbols found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/libopen-pal.so.0
> Reading symbols from /lib64/libdl.so.2...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libdl.so.2
> Reading symbols from /lib64/libnsl.so.1...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libnsl.so.1
> Reading symbols from /lib64/libutil.so.1...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libutil.so.1
> Reading symbols from /lib64/libm.so.6...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libm.so.6
> Reading symbols from /lib64/libpthread.so.0...(no debugging
> symbols found)...done.
> [Thread debugging using libthread_db enabled]
> Loaded symbols for /lib64/libpthread.so.0
> Reading symbols from /lib64/libc.so.6...(no debugging symbols
> found)...done.
> Loaded symbols for /lib64/libc.so.6
> Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging
> symbols found)...done.
> Loaded symbols for /lib64/ld-linux-x86-64.so.2
> Reading symbols from
> /apps/openmpi/1.4.2/lib/openmpi/mca_paffinity_linux.so...(no
> debugging symbols found)...done.
> Loaded symbols for
> /apps/openmpi/1.4.2/lib/openmpi/mca_paffinity_linux.so
> Reading symbols from
> /apps/openmpi/1.4.2/lib/openmpi/mca_carto_auto_detect.so...(no
> debugging symbols found)...done.
> Loaded symbols for
> /apps/openmpi/1.4.2/lib/openmpi/mca_carto_auto_detect.so
> Reading symbols from
> /apps/openmpi/1.4.2/lib/openmpi/mca_ess_hnp.so...(no debugging
> symbols found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_ess_hnp.so
> Reading symbols from
> /apps/openmpi/1.4.2/lib/openmpi/mca_plm_tm.so...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_plm_tm.so
> Reading symbols from
> /apps/openmpi/1.4.2/lib/openmpi/mca_rml_oob.so...(no debugging
> symbols found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_rml_oob.so
> Reading symbols from
> /apps/openmpi/1.4.2/lib/openmpi/mca_oob_tcp.so...(no debugging
> symbols found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_oob_tcp.so
> Reading symbols from
> /apps/openmpi/1.4.2/lib/openmpi/mca_routed_binomial.so...(no
> debugging symbols found)...done.
> Loaded symbols for
> /apps/openmpi/1.4.2/lib/openmpi/mca_routed_binomial.so
> Reading symbols from
> /apps/openmpi/1.4.2/lib/openmpi/mca_grpcomm_bad.so...(no debugging
> symbols found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_grpcomm_bad.so
> Reading symbols from
> /apps/openmpi/1.4.2/lib/openmpi/mca_ras_tm.so...(no debugging
> symbols found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_ras_tm.so
> Reading symbols from
> /apps/openmpi/1.4.2/lib/openmpi/mca_rmaps_round_robin.so...(no
> debugging symbols found)...done.
> Loaded symbols for
> /apps/openmpi/1.4.2/lib/openmpi/mca_rmaps_round_robin.so
> Reading symbols from
> /apps/openmpi/1.4.2/lib/openmpi/mca_errmgr_default.so...(no
> debugging symbols found)...done.
> Loaded symbols for
> /apps/openmpi/1.4.2/lib/openmpi/mca_errmgr_default.so
> Reading symbols from
> /apps/openmpi/1.4.2/lib/openmpi/mca_odls_default.so...(no
> debugging symbols found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_odls_default.so
> Reading symbols from
> /apps/openmpi/1.4.2/lib/openmpi/mca_iof_hnp.so...(no debugging
> symbols found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_iof_hnp.so
> Reading symbols from
> /apps/openmpi/1.4.2/lib/openmpi/mca_filem_rsh.so...(no debugging
> symbols found)...done.
> Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_filem_rsh.so
> 0x00007ffff6bfb14f in poll () from /lib64/libc.so.6
>
> $ sudo padb --config-option rmgr=mpirun --full-report=5226
>
> padb version 3.2 (Revision 399)
> full job report for job 5226
>
> Warning, failed to locate any ranks
>
>
> All the patches I have are committed now, I'm still waiting
> for feedback on the the environment variables that pbs sets
> for parallel jobs however. You should try with HEAD code as
> at least one of your issues is fixed though (although it'll
> still fail further on I'm afraid).
>
>
> You have mentioned a few times about HEAD code. Again, do you mean
> OMPI header files? or padb perl script? I will really appreciate
> if you could clarify this a little bit more.
>
> Ashley.
>
>
>
>
> BTW: do you have any documents, which explain how padb works, e.g
> work flow. It can help us significantly with understanding your
> code and design idea. Then we can feedback some more useful
> information.
>
> Jie
>
>
> _______________________________________________
> padb-users mailing list
> padb-users at pittman.org.uk <mailto:padb-users at pittman.org.uk>
> http://pittman.org.uk/mailman/listinfo/padb-users_pittman.org.uk
>
>
More information about the padb-users
mailing list