[padb-users] start using padb on TORQUE

Jie Cai Jie.Cai at anu.edu.au
Tue Nov 9 23:54:48 GMT 2010


Hi Ashley,

I have saw your response on OpenMPI mailing list which referenced to 
padb as a parallel program inspection tool, which can also provide stack 
trace for debugging purpose.

I have installed padb 3.2 beta1 on our compute system. I found a few 
issues, and listed below.

(1) --list-rmgrs shows that we have multiple resource manager running on 
our system. However, we only use TORQUE (pbs).

$ padb --list-rmgrs
local: 1194 1214 1949 22812 22829 22840 22857 23216 23247
local-fd: No active jobs.
local-qsnet: Not detected on system.
lsf: Not detected on system.
lsf-rms: Not detected on system.
mpd: Not detected on system.
mpirun: No active jobs.
orte: Not detected on system.
pbs: No active jobs.
rms: Not detected on system.
slurm: Not detected on system.

I have looked into "padb" perl script. It seems that information 
resource managers are collected through rmsquery or prun. Neither is 
available on our system. (I don't really know what rmsquery or prun 
does. This is just a guess.. :-)).

(2) in the PBS interactive mode of a job, I have following information 
and warning, please noted that no PBS job detected. I am actually 
expecting a pbs job detected.

$ padb --list-rmgrs
local: 24413 25123 25174 25175 25176 25177 25178 25179 25180 25181 25389
local-fd: 25175 25176 25177 25178 25179 25180 25181
local-qsnet: Not detected on system.
lsf: Not detected on system.
lsf-rms: Not detected on system.
mpd: Not detected on system.
mpirun: 25123
orte: 53259
pbs: No active jobs.
rms: Not detected on system.
slurm: Not detected on system.


$ padb --full-report=53259 --config-option rmgr=orte --rank 0padb 
version 3.2 (Revision 399)

========
Warning: errors reported by some ranks
========
[0]: Error message from /short/z00/jxc900/PADB/padb_build/libexec/minfo: 
setup_communicator_iterator() failed
[0]: Stderr from minfo:
WARNING: Field opal_list_next of type opal_list_item_t not found!
WARNING: Field opal_list_sentinel of type opal_list_t not found!
WARNING: Field fl_mpool of type ompi_free_list_t not found!
WARNING: Field fl_allocations of type ompi_free_list_t not found!
WARNING: Field fl_frag_class of type ompi_free_list_t not found!
WARNING: Field fl_frag_size of type ompi_free_list_t not found!
WARNING: Field fl_frag_alignment of type ompi_free_list_t not found!
WARNING: Field fl_max_to_alloc of type ompi_free_list_t not found!
WARNING: Field fl_num_per_alloc of type ompi_free_list_t not found!
WARNING: Field fl_num_allocated of type ompi_free_list_t not found!
WARNING: Field ht_table of type opal_hash_table_t not found!
WARNING: Field ht_table_size of type opal_hash_table_t not found!
WARNING: Field ht_size of type opal_hash_table_t not found!
WARNING: Field ht_mask of type opal_hash_table_t not found!
WARNING: Field req_type of type ompi_request_t not found!
ield number_free of type opal_pointer_array_t not found!
WARNING: Field size of type opal_pointer_array_t not found!
WARNING: Field addr of type opal_pointer_array_t not found!
WARNING: Field c_name of type ompi_communicator_t not found!
WARNING: Field c_contextid of type ompi_communicator_t not found!
WARNING: Field c_my_rank of type ompi_communicator_t not found!
WARNING: Field c_local_group of type ompi_communicator_t not found!
WARNING: Field c_remote_group of type ompi_communicator_t not found!
WARNING: Field c_flags of type ompi_communicator_t not found!
WARNING: Field c_f_to_c_index of type ompi_communicator_t not found!
WARNING: Field c_topo_comm of type ompi_communicator_t not found!
WARNING: Field c_keyhash of type ompi_communicator_t not found!
WARNING: Field mtc_ndims_or_nnodes of type mca_topo_base_comm_1_0_0_t 
not found!
WARNING: Field mtc_dims_or_index of type mca_topo_base_comm_1_0_0_t not 
found!
WARNING: Field mtc_periods_or_edges of type mca_topo_base_comm_1_0_0_t 
not found!
WARNING: Field mtc_reorder of type mca_topo_base_comm_1_0_0_t not found!
WARNING: Field grp_proc_count of type ompi_group_t not found!
WARNING: Field grp_proc_pointers of type ompi_group_t not found!
WARNING: Field grp_my_rank of type ompi_group_t not found!
WARNING: Field grp_flags of type ompi_group_t not found!
WARNING: Field MPI_SOURCE of type ompi_status_public_t not found!
WARNING: Field MPI_TAG of type ompi_status_public_t not found!
WARNING: Field MPI_ERROR of type ompi_status_public_t not found!
WARNING: Field _count of type ompi_status_public_t not found!
WARNING: Field _cancelled of type ompi_status_public_t not found!
WARNING: Field size of type ompi_datatype_t not found!
WARNING: Field name of type ompi_datatype_t not found!

========
Total: 0 communicators, no communication data recorded.

(3) when I used batch mode of TORQUE, I have failed to attach to the 
running MPI program. Also, this time, no active job is detected for 
orte, which is different from the interactive PBS mode.

$ padb --list-rmgrs
local: 10722 10723 10767 10781 10782 10825 10826 10827 10828 10829 10832 
10838 10844 10921 10922 10923 10925 10928 10930 10931 10934 14039 14046 
14098
local-fd: No active jobs.
local-qsnet: Not detected on system.
lsf: Not detected on system.
lsf-rms: Not detected on system.
mpd: Not detected on system.
mpirun: 10782
orte: Not detected on system.
pbs: No active jobs.
rms: Not detected on system.
slurm: Not detected on system.

$ padb --show-jobs --config-option rmgr=mpirun
10782

$ padb --config-option rmgr=mpirun --full-report=10782
padb version 3.2 (Revision 399)
full job report for job 10782

Failed to attach to process
Fatal problem setting up the resource manager: mpirun

In the OpenMPI mailing list, I saw you actually have some other patch 
for PBS system. Will those patches solve this problem?

Looking forward to hear from you soon.

Kind Regards,
Jie

-- 
Jie Cai                         Jie.Cai at anu.edu.au
ANU Supercomputer Facility      NCI National Facility
Leonard Huxley, Mills Road      Ph:  +61 2 6125 7965
Australian National University  Fax: +61 2 6125 8199
Canberra, ACT 0200, Australia   http://nf.nci.org.au
-----------------------------------------------------





More information about the padb-users mailing list