From ashley at pittman.co.uk Mon Nov 1 19:57:33 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Mon, 1 Nov 2010 19:57:33 +0000 Subject: [padb-users] Upcoming release. Message-ID: All, I'd like to make a formal release in the coming weeks based on the current SVN code, the 3.2 beta has been through an extended testing period and I'm happy that it's ready to move to formal release status. On this basis I propose making a 3.3 release in the next two weeks, probably on Monday the 8th. Please test the latest 3.2 beta or trunk and let me know of any problems you have, unless any new issues are reported by the 5th I'll go ahead as planned. Ashley Pittman. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From ashley at pittman.co.uk Mon Nov 1 20:04:13 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Mon, 1 Nov 2010 20:04:13 +0000 Subject: [padb-users] Padb featured on RCE podcast Message-ID: <43363DC8-10B8-4F1B-8D95-2853E098564D@pittman.co.uk> The Padb project was featured in last weeks RCE podcast, a HPC oriented podcast hosted by Jeff Squyres and Brock Palen, I did the interview a few week ago and the show went live last week. You can listen to the show on-line or subscribe to the podcast using your favourite music app and keep up to date with the rest of the HPC field at the same time. http://www.rce-cast.com/Podcast/rce-43-padb.html Ashley Pittman. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From ashley at pittman.co.uk Mon Nov 8 18:54:40 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Mon, 8 Nov 2010 18:54:40 +0000 Subject: [padb-users] [padb] Upcoming release. (delayed) In-Reply-To: References: Message-ID: On 1 Nov 2010, at 19:57, Ashley Pittman wrote: > I'd like to make a formal release in the coming weeks based on the current SVN code, the 3.2 beta has been through an extended testing period and I'm happy that it's ready to move to formal release status. On this basis I propose making a 3.3 release in the next two weeks, probably on Monday the 8th. I've had rather more feedback to this than I expected so a large number of small fixes have gone in over the last week, as such I'll have to extend the window to allow more time to process them and test/stabilise the code. Other than the changes from Vega and further testing of the pbs/Torque patch I'm not currently planing any changes beyond what turns up in testing. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From Jie.Cai at anu.edu.au Tue Nov 9 23:54:48 2010 From: Jie.Cai at anu.edu.au (Jie Cai) Date: Wed, 10 Nov 2010 10:54:48 +1100 Subject: [padb-users] start using padb on TORQUE Message-ID: <4CD9DF48.8050002@anu.edu.au> Hi Ashley, I have saw your response on OpenMPI mailing list which referenced to padb as a parallel program inspection tool, which can also provide stack trace for debugging purpose. I have installed padb 3.2 beta1 on our compute system. I found a few issues, and listed below. (1) --list-rmgrs shows that we have multiple resource manager running on our system. However, we only use TORQUE (pbs). $ padb --list-rmgrs local: 1194 1214 1949 22812 22829 22840 22857 23216 23247 local-fd: No active jobs. local-qsnet: Not detected on system. lsf: Not detected on system. lsf-rms: Not detected on system. mpd: Not detected on system. mpirun: No active jobs. orte: Not detected on system. pbs: No active jobs. rms: Not detected on system. slurm: Not detected on system. I have looked into "padb" perl script. It seems that information resource managers are collected through rmsquery or prun. Neither is available on our system. (I don't really know what rmsquery or prun does. This is just a guess.. :-)). (2) in the PBS interactive mode of a job, I have following information and warning, please noted that no PBS job detected. I am actually expecting a pbs job detected. $ padb --list-rmgrs local: 24413 25123 25174 25175 25176 25177 25178 25179 25180 25181 25389 local-fd: 25175 25176 25177 25178 25179 25180 25181 local-qsnet: Not detected on system. lsf: Not detected on system. lsf-rms: Not detected on system. mpd: Not detected on system. mpirun: 25123 orte: 53259 pbs: No active jobs. rms: Not detected on system. slurm: Not detected on system. $ padb --full-report=53259 --config-option rmgr=orte --rank 0padb version 3.2 (Revision 399) ======== Warning: errors reported by some ranks ======== [0]: Error message from /short/z00/jxc900/PADB/padb_build/libexec/minfo: setup_communicator_iterator() failed [0]: Stderr from minfo: WARNING: Field opal_list_next of type opal_list_item_t not found! WARNING: Field opal_list_sentinel of type opal_list_t not found! WARNING: Field fl_mpool of type ompi_free_list_t not found! WARNING: Field fl_allocations of type ompi_free_list_t not found! WARNING: Field fl_frag_class of type ompi_free_list_t not found! WARNING: Field fl_frag_size of type ompi_free_list_t not found! WARNING: Field fl_frag_alignment of type ompi_free_list_t not found! WARNING: Field fl_max_to_alloc of type ompi_free_list_t not found! WARNING: Field fl_num_per_alloc of type ompi_free_list_t not found! WARNING: Field fl_num_allocated of type ompi_free_list_t not found! WARNING: Field ht_table of type opal_hash_table_t not found! WARNING: Field ht_table_size of type opal_hash_table_t not found! WARNING: Field ht_size of type opal_hash_table_t not found! WARNING: Field ht_mask of type opal_hash_table_t not found! WARNING: Field req_type of type ompi_request_t not found! ield number_free of type opal_pointer_array_t not found! WARNING: Field size of type opal_pointer_array_t not found! WARNING: Field addr of type opal_pointer_array_t not found! WARNING: Field c_name of type ompi_communicator_t not found! WARNING: Field c_contextid of type ompi_communicator_t not found! WARNING: Field c_my_rank of type ompi_communicator_t not found! WARNING: Field c_local_group of type ompi_communicator_t not found! WARNING: Field c_remote_group of type ompi_communicator_t not found! WARNING: Field c_flags of type ompi_communicator_t not found! WARNING: Field c_f_to_c_index of type ompi_communicator_t not found! WARNING: Field c_topo_comm of type ompi_communicator_t not found! WARNING: Field c_keyhash of type ompi_communicator_t not found! WARNING: Field mtc_ndims_or_nnodes of type mca_topo_base_comm_1_0_0_t not found! WARNING: Field mtc_dims_or_index of type mca_topo_base_comm_1_0_0_t not found! WARNING: Field mtc_periods_or_edges of type mca_topo_base_comm_1_0_0_t not found! WARNING: Field mtc_reorder of type mca_topo_base_comm_1_0_0_t not found! WARNING: Field grp_proc_count of type ompi_group_t not found! WARNING: Field grp_proc_pointers of type ompi_group_t not found! WARNING: Field grp_my_rank of type ompi_group_t not found! WARNING: Field grp_flags of type ompi_group_t not found! WARNING: Field MPI_SOURCE of type ompi_status_public_t not found! WARNING: Field MPI_TAG of type ompi_status_public_t not found! WARNING: Field MPI_ERROR of type ompi_status_public_t not found! WARNING: Field _count of type ompi_status_public_t not found! WARNING: Field _cancelled of type ompi_status_public_t not found! WARNING: Field size of type ompi_datatype_t not found! WARNING: Field name of type ompi_datatype_t not found! ======== Total: 0 communicators, no communication data recorded. (3) when I used batch mode of TORQUE, I have failed to attach to the running MPI program. Also, this time, no active job is detected for orte, which is different from the interactive PBS mode. $ padb --list-rmgrs local: 10722 10723 10767 10781 10782 10825 10826 10827 10828 10829 10832 10838 10844 10921 10922 10923 10925 10928 10930 10931 10934 14039 14046 14098 local-fd: No active jobs. local-qsnet: Not detected on system. lsf: Not detected on system. lsf-rms: Not detected on system. mpd: Not detected on system. mpirun: 10782 orte: Not detected on system. pbs: No active jobs. rms: Not detected on system. slurm: Not detected on system. $ padb --show-jobs --config-option rmgr=mpirun 10782 $ padb --config-option rmgr=mpirun --full-report=10782 padb version 3.2 (Revision 399) full job report for job 10782 Failed to attach to process Fatal problem setting up the resource manager: mpirun In the OpenMPI mailing list, I saw you actually have some other patch for PBS system. Will those patches solve this problem? Looking forward to hear from you soon. Kind Regards, Jie -- Jie Cai Jie.Cai at anu.edu.au ANU Supercomputer Facility NCI National Facility Leonard Huxley, Mills Road Ph: +61 2 6125 7965 Australian National University Fax: +61 2 6125 8199 Canberra, ACT 0200, Australia http://nf.nci.org.au ----------------------------------------------------- From ashley at pittman.co.uk Wed Nov 10 19:41:38 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Wed, 10 Nov 2010 20:41:38 +0100 Subject: [padb-users] start using padb on TORQUE In-Reply-To: <4CD9DF48.8050002@anu.edu.au> References: <4CD9DF48.8050002@anu.edu.au> Message-ID: On 10 Nov 2010, at 00:54, Jie Cai wrote: > I have looked into "padb" perl script. It seems that information resource managers are collected through rmsquery or prun. Neither is available on our system. (I don't really know what rmsquery or prun does. This is just a guess.. :-)). rmsquery are part of the "rms" resource manager, padb will use any resource manager it can find installed (the binaries have to be on $PATH). "mpirun" is a special case because it looks for processes called "mpirun" and doesn't need anything installed, it can with with multiple resource managers. > (2) in the PBS interactive mode of a job, I have following information and warning, please noted that no PBS job detected. I am actually expecting a pbs job detected. pbs_pro support has been included for a while, pbs and Torque support are slightly different and have only been added very recently, in fact the current HEAD will detect jobs but and launch itself on the remote nodes but not find the individual processes, it is almost certainly looking for the wrong environment variable so should be easy to fix when I get some more feedback from people who are testing it (I don't have access to a pbs system and that makes it difficult). > $ padb --full-report=53259 --config-option rmgr=orte --rank 0padb version 3.2 (Revision 399) > > ======== > Warning: errors reported by some ranks > ======== > [0]: Error message from /short/z00/jxc900/PADB/padb_build/libexec/minfo: setup_communicator_iterator() failed > [0]: Stderr from minfo: > WARNING: Field opal_list_next of type opal_list_item_t not found! > WARNING: Field opal_list_sentinel of type opal_list_t not found! > WARNING: Field fl_mpool of type ompi_free_list_t not found! > WARNING: Field fl_allocations of type ompi_free_list_t not found! These are errors from the MPI library, padb has done the right thing here, it's discovered the job, launched itself, found the processes but the MPI debugger callback DLL is unable to extract the information it needs. This is the second time this has been reported in as many weeks so I'm wondering if this is something that they have broken recently, the best place to take this up would be the Ompi developers list or if you can wait until after SC next week I can test if for you and report it myself. > (3) when I used batch mode of TORQUE, I have failed to attach to the running MPI program. Also, this time, no active job is detected for orte, which is different from the interactive PBS mode. orte will only report jobs on the host where the orterun process is running, this is a limitation of OMPI and there is nothing I can do about it. > $ padb --config-option rmgr=mpirun --full-report=10782 > padb version 3.2 (Revision 399) > full job report for job 10782 > > Failed to attach to process > Fatal problem setting up the resource manager: mpirun That shouldn't happen, can you send the output of "gdb -p 10782" in this case? > In the OpenMPI mailing list, I saw you actually have some other patch for PBS system. Will those patches solve this problem? All the patches I have are committed now, I'm still waiting for feedback on the the environment variables that pbs sets for parallel jobs however. You should try with HEAD code as at least one of your issues is fixed though (although it'll still fail further on I'm afraid). Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From Jie.Cai at anu.edu.au Wed Nov 10 23:48:50 2010 From: Jie.Cai at anu.edu.au (Jie Cai) Date: Thu, 11 Nov 2010 10:48:50 +1100 Subject: [padb-users] start using padb on TORQUE In-Reply-To: References: <4CD9DF48.8050002@anu.edu.au> Message-ID: <4CDB2F62.8010808@anu.edu.au> On 11/11/10 06:41, Ashley Pittman wrote: > >> (2) in the PBS interactive mode of a job, I have following information and warning, please noted that no PBS job detected. I am actually expecting a pbs job detected. >> > pbs_pro support has been included for a while, pbs and Torque support are slightly different and have only been added very recently, in fact the current HEAD will detect jobs but and launch itself on the remote nodes but not find the individual processes, it is almost certainly looking for the wrong environment variable so should be easy to fix when I get some more feedback from people who are testing it (I don't have access to a pbs system and that makes it difficult). > > I am pretty happy to help with this. Our PBS system is built on OpenPBS. I am not sure whether there is major difference in the interface between old OpenPBS and torque or PBS pro. Is the "HEAD" you mentioned means padb? or PBS mom? I am a little bit confused. >> $ padb --full-report=53259 --config-option rmgr=orte --rank 0padb version 3.2 (Revision 399) >> >> ======== >> Warning: errors reported by some ranks >> ======== >> [0]: Error message from /short/z00/jxc900/PADB/padb_build/libexec/minfo: setup_communicator_iterator() failed >> [0]: Stderr from minfo: >> WARNING: Field opal_list_next of type opal_list_item_t not found! >> WARNING: Field opal_list_sentinel of type opal_list_t not found! >> WARNING: Field fl_mpool of type ompi_free_list_t not found! >> WARNING: Field fl_allocations of type ompi_free_list_t not found! >> > These are errors from the MPI library, padb has done the right thing here, it's discovered the job, launched itself, found the processes but the MPI debugger callback DLL is unable to extract the information it needs. This is the second time this has been reported in as many weeks so I'm wondering if this is something that they have broken recently, the best place to take this up would be the Ompi developers list or if you can wait until after SC next week I can test if for you and report it myself. > > The symbol is loaded successfully. The error happened in following code in minfo.c. res = dll_ep.setup_communicator_iterator(target_process); if ( res != mqs_ok ) { die_with_code(res,"setup_communicator_iterator() failed"); } I have tested a number of OMPI versions installed on our system, from 1.3.3 to 1.4.2. All shows the warning message. I have tested padb on another cluster with 1.3.3, while no warning messages turned up. I don't know what were the configuration flags use to compile those OpenMPI libraries, and thus I don't know whether there's a potential problem that we haven't use the necessary flag to turn debug callback on (although it doesn't seem like that from OMPI available configuration list). > That shouldn't happen, can you send the output of "gdb -p 10782" in this case? > I similar information that complaining "ptrace: operation not permitted" I did sudo for both gdb and padb: b/libopen-pal.so.0...(no debugging symbols found)...done. Loaded symbols for /apps/openmpi/1.4.2/lib/libopen-pal.so.0 Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/libnsl.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libnsl.so.1 Reading symbols from /lib64/libutil.so.1...(no debugging symbols found)...done. Loaded symbols for /lib64/libutil.so.1 Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libm.so.6 Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done. [Thread debugging using libthread_db enabled] Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_paffinity_linux.so...(no debugging symbols found)...done. Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_paffinity_linux.so Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_carto_auto_detect.so...(no debugging symbols found)...done. Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_carto_auto_detect.so Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_ess_hnp.so...(no debugging symbols found)...done. Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_ess_hnp.so Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_plm_tm.so...done. Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_plm_tm.so Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_rml_oob.so...(no debugging symbols found)...done. Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_rml_oob.so Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_oob_tcp.so...(no debugging symbols found)...done. Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_oob_tcp.so Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_routed_binomial.so...(no debugging symbols found)...done. Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_routed_binomial.so Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_grpcomm_bad.so...(no debugging symbols found)...done. Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_grpcomm_bad.so Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_ras_tm.so...(no debugging symbols found)...done. Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_ras_tm.so Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_rmaps_round_robin.so...(no debugging symbols found)...done. Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_rmaps_round_robin.so Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_errmgr_default.so...(no debugging symbols found)...done. Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_errmgr_default.so Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_odls_default.so...(no debugging symbols found)...done. Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_odls_default.so Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_iof_hnp.so...(no debugging symbols found)...done. Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_iof_hnp.so Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_filem_rsh.so...(no debugging symbols found)...done. Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_filem_rsh.so 0x00007ffff6bfb14f in poll () from /lib64/libc.so.6 $ sudo padb --config-option rmgr=mpirun --full-report=5226 padb version 3.2 (Revision 399) full job report for job 5226 Warning, failed to locate any ranks > All the patches I have are committed now, I'm still waiting for feedback on the the environment variables that pbs sets for parallel jobs however. You should try with HEAD code as at least one of your issues is fixed though (although it'll still fail further on I'm afraid). > You have mentioned a few times about HEAD code. Again, do you mean OMPI header files? or padb perl script? I will really appreciate if you could clarify this a little bit more. > Ashley. > > BTW: do you have any documents, which explain how padb works, e.g work flow. It can help us significantly with understanding your code and design idea. Then we can feedback some more useful information. Jie From daniel.kidger at googlemail.com Thu Nov 11 09:31:34 2010 From: daniel.kidger at googlemail.com (Daniel Kidger) Date: Thu, 11 Nov 2010 09:31:34 +0000 Subject: [padb-users] start using padb on TORQUE In-Reply-To: <4CDB2F62.8010808@anu.edu.au> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> Message-ID: Jie, I know Ashley is away at the moment, so I will reply By HEAD code code he means the latetst verstion of the padb source code You could try 3.2-beta1 from http://padb.pittman.org.uk/ dated 23-10-10 Hope this helps, Daniel On 10 November 2010 23:48, Jie Cai wrote: > > On 11/11/10 06:41, Ashley Pittman wrote: > >> >> (2) in the PBS interactive mode of a job, I have following information >>> and warning, please noted that no PBS job detected. I am actually expecting >>> a pbs job detected. >>> >>> >> pbs_pro support has been included for a while, pbs and Torque support are >> slightly different and have only been added very recently, in fact the >> current HEAD will detect jobs but and launch itself on the remote nodes but >> not find the individual processes, it is almost certainly looking for the >> wrong environment variable so should be easy to fix when I get some more >> feedback from people who are testing it (I don't have access to a pbs system >> and that makes it difficult). >> >> >> > I am pretty happy to help with this. Our PBS system is built on OpenPBS. I > am not sure whether there is major difference in the interface between old > OpenPBS and torque or PBS pro. > > Is the "HEAD" you mentioned means padb? or PBS mom? I am a little bit > confused. > > $ padb --full-report=53259 --config-option rmgr=orte --rank 0padb version >>> 3.2 (Revision 399) >>> >>> ======== >>> Warning: errors reported by some ranks >>> ======== >>> [0]: Error message from /short/z00/jxc900/PADB/padb_build/libexec/minfo: >>> setup_communicator_iterator() failed >>> [0]: Stderr from minfo: >>> WARNING: Field opal_list_next of type opal_list_item_t not found! >>> WARNING: Field opal_list_sentinel of type opal_list_t not found! >>> WARNING: Field fl_mpool of type ompi_free_list_t not found! >>> WARNING: Field fl_allocations of type ompi_free_list_t not found! >>> >>> >> These are errors from the MPI library, padb has done the right thing here, >> it's discovered the job, launched itself, found the processes but the MPI >> debugger callback DLL is unable to extract the information it needs. This >> is the second time this has been reported in as many weeks so I'm wondering >> if this is something that they have broken recently, the best place to take >> this up would be the Ompi developers list or if you can wait until after SC >> next week I can test if for you and report it myself. >> >> >> > The symbol is loaded successfully. The error happened in following code in > minfo.c. > > res = dll_ep.setup_communicator_iterator(target_process); > if ( res != mqs_ok ) { > die_with_code(res,"setup_communicator_iterator() failed"); > } > > I have tested a number of OMPI versions installed on our system, from 1.3.3 > to 1.4.2. All shows the warning message. > > I have tested padb on another cluster with 1.3.3, while no warning messages > turned up. > > I don't know what were the configuration flags use to compile those OpenMPI > libraries, and thus I don't know whether there's a potential problem that we > haven't use the necessary flag to turn debug callback on (although it > doesn't seem like that from OMPI available configuration list). > > That shouldn't happen, can you send the output of "gdb -p 10782" in this >> case? >> >> > I similar information that complaining "ptrace: operation not permitted" > I did sudo for both gdb and padb: > b/libopen-pal.so.0...(no debugging symbols found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/libopen-pal.so.0 > Reading symbols from /lib64/libdl.so.2...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libdl.so.2 > Reading symbols from /lib64/libnsl.so.1...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libnsl.so.1 > Reading symbols from /lib64/libutil.so.1...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libutil.so.1 > Reading symbols from /lib64/libm.so.6...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libm.so.6 > Reading symbols from /lib64/libpthread.so.0...(no debugging symbols > found)...done. > [Thread debugging using libthread_db enabled] > Loaded symbols for /lib64/libpthread.so.0 > Reading symbols from /lib64/libc.so.6...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libc.so.6 > Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols > found)...done. > Loaded symbols for /lib64/ld-linux-x86-64.so.2 > Reading symbols from > /apps/openmpi/1.4.2/lib/openmpi/mca_paffinity_linux.so...(no debugging > symbols found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_paffinity_linux.so > Reading symbols from > /apps/openmpi/1.4.2/lib/openmpi/mca_carto_auto_detect.so...(no debugging > symbols found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_carto_auto_detect.so > Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_ess_hnp.so...(no > debugging symbols found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_ess_hnp.so > Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_plm_tm.so...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_plm_tm.so > Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_rml_oob.so...(no > debugging symbols found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_rml_oob.so > Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_oob_tcp.so...(no > debugging symbols found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_oob_tcp.so > Reading symbols from > /apps/openmpi/1.4.2/lib/openmpi/mca_routed_binomial.so...(no debugging > symbols found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_routed_binomial.so > Reading symbols from > /apps/openmpi/1.4.2/lib/openmpi/mca_grpcomm_bad.so...(no debugging symbols > found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_grpcomm_bad.so > Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_ras_tm.so...(no > debugging symbols found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_ras_tm.so > Reading symbols from > /apps/openmpi/1.4.2/lib/openmpi/mca_rmaps_round_robin.so...(no debugging > symbols found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_rmaps_round_robin.so > Reading symbols from > /apps/openmpi/1.4.2/lib/openmpi/mca_errmgr_default.so...(no debugging > symbols found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_errmgr_default.so > Reading symbols from > /apps/openmpi/1.4.2/lib/openmpi/mca_odls_default.so...(no debugging symbols > found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_odls_default.so > Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_iof_hnp.so...(no > debugging symbols found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_iof_hnp.so > Reading symbols from /apps/openmpi/1.4.2/lib/openmpi/mca_filem_rsh.so...(no > debugging symbols found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_filem_rsh.so > 0x00007ffff6bfb14f in poll () from /lib64/libc.so.6 > > $ sudo padb --config-option rmgr=mpirun --full-report=5226 > > padb version 3.2 (Revision 399) > full job report for job 5226 > > Warning, failed to locate any ranks > > > All the patches I have are committed now, I'm still waiting for feedback >> on the the environment variables that pbs sets for parallel jobs however. >> You should try with HEAD code as at least one of your issues is fixed >> though (although it'll still fail further on I'm afraid). >> >> > You have mentioned a few times about HEAD code. Again, do you mean OMPI > header files? or padb perl script? I will really appreciate if you could > clarify this a little bit more. > >> Ashley. >> >> >> > > BTW: do you have any documents, which explain how padb works, e.g work > flow. It can help us significantly with understanding your code and design > idea. Then we can feedback some more useful information. > > Jie > > > _______________________________________________ > padb-users mailing list > padb-users at pittman.org.uk > http://pittman.org.uk/mailman/listinfo/padb-users_pittman.org.uk > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jie.Cai at anu.edu.au Thu Nov 11 11:05:51 2010 From: Jie.Cai at anu.edu.au (Jie Cai) Date: Thu, 11 Nov 2010 22:05:51 +1100 Subject: [padb-users] start using padb on TORQUE In-Reply-To: References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> Message-ID: <4CDBCE0F.5040509@anu.edu.au> Hi Daniel, Thanks for the reply. Yes, I do have the truck check'd out. Please let us updated if you have answers for other questions. Kind Regards, Jie Daniel Kidger wrote: > > Jie, > I know Ashley is away at the moment, so I will reply > > By HEAD code code he means the latetst verstion of the padb source code > > You could try 3.2-beta1 from http://padb.pittman.org.uk/ dated 23-10-10 > > Hope this helps, > Daniel > > On 10 November 2010 23:48, Jie Cai > wrote: > > > On 11/11/10 06:41, Ashley Pittman wrote: > > > (2) in the PBS interactive mode of a job, I have following > information and warning, please noted that no PBS job > detected. I am actually expecting a pbs job detected. > > > pbs_pro support has been included for a while, pbs and Torque > support are slightly different and have only been added very > recently, in fact the current HEAD will detect jobs but and > launch itself on the remote nodes but not find the individual > processes, it is almost certainly looking for the wrong > environment variable so should be easy to fix when I get some > more feedback from people who are testing it (I don't have > access to a pbs system and that makes it difficult). > > > > I am pretty happy to help with this. Our PBS system is built on > OpenPBS. I am not sure whether there is major difference in the > interface between old OpenPBS and torque or PBS pro. > > Is the "HEAD" you mentioned means padb? or PBS mom? I am a little > bit confused. > > $ padb --full-report=53259 --config-option rmgr=orte > --rank 0padb version 3.2 (Revision 399) > > ======== > Warning: errors reported by some ranks > ======== > [0]: Error message from > /short/z00/jxc900/PADB/padb_build/libexec/minfo: > setup_communicator_iterator() failed > [0]: Stderr from minfo: > WARNING: Field opal_list_next of type opal_list_item_t not > found! > WARNING: Field opal_list_sentinel of type opal_list_t not > found! > WARNING: Field fl_mpool of type ompi_free_list_t not found! > WARNING: Field fl_allocations of type ompi_free_list_t not > found! > > > These are errors from the MPI library, padb has done the right > thing here, it's discovered the job, launched itself, found > the processes but the MPI debugger callback DLL is unable to > extract the information it needs. This is the second time > this has been reported in as many weeks so I'm wondering if > this is something that they have broken recently, the best > place to take this up would be the Ompi developers list or if > you can wait until after SC next week I can test if for you > and report it myself. > > > > The symbol is loaded successfully. The error happened in following > code in minfo.c. > > res = dll_ep.setup_communicator_iterator(target_process); > if ( res != mqs_ok ) { > die_with_code(res,"setup_communicator_iterator() failed"); > } > > I have tested a number of OMPI versions installed on our system, > from 1.3.3 to 1.4.2. All shows the warning message. > > I have tested padb on another cluster with 1.3.3, while no warning > messages turned up. > > I don't know what were the configuration flags use to compile > those OpenMPI libraries, and thus I don't know whether there's a > potential problem that we haven't use the necessary flag to turn > debug callback on (although it doesn't seem like that from OMPI > available configuration list). > > That shouldn't happen, can you send the output of "gdb -p > 10782" in this case? > > > I similar information that complaining "ptrace: operation not > permitted" > I did sudo for both gdb and padb: > b/libopen-pal.so.0...(no debugging symbols found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/libopen-pal.so.0 > Reading symbols from /lib64/libdl.so.2...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libdl.so.2 > Reading symbols from /lib64/libnsl.so.1...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libnsl.so.1 > Reading symbols from /lib64/libutil.so.1...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libutil.so.1 > Reading symbols from /lib64/libm.so.6...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libm.so.6 > Reading symbols from /lib64/libpthread.so.0...(no debugging > symbols found)...done. > [Thread debugging using libthread_db enabled] > Loaded symbols for /lib64/libpthread.so.0 > Reading symbols from /lib64/libc.so.6...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libc.so.6 > Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging > symbols found)...done. > Loaded symbols for /lib64/ld-linux-x86-64.so.2 > Reading symbols from > /apps/openmpi/1.4.2/lib/openmpi/mca_paffinity_linux.so...(no > debugging symbols found)...done. > Loaded symbols for > /apps/openmpi/1.4.2/lib/openmpi/mca_paffinity_linux.so > Reading symbols from > /apps/openmpi/1.4.2/lib/openmpi/mca_carto_auto_detect.so...(no > debugging symbols found)...done. > Loaded symbols for > /apps/openmpi/1.4.2/lib/openmpi/mca_carto_auto_detect.so > Reading symbols from > /apps/openmpi/1.4.2/lib/openmpi/mca_ess_hnp.so...(no debugging > symbols found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_ess_hnp.so > Reading symbols from > /apps/openmpi/1.4.2/lib/openmpi/mca_plm_tm.so...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_plm_tm.so > Reading symbols from > /apps/openmpi/1.4.2/lib/openmpi/mca_rml_oob.so...(no debugging > symbols found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_rml_oob.so > Reading symbols from > /apps/openmpi/1.4.2/lib/openmpi/mca_oob_tcp.so...(no debugging > symbols found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_oob_tcp.so > Reading symbols from > /apps/openmpi/1.4.2/lib/openmpi/mca_routed_binomial.so...(no > debugging symbols found)...done. > Loaded symbols for > /apps/openmpi/1.4.2/lib/openmpi/mca_routed_binomial.so > Reading symbols from > /apps/openmpi/1.4.2/lib/openmpi/mca_grpcomm_bad.so...(no debugging > symbols found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_grpcomm_bad.so > Reading symbols from > /apps/openmpi/1.4.2/lib/openmpi/mca_ras_tm.so...(no debugging > symbols found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_ras_tm.so > Reading symbols from > /apps/openmpi/1.4.2/lib/openmpi/mca_rmaps_round_robin.so...(no > debugging symbols found)...done. > Loaded symbols for > /apps/openmpi/1.4.2/lib/openmpi/mca_rmaps_round_robin.so > Reading symbols from > /apps/openmpi/1.4.2/lib/openmpi/mca_errmgr_default.so...(no > debugging symbols found)...done. > Loaded symbols for > /apps/openmpi/1.4.2/lib/openmpi/mca_errmgr_default.so > Reading symbols from > /apps/openmpi/1.4.2/lib/openmpi/mca_odls_default.so...(no > debugging symbols found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_odls_default.so > Reading symbols from > /apps/openmpi/1.4.2/lib/openmpi/mca_iof_hnp.so...(no debugging > symbols found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_iof_hnp.so > Reading symbols from > /apps/openmpi/1.4.2/lib/openmpi/mca_filem_rsh.so...(no debugging > symbols found)...done. > Loaded symbols for /apps/openmpi/1.4.2/lib/openmpi/mca_filem_rsh.so > 0x00007ffff6bfb14f in poll () from /lib64/libc.so.6 > > $ sudo padb --config-option rmgr=mpirun --full-report=5226 > > padb version 3.2 (Revision 399) > full job report for job 5226 > > Warning, failed to locate any ranks > > > All the patches I have are committed now, I'm still waiting > for feedback on the the environment variables that pbs sets > for parallel jobs however. You should try with HEAD code as > at least one of your issues is fixed though (although it'll > still fail further on I'm afraid). > > > You have mentioned a few times about HEAD code. Again, do you mean > OMPI header files? or padb perl script? I will really appreciate > if you could clarify this a little bit more. > > Ashley. > > > > > BTW: do you have any documents, which explain how padb works, e.g > work flow. It can help us significantly with understanding your > code and design idea. Then we can feedback some more useful > information. > > Jie > > > _______________________________________________ > padb-users mailing list > padb-users at pittman.org.uk > http://pittman.org.uk/mailman/listinfo/padb-users_pittman.org.uk > > From ashley at pittman.co.uk Sat Nov 13 16:37:32 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Sat, 13 Nov 2010 10:37:32 -0600 Subject: [padb-users] start using padb on TORQUE In-Reply-To: <4CDB2F62.8010808@anu.edu.au> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> Message-ID: <3D83BB55-B0B4-480B-AC6D-DC0EF37CC5B1@pittman.co.uk> On 11 Nov 2010, at 00:48, Jie Cai wrote: > I am pretty happy to help with this. Our PBS system is built on OpenPBS. I am not sure whether there is major difference in the interface between old OpenPBS and torque or PBS pro. > > Is the "HEAD" you mentioned means padb? or PBS mom? I am a little bit confused. As Dan says Subversion HEAD of padb. The instructions for getting the source are at: http://code.google.com/p/padb/source/checkout The most recent pbs additions have gone in since the last beta release was cut. If you are prepared to could you run two parallel jobs and send me the output, "env" and "ps auwxf". From that I should have the information I need to make it work in your case. Single CPU jobs will be fine. > I have tested a number of OMPI versions installed on our system, from 1.3.3 to 1.4.2. All shows the warning message. > > I have tested padb on another cluster with 1.3.3, while no warning messages turned up. The WARNING error is coming from OpenMPI which in this case is then returning false from the "setup_communicator_iterator" callback. The code in question is looking for struct offsets in the running program so it likely to do with exporting symbols correctly or possibly static builds? See this mail for more details: http://pittman.org.uk/pipermail/padb-users_pittman.org.uk/2010-October/000043.html >> That shouldn't happen, can you send the output of "gdb -p 10782" in this case? >> > I similar information that complaining "ptrace: operation not permitted" > I did sudo for both gdb and padb: Does this mean that your resource manager is installed suid? I don't think padb can support this case as it would need to run as one user to read the process list and a different user for everything else. > $ sudo padb --config-option rmgr=mpirun --full-report=5226 > padb version 3.2 (Revision 399) > full job report for job 5226 > > Warning, failed to locate any ranks In this case it will have discovered the process list but be looking for root owned processes only and so not discover the processes on the remote node(s) Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From Jie.Cai at anu.edu.au Tue Nov 16 05:00:20 2010 From: Jie.Cai at anu.edu.au (Jie Cai) Date: Tue, 16 Nov 2010 16:00:20 +1100 Subject: [padb-users] start using padb on TORQUE In-Reply-To: <3D83BB55-B0B4-480B-AC6D-DC0EF37CC5B1@pittman.co.uk> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <3D83BB55-B0B4-480B-AC6D-DC0EF37CC5B1@pittman.co.uk> Message-ID: <4CE20FE4.6060807@anu.edu.au> Dear Ashley, On 14/11/10 03:37, Ashley Pittman wrote: > The WARNING error is coming from OpenMPI which in this case is then returning false from the "setup_communicator_iterator" callback. The code in question is looking for struct offsets in the running program so it likely to do with exporting symbols correctly or possibly static builds? > > See this mail for more details: > > http://pittman.org.uk/pipermail/padb-users_pittman.org.uk/2010-October/000043.html We have recently found that this kind of wanning message only occurred for Fortran programs. We looked into debug information, we found that in Fortran programs there's no context of "void" type. Therefore, some gdb command like "-data-evaluate-expression "(void *)&((opal_list_item_t *)0)->opal_list_next"" failed because of "No symbol \"void\" in current context." We have made a little patch into it to fix this issue. Simple insert "-gdb-set language c" prior actually to "-data-evaluate-expression" will solve it. We haven't got time to look into PBS issue yet. Once we have any information will let you updated. Kind Regards, Jie -- Jie Cai Jie.Cai at anu.edu.au ANU Supercomputer Facility NCI National Facility Leonard Huxley, Mills Road Ph: +61 2 6125 7965 Australian National University Fax: +61 2 6125 8199 Canberra, ACT 0200, Australia http://nf.nci.org.au ----------------------------------------------------- From ashley at pittman.co.uk Tue Nov 16 18:33:59 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Tue, 16 Nov 2010 12:33:59 -0600 Subject: [padb-users] start using padb on TORQUE In-Reply-To: <4CE20FE4.6060807@anu.edu.au> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <3D83BB55-B0B4-480B-AC6D-DC0EF37CC5B1@pittman.co.uk> <4CE20FE4.6060807@anu.edu.au> Message-ID: <5F7FBEAB-5967-4195-8D0C-D8C87BF124C8@pittman.co.uk> On 15 Nov 2010, at 23:00, Jie Cai wrote: > Therefore, some gdb command like "-data-evaluate-expression "(void *)&((opal_list_item_t *)0)->opal_list_next"" failed because of "No symbol \"void\" in current context." This was added for some versions of gdb which reported differently for offsets of known types, initially I only saw this on Solaris but I have seen it on Linux as well. http://code.google.com/p/padb/source/detail?r=322 > We have made a little patch into it to fix this issue. Simple insert "-gdb-set language c" prior actually to "-data-evaluate-expression" will solve it. That sounds great, I'll look forward to seeing it. The other option might be to move gdb up the stack to main so that the language is c anyway? > We haven't got time to look into PBS issue yet. Once we have any information will let you updated. It's likely that the change will be possibly adding a process name to the list in is_resmgr_process() and or adding checks for a second environment variable in pbs_find_pids(). Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From Jie.Cai at anu.edu.au Mon Nov 22 00:17:16 2010 From: Jie.Cai at anu.edu.au (Jie Cai) Date: Mon, 22 Nov 2010 11:17:16 +1100 Subject: [padb-users] Make "gdb set language c" for Fortran program to get correct value and offset of symbols. In-Reply-To: <5F7FBEAB-5967-4195-8D0C-D8C87BF124C8@pittman.co.uk> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <3D83BB55-B0B4-480B-AC6D-DC0EF37CC5B1@pittman.co.uk> <4CE20FE4.6060807@anu.edu.au> <5F7FBEAB-5967-4195-8D0C-D8C87BF124C8@pittman.co.uk> Message-ID: <4CE9B68C.4040706@anu.edu.au> Hi Ashley, This is the patch file to let gdb set language to c for Fortran program. It is created based on current svn head of padb. It is attempt to solve problem with warning message as shown below: ======== Warning: errors reported by some ranks ======== [0]: Error message from /short/z00/jxc900/PADB/padb_build/libexec/minfo: setup_communicator_iterator() failed [0]: Stderr from minfo: WARNING: Field opal_list_next of type opal_list_item_t not found! WARNING: Field opal_list_sentinel of type opal_list_t not found! WARNING: Field fl_mpool of type ompi_free_list_t not found! WARNING: Field fl_allocations of type ompi_free_list_t not found! Kind Regards, Jie -- Jie Cai Jie.Cai at anu.edu.au ANU Supercomputer Facility NCI National Facility Leonard Huxley, Mills Road Ph: +61 2 6125 7965 Australian National University Fax: +61 2 6125 8199 Canberra, ACT 0200, Australia http://nf.nci.org.au ----------------------------------------------------- On 17/11/10 05:33, Ashley Pittman wrote: > On 15 Nov 2010, at 23:00, Jie Cai wrote: > >> Therefore, some gdb command like "-data-evaluate-expression "(void *)&((opal_list_item_t *)0)->opal_list_next"" failed because of "No symbol \"void\" in current context." >> > This was added for some versions of gdb which reported differently for offsets of known types, initially I only saw this on Solaris but I have seen it on Linux as well. > > http://code.google.com/p/padb/source/detail?r=322 > > >> We have made a little patch into it to fix this issue. Simple insert "-gdb-set language c" prior actually to "-data-evaluate-expression" will solve it. >> > That sounds great, I'll look forward to seeing it. The other option might be to move gdb up the stack to main so that the language is c anyway? > > >> We haven't got time to look into PBS issue yet. Once we have any information will let you updated. >> > It's likely that the change will be possibly adding a process name to the list in is_resmgr_process() and or adding checks for a second environment variable in pbs_find_pids(). > > Ashley, > > -------------- next part -------------- A non-text attachment was scrubbed... Name: padb_fortran.patch Type: text/x-patch Size: 4803 bytes Desc: not available URL: From Jie.Cai at anu.edu.au Tue Nov 23 01:52:08 2010 From: Jie.Cai at anu.edu.au (Jie Cai) Date: Tue, 23 Nov 2010 12:52:08 +1100 Subject: [padb-users] configuration file format. Message-ID: <4CEB1E48.20607@anu.edu.au> Hi Ashley, I am trying to use configuration file padb.conf. $ cat padb.conf PADB_RMGR=pbs However, I end up with an error message: Warning, unknown config option 'PADB_RMGR' value 'pbs'. I have seen this key on padb web page for env variables, I am not sure whether it is the same for the configuration files. Thanks in advance. Kind Regards, Jie -- Jie Cai Jie.Cai at anu.edu.au ANU Supercomputer Facility NCI National Facility Leonard Huxley, Mills Road Ph: +61 2 6125 7965 Australian National University Fax: +61 2 6125 8199 Canberra, ACT 0200, Australia http://nf.nci.org.au ----------------------------------------------------- From Jie.Cai at anu.edu.au Tue Nov 23 02:00:11 2010 From: Jie.Cai at anu.edu.au (Jie Cai) Date: Tue, 23 Nov 2010 13:00:11 +1100 Subject: [padb-users] configuration file format. In-Reply-To: <30161_1290477207_4CEB1E96_30161_184309_1_4CEB1E48.20607@anu.edu.au> References: <30161_1290477207_4CEB1E96_30161_184309_1_4CEB1E48.20607@anu.edu.au> Message-ID: <4CEB202B.30301@anu.edu.au> I have just figured out. Instead of PADB_RMGR=pbs, I should use rmgr=pbs. BTW: Is there a list of configurable options? Can I setup the launch mode in the configuration options as well? Kind Regards, Jie -- Jie Cai Jie.Cai at anu.edu.au ANU Supercomputer Facility NCI National Facility Leonard Huxley, Mills Road Ph: +61 2 6125 7965 Australian National University Fax: +61 2 6125 8199 Canberra, ACT 0200, Australia http://nf.nci.org.au ----------------------------------------------------- On 23/11/10 12:52, Jie Cai wrote: > Hi Ashley, > > I am trying to use configuration file padb.conf. > > $ cat padb.conf > PADB_RMGR=pbs > > However, I end up with an error message: > Warning, unknown config option 'PADB_RMGR' value 'pbs'. > > I have seen this key on padb web page for env variables, I am not sure > whether it is the same for the configuration files. > > Thanks in advance. > > Kind Regards, > Jie > From ashley at pittman.co.uk Tue Nov 23 07:50:00 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Tue, 23 Nov 2010 07:50:00 +0000 Subject: [padb-users] configuration file format. In-Reply-To: <4CEB202B.30301@anu.edu.au> References: <30161_1290477207_4CEB1E96_30161_184309_1_4CEB1E48.20607@anu.edu.au> <4CEB202B.30301@anu.edu.au> Message-ID: <7ACC4803-3BEC-471D-9A8F-E410BC156F89@pittman.co.uk> On 23 Nov 2010, at 02:00, Jie Cai wrote: > I have just figured out. > > Instead of PADB_RMGR=pbs, I should use rmgr=pbs. Correct, you can set this in either ~/.padbrc or /etc/padb.conf, important to note is that these files are only loaded on the host where padb is invoked and not on the client nodes. dash and underscore are treated as the same character for the purpose of specifying configuration options. > BTW: Is there a list of configurable options? The -v flag will show you the mode specific options currently in use, for seeing global options there is a way but it isn't quite so easy, if you try to set an invalid option then it will show you all options (for all modes if you haven't selected one). # padb -Ohelp=yes > Can I setup the launch mode in the configuration options as well? This depends on what do you mean by launch mode? There is the mode that padb runs in (mode) http://padb.pittman.org.uk/modes.html And the way it launches the backend (launch_mode) http://code.google.com/p/padb/source/detail?r=407 You can set the launch_mode but not the mode in the configuration file. I only added the launch_mode option a few weeks ago and hadn't spotted the naming conflict so I could re-name this to backend_launch_mode if it would clear up any confusion. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From ashley at pittman.co.uk Tue Nov 23 07:55:13 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Tue, 23 Nov 2010 07:55:13 +0000 Subject: [padb-users] Make "gdb set language c" for Fortran program to get correct value and offset of symbols. In-Reply-To: <4CE9B68C.4040706@anu.edu.au> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <3D83BB55-B0B4-480B-AC6D-DC0EF37CC5B1@pittman.co.uk> <4CE20FE4.6060807@anu.edu.au> <5F7FBEAB-5967-4195-8D0C-D8C87BF124C8@pittman.co.uk> <4CE9B68C.4040706@anu.edu.au> Message-ID: <02F7B375-1597-472B-A820-D354681F84AB@pittman.co.uk> Thank you for the patch, I'm wondering if it's enough to simply set the mode on attach or does that break the stack traces or local variables in Fortran code? Padb already tracks the value of "-gdb-set print address" and changes it as required, perhaps it needs to do the same for "-gdb-set language"? Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From Jie.Cai at anu.edu.au Tue Nov 23 08:02:52 2010 From: Jie.Cai at anu.edu.au (Jie Cai) Date: Tue, 23 Nov 2010 19:02:52 +1100 Subject: [padb-users] configuration file format. In-Reply-To: <7ACC4803-3BEC-471D-9A8F-E410BC156F89@pittman.co.uk> References: <30161_1290477207_4CEB1E96_30161_184309_1_4CEB1E48.20607@anu.edu.au> <4CEB202B.30301@anu.edu.au> <7ACC4803-3BEC-471D-9A8F-E410BC156F89@pittman.co.uk> Message-ID: <4CEB752C.3020309@anu.edu.au> Hi Ashley, Thanks for the quick reply. Ashley Pittman wrote: > On 23 Nov 2010, at 02:00, Jie Cai wrote: > > > This depends on what do you mean by launch mode? > > There is the mode that padb runs in (mode) > > http://padb.pittman.org.uk/modes.html > > And the way it launches the backend (launch_mode) > > Yes, I do mean backend launch mode. > http://code.google.com/p/padb/source/detail?r=407 > > You can set the launch_mode but not the mode in the configuration file. I only added the launch_mode option a few weeks ago and hadn't spotted the naming conflict so I could re-name this to backend_launch_mode if it would clear up any confusion. > I saw in the latest HEAD, that pdsh only works for number of hosts < 128. How to actually switch on rmgr launch_mode? > Ashley. > > Kind Regards, Jie From ashley at pittman.co.uk Tue Nov 23 08:03:02 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Tue, 23 Nov 2010 08:03:02 +0000 Subject: [padb-users] configuration file format. In-Reply-To: <4CEB752C.3020309@anu.edu.au> References: <30161_1290477207_4CEB1E96_30161_184309_1_4CEB1E48.20607@anu.edu.au> <4CEB202B.30301@anu.edu.au> <7ACC4803-3BEC-471D-9A8F-E410BC156F89@pittman.co.uk> <4CEB752C.3020309@anu.edu.au> Message-ID: <8A1FC921-C460-40E1-8004-8475E5C9F58F@pittman.co.uk> On 23 Nov 2010, at 08:02, Jie Cai wrote: > Yes, I do mean backend launch mode. You can set this as any other configuration option, file-based, environment based or command line. >> http://code.google.com/p/padb/source/detail?r=407 >> >> You can set the launch_mode but not the mode in the configuration file. I only added the launch_mode option a few weeks ago and hadn't spotted the naming conflict so I could re-name this to backend_launch_mode if it would clear up any confusion. >> > I saw in the latest HEAD, that pdsh only works for number of hosts < 128. This is based on the pdsh documentation that claims to run out of FDs for large node counts, this could be a hangover from when pdsh used rsh connections rather than spawning ssh processes though. > How to actually switch on rmgr launch_mode? This should be on and in preference to pdsh by default, if you are using pbs though then the resource manager doesn't provide it's own launch mechanism so pdsh is the only option currently. An option I'm looking at is clustershell which starts the remote processes in a tree formation so avoids this problem, the new launch_mode should allow easy integration of new remote shell tools but I've not had time to experiment yet. http://sourceforge.net/projects/clustershell/ Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From daniel.kidger at googlemail.com Tue Nov 23 15:38:49 2010 From: daniel.kidger at googlemail.com (Daniel Kidger) Date: Tue, 23 Nov 2010 15:38:49 +0000 Subject: [padb-users] configuration file format. In-Reply-To: <8A1FC921-C460-40E1-8004-8475E5C9F58F@pittman.co.uk> References: <30161_1290477207_4CEB1E96_30161_184309_1_4CEB1E48.20607@anu.edu.au> <4CEB202B.30301@anu.edu.au> <7ACC4803-3BEC-471D-9A8F-E410BC156F89@pittman.co.uk> <4CEB752C.3020309@anu.edu.au> <8A1FC921-C460-40E1-8004-8475E5C9F58F@pittman.co.uk> Message-ID: Ashley, Is Clustershell quite new? - and how does it compare / contrast with LLNL's pdsh ? perhaps on extreme scalability? I understand that Clustershell is by St?phane Thiell of the CEA. (which have mostly Bull hardware so slightly surprised I haven't heard of clustershell before). CEA wrote Shine for Lustre management - I think Clustershell underpins it? Daniel On 23 November 2010 08:03, Ashley Pittman wrote: > > On 23 Nov 2010, at 08:02, Jie Cai wrote: > > Yes, I do mean backend launch mode. > > You can set this as any other configuration option, file-based, environment > based or command line. > > >> http://code.google.com/p/padb/source/detail?r=407 > >> > >> You can set the launch_mode but not the mode in the configuration file. > I only added the launch_mode option a few weeks ago and hadn't spotted the > naming conflict so I could re-name this to backend_launch_mode if it would > clear up any confusion. > >> > > I saw in the latest HEAD, that pdsh only works for number of hosts < 128. > > This is based on the pdsh documentation that claims to run out of FDs for > large node counts, this could be a hangover from when pdsh used rsh > connections rather than spawning ssh processes though. > > > How to actually switch on rmgr launch_mode? > > This should be on and in preference to pdsh by default, if you are using > pbs though then the resource manager doesn't provide it's own launch > mechanism so pdsh is the only option currently. > > An option I'm looking at is clustershell which starts the remote processes > in a tree formation so avoids this problem, the new launch_mode should allow > easy integration of new remote shell tools but I've not had time to > experiment yet. > > http://sourceforge.net/projects/clustershell/ > > Ashley. > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > > _______________________________________________ > padb-users mailing list > padb-users at pittman.org.uk > http://pittman.org.uk/mailman/listinfo/padb-users_pittman.org.uk > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ashley at pittman.co.uk Tue Nov 23 20:32:20 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Tue, 23 Nov 2010 20:32:20 +0000 Subject: [padb-users] configuration file format. In-Reply-To: References: <30161_1290477207_4CEB1E96_30161_184309_1_4CEB1E48.20607@anu.edu.au> <4CEB202B.30301@anu.edu.au> <7ACC4803-3BEC-471D-9A8F-E410BC156F89@pittman.co.uk> <4CEB752C.3020309@anu.edu.au> <8A1FC921-C460-40E1-8004-8475E5C9F58F@pittman.co.uk> Message-ID: <0448FA5B-8F7D-47DD-852B-549F17541347@pittman.co.uk> On 23 Nov 2010, at 15:38, Daniel Kidger wrote: > Ashley, > Is Clustershell quite new? It's been around for at least a year or so but I've yet to use it. > and how does it compare / contrast with LLNL's pdsh ? perhaps on extreme scalability? Mainly the scalability I believe, pdsh has a "sliding window" of hosts it targets at any one time and as one completes it moves onto the next, the size of this window is the "fanout" parameter. This worked well for padb in the 2.x days but now there is full-duplex communication between the inner and outer processes all inner processes have to run simultaneously. I'm told that clustershell works by calling itself recursively on remote nodes so can scale to much larger hostcounts and still run all commands simultaneously. > I understand that Clustershell is by St?phane Thiell of the CEA. > (which have mostly Bull hardware so slightly surprised I haven't heard of clustershell before). > > CEA wrote Shine for Lustre management - I think Clustershell underpins it? It's a part of this project, yes. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From Jie.Cai at anu.edu.au Tue Nov 23 23:44:05 2010 From: Jie.Cai at anu.edu.au (Jie Cai) Date: Wed, 24 Nov 2010 10:44:05 +1100 Subject: [padb-users] Make "gdb set language c" for Fortran program to get correct value and offset of symbols. In-Reply-To: <02F7B375-1597-472B-A820-D354681F84AB@pittman.co.uk> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <3D83BB55-B0B4-480B-AC6D-DC0EF37CC5B1@pittman.co.uk> <4CE20FE4.6060807@anu.edu.au> <5F7FBEAB-5967-4195-8D0C-D8C87BF124C8@pittman.co.uk> <4CE9B68C.4040706@anu.edu.au> <02F7B375-1597-472B-A820-D354681F84AB@pittman.co.uk> Message-ID: <4CEC51C5.3030701@anu.edu.au> On 23/11/10 18:55, Ashley Pittman wrote: > Thank you for the patch, I'm wondering if it's enough to simply set the mode on attach or does that break the stack traces or local variables in Fortran code? As I tested, this is not sufficient. > Padb already tracks the value of "-gdb-set print address" and changes it as required, perhaps it needs to do the same for "-gdb-set language"? > Yes, you are right. The patch can be simplified as put '-gdb-set languance c' only when '-gdb-set print address' is sent. The simplified new patch is attached. Kind Regards, Jie -- Jie Cai Jie.Cai at anu.edu.au ANU Supercomputer Facility NCI National Facility Leonard Huxley, Mills Road Ph: +61 2 6125 7965 Australian National University Fax: +61 2 6125 8199 Canberra, ACT 0200, Australia http://nf.nci.org.au ----------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: padb_fortran_new.patch Type: text/x-patch Size: 415 bytes Desc: not available URL: From ashley at pittman.co.uk Wed Nov 24 22:55:27 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Wed, 24 Nov 2010 22:55:27 +0000 Subject: [padb-users] Make "gdb set language c" for Fortran program to get correct value and offset of symbols. In-Reply-To: <4CEC51C5.3030701@anu.edu.au> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <3D83BB55-B0B4-480B-AC6D-DC0EF37CC5B1@pittman.co.uk> <4CE20FE4.6060807@anu.edu.au> <5F7FBEAB-5967-4195-8D0C-D8C87BF124C8@pittman.co.uk> <4CE9B68C.4040706@anu.edu.au> <02F7B375-1597-472B-A820-D354681F84AB@pittman.co.uk> <4CEC51C5.3030701@anu.edu.au> Message-ID: On 23 Nov 2010, at 23:44, Jie Cai wrote: > > On 23/11/10 18:55, Ashley Pittman wrote: >> Thank you for the patch, I'm wondering if it's enough to simply set the mode on attach or does that break the stack traces or local variables in Fortran code? > As I tested, this is not sufficient. >> Padb already tracks the value of "-gdb-set print address" and changes it as required, perhaps it needs to do the same for "-gdb-set language"? >> > Yes, you are right. The patch can be simplified as put '-gdb-set languance c' only when '-gdb-set print address' is sent. I was thinking of something like the attached which sets the value to c when required but reverts it to it's normal state when not. -------------- next part -------------- A non-text attachment was scrubbed... Name: gdb-language.patch Type: application/octet-stream Size: 1708 bytes Desc: not available URL: -------------- next part -------------- -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From Jie.Cai at anu.edu.au Thu Nov 25 01:07:34 2010 From: Jie.Cai at anu.edu.au (Jie Cai) Date: Thu, 25 Nov 2010 12:07:34 +1100 Subject: [padb-users] Make "gdb set language c" for Fortran program to get correct value and offset of symbols. In-Reply-To: References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <3D83BB55-B0B4-480B-AC6D-DC0EF37CC5B1@pittman.co.uk> <4CE20FE4.6060807@anu.edu.au> <5F7FBEAB-5967-4195-8D0C-D8C87BF124C8@pittman.co.uk> <4CE9B68C.4040706@anu.edu.au> <02F7B375-1597-472B-A820-D354681F84AB@pittman.co.uk> <4CEC51C5.3030701@anu.edu.au> Message-ID: <4CEDB6D6.5040304@anu.edu.au> Hi Ashley, Agreed, it makes more sense to set the language only when it is needed. There is a little bug in line 6323. I have fixed it with the attached patch. Other than that works fine. Kind Regards, Jie -- Jie Cai Jie.Cai at anu.edu.au ANU Supercomputer Facility NCI National Facility Leonard Huxley, Mills Road Ph: +61 2 6125 7965 Australian National University Fax: +61 2 6125 8199 Canberra, ACT 0200, Australia http://nf.nci.org.au ----------------------------------------------------- On 25/11/10 09:55, Ashley Pittman wrote: > On 23 Nov 2010, at 23:44, Jie Cai wrote: > > >> On 23/11/10 18:55, Ashley Pittman wrote: >> >>> Thank you for the patch, I'm wondering if it's enough to simply set the mode on attach or does that break the stack traces or local variables in Fortran code? >>> >> As I tested, this is not sufficient. >> >>> Padb already tracks the value of "-gdb-set print address" and changes it as required, perhaps it needs to do the same for "-gdb-set language"? >>> >>> >> Yes, you are right. The patch can be simplified as put '-gdb-set languance c' only when '-gdb-set print address' is sent. >> > I was thinking of something like the attached which sets the value to c when required but reverts it to it's normal state when not. > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: bug-fix-gdb.patch Type: text/x-patch Size: 374 bytes Desc: not available URL: From ashley at pittman.co.uk Thu Nov 25 18:36:04 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Thu, 25 Nov 2010 18:36:04 +0000 Subject: [padb-users] Make "gdb set language c" for Fortran program to get correct value and offset of symbols. In-Reply-To: <4CEDB6D6.5040304@anu.edu.au> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <3D83BB55-B0B4-480B-AC6D-DC0EF37CC5B1@pittman.co.uk> <4CE20FE4.6060807@anu.edu.au> <5F7FBEAB-5967-4195-8D0C-D8C87BF124C8@pittman.co.uk> <4CE9B68C.4040706@anu.edu.au> <02F7B375-1597-472B-A820-D354681F84AB@pittman.co.uk> <4CEC51C5.3030701@anu.edu.au> <4CEDB6D6.5040304@anu.edu.au> Message-ID: I've committed the patch with your change. I had copied the logic from set_print_address but of course that deals with a boolean flag rather than discrete values. Ashley. On 25 Nov 2010, at 01:07, Jie Cai wrote: > Hi Ashley, > > Agreed, it makes more sense to set the language only when it is needed. > > There is a little bug in line 6323. I have fixed it with the attached patch. Other than that works fine. > Kind Regards, > Jie > > -- > Jie Cai > Jie.Cai at anu.edu.au > > ANU Supercomputer Facility NCI National Facility > Leonard Huxley, Mills Road Ph: +61 2 6125 7965 > Australian National University Fax: +61 2 6125 8199 > Canberra, ACT 0200, Australia > http://nf.nci.org.au > > ----------------------------------------------------- > > > On 25/11/10 09:55, Ashley Pittman wrote: >> On 23 Nov 2010, at 23:44, Jie Cai wrote: >> >> >> >>> On 23/11/10 18:55, Ashley Pittman wrote: >>> >>> >>>> Thank you for the patch, I'm wondering if it's enough to simply set the mode on attach or does that break the stack traces or local variables in Fortran code? >>>> >>>> >>> As I tested, this is not sufficient. >>> >>> >>>> Padb already tracks the value of "-gdb-set print address" and changes it as required, perhaps it needs to do the same for "-gdb-set language"? >>>> >>>> >>>> >>> Yes, you are right. The patch can be simplified as put '-gdb-set languance c' only when '-gdb-set print address' is sent. >>> >>> >> I was thinking of something like the attached which sets the value to c when required but reverts it to it's normal state when not. >> >> >> >> >> >> >> >> > -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From ashley at pittman.co.uk Thu Nov 25 18:46:11 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Thu, 25 Nov 2010 18:46:11 +0000 Subject: [padb-users] start using padb on TORQUE In-Reply-To: <4CDB2F62.8010808@anu.edu.au> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> Message-ID: On 10 Nov 2010, at 23:48, Jie Cai wrote: > > On 11/11/10 06:41, Ashley Pittman wrote: >> >>> (2) in the PBS interactive mode of a job, I have following information and warning, please noted that no PBS job detected. I am actually expecting a pbs job detected. >>> >> pbs_pro support has been included for a while, pbs and Torque support are slightly different and have only been added very recently, in fact the current HEAD will detect jobs but and launch itself on the remote nodes but not find the individual processes, it is almost certainly looking for the wrong environment variable so should be easy to fix when I get some more feedback from people who are testing it (I don't have access to a pbs system and that makes it difficult). >> >> > I am pretty happy to help with this. Our PBS system is built on OpenPBS. I am not sure whether there is major difference in the interface between old OpenPBS and torque or PBS pro. Have you made any headway on this? I'm back from SC now so can devote some time to it myself if you have questions or can get me access to a PBS system. As I said it should really just be a case of finding out what environment variables are set by pbs and what the parent process of the parallel processes is called. > BTW: do you have any documents, which explain how padb works, e.g work flow. It can help us significantly with understanding your code and design idea. Then we can feedback some more useful information. The common use-case really is "what is my parallel program doing right now" and the drive for this could be for debugging, monitoring or verifying the system is functioning correctly after a previous problem. Unlike a "full featured" parallel debugger padb is really very easy to use and gives you information very quickly with no setup cost or steps needed when launching the job in the first place. I know sites which run padb automatically for every job every hour checking for processes in D state and use this to notify admins and users of possible problems, at the other end of the scale you can do in-depth debugging by using padb to look at individual ranks within a parallel job in great detail and for comparing state across the job looking for outliers. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From David.Singleton at anu.edu.au Thu Nov 25 20:01:44 2010 From: David.Singleton at anu.edu.au (David Singleton) Date: Fri, 26 Nov 2010 07:01:44 +1100 Subject: [padb-users] start using padb on TORQUE In-Reply-To: References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> Message-ID: <4CEEC0A8.7010005@anu.edu.au> On 11/26/2010 05:46 AM, Ashley Pittman wrote: > > On 10 Nov 2010, at 23:48, Jie Cai wrote: > >> >> On 11/11/10 06:41, Ashley Pittman wrote: >>> >>>> (2) in the PBS interactive mode of a job, I have following information and warning, please noted that no PBS job detected. I am actually expecting a pbs job detected. >>>> >>> pbs_pro support has been included for a while, pbs and Torque support are slightly different and have only been added very recently, in fact the current HEAD will detect jobs but and launch itself on the remote nodes but not find the individual processes, it is almost certainly looking for the wrong environment variable so should be easy to fix when I get some more feedback from people who are testing it (I don't have access to a pbs system and that makes it difficult). >>> >>> >> I am pretty happy to help with this. Our PBS system is built on OpenPBS. I am not sure whether there is major difference in the interface between old OpenPBS and torque or PBS pro. > > Have you made any headway on this? I'm back from SC now so can devote some time to it myself if you have questions or can get me access to a PBS system. As I said it should really just be a case of finding out what environment variables are set by pbs and what the parent process of the parallel processes is called. > Hi Ashley, Jie now has padb working in our PBS/OpenMPI environment. We would be happy to give you access to our system but since our environment is fairly "unique", it may not help you much in terms of supporting Torque or other PBS sites. We have our own version of PBS based on OpenPBS started well before Torque came into existence. I'll comment on a couple of the issues we have come across and you can decide what is interesting to you. Jie might have other points to add (he is the one who has had to hack the perl). * Long ago, we changed the format of the PBS exechost string (qstat -n output) to be more compact. The relevant parts are like pdsh hostlist format, eg. v[5-6,15-18,30-31]/cpus=0-7/mems=0-1 We have avoided working out how to get padb to use this format by adding an "old exechost format" option to our qstat. For very large jobs, I think our format makes more sense and should be easier to use with pdsh. We haven't looked at clustershell yet (we use c3 for cluster management). * We run all jobs under project groups which are not user login groups. That causes grief for "rsh node gdb ..." type debugging because of insufficient privileges. Since this is a common problem for us, we have a variant of newgrp that we can insert in remote commands to overcome this, eg rsh node nfnewgrp projgroup gdb ... Note that all variants of PBS support users nominating their jobs execution group (the group_list/egroup job attributes) but I dont know how commonly this is exercised. * A common variant of MPI jobs are those launched like mpirun wrapper_script mpi_executable so that the parent of the MPI tasks is not orted/mpid/mpirun. We are interested in ways to support such jobs. Since job processes are contained in cpusets (cgroups) on our system, we can easily get the relevant process list and then use environment to find ranks. Will it matter if a non-MPI process with OMPI_COMM_WORLD_RANK set is queried for message queue info? Does it matter that two process have the same rank? Thanks for the very relevant and useful tool. Cheers, David -- -------------------------------------------------------------------------- Dr David Singleton ANU Supercomputer Facility HPC Systems Manager and NCI National Facility David.Singleton at anu.edu.au Leonard Huxley Bldg (No. 56) Phone: +61 2 6125 4389 Australian National University Fax: +61 2 6125 8199 Canberra, ACT, 0200, Australia -------------------------------------------------------------------------- From ashley at pittman.co.uk Thu Nov 25 20:53:41 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Thu, 25 Nov 2010 20:53:41 +0000 Subject: [padb-users] start using padb on TORQUE In-Reply-To: <4CEEC0A8.7010005@anu.edu.au> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <4CEEC0A8.7010005@anu.edu.au> Message-ID: On 25 Nov 2010, at 20:01, David Singleton wrote: > * Long ago, we changed the format of the PBS exechost string (qstat -n > output) to be more compact. The relevant parts are like pdsh hostlist > format, eg. > v[5-6,15-18,30-31]/cpus=0-7/mems=0-1 > We have avoided working out how to get padb to use this format by > adding an "old exechost format" option to our qstat. For very large > jobs, I think our format makes more sense and should be easier to use > with pdsh. We haven't looked at clustershell yet (we use c3 for cluster > management). I'm not sure if padb has code to extract hostlists from strings like that but I suspect it does, it's one piece of code I've written at least four or five times over the years. It sounds like this would be useful upstream in pbs and if that becomes the case then I'd gladly add this functionality to padb. As you have something working I'd be reluctant to add custom code for a single site. Having checked there is a implementation of this login in the rms_job_to_nhosts() which could easily be factored out to support the above. > * We run all jobs under project groups which are not user login groups. > That causes grief for "rsh node gdb ..." type debugging because of > insufficient privileges. Since this is a common problem for us, we > have a variant of newgrp that we can insert in remote commands to > overcome this, eg > rsh node nfnewgrp projgroup gdb ... > > Note that all variants of PBS support users nominating their jobs > execution group (the group_list/egroup job attributes) but I dont know > how commonly this is exercised. In this context I'm assuming by group you mean a pbs concept and not a linux group (from /etc/groups), if it was the latter and usernames were the same this would be a non-issue. I'm curious to know how you've made this work with the current setup then? It would be possible for me to add this, at least for the pdsh and ssh launch-modes and I assume that any rmgr launch-mode would handle this itself. I assume there is a way of discovering which remote user a job is running as? Currently I'm trying to stabilise a release but this is something I could look at once that process is complete if there is demand for it. > * A common variant of MPI jobs are those launched like > mpirun wrapper_script mpi_executable > so that the parent of the MPI tasks is not orted/mpid/mpirun. We are > interested in ways to support such jobs. Since job processes are > contained in cpusets (cgroups) on our system, we can easily get the > relevant process list and then use environment to find ranks. Will it > matter if a non-MPI process with OMPI_COMM_WORLD_RANK set is queried > for message queue info? Does it matter that two process have the same > rank? Padb *should* handle this case, it only allows one process per rank and, depending on the resource manager, it'll either pick the direct child of the resource manager or if that process is deemed to be a wrapper script and has any children then it will pick the first one. The definition of wrapper script can be configured by the "scripts" configuration option, it defaults to "bash,sh,dash,ash,perl,xterm" so should cover most bases. The code for this is in convert_pids_to_child_pids() and is called once per node and passed a list of potential process which are direct descendants of the resource manager and makes a decision based on what processes are active. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From David.Singleton at anu.edu.au Thu Nov 25 21:31:04 2010 From: David.Singleton at anu.edu.au (David Singleton) Date: Fri, 26 Nov 2010 08:31:04 +1100 Subject: [padb-users] start using padb on TORQUE In-Reply-To: References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <4CEEC0A8.7010005@anu.edu.au> Message-ID: <4CEED598.9050703@anu.edu.au> On 11/26/2010 07:53 AM, Ashley Pittman wrote: >> * We run all jobs under project groups which are not user login groups. >> That causes grief for "rsh node gdb ..." type debugging because of >> insufficient privileges. Since this is a common problem for us, we >> have a variant of newgrp that we can insert in remote commands to >> overcome this, eg >> rsh node nfnewgrp projgroup gdb ... >> >> Note that all variants of PBS support users nominating their jobs >> execution group (the group_list/egroup job attributes) but I dont know >> how commonly this is exercised. > > In this context I'm assuming by group you mean a pbs concept and not a linux group (from /etc/groups), if it was the latter and usernames were the same this would be a non-issue. > I do mean Linux groups. ptrace appears to require matching gids. With matching gid: vayu2:~ > id uid=478(dbs900) gid=1090(z00) groups=900(ANUSF),998(quotasu),999(rashadm),1000(sysadmin),1090(z00),1094(z10),1438(s55),3040(z07),3136(c23),3191(c25),4004(abaqus),4167(vasp4),4285(z15),5050(libgoto),5146(z28),5501(gpu) vayu2:~ > gdb /bin/tcsh 16677 GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5_5.2) ... Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 0x00007ffff7716240 in __read_nocancel () from /lib64/libc.so.6 (gdb) Change group: vayu2:~ > newgrp c23 vayu2:~ > id uid=478(dbs900) gid=3136(c23) groups=900(ANUSF),998(quotasu),999(rashadm),1000(sysadmin),1090(z00),1094(z10),1438(s55),3040(z07),3136(c23),3191(c25),4004(abaqus),4167(vasp4),4285(z15),5050(libgoto),5146(z28),5501(gpu) vayu2:~ > gdb /bin/tcsh 16677 GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5_5.2) ... Reading symbols from /bin/tcsh...(no debugging symbols found)...done. Attaching to program: /bin/tcsh, process 16677 ptrace: Operation not permitted. /home/900/dbs900/16677: No such file or directory. (gdb) Cheers, David From ashley at pittman.co.uk Thu Nov 25 21:54:46 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Thu, 25 Nov 2010 21:54:46 +0000 Subject: [padb-users] start using padb on TORQUE In-Reply-To: <4CEED598.9050703@anu.edu.au> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <4CEEC0A8.7010005@anu.edu.au> <4CEED598.9050703@anu.edu.au> Message-ID: <6992F8C8-D948-47FA-A614-CC20822D0CB2@pittman.co.uk> On 25 Nov 2010, at 21:31, David Singleton wrote: > On 11/26/2010 07:53 AM, Ashley Pittman wrote: > >>> * We run all jobs under project groups which are not user login groups. >>> That causes grief for "rsh node gdb ..." type debugging because of >>> insufficient privileges. Since this is a common problem for us, we >>> have a variant of newgrp that we can insert in remote commands to >>> overcome this, eg >>> rsh node nfnewgrp projgroup gdb ... >>> >>> Note that all variants of PBS support users nominating their jobs >>> execution group (the group_list/egroup job attributes) but I dont know >>> how commonly this is exercised. >> >> In this context I'm assuming by group you mean a pbs concept and not a linux group (from /etc/groups), if it was the latter and usernames were the same this would be a non-issue. >> > > I do mean Linux groups. ptrace appears to require matching gids. I didn't know that! Then I'm doubly intrigued to find out how you made it work, do you change the group of the inner padb processes or do you change the group of the spawned gdb process? Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From Jie.Cai at anu.edu.au Thu Nov 25 23:04:06 2010 From: Jie.Cai at anu.edu.au (Jie Cai) Date: Fri, 26 Nov 2010 10:04:06 +1100 Subject: [padb-users] start using padb on TORQUE In-Reply-To: <4CEEC0A8.7010005@anu.edu.au> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <4CEEC0A8.7010005@anu.edu.au> Message-ID: <4CEEEB66.7020004@anu.edu.au> On 26/11/10 07:01, David Singleton wrote: > * Long ago, we changed the format of the PBS exechost string (qstat -n > output) to be more compact. The relevant parts are like pdsh hostlist > format, eg. > v[5-6,15-18,30-31]/cpus=0-7/mems=0-1 > We have avoided working out how to get padb to use this format by > adding an "old exechost format" option to our qstat. For very large > jobs, I think our format makes more sense and should be easier to use > with pdsh. We haven't looked at clustershell yet (we use c3 for > cluster > management). > Here is a little bit extra. With David's current changes on our PBS system, the old host format looks like following. $ qstat -w -n -u gec651 xepbs: Req'd Req'd Elap Job ID Username Queue Jobname NDS TSK Memory Time S Time --------------- -------- -------- ---------- --- --- ------ ----- - ----- 98856.xepbs gec651 normal adfrun 16 16 20gb 12:00 R 01:10 x146/0+x146/1+x146/2+x146/3+x146/4+x146/5+x146/6+x146/7+x147/0+x147/1+x147/2+x147/3+x147/4+x147/5+x147/6+x147/7 PADB will actually push x146,x146,...x146,x147,x147,....,x147 into $pbs_tabjobs{$job}{hosts} (in function pbs_get_lqsub()), then spawn remote processes by 'pdsh -w $pbs_tabjobs{$job}{hosts}'. I have also changed pbs_get_lqsub() function to filter the redundant host name. I am not sure whether this is a common problem to other site. Kind Regards, Jie -- Jie Cai Jie.Cai at anu.edu.au ANU Supercomputer Facility NCI National Facility Leonard Huxley, Mills Road Ph: +61 2 6125 7965 Australian National University Fax: +61 2 6125 8199 Canberra, ACT 0200, Australia http://nf.nci.org.au ----------------------------------------------------- From Jie.Cai at anu.edu.au Thu Nov 25 23:27:00 2010 From: Jie.Cai at anu.edu.au (Jie Cai) Date: Fri, 26 Nov 2010 10:27:00 +1100 Subject: [padb-users] start using padb on TORQUE In-Reply-To: <6992F8C8-D948-47FA-A614-CC20822D0CB2@pittman.co.uk> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <4CEEC0A8.7010005@anu.edu.au> <4CEED598.9050703@anu.edu.au> <6992F8C8-D948-47FA-A614-CC20822D0CB2@pittman.co.uk> Message-ID: <4CEEF0C4.4010706@anu.edu.au> Hi Ashley, On 26/11/10 08:54, Ashley Pittman wrote: > > I didn't know that! Then I'm doubly intrigued to find out how you made it work, do you change the group of the inner padb processes or do you change the group of the spawned gdb process? > > Ashley, > > We change the group of spawned processes rather than inner padb processes. (1) in get_remote_env() using following code to obtain $pbs_gid, and open 'environ' file: my $project = `stat /proc/$pid|grep Gid|cut -d' ' -f 18|cut -d')' -f 1`; $pbs_gid = substr($project, 0, 3); my @env_tmp = `newgrp $pbs_gid cat /proc/$pid/environ`; (2) In gdb_start(), change the original line 'my $cmd = "gdb --interpreter=mi -q";' to '$cmd = "newgrp $pbs_gid gdb --interpreter=mi -q";' Hope this helps, and please let us know if you need more information. Kind Regards, Jie -- Jie Cai Jie.Cai at anu.edu.au ANU Supercomputer Facility NCI National Facility Leonard Huxley, Mills Road Ph: +61 2 6125 7965 Australian National University Fax: +61 2 6125 8199 Canberra, ACT 0200, Australia http://nf.nci.org.au ----------------------------------------------------- From Jie.Cai at anu.edu.au Thu Nov 25 23:33:59 2010 From: Jie.Cai at anu.edu.au (Jie Cai) Date: Fri, 26 Nov 2010 10:33:59 +1100 Subject: [padb-users] start using padb on TORQUE In-Reply-To: <21316_1290727736_4CEEF138_21316_23871_1_4CEEF0C4.4010706@anu.edu.au> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <4CEEC0A8.7010005@anu.edu.au> <4CEED598.9050703@anu.edu.au> <6992F8C8-D948-47FA-A614-CC20822D0CB2@pittman.co.uk> <21316_1290727736_4CEEF138_21316_23871_1_4CEEF0C4.4010706@anu.edu.au> Message-ID: <4CEEF267.1050806@anu.edu.au> BTW, newgrp is our own implementation, which takes the second arguments in the command line as the actual command will be executed. Kind Regards, Jie -- Jie Cai Jie.Cai at anu.edu.au ANU Supercomputer Facility NCI National Facility Leonard Huxley, Mills Road Ph: +61 2 6125 7965 Australian National University Fax: +61 2 6125 8199 Canberra, ACT 0200, Australia http://nf.nci.org.au ----------------------------------------------------- On 26/11/10 10:27, Jie Cai wrote: > Hi Ashley, > > On 26/11/10 08:54, Ashley Pittman wrote: >> >> I didn't know that! Then I'm doubly intrigued to find out how you >> made it work, do you change the group of the inner padb processes or >> do you change the group of the spawned gdb process? >> >> Ashley, >> > > We change the group of spawned processes rather than inner padb > processes. > > (1) in get_remote_env() using following code to obtain $pbs_gid, and > open 'environ' file: > > my $project = `stat /proc/$pid|grep Gid|cut -d' ' -f 18|cut -d')' -f 1`; > $pbs_gid = substr($project, 0, 3); > > my @env_tmp = `newgrp $pbs_gid cat /proc/$pid/environ`; > > (2) In gdb_start(), change the original line 'my $cmd = "gdb > --interpreter=mi -q";' to > '$cmd = "newgrp $pbs_gid gdb --interpreter=mi -q";' > > Hope this helps, and please let us know if you need more information. > > Kind Regards, > Jie > From Jie.Cai at anu.edu.au Fri Nov 26 02:45:49 2010 From: Jie.Cai at anu.edu.au (Jie Cai) Date: Fri, 26 Nov 2010 13:45:49 +1100 Subject: [padb-users] start using padb on TORQUE In-Reply-To: References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <4CEEC0A8.7010005@anu.edu.au> Message-ID: <4CEF1F5D.1020406@anu.edu.au> On 26/11/10 07:53, Ashley Pittman wrote: > >> * A common variant of MPI jobs are those launched like >> mpirun wrapper_script mpi_executable >> so that the parent of the MPI tasks is not orted/mpid/mpirun. We are >> interested in ways to support such jobs. Since job processes are >> contained in cpusets (cgroups) on our system, we can easily get the >> relevant process list and then use environment to find ranks. Will it >> matter if a non-MPI process with OMPI_COMM_WORLD_RANK set is queried >> for message queue info? Does it matter that two process have the same >> rank? >> > Padb *should* handle this case, it only allows one process per rank and, depending on the resource manager, it'll either pick the direct child of the resource manager or if that process is deemed to be a wrapper script and has any children then it will pick the first one. The definition of wrapper script can be configured by the "scripts" configuration option, it defaults to "bash,sh,dash,ash,perl,xterm" so should cover most bases. > > The code for this is in convert_pids_to_child_pids() and is called once per node and passed a list of potential process which are direct descendants of the resource manager and makes a decision based on what processes are active. > > Ashley. > > I found out that pid_to_name() does not pick the correct script process. I have following debug information: inner: x153: rmpid is 4376, name is lu.sh. inner: x153: notscripts pid is 4376. I guess this is due to find_from_status() does not give the correct script name. I think it might be better to check /proc/$pid/cmdline to find the first element and to compare them with preset scripts keys. I will hack it later today or early next week, and will let you updated whether it is working. Kind Regards, Jie -- Jie Cai Jie.Cai at anu.edu.au ANU Supercomputer Facility NCI National Facility Leonard Huxley, Mills Road Ph: +61 2 6125 7965 Australian National University Fax: +61 2 6125 8199 Canberra, ACT 0200, Australia http://nf.nci.org.au ----------------------------------------------------- From ashley at pittman.co.uk Fri Nov 26 20:26:37 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Fri, 26 Nov 2010 20:26:37 +0000 Subject: [padb-users] start using padb on TORQUE In-Reply-To: <4CEF1F5D.1020406@anu.edu.au> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <4CEEC0A8.7010005@anu.edu.au> <4CEF1F5D.1020406@anu.edu.au> Message-ID: <79BA9921-5C4D-464E-8C0F-329379721A16@pittman.co.uk> On 26 Nov 2010, at 02:45, Jie Cai wrote: > > On 26/11/10 07:53, Ashley Pittman wrote: >> >>> * A common variant of MPI jobs are those launched like >>> mpirun wrapper_script mpi_executable >>> so that the parent of the MPI tasks is not orted/mpid/mpirun. We are >>> interested in ways to support such jobs. Since job processes are >>> contained in cpusets (cgroups) on our system, we can easily get the >>> relevant process list and then use environment to find ranks. Will it >>> matter if a non-MPI process with OMPI_COMM_WORLD_RANK set is queried >>> for message queue info? Does it matter that two process have the same >>> rank? >>> >> Padb *should* handle this case, it only allows one process per rank and, depending on the resource manager, it'll either pick the direct child of the resource manager or if that process is deemed to be a wrapper script and has any children then it will pick the first one. The definition of wrapper script can be configured by the "scripts" configuration option, it defaults to "bash,sh,dash,ash,perl,xterm" so should cover most bases. >> >> The code for this is in convert_pids_to_child_pids() and is called once per node and passed a list of potential process which are direct descendants of the resource manager and makes a decision based on what processes are active. >> >> Ashley. >> >> > I found out that pid_to_name() does not pick the correct script process. I have following debug information: > > inner: x153: rmpid is 4376, name is lu.sh. > inner: x153: notscripts pid is 4376. > > I guess this is due to find_from_status() does not give the correct script name. I think it might be better to check /proc/$pid/cmdline to find the first element and to compare them with preset scripts keys. I took this > I will hack it later today or early next week, and will let you updated whether it is working. > > Kind Regards, > Jie > > -- > Jie Cai Jie.Cai at anu.edu.au > ANU Supercomputer Facility NCI National Facility > Leonard Huxley, Mills Road Ph: +61 2 6125 7965 > Australian National University Fax: +61 2 6125 8199 > Canberra, ACT 0200, Australia http://nf.nci.org.au > ----------------------------------------------------- > -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From ashley at pittman.co.uk Fri Nov 26 20:39:30 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Fri, 26 Nov 2010 20:39:30 +0000 Subject: [padb-users] start using padb on TORQUE In-Reply-To: <4CEF1F5D.1020406@anu.edu.au> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <4CEEC0A8.7010005@anu.edu.au> <4CEF1F5D.1020406@anu.edu.au> Message-ID: Sorry about the last email, I clicked the wrong button! See real response below. On 26 Nov 2010, at 02:45, Jie Cai wrote: >> > I found out that pid_to_name() does not pick the correct script process. I have following debug information: > > inner: x153: rmpid is 4376, name is lu.sh. > inner: x153: notscripts pid is 4376. > > I guess this is due to find_from_status() does not give the correct script name. I think it might be better to check /proc/$pid/cmdline to find the first element and to compare them with preset scripts keys. The lookup is done on the target of the "/proc//exe" link if this is readable and only falls back to find_from_status() if it isn't (normally for SUID or other users processes) so if you have a shell script it should pick up /bin/sh which is then shortened to sh and looked up in the preset keys. This is independent of the name of the wrapper script, I'd expect looking for ".sh" extensions on wrapper scripts to only catch a tiny minority of scripts which should all be caught by the previous mechanism anyway. It could be a permissions issue with the groups again that you are not able to read the destination of this sym-link and hence it's reverting back to the find_from_status() which doesn't to the indirection for interpreted languages? Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From Jie.Cai at anu.edu.au Sun Nov 28 23:57:47 2010 From: Jie.Cai at anu.edu.au (Jie Cai) Date: Mon, 29 Nov 2010 10:57:47 +1100 Subject: [padb-users] start using padb on TORQUE In-Reply-To: References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <4CEEC0A8.7010005@anu.edu.au> <4CEF1F5D.1020406@anu.edu.au> Message-ID: <4CF2EC7B.3070102@anu.edu.au> Thanks Ashley, Yes, I have solved this problem via change group for "readlink". Kind Regards, Jie -- Jie Cai Jie.Cai at anu.edu.au ANU Supercomputer Facility NCI National Facility Leonard Huxley, Mills Road Ph: +61 2 6125 7965 Australian National University Fax: +61 2 6125 8199 Canberra, ACT 0200, Australia http://nf.nci.org.au ----------------------------------------------------- On 27/11/10 07:39, Ashley Pittman wrote: > Sorry about the last email, I clicked the wrong button! See real response below. > > On 26 Nov 2010, at 02:45, Jie Cai wrote: > >>> >> I found out that pid_to_name() does not pick the correct script process. I have following debug information: >> >> inner: x153: rmpid is 4376, name is lu.sh. >> inner: x153: notscripts pid is 4376. >> >> I guess this is due to find_from_status() does not give the correct script name. I think it might be better to check /proc/$pid/cmdline to find the first element and to compare them with preset scripts keys. >> > The lookup is done on the target of the "/proc//exe" link if this is readable and only falls back to find_from_status() if it isn't (normally for SUID or other users processes) so if you have a shell script it should pick up /bin/sh which is then shortened to sh and looked up in the preset keys. This is independent of the name of the wrapper script, I'd expect looking for ".sh" extensions on wrapper scripts to only catch a tiny minority of scripts which should all be caught by the previous mechanism anyway. > > It could be a permissions issue with the groups again that you are not able to read the destination of this sym-link and hence it's reverting back to the find_from_status() which doesn't to the indirection for interpreted languages? > > Ashley. > >