From thipadin.seng-long at bull.net Wed Jan 6 15:50:39 2010 From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net) Date: Wed, 6 Jan 2010 16:50:39 +0100 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_Re=3A_R=E9f?= =?iso-8859-1?q?=2E_=3A__Betterhandling_of_threads_in_stack_traces=2E?= Message-ID: On Mon dec 21, 2009 at 10:03 PM Ashley Pittman wrote: > I've done the simple thing with threads for now, added a --list-threads > option to show detected threads and a optional thread-id configuration > option for -x to restrict which thread is reported on. Let me know how > you get on with this and if it's acceptable half-way house for your > customer, anything more complex is going to require changes to the padb > core which we can look at after 3.1 is out. Sorry to reply lately. In which release can i get what you've done above. Thipadin. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ashley at pittman.co.uk Wed Jan 6 18:43:51 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Wed, 06 Jan 2010 18:43:51 +0000 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_Re=3A_R=E9f?= =?iso-8859-1?q?=2E_=3A__Better_handling_of_threads_in_stack_traces=2E?= In-Reply-To: References: Message-ID: <1262803431.3392.10.camel@alpha> On Wed, 2010-01-06 at 16:50 +0100, thipadin.seng-long at bull.net wrote: > Sorry to reply lately. > In which release can i get what you've done above. You want at least r367 for this. I changed the build to use autoconf over xmas so it'll be a little different to what you are used to but should be a step forward. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From padb at googlecode.com Sun Jan 10 14:58:28 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Sun, 10 Jan 2010 14:58:28 +0000 Subject: [padb] r382 committed - Add mpiexec to the list of resource manager process names for use... Message-ID: <0016362841e44f62d0047cd0a78f@google.com> Revision: 382 Author: apittman Date: Sun Jan 10 06:58:11 2010 Log: Add mpiexec to the list of resource manager process names for use in the "mpirun" resource manager code. This is because Hydra (MPICH2) now supports the MPIR_proctable interface. http://code.google.com/p/padb/source/detail?r=382 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Tue Dec 22 13:31:11 2009 +++ /trunk/src/padb Sun Jan 10 06:58:11 2010 @@ -537,7 +537,7 @@ $conf{interval} = '10s'; $conf{watch_clears_screen} = 'enabled'; $conf{scripts} = 'bash,sh,dash,ash,perl,xterm'; -$conf{mpirun} = 'mpirun,orterun,srun,mpdrun,prun'; +$conf{mpirun} = 'mpirun,orterun,srun,mpdrun,prun,mpiexec'; $conf{lsf_job_offset} = 1; $conf{local_fd_name} = '/dev/null'; $conf{inner_callback} = 'disabled'; From padb at googlecode.com Sun Jan 10 15:02:31 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Sun, 10 Jan 2010 15:02:31 +0000 Subject: [padb] r383 committed - Fix a typo in the help output from padb. A couple of the... Message-ID: <001636ed6c2ec25d38047cd0b5b9@google.com> Revision: 383 Author: apittman Date: Sun Jan 10 06:59:28 2010 Log: Fix a typo in the help output from padb. A couple of the options were displaying wrongly :( http://code.google.com/p/padb/source/detail?r=383 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Sun Jan 10 06:58:11 2010 +++ /trunk/src/padb Sun Jan 10 06:59:28 2010 @@ -741,8 +741,8 @@ Stack trace options: gdb-retry-count Number of times to try getting a 'good' stack trace from gdb. - stack-show-params Show function parameters in stack traces. - stack-show-locals Show locals in stack traces. + stack-shows-params Show function parameters in stack traces. + stack-shows-locals Show locals in stack traces. Statistics options: stats-short Turn on "one process per line" stats reporting code. From padb at googlecode.com Sun Jan 10 15:06:41 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Sun, 10 Jan 2010 15:06:41 +0000 Subject: [padb] r384 committed - Allow a value of 'localhost' in the MPIR_proctable interface.... Message-ID: <001636b2b0caa72e86047cd0c42e@google.com> Revision: 384 Author: apittman Date: Sun Jan 10 07:02:48 2010 Log: Allow a value of 'localhost' in the MPIR_proctable interface. I'd prefer it if this entry used hostnames but it's also allowed to include resolvable names which may not actually be hostnames. This probably means there are cases where padb doesn't work, this commit adds in a hack for localhost which is probably the most common of these cases. http://code.google.com/p/padb/source/detail?r=384 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Sun Jan 10 06:59:28 2010 +++ /trunk/src/padb Sun Jan 10 07:02:48 2010 @@ -3108,6 +3108,13 @@ my $struct_base = $base + ( $table_size * $proc ); my $hostp = gdb_read_pointer( $gdb, $struct_base + $host_offset ); my $host = gdb_string( $gdb, 1024, $hostp ); + + # Ideally this won't happen but it can and does, for example if + # the user supplies a hostsfile with localhost specified the + # resource manager can leave this value unmodified. + if ( defined $host and $host eq 'localhost' ) { + $host = hostname(); + } my $pid = gdb_read_int( $gdb, $struct_base + $pid_offset ); if ( defined $host and defined $pid ) { From padb at googlecode.com Sun Jan 10 15:10:53 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Sun, 10 Jan 2010 15:10:53 +0000 Subject: [padb] r385 committed - Change the default setting for showing values of locals in stack trace... Message-ID: <0016e6475bbaaf4169047cd0d34f@google.com> Revision: 385 Author: apittman Date: Sun Jan 10 07:09:49 2010 Log: Change the default setting for showing values of locals in stack traces. Change the default to not show values and add an over-ride for full-report mode to enable this. Showing variables in this way is really useful and is a big step forward, it does however greatly increase the output and make it harder to spot patterns in the stack traces themselves so disable it by default and leave it up to the user to enable it on a per-run basis if they want. http://code.google.com/p/padb/source/detail?r=385 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Sun Jan 10 07:02:48 2010 +++ /trunk/src/padb Sun Jan 10 07:09:49 2010 @@ -3109,10 +3109,10 @@ my $hostp = gdb_read_pointer( $gdb, $struct_base + $host_offset ); my $host = gdb_string( $gdb, 1024, $hostp ); - # Ideally this won't happen but it can and does, for example if - # the user supplies a hostsfile with localhost specified the - # resource manager can leave this value unmodified. - if ( defined $host and $host eq 'localhost' ) { + # Ideally this won't happen but it can and does, for example if + # the user supplies a hostsfile with localhost specified the + # resource manager can leave this value unmodified. + if ( defined $host and $host eq 'localhost' ) { $host = hostname(); } @@ -5339,7 +5339,9 @@ push_command('deadlock'); my $c = $conf{mode_options}{stack}; - $c->{strip_above_wait} = 0; + $c->{strip_above_wait} = 0; + $c->{stack_shows_params} = 1; + $c->{stack_shows_locals} = 1; push_command( 'stack', 'tree', $c ); go_job($full_report); @@ -9544,8 +9546,8 @@ thread_id => undef, }, options_bool => { - stack_shows_params => 'yes', - stack_shows_locals => 'yes', + stack_shows_params => 'no', + stack_shows_locals => 'no', }, secondary => [ { From padb at googlecode.com Sun Jan 10 15:18:26 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Sun, 10 Jan 2010 15:18:26 +0000 Subject: [padb] r386 committed - If possible verify the type information for MPIR_proctable... Message-ID: <0016e68dec06ac2b79047cd0ee23@google.com> Revision: 386 Author: apittman Date: Sun Jan 10 07:17:18 2010 Log: If possible verify the type information for MPIR_proctable before we read it, if it's readable and not as we expect give the user a warning and try to carry on. This adds a gdb_load_type() function will will no doubt become used elsewhere over time. http://code.google.com/p/padb/source/detail?r=386 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Sun Jan 10 07:09:49 2010 +++ /trunk/src/padb Sun Jan 10 07:17:18 2010 @@ -3086,6 +3086,21 @@ # I've left the old code here for now as I suspect this is going to be # something that causes trouble in the future. + + # If the datatype is readable then verify it's as expected, no + # issue if we can't read it however, then we just have to trust + # the resource manager. + my $proctable_type = gdb_load_type( $gdb, 'MPIR_proctable' ); + + if ( defined $proctable_type ) { + if ( defined $proctable_type->{pid} + and $proctable_type->{pid} ne 'int' ) + { + print +"MPIR_proctable.pid is of wrong type: \'$proctable_type->{pid}\'."; + print " Attempting to continue anyway...\n"; + } + } if (1) { my $word_size = gdb_type_size( $gdb, 'void *' ); @@ -6228,6 +6243,35 @@ } return; } + +# For a given type load it's type! Takes a type name and returns (if +# possible) a hash representing the underlying type in the target. +# Works best if calls on structs where it returns the names and types +# if the struct entries. +sub gdb_load_type { + my ( $gdb, $type ) = @_; + + my %t = gdb_n_send( $gdb, "-var-create - * $type" ); + + return unless ( defined $t{status} and $t{status} eq 'done' ); + + my $reason = gdb_parse_reason( $t{reason} ); + + my %v = gdb_n_send( $gdb, "-var-list-children 1 $reason->{name}" ); + return unless ( defined $v{status} and $v{status} eq 'done' ); + + my $z = gdb_parse_reason( $v{reason} ); + + my %type; + + foreach my $c ( @{ $z->{children} } ) { + my $child = $c->{child}; + + $type{ $child->{exp} } = $child->{type}; + } + + return \%type; +} sub minfo_handle_query { my ( $gdb, $vp, $query, $stats ) = @_; From thipadin.seng-long at bull.net Thu Jan 21 14:20:21 2010 From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net) Date: Thu, 21 Jan 2010 15:20:21 +0100 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_Bull_change?= =?iso-8859-1?q?s_=28_with_LSF_-mpich2wrapper_patch_=29?= Message-ID: On Wed, 2010-01-06 at 19:53 +0100, Ashley Pittman wrote: > On Wed, 2010-01-06 at 16:44 +0100, thipadin.seng-long at bull.net wrote: >> Yes, I have two on the queue, this one support lsf-mpich2 wrapper. >> I have another one in the queue that supports lsf with >> openmpi-wrapper >> If you can deal with both tell me I'll send the patch. >Can you send both patches separately to the developers list and I'll >look at them when I'm home next week. If you can send the as diffs >against the svn trunk it's easiest for me, i.e. make the changes and >then run 'svn diff' and attach the output to a email. I tend to have >three or four copies checked out with different un-committed work in >each. I get back to you after a short break, as I've been doing some validation on a openmpi spawn functionality. Now I've finished what you've asked me above, I am just sending both patches. One for lsf-mpich2 wrapper, and the other one with lsf-openmpi wrapper. I did it against r386 version. Both are alike and have many common sub routines. As the patches are seperated some routines are in both patches. I prefer you integrate once as you can factorize. If you need some 'ps' or 'bjobs' command layouts to understand the coding, please ask, I'll send you. Regards. Thipadin. Ashley Pittman 01/06/2010 07:53 PM Pour : thipadin.seng-long at bull.net cc : florence.vallee at bull.net, Sylvain Jeaugey , Andry.Razafinjatovo at bull.net Objet : Re: R?f. : Bull changes ( with LSF-mpich2 wrapper patch ) On Wed, 2010-01-06 at 16:44 +0100, thipadin.seng-long at bull.net wrote: > Yes, I have two on the queue, this one support lsf-mpich2 wrapper. > I have another one in the queue that supports lsf with > openmpi-wrapper > If you can deal with both tell me I'll send the patch. Can you send both patches separately to the developers list and I'll look at them when I'm home next week. If you can send the as diffs against the svn trunk it's easiest for me, i.e. make the changes and then run 'svn diff' and attach the output to a email. I tend to have three or four copies checked out with different un-committed work in each. > I would like to include these 2 patches in your next major release, > so we can use your version in our official delivery and > forget what I've done in version 263 in the early release. That sounds ideal, I'm mostly happy with the code now and would like to make a release as a lot has changed since the last one but lets finish syncing first. > > Finally, if you've got access to a 2,000 process job would you mind > > running some elementary benchmarks for me and sending me the > results, > > I've not got access to that scale myself and most of the people I > deal > > with are shy about giving out results. > > Our customer have this, we don't have access to their cluster. > By march 2010 we could get access to a cluster with more than 1000 > cores. > So when we have this cluster we will test and send you results. That would be great. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: lsf-mpich2.patch Type: application/octet-stream Size: 8834 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: lsf-openmpi.patch Type: application/octet-stream Size: 11683 bytes Desc: not available URL: From ashley at pittman.co.uk Wed Jan 27 17:30:15 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Wed, 27 Jan 2010 17:30:15 +0000 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_Bull_change?= =?iso-8859-1?q?s_=28_with_LSF_-mpich2wrapper_patch_=29?= In-Reply-To: References: Message-ID: <1ECF3207-FF56-438D-843C-678C673FDA41@pittman.co.uk> On 21 Jan 2010, at 14:20, thipadin.seng-long at bull.net wrote: > > I get back to you after a short break, as I've been doing some validation on a openmpi spawn functionality. > Now I've finished what you've asked me above, I am just sending both patches. > One for lsf-mpich2 wrapper, and the other one with lsf-openmpi wrapper. I did it against r386 version. > Both are alike and have many common sub routines. As the patches are seperated some routines > are in both patches. I prefer you integrate once as you can factorize. > If you need some 'ps' or 'bjobs' command layouts to understand the coding, please ask, I'll send you. As they are dependant on each other could you send them as a single, combined patch please. I don't have systems I can test this on as I don't have lsf but I would like to understand the code, could you put together a paragraph for each rmgr describing how the underlying resource manager lays out processes and how padb finds it's information. I'm particularly interested in why it has to ssh around to different nodes to see the information it needs. With the ps command you can prevent the printing of headers by using the option "-o pid=,ppid=,cmd=" which will avoid the special case for removing these later on. Stripping the leading spaces from ps output is already done in get_extended_process_list(), can you use the same regexp in get_line_ppid() for clarity please. I'm not sure that your loop over @chaps in lsfmpich2wr_get_mpiproc() is correct, should the if ($found_app != 0) test be outside of the main loop? Again a comment explaining what the code is trying to extract would be useful here. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk