From ashley at pittman.co.uk Mon Nov 1 19:57:33 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Mon, 1 Nov 2010 19:57:33 +0000 Subject: [padb] Upcoming release. Message-ID: All, I'd like to make a formal release in the coming weeks based on the current SVN code, the 3.2 beta has been through an extended testing period and I'm happy that it's ready to move to formal release status. On this basis I propose making a 3.3 release in the next two weeks, probably on Monday the 8th. Please test the latest 3.2 beta or trunk and let me know of any problems you have, unless any new issues are reported by the 5th I'll go ahead as planned. Ashley Pittman. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From padb at googlecode.com Sat Nov 6 00:03:55 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Sat, 06 Nov 2010 00:03:55 +0000 Subject: [padb] r407 committed - Add a "launch_mode" to control how the backend processes are launched.... Message-ID: <0015175cb53a7fad2904945720f5@google.com> Revision: 407 Author: apittman at gmail.com Date: Fri Nov 5 17:03:08 2010 Log: Add a "launch_mode" to control how the backend processes are launched. The four options are local,ssh,pdsh and rmgr and are set as a comma seperated list which is walked until one that is able to launch the job is found. local: launch the inner process locally. Only works for single node jobs on the local host. ssh: launch the inner process using ssh. Only works for single node jobs. pdsh: launch using pdsh. rmgr: launch using the selected resource manager. The old behaviour was equilivent to a setting of "rmgr,local,pdsh", the new default it "local,rmgr,ssh,pdsh". http://code.google.com/p/padb/source/detail?r=407 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Sun Oct 31 14:00:13 2010 +++ /trunk/src/padb Fri Nov 5 17:03:08 2010 @@ -575,6 +575,8 @@ $conf{tree_width} = '4'; +$conf{launch_mode} = 'local,rmgr,ssh,pdsh'; + # Config options which take boolean values. my @conf_bool = qw(watch_clears_screen inner_callback); @@ -3711,30 +3713,66 @@ # Otherwise call the more flexible setup_job function. my %pcmd = $rmgr{ $conf{rmgr} }{setup_job}($job); - # If the resource manager interface is able to give a hostlist but - # not able or willing to launch a shadow job natively then use - # pdsh to launch the inner processes. This allows us to be less - # dependant on the resource manager and work in a wider variety of - # cases. Using pdsh like this limits us to 32 hosts (More if we - # set the FANOUT pdsh environment variable) so perhaps a better - # way can be found in the future. - if ( defined $pcmd{host_list} and not defined $pcmd{command} ) { - - my @hosts = @{ $pcmd{host_list} }; - if ( $hosts[0] ne hostname() or @hosts > 1 ) { - - if ( not find_exe('pdsh') ) { - print -"$conf{rmgr} resource manager on multiple or remote hosts requires pdsh to be installed\n"; - return; - } +# Now we have a either a command capable of launching using the selected resource +# manager, a list of hosts or both. At this point we can pick the best way to +# launch the job by walkin the list given in the configuration until we find one +# that works. Note this allows users to prevent the use of the resource manager +# to launch shadow jobs and also to force the use of pdsh. + + my $mode_list = $conf{launch_mode}; + +# The other three launchers require a host list so in the absence of one force it +# to use rmgr. + if ( not defined $pcmd{host_list} ) { + $mode_list = 'rmgr'; + } + + my @modes = split $COMMA, $mode_list; + + my @hosts = @{ $pcmd{host_list} }; + + my $have_pdsh = find_exe('pdsh'); + + foreach my $mode (@modes) { + if ( $mode eq 'local' ) { + my $hc = @hosts; + my $h = hostname(); + if ( @hosts == 1 and $hosts[0] eq hostname() ) { + $pcmd{command} = ''; + return %pcmd; + } + } elsif ( $mode eq 'rmgr' ) { + return %pcmd; + } elsif ( $mode eq 'ssh' ) { + if ( @hosts == 1 ) { + $pcmd{command} = "ssh $hosts[0]"; + return %pcmd; + } + } elsif ( $mode eq 'pdsh' and $have_pdsh ) { $pcmd{require_inner_callback} = 1; my $hlist = join q{,}, @hosts; $pcmd{command} = "pdsh -w $hlist"; + + if ( @hosts > 20 ) { + my $fanout = @hosts + 5; + $ENV{FANOUT} = $fanout; + + if ( @hosts > 128 ) { + print "Pdsh backend not recommended for such large jobs\n"; + } + } + + return %pcmd; + + } else { + print "Backend invalid: $mode\n"; } } - return %pcmd; + + print "No suitable backend found (perhaps try installing pdsh?)!\n"; + return; + } ############################################################################### From padb at googlecode.com Sat Nov 6 21:17:04 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Sat, 06 Nov 2010 21:17:04 +0000 Subject: [padb] r408 committed - Call global_detach with the right options in watch mode.... Message-ID: <0015175ce0c6a84cb7049468e91d@google.com> Revision: 408 Author: apittman at gmail.com Date: Sat Nov 6 14:16:22 2010 Log: Call global_detach with the right options in watch mode. Without this fix "watch" was crashing for some modes. http://code.google.com/p/padb/source/detail?r=408 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Fri Nov 5 17:03:08 2010 +++ /trunk/src/padb Sat Nov 6 14:16:22 2010 @@ -9878,7 +9878,7 @@ # Detach from all processes if the outer requested us to. if ( defined $cmd->{detach_after_callback} ) { - global_detach( $cmd->{mode}, $pid_list ); + global_detach($pid_list); } return; From padb at googlecode.com Sat Nov 6 23:16:50 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Sat, 06 Nov 2010 23:16:50 +0000 Subject: [padb] r409 committed - Add a timeout to the inner loop to allow the code to exit silently ... Message-ID: <20cf300fb1a9fb9a2604946a955d@google.com> Revision: 409 Author: apittman at gmail.com Date: Sat Nov 6 16:16:23 2010 Log: Add a timeout to the inner loop to allow the code to exit silently if the outer process is killed for any reason. Pass through "interval" from the outer process to the inner processes so they can set an appropiate value for this timeout, do this on the command line so it can be set once at startup rather than at signon. http://code.google.com/p/padb/source/detail?r=409 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Sat Nov 6 14:16:22 2010 +++ /trunk/src/padb Sat Nov 6 16:16:23 2010 @@ -373,9 +373,9 @@ my @inner_conf = qw(edb edbopt rmgr scripts slurm_job_step pbs_server lsf_mode lsfmpi_server lsfmpi_mpirpid lsfmpi_port); -# More config options the inner knows about, these are forwarded on the +# More options the inner knows about, these are forwarded on the # command line rather than over the sockets. -my @inner_conf_cmd = qw(port_range outer); +my @inner_conf_cmd = qw(port_range outer interval); ############################################################################### # @@ -5697,11 +5697,11 @@ } foreach my $co (@conf_bool) { - $conf{$co} = check_and_convert_bool( $conf{$co} ); + config_set_internal( $co, check_and_convert_bool( $conf{$co} ) ); } foreach my $co (@conf_time) { - $conf{$co} = check_and_convert_time( $conf{$co} ); + config_set_internal( $co, check_and_convert_time( $conf{$co} ) ); } foreach my $co (@conf_int) { @@ -9984,6 +9984,8 @@ my $hostname = $inner_conf{hostname}; my $key = rand; + my $outer_timeout = $inner_conf{interval} * 2; + if ( defined $outerloc ) { my ( $ohost, $oport ) = split $COLON, $outerloc; my $os = IO::Socket::INET->new( @@ -10019,8 +10021,14 @@ my $stime = time; + # "Last seen time" of another process. This is the time we last had any + # communication from the outer, if it becomes too far in the past then + # we should probably exit. + my $ltime = $stime; + while ( $sel->count() > 0 ) { while ( my @data = $sel->can_read(5) ) { + $ltime = time; foreach my $s (@data) { if ( $s == $server ) { my $new = $server->accept() or confess('Failed accept'); @@ -10071,6 +10079,16 @@ if ( ( $sel->count() == 1 ) and ( ( $time - $stime ) > 30 ) ) { exit 0; } + + # If we are (were) connected but haven't heard anything for a while then + # the outer process has likely died so we should also exit cleanly. + # There doesn't seem to be another way to detect this so just abort + # if we haven't heard anything for a while. This value needs to be + # greater than the maximum reasonable value for 'interval' in the + # outer process. + if ( ( $time - $ltime ) > $outer_timeout ) { + exit 0; + } } my $count = $sel->count(); print "Thats not supposed to happen count=($count)\n"; From padb at googlecode.com Sat Nov 6 23:22:51 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Sat, 06 Nov 2010 23:22:51 +0000 Subject: [padb] r410 committed - Increase the inner timeout to be 2N+10 to allow for correct... Message-ID: <20cf300faa8d82e37404946aab31@google.com> Revision: 410 Author: apittman at gmail.com Date: Sat Nov 6 16:21:44 2010 Log: Increase the inner timeout to be 2N+10 to allow for correct operation for small values of N. http://code.google.com/p/padb/source/detail?r=410 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Sat Nov 6 16:16:23 2010 +++ /trunk/src/padb Sat Nov 6 16:21:44 2010 @@ -9984,7 +9984,7 @@ my $hostname = $inner_conf{hostname}; my $key = rand; - my $outer_timeout = $inner_conf{interval} * 2; + my $outer_timeout = ( $inner_conf{interval} * 2 ) + 10; if ( defined $outerloc ) { my ( $ohost, $oport ) = split $COLON, $outerloc; From Duncan.Thomas at vega.co.uk Mon Nov 8 17:57:42 2010 From: Duncan.Thomas at vega.co.uk (Duncan Thomas) Date: Mon, 8 Nov 2010 17:57:42 +0000 Subject: [padb] edb build fixes Message-ID: <7DB8F4B6C722CD479E0EAB7B9AE421DF042E351A@mercury.vegagroup.net> Hi The below patch enable edb to build in your tree. The code for edb otherwise matches the qsnet release. I'll send some more patches as I get qsnet functionality tested and fixed up. Index: ptrace.c =================================================================== --- ptrace.c (revision 410) +++ ptrace.c (working copy) @@ -7,7 +7,7 @@ /* /cvs/master/quadrics/elan4lib/edb/ptrace.c,v */ -#include +#include "edb.h" #define MACRO_BEGIN { #define MACRO_END } Index: edb.h =================================================================== --- edb.h (revision 410) +++ edb.h (working copy) @@ -140,7 +140,7 @@ extern int dump_stats_eagle(void *pages, size_t size, size_t pagesize); /* New, can do lots of things with it */ -#include +#include "sf.h" /* elf.c */ extern void fetch_data_dead (char *cname, char *ename, int trap_dump); Index: stats_falcon.c =================================================================== --- stats_falcon.c (revision 410) +++ stats_falcon.c (working copy) @@ -11,7 +11,7 @@ #include #include -#include +#include "edb.h" #include Index: xml.c =================================================================== --- xml.c (revision 410) +++ xml.c (working copy) @@ -5,7 +5,7 @@ #ident "xml.c,v 1.3 2005/02/03 15:26:08 ashley Exp" /* /cvs/master/quadrics/elan4lib/edb/xml.c,v */ -#include +#include "edb.h" /*********************************************************** * * Index: elfN.c =================================================================== --- elfN.c (revision 410) +++ elfN.c (working copy) @@ -27,7 +27,7 @@ #include -#include +#include "edb.h" #include Index: elf.c =================================================================== --- elf.c (revision 410) +++ elf.c (working copy) @@ -10,7 +10,7 @@ #include #include -#include +#include "edb.h" #include #include #include Index: Makefile =================================================================== --- Makefile (revision 410) +++ Makefile (working copy) @@ -1,18 +1,20 @@ - - - %.o: %.c edb.h cc -c -o $@ -pthread $< -edb: edb.o - cc -o edb edb.o xml.o +edb: edb.o xml.o ptrace.o stats_eagle.o stats_falcon.o parallel.o elf.o elf64.o elf32.o + cc -o edb edb.o xml.o ptrace.o stats_eagle.o stats_falcon.o parallel.o elf.o elf64.o elf32.o -lelan -lpthread xml.o: xml.c cc -c xml.c -o xml.o -pthread elf32.c: elfN.c - sed s/TSIZE/32/g edb/elfN.c > $@ + sed s/TSIZE/32/g elfN.c > $@ elf64.c: elfN.c - sed s/TSIZE/64/g edb/elfN.c > $@ + sed s/TSIZE/64/g elfN.c > $@ +tags: elf32.c elf64.c *.c *.h + ctags *.c *.h + +clean: + rm -f *.o edb elf32.c elf64.c tags -- Duncan Thomas HPC Consultant VEGA Consulting Services Ltd 360 Bristol Business Park Coldharbour Lane Bristol BS16 1EJ United Kingdom Tel : +44 (0)117 988 0033 Mob : +44 (0)7968 111 883 Fax : +44 (0)117 988 0034 Email : Duncan.Thomas at vega.co.uk Web : www.vega.co.uk Registered company details: VEGA Consulting Services Ltd, 2 Falcon Way, Shire Park, Welwyn Garden City, AL7 1TW, Registered in England, Number - 1393778 Notice of Confidentiality This transmission is intended for the named addressee only. It contains information which may be confidential and which may also be privileged. Unless you are the named addressee (or authorised to receive it for the addressee) you may not copy or use it, or disclose it to anyone else. If you have received this transmission in error please notify the sender immediately. . -------------- next part -------------- An HTML attachment was scrubbed... URL: From padb at googlecode.com Mon Nov 8 18:14:28 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 08 Nov 2010 18:14:28 +0000 Subject: [padb] r411 committed - Add "cmdline" as an option for the proc-summary data. Differs from... Message-ID: <0022150473df4afa2004948e98a2@google.com> Revision: 411 Author: apittman at gmail.com Date: Mon Nov 8 10:13:19 2010 Log: Add "cmdline" as an option for the proc-summary data. Differs from command in that it shows command line options as well. http://code.google.com/p/padb/source/detail?r=411 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Sat Nov 6 16:21:44 2010 +++ /trunk/src/padb Mon Nov 8 10:13:19 2010 @@ -8017,6 +8017,20 @@ } return; } + +sub show_task_cmdline { + my ( $vp, $dir ) = @_; + + local $/ = "\0"; + open my $FD, '<', "$dir/cmdline" or return; + my @args; + while (<$FD>) { + chomp; + push @args, $_; + } + proc_output( $vp, 'cmdline', "@args" ); + return; +} sub show_task_dir { my ( $carg, $vp, $pid, $dir ) = @_; @@ -8030,6 +8044,9 @@ show_task_file( $vp, "$dir/status" ); show_task_file( $vp, "$dir/wchan", 'wchan' ); show_task_file( $vp, "$dir/stat", 'stat' ); + + show_task_cmdline( $vp, $dir ); + if ( $carg->{proc_shows_stat} ) { show_task_stat_file( $vp, "$dir/stat" ); } From padb at googlecode.com Mon Nov 8 18:18:32 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 08 Nov 2010 18:18:32 +0000 Subject: [padb] r412 committed - When sorting entries for the proc-summary report check if all... Message-ID: <20cf3005df0cde1ba204948ea66b@google.com> Revision: 412 Author: apittman at gmail.com Date: Mon Nov 8 10:16:16 2010 Log: When sorting entries for the proc-summary report check if all values are numeric, if so do a numeric sort, if they aren't then do a text sort. http://code.google.com/p/padb/source/detail?r=412 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Mon Nov 8 10:13:19 2010 +++ /trunk/src/padb Mon Nov 8 10:16:16 2010 @@ -3834,11 +3834,29 @@ my $carg = shift; my $key = shift; my @all = (@_); + + my $numeric = 1; + + foreach my $a (@all) { + if ( not $a->{$key} =~ m{\A\d+\z} ) { + $numeric = 0; + last; + } + } + + my @sorted; + + if ($numeric) { + @sorted = sort { $a->{$key} <=> $b->{$key} } @all; + } else { + @sorted = sort { $a->{$key} cmp $b->{$key} } @all; + + } if ( $carg->{reverse_sort_order} ) { - return ( reverse sort { $a->{$key} <=> $b->{$key} } @all ); + return ( reverse @sorted ); } else { - return ( sort { $a->{$key} <=> $b->{$key} } @all ); + return (@sorted); } } From padb at googlecode.com Mon Nov 8 18:23:36 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 08 Nov 2010 18:23:36 +0000 Subject: [padb] r413 committed - Don't supply the -w option to qstat. Torque and pbs (not pro) should... Message-ID: <20cf300fb42dfe5b1b04948eb8fe@google.com> Revision: 413 Author: apittman at gmail.com Date: Mon Nov 8 10:23:03 2010 Log: Don't supply the -w option to qstat. Torque and pbs (not pro) should work with this commit, rather than matching %d.$server we now just match %d and verify $server if possible. This prevents problems with the 15 character limit of pbs output truncating the server name and causing jobs not to be recognised. http://code.google.com/p/padb/source/detail?r=413 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Mon Nov 8 10:16:16 2010 +++ /trunk/src/padb Mon Nov 8 10:23:03 2010 @@ -2763,34 +2763,53 @@ # for each one. sub pbs_get_lqsub { my ( $user, $server ) = @_; - my $job; - my $cmd = "qstat -w -n -u $user \@$server"; + my $cmd = "qstat -n -u $user \@$server"; + + my $job = undef; my @output = slurp_cmd($cmd); - foreach (@output) { - if (/\d+\.$server/i) { - $_ =~ s/^ +//; # suppress leading space(for sure) - my @champs = split(/\s+/); # split by space - if ( $champs[9] eq 'R' ) { # take only Running - ($job) = split qr{\.}, $champs[0]; - $pbs_tabjobs{$job}{nproc} = $champs[6]; - } else { - $job = undef; - } - } elsif ( defined $job ) { - $_ =~ s/^ +//; # suppress blank in front of line - $_ =~ s/^\+//; # suppress first + sign - my @champs = split(/\+/); # split by '+' - if ( defined $pbs_tabjobs{$job}{server} ) { - printf("Warning, job $job exists on multiple servers\n"); + foreach my $line (@output) { + +# If we have previously matched a job (see below) then extract the hostlist. +# This line of outpuas has the form: +# " xn8/0*2+xn9/0*2+xn10/0*2" +# all we care about from this is the hostname (xn[8-10]) so split on '+' and then +# strip everything after the first '/' + if ( defined $job ) { + + $line =~ s/^ +//; # suppress blank in front of line + $line =~ s/^\+//; # suppress first + sign + my @champs = split( /\+/, $line ); # split by '+' + foreach my $word (@champs) { + my ($host) = split "/", $word; + push( @{ $pbs_tabjobs{$job}{hosts} }, $host ); + } + $job = undef; + next; + } + +# See if this line of output matches a job id, it has to be of the correct user and running. +# If this is a running job set $job to it's identifier so the code above can match the hostlist +# which will be on the next line of output. + my @parts = split $SPACE, $line; + if ( $#parts == 10 and $parts[1] eq $user and $parts[9] eq 'R' ) { + my ( $job_id, $job_server ) = split $PERIOD, $parts[0]; + + if ( defined $pbs_tabjobs{$job_id}{server} ) { + printf("Warning, job $job_id exists on multiple servers\n"); next; } - $pbs_tabjobs{$job}{server} = $server; - foreach my $word (@champs) { - chomp($word); - $word =~ s/\/.*//; # take all from / - push( @{ $pbs_tabjobs{$job}{hosts} }, $word ); - } + $job = $job_id; + +# This test is perfectly "safe" and should never fail apart from the case where the job id +# and the server name don't fit inside the 15 characters allowed. This can be worked around +# by setting the -w flag which tells pbs_pro to print up to 30 characters but that doesn't +# work on Torque or plain pbs so only check this value if the string is shorter than that. + if ( length $parts[0] < 15 and $job_server ne $server ) { + printf("Warning, job is listed with unexpected server\n"); + } + $pbs_tabjobs{$job_id}{nproc} = $parts[6]; + $pbs_tabjobs{$job_id}{server} = $server; } } return; From padb at googlecode.com Mon Nov 8 18:27:44 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 08 Nov 2010 18:27:44 +0000 Subject: [padb] r414 committed - Better error checking in the proc-summary mode.... Message-ID: <0022150473dfc107a804948ec7f9@google.com> Revision: 414 Author: apittman at gmail.com Date: Mon Nov 8 10:27:13 2010 Log: Better error checking in the proc-summary mode. If selecting samples which are quick to measure the elapsed time during this might be zero, if this is the case then don't attempt to use this value for division or bad things will happen. Ideally if any of these stats are being reported padb should sample them, wait and then sample them again but right now it doesn't do the wait and the two samples can happen in apparantly zero time. http://code.google.com/p/padb/source/detail?r=414 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Mon Nov 8 10:23:03 2010 +++ /trunk/src/padb Mon Nov 8 10:27:13 2010 @@ -3996,7 +3996,8 @@ my $value = $lines->{$tag}{$key}; next unless defined $proc_format_lengths{$key} or $show_fields; - if ( length $value > $proc_format_lengths{$key} ) { + if ( defined $value and length $value > $proc_format_lengths{$key} ) + { $proc_format_lengths{$key} = length $value; } @@ -8338,27 +8339,29 @@ next unless defined $proc->{stat_end}; - proc_output( - $vp, 'pcpu', - pcpu_total( - $cpucount, $elapsed, - $proc->{stat_start}, $proc->{stat_end} - ) - ); - proc_output( - $vp, 'pucpu', - pcpu_user( - $cpucount, $elapsed, - $proc->{stat_start}, $proc->{stat_end} - ) - ); - proc_output( - $vp, 'pscpu', - pcpu_sys( - $cpucount, $elapsed, - $proc->{stat_start}, $proc->{stat_end} - ) - ); + if ( $elapsed > 0 ) { + proc_output( + $vp, 'pcpu', + pcpu_total( + $cpucount, $elapsed, + $proc->{stat_start}, $proc->{stat_end} + ) + ); + proc_output( + $vp, 'pucpu', + pcpu_user( + $cpucount, $elapsed, + $proc->{stat_start}, $proc->{stat_end} + ) + ); + proc_output( + $vp, 'pscpu', + pcpu_sys( + $cpucount, $elapsed, + $proc->{stat_start}, $proc->{stat_end} + ) + ); + } } } From padb at googlecode.com Mon Nov 8 18:31:50 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 08 Nov 2010 18:31:50 +0000 Subject: [padb] r415 committed - Fix a couple of problem with the new launch_mode code:... Message-ID: <0015175ce0c472d59604948ed613@google.com> Revision: 415 Author: apittman at gmail.com Date: Mon Nov 8 10:31:23 2010 Log: Fix a couple of problem with the new launch_mode code: Only use "rmgr" if it provided a launch command, some of them don't which lead to the wrong thing happening. Check for "pdsh" being requested seperatly for pdsh being installed so that the correct error message is reported to the user. http://code.google.com/p/padb/source/detail?r=415 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Mon Nov 8 10:27:13 2010 +++ /trunk/src/padb Mon Nov 8 10:31:23 2010 @@ -3761,13 +3761,16 @@ return %pcmd; } } elsif ( $mode eq 'rmgr' ) { - return %pcmd; + if ( defined $pcmd{command} ) { + return %pcmd; + } } elsif ( $mode eq 'ssh' ) { if ( @hosts == 1 ) { $pcmd{command} = "ssh $hosts[0]"; return %pcmd; } - } elsif ( $mode eq 'pdsh' and $have_pdsh ) { + } elsif ( $mode eq 'pdsh' ) { + next unless ($have_pdsh); $pcmd{require_inner_callback} = 1; my $hlist = join q{,}, @hosts; From ashley at pittman.co.uk Mon Nov 8 18:54:40 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Mon, 8 Nov 2010 18:54:40 +0000 Subject: [padb] Upcoming release. (delayed) In-Reply-To: References: Message-ID: On 1 Nov 2010, at 19:57, Ashley Pittman wrote: > I'd like to make a formal release in the coming weeks based on the current SVN code, the 3.2 beta has been through an extended testing period and I'm happy that it's ready to move to formal release status. On this basis I propose making a 3.3 release in the next two weeks, probably on Monday the 8th. I've had rather more feedback to this than I expected so a large number of small fixes have gone in over the last week, as such I'll have to extend the window to allow more time to process them and test/stabilise the code. Other than the changes from Vega and further testing of the pbs/Torque patch I'm not currently planing any changes beyond what turns up in testing. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From padb at googlecode.com Tue Nov 9 16:10:36 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Tue, 09 Nov 2010 16:10:36 +0000 Subject: [padb] r416 committed - Be more careful reading hte host_list for a job, if it's not... Message-ID: <0015175cb5aa33b96e0494a0fbc6@google.com> Revision: 416 Author: apittman at gmail.com Date: Tue Nov 9 08:09:47 2010 Log: Be more careful reading hte host_list for a job, if it's not defined then don't try and copy it. http://code.google.com/p/padb/source/detail?r=416 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Mon Nov 8 10:31:23 2010 +++ /trunk/src/padb Tue Nov 9 08:09:47 2010 @@ -3748,7 +3748,10 @@ my @modes = split $COMMA, $mode_list; - my @hosts = @{ $pcmd{host_list} }; + my @hosts; + if ( defined $pcmd{host_list} ) { + @hosts = @{ $pcmd{host_list} }; + } my $have_pdsh = find_exe('pdsh'); @@ -3771,6 +3774,7 @@ } } elsif ( $mode eq 'pdsh' ) { next unless ($have_pdsh); + next if ( @hosts == 0 ); $pcmd{require_inner_callback} = 1; my $hlist = join q{,}, @hosts; From Duncan.Thomas at vega.co.uk Tue Nov 9 17:38:20 2010 From: Duncan.Thomas at vega.co.uk (Duncan Thomas) Date: Tue, 9 Nov 2010 17:38:20 +0000 Subject: [padb] Resending edb build fixes Message-ID: <7DB8F4B6C722CD479E0EAB7B9AE421DF04329426@mercury.vegagroup.net> Please find attached, hopefully my mailer can't chew up attachments as badly as it does text <> -- Duncan Thomas HPC Consultant VEGA Consulting Services Ltd 360 Bristol Business Park Coldharbour Lane Bristol BS16 1EJ United Kingdom Tel : +44 (0)117 988 0033 Mob : +44 (0)7968 111 883 Fax : +44 (0)117 988 0034 Email : Duncan.Thomas at vega.co.uk Web : www.vega.co.uk Registered company details: VEGA Consulting Services Ltd, 2 Falcon Way, Shire Park, Welwyn Garden City, AL7 1TW, Registered in England, Number - 1393778 Notice of Confidentiality This transmission is intended for the named addressee only. It contains information which may be confidential and which may also be privileged. Unless you are the named addressee (or authorised to receive it for the addressee) you may not copy or use it, or disclose it to anyone else. If you have received this transmission in error please notify the sender immediately. . -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: build_fixes_2010_11_09.diff Type: application/octet-stream Size: 3023 bytes Desc: build_fixes_2010_11_09.diff URL: From Duncan.Thomas at vega.co.uk Tue Nov 9 17:42:54 2010 From: Duncan.Thomas at vega.co.uk (Duncan Thomas) Date: Tue, 9 Nov 2010 17:42:54 +0000 Subject: [padb] edb fixes Message-ID: <7DB8F4B6C722CD479E0EAB7B9AE421DF04329427@mercury.vegagroup.net> The version of edb in padb, which matches the one currently shipped by Vega, doesn't work (specifically, queue extraction doesn't work since the elf reading code can no longer find elan_base). The issue seems to be that libelan4.so no longer has a DT_HASH, so you have to walk the symbol table in a linear manner. It was also running into a DT_HASH it couldn't work with in an unnamed dynamic section. The new code just skips unnamed sections since they never contain what we want. I'm not sure the new code is 100% correct, but it seems to work unlike what was there before. Comments off people who understand elf structures better very welcome. These fixes should find their way into the next Vega release unless problems with them are found. <> -- Duncan Thomas HPC Consultant VEGA Consulting Services Ltd 360 Bristol Business Park Coldharbour Lane Bristol BS16 1EJ United Kingdom Tel : +44 (0)117 988 0033 Mob : +44 (0)7968 111 883 Fax : +44 (0)117 988 0034 Email : Duncan.Thomas at vega.co.uk Web : www.vega.co.uk Registered company details: VEGA Consulting Services Ltd, 2 Falcon Way, Shire Park, Welwyn Garden City, AL7 1TW, Registered in England, Number - 1393778 Notice of Confidentiality This transmission is intended for the named addressee only. It contains information which may be confidential and which may also be privileged. Unless you are the named addressee (or authorised to receive it for the addressee) you may not copy or use it, or disclose it to anyone else. If you have received this transmission in error please notify the sender immediately. . -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: edb_fixes_2010_11_09.diff Type: application/octet-stream Size: 8372 bytes Desc: edb_fixes_2010_11_09.diff URL: From padb at googlecode.com Tue Nov 9 17:59:29 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Tue, 09 Nov 2010 17:59:29 +0000 Subject: [padb] r417 committed - Build fixes for edb.... Message-ID: <20cf300fb3ff9979eb0494a28044@google.com> Revision: 417 Author: apittman at gmail.com Date: Tue Nov 9 09:58:57 2010 Log: Build fixes for edb. Import a patch provided by Duncan Thomas http://code.google.com/p/padb/source/detail?r=417 Modified: /trunk/src/edb/Makefile /trunk/src/edb/edb.h /trunk/src/edb/elf.c /trunk/src/edb/elfN.c /trunk/src/edb/ptrace.c /trunk/src/edb/stats_falcon.c /trunk/src/edb/xml.c ======================================= --- /trunk/src/edb/Makefile Wed May 20 04:23:26 2009 +++ /trunk/src/edb/Makefile Tue Nov 9 09:58:57 2010 @@ -1,18 +1,22 @@ - - +all: edb %.o: %.c edb.h - cc -c -o $@ -pthread $< - -edb: edb.o - cc -o edb edb.o xml.o + cc -g -c -o $@ -pthread $< + +edb: edb.o xml.o ptrace.o stats_eagle.o stats_falcon.o parallel.o elf.o elf64.o elf32.o + cc -g -o edb edb.o xml.o ptrace.o stats_eagle.o stats_falcon.o parallel.o elf.o elf64.o elf32.o -lelan -lpthread xml.o: xml.c cc -c xml.c -o xml.o -pthread elf32.c: elfN.c - sed s/TSIZE/32/g edb/elfN.c > $@ + sed s/TSIZE/32/g elfN.c > $@ elf64.c: elfN.c - sed s/TSIZE/64/g edb/elfN.c > $@ - + sed s/TSIZE/64/g elfN.c > $@ + +tags: elf32.c elf64.c *.c *.h + ctags *.c *.h + +clean: + rm -f *.o edb elf32.c elf64.c ======================================= --- /trunk/src/edb/edb.h Wed May 20 04:23:26 2009 +++ /trunk/src/edb/edb.h Tue Nov 9 09:58:57 2010 @@ -140,7 +140,7 @@ extern int dump_stats_eagle(void *pages, size_t size, size_t pagesize); /* New, can do lots of things with it */ -#include +#include "sf.h" /* elf.c */ extern void fetch_data_dead (char *cname, char *ename, int trap_dump); ======================================= --- /trunk/src/edb/elf.c Wed May 20 04:23:26 2009 +++ /trunk/src/edb/elf.c Tue Nov 9 09:58:57 2010 @@ -10,7 +10,7 @@ #include #include -#include +#include "edb.h" #include #include #include ======================================= --- /trunk/src/edb/elfN.c Wed May 20 04:23:26 2009 +++ /trunk/src/edb/elfN.c Tue Nov 9 09:58:57 2010 @@ -27,7 +27,7 @@ #include -#include +#include "edb.h" #include ======================================= --- /trunk/src/edb/ptrace.c Wed May 20 04:23:26 2009 +++ /trunk/src/edb/ptrace.c Tue Nov 9 09:58:57 2010 @@ -7,7 +7,7 @@ /* /cvs/master/quadrics/elan4lib/edb/ptrace.c,v */ -#include +#include "edb.h" #define MACRO_BEGIN { #define MACRO_END } ======================================= --- /trunk/src/edb/stats_falcon.c Wed May 20 04:23:26 2009 +++ /trunk/src/edb/stats_falcon.c Tue Nov 9 09:58:57 2010 @@ -11,7 +11,7 @@ #include #include -#include +#include "edb.h" #include ======================================= --- /trunk/src/edb/xml.c Wed May 20 04:23:26 2009 +++ /trunk/src/edb/xml.c Tue Nov 9 09:58:57 2010 @@ -5,7 +5,7 @@ #ident "xml.c,v 1.3 2005/02/03 15:26:08 ashley Exp" /* /cvs/master/quadrics/elan4lib/edb/xml.c,v */ -#include +#include "edb.h" /*********************************************************** * * From ashley at pittman.co.uk Tue Nov 9 18:00:02 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Tue, 9 Nov 2010 19:00:02 +0100 Subject: [padb] edb fixes In-Reply-To: <7DB8F4B6C722CD479E0EAB7B9AE421DF04329427@mercury.vegagroup.net> References: <7DB8F4B6C722CD479E0EAB7B9AE421DF04329427@mercury.vegagroup.net> Message-ID: <8835FCFA-90F1-4E47-A848-3E87BD55DCAA@pittman.co.uk> On 9 Nov 2010, at 18:42, Duncan Thomas wrote: > The version of edb in padb, which matches the one currently shipped by Vega, doesn't work (specifically, queue extraction doesn't work since the elf reading code can no longer find elan_base). > > The issue seems to be that libelan4.so no longer has a DT_HASH, so you have to walk the symbol table in a linear manner. It was also running into a DT_HASH it couldn't work with in an unnamed dynamic section. The new code just skips unnamed sections since they never contain what we want. > > I'm not sure the new code is 100% correct, but it seems to work unlike what was there before. Comments off people who understand elf structures better very welcome. > > These fixes should find their way into the next Vega release unless problems with them are found. Is this to fix a problem with reading the message queues? Given that padb used to work have you been able to test this on a "before" system and verify it still functions correctly? Ultimately the edb code isn't widely used and if it works for you then I'm happy with it, we could sit down and go over the elf parsing stuff if it suits you. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From Duncan.Thomas at vega.co.uk Wed Nov 10 15:09:32 2010 From: Duncan.Thomas at vega.co.uk (Duncan Thomas) Date: Wed, 10 Nov 2010 15:09:32 +0000 Subject: [padb] edb fixes In-Reply-To: <8835FCFA-90F1-4E47-A848-3E87BD55DCAA@pittman.co.uk> Message-ID: <7DB8F4B6C722CD479E0EAB7B9AE421DF043296A4@mercury.vegagroup.net> I've tested it on a working (REL 4 x86-64) system and it works as before, so I don't think there are any regressions, though I haven't tested 32 bit yet - nobody uses it AFAIK. -- Duncan Thomas HPC Consultant Mob : +44 (0)7968 111 883 Notice of Confidentiality This transmission is intended for the named addressee only. It contains information which may be confidential and which may also be privileged. Unless you are the named addressee (or authorised to receive it for the addressee) you may not copy or use it, or disclose it to anyone else. If you have received this transmission in error please notify the sender immediately. . > -----Original Message----- > From: padb-devel-bounces at pittman.org.uk > [mailto:padb-devel-bounces at pittman.org.uk] On Behalf Of Ashley Pittman > Sent: 09 November 2010 18:00 > To: Duncan Thomas > Cc: padb-devel at pittman.org.uk > Subject: Re: [padb] edb fixes > > > On 9 Nov 2010, at 18:42, Duncan Thomas wrote: > > > The version of edb in padb, which matches the one currently > shipped by Vega, doesn't work (specifically, queue extraction > doesn't work since the elf reading code can no longer find elan_base). > > > > The issue seems to be that libelan4.so no longer has a > DT_HASH, so you have to walk the symbol table in a linear > manner. It was also running into a DT_HASH it couldn't work > with in an unnamed dynamic section. The new code just skips > unnamed sections since they never contain what we want. > > > > I'm not sure the new code is 100% correct, but it seems to > work unlike what was there before. Comments off people who > understand elf structures better very welcome. > > > > These fixes should find their way into the next Vega > release unless problems with them are found. > > Is this to fix a problem with reading the message queues? > Given that padb used to work have you been able to test this > on a "before" system and verify it still functions correctly? > > Ultimately the edb code isn't widely used and if it works for > you then I'm happy with it, we could sit down and go over the > elf parsing stuff if it suits you. > > Ashley. > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > > _______________________________________________ > padb-devel mailing list > padb-devel at pittman.org.uk > http://pittman.org.uk/mailman/listinfo/padb-devel_pittman.org.uk > From padb at googlecode.com Sun Nov 14 18:54:24 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Sun, 14 Nov 2010 18:54:24 +0000 Subject: [padb] r418 committed - Accept a patch to un-break edb on modern systems. Message-ID: <20cf300fb3ff291879049507da52@google.com> Revision: 418 Author: apittman at gmail.com Date: Sun Nov 14 10:54:02 2010 Log: Accept a patch to un-break edb on modern systems. http://code.google.com/p/padb/source/detail?r=418 Modified: /trunk/src/edb/elfN.c ======================================= --- /trunk/src/edb/elfN.c Tue Nov 9 09:58:57 2010 +++ /trunk/src/edb/elfN.c Sun Nov 14 10:54:02 2010 @@ -73,85 +73,155 @@ static uint64_t find_sym_in_tablesTSIZE (struct etrace_ops *ops, struct link_map *map, + uint64_t hash, int nchains, uint64_t symtab, uint64_t strtab, char *sym_name) { ElfTSIZE_Sym sym; + ElfTSIZE_Sym *sym_p; char str[128]; int i = 0; int check; if ( verbose > 3 ) printf("New table mapped at %#"PRIx64" containing %d chains\n",strtab,nchains); - - while (i < nchains) { - - if ( ops->rcopy(ops->handle, - (uint64_t)(uintptr_t)((unsigned long)symtab + (i * sizeof(ElfTSIZE_Sym))), - &sym, - sizeof(ElfTSIZE_Sym)) == -1 ) { - printf("Failed to read from process\n"); - return 0; - } - - i++; - check = 0; - - if ( sym.st_info == (unsigned char)ELFTSIZE_ST_INFO(STB_GLOBAL,STT_OBJECT)) - check = 1; - - if ( sym.st_value == 6 ) - check = 0; - - if ( ELFTSIZE_ST_TYPE(sym.st_info) == (unsigned char)STT_FILE ) - check = 0; - - if ( check ) { - /* read symbol name from the string table */ + if (hash) { + while (i < nchains) { + + if ( ops->rcopy(ops->handle, + (uint64_t)(uintptr_t)((unsigned long)symtab + (i * sizeof(ElfTSIZE_Sym))), + &sym, + sizeof(ElfTSIZE_Sym)) == -1 ) { + printf("Failed to read from process\n"); + return 0; + } + + i++; + check = 0; + + if ( sym.st_info == (unsigned char)ELFTSIZE_ST_INFO(STB_GLOBAL,STT_OBJECT)) + check = 1; + + if ( sym.st_value == 6 ) + check = 0; + + if ( ELFTSIZE_ST_TYPE(sym.st_info) == (unsigned char)STT_FILE ) + check = 0; + + if ( check ) { + /* read symbol name from the string table */ // str = (char *) strtab + sym.st_name; - if ( fetch_string(ops,&str[0],strtab + sym.st_name,128) == -1 ) { - if ( verbose > 2 ) - printf("Failed to find string, returning nothing\n"); - return 0; - } - - if ( verbose > 2 ) - printf("String is type %x bind %x other %x size %x value "TARGET_PTR" shndx %x name %lx '%s'\n", - (int) ELFTSIZE_ST_TYPE(sym.st_info), - (int) ELFTSIZE_ST_BIND(sym.st_info), - (int) sym.st_other, - (int) sym.st_size, - sym.st_value, - (int) sym.st_shndx, - (long) sym.st_name, - str); - - /* compare it with our symbol*/ - if (strcmp(&str[0], sym_name) == 0) { - + if ( fetch_string(ops,&str[0],strtab + sym.st_name,128) == -1 ) { + if ( verbose > 2 ) + printf("Failed to find string, returning nothing\n"); + return 0; + } + if ( verbose > 2 ) - printf("\nSuccess: got it %s "TARGET_PTR" "TARGET_PTR" "TARGET_PTR"\n", - &str[0], - (TARGET_TYPE)map->l_addr, + printf("String is type %x bind %x other %x size %x value "TARGET_PTR" shndx %x name %lx '%s'\n", + (int) ELFTSIZE_ST_TYPE(sym.st_info), + (int) ELFTSIZE_ST_BIND(sym.st_info), + (int) sym.st_other, + (int) sym.st_size, sym.st_value, - (TARGET_TYPE)map->l_addr + sym.st_value); - - if ( sym.st_value ) { - return (uint64_t)(map->l_addr + sym.st_value); + (int) sym.st_shndx, + (long) sym.st_name, + str); + + /* compare it with our symbol*/ + if (strcmp(&str[0], sym_name) == 0) { + + if ( verbose > 2 ) + printf("\nSuccess: got it %s "TARGET_PTR" "TARGET_PTR" "TARGET_PTR"\n", + &str[0], + (TARGET_TYPE)map->l_addr, + sym.st_value, + (TARGET_TYPE)map->l_addr + sym.st_value); + + if ( sym.st_value ) { + return (uint64_t)(map->l_addr + sym.st_value); + } } } } - } - - if ( verbose > 2 ) - printf("Found nothing\n"); - - /* no symbol found, return 0 */ - return 0; + + if ( verbose > 2 ) + printf("Found nothing\n"); + + /* no symbol found, return 0 */ + return 0; + } else { + /* If there is no DT_HASH, we can still walk the symbol table linearly until we find what we're after + or we get an invalid memory reference */ + /* First entry is always blank, second one always has no name */ + sym_p = (void *)symtab; + sym_p += 2; + + if ( ops->rcopy(ops->handle, (uint64_t)(uintptr_t)sym_p, + &sym, sizeof(ElfTSIZE_Sym)) == -1 ) { + printf("Failed to read from process\n"); + return 0; + } + + while (sym.st_name != 0) { + int check = 0; + if ( sym.st_info == (unsigned char)ELFTSIZE_ST_INFO(STB_GLOBAL,STT_OBJECT)) + check = 1; + + if ( sym.st_value == 6 ) + check = 0; + + if ( ELFTSIZE_ST_TYPE(sym.st_info) == (unsigned char)STT_FILE ) + check = 0; + + if ( check ) { + if ( fetch_string(ops,&str[0],strtab + sym.st_name,128) == -1 ) { + if ( verbose > 2 ) + printf("Failed to find string, returning nothing\n"); + return 0; + } + + if ( verbose > 2 ) + printf("String is type %x bind %x other %x size %x value "TARGET_PTR" shndx %x name %lx '%s'\n", + (int) ELFTSIZE_ST_TYPE(sym.st_info), + (int) ELFTSIZE_ST_BIND(sym.st_info), + (int) sym.st_other, + (int) sym.st_size, + sym.st_value, + (int) sym.st_shndx, + (long) sym.st_name, + str); + + /* compare it with our symbol*/ + if (strcmp(&str[0], sym_name) == 0) { + + if ( verbose > 2 ) + printf("\nSuccess: got it %s "TARGET_PTR" "TARGET_PTR" "TARGET_PTR"\n", + &str[0], + (TARGET_TYPE)map->l_addr, + sym.st_value, + (TARGET_TYPE)map->l_addr + sym.st_value); + + if ( sym.st_value ) { + return (uint64_t)(map->l_addr + sym.st_value); + } + } + } + + sym_p++; + if ( ops->rcopy(ops->handle, (uint64_t)(uintptr_t)sym_p, + &sym, sizeof(ElfTSIZE_Sym)) == -1 ) { + printf("Failed to read from process\n"); + return 0; + } + } + + return 0; + } } static void @@ -338,6 +408,7 @@ { ElfTSIZE_Dyn dyn; uint64_t addr; + uint64_t hash = 0; int nchains; uint64_t symtab = 0; uint64_t strtab = 0; @@ -356,22 +427,18 @@ if ( verbose > 2 ) printf("Read nchains from "TARGET_PTR" %p\n",dyn.d_un.d_ptr, (void *)map->l_addr); - - /* - * This smacks of being wrong but having looked at how RHAS3.0 - * handles things I don't see any other way it can work. - * - * Maybe it isn't so bad because the d_ptr is always 0x120 and - * l_addr is page aligned so logical or won't ever require bit - * carry and it should always get the right answer. - * - * ashley at quadrics.com 03th March 2004 - */ + + if (ops->rcopy(ops->handle, + (uint64_t)(uintptr_t)dyn.d_un.d_ptr, &hash, sizeof(hash)) == -1) { + return 0; + } + if (ops->rcopy(ops->handle, - (uint64_t)(uintptr_t)((dyn.d_un.d_ptr | map->l_addr) + 4), + (uint64_t)(uintptr_t)(dyn.d_un.d_ptr + 4), &nchains, - sizeof(nchains)) == -1 ) + sizeof(nchains)) == -1 ) { return 0; + } break; case DT_STRTAB: strtab = dyn.d_un.d_ptr; @@ -383,10 +450,11 @@ break; } addr += sizeof(ElfTSIZE_Dyn); - if (ops->rcopy(ops->handle,addr, &dyn, sizeof(ElfTSIZE_Dyn))==-1) + if (ops->rcopy(ops->handle,addr, &dyn, sizeof(ElfTSIZE_Dyn))==-1) { return 0; - } - return (find_sym_in_tablesTSIZE(ops,map,nchains,symtab,strtab,"elan_base")); + } + } + return (find_sym_in_tablesTSIZE(ops,map,hash,nchains,symtab,strtab,"elan_base")); } @@ -482,6 +550,7 @@ do { uint64_t b; + char name[4]; if ( verbose > 2 ) printf("The link_map looks like %p %p %p %p %p\n", @@ -491,7 +560,16 @@ link->l_next, link->l_prev); - b = (uint64_t)resolv_tablesTSIZE(ops,link); + + /* Grab 4 bytes because ptrace won't allow less */ + ops->rcopy(ops->handle, (uint64_t)(uintptr_t)link->l_name, &name, 4); + if ( *name == 0 ) { + if ( verbose > 2 ) + printf("Skipping anonymous map\n"); + b = 0; + } else { + b = (uint64_t)resolv_tablesTSIZE(ops,link); + } if ( b ) { free(link); return b; From Jie.Cai at anu.edu.au Mon Nov 22 00:17:16 2010 From: Jie.Cai at anu.edu.au (Jie Cai) Date: Mon, 22 Nov 2010 11:17:16 +1100 Subject: [padb] Make "gdb set language c" for Fortran program to get correct value and offset of symbols. In-Reply-To: <5F7FBEAB-5967-4195-8D0C-D8C87BF124C8@pittman.co.uk> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <3D83BB55-B0B4-480B-AC6D-DC0EF37CC5B1@pittman.co.uk> <4CE20FE4.6060807@anu.edu.au> <5F7FBEAB-5967-4195-8D0C-D8C87BF124C8@pittman.co.uk> Message-ID: <4CE9B68C.4040706@anu.edu.au> Hi Ashley, This is the patch file to let gdb set language to c for Fortran program. It is created based on current svn head of padb. It is attempt to solve problem with warning message as shown below: ======== Warning: errors reported by some ranks ======== [0]: Error message from /short/z00/jxc900/PADB/padb_build/libexec/minfo: setup_communicator_iterator() failed [0]: Stderr from minfo: WARNING: Field opal_list_next of type opal_list_item_t not found! WARNING: Field opal_list_sentinel of type opal_list_t not found! WARNING: Field fl_mpool of type ompi_free_list_t not found! WARNING: Field fl_allocations of type ompi_free_list_t not found! Kind Regards, Jie -- Jie Cai Jie.Cai at anu.edu.au ANU Supercomputer Facility NCI National Facility Leonard Huxley, Mills Road Ph: +61 2 6125 7965 Australian National University Fax: +61 2 6125 8199 Canberra, ACT 0200, Australia http://nf.nci.org.au ----------------------------------------------------- On 17/11/10 05:33, Ashley Pittman wrote: > On 15 Nov 2010, at 23:00, Jie Cai wrote: > >> Therefore, some gdb command like "-data-evaluate-expression "(void *)&((opal_list_item_t *)0)->opal_list_next"" failed because of "No symbol \"void\" in current context." >> > This was added for some versions of gdb which reported differently for offsets of known types, initially I only saw this on Solaris but I have seen it on Linux as well. > > http://code.google.com/p/padb/source/detail?r=322 > > >> We have made a little patch into it to fix this issue. Simple insert "-gdb-set language c" prior actually to "-data-evaluate-expression" will solve it. >> > That sounds great, I'll look forward to seeing it. The other option might be to move gdb up the stack to main so that the language is c anyway? > > >> We haven't got time to look into PBS issue yet. Once we have any information will let you updated. >> > It's likely that the change will be possibly adding a process name to the list in is_resmgr_process() and or adding checks for a second environment variable in pbs_find_pids(). > > Ashley, > > -------------- next part -------------- A non-text attachment was scrubbed... Name: padb_fortran.patch Type: text/x-patch Size: 4803 bytes Desc: not available URL: From ashley at pittman.co.uk Tue Nov 23 07:55:13 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Tue, 23 Nov 2010 07:55:13 +0000 Subject: [padb] Make "gdb set language c" for Fortran program to get correct value and offset of symbols. In-Reply-To: <4CE9B68C.4040706@anu.edu.au> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <3D83BB55-B0B4-480B-AC6D-DC0EF37CC5B1@pittman.co.uk> <4CE20FE4.6060807@anu.edu.au> <5F7FBEAB-5967-4195-8D0C-D8C87BF124C8@pittman.co.uk> <4CE9B68C.4040706@anu.edu.au> Message-ID: <02F7B375-1597-472B-A820-D354681F84AB@pittman.co.uk> Thank you for the patch, I'm wondering if it's enough to simply set the mode on attach or does that break the stack traces or local variables in Fortran code? Padb already tracks the value of "-gdb-set print address" and changes it as required, perhaps it needs to do the same for "-gdb-set language"? Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From Jie.Cai at anu.edu.au Tue Nov 23 23:44:05 2010 From: Jie.Cai at anu.edu.au (Jie Cai) Date: Wed, 24 Nov 2010 10:44:05 +1100 Subject: [padb] Make "gdb set language c" for Fortran program to get correct value and offset of symbols. In-Reply-To: <02F7B375-1597-472B-A820-D354681F84AB@pittman.co.uk> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <3D83BB55-B0B4-480B-AC6D-DC0EF37CC5B1@pittman.co.uk> <4CE20FE4.6060807@anu.edu.au> <5F7FBEAB-5967-4195-8D0C-D8C87BF124C8@pittman.co.uk> <4CE9B68C.4040706@anu.edu.au> <02F7B375-1597-472B-A820-D354681F84AB@pittman.co.uk> Message-ID: <4CEC51C5.3030701@anu.edu.au> On 23/11/10 18:55, Ashley Pittman wrote: > Thank you for the patch, I'm wondering if it's enough to simply set the mode on attach or does that break the stack traces or local variables in Fortran code? As I tested, this is not sufficient. > Padb already tracks the value of "-gdb-set print address" and changes it as required, perhaps it needs to do the same for "-gdb-set language"? > Yes, you are right. The patch can be simplified as put '-gdb-set languance c' only when '-gdb-set print address' is sent. The simplified new patch is attached. Kind Regards, Jie -- Jie Cai Jie.Cai at anu.edu.au ANU Supercomputer Facility NCI National Facility Leonard Huxley, Mills Road Ph: +61 2 6125 7965 Australian National University Fax: +61 2 6125 8199 Canberra, ACT 0200, Australia http://nf.nci.org.au ----------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: padb_fortran_new.patch Type: text/x-patch Size: 415 bytes Desc: not available URL: From ashley at pittman.co.uk Wed Nov 24 22:55:27 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Wed, 24 Nov 2010 22:55:27 +0000 Subject: [padb] Make "gdb set language c" for Fortran program to get correct value and offset of symbols. In-Reply-To: <4CEC51C5.3030701@anu.edu.au> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <3D83BB55-B0B4-480B-AC6D-DC0EF37CC5B1@pittman.co.uk> <4CE20FE4.6060807@anu.edu.au> <5F7FBEAB-5967-4195-8D0C-D8C87BF124C8@pittman.co.uk> <4CE9B68C.4040706@anu.edu.au> <02F7B375-1597-472B-A820-D354681F84AB@pittman.co.uk> <4CEC51C5.3030701@anu.edu.au> Message-ID: On 23 Nov 2010, at 23:44, Jie Cai wrote: > > On 23/11/10 18:55, Ashley Pittman wrote: >> Thank you for the patch, I'm wondering if it's enough to simply set the mode on attach or does that break the stack traces or local variables in Fortran code? > As I tested, this is not sufficient. >> Padb already tracks the value of "-gdb-set print address" and changes it as required, perhaps it needs to do the same for "-gdb-set language"? >> > Yes, you are right. The patch can be simplified as put '-gdb-set languance c' only when '-gdb-set print address' is sent. I was thinking of something like the attached which sets the value to c when required but reverts it to it's normal state when not. -------------- next part -------------- A non-text attachment was scrubbed... Name: gdb-language.patch Type: application/octet-stream Size: 1708 bytes Desc: not available URL: -------------- next part -------------- -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From Jie.Cai at anu.edu.au Thu Nov 25 01:07:34 2010 From: Jie.Cai at anu.edu.au (Jie Cai) Date: Thu, 25 Nov 2010 12:07:34 +1100 Subject: [padb] Make "gdb set language c" for Fortran program to get correct value and offset of symbols. In-Reply-To: References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <3D83BB55-B0B4-480B-AC6D-DC0EF37CC5B1@pittman.co.uk> <4CE20FE4.6060807@anu.edu.au> <5F7FBEAB-5967-4195-8D0C-D8C87BF124C8@pittman.co.uk> <4CE9B68C.4040706@anu.edu.au> <02F7B375-1597-472B-A820-D354681F84AB@pittman.co.uk> <4CEC51C5.3030701@anu.edu.au> Message-ID: <4CEDB6D6.5040304@anu.edu.au> Hi Ashley, Agreed, it makes more sense to set the language only when it is needed. There is a little bug in line 6323. I have fixed it with the attached patch. Other than that works fine. Kind Regards, Jie -- Jie Cai Jie.Cai at anu.edu.au ANU Supercomputer Facility NCI National Facility Leonard Huxley, Mills Road Ph: +61 2 6125 7965 Australian National University Fax: +61 2 6125 8199 Canberra, ACT 0200, Australia http://nf.nci.org.au ----------------------------------------------------- On 25/11/10 09:55, Ashley Pittman wrote: > On 23 Nov 2010, at 23:44, Jie Cai wrote: > > >> On 23/11/10 18:55, Ashley Pittman wrote: >> >>> Thank you for the patch, I'm wondering if it's enough to simply set the mode on attach or does that break the stack traces or local variables in Fortran code? >>> >> As I tested, this is not sufficient. >> >>> Padb already tracks the value of "-gdb-set print address" and changes it as required, perhaps it needs to do the same for "-gdb-set language"? >>> >>> >> Yes, you are right. The patch can be simplified as put '-gdb-set languance c' only when '-gdb-set print address' is sent. >> > I was thinking of something like the attached which sets the value to c when required but reverts it to it's normal state when not. > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: bug-fix-gdb.patch Type: text/x-patch Size: 374 bytes Desc: not available URL: From padb at googlecode.com Thu Nov 25 18:34:43 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Thu, 25 Nov 2010 18:34:43 +0000 Subject: [padb] r419 committed - Switch gdb into 'c' mode for expressions which are known to require... Message-ID: <0015175cb8ca0b971b0495e4dcae@google.com> Revision: 419 Author: apittman at gmail.com Date: Thu Nov 25 10:33:54 2010 Log: Switch gdb into 'c' mode for expressions which are known to require this. Without this patch then attempting to read the MPI message queues from fortran programs will fail due to padb being unable to correctly calculate the offset of members within structs. http://code.google.com/p/padb/source/detail?r=419 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Tue Nov 9 08:09:47 2010 +++ /trunk/src/padb Thu Nov 25 10:33:54 2010 @@ -6052,6 +6052,7 @@ tracepid => -1, attached => 0, pa => 0, + lang => 'auto', debug => 0, seq => 1, }; @@ -6163,6 +6164,7 @@ } gdb_n_send( $gdb, '-gdb-set print address off' ); + gdb_n_send( $gdb, '-gdb-set language auto' ); return; } @@ -6314,17 +6316,41 @@ } return; } + +sub _gdb_set_lang { + my ( $gdb, $lang ) = @_; + + if ( $lang eq $gdb->{lang} ) { + return; + } + + $gdb->{lang} = $lang; + + _gdb_send_real( $gdb, "-gdb-set language $lang" ); + + return; +} sub gdb_n_send { my ( $gdb, $cmd ) = @_; _gdb_set_print_address( $gdb, 0 ); + _gdb_set_lang( $gdb, 'auto' ); return _gdb_send_real( $gdb, $cmd ); } + +# Send a command in the c language +sub gdb_send_c { + my ( $gdb, $cmd ) = @_; + _gdb_set_print_address( $gdb, 1 ); + _gdb_set_lang( $gdb, 'c' ); + return _gdb_send_real( $gdb, $cmd ); +} # Send a command with print address enabled. sub gdb_send_addr { my ( $gdb, $cmd ) = @_; _gdb_set_print_address( $gdb, 1 ); + _gdb_set_lang( $gdb, 'auto' ); return _gdb_send_real( $gdb, $cmd ); } @@ -6662,7 +6688,7 @@ # adding extra text after the value which is causing hex to throw warnings. sub gdb_type_offset { my ( $gdb, $type, $field ) = @_; - my %p = gdb_send_addr( $gdb, + my %p = gdb_send_c( $gdb, "-data-evaluate-expression \"(void *)&(($type *)0)->$field\"" ); return unless ( $p{status} eq 'done' ); return hex gdb_strip_value( $p{reason} ); From ashley at pittman.co.uk Thu Nov 25 18:36:04 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Thu, 25 Nov 2010 18:36:04 +0000 Subject: [padb] Make "gdb set language c" for Fortran program to get correct value and offset of symbols. In-Reply-To: <4CEDB6D6.5040304@anu.edu.au> References: <4CD9DF48.8050002@anu.edu.au> <4CDB2F62.8010808@anu.edu.au> <3D83BB55-B0B4-480B-AC6D-DC0EF37CC5B1@pittman.co.uk> <4CE20FE4.6060807@anu.edu.au> <5F7FBEAB-5967-4195-8D0C-D8C87BF124C8@pittman.co.uk> <4CE9B68C.4040706@anu.edu.au> <02F7B375-1597-472B-A820-D354681F84AB@pittman.co.uk> <4CEC51C5.3030701@anu.edu.au> <4CEDB6D6.5040304@anu.edu.au> Message-ID: I've committed the patch with your change. I had copied the logic from set_print_address but of course that deals with a boolean flag rather than discrete values. Ashley. On 25 Nov 2010, at 01:07, Jie Cai wrote: > Hi Ashley, > > Agreed, it makes more sense to set the language only when it is needed. > > There is a little bug in line 6323. I have fixed it with the attached patch. Other than that works fine. > Kind Regards, > Jie > > -- > Jie Cai > Jie.Cai at anu.edu.au > > ANU Supercomputer Facility NCI National Facility > Leonard Huxley, Mills Road Ph: +61 2 6125 7965 > Australian National University Fax: +61 2 6125 8199 > Canberra, ACT 0200, Australia > http://nf.nci.org.au > > ----------------------------------------------------- > > > On 25/11/10 09:55, Ashley Pittman wrote: >> On 23 Nov 2010, at 23:44, Jie Cai wrote: >> >> >> >>> On 23/11/10 18:55, Ashley Pittman wrote: >>> >>> >>>> Thank you for the patch, I'm wondering if it's enough to simply set the mode on attach or does that break the stack traces or local variables in Fortran code? >>>> >>>> >>> As I tested, this is not sufficient. >>> >>> >>>> Padb already tracks the value of "-gdb-set print address" and changes it as required, perhaps it needs to do the same for "-gdb-set language"? >>>> >>>> >>>> >>> Yes, you are right. The patch can be simplified as put '-gdb-set languance c' only when '-gdb-set print address' is sent. >>> >>> >> I was thinking of something like the attached which sets the value to c when required but reverts it to it's normal state when not. >> >> >> >> >> >> >> >> > -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From padb at googlecode.com Sun Nov 28 21:44:13 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Sun, 28 Nov 2010 21:44:13 +0000 Subject: [padb] r420 committed - Reduce the amout of data sent over the network for tree-based stack tr... Message-ID: <0015175cb8ca49b76f049623db40@google.com> Revision: 420 Author: apittman at gmail.com Date: Sun Nov 28 13:43:33 2010 Log: Reduce the amout of data sent over the network for tree-based stack traces. I spotted that each stack trace was being sent over the network twice for this case, once as namespace data which is used for the tree and once as ordinary output. Other than the outer process checking for it's existance as a shortcut to know if it can exit early the ordinary output wasn't used for anything. This commit improves the check in the outer to check for ordinary or namespace output and also stops the inner from sending the un-necessairy data over the network. http://code.google.com/p/padb/source/detail?r=420 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Thu Nov 25 10:33:54 2010 +++ /trunk/src/padb Sun Nov 28 13:43:33 2010 @@ -4107,8 +4107,15 @@ my $cargs = $req->{cargs}; - # Warn on missing output here... - return unless exists $d->{target_output}; + # Warn on missing output here, be sure to check both standard output and + # namespace output because if we are using a tree then there won't be any + # standard output. + if ( not exists $d->{target_output} and not exists $d->{target_ns_output} ) + { + return; + } + + #return unless exists $d->{target_output}; my $lines = $d->{target_output}; my $mode = $req->{mode}; @@ -8664,8 +8671,6 @@ ( $frame->{file} || '?' ), ( $frame->{line} || '?' ); - output( $vp, $l ); - if ( $carg->{out_format} eq 'tree' ) { output_namespace( $vp, $thread->{id}, $l ); @@ -8719,6 +8724,8 @@ } } } else { + output( $vp, $l ); + if ( $carg->{stack_shows_params} ) { show_stack_vars( $vp, $frame, 'params' ); }