From thipadin.seng-long at bull.net  Tue Dec  1 14:18:08 2009
From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net)
Date: Tue, 1 Dec 2009 15:18:08 +0100
Subject: [padb] Pb with --create-secret-file
Message-ID: <OFBD408D77.D26B9915-ONC125767F.004C27DD@frcl.bull.fr>

Hi,
On one of our cluster I've go a problem to create secret file like this:

[thipa at vb0 openmpi]$ padb_r341 --create-secret-file
Failed to chmod secret file: No such file or directory
[thipa at vb0 openmpi]$ 

our system is:
[thipa at vb0 openmpi]$ uname -a
Linux vb0 2.6.18-B64k.1.26 #1 SMP Wed Aug 26 17:15:29 CEST 2009 ia64 ia64 
ia64 GNU/Linux

Before my patch the source code looks like this:

sub create_padb_secret {
    my $filename = "$ENV{HOME}/.padb-secret";
    my $FD;
    if ( not open $FD, '>', $filename ) {
        print "Failed to create secret file: $!\n";
        return;
    } 
    if ( chmod( 0600, $FD ) != 1 ) {
        print "Failed to chmod secret file: $!\n";
        return;
    }
    my $s = rand;
    print {$FD} "secret=$s\n";
    close $FD;
    print "Sucessfully created secret file ($filename)\n";
    return;
}

After searching on the web I changed the code to:

sub create_padb_secret {
    my $filename = "$ENV{HOME}/.padb-secret";
    my $FD;
    if ( not open $FD, '>', $filename ) {
        print "Failed to create secret file: $!\n";
        return;
    } 
    if ( chmod( 0600, $filename ) != 1 ) {
        print "Failed to chmod secret file: $!\n";
        return;
    }
    my $s = rand;
    print {$FD} "secret=$s\n";
    close $FD;
    print "Sucessfully created secret file ($filename)\n";
    return;
}

And it works:

[thipa at vb0 openmpi]$ padb_r341_secret --create-secret-file
Sucessfully created secret file (/home_nfs/thipa/.padb-secret)
[thipa at vb0 openmpi]$ 

This happens only in this cluster which is IA64. On the internet it 
relates to system that support fchmod or not:
http://perldoc.perl.org/functions/chmod.html


Here is the patch:

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091201/4150e720/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: secret.patch
Type: application/octet-stream
Size: 364 bytes
Desc: not available
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091201/4150e720/attachment.obj>

From padb at googlecode.com  Tue Dec  1 14:38:05 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Tue, 01 Dec 2009 14:38:05 +0000
Subject: [padb] r342 committed - When creating the secret file chmod the
	filename not the file handle a...
Message-ID: <00504502b14cc17f9c0479abb4e7@google.com>

Revision: 342
Author: apittman
Date: Tue Dec  1 06:37:02 2009
Log: When creating the secret file chmod the filename not the file handle  
as this
is more widely supported and doesn't result in errors on ia64 or Solaris.

Previously padb could exit with errors like the following:
[thipa at vb0 openmpi]$ padb --create-secret-file
Failed to chmod secret file: No such file or directory


http://code.google.com/p/padb/source/detail?r=342

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Thu Nov 26 02:30:48 2009
+++ /trunk/src/padb	Tue Dec  1 06:37:02 2009
@@ -4726,7 +4726,7 @@
          print "Failed to create secret file: $!\n";
          return;
      }
-    if ( chmod( 0600, $FD ) != 1 ) {
+    if ( chmod( 0600, $filename ) != 1 ) {
          print "Failed to chmod secret file: $!\n";
          return;
      }


From ashley at pittman.co.uk  Tue Dec  1 14:38:47 2009
From: ashley at pittman.co.uk (Ashley Pittman)
Date: Tue, 01 Dec 2009 14:38:47 +0000
Subject: [padb] Pb with --create-secret-file
In-Reply-To: <OFBD408D77.D26B9915-ONC125767F.004C27DD@frcl.bull.fr>
References: <OFBD408D77.D26B9915-ONC125767F.004C27DD@frcl.bull.fr>
Message-ID: <1259678327.3532.263.camel@alpha>

On Tue, 2009-12-01 at 15:18 +0100, thipadin.seng-long at bull.net wrote:
> 
> Hi, 
> On one of our cluster I've go a problem to create secret file like
> this: 
> 
> [thipa at vb0 openmpi]$ padb_r341 --create-secret-file 
> Failed to chmod secret file: No such file or directory 
> [thipa at vb0 openmpi]$ 

Ah thanks, I'd briefly seen this on a Solaris system but didn't have
enough time on the system to diagnose it.  Committed as r342.

Ashley,
-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


From padb at googlecode.com  Tue Dec  1 14:42:08 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Tue, 01 Dec 2009 14:42:08 +0000
Subject: [padb]  r343 committed - Backport r342 from the HEAD,
	allow creation 	of secret file on machines...
Message-ID: <001636e1fd8838fddd0479abc385@google.com>

Revision: 343
Author: apittman
Date: Tue Dec  1 06:38:06 2009
Log: Backport r342 from the HEAD, allow creation of secret file on machines  
which
don't support fchmod

http://code.google.com/p/padb/source/detail?r=343

Modified:
  /branches/3.0/src/padb

=======================================
--- /branches/3.0/src/padb	Thu Oct  8 04:25:23 2009
+++ /branches/3.0/src/padb	Tue Dec  1 06:38:06 2009
@@ -4154,7 +4154,7 @@
          printf("Failed to create secret file: $!\n");
          return;
      }
-    if ( chmod( 0600, $FD ) != 1 ) {
+    if ( chmod( 0600, $filename ) != 1 ) {
          printf("Failed to chmod secret file: $!\n");
          return;
      }


From thipadin.seng-long at bull.net  Tue Dec  1 15:30:22 2009
From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net)
Date: Tue, 1 Dec 2009 16:30:22 +0100
Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_Patch_of_support_of_Slur?=
 =?iso-8859-1?q?m_+_Openmpi__Orte_manager?=
Message-ID: <OF1455126B.6ABF12AA-ONC125767F.004F5359@frcl.bull.fr>

On Mon, 2009-11-30 at 17:31 <ashley at pittman.co.uk> wrote:

>I knew you had to do this when running OpenMPI with slurm however I'd
>never done it myself.  My test cluster has both installed so I should be
>able to try it, do you happen to know if you need and special configure
>options to either to allow this?

I used slurm 2.0.1 and openmpi_1.3.3, uppers versions should work also.
I don't know the special configure except in my $PATH, I have added the 
PATH
to where is installed my openMPI_1.3.3 binaries and libs.
Check the path with "type mpirun" command,it should show the PATH to 
openmpi.

>Does the mpirun job (i.e. the processes we want) have it's own slurm job
>step or does it share the job step with the allocation?

Just after salloc step is:
[thipa at vb0 openmpi]$ salloc.sh
salloc: Granted job allocation 27828
[thipa at vb0 openmpi]$
[thipa at vb0 openmpi]$ squeue -s
STEPID         NAME PARTITION     USER      TIME NODELIST
[thipa at vb0 openmpi]$ squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
  27828       jlg     bash    thipa   R       0:25      2 vb[8,10]

After mpirun step is:
[thipa at vb0 openmpi]$ squeue -s
STEPID         NAME PARTITION     USER      TIME NODELIST
27828.0       orted       jlg    thipa      1:02 vb[8,10]
[thipa at vb0 openmpi]$

I believe it can't share job step, each job step is its own.

>I also notice the /proc/version in the patch, does this mean the patch
>works on an OS other than Linux?

It is not complete, to run on other OS that linux you must have two 
branches:
1 - with /proc/version using readdir /proc and /proc/$pid/cmdline
2 - with "ps -edf | grep slurmstepd" something like this.

>What happens if you run salloc... srun?  Does this work with the
>existing support and how should users know which resource manager plugin
>to pick (Ideally padb could do the right thing).

You mean salloc ... srun ...mpirun  prog ?
That's what I have experimented:

[thipa at vb0 openmpi]$ salloc.sh
salloc: Granted job allocation 27830
[thipa at vb0 openmpi]$ 
[thipa at vb0 openmpi]$ srun -n1 mpirun -bynode -n 6 ./pp_sndrcv_spbl
srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1
I am, process 0 starting on vb8, total by srun  6
Me, process 0, send  1000 to process 2
I am, process 2 starting on vb8, total by srun  6
I am, process 4 starting on vb8, total by srun  6
I am, process 1 starting on vb10, total by srun  6
I am, process 5 starting on vb10, total by srun  6
I am, process 3 starting on vb10, total by srun  6

There are 2 steps:
[thipa at vb0 openmpi]$ squeue -s
STEPID         NAME PARTITION     USER      TIME NODELIST
27830.0      mpirun       jlg    thipa      0:22 vb8
27830.1       orted       jlg    thipa      0:22 vb10
[thipa at vb0 openmpi]$

And rmgr=slurm doesn't work (existing support)
You just catch the stack of orted:

[thipa at vb0 openmpi]$ padb_r341  -O stack-shows-locals=no  -O 
stack-shows-params=no -O rmgr=slurm --verbose -tx 27830
Loading config from "/etc/padb.conf"
Loading config from "/home_nfs/thipa/.padbrc"
Loading config from environment
Loading config from command line
Setting 'rmgr' to 'slurm'
Setting 'stack_shows_locals' to 'no'
Setting 'stack_shows_params' to 'no'

Collecting information for job '27830'

Attaching to job 27830
Job has 1 process(es)
Job spans 2 host(s)
Mode 'stack' mode specific options:
     gdb_retry_count : '3'
 max_distinct_values : '3'
  stack_shows_locals : '0'
  stack_shows_params : '0'
   stack_strip_above : 
'elan_waitWord,elan_pollWord,elan_deviceCheck,opal_condition_wait,opal_progress'
   stack_strip_below : 'main,__libc_start_main,start_thread'
    strip_above_wait : '1'
    strip_below_main : '1'
-----------------
[0] (1 processes)
-----------------
main() at main.c:13
  orterun() at orterun.c:686
    opal_event_dispatch() at ?:?
      opal_event_base_loop() at ?:?
        poll_dispatch() at ?:?
          poll() at ?:?
            ??() at ?:?
result from parallel command is 0 (state=shutdown)
[thipa at vb0 openmpi]$

>> [thipa at machu0 padb_open]$ ./padb -O rmgr="sl-orte" -O
>> stack-shows-locals=no  -O stack-shows-params=no --debug=verbose=all
>> -tx 8324 
>> DEBUG (verbose):   0: There are 1 processes over 3 hosts 

>This isn't great, the number of processes expected is so far only used
>to check for missing processes but there are other potential uses for it
>so I'd rather it was correct.

I will dig it more, I don't know the meaning of processes number actually 
you do with.

>> I don't use scontrol listpids, because I found this command not a
>> universal method (some version doesn't have it), 
>> and may issued error message such as : 
>> slurmd[machu139]: proctrack/pgid does not implement
>> slurm_container_get_pids 

>I'd prefer to use this if at all possible, this option was added at a
>request my be several years ago so I'd have thought most versions have
>it by now, can you be clearer on the versions where it doesn't work?

It work only for slurm upper from 1.2, may be some clients have it still ?
If you can get rid of messages above (slurmd[hostnn]: proctrack/pgid does 
not implement)
I am ok.

Thipadin.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091201/a7e57760/attachment.html>

From ashley at pittman.co.uk  Tue Dec  1 15:31:20 2009
From: ashley at pittman.co.uk (Ashley Pittman)
Date: Tue, 01 Dec 2009 15:31:20 +0000
Subject: [padb] Patch of support of Slurm + Openmpi Orte manager
In-Reply-To: <OF1D21A8E6.DE2ED8FC-ONC125767E.004CF155@frcl.bull.fr>
References: <OF1D21A8E6.DE2ED8FC-ONC125767E.004CF155@frcl.bull.fr>
Message-ID: <1259681480.3532.271.camel@alpha>


Thipadin,

What do you think of leaving this under the umbrella of the slurm
resource manager but having it as a option to enable it as a mode?
Really it's a special use case of Slurm so padb should reflect this in
keeping it's resource manager code clean (or at least no dirtier than it
is) and having a different mode to run in which would select the target
processes differently.

This leaves open the option of padb being able to detect that it only
found one process (called orterun) and advise the user accordingly or
perhaps even re-try with the option enabled.

Attached is a patch that does this, note that I've not changed any of
the code in sl_orte_find_pids, merely changed the mechanism used to call
it.

Thoughts?

I'm away Thursday/Friday this week but should be able to take a closer
look at the actual code the beginning of next week, as I said I've got a
cluster I can run it on this time.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk
-------------- next part --------------
A non-text attachment was scrubbed...
Name: padb-slurm-open.patch
Type: text/x-patch
Size: 4971 bytes
Desc: not available
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091201/f1b94a79/attachment.bin>

From ashley at pittman.co.uk  Tue Dec  1 17:17:03 2009
From: ashley at pittman.co.uk (Ashley Pittman)
Date: Tue, 01 Dec 2009 17:17:03 +0000
Subject: [padb]
 =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_Patch_of_support_of_Slur?=
 =?iso-8859-1?q?m_+_Openmpi_Orte_manager?=
In-Reply-To: <OF1455126B.6ABF12AA-ONC125767F.004F5359@frcl.bull.fr>
References: <OF1455126B.6ABF12AA-ONC125767F.004F5359@frcl.bull.fr>
Message-ID: <1259687823.3532.346.camel@alpha>

On Tue, 2009-12-01 at 16:30 +0100, thipadin.seng-long at bull.net wrote:
> >I also notice the /proc/version in the patch, does this mean the
> patch
> >works on an OS other than Linux? 
> 
> It is not complete, to run on other OS that linux you must have two
> branches: 
> 1 - with /proc/version using readdir /proc and /proc/$pid/cmdline 
> 2 - with "ps -edf | grep slurmstepd" something like this. 

I'd view slurmstepd arguments as liable to change over time, if there
was a way to get this information without relying on slurm internals I'd
prefer it.  If this is the only way to get the information then we'll
use it however.

> >What happens if you run salloc... srun?  Does this work with the
> >existing support and how should users know which resource manager
> plugin
> >to pick (Ideally padb could do the right thing). 
> 
> You mean salloc ... srun ...mpirun  prog ? 
> That's what I have experimented: 

I was thinking salloc...srun with say a MPICH2 program.  Does that work
with the existing slurm support?  How would we document for users which
resource manager (or mode) to use in this case?


> >> [thipa at machu0 padb_open]$ ./padb -O rmgr="sl-orte" -O
> >> stack-shows-locals=no  -O stack-shows-params=no --debug=verbose=all
> >> -tx 8324 
> >> DEBUG (verbose):   0: There are 1 processes over 3 hosts 
> 
> >This isn't great, the number of processes expected is so far only
> used
> >to check for missing processes but there are other potential uses for
> it
> >so I'd rather it was correct. 
> 
> I will dig it more, I don't know the meaning of processes number
> actually you do with.

The expected process count is returned by setup_job and is only used to
ensure that all processes are present, padb could live without this
however being able to warn users on missing processes is useful and was
something that was requested from the 2.0 series.  Perhaps I could make
it that nprocs was returned from the find_pids function on the inner
process and passed back up the tree some how.

> >> I don't use scontrol listpids, because I found this command not a
> >> universal method (some version doesn't have it), 
> >> and may issued error message such as : 
> >> slurmd[machu139]: proctrack/pgid does not implement
> >> slurm_container_get_pids 
> 
> >I'd prefer to use this if at all possible, this option was added at a
> >request my be several years ago so I'd have thought most versions
> have
> >it by now, can you be clearer on the versions where it doesn't work? 
> 
> It work only for slurm upper from 1.2, may be some clients have it
> still ?

At some point we have to drop support for old versions, the current
slurm code won't work without it so requiring it for the the
openmpi/slurm combination doesn't seem like too much of a hardship to
me.

> If you can get rid of messages above (slurmd[hostnn]: proctrack/pgid
> does not implement) 

I'll raise this on the slurm list, I get these warning messages too but
I'd assumed that was because I'm using a debug build.  The listpids
command still works even though these warnings are issued.

Ashley,
-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


From sylvain.jeaugey at bull.net  Wed Dec  2 08:49:50 2009
From: sylvain.jeaugey at bull.net (Sylvain Jeaugey)
Date: Wed, 2 Dec 2009 09:49:50 +0100 (CET)
Subject: [padb]
	=?iso-8859-15?q?R=E9f=2E_=3A_Re=3A_Patch_of_support_of_Slu?=
	=?iso-8859-15?q?rm_+_Openmpi_Orte_manager?=
In-Reply-To: <1259687823.3532.346.camel@alpha>
References: <OF1455126B.6ABF12AA-ONC125767F.004F5359@frcl.bull.fr>
	<1259687823.3532.346.camel@alpha>
Message-ID: <alpine.DEB.2.00.0912020936300.3715@jeaugeys.frec.bull.fr>

On Tue, 1 Dec 2009, Ashley Pittman wrote:

>>> What happens if you run salloc... srun?  Does this work with the
>>> existing support and how should users know which resource manager
>> plugin
>>> to pick (Ideally padb could do the right thing).
>>
>> You mean salloc ... srun ...mpirun  prog ?
>> That's what I have experimented:
>
> I was thinking salloc...srun with say a MPICH2 program.  Does that work
> with the existing slurm support?  How would we document for users which
> resource manager (or mode) to use in this case?
srun inside salloc is just the same as a simple srun. For the record, we 
also use srun directly with Open MPI programs (still beta, but works). And 
padb works fine with the existing slurm support as it would with an 
MPICH2 program.

Sylvain


From ashley at pittman.co.uk  Wed Dec  2 11:40:06 2009
From: ashley at pittman.co.uk (Ashley Pittman)
Date: Wed, 02 Dec 2009 11:40:06 +0000
Subject: [padb]
 =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_Patch_of_support_of_Slur?=
 =?iso-8859-1?q?m_+_Openmpi_Orte_manager?=
In-Reply-To: <alpine.DEB.2.00.0912020936300.3715@jeaugeys.frec.bull.fr>
References: <OF1455126B.6ABF12AA-ONC125767F.004F5359@frcl.bull.fr>
	<1259687823.3532.346.camel@alpha>
	<alpine.DEB.2.00.0912020936300.3715@jeaugeys.frec.bull.fr>
Message-ID: <1259754006.2676.1.camel@alpha>

On Wed, 2009-12-02 at 09:49 +0100, Sylvain Jeaugey wrote:
> On Tue, 1 Dec 2009, Ashley Pittman wrote:
> 
> >>> What happens if you run salloc... srun?  Does this work with the
> >>> existing support and how should users know which resource manager
> >> plugin
> >>> to pick (Ideally padb could do the right thing).
> >>
> >> You mean salloc ... srun ...mpirun  prog ?
> >> That's what I have experimented:
> >
> > I was thinking salloc...srun with say a MPICH2 program.  Does that work
> > with the existing slurm support?  How would we document for users which
> > resource manager (or mode) to use in this case?
> srun inside salloc is just the same as a simple srun. For the record, we 
> also use srun directly with Open MPI programs (still beta, but works). And 
> padb works fine with the existing slurm support as it would with an 
> MPICH2 program.

Ok, that's what I thought.  It's good news that it just works but also
means it's probably not possible to detect from slurm what type of job
is running, weather you need the existing code on the new code.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


From padb at googlecode.com  Wed Dec  2 14:18:54 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Wed, 02 Dec 2009 14:18:54 +0000
Subject: [padb] r344 committed - Add a mechanism to allow the find_pids
	function on the inner processes...
Message-ID: <0016e64dde58f816e50479bf8dbb@google.com>

Revision: 344
Author: apittman
Date: Wed Dec  2 06:18:09 2009
Log: Add a mechanism to allow the find_pids function on the inner processes
to report back a different value of nprocesses to the outer process.
The find_pids function can do this by calling
target_key_pair($rank,"JOB_SIZE",$job_size) which the outer process
will spot and update it's expectations accordingly.

http://code.google.com/p/padb/source/detail?r=344

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Tue Dec  1 06:37:02 2009
+++ /trunk/src/padb	Wed Dec  2 06:18:09 2009
@@ -4175,6 +4175,22 @@

      # The inner process has signed on.
      if ( $comm_data->{current_req}->{mode} eq 'signon' ) {
+
+        # Allow the find_pids function to report back a different job
+        # size to the one the resource manager spotted, potentially
+        # because there is a job running under an allocation and there
+        # may be a discrepancy between the two.
+        if ( defined $d->{target_data}{JOB_SIZE} ) {
+            my @size = keys %{ $d->{target_data}{JOB_SIZE} };
+            if ( @size == 1 ) {
+                $comm_data->{nprocesses} = $size[0];
+            } else {
+                print
+                  "More than one value reported for Job Size, using  
largest\n";
+                my @s = sort { $a <=> $b } @size;
+                $comm_data->{nprocesses} = $s[-1];
+            }
+        }

          # Check the signon messages, reporting minor errors to the user, if
          # no processes are found then don't bother processing any commands


From ashley at pittman.co.uk  Wed Dec  2 15:51:09 2009
From: ashley at pittman.co.uk (Ashley Pittman)
Date: Wed, 02 Dec 2009 15:51:09 +0000
Subject: [padb] Patch of support of Slurm + Openmpi Orte manager
In-Reply-To: <1259681480.3532.271.camel@alpha>
References: <OF1D21A8E6.DE2ED8FC-ONC125767E.004CF155@frcl.bull.fr>
	<1259681480.3532.271.camel@alpha>
Message-ID: <1259769070.6352.51.camel@alpha>

On Tue, 2009-12-01 at 15:31 +0000, Ashley Pittman wrote:
> I'm away Thursday/Friday this week but should be able to take a closer
> look at the actual code the beginning of next week, as I said I've got a
> cluster I can run it on this time.

The code almost works for me, all I've changed is as I sent before,
using a configuration option to turn it on and adding a call to
target_key_pair($rank,"JOB_SIZE"...), see r344 for details of this.

[ashley at cloud0 src]$ ./padb -a --proc-summary  -Oslurm_orte_alloc=true
Warning, failed to locate ranks [0,2]
rank  hostname  pid   vmsize    vmrss    S  uptime  %cpu  lcore  command 
   1    cloud1  1618  73504 kB  3928 kB  R    1.99    21      0  deadlock
   3    cloud1  1619  73504 kB  3932 kB  R    1.99    23      0  deadlock

As you can see it's missing the processes from cloud0 which is where the
mpirun is executing.  The same job shows up as expected using the orte
resource manager however, the limitation here being it only works from
the node where this is running.

[ashley at cloud0 src]$ ./padb -a --proc-summary -Ormgr=orte
rank  hostname  pid   vmsize    vmrss    S  uptime  %cpu  lcore  command 
   0    cloud0  3199  73380 kB  3900 kB  R    2.00    21      0  deadlock
   1    cloud1  1618  73504 kB  3928 kB  R    1.99    21      0  deadlock
   2    cloud0  3200  73384 kB  3908 kB  R    2.00    18      0  deadlock
   3    cloud1  1619  73504 kB  3932 kB  R    1.99    20      0  deadlock

This is the relevant parts of the process tree from cloud0, you can
trace deadlock back to the mpirun without any slurmstepd on this node at
all.

ps -o pid,ppid,user,cmd -xa
 2851  1219 ashley   salloc -N 2 -n3 -O
 2854  2851 ashley   /bin/bash
 3192  2854 ashley   mpirun -n 4 /home/ashley/general/mpi/deadlock
 3193  3192 ashley   srun --nodes=1 --ntasks=1 --kill-on-bad-exit --nodelist=cloud1 orted -mca ess slurm -mca orte_ess_jobid 258146304 -mca orte_es
 3199  3192 ashley   /home/ashley/general/mpi/deadlock
 3200  3192 ashley   /home/ashley/general/mpi/deadlock

I'm wondering if it might be better to simply walk all processes in a
very similar way to pbs_find_pids and check for OMPI_COMM_WORLD_RANK
OMPI_COMM_WORLD_SIZE, SLURM_JOB_ID and SLURM_STEP_ID.  This code could
then be used as a fallback in case scontol listpids failed to return any
pids and hence wouldn't need any options twiddled to enable it.

Combined with some more intelligent setting of default values for
slurm_job_step and that could make this case full automatic with the
user just specifying the jobid and nothing else.

Attached is the patch as I've been using it.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk
-------------- next part --------------
A non-text attachment was scrubbed...
Name: padb-slurm-open-2.patch
Type: text/x-patch
Size: 5461 bytes
Desc: not available
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091202/2af8b613/attachment.bin>

From padb at googlecode.com  Wed Dec  2 16:34:58 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Wed, 02 Dec 2009 16:34:58 +0000
Subject: [padb] r345 committed - Slurm: Try to pick a sensible (valid)
	default value for...
Message-ID: <0016e646922c8f983f0479c1745b@google.com>

Revision: 345
Author: apittman
Date: Wed Dec  2 08:34:27 2009
Log: Slurm:  Try to pick a sensible (valid) default value for
slurm_job_step rather than just using a value of zero.
Revert back to using zero if we can't find any trace of any
active steps.
Also convert slurm_setup_pcmd to slurm_setup_job.

http://code.google.com/p/padb/source/detail?r=345

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Wed Dec  2 06:18:09 2009
+++ /trunk/src/padb	Wed Dec  2 08:34:27 2009
@@ -441,7 +441,7 @@
      is_installed           => \&slurm_is_installed,
      get_active_jobs        => \&slurm_get_jobs,
      job_is_running         => \&slurm_job_is_running,
-    setup_pcmd             => \&slurm_setup_pcmd,
+    setup_job              => \&slurm_setup_job,
      find_pids              => \&slurm_find_pids,
      require_inner_callback => 1,
  };
@@ -519,7 +519,7 @@
  $conf{prun_exittimeout} = '2m';
  $conf{rmgr}             = undef;

-$conf{slurm_job_step} = 0;
+$conf{slurm_job_step} = undef;

  $conf{pbs_server} = undef;

@@ -552,6 +552,7 @@
  my $EQUALS = qr{=}x;
  my $SPACE  = qr{\s+}x;
  my $COLON  = qr{:}x;
+my $PERIOD = qr{\.}x;

  my $EMPTY_STRING = q{};

@@ -2472,11 +2473,41 @@
      return ( $status eq 'running' );
  }

-sub slurm_setup_pcmd {
-    my $job  = shift;
+sub slurm_setup_job {
+    my $job = shift;
+
+    # After we have selected a job id and decided to target it make a
+    # best-attempt effort to pick a sensible step_id.  List all the
+    # step ids slurm thinks are running and pick the first one.
+    # Previously this value just defaulted to zero.
+    if ( not defined $conf{slurm_job_step} ) {
+        my @all_steps = slurp_cmd("squeue -s -o %i");
+        my @valid_steps;
+        foreach my $step (@all_steps) {
+            chomp $step;
+            next if $step eq "STEPID";
+            my ( $job_id, $job_step ) = split $PERIOD, $step;
+            next unless $job_id == $job;
+            push @valid_steps, $job_step;
+        }
+        if (@valid_steps) {
+            config_set_internal( 'slurm_job_step', $valid_steps[0] );
+        } else {
+            print
+              "Unable to determine any valid job steps, assuming step id  
0\n";
+            config_set_internal( 'slurm_job_step', 0 );
+        }
+    }
+
      my $cpus = slurm_job_to_ncpus($job);
      my $nc   = slurm_job_to_nodecount($job);
-    return ( "srun --jobid=$job", $cpus, $nc );
+
+    my %pcmd;
+    $pcmd{nprocesses} = $cpus;
+    $pcmd{nhosts}     = $nc;
+    $pcmd{command}    = "srun --jobid=$job";
+    return %pcmd;
+
  }

   
###############################################################################
@@ -5085,7 +5116,13 @@
      }

      foreach my $co (@conf_int) {
-        check_int( $conf{$co} );
+
+        # Only check for defined values here, for some options only
+        # intergers are valid but the default value is undef which means
+        # padb should attempt to do the right thing.
+        if ( defined $conf{$co} ) {
+            check_int( $conf{$co} );
+        }
      }

      # Now go through all the config options and both verify they are


From ashley at pittman.co.uk  Wed Dec  2 17:48:56 2009
From: ashley at pittman.co.uk (Ashley Pittman)
Date: Wed, 02 Dec 2009 17:48:56 +0000
Subject: [padb] Patch of support of Slurm + Openmpi Orte manager
In-Reply-To: <1259769070.6352.51.camel@alpha>
References: <OF1D21A8E6.DE2ED8FC-ONC125767E.004CF155@frcl.bull.fr>
	<1259681480.3532.271.camel@alpha>  <1259769070.6352.51.camel@alpha>
Message-ID: <1259776136.6352.56.camel@alpha>

On Wed, 2009-12-02 at 15:51 +0000, Ashley Pittman wrote:
> 
> I'm wondering if it might be better to simply walk all processes in a
> very similar way to pbs_find_pids and check for OMPI_COMM_WORLD_RANK
> OMPI_COMM_WORLD_SIZE, SLURM_JOB_ID and SLURM_STEP_ID.  This code could
> then be used as a fallback in case scontol listpids failed to return
> any
> pids and hence wouldn't need any options twiddled to enable it.
> 
> Combined with some more intelligent setting of default values for
> slurm_job_step and that could make this case full automatic with the
> user just specifying the jobid and nothing else. 

The attached patch implements just that, "padb -a --proc-summary
-Ormgr=slurm" works for me correctly in all cases I've tested.

Let me know if this works for you and if you're happy with this
approach.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk
-------------- next part --------------
A non-text attachment was scrubbed...
Name: padb-slurm-open-3.patch
Type: text/x-patch
Size: 2716 bytes
Desc: not available
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091202/5e2fdf17/attachment.bin>

From thipadin.seng-long at bull.net  Thu Dec  3 10:45:37 2009
From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net)
Date: Thu, 3 Dec 2009 11:45:37 +0100
Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A__Patch_of_support_of_Slu?=
 =?iso-8859-1?q?rm_+_OpenmpiOrte_manager?=
Message-ID: <OF9A83C0AC.FF6B43AD-ONC1257681.00381E56@frcl.bull.fr>

Hi,
I was off yesterday, in the mean time you have made some versions.
I am only trying the test the last one (padb-slurm-open-3.patch).
I understand you want to handle automaticallly the fact that user do 
slurm/openmpi combination or not.

I am starting with something wrong, i think it needs more handles:
So the combination is:
salloc
srun -n 1 mpirun -bynode -n 8 my_prog
this combination should be equivalent to
salloc
mprun -bynode -n 8 my_prog
so in all my test I've got this.
The result is a little confused, let 's have a look:

The test:

[thipa at vb0 openmpi]$ salloc -p jlg -w vb8,vb9,vb10
salloc: Granted job allocation 27834
[thipa at vb0 openmpi]$ 
[thipa at vb0 openmpi]$ srun -n1 mpirun -bynode -n 8 ./pp_sndrcv_spbl
srun: Warning: can't run 1 processes on 3 nodes, setting nnodes to 1
I am, process 3 starting on vb8, total by srun  8
I am, process 6 starting on vb8, total by srun  8
I am, process 0 starting on vb8, total by srun  8
I am, process 7 starting on vb9, total by srun  8
I am, process 4 starting on vb9, total by srun  8
I am, process 2 starting on vb10, total by srun  8
I am, process 5 starting on vb10, total by srun  8
I am, process 1 starting on vb9, total by srun  8
Me, process 0, send  1000 to process 2


Padb Test:

[thipa at vb0 openmpi]$ padbr345P -O rmgr=slurm -O stack-shows-locals=no  -O 
stack-shows-params=no --verbose -tx 27834
Loading config from "/etc/padb.conf"
Loading config from "/home_nfs/thipa/.padbrc"
Loading config from environment
Loading config from command line
Setting 'rmgr' to 'slurm'
Setting 'stack_shows_locals' to 'no'
Setting 'stack_shows_params' to 'no'

Collecting information for job '27834'

Attaching to job 27834
Job has 1 process(es)
Job spans 3 host(s)
Warning, failed to locate ranks [3,6]
Warning, remote process name differs across ranks
name : ranks
mpirun : [0]
pp_sndrcv_spbl : [1-2,4-5,7]
Warning, remote process state differs across ranks
state : ranks
R (running) : [2]
S (sleeping) : [0-1,4-5,7]
Mode 'stack' mode specific options:
     gdb_retry_count : '3'
 max_distinct_values : '3'
  stack_shows_locals : '0'
  stack_shows_params : '0'
   stack_strip_above : 
'elan_waitWord,elan_pollWord,elan_deviceCheck,opal_condition_wait,opal_progress'
   stack_strip_below : 'main,__libc_start_main,start_thread'
    strip_above_wait : '1'
    strip_below_main : '1'
-----------------
[0] (1 processes)
-----------------
main() at main.c:13
  orterun() at orterun.c:686
    opal_event_dispatch() at ?:?
      opal_event_base_loop() at ?:?
        poll_dispatch() at ?:?
          poll() at ?:?
            ??() at ?:?
-----------------
[1-2,4-5,7] (5 processes)
-----------------
ThreadId: 1
  -----------------
  [1,4-5,7] (4 processes)
  -----------------
  main() at pp_sndrcv_spbl.c:53
    PMPI_Finalize() at ?:?
      ompi_mpi_finalize() at ?:?
        barrier() at ?:?
          opal_progress() at ?:?
            ThreadId: 2
              start_thread() at ?:?
                btl_openib_async_thread() at ?:?
                  poll() at ?:?
                    ??() at ?:?
                      ThreadId: 3
                        start_thread() at ?:?
                          service_thread_start() at ?:?
                            __GC___select() at ?:?
                              ??() at ?:?
  -----------------
  [2] (1 processes)
  -----------------
  main() at pp_sndrcv_spbl.c:49
    PMPI_Recv() at ?:?
      mca_pml_ob1_recv() at ?:?
        opal_progress() at ?:?
          ThreadId: 2
            start_thread() at ?:?
              btl_openib_async_thread() at ?:?
                poll() at ?:?
                  ??() at ?:?
                    ThreadId: 3
                      start_thread() at ?:?
                        service_thread_start() at ?:?
                          __GC___select() at ?:?
                            ??() at ?:?
result from parallel command is 0 (state=shutdown)
[thipa at vb0 openmpi]$ padbr345P -O rmgr=slurm -O stack-shows-locals=no  -O 
stack-shows-params=no --verbose --proc-summary
Loading config from "/etc/padb.conf"
Loading config from "/home_nfs/thipa/.padbrc"
Loading config from environment
Loading config from command line
Setting 'rmgr' to 'slurm'
Setting 'stack_shows_locals' to 'no'
Setting 'stack_shows_params' to 'no'
padbr345P: Error: no jobs specified, use --all or jobids
[thipa at vb0 openmpi]$ padbr345P -O rmgr=slurm -O stack-shows-locals=no  -O 
stack-shows-params=no --verbose --proc-summary -a
Loading config from "/etc/padb.conf"
Loading config from "/home_nfs/thipa/.padbrc"
Loading config from environment
Loading config from command line
Setting 'rmgr' to 'slurm'
Setting 'stack_shows_locals' to 'no'
Setting 'stack_shows_params' to 'no'
Active jobs (1) are 27834

Collecting information for job '27834'

Attaching to job 27834
Job has 1 process(es)
Job spans 3 host(s)
Warning, failed to locate ranks [3,6]
Warning, remote process name differs across ranks
name : ranks
mpirun : [0]
pp_sndrcv_spbl : [1-2,4-5,7]
Warning, remote process state differs across ranks
state : ranks
R (running) : [2]
S (sleeping) : [0-1,4-5,7]
Mode 'proc_summary' mode specific options:
    column_seperator : '  '
       nprocs_output : undef
         proc_format : 
'rank,hostname,pid,vmsize,vmrss,stat.state=S,load1=uptime,pcpu=%cpu,stat.processor=lcore,name=command'
    proc_show_header : '1'
      proc_shows_fds : '0'
     proc_shows_maps : '0'
     proc_shows_proc : '1'
     proc_shows_stat : '1'
       proc_sort_key : 'rank'
  reverse_sort_order : '0'
rank  hostname  pid    vmsize     vmrss     S  uptime  %cpu  lcore command 
 
   0       vb8  22210   16320 kB  13952 kB  S    0.00     0      3  mpirun
   1       vb9  14985  112384 kB  25600 kB  S    0.08     0      5 
pp_sndrcv_spbl
   2      vb10   9540  133440 kB  47296 kB  R    1.15    99      1 
pp_sndrcv_spbl
   4       vb9  14986  111616 kB  25600 kB  S    0.08     0      5 
pp_sndrcv_spbl
   5      vb10   9544  111616 kB  25600 kB  S    1.15     0      0 
pp_sndrcv_spbl
   7       vb9  14987  112640 kB  25728 kB  S    0.08     0      0 
pp_sndrcv_spbl
result from parallel command is 0 (state=shutdown)
[thipa at vb0 openmpi]$

All processes alive:

ssh vb8
[thipa at vb8 ~]$ psu
  PID  PPID CMD
22210 22206 /home_nfs/thipa/openMPI/install/bin/mpirun -bynode -n 8 
./pp_sndrcv_
22213 22210 srun --nodes=2 --ntasks=2 --kill-on-bad-exit 
--nodelist=vb9,vb10 ort
22218 22210 ./pp_sndrcv_spbl
22219 22210 ./pp_sndrcv_spbl
22220 22210 ./pp_sndrcv_spbl
22990 22986 sshd: thipa at pts/6
22991 22990 -bash
23021 22991 ps -o pid,ppid,cmd -u thipa
[thipa at vb8 ~]$ ssh vb9
[thipa at vb9 ~]$ psu
  PID  PPID CMD
14982 14978 /home_nfs/thipa/openMPI/install/bin/orted -mca ess slurm -mca 
orte_e
14985 14982 ./pp_sndrcv_spbl
14986 14982 ./pp_sndrcv_spbl
14987 14982 ./pp_sndrcv_spbl
15776 15772 sshd: thipa at pts/6
15777 15776 -bash
15807 15777 ps -o pid,ppid,cmd -u thipa
[thipa at vb9 ~]$ ssh vb10
[thipa at vb10 ~]$ psu
  PID  PPID CMD
 9531  9527 /home_nfs/thipa/openMPI/install/bin/orted -mca ess slurm -mca 
orte_e
 9534  9531 ./pp_sndrcv_spbl
 9535  9531 ./pp_sndrcv_spbl
10513 10509 sshd: thipa at pts/4
10514 10513 -bash
10544 10514 ps -o pid,ppid,cmd -u thipa
[thipa at vb10 ~]$ 

You have mpirun which has rank0, this shouldn't, and you miss 3,6.


Now the other test that works:
Combination:
salloc 
mpirun  -bynode -n 8 my_prog

The test:

[thipa at vb0 openmpi]$ salloc -p jlg -w vb8,vb9,vb10
salloc: Granted job allocation 27835
[thipa at vb0 openmpi]$ 
[thipa at vb0 openmpi]$ 
[thipa at vb0 openmpi]$ 
[thipa at vb0 openmpi]$ mpirun -bynode -n 8 ./pp_sndrcv_spbl
I am, process 1 starting on vb9, total by srun  8
I am, process 4 starting on vb9, total by srun  8
I am, process 0 starting on vb8, total by srun  8
I am, process 6 starting on vb8, total by srun  8
I am, process 7 starting on vb9, total by srun  8
I am, process 2 starting on vb10, total by srun  8
I am, process 5 starting on vb10, total by srun  8
I am, process 3 starting on vb8, total by srun  8
Me, process 0, send  1000 to process 2

Padb test:

[thipa at vb0 openmpi]$ padbr345P -O rmgr=slurm --verbose --proc-summary -a
Loading config from "/etc/padb.conf"
Loading config from "/home_nfs/thipa/.padbrc"
Loading config from environment
Loading config from command line
Setting 'rmgr' to 'slurm'
Active jobs (1) are 27835

Collecting information for job '27835'

Attaching to job 27835
Job has 3 process(es)
Job spans 3 host(s)
Warning, remote process state differs across ranks
state : ranks
R (running) : [2]
S (sleeping) : [0-1,3-7]
Mode 'proc_summary' mode specific options:
    column_seperator : '  '
       nprocs_output : undef
         proc_format : 
'rank,hostname,pid,vmsize,vmrss,stat.state=S,load1=uptime,pcpu=%cpu,stat.processor=lcore,name=command'
    proc_show_header : '1'
      proc_shows_fds : '0'
     proc_shows_maps : '0'
     proc_shows_proc : '1'
     proc_shows_stat : '1'
       proc_sort_key : 'rank'
  reverse_sort_order : '0'
rank  hostname  pid    vmsize     vmrss     S  uptime  %cpu  lcore command 
 
   0       vb8  23049  133440 kB  47104 kB  S    0.00     0      5 
pp_sndrcv_spbl
   1       vb9  15828  112640 kB  25408 kB  S    0.00     0      0 
pp_sndrcv_spbl
   2      vb10  10571  134464 kB  47168 kB  R    0.92   100      0 
pp_sndrcv_spbl
   3       vb8  23058  111616 kB  25536 kB  S    0.00     0      2 
pp_sndrcv_spbl
   4       vb9  15845  111616 kB  25408 kB  S    0.00     0      0 
pp_sndrcv_spbl
   5      vb10  10575  111616 kB  25408 kB  S    0.92     0      1 
pp_sndrcv_spbl
   6       vb8  23054  111616 kB  25408 kB  S    0.00     0      0 
pp_sndrcv_spbl
   7       vb9  15830  111616 kB  25408 kB  S    0.00     0      0 
pp_sndrcv_spbl
result from parallel command is 0 (state=shutdown)
[thipa at vb0 openmpi]$ 
[thipa at vb0 openmpi]$ 
[thipa at vb0 openmpi]$ 
[thipa at vb0 openmpi]$ 
[thipa at vb0 openmpi]$ 
[thipa at vb0 openmpi]$ padbr345P -O rmgr=slurm -O stack-shows-locals=no  -O 
stack-shows-params=no --verbose -tx 27835
Loading config from "/etc/padb.conf"
Loading config from "/home_nfs/thipa/.padbrc"
Loading config from environment
Loading config from command line
Setting 'rmgr' to 'slurm'
Setting 'stack_shows_locals' to 'no'
Setting 'stack_shows_params' to 'no'

Collecting information for job '27835'

Attaching to job 27835
Job has 3 process(es)
Job spans 3 host(s)
Warning, remote process state differs across ranks
state : ranks
R (running) : [2]
S (sleeping) : [0-1,3-7]
Mode 'stack' mode specific options:
     gdb_retry_count : '3'
 max_distinct_values : '3'
  stack_shows_locals : '0'
  stack_shows_params : '0'
   stack_strip_above : 
'elan_waitWord,elan_pollWord,elan_deviceCheck,opal_condition_wait,opal_progress'
   stack_strip_below : 'main,__libc_start_main,start_thread'
    strip_above_wait : '1'
    strip_below_main : '1'
-----------------
[0-7] (8 processes)
-----------------
ThreadId: 1
  -----------------
  [0-1,3-7] (7 processes)
  -----------------
  main() at pp_sndrcv_spbl.c:53
    PMPI_Finalize() at ?:?
      ompi_mpi_finalize() at ?:?
        barrier() at ?:?
          opal_progress() at ?:?
            ThreadId: 2
              start_thread() at ?:?
                btl_openib_async_thread() at ?:?
                  poll() at ?:?
                    ??() at ?:?
                      ThreadId: 3
                        start_thread() at ?:?
                          service_thread_start() at ?:?
                            __GC___select() at ?:?
                              ??() at ?:?
  -----------------
  [2] (1 processes)
  -----------------
  main() at pp_sndrcv_spbl.c:49
    PMPI_Recv() at ?:?
      mca_pml_ob1_recv() at ?:?
        ThreadId: 2
          start_thread() at ?:?
            btl_openib_async_thread() at ?:?
              poll() at ?:?
                ??() at ?:?
                  ThreadId: 3
                    start_thread() at ?:?
                      service_thread_start() at ?:?
                        __GC___select() at ?:?
                          ??() at ?:?
result from parallel command is 0 (state=shutdown)
[thipa at vb0 openmpi]$


Thipadin.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091203/8d5c7797/attachment.html>

From ashley at pittman.co.uk  Thu Dec  3 11:08:31 2009
From: ashley at pittman.co.uk (Ashley Pittman)
Date: Thu, 03 Dec 2009 11:08:31 +0000
Subject: [padb]
 =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A__Patch_of_support_of_Slu?=
 =?iso-8859-1?q?rm_+_Openmpi_Orte_manager?=
In-Reply-To: <OF9A83C0AC.FF6B43AD-ONC1257681.00381E56@frcl.bull.fr>
References: <OF9A83C0AC.FF6B43AD-ONC1257681.00381E56@frcl.bull.fr>
Message-ID: <1259838511.6352.111.camel@alpha>


I'm just running out of the door myself and will be away until Sunday
now.

On Thu, 2009-12-03 at 11:45 +0100, thipadin.seng-long at bull.net wrote:
> You have mpirun which has rank0, this shouldn't, and you miss 3,6.

ranks 3 and 6 are on the same node as rank 0, can you try the following
additional patch which should cause it to skip over the mpirun process
and look for local ones based on their environment.

If this patch doesn't work take a look at the the contents
of /proc/$pid/status for the process it's erroneously reporting as rank
0 to see what Name is set to.  In the example you sent it's pid 22210

--- padb-slurm-open-3	2009-12-03 11:03:08.500044734 +0000
+++ padb	2009-12-03 11:03:15.333036493 +0000
@@ -8187,6 +8187,7 @@
         next unless ( $job eq $jobid );
         next unless ( $step == $inner_conf{slurm_job_step} );
         next if( find_from_status( $pid, 'Name' ) eq 'orted');
+        next if( find_from_status( $pid, 'Name' ) eq 'mpirun');
         maybe_show_pid( $global, $pid );
         $found_target = 1;
     }


-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


From thipadin.seng-long at bull.net  Thu Dec  3 12:20:47 2009
From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net)
Date: Thu, 3 Dec 2009 13:20:47 +0100
Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_Re=3A__Patc?=
 =?iso-8859-1?q?h_of__support_of_Slurm_+_Openmpi_Orte_manager?=
Message-ID: <OF087CAB5A.A1F2D41A-ONC1257681.00433E11@frcl.bull.fr>

Hi, good holidays, there ?
I have applied the patch below.
It works now:

padbr345P -O rmgr=slurm  --proc-summary  -a
Warning, remote process state differs across ranks
state : ranks
R (running) : [2]
S (sleeping) : [0-1,3-7]
rank  hostname  pid    vmsize     vmrss     S  uptime  %cpu  lcore command 
 
   0       vb8  24595  133440 kB  47296 kB  S    0.01     0      0 
pp_sndrcv_spbl
   1       vb9  17406  111616 kB  25536 kB  S    0.01     0      0 
pp_sndrcv_spbl
   2      vb10  12521  133440 kB  47296 kB  R    0.93    99      1 
pp_sndrcv_spbl
   3       vb8  24588  111616 kB  25728 kB  S    0.01     0      2 
pp_sndrcv_spbl
   4       vb9  17411  111616 kB  25600 kB  S    0.01     0      5 
pp_sndrcv_spbl
   5      vb10  12522  111616 kB  25600 kB  S    0.93     0      0 
pp_sndrcv_spbl
   6       vb8  24589  111616 kB  25600 kB  S    0.01     0      3 
pp_sndrcv_spbl
   7       vb9  17407  112640 kB  25728 kB  S    0.01     0      0 
pp_sndrcv_spbl
[thipa at vb0 openmpi]$ 

Thipadin.


Ashley Pittman <ashley at pittman.co.uk>
12/03/2009 12:08 PM

 
        Pour :  thipadin.seng-long at bull.net
        cc :    florence.vallee at bull.net, francois.wellenreiter at bull.net, 
padb-devel at pittman.org.uk, Sylvain.JEAUGEY at bull.net
        Objet : Re: R?f. : Re: [padb] Patch of support of Slurm + Openmpi Orte manager


I'm just running out of the door myself and will be away until Sunday
now.

On Thu, 2009-12-03 at 11:45 +0100, thipadin.seng-long at bull.net wrote:
> You have mpirun which has rank0, this shouldn't, and you miss 3,6.

ranks 3 and 6 are on the same node as rank 0, can you try the following
additional patch which should cause it to skip over the mpirun process
and look for local ones based on their environment.

If this patch doesn't work take a look at the the contents
of /proc/$pid/status for the process it's erroneously reporting as rank
0 to see what Name is set to.  In the example you sent it's pid 22210

--- padb-slurm-open-3            2009-12-03 11:03:08.500044734 +0000
+++ padb                 2009-12-03 11:03:15.333036493 +0000
@@ -8187,6 +8187,7 @@
         next unless ( $job eq $jobid );
         next unless ( $step == $inner_conf{slurm_job_step} );
         next if( find_from_status( $pid, 'Name' ) eq 'orted');
+        next if( find_from_status( $pid, 'Name' ) eq 'mpirun');
         maybe_show_pid( $global, $pid );
         $found_target = 1;
     }


-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091203/b5e5abef/attachment.html>

From sylvain.jeaugey at bull.net  Thu Dec  3 12:53:14 2009
From: sylvain.jeaugey at bull.net (Sylvain Jeaugey)
Date: Thu, 3 Dec 2009 13:53:14 +0100 (CET)
Subject: [padb]
	=?iso-8859-15?q?R=E9f=2E_=3A_Re=3A__Patch_of_support_of_Sl?=
	=?iso-8859-15?q?urm_+_Openmpi_Orte_manager?=
In-Reply-To: <OF9A83C0AC.FF6B43AD-ONC1257681.00381E56@frcl.bull.fr>
References: <OF9A83C0AC.FF6B43AD-ONC1257681.00381E56@frcl.bull.fr>
Message-ID: <alpine.DEB.2.00.0912031344360.3715@jeaugeys.frec.bull.fr>

Thipadin,

I don't understand why this combination is being discussed ?

On Thu, 3 Dec 2009, thipadin.seng-long at bull.net wrote:
> salloc srun -n 1 mpirun -bynode -n 8 my_prog

The "srun -n 1" is useless and I doubt anyone would ever try to use it. Is 
there a reason behind this ?

Sylvain


From thipadin.seng-long at bull.net  Fri Dec  4 09:13:25 2009
From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net)
Date: Fri, 4 Dec 2009 10:13:25 +0100
Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_Re=3A__Patc?=
 =?iso-8859-1?q?h_of__support_of_Slurm_+_Openmpi_Orte_manager?=
Message-ID: <OF81732923.DDDAAC86-ONC1257682.00314960@frcl.bull.fr>

On 2/03/2009 01:53 PM Sylvain Jeaugey <sylvain.jeaugey at bull.net> wrote:
> I don't understand why this combination is being discussed ?

> On Thu, 3 Dec 2009, thipadin.seng-long at bull.net wrote:
> > salloc srun -n 1 mpirun -bynode -n 8 my_prog

> The "srun -n 1" is useless and I doubt anyone would ever try to use it. 
Is 
> there a reason behind this ?

There is no reason behind this but Padb should support all kind of jobs.
"srun -n 1 mpirun" is equivalent to "mpirun ", I do admit.
But no one can prevent somebody to start jobs like this
since the syntax is correct,
So if some one start jobs like this, padb should be able to support.
That's the way to make padb rich. That's my point of view.
In term of source code there is just one line added by Ashley and it 
works.

Regards.
Thipadin. 


Sylvain Jeaugey <sylvain.jeaugey at bull.net>
12/03/2009 01:53 PM

 
        Pour :  thipadin.seng-long at bull.net
        cc :    Ashley Pittman <ashley at pittman.co.uk>, florence.vallee at bull.net, 
francois.wellenreiter at bull.net, padb-devel at pittman.org.uk, 
Sylvain.JEAUGEY at bull.net
        Objet : Re: R?f. : Re: [padb] Patch of support of Slurm + Openmpi Orte manager

Thipadin,

I don't understand why this combination is being discussed ?

On Thu, 3 Dec 2009, thipadin.seng-long at bull.net wrote:
> salloc srun -n 1 mpirun -bynode -n 8 my_prog

The "srun -n 1" is useless and I doubt anyone would ever try to use it. Is 

there a reason behind this ?

Sylvain


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091204/6508ec8d/attachment.html>

From sylvain.jeaugey at bull.net  Fri Dec  4 09:35:28 2009
From: sylvain.jeaugey at bull.net (Sylvain Jeaugey)
Date: Fri, 4 Dec 2009 10:35:28 +0100 (CET)
Subject: [padb]
	=?iso-8859-15?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_Re=3A__Pat?=
	=?iso-8859-15?q?ch_of_support_of_Slurm_+_Openmpi_Orte_manager?=
In-Reply-To: <OF81732923.DDDAAC86-ONC1257682.00314960@frcl.bull.fr>
References: <OF81732923.DDDAAC86-ONC1257682.00314960@frcl.bull.fr>
Message-ID: <alpine.DEB.2.00.0912041026300.3715@jeaugeys.frec.bull.fr>

On Fri, 4 Dec 2009, thipadin.seng-long at bull.net wrote:

> But no one can prevent somebody to start jobs like this since the syntax 
> is correct,
Actually we can. The documentation says : use salloc ... mpirun, not 
salloc srun -n 1 mpirun. And I wouldn't say that the syntax is correct. It 
just *happens* to work. With this command, you're launching this chain :
  salloc -> srun -> mpirun -> srun -> MPI processes
We're lucky it works !

> So if some one start jobs like this, padb should be able to support.
I disagree. Since this has no added value, I don't see why we should 
support it. But if that's only one extra line of code, then let it be ...

Sylvain


From ashley at pittman.co.uk  Fri Dec  4 10:05:37 2009
From: ashley at pittman.co.uk (Ashley Pittman)
Date: Fri, 04 Dec 2009 10:05:37 +0000
Subject: [padb]
 =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_Re=3A__Patc?=
 =?iso-8859-1?q?h_of_support_of_Slurm_+_Openmpi_Orte_manager?=
In-Reply-To: <alpine.DEB.2.00.0912041026300.3715@jeaugeys.frec.bull.fr>
References: <OF81732923.DDDAAC86-ONC1257682.00314960@frcl.bull.fr>
	<alpine.DEB.2.00.0912041026300.3715@jeaugeys.frec.bull.fr>
Message-ID: <1259921137.3655.6.camel@alpha>

On Fri, 2009-12-04 at 10:35 +0100, Sylvain Jeaugey wrote:
> On Fri, 4 Dec 2009, thipadin.seng-long at bull.net wrote:
> 
> > But no one can prevent somebody to start jobs like this since the syntax 
> > is correct,
> Actually we can. The documentation says : use salloc ... mpirun, not 
> salloc srun -n 1 mpirun. And I wouldn't say that the syntax is correct. It 
> just *happens* to work. With this command, you're launching this chain :
>   salloc -> srun -> mpirun -> srun -> MPI processes
> We're lucky it works !
> 
> > So if some one start jobs like this, padb should be able to support.
> I disagree. Since this has no added value, I don't see why we should 
> support it. But if that's only one extra line of code, then let it be ...

Given that the failure mode was to report the wrong information then I'm
much happier having this code in than not.  I'll probably change the
check to "next if is_resmgr_process($pid);" which is a superset of what
the code does now and means this case is handled no differently to the
normal salloc/mpirun case.

The examples given nicely demonstrate the benefit of having the signon
check for different executable names, I know some people do do this on
purpose but it's rare enough that it's worth warning about if padb
observes it.

Ashley,

-- 

Ashley Pittman, Brighton, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


From padb at googlecode.com  Mon Dec  7 11:30:59 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Mon, 07 Dec 2009 11:30:59 +0000
Subject: [padb] r346 committed - Extend the slurm resource manager code to
	also work with Orte...
Message-ID: <001636283ae8a97c00047a21ca2f@google.com>

Revision: 346
Author: apittman
Date: Mon Dec  7 03:29:53 2009
Log: Extend the slurm resource manager code to also work with Orte
(OpenMPI) jobs launched under slurm.  To run these two together
you have to create a slurm allocation and then use the OMPI
mpirun from within this allocation to do the application launch
using ORTE.  In this case "slurm listpids" can't tell you the
process identifiers.
If slurm listpids gives no information or claims to have launched
only further resource managers walk the process tree looking for
OMPI processes that have slurm specific environment variables set
indicating they belong to this job.

http://code.google.com/p/padb/source/detail?r=346

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Wed Dec  2 08:34:27 2009
+++ /trunk/src/padb	Mon Dec  7 03:29:53 2009
@@ -8157,10 +8157,16 @@
  }

  sub is_resmgr_process {
-    my $pid = shift;
+    my $pid  = shift;
      my $name = find_from_status( $pid, 'Name' );
-    my $mgrs =
-      { rmsloader => 1, slurmd => 1, slurmstepd => 1, pbs_attach => 1 };
+    my $mgrs = {
+        rmsloader  => 1,
+        slurmd     => 1,
+        slurmstepd => 1,
+        pbs_attach => 1,
+        orted      => 1,
+        mpirun     => 1,
+    };
      return 1 if ( defined $mgrs->{$name} );
      return;
  }
@@ -8173,13 +8179,58 @@
      my @procs =
        slurp_cmd("scontrol listpids $jobid.$inner_conf{slurm_job_step}");

+    my $found_target;
+
      foreach my $proc (@procs) {
          my ( $pid, $job, $step, undef, $global ) = split $SPACE, $proc;
          next if ( $global eq '-' );
          next unless ( $job eq $jobid );
          next unless ( $step == $inner_conf{slurm_job_step} );
+        next if ( is_resmgr_process($pid) );
          maybe_show_pid( $global, $pid );
-    }
+        $found_target = 1;
+    }
+    return if $found_target;
+
+    # Either we didn't find any processes on this node or we only
+    # found processes named orted.  This could be for two reasons:
+    # The job step might not be running on this node.
+    # The job step might be a openmpi salloc/orterun combination.
+    # If it's the latter then this node could either be the "head"
+    # node where the mpirun is running or a "remote" node where the
+    # job will be launched by orted.
+
+    # Search the process list for processes which belong to this job
+    # and either belong to this job step or don't state which job step
+    # they belong to.
+    foreach my $pid ( get_process_list($target_user) ) {
+
+        # Skip over resource manager processes.
+        next if ( is_resmgr_process($pid) );
+
+        # Skip over ones which aren't direct descendants of a resource  
manager
+        next unless is_parent_resmgr($pid);
+
+        my $vp;
+        my %env = get_remote_env($pid);
+
+        next unless defined $env{SLURM_JOB_ID};
+        next if ( $env{SLURM_JOB_ID} != $jobid );
+
+        next unless defined $env{OMPI_COMM_WORLD_RANK};
+
+        # If this is defined check it's correct, it might be missing  
though.
+        if ( defined $env{SLURM_JOB_STEP} ) {
+            next if $env{SLURM_JOB_STEP} != $inner_conf{slurm_job_step};
+        }
+
+        if ( defined $env{OMPI_COMM_WORLD_SIZE} ) {
+            target_key_pair( $vp, "JOB_SIZE", $env{OMPI_COMM_WORLD_SIZE} );
+        }
+
+        maybe_show_pid( $env{OMPI_COMM_WORLD_RANK}, $pid );
+    }
+
      return;
  }


From ashley at pittman.co.uk  Mon Dec  7 11:32:33 2009
From: ashley at pittman.co.uk (Ashley Pittman)
Date: Mon, 07 Dec 2009 11:32:33 +0000
Subject: [padb]
 =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_Re=3A__Patc?=
 =?iso-8859-1?q?h_of_support_of_Slurm_+_Openmpi_Orte_manager?=
In-Reply-To: <OF087CAB5A.A1F2D41A-ONC1257681.00433E11@frcl.bull.fr>
References: <OF087CAB5A.A1F2D41A-ONC1257681.00433E11@frcl.bull.fr>
Message-ID: <1260185553.4449.2.camel@alpha>

On Thu, 2009-12-03 at 13:20 +0100, thipadin.seng-long at bull.net wrote:
> Hi, good holidays, there ?
> I have applied the patch below.
> It works now: 

This is committed as r346, slightly modified to call is_resmgr_process()
rather than checking the name from /proc/$pid/status as discussed.

Let me know if you have any further problems with this and once again
thanks for the patch.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


From padb at googlecode.com  Mon Dec  7 12:22:52 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Mon, 07 Dec 2009 12:22:52 +0000
Subject: [padb] r347 committed - Slight improvement of error reporting with
	failing to open the message...
Message-ID: <0016e68ea0b7367071047a228427@google.com>

Revision: 347
Author: apittman
Date: Mon Dec  7 04:22:42 2009
Log: Slight improvement of error reporting with failing to open the message
queue DLL.

http://code.google.com/p/padb/source/detail?r=347

Modified:
  /trunk/src/minfo.c
  /trunk/src/padb

=======================================
--- /trunk/src/minfo.c	Thu Nov  5 14:07:46 2009
+++ /trunk/src/minfo.c	Mon Dec  7 04:22:42 2009
@@ -529,11 +529,14 @@

      dlhandle = dlopen(filename,RTLD_NOW);
      if ( ! dlhandle ) {
-	show_warning("Unable to dlopen dll with RTLD_NOW, trying LAZY...");
+	show_warning("Unable to dlopen dll with RTLD_NOW, trying RTLD_LAZY...");
  	show_warning(dlerror());
  	dlhandle = dlopen(filename,RTLD_LAZY);
-	if ( ! dlhandle )
+	if ( ! dlhandle ) {
+	    show_warning("Unable to dlopen dll with RTLD_LAZY, giving up...");
+	    show_warning(dlerror());
  	    return -1;
+	}
      }

      DLSYM(dll_ep,dlhandle,setup_basic_callbacks);
=======================================
--- /trunk/src/padb	Mon Dec  7 03:29:53 2009
+++ /trunk/src/padb	Mon Dec  7 04:22:42 2009
@@ -6140,6 +6140,7 @@
              }

          } elsif ( $cmd eq 'out:' ) {
+            $stats{out}++;
              if (
                  $r =~ m{\A
                          out:
@@ -6173,6 +6174,7 @@
                  target_key_pair( $vp, 'UNPARSEABLE MINFO', $r );
              }
          } elsif ( $cmd eq 'zzz:' ) {
+            $stats{zzz}++;
              if (
                  $r =~ m{\A
                          zzz:
@@ -6201,6 +6203,7 @@
              push @cd, dclone( \%cd );
              undef %cd;
          } else {
+            $stats{raw}++;
              push @{ $cd{raw} }, $r;
          }
      }


From padb at googlecode.com  Mon Dec  7 12:30:55 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Mon, 07 Dec 2009 12:30:55 +0000
Subject: [padb] r348 committed - Add new as-yet unused asyncronous gdb
	attaching code. Gdb can take a...
Message-ID: <001485f6d9f4019aae047a22a1fe@google.com>

Revision: 348
Author: apittman
Date: Mon Dec  7 04:30:03 2009
Log: Add new as-yet unused asyncronous gdb attaching code.  Gdb can take a
while to attach to a process, expecially if it's got lots of shared
librarys which need to be loaded off a shared filesystems, this change
adds the option to asyncronously attach meaning that padb can launch
a instance of gdb for every target process on a node, tell them all to
attach and allow them to procede in parallel.  This should help the
performance a lot, particuarly on nodes with lots of cores.

http://code.google.com/p/padb/source/detail?r=348

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Mon Dec  7 04:22:42 2009
+++ /trunk/src/padb	Mon Dec  7 04:30:03 2009
@@ -5473,6 +5473,14 @@
          }
          return;
      }
+
+    gdb_attach_post( $gdb, $pid );
+
+    return $pid;
+}
+
+sub gdb_attach_post {
+    my ( $gdb, $pid ) = @_;

      $gdb->{attached} = 1;
      $gdb->{tracepid} = $pid;
@@ -5493,6 +5501,52 @@

      gdb_n_send( $gdb, '-gdb-set print address off' );

+}
+
+sub gdb_attach_async_start {
+    my ( $gdb, $pid ) = @_;
+
+    if ($running_on_solaris) {
+        my $exe = readlink("/proc/$pid/path/a.out");
+        my %cs = gdb_n_send( $gdb, "file $exe" );
+        if ( $cs{status} ne 'done' ) {
+            croak("Gdb command file $exe failed");
+            return;
+        }
+    }
+
+    send_cont_signal($pid);
+
+    _gdb_send_real_async_start( $gdb, "attach $pid" );
+
+    return;
+}
+
+sub gdb_attach_async_end {
+    my ( $gdb, $pid ) = @_;
+
+    my %p = _gdb_send_real_async_wait( $gdb, "attach $pid" );
+
+    if ( not defined $p{status} ) {
+        $gdb->{error} = 'Failed to attach to process';
+        if ( not find_exe('gdb') ) {
+            $gdb->{error} = 'Failed to attach to process (gdb not  
installed?)';
+        }
+        return;
+    }
+
+    if ( $p{status} eq 'error' ) {
+        my $r = gdb_parse_reason( $p{reason} );
+        if ( defined $r->{msg} ) {
+            $gdb->{error} = "Failed to attach to process: $r->{msg}";
+        } else {
+            $gdb->{error} = 'Failed to attach to process';
+        }
+        return;
+    }
+
+    gdb_attach_post( $gdb, $pid );
+
      return $pid;
  }

@@ -5543,6 +5597,34 @@
      }
      return %r;
  }
+
+sub _gdb_send_real_async_start {
+    my ( $gdb, $cmd ) = @_;
+    gdb_wait_for_prompt($gdb);
+    my $handle = $gdb->{wtr};
+    my $seq    = $gdb->{seq}++;
+    print {$handle} "$seq$cmd\n";
+    if ( defined $gdb->{debugfd} ) {
+        print { $gdb->{debugfd} } "$seq$cmd\n";
+    }
+    return;
+}
+
+sub _gdb_send_real_async_wait {
+    my ( $gdb, $cmd ) = @_;
+    my $seq = $gdb->{seq};
+    my %r = gdb_n_next_result( $gdb, $seq );
+    if ( $gdb->{attached} and $r{seq} ne $seq ) {
+        croak(
+"Invalid sequence number from gdb, expecting $seq got $r{seq} cmd=\"$cmd\""
+        );
+    }
+    $r{cmd} = $cmd;
+    if ( $gdb->{debugfd} and defined $r{status} and $r{status} ne 'done' )  
{
+        print Dumper \%r;
+    }
+    return %r;
+}

  sub _gdb_set_print_address {
      my ( $gdb, $flag ) = @_;


From padb at googlecode.com  Mon Dec  7 12:42:00 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Mon, 07 Dec 2009 12:42:00 +0000
Subject: [padb] r349 committed - Add a new type of callback for the mode to
	use, handler_all is...
Message-ID: <0016367657e6a0c07d047a22c8cb@google.com>

Revision: 349
Author: apittman
Date: Mon Dec  7 04:41:09 2009
Log: Add a new type of callback for the mode to use, handler_all is
intended to replace handler and is called once for each target
process.  It differs from handler in that it takes the same
options as the handler_all command.  In time I'd like to drop
handler for simplicity.

http://code.google.com/p/padb/source/detail?r=349

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Mon Dec  7 04:30:03 2009
+++ /trunk/src/padb	Mon Dec  7 04:41:09 2009
@@ -8791,18 +8791,20 @@
      # or any other rank on this node, we'll have to see if that causes
      # problems or if it's best to clear the target_key_pair() and output()
      # data for this node/rank.
+
+    # Bit of a hack here until I can fix it properly, pass on the
+    # output format so that the stack trace code knows when to do
+    # clever things in tree mode.
+    my $cargs = $cmd->{cargs};
+    if ( defined $cmd->{out_format} ) {
+        $cargs->{out_format} = $cmd->{out_format};
+    } else {
+        $cargs->{out_format} = 'raw';
+    }
+
      if ( defined $allfns{ $cmd->{mode} }{handler_all} ) {
          eval {

-            # Bit of a hack here until I can fix it properly, pass on the
-            # output format so that the stack trace code knows when to do
-            # clever things in tree mode.
-            my $cargs = $cmd->{cargs};
-            if ( defined $cmd->{out_format} ) {
-                $cargs->{out_format} = $cmd->{out_format};
-            } else {
-                $cargs->{out_format} = 'raw';
-            }
              $netdata->{target_response} =
                $allfns{ $cmd->{mode} }{handler_all}( $cargs, $pid_list );
              1;
@@ -8821,9 +8823,19 @@
              my $vp  = $proc->{vp};
              my $pid = $proc->{pid};
              eval {
-                my $res =
-                  $allfns{ $cmd->{mode} }{handler}( $cmd->{cargs}, $vp,  
$pid );
-                $gres{$vp} = $res if ( defined $res );
+
+                # The only difference here is the type of the first option,
+                # all functions should be converted to a single format here
+                if ( defined $allfns{ $cmd->{mode} }{handler_one} ) {
+                    my $res =
+                      $allfns{ $cmd->{mode} }{handler_one}( $cargs, $proc  
);
+                    $gres{$vp} = $res if ( defined $res );
+                } else {
+                    my $res =
+                      $allfns{ $cmd->{mode} }{handler}( $cmd->{cargs}, $vp,
+                        $pid );
+                    $gres{$vp} = $res if ( defined $res );
+                }
                  1;
              } or do {
                  my $error = $@;


From padb at googlecode.com  Mon Dec  7 12:58:20 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Mon, 07 Dec 2009 12:58:20 +0000
Subject: [padb] r350 committed - Add global_attach and global_detach
	functions for looking after gdb...
Message-ID: <001636c5bda31244ea047a2303ca@google.com>

Revision: 350
Author: apittman
Date: Mon Dec  7 04:57:39 2009
Log: Add global_attach and global_detach functions for looking after gdb
handles, if the mode can inform padb of if it needs to attach to
the target processes or not then padb can perform this operation for
it in a optimal way, keeping all the code common.  Allows persistent
attachment over multiple mode calls as well.

http://code.google.com/p/padb/source/detail?r=350

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Mon Dec  7 04:41:09 2009
+++ /trunk/src/padb	Mon Dec  7 04:57:39 2009
@@ -4065,14 +4065,15 @@
      }

      my $cmd;
+    my $req;

      if ($watch) {
          $cmd = $commands[0];
+        $req->{detach_after_callback} = 1;
      } else {
          $cmd = shift @commands;
      }

-    my $req;
      $req->{mode} = $cmd->{mode};

      if ( defined $cmd->{args} ) {
@@ -7648,6 +7649,67 @@
      }
      return;
  }
+
+# Attach to all local processes in preperation for calling the mode
+# callback.  This function, along with the corresponding one below it
+# handles persistent attachment between modes: modes specify if they
+# want gdb handles or not, if they do then this function attaches for
+# it, if they don't and gdb is attached this function will detach.
+sub global_attach {
+    my ( $mode, $procs ) = @_;
+
+    if ( not $allfns{$mode}{needs_gdb} ) {
+        global_detach($procs);
+        return;
+    }
+
+    foreach my $proc ( @{$procs} ) {
+        my $vp  = $proc->{vp};
+        my $pid = $proc->{pid};
+
+        next if defined $proc->{gdb_handle};
+
+        $proc->{gdb_tmp} = gdb_start();
+        gdb_attach_async_start( $proc->{gdb_tmp}, $pid );
+    }
+
+    foreach my $proc ( @{$procs} ) {
+
+        next if defined $proc->{gdb_handle};
+
+        my $vp  = $proc->{vp};
+        my $pid = $proc->{pid};
+        my $gdb = $proc->{gdb_tmp};
+
+        delete $proc->{gdb_tmp};
+
+        if ( gdb_attach_async_end( $gdb, $pid ) ) {
+            $proc->{gdb_handle} = $gdb;
+        } else {
+            if ( defined $gdb->{error} ) {
+                target_error( $vp, $gdb->{error} );
+            } else {
+                target_error( $vp, 'Failed to attach to process' );
+            }
+            gdb_quit($gdb);
+        }
+    }
+
+    return;
+}
+
+# Detach from all local processes, this function is called from both
+# global_attach and also when padb is exiting.
+sub global_detach {
+    my ($procs) = @_;
+
+    foreach my $proc ( @{$procs} ) {
+        if ( defined $proc->{gdb_handle} ) {
+            gdb_detach( $proc->{gdb_handle} );
+            delete $proc->{gdb_handle};
+        }
+    }
+}

  # Try and be clever here, attach to each and every process on this node
  # first, then go back and query them each in turn, should mean that some
@@ -8742,6 +8804,7 @@
      }

      if ( $cmd->{mode} eq 'exit' ) {
+        global_detach( $inner_conf{all_pids} );
          $netdata->{shutdown} = 1;
          return;
      }
@@ -8801,6 +8864,10 @@
      } else {
          $cargs->{out_format} = 'raw';
      }
+
+    # Ensure that we are attached to the target processes if required
+    # and that we are not if not required.
+    global_attach( $cmd->{mode}, $pid_list );

      if ( defined $allfns{ $cmd->{mode} }{handler_all} ) {
          eval {
@@ -8849,6 +8916,11 @@
              $netdata->{target_response} = \%gres;
          }
      }
+
+    # Detach from all processes if the outer requested us to.
+    if ( defined $cmd->{detach_after_callback} ) {
+        global_detach( $cmd->{mode}, $pid_list );
+    }

      return;
  }


From padb at googlecode.com  Mon Dec  7 13:20:42 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Mon, 07 Dec 2009 13:20:42 +0000
Subject: [padb] r351 committed - Convert a number of mode callbacks from
	using handler_all and having...
Message-ID: <001636c9274f118768047a235347@google.com>

Revision: 351
Author: apittman
Date: Mon Dec  7 05:19:40 2009
Log: Convert a number of mode callbacks from using handler_all and having
the handler attach to the target processes to using handler_one
and setting needs_gdb to have padb do the attach.  This means the
attach is both quicker (it's done asyncrously for all targets on
the local node at the same time) and the attachment is syncronous
across different modes.  The end result of this is that the
wall-clock performance of padb is improved, by up to 50% if
--full-report is being used.

http://code.google.com/p/padb/source/detail?r=351

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Mon Dec  7 04:57:39 2009
+++ /trunk/src/padb	Mon Dec  7 05:19:40 2009
@@ -6441,91 +6441,35 @@
      return;
  }

-sub show_mpi_queue_all {
-    my ( $carg, $list ) = @_;
-
-    my @all;
-
-    foreach my $proc ( @{$list} ) {
-        my $vp  = $proc->{vp};
-        my $pid = $proc->{pid};
-
-        my $gdb = gdb_start();
-        if ( gdb_attach( $gdb, $pid ) ) {
-            $proc->{gdb} = $gdb;
-            push @all, $proc;
-        } else {
-            if ( defined $gdb->{error} ) {
-                target_error( $vp, $gdb->{error} );
-            } else {
-                target_error( $vp, 'Failed to attach to process' );
-            }
-        }
-
+sub show_mpi_queue_one {
+    my ( $carg, $proc ) = @_;
+
+    my $vp  = $proc->{vp};
+    my $pid = $proc->{pid};
+    my $gdb = $proc->{gdb_handle};
+
+    return unless $gdb;
+
+    my @mq = fetch_mpi_queue_gdb( $carg, $vp, $pid, $gdb );
+
+    foreach my $o (@mq) {
+        output( $vp, $o );
      }

-    foreach my $proc (@all) {
-
-        my $vp  = $proc->{vp};
-        my $pid = $proc->{pid};
-        my $gdb = $proc->{gdb};
-
-        my @mq = fetch_mpi_queue_gdb( $carg, $vp, $pid, $gdb );
-        if ( $mq[0] ) {
-            foreach my $o (@mq) {
-                output( $vp, $o );
-            }
-        }
-    }
-
-    foreach my $proc (@all) {
-        my $gdb = $proc->{gdb};
-        gdb_detach($gdb);
-        gdb_quit($gdb);
-    }
      return;
  }

-# Ideally handle all this at a higher level...
-sub show_mpi_queue_for_deadlock_all {
-    my ( $carg, $list ) = @_;
-
-    my $ret;
-    my @all;
-
-    foreach my $proc ( @{$list} ) {
-        my $vp  = $proc->{vp};
-        my $pid = $proc->{pid};
-
-        my $gdb = gdb_start();
-        if ( gdb_attach( $gdb, $pid ) ) {
-            $proc->{gdb} = $gdb;
-            push @all, $proc;
-        } else {
-            output( $vp, 'Failed to attach to to process' );
-        }
-
-    }
-
-    foreach my $proc (@all) {
-        my $tries = 0;
-
-        my @threads;
-
-        my $vp  = $proc->{vp};
-        my $pid = $proc->{pid};
-        my $gdb = $proc->{gdb};
-
-        my @mq = fetch_mpi_queue_gdb( $carg, $vp, $pid, $gdb );
-        $ret->{$vp} = \@mq;
-    }
-
-    foreach my $proc (@all) {
-        my $gdb = $proc->{gdb};
-        gdb_detach($gdb);
-        gdb_quit($gdb);
-    }
-    return $ret;
+sub show_mpi_queue_for_deadlock_one {
+    my ( $carg, $proc ) = @_;
+
+    my $vp  = $proc->{vp};
+    my $pid = $proc->{pid};
+    my $gdb = $proc->{gdb_handle};
+
+    return unless $gdb;
+
+    my @mq = fetch_mpi_queue_gdb( $carg, $vp, $pid, $gdb );
+    return \@mq;
  }

  sub mpi_queue_output_handler {
@@ -6659,8 +6603,8 @@

          my $gstr = "Information for group '$gid' ($ad{$gid}{name})\n";

-        # Maybe show the group members, hope that the user doesn't turn
-        # this on unless also setting target_groups!
+        # Maybe show the group members, hope that the user doesn't
+        # turn this on unless also setting target_groups!
          if ( $carg->{show_group_members} ) {
              $gstr .= "group has $ad{$gid}{size} members\n";
              if ( defined $ad{$gid}{size} ) {
@@ -7726,8 +7670,8 @@
  # loop in but don't sleep every iteration.  This could be handled better by
  # checking for the presence of one of the stack_strip_below functions in
  # the stack trace.
-sub stack_trace_from_pids {
-    my ( $carg, $list ) = @_;
+sub stack_trace_from_pid {
+    my ( $carg, $proc ) = @_;

      my @all;

@@ -7747,202 +7691,181 @@
          $below{$_} = 1;
      }

-    foreach my $proc ( @{$list} ) {
-        my $vp  = $proc->{vp};
-        my $pid = $proc->{pid};
-
-        my $gdb = gdb_start();
-        if ( gdb_attach( $gdb, $pid ) ) {
-            $proc->{gdb} = $gdb;
-            push @all, $proc;
-        } else {
-            if ( defined $gdb->{error} ) {
-                target_error( $vp, $gdb->{error} );
-            } else {
-                target_error( $vp, 'Failed to attach to process' );
-            }
-        }
-
-    }
-
-    foreach my $proc (@all) {
-        my $tries = 0;
-
-        my @threads;
-
-        my $vp  = $proc->{vp};
-        my $pid = $proc->{pid};
-        my $gdb = $proc->{gdb};
-
-        my $ok;
-        do {
-
-            # The first time round the loop we will have a gdb handle from
-            # above, only re-attach if we have already failed on the first
-            # try and are here a second time.
-            if ( not defined $gdb ) {
-                send_cont_signal($pid);
-                my $g = gdb_start();
-                if ( gdb_attach( $g, $pid ) ) {
-                    $gdb = $g;
+    return unless defined $proc->{gdb_handle};
+
+    my $tries = 0;
+
+    my @threads;
+
+    my $vp  = $proc->{vp};
+    my $pid = $proc->{pid};
+    my $gdb = $proc->{gdb_handle};
+
+    my $ok;
+    do {
+
+        # The first time round the loop we will have a gdb handle from
+        # above, only re-attach if we have already failed on the first
+        # try and are here a second time.
+        if ( $tries > 0 ) {
+            gdb_detach($gdb);
+            gdb_quit($gdb);
+            delete $proc->{gdb_handle};
+            send_cont_signal($pid);
+            $gdb = gdb_start();
+            if ( gdb_attach( $gdb, $pid ) ) {
+                $proc->{gdb_attach} = $gdb;
+            } else {
+                if ( defined $gdb->{error} ) {
+                    target_error( $vp, $gdb->{error} );
                  } else {
-                    if ( defined $g->{error} ) {
-                        target_error( $vp, $g->{error} );
-                    } else {
-                        target_error( $vp, 'Failed to attach to process' );
-                    }
-                }
-            }
-
-            if ( defined $gdb ) {
-                if (   $carg->{stack_shows_params}
-                    or $carg->{stack_shows_locals} )
-                {
-                    @threads = gdb_dump_frames_per_thread( $gdb, 1 );
-                } else {
-                    @threads = gdb_dump_frames_per_thread($gdb);
-                }
-                gdb_detach($gdb);
-                gdb_quit($gdb);
-                $gdb = undef;
-                if ( defined $threads[0]->{frames} ) {
-                    my @frames = @{ $threads[0]->{frames} };
-                    foreach my $frame (@frames) {
-                        if (    defined $frame->{func}
-                            and defined $below{ $frame->{func} } )
-                        {
-                            $ok = 1;
-                            last;
-                        }
-                    }
+                    target_error( $vp, 'Failed to attach to process' );
+                }
+                gdb_quit($gdb);
+                return;
+            }
+        }
+
+        if (   $carg->{stack_shows_params}
+            or $carg->{stack_shows_locals} )
+        {
+            @threads = gdb_dump_frames_per_thread( $gdb, 1 );
+        } else {
+            @threads = gdb_dump_frames_per_thread($gdb);
+        }
+
+        if ( defined $threads[0]->{frames} ) {
+            my @frames = @{ $threads[0]->{frames} };
+            foreach my $frame (@frames) {
+                if (    defined $frame->{func}
+                    and defined $below{ $frame->{func} } )
+                {
+                    $ok = 1;
+                    last;
                  }
              }
-          } while ( ( not $ok )
-            and ( $tries++ < $carg->{gdb_retry_count} ) );
-
-        if ( not defined $threads[0]{id} ) {
-            target_error( $vp,
-                'Could not extract stack trace from application' );
-            next;
          }

-        if ( defined $threads[0]{error} ) {
-            target_error( $vp, $threads[0]{error} );
-            next;
-        }
-
-        foreach my $thread ( sort { $a->{id} <=> $b->{id} } @threads ) {
-            next unless defined $thread->{frames};
-            my @frames = @{ $thread->{frames} };
-
-            output( $vp, "ThreadId: $thread->{id}" ) if ( @threads != 1 );
-
-            my $strip_below;
-
-            # Find a function to strip above.  Only actually enable this if
-            # there is a function present which we are targeting or else no
-            # output will be generated!  Do this in reverse order so we
-            # strip as much as possible from the stack trace.
-            if ( $carg->{strip_below_main} ) {
-                foreach my $frame ( reverse @frames ) {
-                    next unless exists $frame->{func};
-                    if ( defined $below{ $frame->{func} } ) {
-                        $strip_below = $frame->{func};
-                    }
+      } while ( ( not $ok )
+        and ( $tries++ < $carg->{gdb_retry_count} ) );
+
+    if ( not defined $threads[0]{id} ) {
+        target_error( $vp, 'Could not extract stack trace from  
application' );
+        return;
+    }
+
+    if ( defined $threads[0]{error} ) {
+        target_error( $vp, $threads[0]{error} );
+        return;
+    }
+
+    foreach my $thread ( sort { $a->{id} <=> $b->{id} } @threads ) {
+        next unless defined $thread->{frames};
+        my @frames = @{ $thread->{frames} };
+
+        output( $vp, "ThreadId: $thread->{id}" ) if ( @threads != 1 );
+
+        my $strip_below;
+
+        # Find a function to strip above.  Only actually enable this if
+        # there is a function present which we are targeting or else no
+        # output will be generated!  Do this in reverse order so we
+        # strip as much as possible from the stack trace.
+        if ( $carg->{strip_below_main} ) {
+            foreach my $frame ( reverse @frames ) {
+                next unless exists $frame->{func};
+                if ( defined $below{ $frame->{func} } ) {
+                    $strip_below = $frame->{func};
                  }
              }
-
-            my @fl = $EMPTY_STRING;
-            foreach my $frame ( reverse @frames ) {
-
-                target_error( $vp, "error from gdb: $frame->{error}" )
-                  if exists $frame->{error};
-
-                next unless exists $frame->{level};
-                next unless exists $frame->{func};
-
-                # This seemingly always gets set by gdb even if it is
-                # sometimes set to '??'
-                my $function = $frame->{func};
-
-                next if ( defined $strip_below and $strip_below ne  
$function );
-
-                $strip_below = undef;
-
-                my $l = sprintf "%s() at %s:%s",
-                  $function,
-                  ( $frame->{file} || '?' ),
-                  ( $frame->{line} || '?' );
-
-                output( $vp, $l );
-
-                if ( $carg->{out_format} eq 'tree' ) {
-                    push @fl, $l;
-                    my $fl = join( ",", @fl );
-                    if ( $carg->{stack_shows_locals} ) {
-                        my @local_names;
-                        foreach my $loc ( @{ $frame->{locals} } ) {
-                            push @local_names, $loc->{name};
-                            target_key_pair( $vp, "$l|var_type| 
$loc->{name}",
-                                $loc->{type} );
-
-                            if ( length $loc->{value} > 70 ) {
-                                target_key_pair(
-                                    $vp,
-                                    $fl . '|var|' . $loc->{name},
-                                    pretify_variable(
-                                        'value too long to display')
-                                );
-                            } else {
-                                target_key_pair( $vp,
-                                    $fl . '|var|' . $loc->{name},
-                                    $loc->{value} );
-                            }
-                        }
-                        if ( @local_names > 0 ) {
-                            target_key_pair( $vp, "$l|locals",
-                                join( q{,}, sort @local_names ) );
+        }
+
+        my @fl = $EMPTY_STRING;
+        foreach my $frame ( reverse @frames ) {
+
+            target_error( $vp, "error from gdb: $frame->{error}" )
+              if exists $frame->{error};
+
+            next unless exists $frame->{level};
+            next unless exists $frame->{func};
+
+            # This seemingly always gets set by gdb even if it is
+            # sometimes set to '??'
+            my $function = $frame->{func};
+
+            next if ( defined $strip_below and $strip_below ne $function );
+
+            $strip_below = undef;
+
+            my $l = sprintf "%s() at %s:%s",
+              $function,
+              ( $frame->{file} || '?' ),
+              ( $frame->{line} || '?' );
+
+            output( $vp, $l );
+
+            if ( $carg->{out_format} eq 'tree' ) {
+                push @fl, $l;
+                my $fl = join( ",", @fl );
+                if ( $carg->{stack_shows_locals} ) {
+                    my @local_names;
+                    foreach my $loc ( @{ $frame->{locals} } ) {
+                        push @local_names, $loc->{name};
+                        target_key_pair( $vp, "$l|var_type|$loc->{name}",
+                            $loc->{type} );
+
+                        if ( length $loc->{value} > 70 ) {
+                            target_key_pair(
+                                $vp,
+                                $fl . '|var|' . $loc->{name},
+                                pretify_variable('value too long to  
display')
+                            );
+                        } else {
+                            target_key_pair( $vp, $fl . '|var|' .  
$loc->{name},
+                                $loc->{value} );
                          }
                      }
-                    if ( $carg->{stack_shows_params} ) {
-
-                        my @param_names;
-                        foreach my $par ( @{ $frame->{params} } ) {
-                            push @param_names, $par->{name};
-                            target_key_pair( $vp, "$l|var_type| 
$par->{name}",
-                                $par->{type} );
-                            if ( length $par->{value} > 70 ) {
-                                target_key_pair(
-                                    $vp,
-                                    $fl . '|var|' . $par->{name},
-                                    pretify_variable(
-                                        'value too long to display')
-                                );
-                            } else {
-                                target_key_pair( $vp,
-                                    $fl . '|var|' . $par->{name},
-                                    $par->{value} );
-                            }
-                        }
-                        if ( @param_names > 0 ) {
-                            target_key_pair( $vp, "$l|params",
-                                join( q{,}, @param_names ) );
+                    if ( @local_names > 0 ) {
+                        target_key_pair( $vp, "$l|locals",
+                            join( q{,}, sort @local_names ) );
+                    }
+                }
+                if ( $carg->{stack_shows_params} ) {
+
+                    my @param_names;
+                    foreach my $par ( @{ $frame->{params} } ) {
+                        push @param_names, $par->{name};
+                        target_key_pair( $vp, "$l|var_type|$par->{name}",
+                            $par->{type} );
+                        if ( length $par->{value} > 70 ) {
+                            target_key_pair(
+                                $vp,
+                                $fl . '|var|' . $par->{name},
+                                pretify_variable('value too long to  
display')
+                            );
+                        } else {
+                            target_key_pair( $vp, $fl . '|var|' .  
$par->{name},
+                                $par->{value} );
                          }
                      }
-                } else {
-                    if ( $carg->{stack_shows_params} ) {
-                        show_stack_vars( $vp, $frame, 'params' );
-                    }
-                    if ( $carg->{stack_shows_locals} ) {
-                        show_stack_vars( $vp, $frame, 'locals' );
+                    if ( @param_names > 0 ) {
+                        target_key_pair( $vp, "$l|params",
+                            join( q{,}, @param_names ) );
                      }
                  }
-
-                # Strip below this function if we need to.
-                if ( defined $above{$function} ) {
-                    last;
+            } else {
+                if ( $carg->{stack_shows_params} ) {
+                    show_stack_vars( $vp, $frame, 'params' );
+                }
+                if ( $carg->{stack_shows_locals} ) {
+                    show_stack_vars( $vp, $frame, 'locals' );
                  }
              }
+
+            # Strip below this function if we need to.
+            if ( defined $above{$function} ) {
+                last;
+            }
          }
      }
      return;
@@ -9229,7 +9152,8 @@
      };

      $allfns{mqueue} = {
-        handler_all => \&show_mpi_queue_all,
+        handler_one => \&show_mpi_queue_one,
+        needs_gdb   => 1,
          arg_long    => 'mpi-queue',
          arg_short   => 'Q',
          help        => 'Show MPI message queues',
@@ -9237,7 +9161,8 @@
      };

      $allfns{deadlock} = {
-        handler_all  => \&show_mpi_queue_for_deadlock_all,
+        handler_one  => \&show_mpi_queue_for_deadlock_one,
+        needs_gdb    => 1,
          arg_long     => 'deadlock',
          arg_short    => 'j',
          help         => 'Run deadlock detection algorithm',
@@ -9290,7 +9215,8 @@
      };

      $allfns{stack} = {
-        handler_all => \&stack_trace_from_pids,
+        handler_one => \&stack_trace_from_pid,
+        needs_gdb   => 1,
          arg_long    => 'stack-trace',
          arg_short   => 'x',
          help        => 'Show stack trace (see also -t)',


From padb at googlecode.com  Mon Dec  7 13:34:50 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Mon, 07 Dec 2009 13:34:50 +0000
Subject: [padb] r352 committed - Rename maybe_show_pid as
	register_target_process for clarity.
Message-ID: <001636c5bda39aac8c047a2385fd@google.com>

Revision: 352
Author: apittman
Date: Mon Dec  7 05:34:42 2009
Log: Rename maybe_show_pid as register_target_process for clarity.

http://code.google.com/p/padb/source/detail?r=352

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Mon Dec  7 05:19:40 2009
+++ /trunk/src/padb	Mon Dec  7 05:34:42 2009
@@ -8180,10 +8180,16 @@
      return;
  }

-sub maybe_show_pid {
-    my ( $vp, $pid ) = @_;
-
-    $inner_conf{rmpids}{$pid}{rank} = $vp;
+# To be called from the find_pids resource manager callback to say
+# that the specified pid is the specified rank.  This process should
+# be one spawned by the resource manager, if wrapper scripts are being
+# used, say "mpirun -n 2 sh -c myapp" then this function should be
+# called with the pid of 'sh', padb will then walk the process tree to
+# find the more interesting child process and target that one.
+sub register_target_process {
+    my ( $rank, $pid ) = @_;
+
+    $inner_conf{rmpids}{$pid}{rank} = $rank;
      return;
  }

@@ -8257,7 +8263,7 @@
          next unless ( $job eq $jobid );
          next unless ( $step == $inner_conf{slurm_job_step} );
          next if ( is_resmgr_process($pid) );
-        maybe_show_pid( $global, $pid );
+        register_target_process( $global, $pid );
          $found_target = 1;
      }
      return if $found_target;
@@ -8298,7 +8304,7 @@
              target_key_pair( $vp, "JOB_SIZE", $env{OMPI_COMM_WORLD_SIZE} );
          }

-        maybe_show_pid( $env{OMPI_COMM_WORLD_RANK}, $pid );
+        register_target_process( $env{OMPI_COMM_WORLD_RANK}, $pid );
      }

      return;
@@ -8340,7 +8346,7 @@
      }
      foreach my $vp ( keys %vps ) {
          my $pid = $vps{$vp};
-        maybe_show_pid( $vp, $pid );
+        register_target_process( $vp, $pid );
      }
      return;
  }
@@ -8390,11 +8396,11 @@
      foreach my $vp ( keys %vps ) {
          if ( defined $vps{$vp}{actual} ) {
              foreach my $pid ( @{ $vps{$vp}{actual} } ) {
-                maybe_show_pid( $vp, $pid );
+                register_target_process( $vp, $pid );
              }
          } else {
              foreach my $pid ( @{ $vps{$vp}{likely} } ) {
-                maybe_show_pid( $vp, $pid );
+                register_target_process( $vp, $pid );
              }
          }
      }
@@ -8644,7 +8650,7 @@
      if ( defined $cmd->{pd} ) {
          my $hostname = $inner_conf{hostname};
          foreach my $rank ( keys %{ $cmd->{pd}{$hostname} } ) {
-            maybe_show_pid( $rank, $cmd->{pd}{$hostname}{$rank} );
+            register_target_process( $rank, $cmd->{pd}{$hostname}{$rank} );
          }
      } else {


From thipadin.seng-long at bull.net  Mon Dec  7 15:57:37 2009
From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net)
Date: Mon, 7 Dec 2009 16:57:37 +0100
Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_Re=3A_R=E9f?=
 =?iso-8859-1?q?=2E_=3A_Re=3A_=5B_padb=5D_Patchof_support_of_Slurm_+_Openm?=
 =?iso-8859-1?q?pi_Orte_manager?=
Message-ID: <OFF2852567.0A9C1C18-ONC1257685.0056E32D@frcl.bull.fr>

On 12/07/2009 12:32 PM Ashley Pittman <ashley at pittman.co.uk> wrote:

> This is committed as r346, slightly modified to call is_resmgr_process()
> rather than checking the name from /proc/$pid/status as discussed.

> Let me know if you have any further problems with this and once again
> thanks for the patch.

Yes, I see the change,
By the way I have given a try with this r346 version,
It's all Ok, with both methods of starting jobs.
Thipadin.
More later.


Ashley Pittman <ashley at pittman.co.uk>
12/07/2009 12:32 PM

 
        Pour :  thipadin.seng-long at bull.net
        cc :    florence.vallee at bull.net, francois.wellenreiter at bull.net, 
padb-devel at pittman.org.uk, Sylvain.JEAUGEY at bull.net
        Objet : Re: R?f. : Re: R?f. : Re: [padb] Patch of support of Slurm + Openmpi Orte 
manager

On Thu, 2009-12-03 at 13:20 +0100, thipadin.seng-long at bull.net wrote:
> Hi, good holidays, there ?
> I have applied the patch below.
> It works now: 

This is committed as r346, slightly modified to call is_resmgr_process()
rather than checking the name from /proc/$pid/status as discussed.

Let me know if you have any further problems with this and once again
thanks for the patch.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091207/1bf1fd7c/attachment.html>

From padb at googlecode.com  Mon Dec  7 19:12:11 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Mon, 07 Dec 2009 19:12:11 +0000
Subject: [padb]  r353 committed - Clean up the list-rmgrs code a little,
	both 	in terms...
Message-ID: <0016367d6f360c96e6047a283c6c@google.com>

Revision: 353
Author: apittman
Date: Mon Dec  7 11:12:04 2009
Log: Clean up the list-rmgrs code a little, both in terms
of the code structure and also tidy up what it reports
to the user if a resource manager isn't detected.

http://code.google.com/p/padb/source/detail?r=353

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Mon Dec  7 05:34:42 2009
+++ /trunk/src/padb	Mon Dec  7 11:12:04 2009
@@ -2956,6 +2956,7 @@
          if ( defined $link ) {
              if ( defined $mpirun{ basename($link) } ) {
                  push @jobs, $pid;
+                next;
              }
          }

@@ -5147,26 +5148,21 @@

      if ($list_rmgrs) {
          foreach my $res ( sort keys %rmgr ) {
-            my $working = 'yes';

              if ( defined $rmgr{$res}{is_installed}
                  and not $rmgr{$res}{is_installed}() )
              {
-                $working = 'no';
-            }
-            my $r = $res;
-
-            if ( $working eq 'yes' ) {
-                print "$r: ";
-                my @jobs = $rmgr{$res}{get_active_jobs}($user);
-                if ( @jobs > 0 ) {
-                    my $j = join q{ }, sort { $a <=> $b } @jobs;
-                    print "jobs($j)\n";
-                } else {
-                    print "No active jobs\n";
-                }
+                print "$res: Not detected on system.\n";
+                next;
+            }
+
+            print "$res: ";
+            my @jobs = $rmgr{$res}{get_active_jobs}($user);
+            if ( @jobs > 0 ) {
+                my $j = join q{ }, sort { $a <=> $b } @jobs;
+                print "$j\n";
              } else {
-                print "$r: not active\n";
+                print "No active jobs.\n";
              }
          }
          exit 0;


From padb at googlecode.com  Mon Dec  7 21:59:58 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Mon, 07 Dec 2009 21:59:58 +0000
Subject: [padb] r354 committed - Move the logic for finding dll finding from
	the C code to the perl...
Message-ID: <001485f6d9f40f3497047a2a9482@google.com>

Revision: 354
Author: apittman
Date: Mon Dec  7 13:59:10 2009
Log: Move the logic for finding dll finding from the C code to the perl
code, minfo now just calls fetch_dll_name() in a loop until it
returns null, all the complex string handling code is handled
in padb itself.

http://code.google.com/p/padb/source/detail?r=354

Modified:
  /trunk/src/minfo.c
  /trunk/src/padb

=======================================
--- /trunk/src/minfo.c	Mon Dec  7 04:22:42 2009
+++ /trunk/src/minfo.c	Mon Dec  7 13:59:10 2009
@@ -370,6 +370,21 @@
      free(ans);
      return 0;
  }
+
+/* Fetch a string from a remote memory location, making sure there is
+ * enough memory locally to store our copy.  Return mqs_ok on success */
+void *fetch_dll_name ()
+{
+    char ans[1024];
+    int i;
+
+    i = ask("dll_filename",ans);
+    if ( i != 0 ) {
+
+	return NULL;
+    }
+    return strdup(ans);
+}

  int fetch_image (char *local)
  {
@@ -561,82 +576,28 @@
      return 0;
  }

-#define PATH_MAX 1024
-
-/* Try and load a valid dll from the locations array, loop over the array
- * trying each one in turn.  Return 0 if and when we managed to load one,
- * -1 otherwise
- */
-int find_and_load_dll_from_loc_array() {
-    void **remote_array;
-    char *dll_name;
-    void *locations = find_sym("sym","mpimsgq_dll_locations");
-
-    if ( locations == NULL )
-	return -1;
-
-    if ( find_data(NULL,(mqs_taddr_t)locations,sizeof(void  
*),&remote_array) != mqs_ok ) {
-	return -1;
-    }
-
-    if ( (dll_name = malloc(PATH_MAX)) == NULL )
-	return -1;
+void find_and_load_dll()
+{
+    char *dll_name = fetch_dll_name();
+
+    if ( ! dll_name ) {
+	die("No DLL to load");
+    }

      do {
-	void *remote_entry = NULL;
-
-    	if ( find_data(NULL,(mqs_taddr_t)remote_array,sizeof(void  
*),&remote_entry) != mqs_ok )
-	    goto error_out;
-
-	if ( remote_entry == NULL )
-	    goto error_out;
-
-	memset(dll_name,0,PATH_MAX);
-
-	if ( fetch_string(NULL,dll_name,(mqs_taddr_t)remote_entry,PATH_MAX) !=  
mqs_ok ) {
-	    goto error_out;
-
-	} else {
-	    if ( load_msgq_dll(dll_name) == 0 ) {
-		free(dll_name);
-		return mqs_ok;
-	    }
-	}
-	remote_array++;
-    } while ( 1 );
-
-error_out:
-    free(dll_name);
-    return -1;
-}
-
-void find_and_load_dll()
-{
-    char *dll_name;
-
-    dll_name = getenv("MPINFO_DLL");
-    if ( dll_name != NULL ) {
-	if ( load_msgq_dll(dll_name) != 0 ) {
-	    die("Could not load symbols from dll");
-	}
-	return;
-    }
-
-    /* Try the new (proposed) dll specification mechanism */
-    if ( find_and_load_dll_from_loc_array() == mqs_ok )
-	return;
-
-    void *base = find_sym("sym","MPIR_dll_name");
-    if ( base == NULL ) {
-	die("Could not find MPIR_dll_name symbol");
-    }
-    dll_name = malloc(PATH_MAX);
-    if ( fetch_string(NULL,dll_name,(mqs_taddr_t)base,PATH_MAX) != 0 ) {
-	die("Could not read value of MPIR_dll_name");
-    }
-    if ( load_msgq_dll(dll_name) != 0 ) {
-	die("Could not load symbols from dll");
-    }
+
+	if ( load_msgq_dll(dll_name) == mqs_ok )
+	{
+	    free(dll_name);
+	    return;
+	}
+
+	free(dll_name);
+	dll_name = fetch_dll_name();
+
+    } while ( dll_name != NULL );
+
+    die("Could not find a loadable dll");
  }

  int
=======================================
--- /trunk/src/padb	Mon Dec  7 11:12:04 2009
+++ /trunk/src/padb	Mon Dec  7 13:59:10 2009
@@ -1087,7 +1087,7 @@
          return -1;
      } else {

-        if ( length $str < 10 ) {
+        if ( length $str < 9 ) {
              return hex $str;
          }

@@ -6129,16 +6129,62 @@
  }

  sub run_minfo {
-    my ( $gdb, $vp ) = @_;
+    my ( $carg, $gdb, $vp ) = @_;

      my $h = {
          hpid     => -1,
          tracepid => -1,
          attached => 0,
-        debug    => 0,
+        debug    => 1,
      };

      $h->{fd}{err} = *M_ERROR;
+
+    my @all_dll_filenames;
+
+    if ( defined $carg->{mpi_dll} ) {
+        push @all_dll_filenames, $carg->{mpi_dll};
+    } else {
+        my $loc = gdb_var_addr( $gdb, 'mpimsgq_dll_locations' );
+
+        if ($loc) {
+            my $psize = gdb_type_size( $gdb, 'void *' );
+            my $base = $loc;
+            my $filename;
+
+            $base = gdb_read_pointer( $gdb, $base );
+
+            do {
+                my $strp = gdb_read_pointer( $gdb, $base );
+                $filename = gdb_string( $gdb, 1024, $strp );
+                if ( defined $filename ) {
+                    push @all_dll_filenames, $filename;
+                }
+                $base = _hex($base) + $psize;
+            } while ( defined $filename );
+        }
+
+        my $base = gdb_var_addr( $gdb, 'MPIR_dll_name' );
+        if ( not defined $base ) {
+            target_error( $vp,
+'Process does not appear to be using MPI (No MPIR_dll_name symbol)'
+            );
+            return;
+        }
+        my $filename = gdb_string( $gdb, 1024, $base );
+        push @all_dll_filenames, $filename;
+    }
+
+    my @dll_filenames;
+
+    my %files;
+    foreach my $filename (@all_dll_filenames) {
+        next unless -f ($filename);
+        next if defined $files{$filename};
+
+        push @dll_filenames, $filename;
+        $files{$filename} = 1;
+    }

      my $cmd = $inner_conf{minfo};
      $h->{hpid} = open3( $h->{fd}{wtr}, $h->{fd}{rdr}, *M_ERROR, $cmd )
@@ -6206,7 +6252,21 @@

          chomp $r;
          my $cmd = substr $r, 0, 4;
-        if ( $cmd eq 'req:' ) {
+        if ( $r eq 'req: dll_filename' ) {
+            $stats{dll_files}++;
+            my $filename = shift @dll_filenames;
+            my $res      = 'fail';
+            if ( defined $filename ) {
+                $res = "ok $filename";
+            }
+
+            print {$out} "$res\n";
+
+            if ( defined $h->{debugfd} ) {
+                print { $h->{debugfd} } "$res\n";
+            }
+
+        } elsif ( $cmd eq 'req:' ) {
              my $res = minfo_handle_query( $gdb, $vp, $r, \%stats );

              # Some things *do* fail here, symbol lookups for example,
@@ -6380,24 +6440,8 @@
          return;
      }

-    my $base = gdb_var_addr( $g, 'MPIR_dll_name' );
-    if ( not defined $base ) {
-        target_error( $vp,
-            'Process does not appear to be using MPI (No MPIR_dll_name  
symbol)'
-        );
-    }
-
-    if ( defined $carg->{mpi_dll} ) {
-        $ENV{MPINFO_DLL} = $carg->{mpi_dll};
-    } else {
-        if ( not defined $base ) {
-            gdb_detach($g);
-            gdb_quit($g);
-            return;
-        }
-    }
-
-    my @mq = run_minfo( $g, $vp );
+    my @mq = fetch_mpi_queue_gdb( $carg, $vp, $pid, $g );
+
      gdb_detach($g);
      gdb_quit($g);
      return @mq;
@@ -6406,23 +6450,7 @@
  # As above but take a gdb handle
  sub fetch_mpi_queue_gdb {
      my ( $carg, $vp, $pid, $g ) = @_;
-
-    my $base = gdb_var_addr( $g, 'MPIR_dll_name' );
-    if ( not defined $base ) {
-        target_error( $vp,
-            'Process does not appear to be using MPI (No MPIR_dll_name  
symbol)'
-        );
-    }
-
-    if ( defined $carg->{mpi_dll} ) {
-        $ENV{MPINFO_DLL} = $carg->{mpi_dll};
-    } else {
-        if ( not defined $base ) {
-            return;
-        }
-    }
-
-    my @mq = run_minfo( $g, $vp );
+    my @mq = run_minfo( $carg, $g, $vp );
      return @mq;
  }

@@ -6430,7 +6458,7 @@
      my ( $carg, $vp, $pid ) = @_;

      my @mq = fetch_mpi_queue( $carg, $vp, $pid );
-    return unless $mq[0];
+
      foreach my $o (@mq) {
          output( $vp, $o );
      }
@@ -6451,7 +6479,6 @@
      foreach my $o (@mq) {
          output( $vp, $o );
      }
-
      return;
  }


From thipadin.seng-long at bull.net  Wed Dec  9 10:29:59 2009
From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net)
Date: Wed, 9 Dec 2009 11:29:59 +0100
Subject: [padb] tiny bug with--proc-summary
Message-ID: <OFE3DBE34D.51ADB208-ONC1257687.00301855@frcl.bull.fr>

Hi,
With --proc-summary option, padb displays pid which is indeed a thread PID 
(LWP) 
for a process that have some threads as shown:

[thipa at machu139 padb_open]$ ./padb -O rmgr=slurm --proc-summary 11091
rank  hostname  pid    vmsize     vmrss     S  uptime  %cpu  lcore command 
 
   0  machu139  31719  151780 kB  26380 kB  R    0.97    99      2  concurrent_spaw
[thipa at machu139 padb_open]$ 

This is a ps -eLf command

[thipa at machu139 31718]$ ps -eLf 
UID        PID   PPID   LWP  C NLWP STIME TTY          TIME CMD
.
.
thipa    31718 31717 31718 95    3 16:32 ?        00:23:28 
./concurrent_spawns
thipa    31718 31717 31719  0    3 16:32 ?        00:00:00 ./concurrent_spawns
thipa    31718 31717 31720  0    3 16:32 ?        00:00:00 
./concurrent_spawns
root     31724  2285 31724  0    1 16:33 ?        00:00:00 sshd: thipa 
[priv]

What's do you think.

Thipadin.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091209/10a65013/attachment.html>

From padb at googlecode.com  Wed Dec  9 11:08:01 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Wed, 09 Dec 2009 11:08:01 +0000
Subject: [padb] r355 committed - Modify the mpirun resource manager code to
	work in cases where debug...
Message-ID: <0016e6407ac23d417c047a49b42b@google.com>

Revision: 355
Author: apittman
Date: Wed Dec  9 03:07:11 2009
Log: Modify the mpirun resource manager code to work in cases where debug
information isn't available in the target binary, use hardcoded
values for struct offsets as defined by the standard rather than
rely of gdb being able to access the information and calculate
the offsets for us.
This fixes problems seen on at least two OMPI installations when
run in mpirun mode.

http://code.google.com/p/padb/source/detail?r=355

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Mon Dec  7 13:59:10 2009
+++ /trunk/src/padb	Wed Dec  9 03:07:11 2009
@@ -2994,15 +2994,58 @@
      }

      my %pt;
-    foreach my $proc ( 0 .. ( $nprocs - 1 ) ) {
-        my $hostp = gdb_read_value_addr( $gdb,
-            "(void *)MPIR_proctable[$proc].host_name" );
-        my $host = gdb_string( $gdb, 1024, $hostp );
-        my $pid = gdb_read_value( $gdb, "MPIR_proctable[$proc].pid" );
-        if ( defined $host and defined $pid ) {
-            $pt{$host}{$proc} = $pid;
-        } else {
-            print "Failed to extract process info for rank $proc\n";
+
+    # Whilst it's possible to dip inside the struct in the process to
+    # extract this information some builds don't associate a type with
+    # MPIR_proctable which means in those cases this methhod won't work.
+    # Instead use a set of hardcoded values for offset and size as defined
+    # by the interface and do the maths for finding each element ourselves.
+
+    # I've left the old code here for now as I suspect this is going to be
+    # something that causes trouble in the future.
+
+    if (1) {
+        my $word_size = gdb_type_size( $gdb, 'void *' );
+        my $table_size = ( $word_size * 2 ) + 4;
+
+        # On 64 bit systems the struct is 20 bytes in size but needs to be
+        # 8 byte alligned.
+        if ( $word_size == 8 ) {
+            $table_size += 4;
+        }
+
+        my $host_offset    = 0;
+        my $pid_offset     = $word_size * 2;
+        my $proctable_addr = gdb_var_addr( $gdb, 'MPIR_proctable' );
+        my $proctable      = gdb_read_pointer( $gdb, $proctable_addr );
+        my $base           = _hex($proctable);
+
+        foreach my $proc ( 0 .. ( $nprocs - 1 ) ) {
+
+            my $struct_base = $base + ( $table_size * $proc );
+            my $hostp = gdb_read_pointer( $gdb, $struct_base +  
$host_offset );
+            my $host = gdb_string( $gdb, 1024, $hostp );
+
+            my $pid = gdb_read_int( $gdb, $struct_base + $pid_offset );
+            if ( defined $host and defined $pid ) {
+                $pt{$host}{$proc} = $pid;
+            } else {
+                print "Failed to extract process info for rank $proc\n";
+            }
+        }
+    } else {
+
+        foreach my $proc ( 0 .. ( $nprocs - 1 ) ) {
+
+            my $hostp = gdb_read_value_addr( $gdb,
+                "(void *)MPIR_proctable[$proc].host_name" );
+            my $host = gdb_string( $gdb, 1024, $hostp );
+            my $pid = gdb_read_value( $gdb, "MPIR_proctable[$proc].pid" );
+            if ( defined $host and defined $pid ) {
+                $pt{$host}{$proc} = $pid;
+            } else {
+                print "Failed to extract process info for rank $proc\n";
+            }
          }
      }

@@ -6055,6 +6098,19 @@
      }
      return;
  }
+
+sub gdb_read_int {
+    my ( $gdb, $addr ) = @_;
+
+    # Quote the request in case it contains spaces.
+    my %t =
+      gdb_send_addr( $gdb, "-data-evaluate-expression \"*(int *)$addr\"" );
+    if ( $t{status} eq 'done' ) {
+        my $v = gdb_parse_reason( $t{reason} );
+        return $v->{value};
+    }
+    return;
+}

  sub gdb_read_value {
      my ( $gdb, $name ) = @_;
@@ -6135,7 +6191,7 @@
          hpid     => -1,
          tracepid => -1,
          attached => 0,
-        debug    => 1,
+        debug    => 0,
      };

      $h->{fd}{err} = *M_ERROR;


From ashley at pittman.co.uk  Wed Dec  9 11:41:32 2009
From: ashley at pittman.co.uk (Ashley Pittman)
Date: Wed, 09 Dec 2009 11:41:32 +0000
Subject: [padb] tiny bug with--proc-summary
In-Reply-To: <OFE3DBE34D.51ADB208-ONC1257687.00301855@frcl.bull.fr>
References: <OFE3DBE34D.51ADB208-ONC1257687.00301855@frcl.bull.fr>
Message-ID: <1260358892.21674.64.camel@alpha>

On Wed, 2009-12-09 at 11:29 +0100, thipadin.seng-long at bull.net wrote:
> 
> Hi, 
> With --proc-summary option, padb displays pid which is indeed a thread
> PID (LWP) 
> for a process that have some threads as shown: 

> What's do you think. 

I can confirm there's a bug here, I can see it locally when I target a
multi-threaded application on my laptop.

What is happening is that the show_proc function is reporting data for
all tasks in the program, this is probably the right thing for
--proc-info however for --proc-summary it's incorrect in that it's
recording a lot of entries twice for the same process, pid being one of
these.  This duplicate data is then passed back through the network to
the outer process.

At this point the tree_from_namespace function is re-assembling the data
on the assumption that each key only has one value from a given rank, in
the case here where this isn't true it's picking one at random and
reporting that which is what you see.

Attached is a basic patch which fixes the issue by ensuring that only
data from the first thread is forwarded back, this makes padb
deterministic and causes it to show the pid you'd expect.

The wider issue here is how to handle multi-threaded programs, for
example I don't know how to calculate memory usage across threads, I'd
assume they all have the same memory maps with the possible exception of
TLS which means the value is probably both common to all threads and
correct across the process as a whole but the percent cpu usage
calculation is almost certainly wrong, this would need to be calculated
for each thread and summed across threads to get the true value.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk
-------------- next part --------------
A non-text attachment was scrubbed...
Name: padb-proc-format-threads.patch
Type: text/x-patch
Size: 864 bytes
Desc: not available
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091209/0861a849/attachment.bin>

From thipadin.seng-long at bull.net  Wed Dec  9 14:28:07 2009
From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net)
Date: Wed, 9 Dec 2009 15:28:07 +0100
Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_tiny_bug_with--proc-summ?=
	=?iso-8859-1?q?ary?=
Message-ID: <OFADF53227.9DF827BE-ONC1257687.004ED9E2@frcl.bull.fr>

Hi,
I have tested the patch and is OK.
But I could have the main thread to be the second one from the LWP 
(as my in previous example).
So it's hard to say. Consider it works.
Thipadin.


Ashley Pittman <ashley at pittman.co.uk>
12/09/2009 12:41 PM

 
        Pour :  thipadin.seng-long at bull.net
        cc :    florence.vallee at bull.net, francois.wellenreiter at bull.net, 
padb-devel at pittman.org.uk, Sylvain.JEAUGEY at bull.net
        Objet : Re: tiny bug with--proc-summary

On Wed, 2009-12-09 at 11:29 +0100, thipadin.seng-long at bull.net wrote:
> 
> Hi, 
> With --proc-summary option, padb displays pid which is indeed a thread
> PID (LWP) 
> for a process that have some threads as shown: 

> What's do you think. 

I can confirm there's a bug here, I can see it locally when I target a
multi-threaded application on my laptop.

What is happening is that the show_proc function is reporting data for
all tasks in the program, this is probably the right thing for
--proc-info however for --proc-summary it's incorrect in that it's
recording a lot of entries twice for the same process, pid being one of
these.  This duplicate data is then passed back through the network to
the outer process.

At this point the tree_from_namespace function is re-assembling the data
on the assumption that each key only has one value from a given rank, in
the case here where this isn't true it's picking one at random and
reporting that which is what you see.

Attached is a basic patch which fixes the issue by ensuring that only
data from the first thread is forwarded back, this makes padb
deterministic and causes it to show the pid you'd expect.

The wider issue here is how to handle multi-threaded programs, for
example I don't know how to calculate memory usage across threads, I'd
assume they all have the same memory maps with the possible exception of
TLS which means the value is probably both common to all threads and
correct across the process as a whole but the percent cpu usage
calculation is almost certainly wrong, this would need to be calculated
for each thread and summed across threads to get the true value.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091209/c98ab63f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: padb-proc-format-threads.patch
Type: application/octet-stream
Size: 892 bytes
Desc: not available
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091209/c98ab63f/attachment.obj>

From ashley at pittman.co.uk  Wed Dec  9 15:00:33 2009
From: ashley at pittman.co.uk (Ashley Pittman)
Date: Wed, 09 Dec 2009 15:00:33 +0000
Subject: [padb]
	=?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_tiny_bug_with--proc-summ?=
	=?iso-8859-1?q?ary?=
In-Reply-To: <OFADF53227.9DF827BE-ONC1257687.004ED9E2@frcl.bull.fr>
References: <OFADF53227.9DF827BE-ONC1257687.004ED9E2@frcl.bull.fr>
Message-ID: <1260370833.21674.88.camel@alpha>

On Wed, 2009-12-09 at 15:28 +0100, thipadin.seng-long at bull.net wrote:
> 
> Hi, 
> I have tested the patch and is OK. 

Ok, I'll commit it as is.  It's definitely a step forward as the current
code is non-deterministic.

> But I could have the main thread to be the second one from the LWP 
> (as my in previous example). 

You mean the thread with the LWP of 31719?  LWP 31718 is the one which
has consumed the most CPU cycles.  If you can suggest a good way for
padb to calculate which is the main thread then I'd be keen to hear it.

One other option might be to loop over the threads as before and report
all their values with the keys suffixed with _%d.  In the example you
gave this would give you pid=31718 pid_1=31719 pid_2=31720.  If you knew
that you wanted to see the data for the second pid you could then use
--proc-format="rank,hostname,pid_1,vmsize_1,vmrss_1,..."

As you say it's hard and I'm open to ideas on how you'd like it to work.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


From padb at googlecode.com  Wed Dec  9 17:51:33 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Wed, 09 Dec 2009 17:51:33 +0000
Subject: [padb] r356 committed - Refresh the lstopo mode and add a generic
	"command" mode....
Message-ID: <001636310361553379047a4f575b@google.com>

Revision: 356
Author: apittman
Date: Wed Dec  9 09:51:13 2009
Log: Refresh the lstopo mode and add a generic "command" mode.
Change the lstopo command to add a dash '-' which tells it to give
text based output rather than graphical, also extend this mode to
allow the precise lstopo command to be specified as an option.  The
command given is exected on the target node and %p is replaced with
the pid of the target process.  This should future proof padb as when
a --pid option gets added to lstopo padb will be able to use it by
just editing a configuration file rather than editing the code.

Also add a generic "command" mode which will run any command, again
substituting %p with the pid of the target process, this is in effect
what lstopo is now so add a mode to do this specifically othewise I'm
sure people would try and piggy-back this behaviour onto the lstopo
mode.  The default command is 'readlink /proc/%p/exe'.

http://code.google.com/p/padb/source/detail?r=356

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Wed Dec  9 03:07:11 2009
+++ /trunk/src/padb	Wed Dec  9 09:51:13 2009
@@ -7957,17 +7957,23 @@
      return;
  }

-# Experimental, currently reports on what's on the node rather than what
-# the specific process is attached to, hopefully this functionality will be
-# added in the future.
+# Experimental, currently reports on what's on the node rather than
+# what the specific process is attached to, hopefully this
+# functionality will be added in the future.

  # https://svn.open-mpi.org/trac/hwloc/ticket/21
  sub lstopo {
      my ( $cargs, $vp, $pid ) = @_;

-    target_error( $vp, "Reporting per node rather than per process" );
-
-    my @output = slurp_cmd("lstopo --whole-system");
+    if ( $cargs->{lstopo_show_warning} ) {
+        target_error( $vp, "Reporting per node rather than per process" );
+    }
+
+    my $cmd = $cargs->{lstopo_command};
+
+    $cmd =~ s{%p}{$pid}g;
+
+    my @output = slurp_cmd($cmd);

      # Check the return code, if it's not found then there won't be any
      # output, if it was found but returned an error then do report the
@@ -7988,6 +7994,21 @@
      }
      return;
  }
+
+sub run_cmd_against_target {
+    my ( $cargs, $vp, $pid ) = @_;
+
+    my $cmd = $cargs->{command};
+
+    $cmd =~ s{%p}{$pid}g;
+
+    my @output = slurp_cmd($cmd);
+    chomp @output;
+    foreach my $line (@output) {
+        output( $vp, $line );
+    }
+    return;
+}

  sub ping_rank {
      my ( $cargs, $vp, $pid ) = @_;
@@ -9351,9 +9372,18 @@
      };

      $allfns{lstopo} = {
-        handler  => \&lstopo,
-        arg_long => 'lstopo',
-        help     => 'Show CPU topology',
+        handler      => \&lstopo,
+        arg_long     => 'lstopo',
+        help         => 'Show CPU topology using lstopo',
+        options_i    => { lstopo_command => 'lstopo --whole-system -', },
+        options_bool => { lstopo_show_warning => 'yes', },
+    };
+
+    $allfns{command} = {
+        handler   => \&run_cmd_against_target,
+        arg_long  => 'command',
+        help      => 'Run command on target node',
+        options_i => { command => 'readlink /proc/%p/exe', }
      };

      $allfns{ping} = {


From padb at googlecode.com  Wed Dec  9 18:44:34 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Wed, 09 Dec 2009 18:44:34 +0000
Subject: [padb]  r357 committed - Fix to the proc-format mode,
	ensure that the 	information we report...
Message-ID: <0016e68ea1b4f17d1d047a501453@google.com>

Revision: 357
Author: apittman
Date: Wed Dec  9 10:43:32 2009
Log: Fix to the proc-format mode, ensure that the information we report
comes from the lead pid in the process group.

http://code.google.com/p/padb/source/detail?r=357

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Wed Dec  9 09:51:13 2009
+++ /trunk/src/padb	Wed Dec  9 10:43:32 2009
@@ -7642,16 +7642,19 @@

      if ( -d "/proc/$pid/task" and $carg->{proc_shows_proc} ) {

-        my $threads = 0;
-
          # 2.6 kernel. (ntpl)
          my @tasks = slurp_dir("/proc/$pid/task");
          foreach my $task (@tasks) {
              next if ( $task eq '.' );
              next if ( $task eq '..' );
              show_task_dir( $carg, $vp, $pid, "/proc/$pid/task/$task" );
-            $threads++;
-        }
+            if ( defined $carg->{proc_format} ) {
+                last;
+            }
+        }
+
+        # We have to deduct 2 here to account for . and ..
+        my $threads = @tasks - 2;
          proc_output( $vp, 'threads', $threads );
      } else {
          show_task_dir( $carg, $vp, $pid, "/proc/$pid" );


From padb at googlecode.com  Wed Dec  9 21:15:07 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Wed, 09 Dec 2009 21:15:07 +0000
Subject: [padb]  r358 committed - Fix to the MPI dll discovery code,
	if the new 	mpimsq_dll_locations...
Message-ID: <001485f9126a5abf28047a522f0a@google.com>

Revision: 358
Author: apittman
Date: Wed Dec  9 13:14:36 2009
Log: Fix to the MPI dll discovery code, if the new mpimsq_dll_locations
variable is present but it's value is NULL then padb was giving a
warning.  Handle this case directly by not trying to follow the
pointer if it's value is 0.

http://code.google.com/p/padb/source/detail?r=358

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Wed Dec  9 10:43:32 2009
+++ /trunk/src/padb	Wed Dec  9 13:14:36 2009
@@ -6202,14 +6202,16 @@
          push @all_dll_filenames, $carg->{mpi_dll};
      } else {
          my $loc = gdb_var_addr( $gdb, 'mpimsgq_dll_locations' );
-
+        my $base;
          if ($loc) {
+            $base = gdb_read_pointer( $gdb, $loc );
+        }
+
+        if ( defined $base and $base ne '0x0' ) {
              my $psize = gdb_type_size( $gdb, 'void *' );
-            my $base = $loc;
+
              my $filename;

-            $base = gdb_read_pointer( $gdb, $base );
-
              do {
                  my $strp = gdb_read_pointer( $gdb, $base );
                  $filename = gdb_string( $gdb, 1024, $strp );
@@ -6220,7 +6222,7 @@
              } while ( defined $filename );
          }

-        my $base = gdb_var_addr( $gdb, 'MPIR_dll_name' );
+        $base = gdb_var_addr( $gdb, 'MPIR_dll_name' );
          if ( not defined $base ) {
              target_error( $vp,
  'Process does not appear to be using MPI (No MPIR_dll_name symbol)'


From padb at googlecode.com  Thu Dec 10 13:02:07 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Thu, 10 Dec 2009 13:02:07 +0000
Subject: [padb] r359 committed - Update the release nodes to be current....
Message-ID: <0016e6470c5822e195047a5f6ae9@google.com>

Revision: 359
Author: apittman
Date: Thu Dec 10 05:01:38 2009
Log: Update the release nodes to be current.
Also add a --debug-file option to, surprisingly, send debug output to
a file.

http://code.google.com/p/padb/source/detail?r=359

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Wed Dec  9 13:14:36 2009
+++ /trunk/src/padb	Thu Dec 10 05:01:38 2009
@@ -30,9 +30,15 @@

  # Version 3.?
  #  * Support of PBS Pro
-#  * Add variables to tree based stack traces.
+#  * Support for OpenMPI jobs run by mpirun under a slurm allocation.
+#  * Modify the Slurm resouce manager code to automatically select a
+#    step_id based on what's running on the system currently.
+#  * Modify the mpd resource manager code to only call mpdlistjobs on
+#    the front-end, it provides all the information we need so record
+#    this and send it over the network to the inner processes.
  #  * Solaris port.  Limited functionality compared to running on Linux
  #    however stack trace mode works fully.
+#  * Add variables to tree based stack traces.
  #  * Add "mpirun" as a resource manager, this causes it walk the local
  #    process list looking for processes called mpirun and to get the
  #    pid and hostlist by reading data from Mpir_Proctable as specified
@@ -43,6 +49,7 @@
  #    reduction operations by name.
  #  * Add a --lstopo option to run the lstopo command for each rank.
  #    http://www.open-mpi.org/projects/hwloc/
+#  * Add a 'command' mode to run abritary commands on the target node.
  #  * Enhance the integration with gdb, use sequence numbers when
  #    talking to gdb and check that we get back what we give it.
  #    Correctly notice and raise an appropriate error if gdb dies
@@ -64,6 +71,26 @@
  #    automatically
  #  * Add SVN tags to the source file and the the revision id to the
  #    output of output of --version
+#  * Make proc-format report data for the first thread in a process
+#    rather than a random one.
+#  * Add support for the proposed new standard for finding the Message
+#    queue plugin in MPI programs.
+#  * Have padb handle attaching to programs rather than having the mode
+#    callback handle it.  This means that persistent attachments can
+#    be used in full-report mode.
+#  * Speed up attaching gdb to the target job greatly by attaching to
+#    all target processes on a not simultanously rather than one at
+#    a time.
+#  * Better handling of jobs that dissapear whilst we are monitoring them,
+#    there should be no perl errors shown if this happens.
+#  * Detect where padb is being run from and specify the full path to the
+#    inner processes.  This helps with resource managers which don't
+#    preserve $PWD and padb isn't on $PATH
+#  * Add proper two-pass argunment handling, secondary options are only
+#    accepted if the mode they are relevent to is selected.
+#  * Widespread code cleanups to conform with stricter coding standards.
+#  * Enable type checking of command line options, all boolean flags can be
+#    set yes|no|1|0|true|false now.
  #
  # Version 3.0
  #  * Full-duplex communication between inner and outer processes, padb no
@@ -677,7 +704,10 @@
  -t --tree              Use tree based output for stack traces.
  -i --input-file=FILE   Read input from file.

-   --watch
+   --watch
+
+   --debug=<mode>,<mode1>  Enable debug for mode, use mode=all for all  
debugging.
+   --debug-file=file   Log debug information to file.

  -O [opt1=val,<opt2=val>] Set internal config options for padb, advanced  
use only.
    Options in this version (these are liable to change)
@@ -798,6 +828,22 @@
  # the ref as well.  Enable with --debug=type1,type2=all
  my %debug_modes;
  my $start_time = time;
+my $debug_fd;
+
+sub set_debug_file {
+    my ($filename) = @_;
+
+    if ( defined $filename ) {
+        if ( not open $debug_fd, '>', $filename ) {
+            print "Unable to open log file for writing: $!\n";
+            $debug_fd = *STDOUT;
+        }
+    } else {
+        $debug_fd = *STDOUT;
+    }
+
+    return;
+}

  sub debug_log {
      my ( $type, $handle, $str, @params ) = @_;
@@ -807,10 +853,10 @@
      }
      return unless $debug_modes{$type};
      my $time = time - $start_time;
-    printf "DEBUG ($type): %3d: $str\n", $time, @params;
+    printf {$debug_fd} "DEBUG ($type): %3d: $str\n", $time, @params;
      return if $debug_modes{$type} eq 'basic';
      return unless defined $handle;
-    print Data::Dumper->Dump( [$handle], [$type] );
+    print {$debug_fd} Data::Dumper->Dump( [$handle], [$type] );
      return;
  }

@@ -930,6 +976,7 @@

      Getopt::Long::Configure( 'bundling', 'pass_through' );
      my $debugflag;
+    my $debugfile;

      my @ranks;

@@ -957,6 +1004,7 @@
          'norc'                => \$norc,
          'config-file=s'       => \$configfile,
          'debug=s'             => \$debugflag,
+        'debug-file=s'        => \$debugfile,
          'create-secret-file'  => \$create_secret,
      );

@@ -973,6 +1021,8 @@
      # options which might be bundled with it.
      GetOptions(%optionhash);

+    set_debug_file($debugfile);
+
      Getopt::Long::Configure( 'default', 'bundling' );

      my $mode;
@@ -4491,7 +4541,9 @@

  # rng_user_verify()
  # is_value_in_range()
-# nvalues_in_range()
+# nvalues_in_range()  - Return the number of values in a range.
+# rng_min()           - Return the minimum value in a range.
+# rng_common()        - Take two ranges and return the common values.
  # rng_find_missing()
  #   Take two ranges and return all that are in the first but not in the
  #   second. (see check_signon).


From padb at googlecode.com  Thu Dec 10 15:22:22 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Thu, 10 Dec 2009 15:22:22 +0000
Subject: [padb] r360 committed - Clean up the way the message queue code is
	called on the inner process...
Message-ID: <001636310361b22b4a047a615f68@google.com>

Revision: 360
Author: apittman
Date: Thu Dec 10 07:21:32 2009
Log: Clean up the way the message queue code is called on the inner  
processes,
remove one function completely, rename other and add comments about what
is called from where.  Handle deadlock detection in the same way that
message queues are handled so that the code can use the same function
to read both of them.

http://code.google.com/p/padb/source/detail?r=360

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Thu Dec 10 05:01:38 2009
+++ /trunk/src/padb	Thu Dec 10 07:21:32 2009
@@ -6537,8 +6537,19 @@
      return;
  }

-sub fetch_mpi_queue {
+# Returns the MPI queues for this process given a gdb handle.
+sub fetch_mpi_queue_gdb {
+    my ( $carg, $vp, $pid, $g ) = @_;
+    my @mq = run_minfo( $carg, $g, $vp );
+    return @mq;
+}
+
+# Called as a backoff from the qsnet_show_tport_queue() function if it gets
+# called but can't show the queues for any reason - most likely because it
+# isn't actually a quadrics system.
+sub show_mpi_queue {
      my ( $carg, $vp, $pid ) = @_;
+
      my $g = gdb_start();
      my $p = gdb_attach( $g, $pid );
      if ( !$p ) {
@@ -6554,20 +6565,6 @@

      gdb_detach($g);
      gdb_quit($g);
-    return @mq;
-}
-
-# As above but take a gdb handle
-sub fetch_mpi_queue_gdb {
-    my ( $carg, $vp, $pid, $g ) = @_;
-    my @mq = run_minfo( $carg, $g, $vp );
-    return @mq;
-}
-
-sub show_mpi_queue {
-    my ( $carg, $vp, $pid ) = @_;
-
-    my @mq = fetch_mpi_queue( $carg, $vp, $pid );

      foreach my $o (@mq) {
          output( $vp, $o );
@@ -6575,6 +6572,7 @@
      return;
  }

+# The mode handler for message queue or deadlock detection mode.
  sub show_mpi_queue_one {
      my ( $carg, $proc ) = @_;

@@ -6591,19 +6589,6 @@
      }
      return;
  }
-
-sub show_mpi_queue_for_deadlock_one {
-    my ( $carg, $proc ) = @_;
-
-    my $vp  = $proc->{vp};
-    my $pid = $proc->{pid};
-    my $gdb = $proc->{gdb_handle};
-
-    return unless $gdb;
-
-    my @mq = fetch_mpi_queue_gdb( $carg, $vp, $pid, $gdb );
-    return \@mq;
-}

  sub mpi_queue_output_handler {
      my ( $carg, $lines, $three ) = @_;
@@ -6831,8 +6816,8 @@
      # XXX This is a bit of a hack to make the deadlock code work with input
      # files, the whole thing is due a tidy-up on the full-duplex branch
      # where this should be solved properly.
-    if ( defined $lines->{target_response} ) {
-        $data = $lines->{target_response};
+    if ( defined $lines->{target_output} ) {
+        $data = $lines->{target_output};
      } else {
          $data = $lines->{lines};
      }
@@ -8074,7 +8059,7 @@
      return;
  }

-sub show_queue {
+sub qsnet_show_tport_queue {
      my ( $carg, $vp, $pid ) = @_;

      # If edb isn't installed (this isn't a Quadrics system) don't even try
@@ -9295,7 +9280,7 @@
          arg_long    => 'message-queue',
          qsnet       => 1,
          arg_short   => 'q',
-        handler     => \&show_queue,
+        handler     => \&qsnet_show_tport_queue,
          help        => 'Show the message queues',
          options_i   => { mpi_dll => undef, }
      };
@@ -9324,7 +9309,7 @@
      };

      $allfns{deadlock} = {
-        handler_one  => \&show_mpi_queue_for_deadlock_one,
+        handler_one  => \&show_mpi_queue_one,
          needs_gdb    => 1,
          arg_long     => 'deadlock',
          arg_short    => 'j',


From padb at googlecode.com  Wed Dec 16 21:48:38 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Wed, 16 Dec 2009 21:48:38 +0000
Subject: [padb] r361 committed - Replace a few hard coded char arrays with
	mallocs of a size just...
Message-ID: <0016e6407ac21c9e20047adf7802@google.com>

Revision: 361
Author: apittman
Date: Wed Dec 16 13:48:14 2009
Log: Replace a few hard coded char arrays with mallocs of a size just
big enough for the string.  A slight performance penalty but
it'll help if anybody comes up with a 64 character type name.

http://code.google.com/p/padb/source/detail?r=361

Modified:
  /trunk/src/minfo.c

=======================================
--- /trunk/src/minfo.c	Mon Dec  7 13:59:10 2009
+++ /trunk/src/minfo.c	Wed Dec 16 13:48:14 2009
@@ -131,13 +131,24 @@
  }

  void *find_sym (char *type, char *name) {
-    char req[128];
+    char *req;
      char ans[128];
      int i;
      void *addr = NULL;
+    size_t len = 2;
+    len += strlen(type);
+    len += strlen(name);
+
+    req = malloc(len);
+
+    if ( ! req ) {
+	return NULL;
+    }
+
      sprintf(req,"%s %s",type,name);

      i = ask(req,ans);
+    free(req);
      if ( i != 0 )
  	return NULL;

@@ -238,31 +249,48 @@

  mqs_type *find_type (mqs_image *image, char *name, mqs_lang_code lang)
  {
-    char req[128];
+    char *req;
      int i;
      struct type *t = malloc(sizeof(struct type));
      if ( ! t )
  	return NULL;

+    req = malloc (strlen(name)+6);
+    if ( ! req )
+	return NULL;
+
      strncpy(t->name,name,128);
      sprintf(req,"size %s",name);

      i = req_to_int(req,&t->size);
-    if ( i != 0 )
+    free(req);
+    if ( i != 0 ) {
+	free(t);
  	return NULL;
+    }

      return (mqs_type *)t;
  }

  int find_offset (mqs_type *type, char *name)
  {
-    char req[128];
+    char *req;
      int i,offset;
+    size_t len = 9;
      struct type *t = (struct type *)type;

+    len += strlen(t->name);
+    len += strlen(name);
+    req = malloc(len);
+
+    if ( ! req ) {
+	return -1;
+    }
+
      sprintf(req,"offset %s %s",t->name,name);

      i = req_to_int(req,&offset);
+    free(req);
      if ( i != 0 )
  	return -1;


From thipadin.seng-long at bull.net  Fri Dec 18 13:37:15 2009
From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net)
Date: Fri, 18 Dec 2009 14:37:15 +0100
Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A__Better_handling_of_threads_in?=
 =?iso-8859-1?q?_stack__traces=2E?=
Message-ID: <OFC8F07976.94C4EB7E-ONC1257690.0048F46C@frcl.bull.fr>

On Nov 30th, 2009 <ashley at pittman.co.uk> wrote:

> I've been giving some thought to how to padb can handle threaded
> applications better as the current scheme isn't ideal.
> 
> 
> First would be to report extra threads in the same tree as the primary
> thread, some magic would have to be applied to cover the fact that the
> first thread in a process starts with main and subsequent ones start
> with pthread_create() but this wouldn't be a insurmountable problem.
> The big problem with this approach would be how to report thread
> identifiers in the same rank-spec as rank rank identifiers, I could
> revert to just using a list here but that doesn't work so well on big
> systems.

> The second option would be to treat each thread as a different entity
> within the rank/process and have a number of trees displayed per job,
> each dealing with a different thread, e.g. there would be a tree per
> main thread and another tree for each extra thread encountered.  From a
> technical perspective implementing this would require adding a namespace
> to the {target_output} as it's passed back up the comms tree so is the
> hardest to add but would probably lead to the best solution.

> Finally there is the option of not showing all threads but allowing
> users to select a single thread per invocation of padb.  This is the
> simple but functional option although might be best viewed as a step
> along the way to fully supporting multiple threads in future.  Here the
> options are to be able to select threads by id (1,2,...) or perhaps by
> having a white/black list of function names that should appear in the
> stack for a thread before a thread is shown.

> I'd welcome ideas on which people would prefer or if anybody has any
> other thoughts on how to handle threads properly.


I have a support request from Bull customer that would like to have
padb report sorted by threads as below:
   Thread: 1
   --------------------------
   [0-1999] (2000 processes)
   ---------
   main()
    PMPI_Finalyse()
     ompi_mpi_finalyze()
      barrier()
      ----------------
      ......(249 processes)
      ---------------
       orte_grpcomm_base_allgather()
        opal_progress()
         opal_event_loop()
          epoll_dispatch()
           epoll_wait()
       ---------------
       .....  (1751 processes)
       ----------------
        opal_progress()
         opal_event_loop()
          epoll_dispatch()
           epoll_wait()
   Thread: 2
   --------------------------
   [0-1999] (2000 processes)
   --------- 
      ....
   Thread: 3
   --------------------------
   [0-1999] (2000 processes)
   --------- 
      ....

This report should be by job. Would you accept it ?

Thipadin.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091218/cbd43366/attachment.html>

From padb at googlecode.com  Fri Dec 18 21:26:25 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Fri, 18 Dec 2009 21:26:25 +0000
Subject: [padb] r362 committed - Update to the orte resource manager to
	support spawned jobs. Like slu...
Message-ID: <0016e646417e567486047b0764ca@google.com>

Revision: 362
Author: apittman
Date: Fri Dec 18 13:25:23 2009
Log: Update to the orte resource manager to support spawned jobs.  Like  
slurm
each job, uniquely identified by it's number can have a number of
different steps within it.  This commit adds knowedge of these steps
to padb so it can do the right thing.  Allow targetting of different
steps via the orte-job-step configuration option with the default
step being the lowest numbered one detected.

http://code.google.com/p/padb/source/detail?r=362

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Thu Dec 10 07:21:32 2009
+++ /trunk/src/padb	Fri Dec 18 13:25:23 2009
@@ -547,6 +547,7 @@
  $conf{rmgr}             = undef;

  $conf{slurm_job_step} = undef;
+$conf{orte_job_step}  = undef;

  $conf{pbs_server} = undef;

@@ -568,7 +569,7 @@
  my @conf_time = qw(prun_exittimeout prun_timeout interval);

  # Config options which take an integer.
-my @conf_int = qw(lsf_job_offset slurm_job_step tree_width);
+my @conf_int = qw(lsf_job_offset slurm_job_step orte_job_step tree_width);

  my $norc       = 0;
  my $configfile = '/etc/padb.conf';
@@ -2865,18 +2866,21 @@
          if ( @elems == 4 ) {
              my $nprocs = $elems[3];
              my $name   = $elems[0];
-            if ( $name =~ m{\A\[(\d+)\,\d+]\z}x ) {
-                $open_jobs{$1}{nprocs} = $nprocs;
+            if ( $name =~ m{\A\[(\d+)\,(\d+)]\z}x ) {
+                my $job  = $1;
+                my $step = $2;
+                $open_jobs{$job}{$step}{nprocs} = $nprocs;
              }
          } elsif ( @elems == 6 ) {
              my $name = $elems[1];
-            if ( $name =~ m{\A\[\[(\d+)\,\d+\]\,(\d+)\]}x ) {
+            if ( $name =~ m{\A\[\[(\d+)\,(\d+)\]\,(\d+)\]}x ) {
                  my $job  = $1;
-                my $rank = $2;
+                my $step = $2;
+                my $rank = $3;
                  my $pid  = $elems[3];
                  my $host = $elems[4];
-                $open_jobs{$job}{hosts}{$host}++;
-                $open_jobs{$job}{ranks}{$host}{$rank} = $pid;
+                $open_jobs{$job}{$step}{hosts}{$host}++;
+                $open_jobs{$job}{$step}{ranks}{$host}{$rank} = $pid;
              }
          }
      }
@@ -2895,7 +2899,22 @@

      open_get_data();

-    my @hosts = keys %{ $open_jobs{$job}{hosts} };
+    my $step = $conf{orte_job_step};
+    if ( not defined $step ) {
+        my @steps = keys %{ $open_jobs{$job} };
+
+        my @ordered = sort { $a <=> $b } @steps;
+
+        $step = $ordered[0];
+
+    }
+
+    if ( not defined $open_jobs{$job}{$step} ) {
+        printf("Job $job (step $step) does not exist\n");
+        return;
+    }
+
+    my @hosts = keys %{ $open_jobs{$job}{$step}{hosts} };
      my $i     = @hosts;

      my ( $fh, $fn ) = tempfile('/tmp/padb.XXXXXXXX');
@@ -2909,9 +2928,9 @@
      my $cmd    = "orterun -machinefile $fn -np $i $prefix";

      my %pcmd;
-    $pcmd{nprocesses}   = $open_jobs{$job}{nprocs};
+    $pcmd{nprocesses}   = $open_jobs{$job}{$step}{nprocs};
      $pcmd{nhosts}       = @hosts;
-    $pcmd{process_data} = $open_jobs{$job}{ranks};
+    $pcmd{process_data} = $open_jobs{$job}{$step}{ranks};
      $pcmd{command}      = $cmd;
      @{ $pcmd{host_list} } = @hosts;
      $pcmd{cleanup_cb}     = \&unlink_file;


From padb at googlecode.com  Fri Dec 18 21:36:47 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Fri, 18 Dec 2009 21:36:47 +0000
Subject: [padb] r363 committed - Add command substituion of %r to rank when
	running in command...
Message-ID: <0016e6d26d586fc22b047b0789e7@google.com>

Revision: 363
Author: apittman
Date: Fri Dec 18 13:35:42 2009
Log: Add command substituion of %r to rank when running in command
mode.
This allows the follwing really useful command to work:
padb -a --command -Ocommand="xterm -T %r -e 'gdb -p %p'"

http://code.google.com/p/padb/source/detail?r=363

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Fri Dec 18 13:25:23 2009
+++ /trunk/src/padb	Fri Dec 18 13:35:42 2009
@@ -8057,16 +8057,17 @@
  }

  sub run_cmd_against_target {
-    my ( $cargs, $vp, $pid ) = @_;
+    my ( $cargs, $rank, $pid ) = @_;

      my $cmd = $cargs->{command};

      $cmd =~ s{%p}{$pid}g;
+    $cmd =~ s{%r}{$rank}g;

      my @output = slurp_cmd($cmd);
      chomp @output;
      foreach my $line (@output) {
-        output( $vp, $line );
+        output( $rank, $line );
      }
      return;
  }


From ashley at pittman.co.uk  Mon Dec 21 12:04:05 2009
From: ashley at pittman.co.uk (Ashley Pittman)
Date: Mon, 21 Dec 2009 12:04:05 +0000
Subject: [padb]
 =?iso-8859-1?q?R=E9f=2E_=3A__Better_handling_of_threads_in?=
 =?iso-8859-1?q?_stack_traces=2E?=
In-Reply-To: <OFC8F07976.94C4EB7E-ONC1257690.0048F46C@frcl.bull.fr>
References: <OFC8F07976.94C4EB7E-ONC1257690.0048F46C@frcl.bull.fr>
Message-ID: <1261397045.3600.35.camel@alpha>

On Fri, 2009-12-18 at 14:37 +0100, thipadin.seng-long at bull.net wrote:

> I have a support request from Bull customer that would like to have 
> padb report sorted by threads as below: 

That's great.

> Would you accept it ? 

I'm not sure what you mean here, I would gladly accept a patch
implementing it.

I don't have a huge amount of time to be working on padb right now and
there are already a large number of features waiting for a release as
well as people who've found they need to use the SVN version for some
individual machine.

My priority currently is to finish off what is absent, package padb
properly into a rpm, convert it to use antoconf and fix what bugs people
find in the mean time.  Re: threads in stack traces I don't seem myself
having time to do anything other than the simple option of allowing
users to restrict which threads are shown for each invocation of padb
without it delaying the release.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


From thipadin.seng-long at bull.net  Mon Dec 21 12:55:37 2009
From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net)
Date: Mon, 21 Dec 2009 13:55:37 +0100
Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A__Better_han?=
 =?iso-8859-1?q?dling_of__threads_in_stack_traces=2E?=
Message-ID: <OF9A6EEE5A.ECC96AA1-ONC1257693.00455DA0@frcl.bull.fr>

On Mon,  21th dec 2009 at 13:04 +0100  ashley at pittman.co.uk  wrote:

>On Fri, 2009-12-18 at 14:37 +0100, thipadin.seng-long at bull.net wrote:

>> I have a support request from Bull customer that would like to have 
>> padb report sorted by threads as below: 

>That's great.

>> Would you accept it ? 

> I'm not sure what you mean here, I would gladly accept a patch
> implementing it.

> I don't have a huge amount of time to be working on padb right now and
> there are already a large number of features waiting for a release as
> well as people who've found they need to use the SVN version for some
> individual machine.

> My priority currently is to finish off what is absent, package padb
> properly into a rpm, convert it to use antoconf and fix what bugs people
> find in the mean time.  Re: threads in stack traces I don't seem myself
> having time to do anything other than the simple option of allowing
> users to restrict which threads are shown for each invocation of padb
> without it delaying the release.

> Ashley,

Ok you are right, let's make a package padb out of all 
features waiting for a release. I think it is a wise decision.
As to me I will be off for 2 weeks, so I'll be working on threads sort 
at the beginning of the year.
Thipadin.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091221/f0079609/attachment.html>

From padb at googlecode.com  Mon Dec 21 19:57:50 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Mon, 21 Dec 2009 19:57:50 +0000
Subject: [padb]  r364 committed - Add svn properties to minfo,
	remove old 	#ident line as modern...
Message-ID: <0016361e7e700e3cdf047b428166@google.com>

Revision: 364
Author: apittman
Date: Mon Dec 21 11:57:07 2009
Log: Add svn properties to minfo, remove old #ident line as modern
compilers complain about it.

http://code.google.com/p/padb/source/detail?r=364

Modified:
  /trunk/src/minfo.c

=======================================
--- /trunk/src/minfo.c	Wed Dec 16 13:48:14 2009
+++ /trunk/src/minfo.c	Mon Dec 21 11:57:07 2009
@@ -3,8 +3,11 @@
   * Copyright (c) 2009, Ashley Pittman.
   */

-#ident "elfN.c,v 1.14 2005-11-03 11:23:04 ashley Exp"
-/*             /cvs/master/quadrics/elan4lib/edb/elfN.c,v */
+/*
+ * $URL$
+ * $Date$
+ * $Revision$
+ */

  #include <string.h>
  #include <stdlib.h>


From padb at googlecode.com  Mon Dec 21 20:22:00 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Mon, 21 Dec 2009 20:22:00 +0000
Subject: [padb] r365 committed - Add a thread-list mode to show the user a
	list of threads that...
Message-ID: <0016e68ea1c37ec460047b42d7e9@google.com>

Revision: 365
Author: apittman
Date: Mon Dec 21 12:21:54 2009
Log: Add a thread-list mode to show the user a list of threads that
are running in the target process.

http://code.google.com/p/padb/source/detail?r=365

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Fri Dec 18 13:35:42 2009
+++ /trunk/src/padb	Mon Dec 21 12:21:54 2009
@@ -50,6 +50,8 @@
  #  * Add a --lstopo option to run the lstopo command for each rank.
  #    http://www.open-mpi.org/projects/hwloc/
  #  * Add a 'command' mode to run abritary commands on the target node.
+#  * Add a 'thread-list' mode to report a comma seperated list of threads
+#    for each target process.
  #  * Enhance the integration with gdb, use sequence numbers when
  #    talking to gdb and check that we get back what we give it.
  #    Correctly notice and raise an appropriate error if gdb dies
@@ -8010,6 +8012,33 @@
      }
      return;
  }
+
+sub thread_list_from_pid {
+    my ( $carg, $proc ) = @_;
+
+    return unless defined $proc->{gdb_handle};
+
+    my $gdb = $proc->{gdb_handle};
+
+    my %result = gdb_n_send( $gdb, '-thread-list-ids' );
+    if ( $result{status} ne 'done' ) {
+        return;
+    }
+    my $data = gdb_parse_reason( $result{reason}, 'thread-ids' );
+    if ( not defined $data->{'thread-ids'} ) {
+        return;
+    }
+
+    my @threads;
+    foreach my $thread ( @{ $data->{'thread-ids'} } ) {
+        my $id = $thread->{'thread-id'};
+        push @threads, $id;
+    }
+
+    my $thread_list = join q{,}, sort { $a <=> $b } @threads;
+
+    output( $proc->{vp}, $thread_list );
+}

  sub kill_proc {
      my ( $cargs, $vp, $pid ) = @_;
@@ -9354,6 +9383,13 @@
          }
      };

+    $allfns{threads} = {
+        handler_one => \&thread_list_from_pid,
+        needs_gdb   => 1,
+        arg_long    => 'thread-list',
+        help        => 'List threads in target processes',
+    };
+
      $allfns{proc_summary} = {
          handler_all => \&show_proc_all,
          out_handler => \&show_proc_format,


From padb at googlecode.com  Mon Dec 21 20:47:30 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Mon, 21 Dec 2009 20:47:30 +0000
Subject: [padb] r366 committed - Add a thread-id configuration option to
	specify which threads are...
Message-ID: <001636c9277daccb8c047b43320f@google.com>

Revision: 366
Author: apittman
Date: Mon Dec 21 12:46:24 2009
Log: Add a thread-id configuration option to specify which threads are
shown in the stack trace view.  The default is to show all threads
but this option allows the user to restrict it to one thread.

http://code.google.com/p/padb/source/detail?r=366

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Mon Dec 21 12:21:54 2009
+++ /trunk/src/padb	Mon Dec 21 12:46:24 2009
@@ -52,6 +52,10 @@
  #  * Add a 'command' mode to run abritary commands on the target node.
  #  * Add a 'thread-list' mode to report a comma seperated list of threads
  #    for each target process.
+#  * Add a 'thread-id' configuration option for when collecting stack
+#    traces.  This isn't a complete solution which will have to wait
+#    for 3.2 but does allow the user to specify which thread within an
+#    application is reported on.
  #  * Enhance the integration with gdb, use sequence numbers when
  #    talking to gdb and check that we get back what we give it.
  #    Correctly notice and raise an appropriate error if gdb dies
@@ -7098,7 +7102,7 @@
  }

  sub gdb_dump_frames_per_thread {
-    my ( $gdb, $detail ) = @_;
+    my ( $gdb, $detail, $thread_id ) = @_;
      my @th = ();
      my %result = gdb_n_send( $gdb, '-thread-list-ids' );
      if ( $result{status} ne 'done' ) {
@@ -7108,6 +7112,11 @@
      if ( not defined $data->{'thread-ids'} ) {
          return;
      }
+
+    # I honestly don't know what this code is here for, presumably at
+    # some point in the past I've experienced a version of gdb which
+    # reports the number-of-threads as zero!  No harm in leaving the
+    # code here however.
      if ( $data->{'number-of-threads'} == 0 ) {
          my %t = ( id => 0 );
          @{ $t{frames} } = gdb_dump_frames( $gdb, $detail );
@@ -7128,6 +7137,9 @@
      }
      foreach my $thread ( @{ $data->{'thread-ids'} } ) {
          my $id = $thread->{'thread-id'};
+        if ( defined $thread_id ) {
+            next unless $thread_id eq $id;
+        }
          my %t = ( id => $id );
          gdb_send( $gdb, "-thread-select $id" );
          @{ $t{frames} } = gdb_dump_frames( $gdb, $detail );
@@ -7871,9 +7883,11 @@
          if (   $carg->{stack_shows_params}
              or $carg->{stack_shows_locals} )
          {
-            @threads = gdb_dump_frames_per_thread( $gdb, 1 );
+            @threads =
+              gdb_dump_frames_per_thread( $gdb, 1, $carg->{thread_id} );
          } else {
-            @threads = gdb_dump_frames_per_thread($gdb);
+            @threads =
+              gdb_dump_frames_per_thread( $gdb, undef, $carg->{thread_id}  
);
          }

          if ( defined $threads[0]->{frames} ) {
@@ -7892,7 +7906,13 @@
          and ( $tries++ < $carg->{gdb_retry_count} ) );

      if ( not defined $threads[0]{id} ) {
-        target_error( $vp, 'Could not extract stack trace from  
application' );
+        if ( $carg->{thread_id} ) {
+            target_error( $vp,
+                'Could not extract stack trace for specified thread' );
+        } else {
+            target_error( $vp,
+                'Could not extract stack trace from application' );
+        }
          return;
      }

@@ -9430,6 +9450,7 @@
              stack_strip_above =>
  'elan_waitWord,elan_pollWord,elan_deviceCheck,opal_condition_wait,opal_progress',
              stack_strip_below => 'main,__libc_start_main,start_thread',
+            thread_id         => undef,
          },
          options_bool => {
              stack_shows_params => 'yes',


From padb at googlecode.com  Mon Dec 21 20:53:37 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Mon, 21 Dec 2009 20:53:37 +0000
Subject: [padb] r367 committed - Re-name --thread-list to --list-threads as
	this seems more natural.
Message-ID: <0016368e1b64944ad8047b43487b@google.com>

Revision: 367
Author: apittman
Date: Mon Dec 21 12:52:44 2009
Log: Re-name --thread-list to --list-threads as this seems more natural.

http://code.google.com/p/padb/source/detail?r=367

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Mon Dec 21 12:46:24 2009
+++ /trunk/src/padb	Mon Dec 21 12:52:44 2009
@@ -50,8 +50,8 @@
  #  * Add a --lstopo option to run the lstopo command for each rank.
  #    http://www.open-mpi.org/projects/hwloc/
  #  * Add a 'command' mode to run abritary commands on the target node.
-#  * Add a 'thread-list' mode to report a comma seperated list of threads
-#    for each target process.
+#  * Add a 'list-threads' mode to report a comma seperated list of
+#    threads for each target process.
  #  * Add a 'thread-id' configuration option for when collecting stack
  #    traces.  This isn't a complete solution which will have to wait
  #    for 3.2 but does allow the user to specify which thread within an
@@ -9406,7 +9406,7 @@
      $allfns{threads} = {
          handler_one => \&thread_list_from_pid,
          needs_gdb   => 1,
-        arg_long    => 'thread-list',
+        arg_long    => 'list-threads',
          help        => 'List threads in target processes',
      };


From ashley at pittman.co.uk  Mon Dec 21 21:03:32 2009
From: ashley at pittman.co.uk (Ashley Pittman)
Date: Mon, 21 Dec 2009 21:03:32 +0000
Subject: [padb]
 =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A__Better_han?=
 =?iso-8859-1?q?dling_of_threads_in_stack_traces=2E?=
In-Reply-To: <OF9A6EEE5A.ECC96AA1-ONC1257693.00455DA0@frcl.bull.fr>
References: <OF9A6EEE5A.ECC96AA1-ONC1257693.00455DA0@frcl.bull.fr>
Message-ID: <1261429412.3600.44.camel@alpha>

On Mon, 2009-12-21 at 13:55 +0100, thipadin.seng-long at bull.net wrote:

> Ok you are right, let's make a package padb out of all 
> features waiting for a release. I think it is a wise decision. 
> As to me I will be off for 2 weeks, so I'll be working on threads
> sort 
> at the beginning of the year. 

I've done the simple thing with threads for now, added a --list-threads
option to show detected threads and a optional thread-id configuration
option for -x to restrict which thread is reported on.  Let me know how
you get on with this and if it's acceptable half-way house for your
customer, anything more complex is going to require changes to the padb
core which we can look at after 3.1 is out.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


From padb at googlecode.com  Mon Dec 21 22:13:12 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Mon, 21 Dec 2009 22:13:12 +0000
Subject: [padb] r368 committed - Rename Makefile to be Makefile-simple to
	make way for a new auto-conf...
Message-ID: <0016361e859428e66d047b4465e7@google.com>

Revision: 368
Author: apittman
Date: Mon Dec 21 14:12:41 2009
Log: Rename Makefile to be Makefile-simple to make way for a new auto-conf
generated makefile.  Hopefully soon I can delete this file completly
but for now leave it lying around so that it can be invoked with make -f

http://code.google.com/p/padb/source/detail?r=368

Added:
  /trunk/src/Makefile-simple
Deleted:
  /trunk/src/Makefile

=======================================
--- /dev/null
+++ /trunk/src/Makefile-simple	Mon Dec 21 14:12:41 2009
@@ -0,0 +1,42 @@
+
+INSTALL_DIR=/usr/local/
+CONFIG_DIR=/etc
+VERSION=3.0-rc1
+CC=gcc
+CFLAGS=-Wall -g
+
+FILES = Makefile minfo.c mpi_interface.h padb
+
+minfo.x: minfo.c mpi_interface.h
+	$(CC) minfo.c -o minfo.x -ldl $(CFLAGS)
+
+install: minfo.x
+	/bin/mkdir -p ${INSTALL_DIR}/bin
+	/bin/cp minfo.x ${INSTALL_DIR}/bin/
+	/bin/cp padb ${INSTALL_DIR}/bin/
+
+make config_install:
+	/bin/mkdir -p ${CONFIG_DIR}
+	/bin/cp padb.conf ${CONFIG_DIR}/
+
+clean:
+	/bin/rm -f minfo.x
+
+tarfile:
+	/bin/rm -f padb-${VERSION}.tgz
+	/bin/rm -rf padb-${VERSION}
+	mkdir padb-${VERSION}
+	/bin/cp ${FILES} padb-${VERSION}
+	svnversion > padb-${VERSION}/svnversion
+	tar -czf padb-${VERSION}.tgz padb-${VERSION}
+
+tidy:
+	perltidy -b -ce -w -se padb
+
+pc:	padb
+	perlcritic --brutal --verbose "%l: (%s) %m\n" padb > .pc.tmp || true
+	/bin/mv .pc.tmp pc
+
+report: pc
+	./report.pl pc | tee report
+
=======================================
--- /trunk/src/Makefile	Wed Oct 28 15:31:42 2009
+++ /dev/null
@@ -1,42 +0,0 @@
-
-INSTALL_DIR=/usr/local/
-CONFIG_DIR=/etc
-VERSION=3.0-rc1
-CC=gcc
-CFLAGS=-Wall -g
-
-FILES = Makefile minfo.c mpi_interface.h padb
-
-minfo.x: minfo.c mpi_interface.h
-	$(CC) minfo.c -o minfo.x -ldl $(CFLAGS)
-
-install: minfo.x
-	/bin/mkdir -p ${INSTALL_DIR}/bin
-	/bin/cp minfo.x ${INSTALL_DIR}/bin/
-	/bin/cp padb ${INSTALL_DIR}/bin/
-
-make config_install:
-	/bin/mkdir -p ${CONFIG_DIR}
-	/bin/cp padb.conf ${CONFIG_DIR}/
-
-clean:
-	/bin/rm -f minfo.x
-
-tarfile:
-	/bin/rm -f padb-${VERSION}.tgz
-	/bin/rm -rf padb-${VERSION}
-	mkdir padb-${VERSION}
-	/bin/cp ${FILES} padb-${VERSION}
-	svnversion > padb-${VERSION}/svnversion
-	tar -czf padb-${VERSION}.tgz padb-${VERSION}
-
-tidy:
-	perltidy -b -ce -w -se padb
-
-pc:	padb
-	perlcritic --brutal --verbose "%l: (%s) %m\n" padb > .pc.tmp || true
-	/bin/mv .pc.tmp pc
-
-report: pc
-	./report.pl pc | tee report
-


From padb at googlecode.com  Mon Dec 21 22:19:21 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Mon, 21 Dec 2009 22:19:21 +0000
Subject: [padb]  r369 committed - First cut at using automake to build padb,
	it's early days with...
Message-ID: <001636c9277d2a91c7047b447bc7@google.com>

Revision: 369
Author: apittman
Date: Mon Dec 21 14:19:11 2009
Log: First cut at using automake to build padb, it's early days with
code yet but a basic make/install/dist is all working.  More
attention is needed to install the config file and some code
changes to padb itself are needed as well to cope with updated
paths.

http://code.google.com/p/padb/source/detail?r=369

Added:
  /trunk/Makefile.am
  /trunk/autogen.sh
  /trunk/configure.in
  /trunk/src/Makefile.am

=======================================
--- /dev/null
+++ /trunk/Makefile.am	Mon Dec 21 14:19:11 2009
@@ -0,0 +1,1 @@
+SUBDIRS = src
=======================================
--- /dev/null
+++ /trunk/autogen.sh	Mon Dec 21 14:19:11 2009
@@ -0,0 +1,8 @@
+#!/bin/sh
+
+set -e
+set -x
+
+aclocal
+autoconf
+automake -a
=======================================
--- /dev/null
+++ /trunk/configure.in	Mon Dec 21 14:19:11 2009
@@ -0,0 +1,6 @@
+AC_INIT(src/padb)
+AM_INIT_AUTOMAKE(padb,3.1)
+AC_PROG_CC
+AC_PROG_INSTALL
+AM_PROG_CC_C_O
+AC_OUTPUT(Makefile src/Makefile)
=======================================
--- /dev/null
+++ /trunk/src/Makefile.am	Mon Dec 21 14:19:11 2009
@@ -0,0 +1,5 @@
+bin_SCRIPTS = padb
+libexec_PROGRAMS = minfo
+minfo_CFLAGS = -ldl
+minfo_SOURCES = minfo.c mpi_interface.h
+EXTRA_DIST = padb


From padb at googlecode.com  Mon Dec 21 22:36:42 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Mon, 21 Dec 2009 22:36:42 +0000
Subject: [padb]  r370 committed - Add boilerplate new, copying,
	changelog and 	authors files....
Message-ID: <001636e1f3ad3e6103047b44b9c6@google.com>

Revision: 370
Author: apittman
Date: Mon Dec 21 14:35:42 2009
Log: Add boilerplate new, copying, changelog and authors files.
I'll try and work out how to add a THANKS file shortly...

http://code.google.com/p/padb/source/detail?r=370

Added:
  /trunk/AUTHORS
  /trunk/COPYING
  /trunk/ChangeLog
  /trunk/NEWS

=======================================
--- /dev/null
+++ /trunk/AUTHORS	Mon Dec 21 14:35:42 2009
@@ -0,0 +1,4 @@
+
+Authors of padb
+
+Ashley Pittman.
=======================================
--- /dev/null
+++ /trunk/COPYING	Mon Dec 21 14:35:42 2009
@@ -0,0 +1,504 @@
+		  GNU LESSER GENERAL PUBLIC LICENSE
+		       Version 2.1, February 1999
+
+ Copyright (C) 1991, 1999 Free Software Foundation, Inc.
+ 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+[This is the first released version of the Lesser GPL.  It also counts
+ as the successor of the GNU Library Public License, version 2, hence
+ the version number 2.1.]
+
+			    Preamble
+
+  The licenses for most software are designed to take away your
+freedom to share and change it.  By contrast, the GNU General Public
+Licenses are intended to guarantee your freedom to share and change
+free software--to make sure the software is free for all its users.
+
+  This license, the Lesser General Public License, applies to some
+specially designated software packages--typically libraries--of the
+Free Software Foundation and other authors who decide to use it.  You
+can use it too, but we suggest you first think carefully about whether
+this license or the ordinary General Public License is the better
+strategy to use in any particular case, based on the explanations below.
+
+  When we speak of free software, we are referring to freedom of use,
+not price.  Our General Public Licenses are designed to make sure that
+you have the freedom to distribute copies of free software (and charge
+for this service if you wish); that you receive source code or can get
+it if you want it; that you can change the software and use pieces of
+it in new free programs; and that you are informed that you can do
+these things.
+
+  To protect your rights, we need to make restrictions that forbid
+distributors to deny you these rights or to ask you to surrender these
+rights.  These restrictions translate to certain responsibilities for
+you if you distribute copies of the library or if you modify it.
+
+  For example, if you distribute copies of the library, whether gratis
+or for a fee, you must give the recipients all the rights that we gave
+you.  You must make sure that they, too, receive or can get the source
+code.  If you link other code with the library, you must provide
+complete object files to the recipients, so that they can relink them
+with the library after making changes to the library and recompiling
+it.  And you must show them these terms so they know their rights.
+
+  We protect your rights with a two-step method: (1) we copyright the
+library, and (2) we offer you this license, which gives you legal
+permission to copy, distribute and/or modify the library.
+
+  To protect each distributor, we want to make it very clear that
+there is no warranty for the free library.  Also, if the library is
+modified by someone else and passed on, the recipients should know
+that what they have is not the original version, so that the original
+author's reputation will not be affected by problems that might be
+introduced by others.
+
+  Finally, software patents pose a constant threat to the existence of
+any free program.  We wish to make sure that a company cannot
+effectively restrict the users of a free program by obtaining a
+restrictive license from a patent holder.  Therefore, we insist that
+any patent license obtained for a version of the library must be
+consistent with the full freedom of use specified in this license.
+
+  Most GNU software, including some libraries, is covered by the
+ordinary GNU General Public License.  This license, the GNU Lesser
+General Public License, applies to certain designated libraries, and
+is quite different from the ordinary General Public License.  We use
+this license for certain libraries in order to permit linking those
+libraries into non-free programs.
+
+  When a program is linked with a library, whether statically or using
+a shared library, the combination of the two is legally speaking a
+combined work, a derivative of the original library.  The ordinary
+General Public License therefore permits such linking only if the
+entire combination fits its criteria of freedom.  The Lesser General
+Public License permits more lax criteria for linking other code with
+the library.
+
+  We call this license the "Lesser" General Public License because it
+does Less to protect the user's freedom than the ordinary General
+Public License.  It also provides other free software developers Less
+of an advantage over competing non-free programs.  These disadvantages
+are the reason we use the ordinary General Public License for many
+libraries.  However, the Lesser license provides advantages in certain
+special circumstances.
+
+  For example, on rare occasions, there may be a special need to
+encourage the widest possible use of a certain library, so that it becomes
+a de-facto standard.  To achieve this, non-free programs must be
+allowed to use the library.  A more frequent case is that a free
+library does the same job as widely used non-free libraries.  In this
+case, there is little to gain by limiting the free library to free
+software only, so we use the Lesser General Public License.
+
+  In other cases, permission to use a particular library in non-free
+programs enables a greater number of people to use a large body of
+free software.  For example, permission to use the GNU C Library in
+non-free programs enables many more people to use the whole GNU
+operating system, as well as its variant, the GNU/Linux operating
+system.
+
+  Although the Lesser General Public License is Less protective of the
+users' freedom, it does ensure that the user of a program that is
+linked with the Library has the freedom and the wherewithal to run
+that program using a modified version of the Library.
+
+  The precise terms and conditions for copying, distribution and
+modification follow.  Pay close attention to the difference between a
+"work based on the library" and a "work that uses the library".  The
+former contains code derived from the library, whereas the latter must
+be combined with the library in order to run.
+
+		  GNU LESSER GENERAL PUBLIC LICENSE
+   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+  0. This License Agreement applies to any software library or other
+program which contains a notice placed by the copyright holder or
+other authorized party saying it may be distributed under the terms of
+this Lesser General Public License (also called "this License").
+Each licensee is addressed as "you".
+
+  A "library" means a collection of software functions and/or data
+prepared so as to be conveniently linked with application programs
+(which use some of those functions and data) to form executables.
+
+  The "Library", below, refers to any such software library or work
+which has been distributed under these terms.  A "work based on the
+Library" means either the Library or any derivative work under
+copyright law: that is to say, a work containing the Library or a
+portion of it, either verbatim or with modifications and/or translated
+straightforwardly into another language.  (Hereinafter, translation is
+included without limitation in the term "modification".)
+
+  "Source code" for a work means the preferred form of the work for
+making modifications to it.  For a library, complete source code means
+all the source code for all modules it contains, plus any associated
+interface definition files, plus the scripts used to control compilation
+and installation of the library.
+
+  Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope.  The act of
+running a program using the Library is not restricted, and output from
+such a program is covered only if its contents constitute a work based
+on the Library (independent of the use of the Library in a tool for
+writing it).  Whether that is true depends on what the Library does
+and what the program that uses the Library does.
+
+  1. You may copy and distribute verbatim copies of the Library's
+complete source code as you receive it, in any medium, provided that
+you conspicuously and appropriately publish on each copy an
+appropriate copyright notice and disclaimer of warranty; keep intact
+all the notices that refer to this License and to the absence of any
+warranty; and distribute a copy of this License along with the
+Library.
+
+  You may charge a fee for the physical act of transferring a copy,
+and you may at your option offer warranty protection in exchange for a
+fee.
+
+  2. You may modify your copy or copies of the Library or any portion
+of it, thus forming a work based on the Library, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+    a) The modified work must itself be a software library.
+
+    b) You must cause the files modified to carry prominent notices
+    stating that you changed the files and the date of any change.
+
+    c) You must cause the whole of the work to be licensed at no
+    charge to all third parties under the terms of this License.
+
+    d) If a facility in the modified Library refers to a function or a
+    table of data to be supplied by an application program that uses
+    the facility, other than as an argument passed when the facility
+    is invoked, then you must make a good faith effort to ensure that,
+    in the event an application does not supply such function or
+    table, the facility still operates, and performs whatever part of
+    its purpose remains meaningful.
+
+    (For example, a function in a library to compute square roots has
+    a purpose that is entirely well-defined independent of the
+    application.  Therefore, Subsection 2d requires that any
+    application-supplied function or table used by this function must
+    be optional: if the application does not supply it, the square
+    root function must still compute square roots.)
+
+These requirements apply to the modified work as a whole.  If
+identifiable sections of that work are not derived from the Library,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works.  But when you
+distribute the same sections as part of a whole which is a work based
+on the Library, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote
+it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Library.
+
+In addition, mere aggregation of another work not based on the Library
+with the Library (or with a work based on the Library) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+  3. You may opt to apply the terms of the ordinary GNU General Public
+License instead of this License to a given copy of the Library.  To do
+this, you must alter all the notices that refer to this License, so
+that they refer to the ordinary GNU General Public License, version 2,
+instead of to this License.  (If a newer version than version 2 of the
+ordinary GNU General Public License has appeared, then you can specify
+that version instead if you wish.)  Do not make any other change in
+these notices.
+
+  Once this change is made in a given copy, it is irreversible for
+that copy, so the ordinary GNU General Public License applies to all
+subsequent copies and derivative works made from that copy.
+
+  This option is useful when you wish to copy part of the code of
+the Library into a program that is not a library.
+
+  4. You may copy and distribute the Library (or a portion or
+derivative of it, under Section 2) in object code or executable form
+under the terms of Sections 1 and 2 above provided that you accompany
+it with the complete corresponding machine-readable source code, which
+must be distributed under the terms of Sections 1 and 2 above on a
+medium customarily used for software interchange.
+
+  If distribution of object code is made by offering access to copy
+from a designated place, then offering equivalent access to copy the
+source code from the same place satisfies the requirement to
+distribute the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+  5. A program that contains no derivative of any portion of the
+Library, but is designed to work with the Library by being compiled or
+linked with it, is called a "work that uses the Library".  Such a
+work, in isolation, is not a derivative work of the Library, and
+therefore falls outside the scope of this License.
+
+  However, linking a "work that uses the Library" with the Library
+creates an executable that is a derivative of the Library (because it
+contains portions of the Library), rather than a "work that uses the
+library".  The executable is therefore covered by this License.
+Section 6 states terms for distribution of such executables.
+
+  When a "work that uses the Library" uses material from a header file
+that is part of the Library, the object code for the work may be a
+derivative work of the Library even though the source code is not.
+Whether this is true is especially significant if the work can be
+linked without the Library, or if the work is itself a library.  The
+threshold for this to be true is not precisely defined by law.
+
+  If such an object file uses only numerical parameters, data
+structure layouts and accessors, and small macros and small inline
+functions (ten lines or less in length), then the use of the object
+file is unrestricted, regardless of whether it is legally a derivative
+work.  (Executables containing this object code plus portions of the
+Library will still fall under Section 6.)
+
+  Otherwise, if the work is a derivative of the Library, you may
+distribute the object code for the work under the terms of Section 6.
+Any executables containing that work also fall under Section 6,
+whether or not they are linked directly with the Library itself.
+
+  6. As an exception to the Sections above, you may also combine or
+link a "work that uses the Library" with the Library to produce a
+work containing portions of the Library, and distribute that work
+under terms of your choice, provided that the terms permit
+modification of the work for the customer's own use and reverse
+engineering for debugging such modifications.
+
+  You must give prominent notice with each copy of the work that the
+Library is used in it and that the Library and its use are covered by
+this License.  You must supply a copy of this License.  If the work
+during execution displays copyright notices, you must include the
+copyright notice for the Library among them, as well as a reference
+directing the user to the copy of this License.  Also, you must do one
+of these things:
+
+    a) Accompany the work with the complete corresponding
+    machine-readable source code for the Library including whatever
+    changes were used in the work (which must be distributed under
+    Sections 1 and 2 above); and, if the work is an executable linked
+    with the Library, with the complete machine-readable "work that
+    uses the Library", as object code and/or source code, so that the
+    user can modify the Library and then relink to produce a modified
+    executable containing the modified Library.  (It is understood
+    that the user who changes the contents of definitions files in the
+    Library will not necessarily be able to recompile the application
+    to use the modified definitions.)
+
+    b) Use a suitable shared library mechanism for linking with the
+    Library.  A suitable mechanism is one that (1) uses at run time a
+    copy of the library already present on the user's computer system,
+    rather than copying library functions into the executable, and (2)
+    will operate properly with a modified version of the library, if
+    the user installs one, as long as the modified version is
+    interface-compatible with the version that the work was made with.
+
+    c) Accompany the work with a written offer, valid for at
+    least three years, to give the same user the materials
+    specified in Subsection 6a, above, for a charge no more
+    than the cost of performing this distribution.
+
+    d) If distribution of the work is made by offering access to copy
+    from a designated place, offer equivalent access to copy the above
+    specified materials from the same place.
+
+    e) Verify that the user has already received a copy of these
+    materials or that you have already sent this user a copy.
+
+  For an executable, the required form of the "work that uses the
+Library" must include any data and utility programs needed for
+reproducing the executable from it.  However, as a special exception,
+the materials to be distributed need not include anything that is
+normally distributed (in either source or binary form) with the major
+components (compiler, kernel, and so on) of the operating system on
+which the executable runs, unless that component itself accompanies
+the executable.
+
+  It may happen that this requirement contradicts the license
+restrictions of other proprietary libraries that do not normally
+accompany the operating system.  Such a contradiction means you cannot
+use both them and the Library together in an executable that you
+distribute.
+
+  7. You may place library facilities that are a work based on the
+Library side-by-side in a single library together with other library
+facilities not covered by this License, and distribute such a combined
+library, provided that the separate distribution of the work based on
+the Library and of the other library facilities is otherwise
+permitted, and provided that you do these two things:
+
+    a) Accompany the combined library with a copy of the same work
+    based on the Library, uncombined with any other library
+    facilities.  This must be distributed under the terms of the
+    Sections above.
+
+    b) Give prominent notice with the combined library of the fact
+    that part of it is a work based on the Library, and explaining
+    where to find the accompanying uncombined form of the same work.
+
+  8. You may not copy, modify, sublicense, link with, or distribute
+the Library except as expressly provided under this License.  Any
+attempt otherwise to copy, modify, sublicense, link with, or
+distribute the Library is void, and will automatically terminate your
+rights under this License.  However, parties who have received copies,
+or rights, from you under this License will not have their licenses
+terminated so long as such parties remain in full compliance.
+
+  9. You are not required to accept this License, since you have not
+signed it.  However, nothing else grants you permission to modify or
+distribute the Library or its derivative works.  These actions are
+prohibited by law if you do not accept this License.  Therefore, by
+modifying or distributing the Library (or any work based on the
+Library), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Library or works based on it.
+
+  10. Each time you redistribute the Library (or any work based on the
+Library), the recipient automatically receives a license from the
+original licensor to copy, distribute, link with or modify the Library
+subject to these terms and conditions.  You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties with
+this License.
+
+  11. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Library at all.  For example, if a patent
+license would not permit royalty-free redistribution of the Library by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Library.
+
+If any portion of this section is held invalid or unenforceable under any
+particular circumstance, the balance of the section is intended to apply,
+and the section as a whole is intended to apply in other circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system which is
+implemented by public license practices.  Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+  12. If the distribution and/or use of the Library is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Library under this License may add
+an explicit geographical distribution limitation excluding those countries,
+so that distribution is permitted only in or among countries not thus
+excluded.  In such case, this License incorporates the limitation as if
+written in the body of this License.
+
+  13. The Free Software Foundation may publish revised and/or new
+versions of the Lesser General Public License from time to time.
+Such new versions will be similar in spirit to the present version,
+but may differ in detail to address new problems or concerns.
+
+Each version is given a distinguishing version number.  If the Library
+specifies a version number of this License which applies to it and
+"any later version", you have the option of following the terms and
+conditions either of that version or of any later version published by
+the Free Software Foundation.  If the Library does not specify a
+license version number, you may choose any version ever published by
+the Free Software Foundation.
+
+  14. If you wish to incorporate parts of the Library into other free
+programs whose distribution conditions are incompatible with these,
+write to the author to ask for permission.  For software which is
+copyrighted by the Free Software Foundation, write to the Free
+Software Foundation; we sometimes make exceptions for this.  Our
+decision will be guided by the two goals of preserving the free status
+of all derivatives of our free software and of promoting the sharing
+and reuse of software generally.
+
+			    NO WARRANTY
+
+  15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO
+WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW.
+EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR
+OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY
+KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE
+LIBRARY IS WITH YOU.  SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME
+THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+
+  16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN
+WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY
+AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU
+FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR
+CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE
+LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING
+RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A
+FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF
+SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
+DAMAGES.
+
+		     END OF TERMS AND CONDITIONS
+
+           How to Apply These Terms to Your New Libraries
+
+  If you develop a new library, and you want it to be of the greatest
+possible use to the public, we recommend making it free software that
+everyone can redistribute and change.  You can do so by permitting
+redistribution under these terms (or, alternatively, under the terms of the
+ordinary General Public License).
+
+  To apply these terms, attach the following notices to the library.  It is
+safest to attach them to the start of each source file to most effectively
+convey the exclusion of warranty; and each file should have at least the
+"copyright" line and a pointer to where the full notice is found.
+
+    <one line to give the library's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
+
+    This library is free software; you can redistribute it and/or
+    modify it under the terms of the GNU Lesser General Public
+    License as published by the Free Software Foundation; either
+    version 2.1 of the License, or (at your option) any later version.
+
+    This library is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+    Lesser General Public License for more details.
+
+    You should have received a copy of the GNU Lesser General Public
+    License along with this library; if not, write to the Free Software
+    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA   
02110-1301  USA
+
+Also add information on how to contact you by electronic and paper mail.
+
+You should also get your employer (if you work as a programmer) or your
+school, if any, to sign a "copyright disclaimer" for the library, if
+necessary.  Here is a sample; alter the names:
+
+  Yoyodyne, Inc., hereby disclaims all copyright interest in the
+  library `Frob' (a library for tweaking knobs) written by James Random  
Hacker.
+
+  <signature of Ty Coon>, 1 April 1990
+  Ty Coon, President of Vice
+
+That's all there is to it!
+
+


From padb at googlecode.com  Mon Dec 21 22:50:04 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Mon, 21 Dec 2009 22:50:04 +0000
Subject: [padb] r371 committed - Re-add the tidy and report targets to the
	auto-generated...
Message-ID: <0016e6d26d580d2a9f047b44e9c3@google.com>

Revision: 371
Author: apittman
Date: Mon Dec 21 14:49:14 2009
Log: Re-add the tidy and report targets to the auto-generated
Makefile.

http://code.google.com/p/padb/source/detail?r=371

Modified:
  /trunk/src/Makefile.am

=======================================
--- /trunk/src/Makefile.am	Mon Dec 21 14:19:11 2009
+++ /trunk/src/Makefile.am	Mon Dec 21 14:49:14 2009
@@ -3,3 +3,13 @@
  minfo_CFLAGS = -ldl
  minfo_SOURCES = minfo.c mpi_interface.h
  EXTRA_DIST = padb
+
+tidy:
+	perltidy -b -ce -w -se padb
+
+pc:	padb
+	perlcritic --brutal --verbose "%l: (%s) %m\n" padb > .pc.tmp || true
+	/bin/mv .pc.tmp pc
+
+report: pc
+	./report.pl pc | tee report


From padb at googlecode.com  Mon Dec 21 22:56:13 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Mon, 21 Dec 2009 22:56:13 +0000
Subject: [padb] r372 committed - Update padb to look in the new location for
	minfo....
Message-ID: <005045015fd70b1425047b44ffde@google.com>

Revision: 372
Author: apittman
Date: Mon Dec 21 14:55:24 2009
Log: Update padb to look in the new location for minfo.
It's now called minfo rather than minfo.x at last and is installed
into $libexecprefix rather than anywhere on PATH.

http://code.google.com/p/padb/source/detail?r=372

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Mon Dec 21 12:52:44 2009
+++ /trunk/src/padb	Mon Dec 21 14:55:24 2009
@@ -561,7 +561,7 @@
  $conf{edbopt} = undef;

  $conf{edb}   = find_edb();
-$conf{minfo} = find_minfo();
+$conf{minfo} = undef;

  # Option to define a list of ports used by padb.
  $conf{port_range} = undef;
@@ -669,8 +669,13 @@

  # Look for minfo.x in the same directory as padb.
  sub find_minfo {
-    my $dir = dirname($0);
-    return "$dir/minfo.x";
+    my $self = $0;
+    if ( $self =~ m{\A(.+)/bin/padb\z} ) {
+        my $dir = $1;
+        return "$dir/libexec/minfo";
+    }
+    my $dir = dirname($self);
+    return "$dir/minfo";
  }

   
###############################################################################


From padb at googlecode.com  Mon Dec 21 23:12:33 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Mon, 21 Dec 2009 23:12:33 +0000
Subject: [padb] r373 committed - Commit the new full-report page to svn.
	It's been up for a while but...
Message-ID: <0016361e7a92736b23047b453979@google.com>

Revision: 373
Author: apittman
Date: Mon Dec 21 15:11:50 2009
Log: Commit the new full-report page to svn.  It's been up for a while but
I forgot to commit the source when I uploaded it.

http://code.google.com/p/padb/source/detail?r=373

Added:
  /trunk/doc/full-report.html
Modified:
  /trunk/doc/build_website
  /trunk/doc/email.html
  /trunk/doc/header.html
  /trunk/doc/index.html
  /trunk/doc/layout.css
  /trunk/doc/upload_website
  /trunk/doc/usage.html

=======================================
--- /dev/null
+++ /trunk/doc/full-report.html	Mon Dec 21 15:11:50 2009
@@ -0,0 +1,355 @@
+<div id="content">
+<h1>Full-report mode</h1>
+
+<div class=mode>
+ <a name=full-report></a>
+ <h3>To target a specific job (full-report option)</h3>
+</div>
+
+<p>Whilst <i>padb</i> can be used to collect very specific information
+from an application, unless you know what you are looking for or know
+the application very well this may not be what you want.  For cases
+such as this <i>padb</i> has a "full report" mode in which it collects
+such information from a job as is likely to be useful.  This will
+create a full diagnostic report for a given job iterating over the
+more common <i>padb</i> modes and options.  If you are just starting
+out debugging with <i>padb</i> or are creating an error report for a
+third party then the full-report option is a good place to start.  For
+large jobs this can generate a lot of output so redirecting to a file
+is recommended.
+
+<p>To run in this mode simply invoke <i>padb</i> with the option
+<b>--full-report=&lt;jobid&gt;</b>.
+
+<p>The full-report mode is also very useful if you are automatically
+creating trace files for later inspection or collecting information
+for inspection by a third party.  End-users can be instructed to run
+it and mail a log back to a remote support team, for example or it can
+be integrated into automatic test suites.
+
+<p>More detailed information on using <i>padb</i> and about the type
+of information padb can collect about a job can be found on
+the <a href=modes.html>modes</a> page.
+
+<table class="example">
+<tr><td>
+<pre class="code">
+$ padb --show-jobs
+45882
+$ padb --full-report=45882
+</pre>
+</td></tr>
+<tr><td>
+<pre class="code">
+padb version 3.n (Revision 325)
+full job report for job 45882
+
+----------------
+[0]
+----------------
+comm0: name: 'MPI_COMM_WORLD'
+comm0: rank: '0'
+comm0: size: '4'
+comm0: id: '0'
+comm0: Rank: local 0 global 0
+comm0: Rank: local 1 global 1
+comm0: Rank: local 2 global 2
+comm0: Rank: local 3 global 3
+comm1: name: 'MPI_COMM_SELF'
+comm1: rank: '0'
+comm1: size: '1'
+comm1: id: '0x1'
+comm2: name: 'MPI_COMM_NULL'
+comm2: size: '0'
+comm2: id: '0x2'
+comm3: name: 'MPI COMMUNICATOR 3 DUP FROM 0'
+comm3: rank: '0'
+comm3: size: '4'
+comm3: id: '0x3'
+comm3: Rank: local 0 global 0
+comm3: Rank: local 1 global 1
+comm3: Rank: local 2 global 2
+comm3: Rank: local 3 global 3
+comm4: name: 'MPI COMMUNICATOR 4 DUP FROM 0'
+comm4: rank: '0'
+comm4: size: '4'
+comm4: id: '0x4'
+comm4: Rank: local 0 global 0
+comm4: Rank: local 1 global 1
+comm4: Rank: local 2 global 2
+comm4: Rank: local 3 global 3
+comm5: name: 'MPI COMMUNICATOR 5 SPLIT FROM 3'
+comm5: rank: '0'
+comm5: size: '2'
+comm5: id: '0x5'
+comm5: Rank: local 0 global 0
+comm5: Rank: local 1 global 2
+----------------
+[1]
+----------------
+comm0: name: 'MPI_COMM_WORLD'
+comm0: rank: '1'
+comm0: size: '4'
+comm0: id: '0'
+comm0: Rank: local 0 global 0
+comm0: Rank: local 1 global 1
+comm0: Rank: local 2 global 2
+comm0: Rank: local 3 global 3
+comm1: name: 'MPI_COMM_SELF'
+comm1: rank: '0'
+comm1: size: '1'
+comm1: id: '0x1'
+comm2: name: 'MPI_COMM_NULL'
+comm2: size: '0'
+comm2: id: '0x2'
+comm3: name: 'MPI COMMUNICATOR 3 DUP FROM 0'
+comm3: rank: '1'
+comm3: size: '4'
+comm3: id: '0x3'
+comm3: Rank: local 0 global 0
+comm3: Rank: local 1 global 1
+comm3: Rank: local 2 global 2
+comm3: Rank: local 3 global 3
+comm4: name: 'MPI COMMUNICATOR 4 DUP FROM 0'
+comm4: rank: '1'
+comm4: size: '4'
+comm4: id: '0x4'
+comm4: Rank: local 0 global 0
+comm4: Rank: local 1 global 1
+comm4: Rank: local 2 global 2
+comm4: Rank: local 3 global 3
+comm5: name: 'MPI COMMUNICATOR 5 SPLIT FROM 3'
+comm5: rank: '0'
+comm5: size: '2'
+comm5: id: '0x5'
+comm5: Rank: local 0 global 1
+comm5: Rank: local 1 global 3
+----------------
+[2]
+----------------
+comm0: name: 'MPI_COMM_WORLD'
+comm0: rank: '2'
+comm0: size: '4'
+comm0: id: '0'
+comm0: Rank: local 0 global 0
+comm0: Rank: local 1 global 1
+comm0: Rank: local 2 global 2
+comm0: Rank: local 3 global 3
+comm1: name: 'MPI_COMM_SELF'
+comm1: rank: '0'
+comm1: size: '1'
+comm1: id: '0x1'
+comm2: name: 'MPI_COMM_NULL'
+comm2: size: '0'
+comm2: id: '0x2'
+comm3: name: 'MPI COMMUNICATOR 3 DUP FROM 0'
+comm3: rank: '2'
+comm3: size: '4'
+comm3: id: '0x3'
+comm3: Rank: local 0 global 0
+comm3: Rank: local 1 global 1
+comm3: Rank: local 2 global 2
+comm3: Rank: local 3 global 3
+comm4: name: 'MPI COMMUNICATOR 4 DUP FROM 0'
+comm4: rank: '2'
+comm4: size: '4'
+comm4: id: '0x4'
+comm4: Rank: local 0 global 0
+comm4: Rank: local 1 global 1
+comm4: Rank: local 2 global 2
+comm4: Rank: local 3 global 3
+comm5: name: 'MPI COMMUNICATOR 5 SPLIT FROM 3'
+comm5: rank: '1'
+comm5: size: '2'
+comm5: id: '0x5'
+comm5: Rank: local 0 global 0
+comm5: Rank: local 1 global 2
+----------------
+[3]
+----------------
+comm0: name: 'MPI_COMM_WORLD'
+comm0: rank: '3'
+comm0: size: '4'
+comm0: id: '0'
+comm0: Rank: local 0 global 0
+comm0: Rank: local 1 global 1
+comm0: Rank: local 2 global 2
+comm0: Rank: local 3 global 3
+comm1: name: 'MPI_COMM_SELF'
+comm1: rank: '0'
+comm1: size: '1'
+comm1: id: '0x1'
+comm2: name: 'MPI_COMM_NULL'
+comm2: size: '0'
+comm2: id: '0x2'
+comm3: name: 'MPI COMMUNICATOR 3 DUP FROM 0'
+comm3: rank: '3'
+comm3: size: '4'
+comm3: id: '0x3'
+comm3: Rank: local 0 global 0
+comm3: Rank: local 1 global 1
+comm3: Rank: local 2 global 2
+comm3: Rank: local 3 global 3
+comm4: name: 'MPI COMMUNICATOR 4 DUP FROM 0'
+comm4: rank: '3'
+comm4: size: '4'
+comm4: id: '0x4'
+comm4: Rank: local 0 global 0
+comm4: Rank: local 1 global 1
+comm4: Rank: local 2 global 2
+comm4: Rank: local 3 global 3
+comm5: name: 'MPI COMMUNICATOR 5 SPLIT FROM 3'
+comm5: rank: '1'
+comm5: size: '2'
+comm5: id: '0x5'
+comm5: Rank: local 0 global 1
+comm5: Rank: local 1 global 3
+Total: 10 communicators of which 0 are in use.
+No data was recorded for 24 communicators
+-----------------
+[0-3] (4 processes)
+-----------------
+main() at deadlock.c:42
+      locals
+        MPI_Comm alpha = 'MPI COMMUNICATOR 3 DUP FROM 0' [0-3]
+        MPI_Comm  beta = 'MPI COMMUNICATOR 4 DUP FROM 0' [0-3]
+        MPI_Comm *  mb = '' [0-3]
+        char *       p = 'Address 0xffffffff out of bounds' [0-3]
+        MPI_Comm split = 'MPI COMMUNICATOR 5 SPLIT FROM 3' [0-3]
+  -----------------
+  [0-3] (4 processes)
+  -----------------
+  PMPI_Barrier() at pbarrier.c:62
+        params
+          MPI_Comm comm:
+              'MPI COMMUNICATOR 3 DUP FROM 0' [1-3]
+              'MPI COMMUNICATOR 4 DUP FROM 0' [0]
+        locals
+          int err = '0' [0-3]
+    -----------------
+    [0-3] (4 processes)
+    -----------------
+    ompi_coll_tuned_barrier_intra_dec_fixed() at  
coll_tuned_decision_fixed.c:206
+          params
+            struct ompi_communicator_t * comm:
+                'MPI COMMUNICATOR 3 DUP FROM 0' [1-3]
+                'MPI COMMUNICATOR 4 DUP FROM 0' [0]
+            mca_coll_base_module_t *   module = 'valid pointer perm=rw-p  
([heap])' [0-3]
+          locals
+            int communicator_size = '0' [0-3]
+      -----------------
+      [0-3] (4 processes)
+      -----------------
+      ompi_coll_tuned_barrier_intra_recursivedoubling() at  
coll_tuned_barrier.c:172
+            params
+              struct ompi_communicator_t * comm:
+                  'MPI COMMUNICATOR 3 DUP FROM 0' [1-3]
+                  'MPI COMMUNICATOR 4 DUP FROM 0' [0]
+              mca_coll_base_module_t *   module = 'valid pointer perm=rw-p  
([heap])' [0-3]
+            locals
+              int adjsize = '4' [0-3]
+              int     err = '0' [0-3]
+              int    line: more than 3 distinct values
+              int    mask:
+                  '2' [0-1]
+                  '4' [2-3]
+              int    rank: more than 3 distinct values
+              int  remote:
+                  '0' [1-2]
+                  '1' [0,3]
+              int    size = '4' [0-3]
+        -----------------
+        [0-3] (4 processes)
+        -----------------
+        ompi_coll_tuned_sendrecv_actual() at coll_tuned_util.c:54
+              params
+                void *                    sendbuf = 'null pointer' [0-3]
+                int                        scount = '0' [0-3]
+                ompi_datatype_t *       sdatatype = 'MPI_BYTE' [0-3]
+                int                          dest:
+                    '0' [1-2]
+                    '1' [0,3]
+                int                          stag = '-16' [0-3]
+                void *                    recvbuf = 'null pointer' [0-3]
+                int                        rcount = '0' [0-3]
+                ompi_datatype_t *       rdatatype = 'MPI_BYTE' [0-3]
+                int                        source:
+                    '0' [1-2]
+                    '1' [0,3]
+                int                          rtag = '-16' [0-3]
+                struct ompi_communicator_t * comm:
+                    'MPI COMMUNICATOR 3 DUP FROM 0' [1-3]
+                    'MPI COMMUNICATOR 4 DUP FROM 0' [0]
+                ompi_status_public_t *     status = 'null pointer' [0-3]
+              locals
+                int                           err = '0' [0-3]
+                int                          line = '0' [0-3]
+                ompi_request_t *[2]          reqs = '{, }' [0-3]
+                ompi_status_public_t [2] statuses = 'value too long to  
display' [0-3]
+          -----------------
+          [0-3] (4 processes)
+          -----------------
+          ompi_request_default_wait_all() at request/req_wait.c:262
+                params
+                  size_t                    count = '2' [0-3]
+                  ompi_request_t **      requests: more than 3 distinct  
values
+                  ompi_status_public_t * statuses = 'valid pointer  
perm=rw-p ([stack])' [0-3]
+                locals
+                  char [30] __PRETTY_FUNCTION__  
= '"ompi_request_default_wait_all"' [0-3]
+                  size_t              completed = '1' [0-3]
+                  size_t                      i = '2' [0-3]
+                  int                 mpi_error = '0' [0-3]
+                  size_t                pending = '1' [0-3]
+                  ompi_request_t *      request = 'valid pointer perm=rw-p  
([heap])' [0-3]
+                  ompi_request_t **        rptr = '' [0-3]
+                  size_t                  start:
+                      '53' [0-1]
+                      '55' [2-3]
+            -----------------
+            [0-3] (4 processes)
+            -----------------
+            opal_condition_wait() at ../opal/threads/condition.h:99
+                  params
+                    opal_condition_t * c = 'valid pointer perm=rw-p' [0-3]
+                    opal_mutex_t *     m = 'valid pointer perm=rw-p' [0-3]
+                  locals
+                    int rc = '0' [0-3]
+              -----------------
+              [0,3] (2 processes)
+              -----------------
+              opal_progress() at runtime/opal_progress.c:206
+                    locals
+                      int events = '0' [0,3]
+                      size_t   i = '0' [0,3]
+              -----------------
+              [1] (1 processes)
+              -----------------
+              opal_progress() at runtime/opal_progress.c:181
+                    locals
+                      int       events = '0' [1]
+                      size_t         i = '2' [1]
+                      opal_timer_t now = '135914459801112' [1]
+                -----------------
+                [1] (1 processes)
+                -----------------
+                opal_timer_base_get_cycles()  
at ../opal/mca/timer/linux/timer_linux.h:31
+                  opal_sys_timer_get_cycles()  
at ../opal/include/opal/sys/ia32/timer.h:33
+                        locals
+                          opal_timer_t ret = '135914459801112' [1]
+              -----------------
+              [2] (1 processes)
+              -----------------
+              opal_progress() at runtime/opal_progress.c:166
+                    locals
+                      int events = '0' [2]
+                      size_t   i = '2' [2]
+</pre>
+</td></tr>
+</table>
+
+
+</div>
+<div id="footer">
+ <hr>
+ <p>Page maintained by Ashley Pittman. $Date: 2009-11-09 15:45:01 +0000  
(Mon, 09 Nov 2009) $ $Revision: 326 $</p>
+</div>
=======================================
--- /trunk/doc/build_website	Thu Sep 10 02:38:12 2009
+++ /trunk/doc/build_website	Mon Dec 21 15:11:50 2009
@@ -8,7 +8,7 @@

  echo Uploading website to http://padb.pittman.org.uk

-FILES="index usage download email extensions modes configuration"
+FILES="index usage download email extensions modes full-report  
configuration"

  TDIR=public

=======================================
--- /trunk/doc/email.html	Mon Nov  9 07:45:01 2009
+++ /trunk/doc/email.html	Mon Dec 21 15:11:50 2009
@@ -1,9 +1,7 @@
  <div id="content">
  <h2>Mailing Lists</h2>
  <a href="http://pittman.org.uk/mailman/listinfo">Mailing lists</a>
-for padb discussion and development are available for public use,
-if you are an existing user or are considering using <i>padb</i>
-for the first time I would advise you to join.
+for padb discussion and development are available and are archived on-line.

  <ul>
  <li><a  
href="http://pittman.org.uk/mailman/listinfo/padb-devel_pittman.org.uk">
=======================================
--- /trunk/doc/header.html	Mon Nov  9 07:45:01 2009
+++ /trunk/doc/header.html	Mon Dec 21 15:11:50 2009
@@ -13,14 +13,14 @@
  </div>
  <div id="navigation">
   <ul>
-  <li><span class="menu_heading"><a href="index.html" title="Main  
page">Home</a></span>
-  <br><span class="menu_subheading"><a href="index.html#news"  
title="Project news">News</a></span></li>
+  <li><span class="menu_heading"><a href="/" title="Main  
page">Home</a></span>
+  <br><span class="menu_subheading"><a href="index.html#news"  
title="Project news">News</a></span>
+  <li><span class="menu_heading"><a href="full-report.html"  
title="Generating a full report for a job">Full Report mode</a></span>
    <li><span class="menu_heading"><a href="usage.html" title="Command line  
options">Usage</a></span>
    <span class="menu_subheading">
       <br><a href="usage.html#rmgr" title="Selecting the resource  
manager">Resource mananger</a>
-     <br><a href="usage.html#job" title="Selecting a job">Jobs</a>
+     <br><a href="usage.html#job" title="Selecting a job">Job selection</a>
       <br><a href="usage.html#rank" title="Selecting ranks within a  
job">Ranks</a>
-     <br><a href="usage.html#full-report" title="Generating a full report  
for a job">Full Report mode</a>
       <br><a href="configuration.html" title="Setting configuration  
options">Configuration</a>
    </span>
    <li><span class="menu_heading"><a href="modes.html" title="Modes of  
operation">Modes of operation</a></span>
=======================================
--- /trunk/doc/index.html	Mon Nov  9 07:45:01 2009
+++ /trunk/doc/index.html	Mon Dec 21 15:11:50 2009
@@ -82,8 +82,10 @@
  run, it's main use is to assist in the debugging of parallel applications,  
it's
  therefore assumed that you have a working MPI stack or other parallel
  environment and that "Hello world" application runs to completion without
-error.<br> A <i>Linux</i> operating system is assumed and a working
-<i>gdb</i> is required for stack trace functionality.<br>
+error.<br>
+
+A <i>Linux</i> operating system is assumed and a working <i>gdb</i> is  
required
+for stack trace functionality.  Work on a <i>solaris</i> port is under way.
  </div>
  <div id="footer">
   <hr>
=======================================
--- /trunk/doc/layout.css	Mon Nov  9 07:45:01 2009
+++ /trunk/doc/layout.css	Mon Dec 21 15:11:50 2009
@@ -37,8 +37,6 @@
  }

  pre.code {
-#margin-left: 2em;
-#margin-right: 2em;
  background-color: #f0f0f0;
  padding: 10px;
  border: 1px solid;
=======================================
--- /trunk/doc/upload_website	Thu Sep 10 02:38:12 2009
+++ /trunk/doc/upload_website	Mon Dec 21 15:11:50 2009
@@ -11,7 +11,7 @@
  # Load the password from a non-public file ;)
  . ~/padb-website-password.txt

-FILES="index usage download email extensions modes configuration"
+FILES="index usage download email extensions modes full-report  
configuration"

  ftp-upload --host padb.pittman.org.uk -u padb at pittman.co.uk --password  
$PASSWORD layout.css

@@ -23,13 +23,13 @@
    cat $FILE.html >> $TFILE
    cat footer.html >> $TFILE
    ftp-upload --host padb.pittman.org.uk -u padb at pittman.co.uk --password  
$PASSWORD --as $FILE.html $TFILE
-  ftp-upload --host padb.pittman.org.uk -u padb at pittman.co.uk --password  
$PASSWORD --as $FILE/index.html $TFILE
-  ftp-upload --host padb.pittman.org.uk -u padb at pittman.co.uk --password  
$PASSWORD --as $FILE/layout.css layout.css
+  # ftp-upload --host padb.pittman.org.uk -u padb at pittman.co.uk --password  
$PASSWORD --as $FILE/index.html $TFILE
+  #ftp-upload --host padb.pittman.org.uk -u padb at pittman.co.uk --password  
$PASSWORD --as $FILE/layout.css layout.css
    rm $TFILE
  done

  ftp-upload --host padb.pittman.org.uk -u padb at pittman.co.uk --password  
$PASSWORD  OpenMPI-padb-groups.patch
-ftp-upload --host padb.pittman.org.uk -u padb at pittman.co.uk --password  
$PASSWORD --as extensions/OpenMPI-padb-groups.patch  
OpenMPI-padb-groups.patch
+#ftp-upload --host padb.pittman.org.uk -u padb at pittman.co.uk --password  
$PASSWORD --as extensions/OpenMPI-padb-groups.patch  
OpenMPI-padb-groups.patch

  echo All done.
  exit 0
=======================================
--- /trunk/doc/usage.html	Mon Nov  9 07:45:01 2009
+++ /trunk/doc/usage.html	Mon Dec 21 15:11:50 2009
@@ -6,13 +6,14 @@
   <h2>Selecting the Resource Manager (Job Launcher)</h2>
  </div>
  <i>Padb</i> supports many resource managers and should select the
-appropriate one for your cluster, if you have more than one resource
+appropriate one for your machine, if you have more than one resource
  manager installed or <i>padb</i> can't detect the correct one use
-the <b>rmgr</b> <a href=configuration.html>configuration option</a>.
-
-<p>If no resource manager is found you can use <b>-O rmgr=local</b>
-and process identifiers (pids) will be used instead of job ids.
-
+the <b>rmgr</b> <a href=configuration.html>configuration option</a> to
+set machine-wide defaults.
+
+<p>If your resource manager or scheduler is not supported you can also
+use <b>local</b> and process identifiers (pids) will be used instead
+of job ids.

  <center>
  <table class=rmgrs border=1 width="90%">
@@ -27,7 +28,7 @@
    <td>Works with any resource manager or software stack that is
    compliant with
    the <a href="http://www.mcs.anl.gov/research/projects/mpi/mpi-debug/">MPI
-  debugger interface</a>.  It is recommended to use support for your
+  debugger interface</a>.  It is preferable to use support for your
    specific resource manager if it exists.</td>
   </tr>
   <tr>
@@ -46,7 +47,7 @@
    <td>Fully supported</td>
   </tr>
   <tr>
-  <td>MPICH2/mpd</td>
+  <td>MPICH2 mpd</td>
    <td>mpd</td>
    <td>Fully supported in 3.0 and above</td>
   </tr>
@@ -67,7 +68,7 @@
    <td>None</td>
    <td>local-qsnet</td>
    <td>as local-fd with local-fd-name set to /proc/qsnet/elan/user to
-  automatially select network jobs on the local node.</td>
+  automatically select network jobs on the local node.</td>
   </tr>
  </table>
  </center>
@@ -75,6 +76,8 @@
  <p>The <b>--list-rmgrs</b> option can be used to show a list of
  detected resource managers and their active jobs.

+<hr>
+
  <div class=mode>
   <a name=job></a>
   <h2>Selecting the job(s) to target</h2>
@@ -86,6 +89,10 @@
  default is to target jobs of the current user, this can be over-ridden
  with the <b>--user</b> flag.

+<h2>To target a specific job</h2> To target a specific job specify the
+numeric jobid for the job on the command line, after all other
+options.
+
  <h3>Showing list of current jobs</h3>
  To show a list of currently running jobs for a given user use the
  <b>--show-jobs</b> option.  Alternatively the <b>--list-rmgrs</b>
@@ -95,55 +102,31 @@
  <h3>To target all jobs</h3> To target all jobs currently running for a
  given user use the <b>--all</b> (<b>-a</b>) flag.

-<h3>To target any jobs</h3> To target "any" job currently running for
-a given user use the <b>--any</b> (<b>-A</b>) flag.  This differs from
+<h3>To target any jobs</h3> To target any job currently running for a
+given user use the <b>--any</b> (<b>-A</b>) flag.  This differs from
  targeting all jobs as it will exit with an error if more than one job
  is running.

-<h3>To target a specific job</h3> To target a specific job specify the
-jobid for the job on the command line, after all other options.
-
-<div class=mode>
- <a name=full-report></a>
- <h3>To target a specific job (Full report option)</h3>
-</div>
-If trying to diagnose a problem or gather information there is another
-option, <b>--full-report=&lt;jobid&gt;</b>, this tells padb to target
-the job specified and to report all information about the job it knows
-how to collect.  This option is typically used when creating bug
-reports to send to third parties or to inspect a job for anomalies.
-
  <hr>

  <h1><a name="rank">Selecting ranks (Processes)</a></h1> In modes where
  data for each process is reported separately it is possible to
  restrict which ranks are queried, this is done via the <b>--rank</b>
  option.  Multiple ranks can be selected by specifying <b>--rank</b>
-multiple times.
+multiple times or by specifying a rank list using
+the <b>[&lt;low&gt;-&lt;high&gt;,&lt;value&gt;]</b> notation.  Eg, to  
specify ranks
+0,2 and 3 use <b>--rank [0,2-3]</b>

  <hr>

  <h1>Selecting which mode to run in.</h1>
  <i>Padb</i> can present an array of different information about your
  select jobs and it can present it in a number of different ways.  With
-the exception of Full Report only one mode can be selected, if you
-need more information about the program run <i>padb</i> more than
-once.
-
-<div class=mode>
- <a name=full-report></a>
- <h2>Full Report</h2>
-</div>
-If you are just starting with <i>padb</i> or are creating an error
-report for somebody else then the <b>--full-report=&lt;jobid&gt;</b>
-option is a good place to start, this will complete a full diagnostic
-report for the job iterating over the more common padb options.  For
-large jobs this can generate a lot of output so redirecting to a file
-is recommended.
-
-<p>
-
-A list of avaliable modes and their descriptions can be found on the <a  
href=modes.html>modes</a> page.
+the exception of <a href="full-report.html">full Report</a> only one
+mode can be selected, if you need more information about the program
+<i>padb</i> has to be run more than once.  A list of available modes
+and their descriptions can be found on
+the <a href="modes.html">modes</a> page.

  </div>
  <div id="footer">


From padb at googlecode.com  Mon Dec 21 23:16:41 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Mon, 21 Dec 2009 23:16:41 +0000
Subject: [padb] r374 committed - Add a list of files that svn should ignore
	when doing a svn status.
Message-ID: <0016363b98ae3648b2047b4548d6@google.com>

Revision: 374
Author: apittman
Date: Mon Dec 21 15:12:51 2009
Log: Add a list of files that svn should ignore when doing a svn status.

http://code.google.com/p/padb/source/detail?r=374

Modified:
  /trunk
  /trunk/src


From padb at googlecode.com  Mon Dec 21 23:21:45 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Mon, 21 Dec 2009 23:21:45 +0000
Subject: [padb]  r375 committed - Add three new files to be ignored.
Message-ID: <001485f9a5545edfa7047b455a29@google.com>

Revision: 375
Author: apittman
Date: Mon Dec 21 15:21:07 2009
Log: Add three new files to be ignored.

http://code.google.com/p/padb/source/detail?r=375

Modified:
  /trunk/src


From padb at googlecode.com  Tue Dec 22 10:37:05 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Tue, 22 Dec 2009 10:37:05 +0000
Subject: [padb] r376 committed - Add a README.Developers file to describe
	how to build from SVN.
Message-ID: <001636ed6c7e85bd12047b4ec93e@google.com>

Revision: 376
Author: apittman
Date: Tue Dec 22 02:36:01 2009
Log: Add a README.Developers file to describe how to build from SVN.

http://code.google.com/p/padb/source/detail?r=376

Added:
  /trunk/README.Developers

=======================================
--- /dev/null
+++ /trunk/README.Developers	Tue Dec 22 02:36:01 2009
@@ -0,0 +1,19 @@
+
+Padb is managed by autoconf and all the normal build rules apply,
+"./configure && make && make install" being the normal build mechanism.
+
+A file called "autogen.sh" is provided to convert a raw Subversion tree
+into a tree which is buildable in the standard autoconf way.  After a
+checkout of the source all a developer needs to do is to execute the
+following command once and he can then build as normal.
+
+$ ./autogen.sh
+
+No dependency on any particular version of autoconf is known, I have tested
+with 2.61 and 2.65 at this time.
+
+
+It is still possible to build and run this package without using autoconf,
+padb itself is a stand-alone perl program contained in a single file.  The
+helper program, minfo, is a c program can be compiled with the command
+"make -f Makefile-simple" executed from the src directory.


From padb at googlecode.com  Tue Dec 22 10:41:07 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Tue, 22 Dec 2009 10:41:07 +0000
Subject: [padb] r377 committed - Rename minfo to remove the horrible .x
	suffix and install...
Message-ID: <0016361e7b94f09795047b4ed728@google.com>

Revision: 377
Author: apittman
Date: Tue Dec 22 02:38:06 2009
Log: Rename minfo to remove the horrible .x suffix and install
it into the same location as the auto-conf build.

http://code.google.com/p/padb/source/detail?r=377

Modified:
  /trunk/src/Makefile-simple

=======================================
--- /trunk/src/Makefile-simple	Mon Dec 21 14:12:41 2009
+++ /trunk/src/Makefile-simple	Tue Dec 22 02:38:06 2009
@@ -7,20 +7,22 @@

  FILES = Makefile minfo.c mpi_interface.h padb

-minfo.x: minfo.c mpi_interface.h
-	$(CC) minfo.c -o minfo.x -ldl $(CFLAGS)
-
-install: minfo.x
+minfo: minfo.c mpi_interface.h
+	$(CC) minfo.c -o minfo -ldl $(CFLAGS)
+
+install: minfo
  	/bin/mkdir -p ${INSTALL_DIR}/bin
-	/bin/cp minfo.x ${INSTALL_DIR}/bin/
  	/bin/cp padb ${INSTALL_DIR}/bin/
+	/bin/mkdir -p ${INSTALL_DIR}/libexec
+	/bin/cp minfo ${INSTALL_DIR}/libexec/
+

  make config_install:
  	/bin/mkdir -p ${CONFIG_DIR}
  	/bin/cp padb.conf ${CONFIG_DIR}/

  clean:
-	/bin/rm -f minfo.x
+	/bin/rm -f minfo

  tarfile:
  	/bin/rm -f padb-${VERSION}.tgz


From padb at googlecode.com  Tue Dec 22 10:46:14 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Tue, 22 Dec 2009 10:46:14 +0000
Subject: [padb] r378 committed - Only use -ldl when linking minfo and use
	-Wall when compiling it.
Message-ID: <001485f6d3f23efc66047b4eea19@google.com>

Revision: 378
Author: apittman
Date: Tue Dec 22 02:45:13 2009
Log: Only use -ldl when linking minfo and use -Wall when compiling it.

http://code.google.com/p/padb/source/detail?r=378

Modified:
  /trunk/src/Makefile.am

=======================================
--- /trunk/src/Makefile.am	Mon Dec 21 14:49:14 2009
+++ /trunk/src/Makefile.am	Tue Dec 22 02:45:13 2009
@@ -1,6 +1,7 @@
  bin_SCRIPTS = padb
  libexec_PROGRAMS = minfo
-minfo_CFLAGS = -ldl
+minfo_CFLAGS = -Wall
+minfo_LDFLAGS = -ldl
  minfo_SOURCES = minfo.c mpi_interface.h
  EXTRA_DIST = padb


From ashley at pittman.co.uk  Tue Dec 22 11:11:34 2009
From: ashley at pittman.co.uk (Ashley Pittman)
Date: Tue, 22 Dec 2009 11:11:34 +0000
Subject: [padb] Autoconf changes.
Message-ID: <1261480295.3600.53.camel@alpha>


All,

As you'll have seen I've updated the padb sources to use autoconf for
building hopefully this will be a more familiar environment for people
to work in.

The bulk of the changes should be in now, at least the ones relating to
building but I still need to populate the NEWS and ChangeLog files, the
new build procedure shouldn't change again though.

The build procedure is documented in this file.  It's fairly simple,
after a update or fresh checkout simply run ./autogen.sh once to prepare
the tree and then use configure/make as is normal for autoconf projects.

http://code.google.com/p/padb/source/browse/trunk/README.Developers

For the mean time padb is still maintained as a single large source file
so for development none of this is necessary, it can still be run
in-tree without any modifications at all.  Personally I like this
feature and have found it useful so I'm hoping it'll stay but I can't
guarantee it in the long term.

Any questions or problems give me a shout.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


From padb at googlecode.com  Tue Dec 22 15:11:40 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Tue, 22 Dec 2009 15:11:40 +0000
Subject: [padb] r379 committed - Handle the minfo configuration option as a
	per-mode option rather than...
Message-ID: <001636c5bda388bd49047b529f8b@google.com>

Revision: 379
Author: apittman
Date: Tue Dec 22 07:10:41 2009
Log: Handle the minfo configuration option as a per-mode option rather than
a global one.  It should have been done like this anyway.

http://code.google.com/p/padb/source/detail?r=379

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Mon Dec 21 14:55:24 2009
+++ /trunk/src/padb	Tue Dec 22 07:10:41 2009
@@ -370,7 +370,7 @@

  # Config options the inner knows about, only forward options if they are in
  # this list.
-my @inner_conf = qw(edb edbopt minfo rmgr scripts slurm_job_step  
pbs_server);
+my @inner_conf = qw(edb edbopt rmgr scripts slurm_job_step pbs_server);

  # More config options the inner knows about, these are forwarded on the
  # command line rather than over the sockets.
@@ -560,8 +560,7 @@
  # These settings are passed onto inner only.
  $conf{edbopt} = undef;

-$conf{edb}   = find_edb();
-$conf{minfo} = undef;
+$conf{edb} = find_edb();

  # Option to define a list of ports used by padb.
  $conf{port_range} = undef;
@@ -667,12 +666,17 @@
      return 'edb';
  }

-# Look for minfo.x in the same directory as padb.
+# Look for minfo in the filesystem.  If it appears that padb has been
+# installed then look for minfo in the directory where it would have been
+# installed to.  If that's not the case or it's not there then look in the
+# same directory as padb is running from.
  sub find_minfo {
      my $self = $0;
      if ( $self =~ m{\A(.+)/bin/padb\z} ) {
          my $dir = $1;
-        return "$dir/libexec/minfo";
+        if ( -f "$dir/libexec/minfo" ) {
+            return "$dir/libexec/minfo";
+        }
      }
      my $dir = dirname($self);
      return "$dir/minfo";
@@ -6279,6 +6283,14 @@
      $h->{fd}{err} = *M_ERROR;

      my @all_dll_filenames;
+
+    # If supplied with a value of minfo then use it otherwise pick the
+    # version that was installed with padb.
+    my $minfo = $carg->{minfo};
+
+    if ( not defined $minfo ) {
+        $minfo = find_minfo();
+    }

      if ( defined $carg->{mpi_dll} ) {
          push @all_dll_filenames, $carg->{mpi_dll};
@@ -6326,8 +6338,7 @@
          $files{$filename} = 1;
      }

-    my $cmd = $inner_conf{minfo};
-    $h->{hpid} = open3( $h->{fd}{wtr}, $h->{fd}{rdr}, *M_ERROR, $cmd )
+    $h->{hpid} = open3( $h->{fd}{wtr}, $h->{fd}{rdr}, *M_ERROR, $minfo )
        or confess "Unable to popen() h: $!\n";

      if ( $h->{debug} ) {
@@ -6498,7 +6509,7 @@
          $stderr .= $_;
      }
      if ($have_error_messages) {
-        target_error( $vp, "Stderr from minfo: $stderr" );
+        target_error( $vp, "Stderr from minfo:\n$stderr" );
      }

      my $sc = keys %stats;
@@ -6511,18 +6522,16 @@
      if ( $sc == 0 ) {

          # No interaction was had with minfo, abort with nothing.
-        target_error( $vp, "Error running $inner_conf{minfo}: No contact"  
);
+        target_error( $vp, "Error running $minfo: No contact" );
          return;
      }

      if ( $global{exit} ne 'ok' ) {
          if ( $global{exit} eq 'die' ) {
-            target_error( $vp,
-                "Error message from $inner_conf{minfo}: $global{dmsg}" );
+            target_error( $vp, "Error message from $minfo: $global{dmsg}"  
);

          } else {
-            target_error( $vp,
-                "Error running $inner_conf{minfo}: Bad exit code $?" );
+            target_error( $vp, "Error running $minfo: Bad exit code $?" );
          }
      }

@@ -9276,8 +9285,7 @@

      # Over-ride the defaults for these two as minfo might not exist on the
      # front end.
-    $inner_conf{edb}   = find_edb();
-    $inner_conf{minfo} = find_minfo();
+    $inner_conf{edb} = find_edb();

      # Load the command line options.
      my %optionhash;
@@ -9356,7 +9364,10 @@
          arg_short   => 'q',
          handler     => \&qsnet_show_tport_queue,
          help        => 'Show the message queues',
-        options_i   => { mpi_dll => undef, }
+        options_i   => {
+            minfo   => undef,
+            mpi_dll => undef,
+        }
      };

      $allfns{kill} = {
@@ -9379,17 +9390,23 @@
          arg_long    => 'mpi-queue',
          arg_short   => 'Q',
          help        => 'Show MPI message queues',
-        options_i   => { mpi_dll => undef, }
+        options_i   => {
+            minfo   => undef,
+            mpi_dll => undef,
+        }
      };

      $allfns{deadlock} = {
-        handler_one  => \&show_mpi_queue_one,
-        needs_gdb    => 1,
-        arg_long     => 'deadlock',
-        arg_short    => 'j',
-        help         => 'Run deadlock detection algorithm',
-        out_handler  => \&mpi_deadlock_detect,
-        options_i    => { mpi_dll => undef, },
+        handler_one => \&show_mpi_queue_one,
+        needs_gdb   => 1,
+        arg_long    => 'deadlock',
+        arg_short   => 'j',
+        help        => 'Run deadlock detection algorithm',
+        out_handler => \&mpi_deadlock_detect,
+        options_i   => {
+            mpi_dll => undef,
+            minfo   => undef,
+        },
          options_bool => {
              show_group_members => 'no',
              show_all_groups    => 'no',
@@ -9490,8 +9507,9 @@
          pre_out_handler => \&pre_mpi_watch,
          out_handler     => \&show_mpi_watch,
          options_i       => {
+            minfo          => undef,
              mpi_dll        => undef,
-            mpi_watch_file => undef
+            mpi_watch_file => undef,
          }
      };


From padb at googlecode.com  Tue Dec 22 20:03:55 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Tue, 22 Dec 2009 20:03:55 +0000
Subject: [padb]  r380 committed - Change to the message queue code,
	no longer 	have the output of minfo...
Message-ID: <00163623a777b29dcb047b56b4ea@google.com>

Revision: 380
Author: apittman
Date: Tue Dec 22 12:03:45 2009
Log: Change to the message queue code, no longer have the output of minfo
reported directly to the user but rather have the interaction between
minfo and padb entirely private and have padb assemble the message
shown to the user.  This allows me to change the protocol without
affecting the output shown and as padb now knows the contents of the
message queues as datatypes opens the possibility of doing cleverer
things with the data.
Currently padb formats the output exactly the same as minfo did before
so there should be no user-visable change from this commit.

http://code.google.com/p/padb/source/detail?r=380

Modified:
  /trunk/src/minfo.c
  /trunk/src/padb

=======================================
--- /trunk/src/minfo.c	Mon Dec 21 11:57:07 2009
+++ /trunk/src/minfo.c	Tue Dec 22 12:03:45 2009
@@ -486,6 +486,11 @@
  	}
      }
  }
+
+void show_msg_op (char *name, int value)
+{
+    printf("Msg: %s:%d\n",name,value);
+}

  void show_op (mqs_pending_operation *op, int msgid, int type)
  {
@@ -494,36 +499,37 @@

      if ( type == mqs_pending_sends || op->status == mqs_st_matched ||  
op->status == mqs_st_complete )
  	all = 1;
-
-    printf("msg%d: Operation %d (%s) status %d (%s)\n",
-	   msgid,type,op_types[type],op->status,op_status[op->status]);
-    printf("msg%d: Rank local %d global %d\n",
-	   msgid,(int)op->desired_local_rank, (int)op->desired_global_rank);
-    if ( all )
-	printf("msg%d: Actual local %d global %d\n",
-	       msgid, (int)op->actual_local_rank, (int)op->actual_global_rank);
-    if ( all )
-	printf("msg%d: Size desired %d actual %d\n",
-	       msgid, (int)op->desired_length, (int)op->actual_length);
-    else
-	printf("msg%d: Size desired %d\n",
-	       msgid, (int)op->desired_length);
-    printf("msg%d: tag_wild %d\n",msgid,op->tag_wild);
-    if ( all )
-	printf("msg%d: Tag desired %d actual %d\n",
-	       msgid, (int)op->desired_tag, (int)op->actual_tag);
-    else
-	printf("msg%d: Tag desired %d\n",msgid, (int)op->desired_tag);
-    printf("msg%d: system_buffer %d\n",msgid,op->system_buffer);
-    printf("msg%d: Buffer 0x%lx\n",msgid,(long)op->buffer);
+
+    show_msg_op("start",1);
+
+    show_msg_op("op_type",type);
+    show_msg_op("status",op->status);
+
+    show_msg_op("desired_local_rank",op->desired_local_rank);
+    show_msg_op("desired_global_rank",op->desired_global_rank);
+    show_msg_op("desired_length",op->desired_length);
+    show_msg_op("desired_tag",op->desired_tag);
+    if ( all ) {
+	show_msg_op("actual_local_rank",op->actual_local_rank);
+	show_msg_op("actual_global_rank",op->actual_global_rank);
+	show_msg_op("actual_length",op->actual_length);
+	show_msg_op("actual_tag",op->actual_tag);
+    }
+
+    show_msg_op("tag_wild",op->tag_wild);
+    show_msg_op("system_buffer",op->system_buffer);
+
+    printf("Buffer 0x%lx\n",(long)op->buffer);

      i = 0;
      do {
  	if ( op->extra_text[i][0] )
-	    printf("msg%d: '%s'\n",msgid,op->extra_text[i]);
+	    printf("'%s'\n",op->extra_text[i]);
  	else
  	    i = 10;
      } while ( i++ < 5 );
+
+    show_msg_op("done",1);
  }

  void load_ops (mqs_process *target_process,int type)
=======================================
--- /trunk/src/padb	Tue Dec 22 07:10:41 2009
+++ /trunk/src/padb	Tue Dec 22 12:03:45 2009
@@ -6353,13 +6353,14 @@
      my %stats;

      # Communicator data.
-    my %cd;
+    my %communicator_descriptor;
+    my %message_descriptor;

      my %global;

      $global{exit} = 'unknown';

-    my @cd;
+    my @communicator_list;
      my $bytes_to_read;
      my $str_name;
      my $str_value = $EMPTY_STRING;
@@ -6393,7 +6394,7 @@
                          target_key_pair( $vp, 'minfo_msg', $str_name );
                      }
                  } else {
-                    $cd{$str_name} = $str_value;
+                    $communicator_descriptor{$str_name} = $str_value;
                  }
                  $bytes_to_read = undef;
                  $str_value     = "";
@@ -6455,13 +6456,40 @@
                      $str_name      = $name;
                      $str_global    = 0;
                  } elsif ( $key eq 'rt' ) {
-                    push @{ $cd{rt} }, $value;
+                    push @{ $communicator_descriptor{rt} }, $value;
                  } else {
-                    $cd{$key} = $value;
-                    $cd{mid} = $cid;
+                    $communicator_descriptor{$key} = $value;
+                    $communicator_descriptor{mid} = $cid;
                  }
              } else {
-                target_key_pair( $vp, 'UNPARSEABLE MINFO', $r );
+                target_error( $vp, "UNPARSEABLE MINFO: $r" );
+            }
+        } elsif ( $cmd eq 'Msg:' ) {
+            $stats{msg}++;
+            if (
+                $r =~ m{\A
+                        Msg:
+                        [ ]
+                        (\w+):
+                        (-?\d+)
+                        \z
+                        }x
+              )
+            {
+                my $key   = $1;
+                my $value = $2;
+
+                if ( $key eq 'start' ) {
+                    undef %message_descriptor;
+                } elsif ( $key eq 'done' ) {
+                    push @{ $communicator_descriptor{messages} },
+                      dclone( \%message_descriptor );
+                    undef %message_descriptor;
+                } else {
+                    $message_descriptor{$key} = $value;
+                }
+            } else {
+                target_error( $vp, "UNPARSEABLE MINFO: $r" );
              }
          } elsif ( $cmd eq 'zzz:' ) {
              $stats{zzz}++;
@@ -6487,14 +6515,14 @@
                      $str_global    = 1;
                  }
              } else {
-                target_key_pair( $vp, 'UNPARSEABLE MINFO', $r );
+                target_error( $vp, "UNPARSEABLE MINFO: $r" );
              }
          } elsif ( $cmd eq 'done' ) {
-            push @cd, dclone( \%cd );
-            undef %cd;
+            push @communicator_list, dclone( \%communicator_descriptor );
+            undef %communicator_descriptor;
          } else {
              $stats{raw}++;
-            push @{ $cd{raw} }, $r;
+            push @{ $message_descriptor{raw} }, $r;
          }
      }

@@ -6535,7 +6563,7 @@
          }
      }

-    return minfo_to_array( \@cd );
+    return minfo_to_array( \@communicator_list );

  }

@@ -6545,7 +6573,6 @@
      my @mq;
      foreach my $comm ( @{$cd} ) {

-        #print Dumper $comm;
          push @mq, "comm$comm->{mid}: name: '$comm->{name}'";
          if ( defined $comm->{rank} ) {
              push @mq, "comm$comm->{mid}: rank: '$comm->{rank}'";
@@ -6558,8 +6585,43 @@
              push @mq, "comm$comm->{mid}: Rank: local $i global  
$comm->{rt}[$i]";
          }

-        foreach my $l ( @{ $comm->{raw} } ) {
-            push @mq, $l;
+        my $mid = 0;
+
+        foreach my $m ( @{ $comm->{messages} } ) {
+            my @op_desc = qw(pending_send pending_receive  
unexpected_message);
+            my @status_desc = qw(pending matched complete);
+            my $op = "Operation $m->{op_type} ($op_desc[$m->{op_type}])";
+            $op .= " status $m->{status} ($status_desc[$m->{status}])";
+            push @mq, "msg$mid: $op";
+            push @mq,
+"msg$mid: Rank local $m->{desired_local_rank} global  
$m->{desired_global_rank}";
+            if ( defined $m->{actual_global_rank} ) {
+                push @mq,
+"msg$mid: Actual local $m->{actual_local_rank} global  
$m->{actual_global_rank}";
+            }
+
+            if ( defined $m->{actual_length} ) {
+                push @mq,
+"msg$mid: Size desired $m->{desired_length} actual $m->{actual_length}";
+            } else {
+                push @mq, "msg$mid: Size desired $m->{desired_length}";
+            }
+            push @mq, "msg$mid: tag_wild $m->{tag_wild}";
+
+            if ( defined $m->{actual_tag} ) {
+                push @mq,
+"msg$mid: Tag desired $m->{desired_tag} actual $m->{actual_tag}";
+            } else {
+                push @mq, "msg$mid: Tag desired $m->{desired_tag}";
+            }
+
+            push @mq, "msg$mid: system_buffer $m->{system_buffer}";
+
+            foreach my $l ( @{ $m->{raw} } ) {
+                push @mq, "msg$mid: $l";
+            }
+
+            $mid++;
          }
      }
      return @mq;


From padb at googlecode.com  Tue Dec 22 21:31:27 2009
From: padb at googlecode.com (padb at googlecode.com)
Date: Tue, 22 Dec 2009 21:31:27 +0000
Subject: [padb] r381 committed - Tidy up some errors reported by perltidy
	again, mostly missing calls...
Message-ID: <001636988e01b319da047b57ed61@google.com>

Revision: 381
Author: apittman
Date: Tue Dec 22 13:31:11 2009
Log: Tidy up some errors reported by perltidy again, mostly missing calls
to return at the end of a function.

http://code.google.com/p/padb/source/detail?r=381

Modified:
  /trunk/src/padb

=======================================
--- /trunk/src/padb	Tue Dec 22 12:03:45 2009
+++ /trunk/src/padb	Tue Dec 22 13:31:11 2009
@@ -2749,7 +2749,6 @@
  sub pbs_get_lqsub {
      my ( $user, $server ) = @_;
      my $job;
-    my $nprocess;
      my $cmd = "qstat -w -n -u $user \@$server";

      my @output = slurp_cmd($cmd);
@@ -5574,7 +5573,6 @@
          my $exe = readlink("/proc/$pid/path/a.out");
          my %cs = gdb_n_send( $gdb, "file $exe" );
          if ( $cs{status} ne 'done' ) {
-            croak("Gdb command file $exe failed");
              return;
          }
      }
@@ -5627,6 +5625,7 @@

      gdb_n_send( $gdb, '-gdb-set print address off' );

+    return;
  }

  sub gdb_attach_async_start {
@@ -5636,7 +5635,6 @@
          my $exe = readlink("/proc/$pid/path/a.out");
          my %cs = gdb_n_send( $gdb, "file $exe" );
          if ( $cs{status} ne 'done' ) {
-            croak("Gdb command file $exe failed");
              return;
          }
      }
@@ -7883,6 +7881,7 @@
              delete $proc->{gdb_handle};
          }
      }
+    return;
  }

  # Try and be clever here, attach to each and every process on this node
@@ -8134,6 +8133,7 @@
      my $thread_list = join q{,}, sort { $a <=> $b } @threads;

      output( $proc->{vp}, $thread_list );
+    return;
  }

  sub kill_proc {