From thipadin.seng-long at bull.net Tue Dec 1 14:18:08 2009 From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net) Date: Tue, 1 Dec 2009 15:18:08 +0100 Subject: [padb] Pb with --create-secret-file Message-ID: Hi, On one of our cluster I've go a problem to create secret file like this: [thipa at vb0 openmpi]$ padb_r341 --create-secret-file Failed to chmod secret file: No such file or directory [thipa at vb0 openmpi]$ our system is: [thipa at vb0 openmpi]$ uname -a Linux vb0 2.6.18-B64k.1.26 #1 SMP Wed Aug 26 17:15:29 CEST 2009 ia64 ia64 ia64 GNU/Linux Before my patch the source code looks like this: sub create_padb_secret { my $filename = "$ENV{HOME}/.padb-secret"; my $FD; if ( not open $FD, '>', $filename ) { print "Failed to create secret file: $!\n"; return; } if ( chmod( 0600, $FD ) != 1 ) { print "Failed to chmod secret file: $!\n"; return; } my $s = rand; print {$FD} "secret=$s\n"; close $FD; print "Sucessfully created secret file ($filename)\n"; return; } After searching on the web I changed the code to: sub create_padb_secret { my $filename = "$ENV{HOME}/.padb-secret"; my $FD; if ( not open $FD, '>', $filename ) { print "Failed to create secret file: $!\n"; return; } if ( chmod( 0600, $filename ) != 1 ) { print "Failed to chmod secret file: $!\n"; return; } my $s = rand; print {$FD} "secret=$s\n"; close $FD; print "Sucessfully created secret file ($filename)\n"; return; } And it works: [thipa at vb0 openmpi]$ padb_r341_secret --create-secret-file Sucessfully created secret file (/home_nfs/thipa/.padb-secret) [thipa at vb0 openmpi]$ This happens only in this cluster which is IA64. On the internet it relates to system that support fchmod or not: http://perldoc.perl.org/functions/chmod.html Here is the patch: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: secret.patch Type: application/octet-stream Size: 364 bytes Desc: not available URL: From padb at googlecode.com Tue Dec 1 14:38:05 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Tue, 01 Dec 2009 14:38:05 +0000 Subject: [padb] r342 committed - When creating the secret file chmod the filename not the file handle a... Message-ID: <00504502b14cc17f9c0479abb4e7@google.com> Revision: 342 Author: apittman Date: Tue Dec 1 06:37:02 2009 Log: When creating the secret file chmod the filename not the file handle as this is more widely supported and doesn't result in errors on ia64 or Solaris. Previously padb could exit with errors like the following: [thipa at vb0 openmpi]$ padb --create-secret-file Failed to chmod secret file: No such file or directory http://code.google.com/p/padb/source/detail?r=342 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Thu Nov 26 02:30:48 2009 +++ /trunk/src/padb Tue Dec 1 06:37:02 2009 @@ -4726,7 +4726,7 @@ print "Failed to create secret file: $!\n"; return; } - if ( chmod( 0600, $FD ) != 1 ) { + if ( chmod( 0600, $filename ) != 1 ) { print "Failed to chmod secret file: $!\n"; return; } From ashley at pittman.co.uk Tue Dec 1 14:38:47 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Tue, 01 Dec 2009 14:38:47 +0000 Subject: [padb] Pb with --create-secret-file In-Reply-To: References: Message-ID: <1259678327.3532.263.camel@alpha> On Tue, 2009-12-01 at 15:18 +0100, thipadin.seng-long at bull.net wrote: > > Hi, > On one of our cluster I've go a problem to create secret file like > this: > > [thipa at vb0 openmpi]$ padb_r341 --create-secret-file > Failed to chmod secret file: No such file or directory > [thipa at vb0 openmpi]$ Ah thanks, I'd briefly seen this on a Solaris system but didn't have enough time on the system to diagnose it. Committed as r342. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From padb at googlecode.com Tue Dec 1 14:42:08 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Tue, 01 Dec 2009 14:42:08 +0000 Subject: [padb] r343 committed - Backport r342 from the HEAD, allow creation of secret file on machines... Message-ID: <001636e1fd8838fddd0479abc385@google.com> Revision: 343 Author: apittman Date: Tue Dec 1 06:38:06 2009 Log: Backport r342 from the HEAD, allow creation of secret file on machines which don't support fchmod http://code.google.com/p/padb/source/detail?r=343 Modified: /branches/3.0/src/padb ======================================= --- /branches/3.0/src/padb Thu Oct 8 04:25:23 2009 +++ /branches/3.0/src/padb Tue Dec 1 06:38:06 2009 @@ -4154,7 +4154,7 @@ printf("Failed to create secret file: $!\n"); return; } - if ( chmod( 0600, $FD ) != 1 ) { + if ( chmod( 0600, $filename ) != 1 ) { printf("Failed to chmod secret file: $!\n"); return; } From thipadin.seng-long at bull.net Tue Dec 1 15:30:22 2009 From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net) Date: Tue, 1 Dec 2009 16:30:22 +0100 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_Patch_of_support_of_Slur?= =?iso-8859-1?q?m_+_Openmpi__Orte_manager?= Message-ID: On Mon, 2009-11-30 at 17:31 wrote: >I knew you had to do this when running OpenMPI with slurm however I'd >never done it myself. My test cluster has both installed so I should be >able to try it, do you happen to know if you need and special configure >options to either to allow this? I used slurm 2.0.1 and openmpi_1.3.3, uppers versions should work also. I don't know the special configure except in my $PATH, I have added the PATH to where is installed my openMPI_1.3.3 binaries and libs. Check the path with "type mpirun" command,it should show the PATH to openmpi. >Does the mpirun job (i.e. the processes we want) have it's own slurm job >step or does it share the job step with the allocation? Just after salloc step is: [thipa at vb0 openmpi]$ salloc.sh salloc: Granted job allocation 27828 [thipa at vb0 openmpi]$ [thipa at vb0 openmpi]$ squeue -s STEPID NAME PARTITION USER TIME NODELIST [thipa at vb0 openmpi]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 27828 jlg bash thipa R 0:25 2 vb[8,10] After mpirun step is: [thipa at vb0 openmpi]$ squeue -s STEPID NAME PARTITION USER TIME NODELIST 27828.0 orted jlg thipa 1:02 vb[8,10] [thipa at vb0 openmpi]$ I believe it can't share job step, each job step is its own. >I also notice the /proc/version in the patch, does this mean the patch >works on an OS other than Linux? It is not complete, to run on other OS that linux you must have two branches: 1 - with /proc/version using readdir /proc and /proc/$pid/cmdline 2 - with "ps -edf | grep slurmstepd" something like this. >What happens if you run salloc... srun? Does this work with the >existing support and how should users know which resource manager plugin >to pick (Ideally padb could do the right thing). You mean salloc ... srun ...mpirun prog ? That's what I have experimented: [thipa at vb0 openmpi]$ salloc.sh salloc: Granted job allocation 27830 [thipa at vb0 openmpi]$ [thipa at vb0 openmpi]$ srun -n1 mpirun -bynode -n 6 ./pp_sndrcv_spbl srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 I am, process 0 starting on vb8, total by srun 6 Me, process 0, send 1000 to process 2 I am, process 2 starting on vb8, total by srun 6 I am, process 4 starting on vb8, total by srun 6 I am, process 1 starting on vb10, total by srun 6 I am, process 5 starting on vb10, total by srun 6 I am, process 3 starting on vb10, total by srun 6 There are 2 steps: [thipa at vb0 openmpi]$ squeue -s STEPID NAME PARTITION USER TIME NODELIST 27830.0 mpirun jlg thipa 0:22 vb8 27830.1 orted jlg thipa 0:22 vb10 [thipa at vb0 openmpi]$ And rmgr=slurm doesn't work (existing support) You just catch the stack of orted: [thipa at vb0 openmpi]$ padb_r341 -O stack-shows-locals=no -O stack-shows-params=no -O rmgr=slurm --verbose -tx 27830 Loading config from "/etc/padb.conf" Loading config from "/home_nfs/thipa/.padbrc" Loading config from environment Loading config from command line Setting 'rmgr' to 'slurm' Setting 'stack_shows_locals' to 'no' Setting 'stack_shows_params' to 'no' Collecting information for job '27830' Attaching to job 27830 Job has 1 process(es) Job spans 2 host(s) Mode 'stack' mode specific options: gdb_retry_count : '3' max_distinct_values : '3' stack_shows_locals : '0' stack_shows_params : '0' stack_strip_above : 'elan_waitWord,elan_pollWord,elan_deviceCheck,opal_condition_wait,opal_progress' stack_strip_below : 'main,__libc_start_main,start_thread' strip_above_wait : '1' strip_below_main : '1' ----------------- [0] (1 processes) ----------------- main() at main.c:13 orterun() at orterun.c:686 opal_event_dispatch() at ?:? opal_event_base_loop() at ?:? poll_dispatch() at ?:? poll() at ?:? ??() at ?:? result from parallel command is 0 (state=shutdown) [thipa at vb0 openmpi]$ >> [thipa at machu0 padb_open]$ ./padb -O rmgr="sl-orte" -O >> stack-shows-locals=no -O stack-shows-params=no --debug=verbose=all >> -tx 8324 >> DEBUG (verbose): 0: There are 1 processes over 3 hosts >This isn't great, the number of processes expected is so far only used >to check for missing processes but there are other potential uses for it >so I'd rather it was correct. I will dig it more, I don't know the meaning of processes number actually you do with. >> I don't use scontrol listpids, because I found this command not a >> universal method (some version doesn't have it), >> and may issued error message such as : >> slurmd[machu139]: proctrack/pgid does not implement >> slurm_container_get_pids >I'd prefer to use this if at all possible, this option was added at a >request my be several years ago so I'd have thought most versions have >it by now, can you be clearer on the versions where it doesn't work? It work only for slurm upper from 1.2, may be some clients have it still ? If you can get rid of messages above (slurmd[hostnn]: proctrack/pgid does not implement) I am ok. Thipadin. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ashley at pittman.co.uk Tue Dec 1 15:31:20 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Tue, 01 Dec 2009 15:31:20 +0000 Subject: [padb] Patch of support of Slurm + Openmpi Orte manager In-Reply-To: References: Message-ID: <1259681480.3532.271.camel@alpha> Thipadin, What do you think of leaving this under the umbrella of the slurm resource manager but having it as a option to enable it as a mode? Really it's a special use case of Slurm so padb should reflect this in keeping it's resource manager code clean (or at least no dirtier than it is) and having a different mode to run in which would select the target processes differently. This leaves open the option of padb being able to detect that it only found one process (called orterun) and advise the user accordingly or perhaps even re-try with the option enabled. Attached is a patch that does this, note that I've not changed any of the code in sl_orte_find_pids, merely changed the mechanism used to call it. Thoughts? I'm away Thursday/Friday this week but should be able to take a closer look at the actual code the beginning of next week, as I said I've got a cluster I can run it on this time. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk -------------- next part -------------- A non-text attachment was scrubbed... Name: padb-slurm-open.patch Type: text/x-patch Size: 4971 bytes Desc: not available URL: From ashley at pittman.co.uk Tue Dec 1 17:17:03 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Tue, 01 Dec 2009 17:17:03 +0000 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_Patch_of_support_of_Slur?= =?iso-8859-1?q?m_+_Openmpi_Orte_manager?= In-Reply-To: References: Message-ID: <1259687823.3532.346.camel@alpha> On Tue, 2009-12-01 at 16:30 +0100, thipadin.seng-long at bull.net wrote: > >I also notice the /proc/version in the patch, does this mean the > patch > >works on an OS other than Linux? > > It is not complete, to run on other OS that linux you must have two > branches: > 1 - with /proc/version using readdir /proc and /proc/$pid/cmdline > 2 - with "ps -edf | grep slurmstepd" something like this. I'd view slurmstepd arguments as liable to change over time, if there was a way to get this information without relying on slurm internals I'd prefer it. If this is the only way to get the information then we'll use it however. > >What happens if you run salloc... srun? Does this work with the > >existing support and how should users know which resource manager > plugin > >to pick (Ideally padb could do the right thing). > > You mean salloc ... srun ...mpirun prog ? > That's what I have experimented: I was thinking salloc...srun with say a MPICH2 program. Does that work with the existing slurm support? How would we document for users which resource manager (or mode) to use in this case? > >> [thipa at machu0 padb_open]$ ./padb -O rmgr="sl-orte" -O > >> stack-shows-locals=no -O stack-shows-params=no --debug=verbose=all > >> -tx 8324 > >> DEBUG (verbose): 0: There are 1 processes over 3 hosts > > >This isn't great, the number of processes expected is so far only > used > >to check for missing processes but there are other potential uses for > it > >so I'd rather it was correct. > > I will dig it more, I don't know the meaning of processes number > actually you do with. The expected process count is returned by setup_job and is only used to ensure that all processes are present, padb could live without this however being able to warn users on missing processes is useful and was something that was requested from the 2.0 series. Perhaps I could make it that nprocs was returned from the find_pids function on the inner process and passed back up the tree some how. > >> I don't use scontrol listpids, because I found this command not a > >> universal method (some version doesn't have it), > >> and may issued error message such as : > >> slurmd[machu139]: proctrack/pgid does not implement > >> slurm_container_get_pids > > >I'd prefer to use this if at all possible, this option was added at a > >request my be several years ago so I'd have thought most versions > have > >it by now, can you be clearer on the versions where it doesn't work? > > It work only for slurm upper from 1.2, may be some clients have it > still ? At some point we have to drop support for old versions, the current slurm code won't work without it so requiring it for the the openmpi/slurm combination doesn't seem like too much of a hardship to me. > If you can get rid of messages above (slurmd[hostnn]: proctrack/pgid > does not implement) I'll raise this on the slurm list, I get these warning messages too but I'd assumed that was because I'm using a debug build. The listpids command still works even though these warnings are issued. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From sylvain.jeaugey at bull.net Wed Dec 2 08:49:50 2009 From: sylvain.jeaugey at bull.net (Sylvain Jeaugey) Date: Wed, 2 Dec 2009 09:49:50 +0100 (CET) Subject: [padb] =?iso-8859-15?q?R=E9f=2E_=3A_Re=3A_Patch_of_support_of_Slu?= =?iso-8859-15?q?rm_+_Openmpi_Orte_manager?= In-Reply-To: <1259687823.3532.346.camel@alpha> References: <1259687823.3532.346.camel@alpha> Message-ID: On Tue, 1 Dec 2009, Ashley Pittman wrote: >>> What happens if you run salloc... srun? Does this work with the >>> existing support and how should users know which resource manager >> plugin >>> to pick (Ideally padb could do the right thing). >> >> You mean salloc ... srun ...mpirun prog ? >> That's what I have experimented: > > I was thinking salloc...srun with say a MPICH2 program. Does that work > with the existing slurm support? How would we document for users which > resource manager (or mode) to use in this case? srun inside salloc is just the same as a simple srun. For the record, we also use srun directly with Open MPI programs (still beta, but works). And padb works fine with the existing slurm support as it would with an MPICH2 program. Sylvain From ashley at pittman.co.uk Wed Dec 2 11:40:06 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Wed, 02 Dec 2009 11:40:06 +0000 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_Patch_of_support_of_Slur?= =?iso-8859-1?q?m_+_Openmpi_Orte_manager?= In-Reply-To: References: <1259687823.3532.346.camel@alpha> Message-ID: <1259754006.2676.1.camel@alpha> On Wed, 2009-12-02 at 09:49 +0100, Sylvain Jeaugey wrote: > On Tue, 1 Dec 2009, Ashley Pittman wrote: > > >>> What happens if you run salloc... srun? Does this work with the > >>> existing support and how should users know which resource manager > >> plugin > >>> to pick (Ideally padb could do the right thing). > >> > >> You mean salloc ... srun ...mpirun prog ? > >> That's what I have experimented: > > > > I was thinking salloc...srun with say a MPICH2 program. Does that work > > with the existing slurm support? How would we document for users which > > resource manager (or mode) to use in this case? > srun inside salloc is just the same as a simple srun. For the record, we > also use srun directly with Open MPI programs (still beta, but works). And > padb works fine with the existing slurm support as it would with an > MPICH2 program. Ok, that's what I thought. It's good news that it just works but also means it's probably not possible to detect from slurm what type of job is running, weather you need the existing code on the new code. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From padb at googlecode.com Wed Dec 2 14:18:54 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Wed, 02 Dec 2009 14:18:54 +0000 Subject: [padb] r344 committed - Add a mechanism to allow the find_pids function on the inner processes... Message-ID: <0016e64dde58f816e50479bf8dbb@google.com> Revision: 344 Author: apittman Date: Wed Dec 2 06:18:09 2009 Log: Add a mechanism to allow the find_pids function on the inner processes to report back a different value of nprocesses to the outer process. The find_pids function can do this by calling target_key_pair($rank,"JOB_SIZE",$job_size) which the outer process will spot and update it's expectations accordingly. http://code.google.com/p/padb/source/detail?r=344 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Tue Dec 1 06:37:02 2009 +++ /trunk/src/padb Wed Dec 2 06:18:09 2009 @@ -4175,6 +4175,22 @@ # The inner process has signed on. if ( $comm_data->{current_req}->{mode} eq 'signon' ) { + + # Allow the find_pids function to report back a different job + # size to the one the resource manager spotted, potentially + # because there is a job running under an allocation and there + # may be a discrepancy between the two. + if ( defined $d->{target_data}{JOB_SIZE} ) { + my @size = keys %{ $d->{target_data}{JOB_SIZE} }; + if ( @size == 1 ) { + $comm_data->{nprocesses} = $size[0]; + } else { + print + "More than one value reported for Job Size, using largest\n"; + my @s = sort { $a <=> $b } @size; + $comm_data->{nprocesses} = $s[-1]; + } + } # Check the signon messages, reporting minor errors to the user, if # no processes are found then don't bother processing any commands From ashley at pittman.co.uk Wed Dec 2 15:51:09 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Wed, 02 Dec 2009 15:51:09 +0000 Subject: [padb] Patch of support of Slurm + Openmpi Orte manager In-Reply-To: <1259681480.3532.271.camel@alpha> References: <1259681480.3532.271.camel@alpha> Message-ID: <1259769070.6352.51.camel@alpha> On Tue, 2009-12-01 at 15:31 +0000, Ashley Pittman wrote: > I'm away Thursday/Friday this week but should be able to take a closer > look at the actual code the beginning of next week, as I said I've got a > cluster I can run it on this time. The code almost works for me, all I've changed is as I sent before, using a configuration option to turn it on and adding a call to target_key_pair($rank,"JOB_SIZE"...), see r344 for details of this. [ashley at cloud0 src]$ ./padb -a --proc-summary -Oslurm_orte_alloc=true Warning, failed to locate ranks [0,2] rank hostname pid vmsize vmrss S uptime %cpu lcore command 1 cloud1 1618 73504 kB 3928 kB R 1.99 21 0 deadlock 3 cloud1 1619 73504 kB 3932 kB R 1.99 23 0 deadlock As you can see it's missing the processes from cloud0 which is where the mpirun is executing. The same job shows up as expected using the orte resource manager however, the limitation here being it only works from the node where this is running. [ashley at cloud0 src]$ ./padb -a --proc-summary -Ormgr=orte rank hostname pid vmsize vmrss S uptime %cpu lcore command 0 cloud0 3199 73380 kB 3900 kB R 2.00 21 0 deadlock 1 cloud1 1618 73504 kB 3928 kB R 1.99 21 0 deadlock 2 cloud0 3200 73384 kB 3908 kB R 2.00 18 0 deadlock 3 cloud1 1619 73504 kB 3932 kB R 1.99 20 0 deadlock This is the relevant parts of the process tree from cloud0, you can trace deadlock back to the mpirun without any slurmstepd on this node at all. ps -o pid,ppid,user,cmd -xa 2851 1219 ashley salloc -N 2 -n3 -O 2854 2851 ashley /bin/bash 3192 2854 ashley mpirun -n 4 /home/ashley/general/mpi/deadlock 3193 3192 ashley srun --nodes=1 --ntasks=1 --kill-on-bad-exit --nodelist=cloud1 orted -mca ess slurm -mca orte_ess_jobid 258146304 -mca orte_es 3199 3192 ashley /home/ashley/general/mpi/deadlock 3200 3192 ashley /home/ashley/general/mpi/deadlock I'm wondering if it might be better to simply walk all processes in a very similar way to pbs_find_pids and check for OMPI_COMM_WORLD_RANK OMPI_COMM_WORLD_SIZE, SLURM_JOB_ID and SLURM_STEP_ID. This code could then be used as a fallback in case scontol listpids failed to return any pids and hence wouldn't need any options twiddled to enable it. Combined with some more intelligent setting of default values for slurm_job_step and that could make this case full automatic with the user just specifying the jobid and nothing else. Attached is the patch as I've been using it. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk -------------- next part -------------- A non-text attachment was scrubbed... Name: padb-slurm-open-2.patch Type: text/x-patch Size: 5461 bytes Desc: not available URL: From padb at googlecode.com Wed Dec 2 16:34:58 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Wed, 02 Dec 2009 16:34:58 +0000 Subject: [padb] r345 committed - Slurm: Try to pick a sensible (valid) default value for... Message-ID: <0016e646922c8f983f0479c1745b@google.com> Revision: 345 Author: apittman Date: Wed Dec 2 08:34:27 2009 Log: Slurm: Try to pick a sensible (valid) default value for slurm_job_step rather than just using a value of zero. Revert back to using zero if we can't find any trace of any active steps. Also convert slurm_setup_pcmd to slurm_setup_job. http://code.google.com/p/padb/source/detail?r=345 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Wed Dec 2 06:18:09 2009 +++ /trunk/src/padb Wed Dec 2 08:34:27 2009 @@ -441,7 +441,7 @@ is_installed => \&slurm_is_installed, get_active_jobs => \&slurm_get_jobs, job_is_running => \&slurm_job_is_running, - setup_pcmd => \&slurm_setup_pcmd, + setup_job => \&slurm_setup_job, find_pids => \&slurm_find_pids, require_inner_callback => 1, }; @@ -519,7 +519,7 @@ $conf{prun_exittimeout} = '2m'; $conf{rmgr} = undef; -$conf{slurm_job_step} = 0; +$conf{slurm_job_step} = undef; $conf{pbs_server} = undef; @@ -552,6 +552,7 @@ my $EQUALS = qr{=}x; my $SPACE = qr{\s+}x; my $COLON = qr{:}x; +my $PERIOD = qr{\.}x; my $EMPTY_STRING = q{}; @@ -2472,11 +2473,41 @@ return ( $status eq 'running' ); } -sub slurm_setup_pcmd { - my $job = shift; +sub slurm_setup_job { + my $job = shift; + + # After we have selected a job id and decided to target it make a + # best-attempt effort to pick a sensible step_id. List all the + # step ids slurm thinks are running and pick the first one. + # Previously this value just defaulted to zero. + if ( not defined $conf{slurm_job_step} ) { + my @all_steps = slurp_cmd("squeue -s -o %i"); + my @valid_steps; + foreach my $step (@all_steps) { + chomp $step; + next if $step eq "STEPID"; + my ( $job_id, $job_step ) = split $PERIOD, $step; + next unless $job_id == $job; + push @valid_steps, $job_step; + } + if (@valid_steps) { + config_set_internal( 'slurm_job_step', $valid_steps[0] ); + } else { + print + "Unable to determine any valid job steps, assuming step id 0\n"; + config_set_internal( 'slurm_job_step', 0 ); + } + } + my $cpus = slurm_job_to_ncpus($job); my $nc = slurm_job_to_nodecount($job); - return ( "srun --jobid=$job", $cpus, $nc ); + + my %pcmd; + $pcmd{nprocesses} = $cpus; + $pcmd{nhosts} = $nc; + $pcmd{command} = "srun --jobid=$job"; + return %pcmd; + } ############################################################################### @@ -5085,7 +5116,13 @@ } foreach my $co (@conf_int) { - check_int( $conf{$co} ); + + # Only check for defined values here, for some options only + # intergers are valid but the default value is undef which means + # padb should attempt to do the right thing. + if ( defined $conf{$co} ) { + check_int( $conf{$co} ); + } } # Now go through all the config options and both verify they are From ashley at pittman.co.uk Wed Dec 2 17:48:56 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Wed, 02 Dec 2009 17:48:56 +0000 Subject: [padb] Patch of support of Slurm + Openmpi Orte manager In-Reply-To: <1259769070.6352.51.camel@alpha> References: <1259681480.3532.271.camel@alpha> <1259769070.6352.51.camel@alpha> Message-ID: <1259776136.6352.56.camel@alpha> On Wed, 2009-12-02 at 15:51 +0000, Ashley Pittman wrote: > > I'm wondering if it might be better to simply walk all processes in a > very similar way to pbs_find_pids and check for OMPI_COMM_WORLD_RANK > OMPI_COMM_WORLD_SIZE, SLURM_JOB_ID and SLURM_STEP_ID. This code could > then be used as a fallback in case scontol listpids failed to return > any > pids and hence wouldn't need any options twiddled to enable it. > > Combined with some more intelligent setting of default values for > slurm_job_step and that could make this case full automatic with the > user just specifying the jobid and nothing else. The attached patch implements just that, "padb -a --proc-summary -Ormgr=slurm" works for me correctly in all cases I've tested. Let me know if this works for you and if you're happy with this approach. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk -------------- next part -------------- A non-text attachment was scrubbed... Name: padb-slurm-open-3.patch Type: text/x-patch Size: 2716 bytes Desc: not available URL: From thipadin.seng-long at bull.net Thu Dec 3 10:45:37 2009 From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net) Date: Thu, 3 Dec 2009 11:45:37 +0100 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A__Patch_of_support_of_Slu?= =?iso-8859-1?q?rm_+_OpenmpiOrte_manager?= Message-ID: Hi, I was off yesterday, in the mean time you have made some versions. I am only trying the test the last one (padb-slurm-open-3.patch). I understand you want to handle automaticallly the fact that user do slurm/openmpi combination or not. I am starting with something wrong, i think it needs more handles: So the combination is: salloc srun -n 1 mpirun -bynode -n 8 my_prog this combination should be equivalent to salloc mprun -bynode -n 8 my_prog so in all my test I've got this. The result is a little confused, let 's have a look: The test: [thipa at vb0 openmpi]$ salloc -p jlg -w vb8,vb9,vb10 salloc: Granted job allocation 27834 [thipa at vb0 openmpi]$ [thipa at vb0 openmpi]$ srun -n1 mpirun -bynode -n 8 ./pp_sndrcv_spbl srun: Warning: can't run 1 processes on 3 nodes, setting nnodes to 1 I am, process 3 starting on vb8, total by srun 8 I am, process 6 starting on vb8, total by srun 8 I am, process 0 starting on vb8, total by srun 8 I am, process 7 starting on vb9, total by srun 8 I am, process 4 starting on vb9, total by srun 8 I am, process 2 starting on vb10, total by srun 8 I am, process 5 starting on vb10, total by srun 8 I am, process 1 starting on vb9, total by srun 8 Me, process 0, send 1000 to process 2 Padb Test: [thipa at vb0 openmpi]$ padbr345P -O rmgr=slurm -O stack-shows-locals=no -O stack-shows-params=no --verbose -tx 27834 Loading config from "/etc/padb.conf" Loading config from "/home_nfs/thipa/.padbrc" Loading config from environment Loading config from command line Setting 'rmgr' to 'slurm' Setting 'stack_shows_locals' to 'no' Setting 'stack_shows_params' to 'no' Collecting information for job '27834' Attaching to job 27834 Job has 1 process(es) Job spans 3 host(s) Warning, failed to locate ranks [3,6] Warning, remote process name differs across ranks name : ranks mpirun : [0] pp_sndrcv_spbl : [1-2,4-5,7] Warning, remote process state differs across ranks state : ranks R (running) : [2] S (sleeping) : [0-1,4-5,7] Mode 'stack' mode specific options: gdb_retry_count : '3' max_distinct_values : '3' stack_shows_locals : '0' stack_shows_params : '0' stack_strip_above : 'elan_waitWord,elan_pollWord,elan_deviceCheck,opal_condition_wait,opal_progress' stack_strip_below : 'main,__libc_start_main,start_thread' strip_above_wait : '1' strip_below_main : '1' ----------------- [0] (1 processes) ----------------- main() at main.c:13 orterun() at orterun.c:686 opal_event_dispatch() at ?:? opal_event_base_loop() at ?:? poll_dispatch() at ?:? poll() at ?:? ??() at ?:? ----------------- [1-2,4-5,7] (5 processes) ----------------- ThreadId: 1 ----------------- [1,4-5,7] (4 processes) ----------------- main() at pp_sndrcv_spbl.c:53 PMPI_Finalize() at ?:? ompi_mpi_finalize() at ?:? barrier() at ?:? opal_progress() at ?:? ThreadId: 2 start_thread() at ?:? btl_openib_async_thread() at ?:? poll() at ?:? ??() at ?:? ThreadId: 3 start_thread() at ?:? service_thread_start() at ?:? __GC___select() at ?:? ??() at ?:? ----------------- [2] (1 processes) ----------------- main() at pp_sndrcv_spbl.c:49 PMPI_Recv() at ?:? mca_pml_ob1_recv() at ?:? opal_progress() at ?:? ThreadId: 2 start_thread() at ?:? btl_openib_async_thread() at ?:? poll() at ?:? ??() at ?:? ThreadId: 3 start_thread() at ?:? service_thread_start() at ?:? __GC___select() at ?:? ??() at ?:? result from parallel command is 0 (state=shutdown) [thipa at vb0 openmpi]$ padbr345P -O rmgr=slurm -O stack-shows-locals=no -O stack-shows-params=no --verbose --proc-summary Loading config from "/etc/padb.conf" Loading config from "/home_nfs/thipa/.padbrc" Loading config from environment Loading config from command line Setting 'rmgr' to 'slurm' Setting 'stack_shows_locals' to 'no' Setting 'stack_shows_params' to 'no' padbr345P: Error: no jobs specified, use --all or jobids [thipa at vb0 openmpi]$ padbr345P -O rmgr=slurm -O stack-shows-locals=no -O stack-shows-params=no --verbose --proc-summary -a Loading config from "/etc/padb.conf" Loading config from "/home_nfs/thipa/.padbrc" Loading config from environment Loading config from command line Setting 'rmgr' to 'slurm' Setting 'stack_shows_locals' to 'no' Setting 'stack_shows_params' to 'no' Active jobs (1) are 27834 Collecting information for job '27834' Attaching to job 27834 Job has 1 process(es) Job spans 3 host(s) Warning, failed to locate ranks [3,6] Warning, remote process name differs across ranks name : ranks mpirun : [0] pp_sndrcv_spbl : [1-2,4-5,7] Warning, remote process state differs across ranks state : ranks R (running) : [2] S (sleeping) : [0-1,4-5,7] Mode 'proc_summary' mode specific options: column_seperator : ' ' nprocs_output : undef proc_format : 'rank,hostname,pid,vmsize,vmrss,stat.state=S,load1=uptime,pcpu=%cpu,stat.processor=lcore,name=command' proc_show_header : '1' proc_shows_fds : '0' proc_shows_maps : '0' proc_shows_proc : '1' proc_shows_stat : '1' proc_sort_key : 'rank' reverse_sort_order : '0' rank hostname pid vmsize vmrss S uptime %cpu lcore command 0 vb8 22210 16320 kB 13952 kB S 0.00 0 3 mpirun 1 vb9 14985 112384 kB 25600 kB S 0.08 0 5 pp_sndrcv_spbl 2 vb10 9540 133440 kB 47296 kB R 1.15 99 1 pp_sndrcv_spbl 4 vb9 14986 111616 kB 25600 kB S 0.08 0 5 pp_sndrcv_spbl 5 vb10 9544 111616 kB 25600 kB S 1.15 0 0 pp_sndrcv_spbl 7 vb9 14987 112640 kB 25728 kB S 0.08 0 0 pp_sndrcv_spbl result from parallel command is 0 (state=shutdown) [thipa at vb0 openmpi]$ All processes alive: ssh vb8 [thipa at vb8 ~]$ psu PID PPID CMD 22210 22206 /home_nfs/thipa/openMPI/install/bin/mpirun -bynode -n 8 ./pp_sndrcv_ 22213 22210 srun --nodes=2 --ntasks=2 --kill-on-bad-exit --nodelist=vb9,vb10 ort 22218 22210 ./pp_sndrcv_spbl 22219 22210 ./pp_sndrcv_spbl 22220 22210 ./pp_sndrcv_spbl 22990 22986 sshd: thipa at pts/6 22991 22990 -bash 23021 22991 ps -o pid,ppid,cmd -u thipa [thipa at vb8 ~]$ ssh vb9 [thipa at vb9 ~]$ psu PID PPID CMD 14982 14978 /home_nfs/thipa/openMPI/install/bin/orted -mca ess slurm -mca orte_e 14985 14982 ./pp_sndrcv_spbl 14986 14982 ./pp_sndrcv_spbl 14987 14982 ./pp_sndrcv_spbl 15776 15772 sshd: thipa at pts/6 15777 15776 -bash 15807 15777 ps -o pid,ppid,cmd -u thipa [thipa at vb9 ~]$ ssh vb10 [thipa at vb10 ~]$ psu PID PPID CMD 9531 9527 /home_nfs/thipa/openMPI/install/bin/orted -mca ess slurm -mca orte_e 9534 9531 ./pp_sndrcv_spbl 9535 9531 ./pp_sndrcv_spbl 10513 10509 sshd: thipa at pts/4 10514 10513 -bash 10544 10514 ps -o pid,ppid,cmd -u thipa [thipa at vb10 ~]$ You have mpirun which has rank0, this shouldn't, and you miss 3,6. Now the other test that works: Combination: salloc mpirun -bynode -n 8 my_prog The test: [thipa at vb0 openmpi]$ salloc -p jlg -w vb8,vb9,vb10 salloc: Granted job allocation 27835 [thipa at vb0 openmpi]$ [thipa at vb0 openmpi]$ [thipa at vb0 openmpi]$ [thipa at vb0 openmpi]$ mpirun -bynode -n 8 ./pp_sndrcv_spbl I am, process 1 starting on vb9, total by srun 8 I am, process 4 starting on vb9, total by srun 8 I am, process 0 starting on vb8, total by srun 8 I am, process 6 starting on vb8, total by srun 8 I am, process 7 starting on vb9, total by srun 8 I am, process 2 starting on vb10, total by srun 8 I am, process 5 starting on vb10, total by srun 8 I am, process 3 starting on vb8, total by srun 8 Me, process 0, send 1000 to process 2 Padb test: [thipa at vb0 openmpi]$ padbr345P -O rmgr=slurm --verbose --proc-summary -a Loading config from "/etc/padb.conf" Loading config from "/home_nfs/thipa/.padbrc" Loading config from environment Loading config from command line Setting 'rmgr' to 'slurm' Active jobs (1) are 27835 Collecting information for job '27835' Attaching to job 27835 Job has 3 process(es) Job spans 3 host(s) Warning, remote process state differs across ranks state : ranks R (running) : [2] S (sleeping) : [0-1,3-7] Mode 'proc_summary' mode specific options: column_seperator : ' ' nprocs_output : undef proc_format : 'rank,hostname,pid,vmsize,vmrss,stat.state=S,load1=uptime,pcpu=%cpu,stat.processor=lcore,name=command' proc_show_header : '1' proc_shows_fds : '0' proc_shows_maps : '0' proc_shows_proc : '1' proc_shows_stat : '1' proc_sort_key : 'rank' reverse_sort_order : '0' rank hostname pid vmsize vmrss S uptime %cpu lcore command 0 vb8 23049 133440 kB 47104 kB S 0.00 0 5 pp_sndrcv_spbl 1 vb9 15828 112640 kB 25408 kB S 0.00 0 0 pp_sndrcv_spbl 2 vb10 10571 134464 kB 47168 kB R 0.92 100 0 pp_sndrcv_spbl 3 vb8 23058 111616 kB 25536 kB S 0.00 0 2 pp_sndrcv_spbl 4 vb9 15845 111616 kB 25408 kB S 0.00 0 0 pp_sndrcv_spbl 5 vb10 10575 111616 kB 25408 kB S 0.92 0 1 pp_sndrcv_spbl 6 vb8 23054 111616 kB 25408 kB S 0.00 0 0 pp_sndrcv_spbl 7 vb9 15830 111616 kB 25408 kB S 0.00 0 0 pp_sndrcv_spbl result from parallel command is 0 (state=shutdown) [thipa at vb0 openmpi]$ [thipa at vb0 openmpi]$ [thipa at vb0 openmpi]$ [thipa at vb0 openmpi]$ [thipa at vb0 openmpi]$ [thipa at vb0 openmpi]$ padbr345P -O rmgr=slurm -O stack-shows-locals=no -O stack-shows-params=no --verbose -tx 27835 Loading config from "/etc/padb.conf" Loading config from "/home_nfs/thipa/.padbrc" Loading config from environment Loading config from command line Setting 'rmgr' to 'slurm' Setting 'stack_shows_locals' to 'no' Setting 'stack_shows_params' to 'no' Collecting information for job '27835' Attaching to job 27835 Job has 3 process(es) Job spans 3 host(s) Warning, remote process state differs across ranks state : ranks R (running) : [2] S (sleeping) : [0-1,3-7] Mode 'stack' mode specific options: gdb_retry_count : '3' max_distinct_values : '3' stack_shows_locals : '0' stack_shows_params : '0' stack_strip_above : 'elan_waitWord,elan_pollWord,elan_deviceCheck,opal_condition_wait,opal_progress' stack_strip_below : 'main,__libc_start_main,start_thread' strip_above_wait : '1' strip_below_main : '1' ----------------- [0-7] (8 processes) ----------------- ThreadId: 1 ----------------- [0-1,3-7] (7 processes) ----------------- main() at pp_sndrcv_spbl.c:53 PMPI_Finalize() at ?:? ompi_mpi_finalize() at ?:? barrier() at ?:? opal_progress() at ?:? ThreadId: 2 start_thread() at ?:? btl_openib_async_thread() at ?:? poll() at ?:? ??() at ?:? ThreadId: 3 start_thread() at ?:? service_thread_start() at ?:? __GC___select() at ?:? ??() at ?:? ----------------- [2] (1 processes) ----------------- main() at pp_sndrcv_spbl.c:49 PMPI_Recv() at ?:? mca_pml_ob1_recv() at ?:? ThreadId: 2 start_thread() at ?:? btl_openib_async_thread() at ?:? poll() at ?:? ??() at ?:? ThreadId: 3 start_thread() at ?:? service_thread_start() at ?:? __GC___select() at ?:? ??() at ?:? result from parallel command is 0 (state=shutdown) [thipa at vb0 openmpi]$ Thipadin. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ashley at pittman.co.uk Thu Dec 3 11:08:31 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Thu, 03 Dec 2009 11:08:31 +0000 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A__Patch_of_support_of_Slu?= =?iso-8859-1?q?rm_+_Openmpi_Orte_manager?= In-Reply-To: References: Message-ID: <1259838511.6352.111.camel@alpha> I'm just running out of the door myself and will be away until Sunday now. On Thu, 2009-12-03 at 11:45 +0100, thipadin.seng-long at bull.net wrote: > You have mpirun which has rank0, this shouldn't, and you miss 3,6. ranks 3 and 6 are on the same node as rank 0, can you try the following additional patch which should cause it to skip over the mpirun process and look for local ones based on their environment. If this patch doesn't work take a look at the the contents of /proc/$pid/status for the process it's erroneously reporting as rank 0 to see what Name is set to. In the example you sent it's pid 22210 --- padb-slurm-open-3 2009-12-03 11:03:08.500044734 +0000 +++ padb 2009-12-03 11:03:15.333036493 +0000 @@ -8187,6 +8187,7 @@ next unless ( $job eq $jobid ); next unless ( $step == $inner_conf{slurm_job_step} ); next if( find_from_status( $pid, 'Name' ) eq 'orted'); + next if( find_from_status( $pid, 'Name' ) eq 'mpirun'); maybe_show_pid( $global, $pid ); $found_target = 1; } -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From thipadin.seng-long at bull.net Thu Dec 3 12:20:47 2009 From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net) Date: Thu, 3 Dec 2009 13:20:47 +0100 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_Re=3A__Patc?= =?iso-8859-1?q?h_of__support_of_Slurm_+_Openmpi_Orte_manager?= Message-ID: Hi, good holidays, there ? I have applied the patch below. It works now: padbr345P -O rmgr=slurm --proc-summary -a Warning, remote process state differs across ranks state : ranks R (running) : [2] S (sleeping) : [0-1,3-7] rank hostname pid vmsize vmrss S uptime %cpu lcore command 0 vb8 24595 133440 kB 47296 kB S 0.01 0 0 pp_sndrcv_spbl 1 vb9 17406 111616 kB 25536 kB S 0.01 0 0 pp_sndrcv_spbl 2 vb10 12521 133440 kB 47296 kB R 0.93 99 1 pp_sndrcv_spbl 3 vb8 24588 111616 kB 25728 kB S 0.01 0 2 pp_sndrcv_spbl 4 vb9 17411 111616 kB 25600 kB S 0.01 0 5 pp_sndrcv_spbl 5 vb10 12522 111616 kB 25600 kB S 0.93 0 0 pp_sndrcv_spbl 6 vb8 24589 111616 kB 25600 kB S 0.01 0 3 pp_sndrcv_spbl 7 vb9 17407 112640 kB 25728 kB S 0.01 0 0 pp_sndrcv_spbl [thipa at vb0 openmpi]$ Thipadin. Ashley Pittman 12/03/2009 12:08 PM Pour : thipadin.seng-long at bull.net cc : florence.vallee at bull.net, francois.wellenreiter at bull.net, padb-devel at pittman.org.uk, Sylvain.JEAUGEY at bull.net Objet : Re: R?f. : Re: [padb] Patch of support of Slurm + Openmpi Orte manager I'm just running out of the door myself and will be away until Sunday now. On Thu, 2009-12-03 at 11:45 +0100, thipadin.seng-long at bull.net wrote: > You have mpirun which has rank0, this shouldn't, and you miss 3,6. ranks 3 and 6 are on the same node as rank 0, can you try the following additional patch which should cause it to skip over the mpirun process and look for local ones based on their environment. If this patch doesn't work take a look at the the contents of /proc/$pid/status for the process it's erroneously reporting as rank 0 to see what Name is set to. In the example you sent it's pid 22210 --- padb-slurm-open-3 2009-12-03 11:03:08.500044734 +0000 +++ padb 2009-12-03 11:03:15.333036493 +0000 @@ -8187,6 +8187,7 @@ next unless ( $job eq $jobid ); next unless ( $step == $inner_conf{slurm_job_step} ); next if( find_from_status( $pid, 'Name' ) eq 'orted'); + next if( find_from_status( $pid, 'Name' ) eq 'mpirun'); maybe_show_pid( $global, $pid ); $found_target = 1; } -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk -------------- next part -------------- An HTML attachment was scrubbed... URL: From sylvain.jeaugey at bull.net Thu Dec 3 12:53:14 2009 From: sylvain.jeaugey at bull.net (Sylvain Jeaugey) Date: Thu, 3 Dec 2009 13:53:14 +0100 (CET) Subject: [padb] =?iso-8859-15?q?R=E9f=2E_=3A_Re=3A__Patch_of_support_of_Sl?= =?iso-8859-15?q?urm_+_Openmpi_Orte_manager?= In-Reply-To: References: Message-ID: Thipadin, I don't understand why this combination is being discussed ? On Thu, 3 Dec 2009, thipadin.seng-long at bull.net wrote: > salloc srun -n 1 mpirun -bynode -n 8 my_prog The "srun -n 1" is useless and I doubt anyone would ever try to use it. Is there a reason behind this ? Sylvain From thipadin.seng-long at bull.net Fri Dec 4 09:13:25 2009 From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net) Date: Fri, 4 Dec 2009 10:13:25 +0100 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_Re=3A__Patc?= =?iso-8859-1?q?h_of__support_of_Slurm_+_Openmpi_Orte_manager?= Message-ID: On 2/03/2009 01:53 PM Sylvain Jeaugey wrote: > I don't understand why this combination is being discussed ? > On Thu, 3 Dec 2009, thipadin.seng-long at bull.net wrote: > > salloc srun -n 1 mpirun -bynode -n 8 my_prog > The "srun -n 1" is useless and I doubt anyone would ever try to use it. Is > there a reason behind this ? There is no reason behind this but Padb should support all kind of jobs. "srun -n 1 mpirun" is equivalent to "mpirun ", I do admit. But no one can prevent somebody to start jobs like this since the syntax is correct, So if some one start jobs like this, padb should be able to support. That's the way to make padb rich. That's my point of view. In term of source code there is just one line added by Ashley and it works. Regards. Thipadin. Sylvain Jeaugey 12/03/2009 01:53 PM Pour : thipadin.seng-long at bull.net cc : Ashley Pittman , florence.vallee at bull.net, francois.wellenreiter at bull.net, padb-devel at pittman.org.uk, Sylvain.JEAUGEY at bull.net Objet : Re: R?f. : Re: [padb] Patch of support of Slurm + Openmpi Orte manager Thipadin, I don't understand why this combination is being discussed ? On Thu, 3 Dec 2009, thipadin.seng-long at bull.net wrote: > salloc srun -n 1 mpirun -bynode -n 8 my_prog The "srun -n 1" is useless and I doubt anyone would ever try to use it. Is there a reason behind this ? Sylvain -------------- next part -------------- An HTML attachment was scrubbed... URL: From sylvain.jeaugey at bull.net Fri Dec 4 09:35:28 2009 From: sylvain.jeaugey at bull.net (Sylvain Jeaugey) Date: Fri, 4 Dec 2009 10:35:28 +0100 (CET) Subject: [padb] =?iso-8859-15?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_Re=3A__Pat?= =?iso-8859-15?q?ch_of_support_of_Slurm_+_Openmpi_Orte_manager?= In-Reply-To: References: Message-ID: On Fri, 4 Dec 2009, thipadin.seng-long at bull.net wrote: > But no one can prevent somebody to start jobs like this since the syntax > is correct, Actually we can. The documentation says : use salloc ... mpirun, not salloc srun -n 1 mpirun. And I wouldn't say that the syntax is correct. It just *happens* to work. With this command, you're launching this chain : salloc -> srun -> mpirun -> srun -> MPI processes We're lucky it works ! > So if some one start jobs like this, padb should be able to support. I disagree. Since this has no added value, I don't see why we should support it. But if that's only one extra line of code, then let it be ... Sylvain From ashley at pittman.co.uk Fri Dec 4 10:05:37 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Fri, 04 Dec 2009 10:05:37 +0000 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_Re=3A__Patc?= =?iso-8859-1?q?h_of_support_of_Slurm_+_Openmpi_Orte_manager?= In-Reply-To: References: Message-ID: <1259921137.3655.6.camel@alpha> On Fri, 2009-12-04 at 10:35 +0100, Sylvain Jeaugey wrote: > On Fri, 4 Dec 2009, thipadin.seng-long at bull.net wrote: > > > But no one can prevent somebody to start jobs like this since the syntax > > is correct, > Actually we can. The documentation says : use salloc ... mpirun, not > salloc srun -n 1 mpirun. And I wouldn't say that the syntax is correct. It > just *happens* to work. With this command, you're launching this chain : > salloc -> srun -> mpirun -> srun -> MPI processes > We're lucky it works ! > > > So if some one start jobs like this, padb should be able to support. > I disagree. Since this has no added value, I don't see why we should > support it. But if that's only one extra line of code, then let it be ... Given that the failure mode was to report the wrong information then I'm much happier having this code in than not. I'll probably change the check to "next if is_resmgr_process($pid);" which is a superset of what the code does now and means this case is handled no differently to the normal salloc/mpirun case. The examples given nicely demonstrate the benefit of having the signon check for different executable names, I know some people do do this on purpose but it's rare enough that it's worth warning about if padb observes it. Ashley, -- Ashley Pittman, Brighton, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From padb at googlecode.com Mon Dec 7 11:30:59 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 07 Dec 2009 11:30:59 +0000 Subject: [padb] r346 committed - Extend the slurm resource manager code to also work with Orte... Message-ID: <001636283ae8a97c00047a21ca2f@google.com> Revision: 346 Author: apittman Date: Mon Dec 7 03:29:53 2009 Log: Extend the slurm resource manager code to also work with Orte (OpenMPI) jobs launched under slurm. To run these two together you have to create a slurm allocation and then use the OMPI mpirun from within this allocation to do the application launch using ORTE. In this case "slurm listpids" can't tell you the process identifiers. If slurm listpids gives no information or claims to have launched only further resource managers walk the process tree looking for OMPI processes that have slurm specific environment variables set indicating they belong to this job. http://code.google.com/p/padb/source/detail?r=346 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Wed Dec 2 08:34:27 2009 +++ /trunk/src/padb Mon Dec 7 03:29:53 2009 @@ -8157,10 +8157,16 @@ } sub is_resmgr_process { - my $pid = shift; + my $pid = shift; my $name = find_from_status( $pid, 'Name' ); - my $mgrs = - { rmsloader => 1, slurmd => 1, slurmstepd => 1, pbs_attach => 1 }; + my $mgrs = { + rmsloader => 1, + slurmd => 1, + slurmstepd => 1, + pbs_attach => 1, + orted => 1, + mpirun => 1, + }; return 1 if ( defined $mgrs->{$name} ); return; } @@ -8173,13 +8179,58 @@ my @procs = slurp_cmd("scontrol listpids $jobid.$inner_conf{slurm_job_step}"); + my $found_target; + foreach my $proc (@procs) { my ( $pid, $job, $step, undef, $global ) = split $SPACE, $proc; next if ( $global eq '-' ); next unless ( $job eq $jobid ); next unless ( $step == $inner_conf{slurm_job_step} ); + next if ( is_resmgr_process($pid) ); maybe_show_pid( $global, $pid ); - } + $found_target = 1; + } + return if $found_target; + + # Either we didn't find any processes on this node or we only + # found processes named orted. This could be for two reasons: + # The job step might not be running on this node. + # The job step might be a openmpi salloc/orterun combination. + # If it's the latter then this node could either be the "head" + # node where the mpirun is running or a "remote" node where the + # job will be launched by orted. + + # Search the process list for processes which belong to this job + # and either belong to this job step or don't state which job step + # they belong to. + foreach my $pid ( get_process_list($target_user) ) { + + # Skip over resource manager processes. + next if ( is_resmgr_process($pid) ); + + # Skip over ones which aren't direct descendants of a resource manager + next unless is_parent_resmgr($pid); + + my $vp; + my %env = get_remote_env($pid); + + next unless defined $env{SLURM_JOB_ID}; + next if ( $env{SLURM_JOB_ID} != $jobid ); + + next unless defined $env{OMPI_COMM_WORLD_RANK}; + + # If this is defined check it's correct, it might be missing though. + if ( defined $env{SLURM_JOB_STEP} ) { + next if $env{SLURM_JOB_STEP} != $inner_conf{slurm_job_step}; + } + + if ( defined $env{OMPI_COMM_WORLD_SIZE} ) { + target_key_pair( $vp, "JOB_SIZE", $env{OMPI_COMM_WORLD_SIZE} ); + } + + maybe_show_pid( $env{OMPI_COMM_WORLD_RANK}, $pid ); + } + return; } From ashley at pittman.co.uk Mon Dec 7 11:32:33 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Mon, 07 Dec 2009 11:32:33 +0000 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_Re=3A__Patc?= =?iso-8859-1?q?h_of_support_of_Slurm_+_Openmpi_Orte_manager?= In-Reply-To: References: Message-ID: <1260185553.4449.2.camel@alpha> On Thu, 2009-12-03 at 13:20 +0100, thipadin.seng-long at bull.net wrote: > Hi, good holidays, there ? > I have applied the patch below. > It works now: This is committed as r346, slightly modified to call is_resmgr_process() rather than checking the name from /proc/$pid/status as discussed. Let me know if you have any further problems with this and once again thanks for the patch. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From padb at googlecode.com Mon Dec 7 12:22:52 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 07 Dec 2009 12:22:52 +0000 Subject: [padb] r347 committed - Slight improvement of error reporting with failing to open the message... Message-ID: <0016e68ea0b7367071047a228427@google.com> Revision: 347 Author: apittman Date: Mon Dec 7 04:22:42 2009 Log: Slight improvement of error reporting with failing to open the message queue DLL. http://code.google.com/p/padb/source/detail?r=347 Modified: /trunk/src/minfo.c /trunk/src/padb ======================================= --- /trunk/src/minfo.c Thu Nov 5 14:07:46 2009 +++ /trunk/src/minfo.c Mon Dec 7 04:22:42 2009 @@ -529,11 +529,14 @@ dlhandle = dlopen(filename,RTLD_NOW); if ( ! dlhandle ) { - show_warning("Unable to dlopen dll with RTLD_NOW, trying LAZY..."); + show_warning("Unable to dlopen dll with RTLD_NOW, trying RTLD_LAZY..."); show_warning(dlerror()); dlhandle = dlopen(filename,RTLD_LAZY); - if ( ! dlhandle ) + if ( ! dlhandle ) { + show_warning("Unable to dlopen dll with RTLD_LAZY, giving up..."); + show_warning(dlerror()); return -1; + } } DLSYM(dll_ep,dlhandle,setup_basic_callbacks); ======================================= --- /trunk/src/padb Mon Dec 7 03:29:53 2009 +++ /trunk/src/padb Mon Dec 7 04:22:42 2009 @@ -6140,6 +6140,7 @@ } } elsif ( $cmd eq 'out:' ) { + $stats{out}++; if ( $r =~ m{\A out: @@ -6173,6 +6174,7 @@ target_key_pair( $vp, 'UNPARSEABLE MINFO', $r ); } } elsif ( $cmd eq 'zzz:' ) { + $stats{zzz}++; if ( $r =~ m{\A zzz: @@ -6201,6 +6203,7 @@ push @cd, dclone( \%cd ); undef %cd; } else { + $stats{raw}++; push @{ $cd{raw} }, $r; } } From padb at googlecode.com Mon Dec 7 12:30:55 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 07 Dec 2009 12:30:55 +0000 Subject: [padb] r348 committed - Add new as-yet unused asyncronous gdb attaching code. Gdb can take a... Message-ID: <001485f6d9f4019aae047a22a1fe@google.com> Revision: 348 Author: apittman Date: Mon Dec 7 04:30:03 2009 Log: Add new as-yet unused asyncronous gdb attaching code. Gdb can take a while to attach to a process, expecially if it's got lots of shared librarys which need to be loaded off a shared filesystems, this change adds the option to asyncronously attach meaning that padb can launch a instance of gdb for every target process on a node, tell them all to attach and allow them to procede in parallel. This should help the performance a lot, particuarly on nodes with lots of cores. http://code.google.com/p/padb/source/detail?r=348 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Mon Dec 7 04:22:42 2009 +++ /trunk/src/padb Mon Dec 7 04:30:03 2009 @@ -5473,6 +5473,14 @@ } return; } + + gdb_attach_post( $gdb, $pid ); + + return $pid; +} + +sub gdb_attach_post { + my ( $gdb, $pid ) = @_; $gdb->{attached} = 1; $gdb->{tracepid} = $pid; @@ -5493,6 +5501,52 @@ gdb_n_send( $gdb, '-gdb-set print address off' ); +} + +sub gdb_attach_async_start { + my ( $gdb, $pid ) = @_; + + if ($running_on_solaris) { + my $exe = readlink("/proc/$pid/path/a.out"); + my %cs = gdb_n_send( $gdb, "file $exe" ); + if ( $cs{status} ne 'done' ) { + croak("Gdb command file $exe failed"); + return; + } + } + + send_cont_signal($pid); + + _gdb_send_real_async_start( $gdb, "attach $pid" ); + + return; +} + +sub gdb_attach_async_end { + my ( $gdb, $pid ) = @_; + + my %p = _gdb_send_real_async_wait( $gdb, "attach $pid" ); + + if ( not defined $p{status} ) { + $gdb->{error} = 'Failed to attach to process'; + if ( not find_exe('gdb') ) { + $gdb->{error} = 'Failed to attach to process (gdb not installed?)'; + } + return; + } + + if ( $p{status} eq 'error' ) { + my $r = gdb_parse_reason( $p{reason} ); + if ( defined $r->{msg} ) { + $gdb->{error} = "Failed to attach to process: $r->{msg}"; + } else { + $gdb->{error} = 'Failed to attach to process'; + } + return; + } + + gdb_attach_post( $gdb, $pid ); + return $pid; } @@ -5543,6 +5597,34 @@ } return %r; } + +sub _gdb_send_real_async_start { + my ( $gdb, $cmd ) = @_; + gdb_wait_for_prompt($gdb); + my $handle = $gdb->{wtr}; + my $seq = $gdb->{seq}++; + print {$handle} "$seq$cmd\n"; + if ( defined $gdb->{debugfd} ) { + print { $gdb->{debugfd} } "$seq$cmd\n"; + } + return; +} + +sub _gdb_send_real_async_wait { + my ( $gdb, $cmd ) = @_; + my $seq = $gdb->{seq}; + my %r = gdb_n_next_result( $gdb, $seq ); + if ( $gdb->{attached} and $r{seq} ne $seq ) { + croak( +"Invalid sequence number from gdb, expecting $seq got $r{seq} cmd=\"$cmd\"" + ); + } + $r{cmd} = $cmd; + if ( $gdb->{debugfd} and defined $r{status} and $r{status} ne 'done' ) { + print Dumper \%r; + } + return %r; +} sub _gdb_set_print_address { my ( $gdb, $flag ) = @_; From padb at googlecode.com Mon Dec 7 12:42:00 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 07 Dec 2009 12:42:00 +0000 Subject: [padb] r349 committed - Add a new type of callback for the mode to use, handler_all is... Message-ID: <0016367657e6a0c07d047a22c8cb@google.com> Revision: 349 Author: apittman Date: Mon Dec 7 04:41:09 2009 Log: Add a new type of callback for the mode to use, handler_all is intended to replace handler and is called once for each target process. It differs from handler in that it takes the same options as the handler_all command. In time I'd like to drop handler for simplicity. http://code.google.com/p/padb/source/detail?r=349 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Mon Dec 7 04:30:03 2009 +++ /trunk/src/padb Mon Dec 7 04:41:09 2009 @@ -8791,18 +8791,20 @@ # or any other rank on this node, we'll have to see if that causes # problems or if it's best to clear the target_key_pair() and output() # data for this node/rank. + + # Bit of a hack here until I can fix it properly, pass on the + # output format so that the stack trace code knows when to do + # clever things in tree mode. + my $cargs = $cmd->{cargs}; + if ( defined $cmd->{out_format} ) { + $cargs->{out_format} = $cmd->{out_format}; + } else { + $cargs->{out_format} = 'raw'; + } + if ( defined $allfns{ $cmd->{mode} }{handler_all} ) { eval { - # Bit of a hack here until I can fix it properly, pass on the - # output format so that the stack trace code knows when to do - # clever things in tree mode. - my $cargs = $cmd->{cargs}; - if ( defined $cmd->{out_format} ) { - $cargs->{out_format} = $cmd->{out_format}; - } else { - $cargs->{out_format} = 'raw'; - } $netdata->{target_response} = $allfns{ $cmd->{mode} }{handler_all}( $cargs, $pid_list ); 1; @@ -8821,9 +8823,19 @@ my $vp = $proc->{vp}; my $pid = $proc->{pid}; eval { - my $res = - $allfns{ $cmd->{mode} }{handler}( $cmd->{cargs}, $vp, $pid ); - $gres{$vp} = $res if ( defined $res ); + + # The only difference here is the type of the first option, + # all functions should be converted to a single format here + if ( defined $allfns{ $cmd->{mode} }{handler_one} ) { + my $res = + $allfns{ $cmd->{mode} }{handler_one}( $cargs, $proc ); + $gres{$vp} = $res if ( defined $res ); + } else { + my $res = + $allfns{ $cmd->{mode} }{handler}( $cmd->{cargs}, $vp, + $pid ); + $gres{$vp} = $res if ( defined $res ); + } 1; } or do { my $error = $@; From padb at googlecode.com Mon Dec 7 12:58:20 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 07 Dec 2009 12:58:20 +0000 Subject: [padb] r350 committed - Add global_attach and global_detach functions for looking after gdb... Message-ID: <001636c5bda31244ea047a2303ca@google.com> Revision: 350 Author: apittman Date: Mon Dec 7 04:57:39 2009 Log: Add global_attach and global_detach functions for looking after gdb handles, if the mode can inform padb of if it needs to attach to the target processes or not then padb can perform this operation for it in a optimal way, keeping all the code common. Allows persistent attachment over multiple mode calls as well. http://code.google.com/p/padb/source/detail?r=350 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Mon Dec 7 04:41:09 2009 +++ /trunk/src/padb Mon Dec 7 04:57:39 2009 @@ -4065,14 +4065,15 @@ } my $cmd; + my $req; if ($watch) { $cmd = $commands[0]; + $req->{detach_after_callback} = 1; } else { $cmd = shift @commands; } - my $req; $req->{mode} = $cmd->{mode}; if ( defined $cmd->{args} ) { @@ -7648,6 +7649,67 @@ } return; } + +# Attach to all local processes in preperation for calling the mode +# callback. This function, along with the corresponding one below it +# handles persistent attachment between modes: modes specify if they +# want gdb handles or not, if they do then this function attaches for +# it, if they don't and gdb is attached this function will detach. +sub global_attach { + my ( $mode, $procs ) = @_; + + if ( not $allfns{$mode}{needs_gdb} ) { + global_detach($procs); + return; + } + + foreach my $proc ( @{$procs} ) { + my $vp = $proc->{vp}; + my $pid = $proc->{pid}; + + next if defined $proc->{gdb_handle}; + + $proc->{gdb_tmp} = gdb_start(); + gdb_attach_async_start( $proc->{gdb_tmp}, $pid ); + } + + foreach my $proc ( @{$procs} ) { + + next if defined $proc->{gdb_handle}; + + my $vp = $proc->{vp}; + my $pid = $proc->{pid}; + my $gdb = $proc->{gdb_tmp}; + + delete $proc->{gdb_tmp}; + + if ( gdb_attach_async_end( $gdb, $pid ) ) { + $proc->{gdb_handle} = $gdb; + } else { + if ( defined $gdb->{error} ) { + target_error( $vp, $gdb->{error} ); + } else { + target_error( $vp, 'Failed to attach to process' ); + } + gdb_quit($gdb); + } + } + + return; +} + +# Detach from all local processes, this function is called from both +# global_attach and also when padb is exiting. +sub global_detach { + my ($procs) = @_; + + foreach my $proc ( @{$procs} ) { + if ( defined $proc->{gdb_handle} ) { + gdb_detach( $proc->{gdb_handle} ); + delete $proc->{gdb_handle}; + } + } +} # Try and be clever here, attach to each and every process on this node # first, then go back and query them each in turn, should mean that some @@ -8742,6 +8804,7 @@ } if ( $cmd->{mode} eq 'exit' ) { + global_detach( $inner_conf{all_pids} ); $netdata->{shutdown} = 1; return; } @@ -8801,6 +8864,10 @@ } else { $cargs->{out_format} = 'raw'; } + + # Ensure that we are attached to the target processes if required + # and that we are not if not required. + global_attach( $cmd->{mode}, $pid_list ); if ( defined $allfns{ $cmd->{mode} }{handler_all} ) { eval { @@ -8849,6 +8916,11 @@ $netdata->{target_response} = \%gres; } } + + # Detach from all processes if the outer requested us to. + if ( defined $cmd->{detach_after_callback} ) { + global_detach( $cmd->{mode}, $pid_list ); + } return; } From padb at googlecode.com Mon Dec 7 13:20:42 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 07 Dec 2009 13:20:42 +0000 Subject: [padb] r351 committed - Convert a number of mode callbacks from using handler_all and having... Message-ID: <001636c9274f118768047a235347@google.com> Revision: 351 Author: apittman Date: Mon Dec 7 05:19:40 2009 Log: Convert a number of mode callbacks from using handler_all and having the handler attach to the target processes to using handler_one and setting needs_gdb to have padb do the attach. This means the attach is both quicker (it's done asyncrously for all targets on the local node at the same time) and the attachment is syncronous across different modes. The end result of this is that the wall-clock performance of padb is improved, by up to 50% if --full-report is being used. http://code.google.com/p/padb/source/detail?r=351 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Mon Dec 7 04:57:39 2009 +++ /trunk/src/padb Mon Dec 7 05:19:40 2009 @@ -6441,91 +6441,35 @@ return; } -sub show_mpi_queue_all { - my ( $carg, $list ) = @_; - - my @all; - - foreach my $proc ( @{$list} ) { - my $vp = $proc->{vp}; - my $pid = $proc->{pid}; - - my $gdb = gdb_start(); - if ( gdb_attach( $gdb, $pid ) ) { - $proc->{gdb} = $gdb; - push @all, $proc; - } else { - if ( defined $gdb->{error} ) { - target_error( $vp, $gdb->{error} ); - } else { - target_error( $vp, 'Failed to attach to process' ); - } - } - +sub show_mpi_queue_one { + my ( $carg, $proc ) = @_; + + my $vp = $proc->{vp}; + my $pid = $proc->{pid}; + my $gdb = $proc->{gdb_handle}; + + return unless $gdb; + + my @mq = fetch_mpi_queue_gdb( $carg, $vp, $pid, $gdb ); + + foreach my $o (@mq) { + output( $vp, $o ); } - foreach my $proc (@all) { - - my $vp = $proc->{vp}; - my $pid = $proc->{pid}; - my $gdb = $proc->{gdb}; - - my @mq = fetch_mpi_queue_gdb( $carg, $vp, $pid, $gdb ); - if ( $mq[0] ) { - foreach my $o (@mq) { - output( $vp, $o ); - } - } - } - - foreach my $proc (@all) { - my $gdb = $proc->{gdb}; - gdb_detach($gdb); - gdb_quit($gdb); - } return; } -# Ideally handle all this at a higher level... -sub show_mpi_queue_for_deadlock_all { - my ( $carg, $list ) = @_; - - my $ret; - my @all; - - foreach my $proc ( @{$list} ) { - my $vp = $proc->{vp}; - my $pid = $proc->{pid}; - - my $gdb = gdb_start(); - if ( gdb_attach( $gdb, $pid ) ) { - $proc->{gdb} = $gdb; - push @all, $proc; - } else { - output( $vp, 'Failed to attach to to process' ); - } - - } - - foreach my $proc (@all) { - my $tries = 0; - - my @threads; - - my $vp = $proc->{vp}; - my $pid = $proc->{pid}; - my $gdb = $proc->{gdb}; - - my @mq = fetch_mpi_queue_gdb( $carg, $vp, $pid, $gdb ); - $ret->{$vp} = \@mq; - } - - foreach my $proc (@all) { - my $gdb = $proc->{gdb}; - gdb_detach($gdb); - gdb_quit($gdb); - } - return $ret; +sub show_mpi_queue_for_deadlock_one { + my ( $carg, $proc ) = @_; + + my $vp = $proc->{vp}; + my $pid = $proc->{pid}; + my $gdb = $proc->{gdb_handle}; + + return unless $gdb; + + my @mq = fetch_mpi_queue_gdb( $carg, $vp, $pid, $gdb ); + return \@mq; } sub mpi_queue_output_handler { @@ -6659,8 +6603,8 @@ my $gstr = "Information for group '$gid' ($ad{$gid}{name})\n"; - # Maybe show the group members, hope that the user doesn't turn - # this on unless also setting target_groups! + # Maybe show the group members, hope that the user doesn't + # turn this on unless also setting target_groups! if ( $carg->{show_group_members} ) { $gstr .= "group has $ad{$gid}{size} members\n"; if ( defined $ad{$gid}{size} ) { @@ -7726,8 +7670,8 @@ # loop in but don't sleep every iteration. This could be handled better by # checking for the presence of one of the stack_strip_below functions in # the stack trace. -sub stack_trace_from_pids { - my ( $carg, $list ) = @_; +sub stack_trace_from_pid { + my ( $carg, $proc ) = @_; my @all; @@ -7747,202 +7691,181 @@ $below{$_} = 1; } - foreach my $proc ( @{$list} ) { - my $vp = $proc->{vp}; - my $pid = $proc->{pid}; - - my $gdb = gdb_start(); - if ( gdb_attach( $gdb, $pid ) ) { - $proc->{gdb} = $gdb; - push @all, $proc; - } else { - if ( defined $gdb->{error} ) { - target_error( $vp, $gdb->{error} ); - } else { - target_error( $vp, 'Failed to attach to process' ); - } - } - - } - - foreach my $proc (@all) { - my $tries = 0; - - my @threads; - - my $vp = $proc->{vp}; - my $pid = $proc->{pid}; - my $gdb = $proc->{gdb}; - - my $ok; - do { - - # The first time round the loop we will have a gdb handle from - # above, only re-attach if we have already failed on the first - # try and are here a second time. - if ( not defined $gdb ) { - send_cont_signal($pid); - my $g = gdb_start(); - if ( gdb_attach( $g, $pid ) ) { - $gdb = $g; + return unless defined $proc->{gdb_handle}; + + my $tries = 0; + + my @threads; + + my $vp = $proc->{vp}; + my $pid = $proc->{pid}; + my $gdb = $proc->{gdb_handle}; + + my $ok; + do { + + # The first time round the loop we will have a gdb handle from + # above, only re-attach if we have already failed on the first + # try and are here a second time. + if ( $tries > 0 ) { + gdb_detach($gdb); + gdb_quit($gdb); + delete $proc->{gdb_handle}; + send_cont_signal($pid); + $gdb = gdb_start(); + if ( gdb_attach( $gdb, $pid ) ) { + $proc->{gdb_attach} = $gdb; + } else { + if ( defined $gdb->{error} ) { + target_error( $vp, $gdb->{error} ); } else { - if ( defined $g->{error} ) { - target_error( $vp, $g->{error} ); - } else { - target_error( $vp, 'Failed to attach to process' ); - } - } - } - - if ( defined $gdb ) { - if ( $carg->{stack_shows_params} - or $carg->{stack_shows_locals} ) - { - @threads = gdb_dump_frames_per_thread( $gdb, 1 ); - } else { - @threads = gdb_dump_frames_per_thread($gdb); - } - gdb_detach($gdb); - gdb_quit($gdb); - $gdb = undef; - if ( defined $threads[0]->{frames} ) { - my @frames = @{ $threads[0]->{frames} }; - foreach my $frame (@frames) { - if ( defined $frame->{func} - and defined $below{ $frame->{func} } ) - { - $ok = 1; - last; - } - } + target_error( $vp, 'Failed to attach to process' ); + } + gdb_quit($gdb); + return; + } + } + + if ( $carg->{stack_shows_params} + or $carg->{stack_shows_locals} ) + { + @threads = gdb_dump_frames_per_thread( $gdb, 1 ); + } else { + @threads = gdb_dump_frames_per_thread($gdb); + } + + if ( defined $threads[0]->{frames} ) { + my @frames = @{ $threads[0]->{frames} }; + foreach my $frame (@frames) { + if ( defined $frame->{func} + and defined $below{ $frame->{func} } ) + { + $ok = 1; + last; } } - } while ( ( not $ok ) - and ( $tries++ < $carg->{gdb_retry_count} ) ); - - if ( not defined $threads[0]{id} ) { - target_error( $vp, - 'Could not extract stack trace from application' ); - next; } - if ( defined $threads[0]{error} ) { - target_error( $vp, $threads[0]{error} ); - next; - } - - foreach my $thread ( sort { $a->{id} <=> $b->{id} } @threads ) { - next unless defined $thread->{frames}; - my @frames = @{ $thread->{frames} }; - - output( $vp, "ThreadId: $thread->{id}" ) if ( @threads != 1 ); - - my $strip_below; - - # Find a function to strip above. Only actually enable this if - # there is a function present which we are targeting or else no - # output will be generated! Do this in reverse order so we - # strip as much as possible from the stack trace. - if ( $carg->{strip_below_main} ) { - foreach my $frame ( reverse @frames ) { - next unless exists $frame->{func}; - if ( defined $below{ $frame->{func} } ) { - $strip_below = $frame->{func}; - } + } while ( ( not $ok ) + and ( $tries++ < $carg->{gdb_retry_count} ) ); + + if ( not defined $threads[0]{id} ) { + target_error( $vp, 'Could not extract stack trace from application' ); + return; + } + + if ( defined $threads[0]{error} ) { + target_error( $vp, $threads[0]{error} ); + return; + } + + foreach my $thread ( sort { $a->{id} <=> $b->{id} } @threads ) { + next unless defined $thread->{frames}; + my @frames = @{ $thread->{frames} }; + + output( $vp, "ThreadId: $thread->{id}" ) if ( @threads != 1 ); + + my $strip_below; + + # Find a function to strip above. Only actually enable this if + # there is a function present which we are targeting or else no + # output will be generated! Do this in reverse order so we + # strip as much as possible from the stack trace. + if ( $carg->{strip_below_main} ) { + foreach my $frame ( reverse @frames ) { + next unless exists $frame->{func}; + if ( defined $below{ $frame->{func} } ) { + $strip_below = $frame->{func}; } } - - my @fl = $EMPTY_STRING; - foreach my $frame ( reverse @frames ) { - - target_error( $vp, "error from gdb: $frame->{error}" ) - if exists $frame->{error}; - - next unless exists $frame->{level}; - next unless exists $frame->{func}; - - # This seemingly always gets set by gdb even if it is - # sometimes set to '??' - my $function = $frame->{func}; - - next if ( defined $strip_below and $strip_below ne $function ); - - $strip_below = undef; - - my $l = sprintf "%s() at %s:%s", - $function, - ( $frame->{file} || '?' ), - ( $frame->{line} || '?' ); - - output( $vp, $l ); - - if ( $carg->{out_format} eq 'tree' ) { - push @fl, $l; - my $fl = join( ",", @fl ); - if ( $carg->{stack_shows_locals} ) { - my @local_names; - foreach my $loc ( @{ $frame->{locals} } ) { - push @local_names, $loc->{name}; - target_key_pair( $vp, "$l|var_type| $loc->{name}", - $loc->{type} ); - - if ( length $loc->{value} > 70 ) { - target_key_pair( - $vp, - $fl . '|var|' . $loc->{name}, - pretify_variable( - 'value too long to display') - ); - } else { - target_key_pair( $vp, - $fl . '|var|' . $loc->{name}, - $loc->{value} ); - } - } - if ( @local_names > 0 ) { - target_key_pair( $vp, "$l|locals", - join( q{,}, sort @local_names ) ); + } + + my @fl = $EMPTY_STRING; + foreach my $frame ( reverse @frames ) { + + target_error( $vp, "error from gdb: $frame->{error}" ) + if exists $frame->{error}; + + next unless exists $frame->{level}; + next unless exists $frame->{func}; + + # This seemingly always gets set by gdb even if it is + # sometimes set to '??' + my $function = $frame->{func}; + + next if ( defined $strip_below and $strip_below ne $function ); + + $strip_below = undef; + + my $l = sprintf "%s() at %s:%s", + $function, + ( $frame->{file} || '?' ), + ( $frame->{line} || '?' ); + + output( $vp, $l ); + + if ( $carg->{out_format} eq 'tree' ) { + push @fl, $l; + my $fl = join( ",", @fl ); + if ( $carg->{stack_shows_locals} ) { + my @local_names; + foreach my $loc ( @{ $frame->{locals} } ) { + push @local_names, $loc->{name}; + target_key_pair( $vp, "$l|var_type|$loc->{name}", + $loc->{type} ); + + if ( length $loc->{value} > 70 ) { + target_key_pair( + $vp, + $fl . '|var|' . $loc->{name}, + pretify_variable('value too long to display') + ); + } else { + target_key_pair( $vp, $fl . '|var|' . $loc->{name}, + $loc->{value} ); } } - if ( $carg->{stack_shows_params} ) { - - my @param_names; - foreach my $par ( @{ $frame->{params} } ) { - push @param_names, $par->{name}; - target_key_pair( $vp, "$l|var_type| $par->{name}", - $par->{type} ); - if ( length $par->{value} > 70 ) { - target_key_pair( - $vp, - $fl . '|var|' . $par->{name}, - pretify_variable( - 'value too long to display') - ); - } else { - target_key_pair( $vp, - $fl . '|var|' . $par->{name}, - $par->{value} ); - } - } - if ( @param_names > 0 ) { - target_key_pair( $vp, "$l|params", - join( q{,}, @param_names ) ); + if ( @local_names > 0 ) { + target_key_pair( $vp, "$l|locals", + join( q{,}, sort @local_names ) ); + } + } + if ( $carg->{stack_shows_params} ) { + + my @param_names; + foreach my $par ( @{ $frame->{params} } ) { + push @param_names, $par->{name}; + target_key_pair( $vp, "$l|var_type|$par->{name}", + $par->{type} ); + if ( length $par->{value} > 70 ) { + target_key_pair( + $vp, + $fl . '|var|' . $par->{name}, + pretify_variable('value too long to display') + ); + } else { + target_key_pair( $vp, $fl . '|var|' . $par->{name}, + $par->{value} ); } } - } else { - if ( $carg->{stack_shows_params} ) { - show_stack_vars( $vp, $frame, 'params' ); - } - if ( $carg->{stack_shows_locals} ) { - show_stack_vars( $vp, $frame, 'locals' ); + if ( @param_names > 0 ) { + target_key_pair( $vp, "$l|params", + join( q{,}, @param_names ) ); } } - - # Strip below this function if we need to. - if ( defined $above{$function} ) { - last; + } else { + if ( $carg->{stack_shows_params} ) { + show_stack_vars( $vp, $frame, 'params' ); + } + if ( $carg->{stack_shows_locals} ) { + show_stack_vars( $vp, $frame, 'locals' ); } } + + # Strip below this function if we need to. + if ( defined $above{$function} ) { + last; + } } } return; @@ -9229,7 +9152,8 @@ }; $allfns{mqueue} = { - handler_all => \&show_mpi_queue_all, + handler_one => \&show_mpi_queue_one, + needs_gdb => 1, arg_long => 'mpi-queue', arg_short => 'Q', help => 'Show MPI message queues', @@ -9237,7 +9161,8 @@ }; $allfns{deadlock} = { - handler_all => \&show_mpi_queue_for_deadlock_all, + handler_one => \&show_mpi_queue_for_deadlock_one, + needs_gdb => 1, arg_long => 'deadlock', arg_short => 'j', help => 'Run deadlock detection algorithm', @@ -9290,7 +9215,8 @@ }; $allfns{stack} = { - handler_all => \&stack_trace_from_pids, + handler_one => \&stack_trace_from_pid, + needs_gdb => 1, arg_long => 'stack-trace', arg_short => 'x', help => 'Show stack trace (see also -t)', From padb at googlecode.com Mon Dec 7 13:34:50 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 07 Dec 2009 13:34:50 +0000 Subject: [padb] r352 committed - Rename maybe_show_pid as register_target_process for clarity. Message-ID: <001636c5bda39aac8c047a2385fd@google.com> Revision: 352 Author: apittman Date: Mon Dec 7 05:34:42 2009 Log: Rename maybe_show_pid as register_target_process for clarity. http://code.google.com/p/padb/source/detail?r=352 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Mon Dec 7 05:19:40 2009 +++ /trunk/src/padb Mon Dec 7 05:34:42 2009 @@ -8180,10 +8180,16 @@ return; } -sub maybe_show_pid { - my ( $vp, $pid ) = @_; - - $inner_conf{rmpids}{$pid}{rank} = $vp; +# To be called from the find_pids resource manager callback to say +# that the specified pid is the specified rank. This process should +# be one spawned by the resource manager, if wrapper scripts are being +# used, say "mpirun -n 2 sh -c myapp" then this function should be +# called with the pid of 'sh', padb will then walk the process tree to +# find the more interesting child process and target that one. +sub register_target_process { + my ( $rank, $pid ) = @_; + + $inner_conf{rmpids}{$pid}{rank} = $rank; return; } @@ -8257,7 +8263,7 @@ next unless ( $job eq $jobid ); next unless ( $step == $inner_conf{slurm_job_step} ); next if ( is_resmgr_process($pid) ); - maybe_show_pid( $global, $pid ); + register_target_process( $global, $pid ); $found_target = 1; } return if $found_target; @@ -8298,7 +8304,7 @@ target_key_pair( $vp, "JOB_SIZE", $env{OMPI_COMM_WORLD_SIZE} ); } - maybe_show_pid( $env{OMPI_COMM_WORLD_RANK}, $pid ); + register_target_process( $env{OMPI_COMM_WORLD_RANK}, $pid ); } return; @@ -8340,7 +8346,7 @@ } foreach my $vp ( keys %vps ) { my $pid = $vps{$vp}; - maybe_show_pid( $vp, $pid ); + register_target_process( $vp, $pid ); } return; } @@ -8390,11 +8396,11 @@ foreach my $vp ( keys %vps ) { if ( defined $vps{$vp}{actual} ) { foreach my $pid ( @{ $vps{$vp}{actual} } ) { - maybe_show_pid( $vp, $pid ); + register_target_process( $vp, $pid ); } } else { foreach my $pid ( @{ $vps{$vp}{likely} } ) { - maybe_show_pid( $vp, $pid ); + register_target_process( $vp, $pid ); } } } @@ -8644,7 +8650,7 @@ if ( defined $cmd->{pd} ) { my $hostname = $inner_conf{hostname}; foreach my $rank ( keys %{ $cmd->{pd}{$hostname} } ) { - maybe_show_pid( $rank, $cmd->{pd}{$hostname}{$rank} ); + register_target_process( $rank, $cmd->{pd}{$hostname}{$rank} ); } } else { From thipadin.seng-long at bull.net Mon Dec 7 15:57:37 2009 From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net) Date: Mon, 7 Dec 2009 16:57:37 +0100 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_Re=3A_R=E9f?= =?iso-8859-1?q?=2E_=3A_Re=3A_=5B_padb=5D_Patchof_support_of_Slurm_+_Openm?= =?iso-8859-1?q?pi_Orte_manager?= Message-ID: On 12/07/2009 12:32 PM Ashley Pittman wrote: > This is committed as r346, slightly modified to call is_resmgr_process() > rather than checking the name from /proc/$pid/status as discussed. > Let me know if you have any further problems with this and once again > thanks for the patch. Yes, I see the change, By the way I have given a try with this r346 version, It's all Ok, with both methods of starting jobs. Thipadin. More later. Ashley Pittman 12/07/2009 12:32 PM Pour : thipadin.seng-long at bull.net cc : florence.vallee at bull.net, francois.wellenreiter at bull.net, padb-devel at pittman.org.uk, Sylvain.JEAUGEY at bull.net Objet : Re: R?f. : Re: R?f. : Re: [padb] Patch of support of Slurm + Openmpi Orte manager On Thu, 2009-12-03 at 13:20 +0100, thipadin.seng-long at bull.net wrote: > Hi, good holidays, there ? > I have applied the patch below. > It works now: This is committed as r346, slightly modified to call is_resmgr_process() rather than checking the name from /proc/$pid/status as discussed. Let me know if you have any further problems with this and once again thanks for the patch. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk -------------- next part -------------- An HTML attachment was scrubbed... URL: From padb at googlecode.com Mon Dec 7 19:12:11 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 07 Dec 2009 19:12:11 +0000 Subject: [padb] r353 committed - Clean up the list-rmgrs code a little, both in terms... Message-ID: <0016367d6f360c96e6047a283c6c@google.com> Revision: 353 Author: apittman Date: Mon Dec 7 11:12:04 2009 Log: Clean up the list-rmgrs code a little, both in terms of the code structure and also tidy up what it reports to the user if a resource manager isn't detected. http://code.google.com/p/padb/source/detail?r=353 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Mon Dec 7 05:34:42 2009 +++ /trunk/src/padb Mon Dec 7 11:12:04 2009 @@ -2956,6 +2956,7 @@ if ( defined $link ) { if ( defined $mpirun{ basename($link) } ) { push @jobs, $pid; + next; } } @@ -5147,26 +5148,21 @@ if ($list_rmgrs) { foreach my $res ( sort keys %rmgr ) { - my $working = 'yes'; if ( defined $rmgr{$res}{is_installed} and not $rmgr{$res}{is_installed}() ) { - $working = 'no'; - } - my $r = $res; - - if ( $working eq 'yes' ) { - print "$r: "; - my @jobs = $rmgr{$res}{get_active_jobs}($user); - if ( @jobs > 0 ) { - my $j = join q{ }, sort { $a <=> $b } @jobs; - print "jobs($j)\n"; - } else { - print "No active jobs\n"; - } + print "$res: Not detected on system.\n"; + next; + } + + print "$res: "; + my @jobs = $rmgr{$res}{get_active_jobs}($user); + if ( @jobs > 0 ) { + my $j = join q{ }, sort { $a <=> $b } @jobs; + print "$j\n"; } else { - print "$r: not active\n"; + print "No active jobs.\n"; } } exit 0; From padb at googlecode.com Mon Dec 7 21:59:58 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 07 Dec 2009 21:59:58 +0000 Subject: [padb] r354 committed - Move the logic for finding dll finding from the C code to the perl... Message-ID: <001485f6d9f40f3497047a2a9482@google.com> Revision: 354 Author: apittman Date: Mon Dec 7 13:59:10 2009 Log: Move the logic for finding dll finding from the C code to the perl code, minfo now just calls fetch_dll_name() in a loop until it returns null, all the complex string handling code is handled in padb itself. http://code.google.com/p/padb/source/detail?r=354 Modified: /trunk/src/minfo.c /trunk/src/padb ======================================= --- /trunk/src/minfo.c Mon Dec 7 04:22:42 2009 +++ /trunk/src/minfo.c Mon Dec 7 13:59:10 2009 @@ -370,6 +370,21 @@ free(ans); return 0; } + +/* Fetch a string from a remote memory location, making sure there is + * enough memory locally to store our copy. Return mqs_ok on success */ +void *fetch_dll_name () +{ + char ans[1024]; + int i; + + i = ask("dll_filename",ans); + if ( i != 0 ) { + + return NULL; + } + return strdup(ans); +} int fetch_image (char *local) { @@ -561,82 +576,28 @@ return 0; } -#define PATH_MAX 1024 - -/* Try and load a valid dll from the locations array, loop over the array - * trying each one in turn. Return 0 if and when we managed to load one, - * -1 otherwise - */ -int find_and_load_dll_from_loc_array() { - void **remote_array; - char *dll_name; - void *locations = find_sym("sym","mpimsgq_dll_locations"); - - if ( locations == NULL ) - return -1; - - if ( find_data(NULL,(mqs_taddr_t)locations,sizeof(void *),&remote_array) != mqs_ok ) { - return -1; - } - - if ( (dll_name = malloc(PATH_MAX)) == NULL ) - return -1; +void find_and_load_dll() +{ + char *dll_name = fetch_dll_name(); + + if ( ! dll_name ) { + die("No DLL to load"); + } do { - void *remote_entry = NULL; - - if ( find_data(NULL,(mqs_taddr_t)remote_array,sizeof(void *),&remote_entry) != mqs_ok ) - goto error_out; - - if ( remote_entry == NULL ) - goto error_out; - - memset(dll_name,0,PATH_MAX); - - if ( fetch_string(NULL,dll_name,(mqs_taddr_t)remote_entry,PATH_MAX) != mqs_ok ) { - goto error_out; - - } else { - if ( load_msgq_dll(dll_name) == 0 ) { - free(dll_name); - return mqs_ok; - } - } - remote_array++; - } while ( 1 ); - -error_out: - free(dll_name); - return -1; -} - -void find_and_load_dll() -{ - char *dll_name; - - dll_name = getenv("MPINFO_DLL"); - if ( dll_name != NULL ) { - if ( load_msgq_dll(dll_name) != 0 ) { - die("Could not load symbols from dll"); - } - return; - } - - /* Try the new (proposed) dll specification mechanism */ - if ( find_and_load_dll_from_loc_array() == mqs_ok ) - return; - - void *base = find_sym("sym","MPIR_dll_name"); - if ( base == NULL ) { - die("Could not find MPIR_dll_name symbol"); - } - dll_name = malloc(PATH_MAX); - if ( fetch_string(NULL,dll_name,(mqs_taddr_t)base,PATH_MAX) != 0 ) { - die("Could not read value of MPIR_dll_name"); - } - if ( load_msgq_dll(dll_name) != 0 ) { - die("Could not load symbols from dll"); - } + + if ( load_msgq_dll(dll_name) == mqs_ok ) + { + free(dll_name); + return; + } + + free(dll_name); + dll_name = fetch_dll_name(); + + } while ( dll_name != NULL ); + + die("Could not find a loadable dll"); } int ======================================= --- /trunk/src/padb Mon Dec 7 11:12:04 2009 +++ /trunk/src/padb Mon Dec 7 13:59:10 2009 @@ -1087,7 +1087,7 @@ return -1; } else { - if ( length $str < 10 ) { + if ( length $str < 9 ) { return hex $str; } @@ -6129,16 +6129,62 @@ } sub run_minfo { - my ( $gdb, $vp ) = @_; + my ( $carg, $gdb, $vp ) = @_; my $h = { hpid => -1, tracepid => -1, attached => 0, - debug => 0, + debug => 1, }; $h->{fd}{err} = *M_ERROR; + + my @all_dll_filenames; + + if ( defined $carg->{mpi_dll} ) { + push @all_dll_filenames, $carg->{mpi_dll}; + } else { + my $loc = gdb_var_addr( $gdb, 'mpimsgq_dll_locations' ); + + if ($loc) { + my $psize = gdb_type_size( $gdb, 'void *' ); + my $base = $loc; + my $filename; + + $base = gdb_read_pointer( $gdb, $base ); + + do { + my $strp = gdb_read_pointer( $gdb, $base ); + $filename = gdb_string( $gdb, 1024, $strp ); + if ( defined $filename ) { + push @all_dll_filenames, $filename; + } + $base = _hex($base) + $psize; + } while ( defined $filename ); + } + + my $base = gdb_var_addr( $gdb, 'MPIR_dll_name' ); + if ( not defined $base ) { + target_error( $vp, +'Process does not appear to be using MPI (No MPIR_dll_name symbol)' + ); + return; + } + my $filename = gdb_string( $gdb, 1024, $base ); + push @all_dll_filenames, $filename; + } + + my @dll_filenames; + + my %files; + foreach my $filename (@all_dll_filenames) { + next unless -f ($filename); + next if defined $files{$filename}; + + push @dll_filenames, $filename; + $files{$filename} = 1; + } my $cmd = $inner_conf{minfo}; $h->{hpid} = open3( $h->{fd}{wtr}, $h->{fd}{rdr}, *M_ERROR, $cmd ) @@ -6206,7 +6252,21 @@ chomp $r; my $cmd = substr $r, 0, 4; - if ( $cmd eq 'req:' ) { + if ( $r eq 'req: dll_filename' ) { + $stats{dll_files}++; + my $filename = shift @dll_filenames; + my $res = 'fail'; + if ( defined $filename ) { + $res = "ok $filename"; + } + + print {$out} "$res\n"; + + if ( defined $h->{debugfd} ) { + print { $h->{debugfd} } "$res\n"; + } + + } elsif ( $cmd eq 'req:' ) { my $res = minfo_handle_query( $gdb, $vp, $r, \%stats ); # Some things *do* fail here, symbol lookups for example, @@ -6380,24 +6440,8 @@ return; } - my $base = gdb_var_addr( $g, 'MPIR_dll_name' ); - if ( not defined $base ) { - target_error( $vp, - 'Process does not appear to be using MPI (No MPIR_dll_name symbol)' - ); - } - - if ( defined $carg->{mpi_dll} ) { - $ENV{MPINFO_DLL} = $carg->{mpi_dll}; - } else { - if ( not defined $base ) { - gdb_detach($g); - gdb_quit($g); - return; - } - } - - my @mq = run_minfo( $g, $vp ); + my @mq = fetch_mpi_queue_gdb( $carg, $vp, $pid, $g ); + gdb_detach($g); gdb_quit($g); return @mq; @@ -6406,23 +6450,7 @@ # As above but take a gdb handle sub fetch_mpi_queue_gdb { my ( $carg, $vp, $pid, $g ) = @_; - - my $base = gdb_var_addr( $g, 'MPIR_dll_name' ); - if ( not defined $base ) { - target_error( $vp, - 'Process does not appear to be using MPI (No MPIR_dll_name symbol)' - ); - } - - if ( defined $carg->{mpi_dll} ) { - $ENV{MPINFO_DLL} = $carg->{mpi_dll}; - } else { - if ( not defined $base ) { - return; - } - } - - my @mq = run_minfo( $g, $vp ); + my @mq = run_minfo( $carg, $g, $vp ); return @mq; } @@ -6430,7 +6458,7 @@ my ( $carg, $vp, $pid ) = @_; my @mq = fetch_mpi_queue( $carg, $vp, $pid ); - return unless $mq[0]; + foreach my $o (@mq) { output( $vp, $o ); } @@ -6451,7 +6479,6 @@ foreach my $o (@mq) { output( $vp, $o ); } - return; } From thipadin.seng-long at bull.net Wed Dec 9 10:29:59 2009 From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net) Date: Wed, 9 Dec 2009 11:29:59 +0100 Subject: [padb] tiny bug with--proc-summary Message-ID: Hi, With --proc-summary option, padb displays pid which is indeed a thread PID (LWP) for a process that have some threads as shown: [thipa at machu139 padb_open]$ ./padb -O rmgr=slurm --proc-summary 11091 rank hostname pid vmsize vmrss S uptime %cpu lcore command 0 machu139 31719 151780 kB 26380 kB R 0.97 99 2 concurrent_spaw [thipa at machu139 padb_open]$ This is a ps -eLf command [thipa at machu139 31718]$ ps -eLf UID PID PPID LWP C NLWP STIME TTY TIME CMD . . thipa 31718 31717 31718 95 3 16:32 ? 00:23:28 ./concurrent_spawns thipa 31718 31717 31719 0 3 16:32 ? 00:00:00 ./concurrent_spawns thipa 31718 31717 31720 0 3 16:32 ? 00:00:00 ./concurrent_spawns root 31724 2285 31724 0 1 16:33 ? 00:00:00 sshd: thipa [priv] What's do you think. Thipadin. -------------- next part -------------- An HTML attachment was scrubbed... URL: From padb at googlecode.com Wed Dec 9 11:08:01 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Wed, 09 Dec 2009 11:08:01 +0000 Subject: [padb] r355 committed - Modify the mpirun resource manager code to work in cases where debug... Message-ID: <0016e6407ac23d417c047a49b42b@google.com> Revision: 355 Author: apittman Date: Wed Dec 9 03:07:11 2009 Log: Modify the mpirun resource manager code to work in cases where debug information isn't available in the target binary, use hardcoded values for struct offsets as defined by the standard rather than rely of gdb being able to access the information and calculate the offsets for us. This fixes problems seen on at least two OMPI installations when run in mpirun mode. http://code.google.com/p/padb/source/detail?r=355 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Mon Dec 7 13:59:10 2009 +++ /trunk/src/padb Wed Dec 9 03:07:11 2009 @@ -2994,15 +2994,58 @@ } my %pt; - foreach my $proc ( 0 .. ( $nprocs - 1 ) ) { - my $hostp = gdb_read_value_addr( $gdb, - "(void *)MPIR_proctable[$proc].host_name" ); - my $host = gdb_string( $gdb, 1024, $hostp ); - my $pid = gdb_read_value( $gdb, "MPIR_proctable[$proc].pid" ); - if ( defined $host and defined $pid ) { - $pt{$host}{$proc} = $pid; - } else { - print "Failed to extract process info for rank $proc\n"; + + # Whilst it's possible to dip inside the struct in the process to + # extract this information some builds don't associate a type with + # MPIR_proctable which means in those cases this methhod won't work. + # Instead use a set of hardcoded values for offset and size as defined + # by the interface and do the maths for finding each element ourselves. + + # I've left the old code here for now as I suspect this is going to be + # something that causes trouble in the future. + + if (1) { + my $word_size = gdb_type_size( $gdb, 'void *' ); + my $table_size = ( $word_size * 2 ) + 4; + + # On 64 bit systems the struct is 20 bytes in size but needs to be + # 8 byte alligned. + if ( $word_size == 8 ) { + $table_size += 4; + } + + my $host_offset = 0; + my $pid_offset = $word_size * 2; + my $proctable_addr = gdb_var_addr( $gdb, 'MPIR_proctable' ); + my $proctable = gdb_read_pointer( $gdb, $proctable_addr ); + my $base = _hex($proctable); + + foreach my $proc ( 0 .. ( $nprocs - 1 ) ) { + + my $struct_base = $base + ( $table_size * $proc ); + my $hostp = gdb_read_pointer( $gdb, $struct_base + $host_offset ); + my $host = gdb_string( $gdb, 1024, $hostp ); + + my $pid = gdb_read_int( $gdb, $struct_base + $pid_offset ); + if ( defined $host and defined $pid ) { + $pt{$host}{$proc} = $pid; + } else { + print "Failed to extract process info for rank $proc\n"; + } + } + } else { + + foreach my $proc ( 0 .. ( $nprocs - 1 ) ) { + + my $hostp = gdb_read_value_addr( $gdb, + "(void *)MPIR_proctable[$proc].host_name" ); + my $host = gdb_string( $gdb, 1024, $hostp ); + my $pid = gdb_read_value( $gdb, "MPIR_proctable[$proc].pid" ); + if ( defined $host and defined $pid ) { + $pt{$host}{$proc} = $pid; + } else { + print "Failed to extract process info for rank $proc\n"; + } } } @@ -6055,6 +6098,19 @@ } return; } + +sub gdb_read_int { + my ( $gdb, $addr ) = @_; + + # Quote the request in case it contains spaces. + my %t = + gdb_send_addr( $gdb, "-data-evaluate-expression \"*(int *)$addr\"" ); + if ( $t{status} eq 'done' ) { + my $v = gdb_parse_reason( $t{reason} ); + return $v->{value}; + } + return; +} sub gdb_read_value { my ( $gdb, $name ) = @_; @@ -6135,7 +6191,7 @@ hpid => -1, tracepid => -1, attached => 0, - debug => 1, + debug => 0, }; $h->{fd}{err} = *M_ERROR; From ashley at pittman.co.uk Wed Dec 9 11:41:32 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Wed, 09 Dec 2009 11:41:32 +0000 Subject: [padb] tiny bug with--proc-summary In-Reply-To: References: Message-ID: <1260358892.21674.64.camel@alpha> On Wed, 2009-12-09 at 11:29 +0100, thipadin.seng-long at bull.net wrote: > > Hi, > With --proc-summary option, padb displays pid which is indeed a thread > PID (LWP) > for a process that have some threads as shown: > What's do you think. I can confirm there's a bug here, I can see it locally when I target a multi-threaded application on my laptop. What is happening is that the show_proc function is reporting data for all tasks in the program, this is probably the right thing for --proc-info however for --proc-summary it's incorrect in that it's recording a lot of entries twice for the same process, pid being one of these. This duplicate data is then passed back through the network to the outer process. At this point the tree_from_namespace function is re-assembling the data on the assumption that each key only has one value from a given rank, in the case here where this isn't true it's picking one at random and reporting that which is what you see. Attached is a basic patch which fixes the issue by ensuring that only data from the first thread is forwarded back, this makes padb deterministic and causes it to show the pid you'd expect. The wider issue here is how to handle multi-threaded programs, for example I don't know how to calculate memory usage across threads, I'd assume they all have the same memory maps with the possible exception of TLS which means the value is probably both common to all threads and correct across the process as a whole but the percent cpu usage calculation is almost certainly wrong, this would need to be calculated for each thread and summed across threads to get the true value. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk -------------- next part -------------- A non-text attachment was scrubbed... Name: padb-proc-format-threads.patch Type: text/x-patch Size: 864 bytes Desc: not available URL: From thipadin.seng-long at bull.net Wed Dec 9 14:28:07 2009 From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net) Date: Wed, 9 Dec 2009 15:28:07 +0100 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_tiny_bug_with--proc-summ?= =?iso-8859-1?q?ary?= Message-ID: Hi, I have tested the patch and is OK. But I could have the main thread to be the second one from the LWP (as my in previous example). So it's hard to say. Consider it works. Thipadin. Ashley Pittman 12/09/2009 12:41 PM Pour : thipadin.seng-long at bull.net cc : florence.vallee at bull.net, francois.wellenreiter at bull.net, padb-devel at pittman.org.uk, Sylvain.JEAUGEY at bull.net Objet : Re: tiny bug with--proc-summary On Wed, 2009-12-09 at 11:29 +0100, thipadin.seng-long at bull.net wrote: > > Hi, > With --proc-summary option, padb displays pid which is indeed a thread > PID (LWP) > for a process that have some threads as shown: > What's do you think. I can confirm there's a bug here, I can see it locally when I target a multi-threaded application on my laptop. What is happening is that the show_proc function is reporting data for all tasks in the program, this is probably the right thing for --proc-info however for --proc-summary it's incorrect in that it's recording a lot of entries twice for the same process, pid being one of these. This duplicate data is then passed back through the network to the outer process. At this point the tree_from_namespace function is re-assembling the data on the assumption that each key only has one value from a given rank, in the case here where this isn't true it's picking one at random and reporting that which is what you see. Attached is a basic patch which fixes the issue by ensuring that only data from the first thread is forwarded back, this makes padb deterministic and causes it to show the pid you'd expect. The wider issue here is how to handle multi-threaded programs, for example I don't know how to calculate memory usage across threads, I'd assume they all have the same memory maps with the possible exception of TLS which means the value is probably both common to all threads and correct across the process as a whole but the percent cpu usage calculation is almost certainly wrong, this would need to be calculated for each thread and summed across threads to get the true value. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: padb-proc-format-threads.patch Type: application/octet-stream Size: 892 bytes Desc: not available URL: From ashley at pittman.co.uk Wed Dec 9 15:00:33 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Wed, 09 Dec 2009 15:00:33 +0000 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_tiny_bug_with--proc-summ?= =?iso-8859-1?q?ary?= In-Reply-To: References: Message-ID: <1260370833.21674.88.camel@alpha> On Wed, 2009-12-09 at 15:28 +0100, thipadin.seng-long at bull.net wrote: > > Hi, > I have tested the patch and is OK. Ok, I'll commit it as is. It's definitely a step forward as the current code is non-deterministic. > But I could have the main thread to be the second one from the LWP > (as my in previous example). You mean the thread with the LWP of 31719? LWP 31718 is the one which has consumed the most CPU cycles. If you can suggest a good way for padb to calculate which is the main thread then I'd be keen to hear it. One other option might be to loop over the threads as before and report all their values with the keys suffixed with _%d. In the example you gave this would give you pid=31718 pid_1=31719 pid_2=31720. If you knew that you wanted to see the data for the second pid you could then use --proc-format="rank,hostname,pid_1,vmsize_1,vmrss_1,..." As you say it's hard and I'm open to ideas on how you'd like it to work. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From padb at googlecode.com Wed Dec 9 17:51:33 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Wed, 09 Dec 2009 17:51:33 +0000 Subject: [padb] r356 committed - Refresh the lstopo mode and add a generic "command" mode.... Message-ID: <001636310361553379047a4f575b@google.com> Revision: 356 Author: apittman Date: Wed Dec 9 09:51:13 2009 Log: Refresh the lstopo mode and add a generic "command" mode. Change the lstopo command to add a dash '-' which tells it to give text based output rather than graphical, also extend this mode to allow the precise lstopo command to be specified as an option. The command given is exected on the target node and %p is replaced with the pid of the target process. This should future proof padb as when a --pid option gets added to lstopo padb will be able to use it by just editing a configuration file rather than editing the code. Also add a generic "command" mode which will run any command, again substituting %p with the pid of the target process, this is in effect what lstopo is now so add a mode to do this specifically othewise I'm sure people would try and piggy-back this behaviour onto the lstopo mode. The default command is 'readlink /proc/%p/exe'. http://code.google.com/p/padb/source/detail?r=356 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Wed Dec 9 03:07:11 2009 +++ /trunk/src/padb Wed Dec 9 09:51:13 2009 @@ -7957,17 +7957,23 @@ return; } -# Experimental, currently reports on what's on the node rather than what -# the specific process is attached to, hopefully this functionality will be -# added in the future. +# Experimental, currently reports on what's on the node rather than +# what the specific process is attached to, hopefully this +# functionality will be added in the future. # https://svn.open-mpi.org/trac/hwloc/ticket/21 sub lstopo { my ( $cargs, $vp, $pid ) = @_; - target_error( $vp, "Reporting per node rather than per process" ); - - my @output = slurp_cmd("lstopo --whole-system"); + if ( $cargs->{lstopo_show_warning} ) { + target_error( $vp, "Reporting per node rather than per process" ); + } + + my $cmd = $cargs->{lstopo_command}; + + $cmd =~ s{%p}{$pid}g; + + my @output = slurp_cmd($cmd); # Check the return code, if it's not found then there won't be any # output, if it was found but returned an error then do report the @@ -7988,6 +7994,21 @@ } return; } + +sub run_cmd_against_target { + my ( $cargs, $vp, $pid ) = @_; + + my $cmd = $cargs->{command}; + + $cmd =~ s{%p}{$pid}g; + + my @output = slurp_cmd($cmd); + chomp @output; + foreach my $line (@output) { + output( $vp, $line ); + } + return; +} sub ping_rank { my ( $cargs, $vp, $pid ) = @_; @@ -9351,9 +9372,18 @@ }; $allfns{lstopo} = { - handler => \&lstopo, - arg_long => 'lstopo', - help => 'Show CPU topology', + handler => \&lstopo, + arg_long => 'lstopo', + help => 'Show CPU topology using lstopo', + options_i => { lstopo_command => 'lstopo --whole-system -', }, + options_bool => { lstopo_show_warning => 'yes', }, + }; + + $allfns{command} = { + handler => \&run_cmd_against_target, + arg_long => 'command', + help => 'Run command on target node', + options_i => { command => 'readlink /proc/%p/exe', } }; $allfns{ping} = { From padb at googlecode.com Wed Dec 9 18:44:34 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Wed, 09 Dec 2009 18:44:34 +0000 Subject: [padb] r357 committed - Fix to the proc-format mode, ensure that the information we report... Message-ID: <0016e68ea1b4f17d1d047a501453@google.com> Revision: 357 Author: apittman Date: Wed Dec 9 10:43:32 2009 Log: Fix to the proc-format mode, ensure that the information we report comes from the lead pid in the process group. http://code.google.com/p/padb/source/detail?r=357 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Wed Dec 9 09:51:13 2009 +++ /trunk/src/padb Wed Dec 9 10:43:32 2009 @@ -7642,16 +7642,19 @@ if ( -d "/proc/$pid/task" and $carg->{proc_shows_proc} ) { - my $threads = 0; - # 2.6 kernel. (ntpl) my @tasks = slurp_dir("/proc/$pid/task"); foreach my $task (@tasks) { next if ( $task eq '.' ); next if ( $task eq '..' ); show_task_dir( $carg, $vp, $pid, "/proc/$pid/task/$task" ); - $threads++; - } + if ( defined $carg->{proc_format} ) { + last; + } + } + + # We have to deduct 2 here to account for . and .. + my $threads = @tasks - 2; proc_output( $vp, 'threads', $threads ); } else { show_task_dir( $carg, $vp, $pid, "/proc/$pid" ); From padb at googlecode.com Wed Dec 9 21:15:07 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Wed, 09 Dec 2009 21:15:07 +0000 Subject: [padb] r358 committed - Fix to the MPI dll discovery code, if the new mpimsq_dll_locations... Message-ID: <001485f9126a5abf28047a522f0a@google.com> Revision: 358 Author: apittman Date: Wed Dec 9 13:14:36 2009 Log: Fix to the MPI dll discovery code, if the new mpimsq_dll_locations variable is present but it's value is NULL then padb was giving a warning. Handle this case directly by not trying to follow the pointer if it's value is 0. http://code.google.com/p/padb/source/detail?r=358 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Wed Dec 9 10:43:32 2009 +++ /trunk/src/padb Wed Dec 9 13:14:36 2009 @@ -6202,14 +6202,16 @@ push @all_dll_filenames, $carg->{mpi_dll}; } else { my $loc = gdb_var_addr( $gdb, 'mpimsgq_dll_locations' ); - + my $base; if ($loc) { + $base = gdb_read_pointer( $gdb, $loc ); + } + + if ( defined $base and $base ne '0x0' ) { my $psize = gdb_type_size( $gdb, 'void *' ); - my $base = $loc; + my $filename; - $base = gdb_read_pointer( $gdb, $base ); - do { my $strp = gdb_read_pointer( $gdb, $base ); $filename = gdb_string( $gdb, 1024, $strp ); @@ -6220,7 +6222,7 @@ } while ( defined $filename ); } - my $base = gdb_var_addr( $gdb, 'MPIR_dll_name' ); + $base = gdb_var_addr( $gdb, 'MPIR_dll_name' ); if ( not defined $base ) { target_error( $vp, 'Process does not appear to be using MPI (No MPIR_dll_name symbol)' From padb at googlecode.com Thu Dec 10 13:02:07 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Thu, 10 Dec 2009 13:02:07 +0000 Subject: [padb] r359 committed - Update the release nodes to be current.... Message-ID: <0016e6470c5822e195047a5f6ae9@google.com> Revision: 359 Author: apittman Date: Thu Dec 10 05:01:38 2009 Log: Update the release nodes to be current. Also add a --debug-file option to, surprisingly, send debug output to a file. http://code.google.com/p/padb/source/detail?r=359 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Wed Dec 9 13:14:36 2009 +++ /trunk/src/padb Thu Dec 10 05:01:38 2009 @@ -30,9 +30,15 @@ # Version 3.? # * Support of PBS Pro -# * Add variables to tree based stack traces. +# * Support for OpenMPI jobs run by mpirun under a slurm allocation. +# * Modify the Slurm resouce manager code to automatically select a +# step_id based on what's running on the system currently. +# * Modify the mpd resource manager code to only call mpdlistjobs on +# the front-end, it provides all the information we need so record +# this and send it over the network to the inner processes. # * Solaris port. Limited functionality compared to running on Linux # however stack trace mode works fully. +# * Add variables to tree based stack traces. # * Add "mpirun" as a resource manager, this causes it walk the local # process list looking for processes called mpirun and to get the # pid and hostlist by reading data from Mpir_Proctable as specified @@ -43,6 +49,7 @@ # reduction operations by name. # * Add a --lstopo option to run the lstopo command for each rank. # http://www.open-mpi.org/projects/hwloc/ +# * Add a 'command' mode to run abritary commands on the target node. # * Enhance the integration with gdb, use sequence numbers when # talking to gdb and check that we get back what we give it. # Correctly notice and raise an appropriate error if gdb dies @@ -64,6 +71,26 @@ # automatically # * Add SVN tags to the source file and the the revision id to the # output of output of --version +# * Make proc-format report data for the first thread in a process +# rather than a random one. +# * Add support for the proposed new standard for finding the Message +# queue plugin in MPI programs. +# * Have padb handle attaching to programs rather than having the mode +# callback handle it. This means that persistent attachments can +# be used in full-report mode. +# * Speed up attaching gdb to the target job greatly by attaching to +# all target processes on a not simultanously rather than one at +# a time. +# * Better handling of jobs that dissapear whilst we are monitoring them, +# there should be no perl errors shown if this happens. +# * Detect where padb is being run from and specify the full path to the +# inner processes. This helps with resource managers which don't +# preserve $PWD and padb isn't on $PATH +# * Add proper two-pass argunment handling, secondary options are only +# accepted if the mode they are relevent to is selected. +# * Widespread code cleanups to conform with stricter coding standards. +# * Enable type checking of command line options, all boolean flags can be +# set yes|no|1|0|true|false now. # # Version 3.0 # * Full-duplex communication between inner and outer processes, padb no @@ -677,7 +704,10 @@ -t --tree Use tree based output for stack traces. -i --input-file=FILE Read input from file. - --watch + --watch + + --debug=, Enable debug for mode, use mode=all for all debugging. + --debug-file=file Log debug information to file. -O [opt1=val,] Set internal config options for padb, advanced use only. Options in this version (these are liable to change) @@ -798,6 +828,22 @@ # the ref as well. Enable with --debug=type1,type2=all my %debug_modes; my $start_time = time; +my $debug_fd; + +sub set_debug_file { + my ($filename) = @_; + + if ( defined $filename ) { + if ( not open $debug_fd, '>', $filename ) { + print "Unable to open log file for writing: $!\n"; + $debug_fd = *STDOUT; + } + } else { + $debug_fd = *STDOUT; + } + + return; +} sub debug_log { my ( $type, $handle, $str, @params ) = @_; @@ -807,10 +853,10 @@ } return unless $debug_modes{$type}; my $time = time - $start_time; - printf "DEBUG ($type): %3d: $str\n", $time, @params; + printf {$debug_fd} "DEBUG ($type): %3d: $str\n", $time, @params; return if $debug_modes{$type} eq 'basic'; return unless defined $handle; - print Data::Dumper->Dump( [$handle], [$type] ); + print {$debug_fd} Data::Dumper->Dump( [$handle], [$type] ); return; } @@ -930,6 +976,7 @@ Getopt::Long::Configure( 'bundling', 'pass_through' ); my $debugflag; + my $debugfile; my @ranks; @@ -957,6 +1004,7 @@ 'norc' => \$norc, 'config-file=s' => \$configfile, 'debug=s' => \$debugflag, + 'debug-file=s' => \$debugfile, 'create-secret-file' => \$create_secret, ); @@ -973,6 +1021,8 @@ # options which might be bundled with it. GetOptions(%optionhash); + set_debug_file($debugfile); + Getopt::Long::Configure( 'default', 'bundling' ); my $mode; @@ -4491,7 +4541,9 @@ # rng_user_verify() # is_value_in_range() -# nvalues_in_range() +# nvalues_in_range() - Return the number of values in a range. +# rng_min() - Return the minimum value in a range. +# rng_common() - Take two ranges and return the common values. # rng_find_missing() # Take two ranges and return all that are in the first but not in the # second. (see check_signon). From padb at googlecode.com Thu Dec 10 15:22:22 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Thu, 10 Dec 2009 15:22:22 +0000 Subject: [padb] r360 committed - Clean up the way the message queue code is called on the inner process... Message-ID: <001636310361b22b4a047a615f68@google.com> Revision: 360 Author: apittman Date: Thu Dec 10 07:21:32 2009 Log: Clean up the way the message queue code is called on the inner processes, remove one function completely, rename other and add comments about what is called from where. Handle deadlock detection in the same way that message queues are handled so that the code can use the same function to read both of them. http://code.google.com/p/padb/source/detail?r=360 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Thu Dec 10 05:01:38 2009 +++ /trunk/src/padb Thu Dec 10 07:21:32 2009 @@ -6537,8 +6537,19 @@ return; } -sub fetch_mpi_queue { +# Returns the MPI queues for this process given a gdb handle. +sub fetch_mpi_queue_gdb { + my ( $carg, $vp, $pid, $g ) = @_; + my @mq = run_minfo( $carg, $g, $vp ); + return @mq; +} + +# Called as a backoff from the qsnet_show_tport_queue() function if it gets +# called but can't show the queues for any reason - most likely because it +# isn't actually a quadrics system. +sub show_mpi_queue { my ( $carg, $vp, $pid ) = @_; + my $g = gdb_start(); my $p = gdb_attach( $g, $pid ); if ( !$p ) { @@ -6554,20 +6565,6 @@ gdb_detach($g); gdb_quit($g); - return @mq; -} - -# As above but take a gdb handle -sub fetch_mpi_queue_gdb { - my ( $carg, $vp, $pid, $g ) = @_; - my @mq = run_minfo( $carg, $g, $vp ); - return @mq; -} - -sub show_mpi_queue { - my ( $carg, $vp, $pid ) = @_; - - my @mq = fetch_mpi_queue( $carg, $vp, $pid ); foreach my $o (@mq) { output( $vp, $o ); @@ -6575,6 +6572,7 @@ return; } +# The mode handler for message queue or deadlock detection mode. sub show_mpi_queue_one { my ( $carg, $proc ) = @_; @@ -6591,19 +6589,6 @@ } return; } - -sub show_mpi_queue_for_deadlock_one { - my ( $carg, $proc ) = @_; - - my $vp = $proc->{vp}; - my $pid = $proc->{pid}; - my $gdb = $proc->{gdb_handle}; - - return unless $gdb; - - my @mq = fetch_mpi_queue_gdb( $carg, $vp, $pid, $gdb ); - return \@mq; -} sub mpi_queue_output_handler { my ( $carg, $lines, $three ) = @_; @@ -6831,8 +6816,8 @@ # XXX This is a bit of a hack to make the deadlock code work with input # files, the whole thing is due a tidy-up on the full-duplex branch # where this should be solved properly. - if ( defined $lines->{target_response} ) { - $data = $lines->{target_response}; + if ( defined $lines->{target_output} ) { + $data = $lines->{target_output}; } else { $data = $lines->{lines}; } @@ -8074,7 +8059,7 @@ return; } -sub show_queue { +sub qsnet_show_tport_queue { my ( $carg, $vp, $pid ) = @_; # If edb isn't installed (this isn't a Quadrics system) don't even try @@ -9295,7 +9280,7 @@ arg_long => 'message-queue', qsnet => 1, arg_short => 'q', - handler => \&show_queue, + handler => \&qsnet_show_tport_queue, help => 'Show the message queues', options_i => { mpi_dll => undef, } }; @@ -9324,7 +9309,7 @@ }; $allfns{deadlock} = { - handler_one => \&show_mpi_queue_for_deadlock_one, + handler_one => \&show_mpi_queue_one, needs_gdb => 1, arg_long => 'deadlock', arg_short => 'j', From padb at googlecode.com Wed Dec 16 21:48:38 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Wed, 16 Dec 2009 21:48:38 +0000 Subject: [padb] r361 committed - Replace a few hard coded char arrays with mallocs of a size just... Message-ID: <0016e6407ac21c9e20047adf7802@google.com> Revision: 361 Author: apittman Date: Wed Dec 16 13:48:14 2009 Log: Replace a few hard coded char arrays with mallocs of a size just big enough for the string. A slight performance penalty but it'll help if anybody comes up with a 64 character type name. http://code.google.com/p/padb/source/detail?r=361 Modified: /trunk/src/minfo.c ======================================= --- /trunk/src/minfo.c Mon Dec 7 13:59:10 2009 +++ /trunk/src/minfo.c Wed Dec 16 13:48:14 2009 @@ -131,13 +131,24 @@ } void *find_sym (char *type, char *name) { - char req[128]; + char *req; char ans[128]; int i; void *addr = NULL; + size_t len = 2; + len += strlen(type); + len += strlen(name); + + req = malloc(len); + + if ( ! req ) { + return NULL; + } + sprintf(req,"%s %s",type,name); i = ask(req,ans); + free(req); if ( i != 0 ) return NULL; @@ -238,31 +249,48 @@ mqs_type *find_type (mqs_image *image, char *name, mqs_lang_code lang) { - char req[128]; + char *req; int i; struct type *t = malloc(sizeof(struct type)); if ( ! t ) return NULL; + req = malloc (strlen(name)+6); + if ( ! req ) + return NULL; + strncpy(t->name,name,128); sprintf(req,"size %s",name); i = req_to_int(req,&t->size); - if ( i != 0 ) + free(req); + if ( i != 0 ) { + free(t); return NULL; + } return (mqs_type *)t; } int find_offset (mqs_type *type, char *name) { - char req[128]; + char *req; int i,offset; + size_t len = 9; struct type *t = (struct type *)type; + len += strlen(t->name); + len += strlen(name); + req = malloc(len); + + if ( ! req ) { + return -1; + } + sprintf(req,"offset %s %s",t->name,name); i = req_to_int(req,&offset); + free(req); if ( i != 0 ) return -1; From thipadin.seng-long at bull.net Fri Dec 18 13:37:15 2009 From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net) Date: Fri, 18 Dec 2009 14:37:15 +0100 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A__Better_handling_of_threads_in?= =?iso-8859-1?q?_stack__traces=2E?= Message-ID: On Nov 30th, 2009 wrote: > I've been giving some thought to how to padb can handle threaded > applications better as the current scheme isn't ideal. > > > First would be to report extra threads in the same tree as the primary > thread, some magic would have to be applied to cover the fact that the > first thread in a process starts with main and subsequent ones start > with pthread_create() but this wouldn't be a insurmountable problem. > The big problem with this approach would be how to report thread > identifiers in the same rank-spec as rank rank identifiers, I could > revert to just using a list here but that doesn't work so well on big > systems. > The second option would be to treat each thread as a different entity > within the rank/process and have a number of trees displayed per job, > each dealing with a different thread, e.g. there would be a tree per > main thread and another tree for each extra thread encountered. From a > technical perspective implementing this would require adding a namespace > to the {target_output} as it's passed back up the comms tree so is the > hardest to add but would probably lead to the best solution. > Finally there is the option of not showing all threads but allowing > users to select a single thread per invocation of padb. This is the > simple but functional option although might be best viewed as a step > along the way to fully supporting multiple threads in future. Here the > options are to be able to select threads by id (1,2,...) or perhaps by > having a white/black list of function names that should appear in the > stack for a thread before a thread is shown. > I'd welcome ideas on which people would prefer or if anybody has any > other thoughts on how to handle threads properly. I have a support request from Bull customer that would like to have padb report sorted by threads as below: Thread: 1 -------------------------- [0-1999] (2000 processes) --------- main() PMPI_Finalyse() ompi_mpi_finalyze() barrier() ---------------- ......(249 processes) --------------- orte_grpcomm_base_allgather() opal_progress() opal_event_loop() epoll_dispatch() epoll_wait() --------------- ..... (1751 processes) ---------------- opal_progress() opal_event_loop() epoll_dispatch() epoll_wait() Thread: 2 -------------------------- [0-1999] (2000 processes) --------- .... Thread: 3 -------------------------- [0-1999] (2000 processes) --------- .... This report should be by job. Would you accept it ? Thipadin. -------------- next part -------------- An HTML attachment was scrubbed... URL: From padb at googlecode.com Fri Dec 18 21:26:25 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Fri, 18 Dec 2009 21:26:25 +0000 Subject: [padb] r362 committed - Update to the orte resource manager to support spawned jobs. Like slu... Message-ID: <0016e646417e567486047b0764ca@google.com> Revision: 362 Author: apittman Date: Fri Dec 18 13:25:23 2009 Log: Update to the orte resource manager to support spawned jobs. Like slurm each job, uniquely identified by it's number can have a number of different steps within it. This commit adds knowedge of these steps to padb so it can do the right thing. Allow targetting of different steps via the orte-job-step configuration option with the default step being the lowest numbered one detected. http://code.google.com/p/padb/source/detail?r=362 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Thu Dec 10 07:21:32 2009 +++ /trunk/src/padb Fri Dec 18 13:25:23 2009 @@ -547,6 +547,7 @@ $conf{rmgr} = undef; $conf{slurm_job_step} = undef; +$conf{orte_job_step} = undef; $conf{pbs_server} = undef; @@ -568,7 +569,7 @@ my @conf_time = qw(prun_exittimeout prun_timeout interval); # Config options which take an integer. -my @conf_int = qw(lsf_job_offset slurm_job_step tree_width); +my @conf_int = qw(lsf_job_offset slurm_job_step orte_job_step tree_width); my $norc = 0; my $configfile = '/etc/padb.conf'; @@ -2865,18 +2866,21 @@ if ( @elems == 4 ) { my $nprocs = $elems[3]; my $name = $elems[0]; - if ( $name =~ m{\A\[(\d+)\,\d+]\z}x ) { - $open_jobs{$1}{nprocs} = $nprocs; + if ( $name =~ m{\A\[(\d+)\,(\d+)]\z}x ) { + my $job = $1; + my $step = $2; + $open_jobs{$job}{$step}{nprocs} = $nprocs; } } elsif ( @elems == 6 ) { my $name = $elems[1]; - if ( $name =~ m{\A\[\[(\d+)\,\d+\]\,(\d+)\]}x ) { + if ( $name =~ m{\A\[\[(\d+)\,(\d+)\]\,(\d+)\]}x ) { my $job = $1; - my $rank = $2; + my $step = $2; + my $rank = $3; my $pid = $elems[3]; my $host = $elems[4]; - $open_jobs{$job}{hosts}{$host}++; - $open_jobs{$job}{ranks}{$host}{$rank} = $pid; + $open_jobs{$job}{$step}{hosts}{$host}++; + $open_jobs{$job}{$step}{ranks}{$host}{$rank} = $pid; } } } @@ -2895,7 +2899,22 @@ open_get_data(); - my @hosts = keys %{ $open_jobs{$job}{hosts} }; + my $step = $conf{orte_job_step}; + if ( not defined $step ) { + my @steps = keys %{ $open_jobs{$job} }; + + my @ordered = sort { $a <=> $b } @steps; + + $step = $ordered[0]; + + } + + if ( not defined $open_jobs{$job}{$step} ) { + printf("Job $job (step $step) does not exist\n"); + return; + } + + my @hosts = keys %{ $open_jobs{$job}{$step}{hosts} }; my $i = @hosts; my ( $fh, $fn ) = tempfile('/tmp/padb.XXXXXXXX'); @@ -2909,9 +2928,9 @@ my $cmd = "orterun -machinefile $fn -np $i $prefix"; my %pcmd; - $pcmd{nprocesses} = $open_jobs{$job}{nprocs}; + $pcmd{nprocesses} = $open_jobs{$job}{$step}{nprocs}; $pcmd{nhosts} = @hosts; - $pcmd{process_data} = $open_jobs{$job}{ranks}; + $pcmd{process_data} = $open_jobs{$job}{$step}{ranks}; $pcmd{command} = $cmd; @{ $pcmd{host_list} } = @hosts; $pcmd{cleanup_cb} = \&unlink_file; From padb at googlecode.com Fri Dec 18 21:36:47 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Fri, 18 Dec 2009 21:36:47 +0000 Subject: [padb] r363 committed - Add command substituion of %r to rank when running in command... Message-ID: <0016e6d26d586fc22b047b0789e7@google.com> Revision: 363 Author: apittman Date: Fri Dec 18 13:35:42 2009 Log: Add command substituion of %r to rank when running in command mode. This allows the follwing really useful command to work: padb -a --command -Ocommand="xterm -T %r -e 'gdb -p %p'" http://code.google.com/p/padb/source/detail?r=363 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Fri Dec 18 13:25:23 2009 +++ /trunk/src/padb Fri Dec 18 13:35:42 2009 @@ -8057,16 +8057,17 @@ } sub run_cmd_against_target { - my ( $cargs, $vp, $pid ) = @_; + my ( $cargs, $rank, $pid ) = @_; my $cmd = $cargs->{command}; $cmd =~ s{%p}{$pid}g; + $cmd =~ s{%r}{$rank}g; my @output = slurp_cmd($cmd); chomp @output; foreach my $line (@output) { - output( $vp, $line ); + output( $rank, $line ); } return; } From ashley at pittman.co.uk Mon Dec 21 12:04:05 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Mon, 21 Dec 2009 12:04:05 +0000 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A__Better_handling_of_threads_in?= =?iso-8859-1?q?_stack_traces=2E?= In-Reply-To: References: Message-ID: <1261397045.3600.35.camel@alpha> On Fri, 2009-12-18 at 14:37 +0100, thipadin.seng-long at bull.net wrote: > I have a support request from Bull customer that would like to have > padb report sorted by threads as below: That's great. > Would you accept it ? I'm not sure what you mean here, I would gladly accept a patch implementing it. I don't have a huge amount of time to be working on padb right now and there are already a large number of features waiting for a release as well as people who've found they need to use the SVN version for some individual machine. My priority currently is to finish off what is absent, package padb properly into a rpm, convert it to use antoconf and fix what bugs people find in the mean time. Re: threads in stack traces I don't seem myself having time to do anything other than the simple option of allowing users to restrict which threads are shown for each invocation of padb without it delaying the release. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From thipadin.seng-long at bull.net Mon Dec 21 12:55:37 2009 From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net) Date: Mon, 21 Dec 2009 13:55:37 +0100 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A__Better_han?= =?iso-8859-1?q?dling_of__threads_in_stack_traces=2E?= Message-ID: On Mon, 21th dec 2009 at 13:04 +0100 ashley at pittman.co.uk wrote: >On Fri, 2009-12-18 at 14:37 +0100, thipadin.seng-long at bull.net wrote: >> I have a support request from Bull customer that would like to have >> padb report sorted by threads as below: >That's great. >> Would you accept it ? > I'm not sure what you mean here, I would gladly accept a patch > implementing it. > I don't have a huge amount of time to be working on padb right now and > there are already a large number of features waiting for a release as > well as people who've found they need to use the SVN version for some > individual machine. > My priority currently is to finish off what is absent, package padb > properly into a rpm, convert it to use antoconf and fix what bugs people > find in the mean time. Re: threads in stack traces I don't seem myself > having time to do anything other than the simple option of allowing > users to restrict which threads are shown for each invocation of padb > without it delaying the release. > Ashley, Ok you are right, let's make a package padb out of all features waiting for a release. I think it is a wise decision. As to me I will be off for 2 weeks, so I'll be working on threads sort at the beginning of the year. Thipadin. -------------- next part -------------- An HTML attachment was scrubbed... URL: From padb at googlecode.com Mon Dec 21 19:57:50 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 21 Dec 2009 19:57:50 +0000 Subject: [padb] r364 committed - Add svn properties to minfo, remove old #ident line as modern... Message-ID: <0016361e7e700e3cdf047b428166@google.com> Revision: 364 Author: apittman Date: Mon Dec 21 11:57:07 2009 Log: Add svn properties to minfo, remove old #ident line as modern compilers complain about it. http://code.google.com/p/padb/source/detail?r=364 Modified: /trunk/src/minfo.c ======================================= --- /trunk/src/minfo.c Wed Dec 16 13:48:14 2009 +++ /trunk/src/minfo.c Mon Dec 21 11:57:07 2009 @@ -3,8 +3,11 @@ * Copyright (c) 2009, Ashley Pittman. */ -#ident "elfN.c,v 1.14 2005-11-03 11:23:04 ashley Exp" -/* /cvs/master/quadrics/elan4lib/edb/elfN.c,v */ +/* + * $URL$ + * $Date$ + * $Revision$ + */ #include #include From padb at googlecode.com Mon Dec 21 20:22:00 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 21 Dec 2009 20:22:00 +0000 Subject: [padb] r365 committed - Add a thread-list mode to show the user a list of threads that... Message-ID: <0016e68ea1c37ec460047b42d7e9@google.com> Revision: 365 Author: apittman Date: Mon Dec 21 12:21:54 2009 Log: Add a thread-list mode to show the user a list of threads that are running in the target process. http://code.google.com/p/padb/source/detail?r=365 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Fri Dec 18 13:35:42 2009 +++ /trunk/src/padb Mon Dec 21 12:21:54 2009 @@ -50,6 +50,8 @@ # * Add a --lstopo option to run the lstopo command for each rank. # http://www.open-mpi.org/projects/hwloc/ # * Add a 'command' mode to run abritary commands on the target node. +# * Add a 'thread-list' mode to report a comma seperated list of threads +# for each target process. # * Enhance the integration with gdb, use sequence numbers when # talking to gdb and check that we get back what we give it. # Correctly notice and raise an appropriate error if gdb dies @@ -8010,6 +8012,33 @@ } return; } + +sub thread_list_from_pid { + my ( $carg, $proc ) = @_; + + return unless defined $proc->{gdb_handle}; + + my $gdb = $proc->{gdb_handle}; + + my %result = gdb_n_send( $gdb, '-thread-list-ids' ); + if ( $result{status} ne 'done' ) { + return; + } + my $data = gdb_parse_reason( $result{reason}, 'thread-ids' ); + if ( not defined $data->{'thread-ids'} ) { + return; + } + + my @threads; + foreach my $thread ( @{ $data->{'thread-ids'} } ) { + my $id = $thread->{'thread-id'}; + push @threads, $id; + } + + my $thread_list = join q{,}, sort { $a <=> $b } @threads; + + output( $proc->{vp}, $thread_list ); +} sub kill_proc { my ( $cargs, $vp, $pid ) = @_; @@ -9354,6 +9383,13 @@ } }; + $allfns{threads} = { + handler_one => \&thread_list_from_pid, + needs_gdb => 1, + arg_long => 'thread-list', + help => 'List threads in target processes', + }; + $allfns{proc_summary} = { handler_all => \&show_proc_all, out_handler => \&show_proc_format, From padb at googlecode.com Mon Dec 21 20:47:30 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 21 Dec 2009 20:47:30 +0000 Subject: [padb] r366 committed - Add a thread-id configuration option to specify which threads are... Message-ID: <001636c9277daccb8c047b43320f@google.com> Revision: 366 Author: apittman Date: Mon Dec 21 12:46:24 2009 Log: Add a thread-id configuration option to specify which threads are shown in the stack trace view. The default is to show all threads but this option allows the user to restrict it to one thread. http://code.google.com/p/padb/source/detail?r=366 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Mon Dec 21 12:21:54 2009 +++ /trunk/src/padb Mon Dec 21 12:46:24 2009 @@ -52,6 +52,10 @@ # * Add a 'command' mode to run abritary commands on the target node. # * Add a 'thread-list' mode to report a comma seperated list of threads # for each target process. +# * Add a 'thread-id' configuration option for when collecting stack +# traces. This isn't a complete solution which will have to wait +# for 3.2 but does allow the user to specify which thread within an +# application is reported on. # * Enhance the integration with gdb, use sequence numbers when # talking to gdb and check that we get back what we give it. # Correctly notice and raise an appropriate error if gdb dies @@ -7098,7 +7102,7 @@ } sub gdb_dump_frames_per_thread { - my ( $gdb, $detail ) = @_; + my ( $gdb, $detail, $thread_id ) = @_; my @th = (); my %result = gdb_n_send( $gdb, '-thread-list-ids' ); if ( $result{status} ne 'done' ) { @@ -7108,6 +7112,11 @@ if ( not defined $data->{'thread-ids'} ) { return; } + + # I honestly don't know what this code is here for, presumably at + # some point in the past I've experienced a version of gdb which + # reports the number-of-threads as zero! No harm in leaving the + # code here however. if ( $data->{'number-of-threads'} == 0 ) { my %t = ( id => 0 ); @{ $t{frames} } = gdb_dump_frames( $gdb, $detail ); @@ -7128,6 +7137,9 @@ } foreach my $thread ( @{ $data->{'thread-ids'} } ) { my $id = $thread->{'thread-id'}; + if ( defined $thread_id ) { + next unless $thread_id eq $id; + } my %t = ( id => $id ); gdb_send( $gdb, "-thread-select $id" ); @{ $t{frames} } = gdb_dump_frames( $gdb, $detail ); @@ -7871,9 +7883,11 @@ if ( $carg->{stack_shows_params} or $carg->{stack_shows_locals} ) { - @threads = gdb_dump_frames_per_thread( $gdb, 1 ); + @threads = + gdb_dump_frames_per_thread( $gdb, 1, $carg->{thread_id} ); } else { - @threads = gdb_dump_frames_per_thread($gdb); + @threads = + gdb_dump_frames_per_thread( $gdb, undef, $carg->{thread_id} ); } if ( defined $threads[0]->{frames} ) { @@ -7892,7 +7906,13 @@ and ( $tries++ < $carg->{gdb_retry_count} ) ); if ( not defined $threads[0]{id} ) { - target_error( $vp, 'Could not extract stack trace from application' ); + if ( $carg->{thread_id} ) { + target_error( $vp, + 'Could not extract stack trace for specified thread' ); + } else { + target_error( $vp, + 'Could not extract stack trace from application' ); + } return; } @@ -9430,6 +9450,7 @@ stack_strip_above => 'elan_waitWord,elan_pollWord,elan_deviceCheck,opal_condition_wait,opal_progress', stack_strip_below => 'main,__libc_start_main,start_thread', + thread_id => undef, }, options_bool => { stack_shows_params => 'yes', From padb at googlecode.com Mon Dec 21 20:53:37 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 21 Dec 2009 20:53:37 +0000 Subject: [padb] r367 committed - Re-name --thread-list to --list-threads as this seems more natural. Message-ID: <0016368e1b64944ad8047b43487b@google.com> Revision: 367 Author: apittman Date: Mon Dec 21 12:52:44 2009 Log: Re-name --thread-list to --list-threads as this seems more natural. http://code.google.com/p/padb/source/detail?r=367 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Mon Dec 21 12:46:24 2009 +++ /trunk/src/padb Mon Dec 21 12:52:44 2009 @@ -50,8 +50,8 @@ # * Add a --lstopo option to run the lstopo command for each rank. # http://www.open-mpi.org/projects/hwloc/ # * Add a 'command' mode to run abritary commands on the target node. -# * Add a 'thread-list' mode to report a comma seperated list of threads -# for each target process. +# * Add a 'list-threads' mode to report a comma seperated list of +# threads for each target process. # * Add a 'thread-id' configuration option for when collecting stack # traces. This isn't a complete solution which will have to wait # for 3.2 but does allow the user to specify which thread within an @@ -9406,7 +9406,7 @@ $allfns{threads} = { handler_one => \&thread_list_from_pid, needs_gdb => 1, - arg_long => 'thread-list', + arg_long => 'list-threads', help => 'List threads in target processes', }; From ashley at pittman.co.uk Mon Dec 21 21:03:32 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Mon, 21 Dec 2009 21:03:32 +0000 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A__Better_han?= =?iso-8859-1?q?dling_of_threads_in_stack_traces=2E?= In-Reply-To: References: Message-ID: <1261429412.3600.44.camel@alpha> On Mon, 2009-12-21 at 13:55 +0100, thipadin.seng-long at bull.net wrote: > Ok you are right, let's make a package padb out of all > features waiting for a release. I think it is a wise decision. > As to me I will be off for 2 weeks, so I'll be working on threads > sort > at the beginning of the year. I've done the simple thing with threads for now, added a --list-threads option to show detected threads and a optional thread-id configuration option for -x to restrict which thread is reported on. Let me know how you get on with this and if it's acceptable half-way house for your customer, anything more complex is going to require changes to the padb core which we can look at after 3.1 is out. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From padb at googlecode.com Mon Dec 21 22:13:12 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 21 Dec 2009 22:13:12 +0000 Subject: [padb] r368 committed - Rename Makefile to be Makefile-simple to make way for a new auto-conf... Message-ID: <0016361e859428e66d047b4465e7@google.com> Revision: 368 Author: apittman Date: Mon Dec 21 14:12:41 2009 Log: Rename Makefile to be Makefile-simple to make way for a new auto-conf generated makefile. Hopefully soon I can delete this file completly but for now leave it lying around so that it can be invoked with make -f http://code.google.com/p/padb/source/detail?r=368 Added: /trunk/src/Makefile-simple Deleted: /trunk/src/Makefile ======================================= --- /dev/null +++ /trunk/src/Makefile-simple Mon Dec 21 14:12:41 2009 @@ -0,0 +1,42 @@ + +INSTALL_DIR=/usr/local/ +CONFIG_DIR=/etc +VERSION=3.0-rc1 +CC=gcc +CFLAGS=-Wall -g + +FILES = Makefile minfo.c mpi_interface.h padb + +minfo.x: minfo.c mpi_interface.h + $(CC) minfo.c -o minfo.x -ldl $(CFLAGS) + +install: minfo.x + /bin/mkdir -p ${INSTALL_DIR}/bin + /bin/cp minfo.x ${INSTALL_DIR}/bin/ + /bin/cp padb ${INSTALL_DIR}/bin/ + +make config_install: + /bin/mkdir -p ${CONFIG_DIR} + /bin/cp padb.conf ${CONFIG_DIR}/ + +clean: + /bin/rm -f minfo.x + +tarfile: + /bin/rm -f padb-${VERSION}.tgz + /bin/rm -rf padb-${VERSION} + mkdir padb-${VERSION} + /bin/cp ${FILES} padb-${VERSION} + svnversion > padb-${VERSION}/svnversion + tar -czf padb-${VERSION}.tgz padb-${VERSION} + +tidy: + perltidy -b -ce -w -se padb + +pc: padb + perlcritic --brutal --verbose "%l: (%s) %m\n" padb > .pc.tmp || true + /bin/mv .pc.tmp pc + +report: pc + ./report.pl pc | tee report + ======================================= --- /trunk/src/Makefile Wed Oct 28 15:31:42 2009 +++ /dev/null @@ -1,42 +0,0 @@ - -INSTALL_DIR=/usr/local/ -CONFIG_DIR=/etc -VERSION=3.0-rc1 -CC=gcc -CFLAGS=-Wall -g - -FILES = Makefile minfo.c mpi_interface.h padb - -minfo.x: minfo.c mpi_interface.h - $(CC) minfo.c -o minfo.x -ldl $(CFLAGS) - -install: minfo.x - /bin/mkdir -p ${INSTALL_DIR}/bin - /bin/cp minfo.x ${INSTALL_DIR}/bin/ - /bin/cp padb ${INSTALL_DIR}/bin/ - -make config_install: - /bin/mkdir -p ${CONFIG_DIR} - /bin/cp padb.conf ${CONFIG_DIR}/ - -clean: - /bin/rm -f minfo.x - -tarfile: - /bin/rm -f padb-${VERSION}.tgz - /bin/rm -rf padb-${VERSION} - mkdir padb-${VERSION} - /bin/cp ${FILES} padb-${VERSION} - svnversion > padb-${VERSION}/svnversion - tar -czf padb-${VERSION}.tgz padb-${VERSION} - -tidy: - perltidy -b -ce -w -se padb - -pc: padb - perlcritic --brutal --verbose "%l: (%s) %m\n" padb > .pc.tmp || true - /bin/mv .pc.tmp pc - -report: pc - ./report.pl pc | tee report - From padb at googlecode.com Mon Dec 21 22:19:21 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 21 Dec 2009 22:19:21 +0000 Subject: [padb] r369 committed - First cut at using automake to build padb, it's early days with... Message-ID: <001636c9277d2a91c7047b447bc7@google.com> Revision: 369 Author: apittman Date: Mon Dec 21 14:19:11 2009 Log: First cut at using automake to build padb, it's early days with code yet but a basic make/install/dist is all working. More attention is needed to install the config file and some code changes to padb itself are needed as well to cope with updated paths. http://code.google.com/p/padb/source/detail?r=369 Added: /trunk/Makefile.am /trunk/autogen.sh /trunk/configure.in /trunk/src/Makefile.am ======================================= --- /dev/null +++ /trunk/Makefile.am Mon Dec 21 14:19:11 2009 @@ -0,0 +1,1 @@ +SUBDIRS = src ======================================= --- /dev/null +++ /trunk/autogen.sh Mon Dec 21 14:19:11 2009 @@ -0,0 +1,8 @@ +#!/bin/sh + +set -e +set -x + +aclocal +autoconf +automake -a ======================================= --- /dev/null +++ /trunk/configure.in Mon Dec 21 14:19:11 2009 @@ -0,0 +1,6 @@ +AC_INIT(src/padb) +AM_INIT_AUTOMAKE(padb,3.1) +AC_PROG_CC +AC_PROG_INSTALL +AM_PROG_CC_C_O +AC_OUTPUT(Makefile src/Makefile) ======================================= --- /dev/null +++ /trunk/src/Makefile.am Mon Dec 21 14:19:11 2009 @@ -0,0 +1,5 @@ +bin_SCRIPTS = padb +libexec_PROGRAMS = minfo +minfo_CFLAGS = -ldl +minfo_SOURCES = minfo.c mpi_interface.h +EXTRA_DIST = padb From padb at googlecode.com Mon Dec 21 22:36:42 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 21 Dec 2009 22:36:42 +0000 Subject: [padb] r370 committed - Add boilerplate new, copying, changelog and authors files.... Message-ID: <001636e1f3ad3e6103047b44b9c6@google.com> Revision: 370 Author: apittman Date: Mon Dec 21 14:35:42 2009 Log: Add boilerplate new, copying, changelog and authors files. I'll try and work out how to add a THANKS file shortly... http://code.google.com/p/padb/source/detail?r=370 Added: /trunk/AUTHORS /trunk/COPYING /trunk/ChangeLog /trunk/NEWS ======================================= --- /dev/null +++ /trunk/AUTHORS Mon Dec 21 14:35:42 2009 @@ -0,0 +1,4 @@ + +Authors of padb + +Ashley Pittman. ======================================= --- /dev/null +++ /trunk/COPYING Mon Dec 21 14:35:42 2009 @@ -0,0 +1,504 @@ + GNU LESSER GENERAL PUBLIC LICENSE + Version 2.1, February 1999 + + Copyright (C) 1991, 1999 Free Software Foundation, Inc. + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + +[This is the first released version of the Lesser GPL. It also counts + as the successor of the GNU Library Public License, version 2, hence + the version number 2.1.] + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +Licenses are intended to guarantee your freedom to share and change +free software--to make sure the software is free for all its users. + + This license, the Lesser General Public License, applies to some +specially designated software packages--typically libraries--of the +Free Software Foundation and other authors who decide to use it. You +can use it too, but we suggest you first think carefully about whether +this license or the ordinary General Public License is the better +strategy to use in any particular case, based on the explanations below. + + When we speak of free software, we are referring to freedom of use, +not price. Our General Public Licenses are designed to make sure that +you have the freedom to distribute copies of free software (and charge +for this service if you wish); that you receive source code or can get +it if you want it; that you can change the software and use pieces of +it in new free programs; and that you are informed that you can do +these things. + + To protect your rights, we need to make restrictions that forbid +distributors to deny you these rights or to ask you to surrender these +rights. These restrictions translate to certain responsibilities for +you if you distribute copies of the library or if you modify it. + + For example, if you distribute copies of the library, whether gratis +or for a fee, you must give the recipients all the rights that we gave +you. You must make sure that they, too, receive or can get the source +code. If you link other code with the library, you must provide +complete object files to the recipients, so that they can relink them +with the library after making changes to the library and recompiling +it. And you must show them these terms so they know their rights. + + We protect your rights with a two-step method: (1) we copyright the +library, and (2) we offer you this license, which gives you legal +permission to copy, distribute and/or modify the library. + + To protect each distributor, we want to make it very clear that +there is no warranty for the free library. Also, if the library is +modified by someone else and passed on, the recipients should know +that what they have is not the original version, so that the original +author's reputation will not be affected by problems that might be +introduced by others. + + Finally, software patents pose a constant threat to the existence of +any free program. We wish to make sure that a company cannot +effectively restrict the users of a free program by obtaining a +restrictive license from a patent holder. Therefore, we insist that +any patent license obtained for a version of the library must be +consistent with the full freedom of use specified in this license. + + Most GNU software, including some libraries, is covered by the +ordinary GNU General Public License. This license, the GNU Lesser +General Public License, applies to certain designated libraries, and +is quite different from the ordinary General Public License. We use +this license for certain libraries in order to permit linking those +libraries into non-free programs. + + When a program is linked with a library, whether statically or using +a shared library, the combination of the two is legally speaking a +combined work, a derivative of the original library. The ordinary +General Public License therefore permits such linking only if the +entire combination fits its criteria of freedom. The Lesser General +Public License permits more lax criteria for linking other code with +the library. + + We call this license the "Lesser" General Public License because it +does Less to protect the user's freedom than the ordinary General +Public License. It also provides other free software developers Less +of an advantage over competing non-free programs. These disadvantages +are the reason we use the ordinary General Public License for many +libraries. However, the Lesser license provides advantages in certain +special circumstances. + + For example, on rare occasions, there may be a special need to +encourage the widest possible use of a certain library, so that it becomes +a de-facto standard. To achieve this, non-free programs must be +allowed to use the library. A more frequent case is that a free +library does the same job as widely used non-free libraries. In this +case, there is little to gain by limiting the free library to free +software only, so we use the Lesser General Public License. + + In other cases, permission to use a particular library in non-free +programs enables a greater number of people to use a large body of +free software. For example, permission to use the GNU C Library in +non-free programs enables many more people to use the whole GNU +operating system, as well as its variant, the GNU/Linux operating +system. + + Although the Lesser General Public License is Less protective of the +users' freedom, it does ensure that the user of a program that is +linked with the Library has the freedom and the wherewithal to run +that program using a modified version of the Library. + + The precise terms and conditions for copying, distribution and +modification follow. Pay close attention to the difference between a +"work based on the library" and a "work that uses the library". The +former contains code derived from the library, whereas the latter must +be combined with the library in order to run. + + GNU LESSER GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License Agreement applies to any software library or other +program which contains a notice placed by the copyright holder or +other authorized party saying it may be distributed under the terms of +this Lesser General Public License (also called "this License"). +Each licensee is addressed as "you". + + A "library" means a collection of software functions and/or data +prepared so as to be conveniently linked with application programs +(which use some of those functions and data) to form executables. + + The "Library", below, refers to any such software library or work +which has been distributed under these terms. A "work based on the +Library" means either the Library or any derivative work under +copyright law: that is to say, a work containing the Library or a +portion of it, either verbatim or with modifications and/or translated +straightforwardly into another language. (Hereinafter, translation is +included without limitation in the term "modification".) + + "Source code" for a work means the preferred form of the work for +making modifications to it. For a library, complete source code means +all the source code for all modules it contains, plus any associated +interface definition files, plus the scripts used to control compilation +and installation of the library. + + Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running a program using the Library is not restricted, and output from +such a program is covered only if its contents constitute a work based +on the Library (independent of the use of the Library in a tool for +writing it). Whether that is true depends on what the Library does +and what the program that uses the Library does. + + 1. You may copy and distribute verbatim copies of the Library's +complete source code as you receive it, in any medium, provided that +you conspicuously and appropriately publish on each copy an +appropriate copyright notice and disclaimer of warranty; keep intact +all the notices that refer to this License and to the absence of any +warranty; and distribute a copy of this License along with the +Library. + + You may charge a fee for the physical act of transferring a copy, +and you may at your option offer warranty protection in exchange for a +fee. + + 2. You may modify your copy or copies of the Library or any portion +of it, thus forming a work based on the Library, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) The modified work must itself be a software library. + + b) You must cause the files modified to carry prominent notices + stating that you changed the files and the date of any change. + + c) You must cause the whole of the work to be licensed at no + charge to all third parties under the terms of this License. + + d) If a facility in the modified Library refers to a function or a + table of data to be supplied by an application program that uses + the facility, other than as an argument passed when the facility + is invoked, then you must make a good faith effort to ensure that, + in the event an application does not supply such function or + table, the facility still operates, and performs whatever part of + its purpose remains meaningful. + + (For example, a function in a library to compute square roots has + a purpose that is entirely well-defined independent of the + application. Therefore, Subsection 2d requires that any + application-supplied function or table used by this function must + be optional: if the application does not supply it, the square + root function must still compute square roots.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Library, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Library, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote +it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Library. + +In addition, mere aggregation of another work not based on the Library +with the Library (or with a work based on the Library) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may opt to apply the terms of the ordinary GNU General Public +License instead of this License to a given copy of the Library. To do +this, you must alter all the notices that refer to this License, so +that they refer to the ordinary GNU General Public License, version 2, +instead of to this License. (If a newer version than version 2 of the +ordinary GNU General Public License has appeared, then you can specify +that version instead if you wish.) Do not make any other change in +these notices. + + Once this change is made in a given copy, it is irreversible for +that copy, so the ordinary GNU General Public License applies to all +subsequent copies and derivative works made from that copy. + + This option is useful when you wish to copy part of the code of +the Library into a program that is not a library. + + 4. You may copy and distribute the Library (or a portion or +derivative of it, under Section 2) in object code or executable form +under the terms of Sections 1 and 2 above provided that you accompany +it with the complete corresponding machine-readable source code, which +must be distributed under the terms of Sections 1 and 2 above on a +medium customarily used for software interchange. + + If distribution of object code is made by offering access to copy +from a designated place, then offering equivalent access to copy the +source code from the same place satisfies the requirement to +distribute the source code, even though third parties are not +compelled to copy the source along with the object code. + + 5. A program that contains no derivative of any portion of the +Library, but is designed to work with the Library by being compiled or +linked with it, is called a "work that uses the Library". Such a +work, in isolation, is not a derivative work of the Library, and +therefore falls outside the scope of this License. + + However, linking a "work that uses the Library" with the Library +creates an executable that is a derivative of the Library (because it +contains portions of the Library), rather than a "work that uses the +library". The executable is therefore covered by this License. +Section 6 states terms for distribution of such executables. + + When a "work that uses the Library" uses material from a header file +that is part of the Library, the object code for the work may be a +derivative work of the Library even though the source code is not. +Whether this is true is especially significant if the work can be +linked without the Library, or if the work is itself a library. The +threshold for this to be true is not precisely defined by law. + + If such an object file uses only numerical parameters, data +structure layouts and accessors, and small macros and small inline +functions (ten lines or less in length), then the use of the object +file is unrestricted, regardless of whether it is legally a derivative +work. (Executables containing this object code plus portions of the +Library will still fall under Section 6.) + + Otherwise, if the work is a derivative of the Library, you may +distribute the object code for the work under the terms of Section 6. +Any executables containing that work also fall under Section 6, +whether or not they are linked directly with the Library itself. + + 6. As an exception to the Sections above, you may also combine or +link a "work that uses the Library" with the Library to produce a +work containing portions of the Library, and distribute that work +under terms of your choice, provided that the terms permit +modification of the work for the customer's own use and reverse +engineering for debugging such modifications. + + You must give prominent notice with each copy of the work that the +Library is used in it and that the Library and its use are covered by +this License. You must supply a copy of this License. If the work +during execution displays copyright notices, you must include the +copyright notice for the Library among them, as well as a reference +directing the user to the copy of this License. Also, you must do one +of these things: + + a) Accompany the work with the complete corresponding + machine-readable source code for the Library including whatever + changes were used in the work (which must be distributed under + Sections 1 and 2 above); and, if the work is an executable linked + with the Library, with the complete machine-readable "work that + uses the Library", as object code and/or source code, so that the + user can modify the Library and then relink to produce a modified + executable containing the modified Library. (It is understood + that the user who changes the contents of definitions files in the + Library will not necessarily be able to recompile the application + to use the modified definitions.) + + b) Use a suitable shared library mechanism for linking with the + Library. A suitable mechanism is one that (1) uses at run time a + copy of the library already present on the user's computer system, + rather than copying library functions into the executable, and (2) + will operate properly with a modified version of the library, if + the user installs one, as long as the modified version is + interface-compatible with the version that the work was made with. + + c) Accompany the work with a written offer, valid for at + least three years, to give the same user the materials + specified in Subsection 6a, above, for a charge no more + than the cost of performing this distribution. + + d) If distribution of the work is made by offering access to copy + from a designated place, offer equivalent access to copy the above + specified materials from the same place. + + e) Verify that the user has already received a copy of these + materials or that you have already sent this user a copy. + + For an executable, the required form of the "work that uses the +Library" must include any data and utility programs needed for +reproducing the executable from it. However, as a special exception, +the materials to be distributed need not include anything that is +normally distributed (in either source or binary form) with the major +components (compiler, kernel, and so on) of the operating system on +which the executable runs, unless that component itself accompanies +the executable. + + It may happen that this requirement contradicts the license +restrictions of other proprietary libraries that do not normally +accompany the operating system. Such a contradiction means you cannot +use both them and the Library together in an executable that you +distribute. + + 7. You may place library facilities that are a work based on the +Library side-by-side in a single library together with other library +facilities not covered by this License, and distribute such a combined +library, provided that the separate distribution of the work based on +the Library and of the other library facilities is otherwise +permitted, and provided that you do these two things: + + a) Accompany the combined library with a copy of the same work + based on the Library, uncombined with any other library + facilities. This must be distributed under the terms of the + Sections above. + + b) Give prominent notice with the combined library of the fact + that part of it is a work based on the Library, and explaining + where to find the accompanying uncombined form of the same work. + + 8. You may not copy, modify, sublicense, link with, or distribute +the Library except as expressly provided under this License. Any +attempt otherwise to copy, modify, sublicense, link with, or +distribute the Library is void, and will automatically terminate your +rights under this License. However, parties who have received copies, +or rights, from you under this License will not have their licenses +terminated so long as such parties remain in full compliance. + + 9. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Library or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Library (or any work based on the +Library), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Library or works based on it. + + 10. Each time you redistribute the Library (or any work based on the +Library), the recipient automatically receives a license from the +original licensor to copy, distribute, link with or modify the Library +subject to these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties with +this License. + + 11. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Library at all. For example, if a patent +license would not permit royalty-free redistribution of the Library by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Library. + +If any portion of this section is held invalid or unenforceable under any +particular circumstance, the balance of the section is intended to apply, +and the section as a whole is intended to apply in other circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 12. If the distribution and/or use of the Library is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Library under this License may add +an explicit geographical distribution limitation excluding those countries, +so that distribution is permitted only in or among countries not thus +excluded. In such case, this License incorporates the limitation as if +written in the body of this License. + + 13. The Free Software Foundation may publish revised and/or new +versions of the Lesser General Public License from time to time. +Such new versions will be similar in spirit to the present version, +but may differ in detail to address new problems or concerns. + +Each version is given a distinguishing version number. If the Library +specifies a version number of this License which applies to it and +"any later version", you have the option of following the terms and +conditions either of that version or of any later version published by +the Free Software Foundation. If the Library does not specify a +license version number, you may choose any version ever published by +the Free Software Foundation. + + 14. If you wish to incorporate parts of the Library into other free +programs whose distribution conditions are incompatible with these, +write to the author to ask for permission. For software which is +copyrighted by the Free Software Foundation, write to the Free +Software Foundation; we sometimes make exceptions for this. Our +decision will be guided by the two goals of preserving the free status +of all derivatives of our free software and of promoting the sharing +and reuse of software generally. + + NO WARRANTY + + 15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO +WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW. +EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR +OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY +KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE +LIBRARY IS WITH YOU. SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME +THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. + + 16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN +WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY +AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU +FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR +CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE +LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING +RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A +FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF +SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH +DAMAGES. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Libraries + + If you develop a new library, and you want it to be of the greatest +possible use to the public, we recommend making it free software that +everyone can redistribute and change. You can do so by permitting +redistribution under these terms (or, alternatively, under the terms of the +ordinary General Public License). + + To apply these terms, attach the following notices to the library. It is +safest to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least the +"copyright" line and a pointer to where the full notice is found. + + + Copyright (C) + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +Also add information on how to contact you by electronic and paper mail. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the library, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the + library `Frob' (a library for tweaking knobs) written by James Random Hacker. + + , 1 April 1990 + Ty Coon, President of Vice + +That's all there is to it! + + From padb at googlecode.com Mon Dec 21 22:50:04 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 21 Dec 2009 22:50:04 +0000 Subject: [padb] r371 committed - Re-add the tidy and report targets to the auto-generated... Message-ID: <0016e6d26d580d2a9f047b44e9c3@google.com> Revision: 371 Author: apittman Date: Mon Dec 21 14:49:14 2009 Log: Re-add the tidy and report targets to the auto-generated Makefile. http://code.google.com/p/padb/source/detail?r=371 Modified: /trunk/src/Makefile.am ======================================= --- /trunk/src/Makefile.am Mon Dec 21 14:19:11 2009 +++ /trunk/src/Makefile.am Mon Dec 21 14:49:14 2009 @@ -3,3 +3,13 @@ minfo_CFLAGS = -ldl minfo_SOURCES = minfo.c mpi_interface.h EXTRA_DIST = padb + +tidy: + perltidy -b -ce -w -se padb + +pc: padb + perlcritic --brutal --verbose "%l: (%s) %m\n" padb > .pc.tmp || true + /bin/mv .pc.tmp pc + +report: pc + ./report.pl pc | tee report From padb at googlecode.com Mon Dec 21 22:56:13 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 21 Dec 2009 22:56:13 +0000 Subject: [padb] r372 committed - Update padb to look in the new location for minfo.... Message-ID: <005045015fd70b1425047b44ffde@google.com> Revision: 372 Author: apittman Date: Mon Dec 21 14:55:24 2009 Log: Update padb to look in the new location for minfo. It's now called minfo rather than minfo.x at last and is installed into $libexecprefix rather than anywhere on PATH. http://code.google.com/p/padb/source/detail?r=372 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Mon Dec 21 12:52:44 2009 +++ /trunk/src/padb Mon Dec 21 14:55:24 2009 @@ -561,7 +561,7 @@ $conf{edbopt} = undef; $conf{edb} = find_edb(); -$conf{minfo} = find_minfo(); +$conf{minfo} = undef; # Option to define a list of ports used by padb. $conf{port_range} = undef; @@ -669,8 +669,13 @@ # Look for minfo.x in the same directory as padb. sub find_minfo { - my $dir = dirname($0); - return "$dir/minfo.x"; + my $self = $0; + if ( $self =~ m{\A(.+)/bin/padb\z} ) { + my $dir = $1; + return "$dir/libexec/minfo"; + } + my $dir = dirname($self); + return "$dir/minfo"; } ############################################################################### From padb at googlecode.com Mon Dec 21 23:12:33 2009 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 21 Dec 2009 23:12:33 +0000 Subject: [padb] r373 committed - Commit the new full-report page to svn. It's been up for a while but... Message-ID: <0016361e7a92736b23047b453979@google.com> Revision: 373 Author: apittman Date: Mon Dec 21 15:11:50 2009 Log: Commit the new full-report page to svn. It's been up for a while but I forgot to commit the source when I uploaded it. http://code.google.com/p/padb/source/detail?r=373 Added: /trunk/doc/full-report.html Modified: /trunk/doc/build_website /trunk/doc/email.html /trunk/doc/header.html /trunk/doc/index.html /trunk/doc/layout.css /trunk/doc/upload_website /trunk/doc/usage.html ======================================= --- /dev/null +++ /trunk/doc/full-report.html Mon Dec 21 15:11:50 2009 @@ -0,0 +1,355 @@ +
+

Full-report mode

+ +
+ +

To target a specific job (full-report option)

+
+ +

Whilst padb can be used to collect very specific information +from an application, unless you know what you are looking for or know +the application very well this may not be what you want. For cases +such as this padb has a "full report" mode in which it collects +such information from a job as is likely to be useful. This will +create a full diagnostic report for a given job iterating over the +more common padb modes and options. If you are just starting +out debugging with padb or are creating an error report for a +third party then the full-report option is a good place to start. For +large jobs this can generate a lot of output so redirecting to a file +is recommended. + +

To run in this mode simply invoke padb with the option +--full-report=<jobid>. + +

The full-report mode is also very useful if you are automatically +creating trace files for later inspection or collecting information +for inspection by a third party. End-users can be instructed to run +it and mail a log back to a remote support team, for example or it can +be integrated into automatic test suites. + +

More detailed information on using padb and about the type +of information padb can collect about a job can be found on +the modes page. + + + + +
+
+$ padb --show-jobs
+45882
+$ padb --full-report=45882
+
+
+
+padb version 3.n (Revision 325)
+full job report for job 45882
+
+----------------
+[0]
+----------------
+comm0: name: 'MPI_COMM_WORLD'
+comm0: rank: '0'
+comm0: size: '4'
+comm0: id: '0'
+comm0: Rank: local 0 global 0
+comm0: Rank: local 1 global 1
+comm0: Rank: local 2 global 2
+comm0: Rank: local 3 global 3
+comm1: name: 'MPI_COMM_SELF'
+comm1: rank: '0'
+comm1: size: '1'
+comm1: id: '0x1'
+comm2: name: 'MPI_COMM_NULL'
+comm2: size: '0'
+comm2: id: '0x2'
+comm3: name: 'MPI COMMUNICATOR 3 DUP FROM 0'
+comm3: rank: '0'
+comm3: size: '4'
+comm3: id: '0x3'
+comm3: Rank: local 0 global 0
+comm3: Rank: local 1 global 1
+comm3: Rank: local 2 global 2
+comm3: Rank: local 3 global 3
+comm4: name: 'MPI COMMUNICATOR 4 DUP FROM 0'
+comm4: rank: '0'
+comm4: size: '4'
+comm4: id: '0x4'
+comm4: Rank: local 0 global 0
+comm4: Rank: local 1 global 1
+comm4: Rank: local 2 global 2
+comm4: Rank: local 3 global 3
+comm5: name: 'MPI COMMUNICATOR 5 SPLIT FROM 3'
+comm5: rank: '0'
+comm5: size: '2'
+comm5: id: '0x5'
+comm5: Rank: local 0 global 0
+comm5: Rank: local 1 global 2
+----------------
+[1]
+----------------
+comm0: name: 'MPI_COMM_WORLD'
+comm0: rank: '1'
+comm0: size: '4'
+comm0: id: '0'
+comm0: Rank: local 0 global 0
+comm0: Rank: local 1 global 1
+comm0: Rank: local 2 global 2
+comm0: Rank: local 3 global 3
+comm1: name: 'MPI_COMM_SELF'
+comm1: rank: '0'
+comm1: size: '1'
+comm1: id: '0x1'
+comm2: name: 'MPI_COMM_NULL'
+comm2: size: '0'
+comm2: id: '0x2'
+comm3: name: 'MPI COMMUNICATOR 3 DUP FROM 0'
+comm3: rank: '1'
+comm3: size: '4'
+comm3: id: '0x3'
+comm3: Rank: local 0 global 0
+comm3: Rank: local 1 global 1
+comm3: Rank: local 2 global 2
+comm3: Rank: local 3 global 3
+comm4: name: 'MPI COMMUNICATOR 4 DUP FROM 0'
+comm4: rank: '1'
+comm4: size: '4'
+comm4: id: '0x4'
+comm4: Rank: local 0 global 0
+comm4: Rank: local 1 global 1
+comm4: Rank: local 2 global 2
+comm4: Rank: local 3 global 3
+comm5: name: 'MPI COMMUNICATOR 5 SPLIT FROM 3'
+comm5: rank: '0'
+comm5: size: '2'
+comm5: id: '0x5'
+comm5: Rank: local 0 global 1
+comm5: Rank: local 1 global 3
+----------------
+[2]
+----------------
+comm0: name: 'MPI_COMM_WORLD'
+comm0: rank: '2'
+comm0: size: '4'
+comm0: id: '0'
+comm0: Rank: local 0 global 0
+comm0: Rank: local 1 global 1
+comm0: Rank: local 2 global 2
+comm0: Rank: local 3 global 3
+comm1: name: 'MPI_COMM_SELF'
+comm1: rank: '0'
+comm1: size: '1'
+comm1: id: '0x1'
+comm2: name: 'MPI_COMM_NULL'
+comm2: size: '0'
+comm2: id: '0x2'
+comm3: name: 'MPI COMMUNICATOR 3 DUP FROM 0'
+comm3: rank: '2'
+comm3: size: '4'
+comm3: id: '0x3'
+comm3: Rank: local 0 global 0
+comm3: Rank: local 1 global 1
+comm3: Rank: local 2 global 2
+comm3: Rank: local 3 global 3
+comm4: name: 'MPI COMMUNICATOR 4 DUP FROM 0'
+comm4: rank: '2'
+comm4: size: '4'
+comm4: id: '0x4'
+comm4: Rank: local 0 global 0
+comm4: Rank: local 1 global 1
+comm4: Rank: local 2 global 2
+comm4: Rank: local 3 global 3
+comm5: name: 'MPI COMMUNICATOR 5 SPLIT FROM 3'
+comm5: rank: '1'
+comm5: size: '2'
+comm5: id: '0x5'
+comm5: Rank: local 0 global 0
+comm5: Rank: local 1 global 2
+----------------
+[3]
+----------------
+comm0: name: 'MPI_COMM_WORLD'
+comm0: rank: '3'
+comm0: size: '4'
+comm0: id: '0'
+comm0: Rank: local 0 global 0
+comm0: Rank: local 1 global 1
+comm0: Rank: local 2 global 2
+comm0: Rank: local 3 global 3
+comm1: name: 'MPI_COMM_SELF'
+comm1: rank: '0'
+comm1: size: '1'
+comm1: id: '0x1'
+comm2: name: 'MPI_COMM_NULL'
+comm2: size: '0'
+comm2: id: '0x2'
+comm3: name: 'MPI COMMUNICATOR 3 DUP FROM 0'
+comm3: rank: '3'
+comm3: size: '4'
+comm3: id: '0x3'
+comm3: Rank: local 0 global 0
+comm3: Rank: local 1 global 1
+comm3: Rank: local 2 global 2
+comm3: Rank: local 3 global 3
+comm4: name: 'MPI COMMUNICATOR 4 DUP FROM 0'
+comm4: rank: '3'
+comm4: size: '4'
+comm4: id: '0x4'
+comm4: Rank: local 0 global 0
+comm4: Rank: local 1 global 1
+comm4: Rank: local 2 global 2
+comm4: Rank: local 3 global 3
+comm5: name: 'MPI COMMUNICATOR 5 SPLIT FROM 3'
+comm5: rank: '1'
+comm5: size: '2'
+comm5: id: '0x5'
+comm5: Rank: local 0 global 1
+comm5: Rank: local 1 global 3
+Total: 10 communicators of which 0 are in use.
+No data was recorded for 24 communicators
+-----------------
+[0-3] (4 processes)
+-----------------
+main() at deadlock.c:42
+      locals
+        MPI_Comm alpha = 'MPI COMMUNICATOR 3 DUP FROM 0' [0-3]
+        MPI_Comm  beta = 'MPI COMMUNICATOR 4 DUP FROM 0' [0-3]
+        MPI_Comm *  mb = '' [0-3]
+        char *       p = 'Address 0xffffffff out of bounds' [0-3]
+        MPI_Comm split = 'MPI COMMUNICATOR 5 SPLIT FROM 3' [0-3]
+  -----------------
+  [0-3] (4 processes)
+  -----------------
+  PMPI_Barrier() at pbarrier.c:62
+        params
+          MPI_Comm comm:
+              'MPI COMMUNICATOR 3 DUP FROM 0' [1-3]
+              'MPI COMMUNICATOR 4 DUP FROM 0' [0]
+        locals
+          int err = '0' [0-3]
+    -----------------
+    [0-3] (4 processes)
+    -----------------
+    ompi_coll_tuned_barrier_intra_dec_fixed() at  
coll_tuned_decision_fixed.c:206
+          params
+            struct ompi_communicator_t * comm:
+                'MPI COMMUNICATOR 3 DUP FROM 0' [1-3]
+                'MPI COMMUNICATOR 4 DUP FROM 0' [0]
+            mca_coll_base_module_t *   module = 'valid pointer perm=rw-p  
([heap])' [0-3]
+          locals
+            int communicator_size = '0' [0-3]
+      -----------------
+      [0-3] (4 processes)
+      -----------------
+      ompi_coll_tuned_barrier_intra_recursivedoubling() at  
coll_tuned_barrier.c:172
+            params
+              struct ompi_communicator_t * comm:
+                  'MPI COMMUNICATOR 3 DUP FROM 0' [1-3]
+                  'MPI COMMUNICATOR 4 DUP FROM 0' [0]
+              mca_coll_base_module_t *   module = 'valid pointer perm=rw-p  
([heap])' [0-3]
+            locals
+              int adjsize = '4' [0-3]
+              int     err = '0' [0-3]
+              int    line: more than 3 distinct values
+              int    mask:
+                  '2' [0-1]
+                  '4' [2-3]
+              int    rank: more than 3 distinct values
+              int  remote:
+                  '0' [1-2]
+                  '1' [0,3]
+              int    size = '4' [0-3]
+        -----------------
+        [0-3] (4 processes)
+        -----------------
+        ompi_coll_tuned_sendrecv_actual() at coll_tuned_util.c:54
+              params
+                void *                    sendbuf = 'null pointer' [0-3]
+                int                        scount = '0' [0-3]
+                ompi_datatype_t *       sdatatype = 'MPI_BYTE' [0-3]
+                int                          dest:
+                    '0' [1-2]
+                    '1' [0,3]
+                int                          stag = '-16' [0-3]
+                void *                    recvbuf = 'null pointer' [0-3]
+                int                        rcount = '0' [0-3]
+                ompi_datatype_t *       rdatatype = 'MPI_BYTE' [0-3]
+                int                        source:
+                    '0' [1-2]
+                    '1' [0,3]
+                int                          rtag = '-16' [0-3]
+                struct ompi_communicator_t * comm:
+                    'MPI COMMUNICATOR 3 DUP FROM 0' [1-3]
+                    'MPI COMMUNICATOR 4 DUP FROM 0' [0]
+                ompi_status_public_t *     status = 'null pointer' [0-3]
+              locals
+                int                           err = '0' [0-3]
+                int                          line = '0' [0-3]
+                ompi_request_t *[2]          reqs = '{, }' [0-3]
+                ompi_status_public_t [2] statuses = 'value too long to  
display' [0-3]
+          -----------------
+          [0-3] (4 processes)
+          -----------------
+          ompi_request_default_wait_all() at request/req_wait.c:262
+                params
+                  size_t                    count = '2' [0-3]
+                  ompi_request_t **      requests: more than 3 distinct  
values
+                  ompi_status_public_t * statuses = 'valid pointer  
perm=rw-p ([stack])' [0-3]
+                locals
+                  char [30] __PRETTY_FUNCTION__  
= '"ompi_request_default_wait_all"' [0-3]
+                  size_t              completed = '1' [0-3]
+                  size_t                      i = '2' [0-3]
+                  int                 mpi_error = '0' [0-3]
+                  size_t                pending = '1' [0-3]
+                  ompi_request_t *      request = 'valid pointer perm=rw-p  
([heap])' [0-3]
+                  ompi_request_t **        rptr = '' [0-3]
+                  size_t                  start:
+                      '53' [0-1]
+                      '55' [2-3]
+            -----------------
+            [0-3] (4 processes)
+            -----------------
+            opal_condition_wait() at ../opal/threads/condition.h:99
+                  params
+                    opal_condition_t * c = 'valid pointer perm=rw-p' [0-3]
+                    opal_mutex_t *     m = 'valid pointer perm=rw-p' [0-3]
+                  locals
+                    int rc = '0' [0-3]
+              -----------------
+              [0,3] (2 processes)
+              -----------------
+              opal_progress() at runtime/opal_progress.c:206
+                    locals
+                      int events = '0' [0,3]
+                      size_t   i = '0' [0,3]
+              -----------------
+              [1] (1 processes)
+              -----------------
+              opal_progress() at runtime/opal_progress.c:181
+                    locals
+                      int       events = '0' [1]
+                      size_t         i = '2' [1]
+                      opal_timer_t now = '135914459801112' [1]
+                -----------------
+                [1] (1 processes)
+                -----------------
+                opal_timer_base_get_cycles()  
at ../opal/mca/timer/linux/timer_linux.h:31
+                  opal_sys_timer_get_cycles()  
at ../opal/include/opal/sys/ia32/timer.h:33
+                        locals
+                          opal_timer_t ret = '135914459801112' [1]
+              -----------------
+              [2] (1 processes)
+              -----------------
+              opal_progress() at runtime/opal_progress.c:166
+                    locals
+                      int events = '0' [2]
+                      size_t   i = '2' [2]
+
+
+ + +

+ ======================================= --- /trunk/doc/build_website Thu Sep 10 02:38:12 2009 +++ /trunk/doc/build_website Mon Dec 21 15:11:50 2009 @@ -8,7 +8,7 @@ echo Uploading website to http://padb.pittman.org.uk -FILES="index usage download email extensions modes configuration" +FILES="index usage download email extensions modes full-report configuration" TDIR=public ======================================= --- /trunk/doc/email.html Mon Nov 9 07:45:01 2009 +++ /trunk/doc/email.html Mon Dec 21 15:11:50 2009 @@ -1,9 +1,7 @@

Mailing Lists

Mailing lists -for padb discussion and development are available for public use, -if you are an existing user or are considering using padb -for the first time I would advise you to join. +for padb discussion and development are available and are archived on-line.
Padb supports many resource managers and should select the -appropriate one for your cluster, if you have more than one resource +appropriate one for your machine, if you have more than one resource manager installed or padb can't detect the correct one use -the rmgr configuration option. - -

If no resource manager is found you can use -O rmgr=local -and process identifiers (pids) will be used instead of job ids. - +the rmgr configuration option to +set machine-wide defaults. + +

If your resource manager or scheduler is not supported you can also +use local and process identifiers (pids) will be used instead +of job ids.

@@ -27,7 +28,7 @@ @@ -46,7 +47,7 @@ - + @@ -67,7 +68,7 @@ + automatically select network jobs on the local node.
Works with any resource manager or software stack that is compliant with the MPI - debugger interface. It is recommended to use support for your + debugger interface. It is preferable to use support for your specific resource manager if it exists.
Fully supported
MPICH2/mpdMPICH2 mpd mpd Fully supported in 3.0 and above
None local-qsnet as local-fd with local-fd-name set to /proc/qsnet/elan/user to - automatially select network jobs on the local node.
@@ -75,6 +76,8 @@

The --list-rmgrs option can be used to show a list of detected resource managers and their active jobs. +


+

Selecting the job(s) to target

@@ -86,6 +89,10 @@ default is to target jobs of the current user, this can be over-ridden with the --user flag. +

To target a specific job

To target a specific job specify the +numeric jobid for the job on the command line, after all other +options. +

Showing list of current jobs

To show a list of currently running jobs for a given user use the --show-jobs option. Alternatively the --list-rmgrs @@ -95,55 +102,31 @@

To target all jobs

To target all jobs currently running for a given user use the --all (-a) flag. -

To target any jobs

To target "any" job currently running for -a given user use the --any (-A) flag. This differs from +

To target any jobs

To target any job currently running for a +given user use the --any (-A) flag. This differs from targeting all jobs as it will exit with an error if more than one job is running. -

To target a specific job

To target a specific job specify the -jobid for the job on the command line, after all other options. - -
- -

To target a specific job (Full report option)

-
-If trying to diagnose a problem or gather information there is another -option, --full-report=<jobid>, this tells padb to target -the job specified and to report all information about the job it knows -how to collect. This option is typically used when creating bug -reports to send to third parties or to inspect a job for anomalies. -

Selecting ranks (Processes)

In modes where data for each process is reported separately it is possible to restrict which ranks are queried, this is done via the --rank option. Multiple ranks can be selected by specifying --rank -multiple times. +multiple times or by specifying a rank list using +the [<low>-<high>,<value>] notation. Eg, to specify ranks +0,2 and 3 use --rank [0,2-3]

Selecting which mode to run in.

Padb can present an array of different information about your select jobs and it can present it in a number of different ways. With -the exception of Full Report only one mode can be selected, if you -need more information about the program run padb more than -once. - -
- -

Full Report

-
-If you are just starting with padb or are creating an error -report for somebody else then the --full-report=<jobid> -option is a good place to start, this will complete a full diagnostic -report for the job iterating over the more common padb options. For -large jobs this can generate a lot of output so redirecting to a file -is recommended. - -

- -A list of avaliable modes and their descriptions can be found on the modes page. +the exception of full Report only one +mode can be selected, if you need more information about the program +padb has to be run more than once. A list of available modes +and their descriptions can be found on +the modes page.