From thipadin.seng-long at bull.net Tue Feb 2 16:03:08 2010 From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net) Date: Tue, 2 Feb 2010 17:03:08 +0100 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A__R=E9f=2E_=3A_Re=3A_R=E9?= =?iso-8859-1?q?f=2E_=3A_Bullchanges_=28_with_LSF_-mpich2wrapper_patch_=29?= Message-ID: On Jan 27 2010 at 18:30 Ashley Pittman wrote: > As they are dependant on each other could you send them as a single, combined patch please. I'm sending the combined patch against r386: >I'm not sure that your loop over @chaps in lsfmpich2wr_get_mpiproc() is correct, should the if ($found_app != 0) test be outside of the main loop? Again a comment explaining what the code is trying to extract would be >useful here. This subroutine is trying to extract a file path wich is just after param --app ex: --app foo ..., (foo is the file) so the test is just for the loop on every fields(words) for this line. You are free to optimize my codings, just got to get them working. Thipadin, Regards. Ashley Pittman 01/27/2010 06:30 PM Pour : thipadin.seng-long at bull.net cc : padb-devel at pittman.org.uk, Andry.Razafinjatovo at bull.net, florence.vallee at bull.net, Sylvain Jeaugey Objet : Re: [padb] R?f. : Re: R?f. : Bull changes ( with LSF -mpich2wrapper patch ) On 21 Jan 2010, at 14:20, thipadin.seng-long at bull.net wrote: > > I get back to you after a short break, as I've been doing some validation on a openmpi spawn functionality. > Now I've finished what you've asked me above, I am just sending both patches. > One for lsf-mpich2 wrapper, and the other one with lsf-openmpi wrapper. I did it against r386 version. > Both are alike and have many common sub routines. As the patches are seperated some routines > are in both patches. I prefer you integrate once as you can factorize. > If you need some 'ps' or 'bjobs' command layouts to understand the coding, please ask, I'll send you. As they are dependant on each other could you send them as a single, combined patch please. I don't have systems I can test this on as I don't have lsf but I would like to understand the code, could you put together a paragraph for each rmgr describing how the underlying resource manager lays out processes and how padb finds it's information. I'm particularly interested in why it has to ssh around to different nodes to see the information it needs. With the ps command you can prevent the printing of headers by using the option "-o pid=,ppid=,cmd=" which will avoid the special case for removing these later on. Stripping the leading spaces from ps output is already done in get_extended_process_list(), can you use the same regexp in get_line_ppid() for clarity please. I'm not sure that your loop over @chaps in lsfmpich2wr_get_mpiproc() is correct, should the if ($found_app != 0) test be outside of the main loop? Again a comment explaining what the code is trying to extract would be useful here. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: lsf_mpich2_ompi.patch Type: application/octet-stream Size: 17901 bytes Desc: not available URL: From padb at googlecode.com Tue Feb 2 16:18:43 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Tue, 02 Feb 2010 16:18:43 +0000 Subject: [padb] r387 committed - Add better support for reporting theads in stack traces.... Message-ID: <0016e6d27cc1a3d71b047ea07459@google.com> Revision: 387 Author: apittman Date: Tue Feb 2 08:17:46 2010 Log: Add better support for reporting theads in stack traces. A tree of stacks is shown for each thread with there being as many tress as there are thead identifiers in the entire job. Gdb numbers threads from 0 upwards so typically each process will have a thead[0] and then a number of extra threads as well. This patch adds a new output_namespace() function to the core code and some corresponding code to gather output not just over ranks but over namespaces as well. http://code.google.com/p/padb/source/detail?r=387 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Sun Jan 10 07:17:18 2010 +++ /trunk/src/padb Tue Feb 2 08:17:46 2010 @@ -3861,9 +3861,13 @@ } sub add_data_to_tree { - my ( $tree, $d ) = @_; + my ( $tree, $ns, $d ) = @_; + my $prefix = $EMPTY_STRING; + if ( defined $ns ) { + $prefix = "$ns|"; + } if ( defined $d->{target_data} ) { - _add_data_to_tree( $tree, $d, $EMPTY_STRING ); + _add_data_to_tree( $tree, $d, $prefix ); } return; } @@ -3913,12 +3917,12 @@ sub new_tree { my ( $lines, $d ) = @_; my %tree; - debug_log( 'tree', undef, 'Making the tree' ); + debug_log( 'tree', $d, 'Making the tree' ); foreach my $tag ( sort { $a <=> $b } keys %{$lines} ) { add_tag_to_tree( \%tree, $tag, $lines->{$tag} ); } debug_log( 'tree', \%tree, 'Enhancing the tree' ); - add_data_to_tree( \%tree, $d ); + add_data_to_tree( \%tree, undef, $d ); debug_log( 'tree', \%tree, 'Formatting the tree' ); my $t = display_tree( \%tree, ); debug_log( 'tree', undef, 'Displaying the tree' ); @@ -3926,12 +3930,48 @@ debug_log( 'tree', undef, 'Done' ); return; } + +# An experimental new tree format. +sub new_ns_tree { + my ( $d, $ns ) = @_; + + my %tree; + debug_log( 'tree', undef, 'Making the tree' ); + my @tags; + foreach my $tag ( keys %{ $d->{target_ns_output} } ) { + if ( defined $d->{target_ns_output}->{$tag}->{$ns} ) { + push @tags, $tag; + } + } + foreach my $tag ( sort { $a <=> $b } @tags ) { + add_tag_to_tree( \%tree, $tag, $d->{target_ns_output}->{$tag}->{$ns} ); + } + debug_log( 'tree', \%tree, 'Enhancing the tree' ); + add_data_to_tree( \%tree, $ns, $d ); + debug_log( 'tree', \%tree, 'Formatting the tree' ); + my $t = display_tree( \%tree, ); + debug_log( 'tree', undef, 'Displaying the tree' ); + print "Stack trace(s) for thread: $ns\n"; + print $t; + debug_log( 'tree', undef, 'Done' ); + return; +} sub complex_output_handler { my ( $output, $lines, $d ) = @_; if ( $output eq 'tree' ) { - new_tree( $lines, $d ); + if ( not defined $d->{target_ns_output} ) { + new_tree( $lines, $d ); + return; + } + + foreach + my $ns ( sort { $a <=> $b } keys %{ $d->{target_data}{thread_id} } ) + { + new_ns_tree( $d, $ns ); + } + } elsif ( $output eq 'compress' ) { foreach my $tag ( sort { $a <=> $b } ( keys %{$lines} ) ) { @@ -4420,6 +4460,14 @@ format_target_data( $d->{target_data} ) ); } + + if ( defined $d->{target_output} ) { + debug_log( 'tdata', $d->{target_output}, 'Target output' ); + } + + if ( defined $d->{target_ns_output} ) { + debug_log( 'tdata', $d->{target_ns_output}, 'Target namespace output' ); + } maybe_clear_screen(); maybe_show_header($comm_data); @@ -5497,8 +5545,16 @@ my %inner_conf; my %inner_output; +my %inner_ns_output; my %local_target_data; +sub output_namespace { + my ( $rank, $ns, $str ) = @_; + + push @{ $inner_ns_output{$rank}{$ns} }, $str; + return; +} + sub output { my ( $vp, $str ) = @_; @@ -8054,6 +8110,9 @@ my @frames = @{ $thread->{frames} }; output( $vp, "ThreadId: $thread->{id}" ) if ( @threads != 1 ); + if ( $carg->{out_format} eq 'tree' ) { + target_key_pair( $vp, 'thread_id', $thread->{id} ); + } my $strip_below; @@ -8095,8 +8154,11 @@ output( $vp, $l ); if ( $carg->{out_format} eq 'tree' ) { + + output_namespace( $vp, $thread->{id}, $l ); push @fl, $l; my $fl = join( ",", @fl ); + $fl = "$thread->{id}|$fl"; if ( $carg->{stack_shows_locals} ) { my @local_names; foreach my $loc ( @{ $frame->{locals} } ) { @@ -8796,6 +8858,14 @@ $r->{target_output}{$tp}; } } + + # Combine the target process responses from child. + if ( exists $r->{target_ns_output} ) { + foreach my $tp ( keys %{ $r->{target_ns_output} } ) { + $handle->{all_replys}->{target_ns_output}{$tp} = + $r->{target_ns_output}{$tp}; + } + } # Copy the target local responses. if ( exists $handle->{target_response} ) { @@ -8812,6 +8882,13 @@ %inner_output = (); + # Save any output we've got from this node. + foreach my $key ( keys %inner_ns_output ) { + $handle->{all_replys}->{target_ns_output}{$key} = + $inner_ns_output{$key}; + } + %inner_ns_output = (); + # Copy the network target errors into response. if ( exists $r->{target_data} ) { if ( exists $handle->{all_replys}->{target_data} ) { @@ -9259,6 +9336,11 @@ foreach my $key ( keys %inner_output ) { $res->{target_output}{$key} = $inner_output{$key}; } + + # Save any output we've got from this node. + foreach my $key ( keys %inner_ns_output ) { + $res->{target_ns_output}{$key} = $inner_ns_output{$key}; + } if (%local_target_data) { $res->{target_data} = \%local_target_data; @@ -9268,6 +9350,7 @@ # Clear down the local inputs. %inner_output = (); + %inner_ns_output = (); %local_target_data = (); $netdata->{target_response} = undef; From thipadin.seng-long at bull.net Thu Feb 4 14:38:36 2010 From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net) Date: Thu, 4 Feb 2010 15:38:36 +0100 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A__R=E9f=2E_=3A_Re=3A__R=E9f=2E_?= =?iso-8859-1?q?=3A_Re_=3A_R=E9f=2E_=3A_Bullchanges_=28_with_LSF_-mpich2_w?= =?iso-8859-1?q?rapperi_and__-openmpiwrapper=29?= Message-ID: Hi, As I promised you, i 'm sending you the display outputs of 'bjobs' and 'ps' command on the respective hosts as an example. This example is for openmpi wrapper. This may help to understand my codings as well. There are 2 jobs, the one (1478) started with the wrapper and the one (1477) without the wrapper. With the wrapper we can determine how many procs are on which host (2 on artemis3, 2 on artemis4) etc. Without the wrapper we can just see it has started on 'artemis3' but we don't know how many procs on artemis3 and on artemis4 (actually 2 on artemis2 and 2 on artemis4). So this jobs should not be taken into account. To distinguish between the two, i look in the bjobs command, see where is the master host (where job starts)?, should be the first line of EXEC_HOST in this case is 'artemis3' for both jobs. And in this host ps command will show a mpirun --app 'path_to_app_file' while for the other job it shows mpirun without --app parameter. And in this appfile it'll show the TaskStarter command with -p artemis3:37756, a port number that all subsequent processes should have in each remote hosts, while the job without wrapper doesn't have. [senglont at artemis3 lsf-ompi]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 1477 senglon RUN normal artemis3 artemis3 PP_SLNOWR Feb 4 14:13 1478 senglon RUN normal artemis3 2*artemis3 PP_SNDRCV Feb 4 14:21 2*artemis4 [senglont at artemis3 ]$ Bjobs of 1478 bjobs -l 1478 Job <1478>, Job Name , User , Project , Status , Queue , Command <#! /bin/bash;# with mpirun wrapper;# essai avec -R span a lancer deux fois lui-meme; # Ok ce script est bon pour lancer 2 jobs;# avec chaque 2proc sur artemis3 et 2proc sur artemis4;#BSUB -J "PP_SNDR CV";#BSUB -m "artemis3 artemis4";#BSUB -o PP_SNDRCV.%J;#BS UB -n 4;#BSUB -e PP_SNDRCVerr.%J;#BSUB -a openmpi;#BSUB -R "span[ptile=2]";source ~/.bashrc_lompi;mpirun.lsf --prefi x /home_nfs/senglont/ompi_inst/1.3.3/ ./pp_sndrcv_spbl> Thu Feb 4 14:21:02: Submitted from host , CWD <$HOME/mympi/lsf-ompi> , Output File , Error File , 4 Processors Requested, Requested Resources , Specified Hosts , ; Thu Feb 4 14:21:04: Started on 4 Hosts/Processors <2*artemis3> <2*artemis4>, E xecution Home , Execution CWD ; Thu Feb 4 15:19:57: Resource usage collected. The CPU time used is 3526 seconds. MEM: 14 Mbytes; SWAP: 611 Mbytes; NTHREAD: 14 PGID: 13623; PIDs: 13631 13635 13637 13638 13623 13624 13628 13629 PGID: 13639; PIDs: 13639 PGID: 13640; PIDs: 13640 PGID: 10491; PIDs: 10491 PGID: 10492; PIDs: 10492 SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - [senglont at artemis3 lsf-ompi]$ PS from artemis3 [senglont at artemis3 ]$ psu PID PPID CMD 10222 10220 sshd: senglont at pts/5 10223 10222 -bash 13586 27520 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/etc/res -d /usr/share/lsf/conf -m 13587 13586 /bin/sh /home_nfs/senglont/.lsbatch/1265289212.1477 13591 13587 /bin/bash /home_nfs/senglont/.lsbatch/1265289212.1477.shell 13592 13591 mpirun --prefix /home_nfs/senglont/ompi_inst/1.3.3 -H artemis3,artemis4 -n 4 . 13594 13592 ./pp_sleep 13595 13592 ./pp_sleep 13623 27520 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/etc/res -d /usr/share/lsf/conf -m 13624 13623 /bin/sh /home_nfs/senglont/.lsbatch/1265289662.1478 13628 13624 /bin/bash /home_nfs/senglont/.lsbatch/1265289662.1478.shell 13629 13628 pam -g /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper --prefi 13631 13629 /bin/sh /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper --pref 13635 13631 mpirun --app /home_nfs/senglont/.openmpi_appfile_1478 13637 13635 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/TaskStarter -p artemis3:37756 13638 13635 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/TaskStarter -p artemis3:37756 13639 13637 ./pp_sndrcv_spbl 13640 13638 ./pp_sndrcv_spbl 13645 27420 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/etc/res 13699 10223 ps -o pid,ppid,cmd -u senglont [senglont at artemis3 lsf-ompi]$ [senglont at artemis3 lsf-ompi]$ [senglont at artemis3 lsf-ompi]$ cat /home_nfs/senglont/.openmpi_appfile_1478 -host artemis4 -n 2 --prefix /home_nfs/senglont/ompi_inst/1.3.3/ /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/TaskStarter -p artemis3:37756 -c /usr/share/lsf/conf -s /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/etc -a X86_64 ./pp_sndrcv_spbl -host artemis3 -n 2 --prefix /home_nfs/senglont/ompi_inst/1.3.3/ /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/TaskStarter -p artemis3:37756 -c /usr/share/lsf/conf -s /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/etc -a X86_64 ./pp_sndrcv_spbl [senglont at artemis3 lsf-ompi]$ PS from artemis4 [senglont at artemis4 ~]$ psu PID PPID CMD 10478 1 /home_nfs/senglont/ompi_inst/1.3.3/bin/orted --daemonize -mca ess env -mca ort 10479 10478 ./pp_sleep 10480 10478 ./pp_sleep 10488 1 /home_nfs/senglont/ompi_inst/1.3.3/bin/orted --daemonize -mca ess env -mca ort 10489 10488 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/TaskStarter -p artemis3:37756 10490 10488 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/TaskStarter -p artemis3:37756 10491 10490 ./pp_sndrcv_spbl 10492 10489 ./pp_sndrcv_spbl 10493 18965 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/etc/res 11019 11017 sshd: senglont at pts/8 11020 11019 -bash 11054 11020 ps -o pid,ppid,cmd -u senglont [senglont at artemis4 ~]$ As you said, we can work it out to optimize the codings to just have one (after the commit). Thipadin. -------------- next part -------------- An HTML attachment was scrubbed... URL: From thipadin.seng-long at bull.net Tue Feb 9 16:03:09 2010 From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net) Date: Tue, 9 Feb 2010 17:03:09 +0100 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_R=E9f=2E_=3A__R=E9f=2E_=3A_Re?= =?iso-8859-1?q?=3A__R=E9f_=2E_=3A_Re=3A_R=E9f=2E_=3A_Bullchanges_=28_with?= =?iso-8859-1?q?_LSF_-mpich2_wrapper_and__-openmpi=5Fwrapper_combined=29?= Message-ID: Hi, I've eventually combined my previous coding for mpich2 and openmpi wrapper on LSF as we discussed. I hope you haven't yet commit the previous sending. In the "outer" side we can store differents combined jobs (whatever mpich2 or openmpi) in the table. Each job is tagged in jobid{lsf_mpi} = 1 for mpich2 and 2 for openmpi. The flag is passed through inner_conf{lsf_mpi} to the inners processus so they can do differents treatments for each wrapper to find the processus. The RMGR is 'lsf-mpiwr' as mpi wrapper as it must be lauched by a wrapper. So It can be used for further mpi wrapper. I've enjoyed meeting you. Hoping you can come often to CEA. I hope you'll commit it soon as we expect to deliver to CEA soon. Best regards, Thipadin, -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: LSF_mpich2_openmpi_2.patch Type: application/octet-stream Size: 13736 bytes Desc: not available URL: From padb at googlecode.com Mon Feb 15 18:09:14 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 15 Feb 2010 18:09:14 +0000 Subject: [padb] r388 committed - Extend padb to work with (a subset of) LSF jobs.... Message-ID: <000325562e1ecbaef8047fa783d2@google.com> Revision: 388 Author: apittman Date: Mon Feb 15 10:08:15 2010 Log: Extend padb to work with (a subset of) LSF jobs. Note that this commit does not allow padb to attach to all LSF jobs but rather padb is now able to detect two types of wrapper script commonly used with LSF and extract the information it needs from the wrapper script. As such this is a step forward but not a general case solution for LSF support and will not meet everybody's need. This commit is largely the work of Thipadin @ Bull as sent to the developers list on 9/2/10 with the following changes: The resource manager has been renamed to the more simple 'lsf'. lsf_mpi has been replaced with lsf_mode with valid values being mpich2 or openmpi slurp_remote_cmd() has been added instead of calling slurm_cmd() with "ssh host ..." http://code.google.com/p/padb/source/detail?r=388 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Tue Feb 2 08:17:46 2010 +++ /trunk/src/padb Mon Feb 15 10:08:15 2010 @@ -370,7 +370,8 @@ # Config options the inner knows about, only forward options if they are in # this list. -my @inner_conf = qw(edb edbopt rmgr scripts slurm_job_step pbs_server); +my @inner_conf = + qw(edb edbopt rmgr scripts slurm_job_step pbs_server lsf_mode lsfmpi_server lsfmpi_mpirpid lsfmpi_port); # More config options the inner knows about, these are forwarded on the # command line rather than over the sockets. @@ -507,6 +508,13 @@ find_pids => \&pbs_find_pids, }; +$rmgr{lsf} = { + is_installed => \&lsfmpi_is_installed, + get_active_jobs => \&lsfmpi_get_jobs, + setup_job => \&lsfmpi_setup_pcmd, + find_pids => \&lsfmpi_find_pids, +}; + ############################################################################### # # Config options @@ -905,6 +913,11 @@ close $CFD; return @out; } + +sub slurp_remote_cmd { + my ( $host, $cmd ) = @_; + return slurm_cmd("ssh $host $cmd"); +} sub slurp_dir { my ($dir) = @_; @@ -2839,6 +2852,344 @@ return %pcmd; } + +############################################################################### +# +# lsf-mpich2 wrapper support +# the jobs launched by mpich2_wrapper thanks to mpirun.lsf and #BSUB -a mpich2 +# The job submission file looks like: +# +##! /bin/bash +##BSUB -J "JOB_NAME" +##BSUB -o JOB_NAME.%J +##BSUB -n 4 +##BSUB -e JOB_NAME_err.%J +##BSUB -a mpich2 +#mpirun.lsf ./mpi_prog +# +# lsf-ompi-wrapper support thanks to #BSUB -a openmpi and mpirun.lsf. +# The job submission file looks like: +##! /bin/bash +##BSUB -J "PP_SNDRCV" +##BSUB -o PP_SNDRCV.%J +##BSUB -n 4 +##BSUB -e PP_SNDRCVerr.%J +##BSUB -a openmpi +#mpirun.lsf ./pp_sndrcv_spbl +# +############################################################################### + +my %lsfmpi_tabjobs; + +sub lsfmpi_is_installed { + return ( find_exe('mpirun.lsf') + and ( find_exe('mpich2_wrapper') or find_exe('openmpi_wrapper') ) ); +} + +sub lsf_get_line_ppid { + my ( $ppid, $rank_pid, $rank_ppid, @handle ) = @_; + my $ret_line; + my $pid; + foreach my $line (@handle) { + $line =~ s/^ +//; # take off leading space + my @champs = split( /\s+/, $line ); + if ( $champs[$rank_ppid] == $ppid ) { + $pid = $champs[$rank_pid]; + $ret_line = $line; + last; + } + } + return ( $ret_line, $pid ); +} + +sub lsfmpi_get_mpiport { + my ( $host, $portpath ) = @_; + my $portfound = 0; + my $port; + my @handle = slurp_remote_cmd( $host, "cat $portpath" ); + foreach my $line (@handle) { + if ( $line =~ /TaskStarter/ ) { + my @champs = split( " ", $line ); + foreach my $word (@champs) { + if ( $word eq "-p" ) { # don't use =~ because may take --prefix + $portfound = 1; + next; + } + if ( $portfound == 1 ) { + $port = $word; + last; + } + } + last; + } + } + return $port; +} + +sub lsfmpi_get_mpiproc { + my ( $ppid, $host, $job ) = @_; + my $rank_pid = 0; + my $rank_ppid = 1; + my $proc; + my $path_file; + my $count_line = 0; + my $mode; + + #get ps from the leading host(the one that start mpirun.lsf) + my @handle = + slurp_remote_cmd( $host, "ps -o pid=,ppid=,cmd= -u $target_user" ); + + $count_line = @handle; + for ( my $i = 0 ; $i < $count_line ; $i++ ) { # to avoid loop + my ( $line, $pid ); + next if ( !defined $ppid ); + ( $line, $pid ) = + lsf_get_line_ppid( $ppid, $rank_pid, $rank_ppid, @handle ); + next if ( !defined $line ); + if ( $line =~ /mpi/ && $line =~ /-configfile/ ) { + my @champs = split( " ", $line ); + foreach my $word (@champs) { + if ( $word eq "-configfile" ) { + $mode = 'mpich2'; + next; + } + if ( defined $mode ) { + $path_file = $word; # get path of -configfile + $proc = $pid; + last; + } + } + if ( $path_file =~ /$job\.newconf$/ ) + { # format is .mpich2_wrapper.jobid.newconf + last; + } + $path_file = undef; + } elsif ( $line =~ /mpi/ && $line =~ /-app/ ) { + my @champs = split( " ", $line ); + foreach my $word (@champs) { + if ( $word eq "--app" ) { + $mode = 'openmpi'; + next; + } + if ( defined $mode ) { + $path_file = $word; # get path file of --app param + $proc = $pid; + last; + } + } + if ( $path_file =~ /$job$/ ) { # format is .openmpi_appfile_jobid + last; + } + $path_file = undef; + } else { + $ppid = $pid; + } + } + return ( $proc, $path_file, $mode ); +} + +sub lsf_get_jobpgid { + my ($jobid) = @_; + my $resfound = 0; + my @proc; + my $cmd = "bjobs -l $jobid "; + my @handle = slurp_cmd($cmd); + foreach my $line (@handle) { + if ( $line =~ /Resource usage collected./i ) { + $resfound = 1; + next; + } + if ( $resfound == 1 ) { + $line =~ s/^ +//; # take off space at start + if ( $line =~ /^PGID:/i ) { + my @champs = split( " ", $line ); + my $pgid = $champs[1]; + chop($pgid) if ( $pgid =~ /;$/ ); + push( @proc, $pgid ); + my $firstpid = 0; + foreach my $word (@champs) { + if ( $word =~ /^PIDs:/ ) { + $firstpid = 1; + next; + } + if ( $firstpid == 1 ) { + push( @proc, $word ); + } + } + last; + } + } + } + return (@proc); +} + +sub lsfmpi_get_hostport { + my $job = shift; + my $d = lsfmpi_get_data(); + my $host; + my $port; + my $mpirunpid; + my $path_port; + my $lsf_mode; + + my @hosts = @{ $d->{$job}{hosts} } if ( defined $d->{$job}{hosts} ); + + $host = $hosts[0] if ( defined $hosts[0] ); + + #get the pgid of the job(first job pid) + my @pgid = lsf_get_jobpgid($job); + my $ppid = $pgid[0]; + + #get the port of the leading proc (mpirun proc port) + if ( defined $ppid and defined $host ) { + ( $mpirunpid, $path_port, $lsf_mode ) = + lsfmpi_get_mpiproc( $ppid, $host, $job ); + $d->{$job}{lsf_mode} = $lsf_mode; # can be 'mpich2' or 'openmpi' + $port = lsfmpi_get_mpiport( $host, $path_port ) + if ( defined($path_port) ); + } + return ( $host, $mpirunpid, $port ); +} + +sub lsfmpi_get_lbjobs { + my $jobidfound = 0; + my $found_title = 0; + my $jobid; + my $rank_jobid = 0; + my $rank_user = 1; + my $rank_stat = 2; + my $rank_ehost = 5; + my $rank_jobname = 6; + my $cmd = "bjobs -r -u $target_user "; + my @output = slurp_cmd($cmd); + foreach my $line (@output) { + $line =~ s/^ +//; # suppress blank in front of line + my @champs = split( /\s+/, $line ); + next if ( $champs[$rank_jobid] eq 'JOBID' ); + next if ( $#champs == -1 ); # empty line + if ( $#champs != 0 ) { # line with many fields is first line + $jobid = undef; + $jobid = $champs[$rank_jobid]; + my @ehosts = split( '\*', $champs[$rank_ehost] ); + $lsfmpi_tabjobs{$jobid}{nproc} = $ehosts[0]; + my $exec_host = $ehosts[1]; + push( @{ $lsfmpi_tabjobs{$jobid}{hosts} }, $exec_host ) + if ( defined($exec_host) ); + } elsif ( defined $jobid ) + { # line with one field, should be continued line(exec_host) + my @ehosts = split( '\*', $champs[0] ); + my $exec_host = $ehosts[1]; + chomp($exec_host); + $lsfmpi_tabjobs{$jobid}{nproc} += $ehosts[0]; # nprocess + push( @{ $lsfmpi_tabjobs{$jobid}{hosts} }, $exec_host ); + } + } +} + +sub lsfmpi_get_data { + return \%lsfmpi_tabjobs if ( keys %lsfmpi_tabjobs != 0 ); + lsfmpi_get_lbjobs(); # get job list by bjobs + return \%lsfmpi_tabjobs; +} + +sub lsfmpi_get_jobs { + my $user = shift; + my @ret_jobs; + my $d = lsfmpi_get_data(); + my @jobs = keys %{$d}; + + # filter other jobs that aren't launched by mpich2_wrapper + # (for exemple by mpd; mpiexec; in the submitted job) + # to do this we have criteria below + # jobs launched by mpich2_wrapper will have -configfile parameter + # jobs launched by ompi_wrapper will have --app parameter + foreach my $job (@jobs) { + my ( $server, $mpirpid, $port ) = lsfmpi_get_hostport($job); + if ( defined($mpirpid) and defined($port) and defined($server) ) { + $d->{$job}{server} = $server; + $d->{$job}{mpirpid} = $mpirpid; + $d->{$job}{port} = $port; + push( @ret_jobs, $job ); + } + } + return @ret_jobs; +} + +sub lsfmpi_setup_pcmd { + my $job = shift; + my $cmd; + my $index = 0; + my %pcmd; + my $d = lsfmpi_get_data(); + + my ( $server, $mpirpid, $port ); + + $server = $d->{$job}{server}; + $mpirpid = $d->{$job}{mpirpid}; + $port = $d->{$job}{port}; + config_set_internal( 'lsf_mode', $d->{$job}{lsf_mode} ); + config_set_internal( 'lsfmpi_server', $server ); + config_set_internal( 'lsfmpi_mpirpid', $mpirpid ); + config_set_internal( 'lsfmpi_port', $port ); + my @hosts = @{ $d->{$job}{hosts} }; + $pcmd{nprocesses} = $d->{$job}{nproc}; + $pcmd{nhosts} = @hosts; + @{ $pcmd{host_list} } = @hosts; + + return %pcmd; +} + +sub get_pids_ppid { + + # get all pids from ppid be careful about defunct + my ( $ppid, $rank_pid, $rank_ppid, @handle ) = @_; + my $pid; + my @proc; + foreach my $line (@handle) { + $line =~ s/^ +//; # take off leading space + my @champs = split( /\s+/, $line ); + next if ( $champs[$rank_pid] eq 'PID' ); + if ( $champs[$rank_ppid] == $ppid ) { + $pid = $champs[$rank_pid]; + if ( $line =~ /defunct/i ) { + next; + } + push( @proc, $pid ); + } + } + return (@proc); +} + +sub get_pids_fromport { + + # get all pids from port -p host:port_num + my ( $port, $rank_pid, $rank_ppid, $rank_cmd, @handle ) = @_; + my $portfound = 0; + my @proc; + foreach my $line (@handle) { + $line =~ s/^ +//; # take off space at start + my @champs = split( /\s+/, $line ); + my $cmd = $champs[$rank_cmd]; + my $base = basename($cmd); + if ( $base eq "TaskStarter" ) { + if ( $line =~ /$port/ ) { + $portfound = 0; + foreach my $word (@champs) { + if ( $word eq "-p" ) + { # don't use =~ because may take --prefix + $portfound = 1; + next; + } + if ( $portfound == 1 ) { + push( @proc, $champs[$rank_pid] ) if ( $word eq $port ); + last; + } + } + } + } + } + return @proc; +} # open support. # @@ -8752,6 +9103,98 @@ } return; } + +# +# LSF-openmpi support and LSF-mpich2 +# +sub lsfmpi_get_proc { + my $job = shift; + my @proc; + my $rank_pid = 1; + my $rank_ppid = 2; + my $rank_cmd = 3; + my $port; + my ( $server, $mpirun_pid ); + $port = $inner_conf{lsfmpi_port}; + $server = $inner_conf{lsfmpi_server}; + $mpirun_pid = $inner_conf{lsfmpi_mpirpid}; + my $cmd = "ps -o uid,pid,ppid,cmd -u $target_user"; + my @handle = slurp_cmd($cmd); + my $hostname = hostname(); + + if ( $hostname eq $server ) { + + #this is the server + #get all mpirun children, it should be TaskStarter pids + #get all TaskStarter children it should be the appli pids + my @ppid_proc; + @ppid_proc = + get_pids_ppid( $mpirun_pid, $rank_pid, $rank_ppid, @handle ); + foreach my $pid (@ppid_proc) { + my @w_proc = get_pids_ppid( $pid, $rank_pid, $rank_ppid, @handle ); + push( @proc, @w_proc ); + } + } + if ( $#proc == -1 ) { # nothing in @proc so try from port + # all cases including case host=server which failed + # get all TaskStarter that matched port num + # get all TaskStarter children it should be the appli pids + my @ppid_proc; + @ppid_proc = + get_pids_fromport( $port, $rank_pid, $rank_ppid, $rank_cmd, @handle ); + foreach my $pid (@ppid_proc) { + my @w_proc = get_pids_ppid( $pid, $rank_pid, $rank_ppid, @handle ); + push( @proc, @w_proc ); + } + } + return @proc; +} + +sub lsfmpi_find_pids { + my $job = shift; + my %vps; + + # Iterate over all processes for this user + if ( $inner_conf{lsf_mode} eq 'mpich2' ) { + foreach my $pid ( lsfmpi_get_proc($job) ) { + + my $vp; + my %env = get_remote_env($pid); + if ( !defined( $env{LSB_JOBID} ) || !defined( $env{PMI_RANK} ) ) { + %env = get_remote_env_bygdb($pid); + } + + if ( $env{LSB_JOBID} eq $job ) { + $vp = $env{PMI_RANK}; + } + if ( defined $vp ) { + $vps{$vp} = $pid; + } + } + } else { + + # lsf_mode eq 'openmpi' + foreach my $pid ( lsfmpi_get_proc($job) ) { + + my $vp; + my %env = get_remote_env($pid); + if ( !defined( $env{OMPI_COMM_WORLD_SIZE} ) + || !defined( $env{OMPI_COMM_WORLD_RANK} ) ) + { + %env = get_remote_env_bygdb($pid); + } + + $vp = $env{OMPI_COMM_WORLD_RANK}; + if ( defined $vp ) { + $vps{$vp} = $pid; + } + } + } + foreach my $vp ( keys %vps ) { + my $pid = $vps{$vp}; + register_target_process( $vp, $pid ); + } +} sub rms_find_pids { my $jobid = shift; From ashley at pittman.co.uk Mon Feb 15 18:17:10 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Mon, 15 Feb 2010 18:17:10 +0000 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_R=E9f=2E_=3A__R=E9f=2E_=3A_Re?= =?iso-8859-1?q?=3A__R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_Bullchanges_=28_with_?= =?iso-8859-1?q?LSF_-mpich2_wrapper_and_-openmpi=5Fwrapper_combined=29?= In-Reply-To: References: Message-ID: <6713F053-A4CF-43F1-B122-0AE55AD2706E@pittman.co.uk> On 9 Feb 2010, at 16:03, thipadin.seng-long at bull.net wrote: > I've eventually combined my previous coding for mpich2 and openmpi wrapper on LSF as we discussed. > I hope you haven't yet commit the previous sending. > In the "outer" side we can store differents combined jobs (whatever mpich2 or openmpi) in the table. > Each job is tagged in jobid{lsf_mpi} = 1 for mpich2 and 2 for openmpi. > The flag is passed through inner_conf{lsf_mpi} to the inners processus so they can do differents treatments for each wrapper to find the processus. > The RMGR is 'lsf-mpiwr' as mpi wrapper as it must be lauched by a wrapper. So It can be used for further mpi wrapper. I've renamed the rmgr as lsf rather than lsf-mpiwr as the -mpiwr only serves to add confusion. If and when better LSF support comes along it can share the same rmgr setting. I also changed lsf_mpi to lsf_mode and gave it string values instead of int values as well as this should make the code easier to read. > I've enjoyed meeting you. Hoping you can come often to CEA. > I hope you'll commit it soon as we expect to deliver to CEA soon. Thank you very much for the patch, I'm back from Holiday now so have some time to look at this again. I've committed a variant as r388. I hope I haven't broken anything but can you test it please. I'm interested to see the output if a valid LSF job is specified but it doesn't use a wrapper of the correct style, is a correct and clear error message given in this case? As I said I don't have access to LSF myself so I've tried to keep any changes to a minimum. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From padb at googlecode.com Mon Feb 15 18:25:26 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 15 Feb 2010 18:25:26 +0000 Subject: [padb] r389 committed - Bump version number to 3.2 and release string to beta0 in preperation ... Message-ID: <00504502b014be5afd047fa7bd25@google.com> Revision: 389 Author: apittman Date: Mon Feb 15 10:24:12 2010 Log: Bump version number to 3.2 and release string to beta0 in preperation for release testing. http://code.google.com/p/padb/source/detail?r=389 Modified: /trunk/configure.in /trunk/src/padb ======================================= --- /trunk/configure.in Mon Dec 21 14:19:11 2009 +++ /trunk/configure.in Mon Feb 15 10:24:12 2010 @@ -1,5 +1,5 @@ AC_INIT(src/padb) -AM_INIT_AUTOMAKE(padb,3.1) +AM_INIT_AUTOMAKE(padb,3.2-beta0) AC_PROG_CC AC_PROG_INSTALL AM_PROG_CC_C_O ======================================= --- /trunk/src/padb Mon Feb 15 10:08:15 2010 +++ /trunk/src/padb Mon Feb 15 10:24:12 2010 @@ -28,7 +28,7 @@ # Revision history -# Version 3.? +# Version 3.2 # * Support of PBS Pro # * Support for OpenMPI jobs run by mpirun under a slurm allocation. # * Modify the Slurm resouce manager code to automatically select a @@ -350,7 +350,7 @@ } my $prog = basename $0; -my $version = "3.n (Revision $svn_revision)"; +my $version = "3.2 (Revision $svn_revision)"; my %conf; From padb at googlecode.com Mon Feb 15 18:41:44 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Mon, 15 Feb 2010 18:41:44 +0000 Subject: [padb] r390 committed - Initialise a variable to NULL before first use. Message-ID: <0016e64bb8740a3bfe047fa7f8e3@google.com> Revision: 390 Author: apittman Date: Mon Feb 15 10:41:29 2010 Log: Initialise a variable to NULL before first use. http://code.google.com/p/padb/source/detail?r=390 Modified: /trunk/src/minfo.c ======================================= --- /trunk/src/minfo.c Tue Dec 22 12:03:45 2009 +++ /trunk/src/minfo.c Mon Feb 15 10:41:29 2010 @@ -328,7 +328,7 @@ int i; char *local = base; - char *ptr; + char *ptr = NULL; if ( size == 0 ) return mqs_ok; From ashley at pittman.co.uk Mon Feb 15 18:54:28 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Mon, 15 Feb 2010 18:54:28 +0000 Subject: [padb] 3.2 Beta release available Message-ID: All, A beta release of the upcoming 3.2 release is available for download from google. This is a major release which fixes a large number of bugs and minor problems as well as introducing significant new features. Amongst the most important are: autoconf based build procedure (./configure && make && make install) Support for new resource managers. Much improved support for threaded applications. Solaris support. Better error handling, both with usage errors and with jobs finishing whilst they are being monitored. The direct download link is: http://padb.googlecode.com/files/padb-3.2-beta0.tar.gz Subject to feedback I expect this beta release to bake for a couple of weeks with a tentative release date of the 5th March. Please download this tarball and test it on the platforms you care about, I'm eager to hear any feedback be it good or bad. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From thipadin.seng-long at bull.net Tue Feb 16 16:08:55 2010 From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net) Date: Tue, 16 Feb 2010 17:08:55 +0100 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_R=E9f=2E_?= =?iso-8859-1?q?=3A__R=E9f_=2E_=3A_Re=3A_R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_B?= =?iso-8859-1?q?ullchanges_=28_with_LSF__-mpich2_wrapper_and_-openmpi=5Fwr?= =?iso-8859-1?q?apper_combined=29?= Message-ID: On 02/15/2010 19:17 Ashley Pittman wrote: >On 9 Feb 2010, at 16:03, thipadin.seng-long at bull.net wrote: >> I've eventually combined my previous coding for mpich2 and openmpi wrapper on LSF as we discussed. >> I hope you haven't yet commit the previous sending. >> In the "outer" side we can store differents combined jobs (whatever mpich2 or openmpi) in the table. >> Each job is tagged in jobid{lsf_mpi} = 1 for mpich2 and 2 for openmpi. >> The flag is passed through inner_conf{lsf_mpi} to the inners processus so they can do differents treatments for each wrapper to find the processus. >> The RMGR is 'lsf-mpiwr' as mpi wrapper as it must be lauched by a wrapper. So It can be used for further mpi wrapper. > >I've renamed the rmgr as lsf rather than lsf-mpiwr as the -mpiwr only serves to add confusion. If and when >better LSF support comes along it can share the same rmgr setting. I also changed lsf_mpi to lsf_mode and gave >it string values instead of int values as well as this should make the code easier to read. > >> I've enjoyed meeting you. Hoping you can come often to CEA. > I hope you'll commit it soon as we expect to deliver to CEA soon. > >Thank you very much for the patch, I'm back from Holiday now so have some time to look at this again. > >I've committed a variant as r388. I hope I haven't broken anything but can you test it please. I'm interested >to see the output if a valid LSF job is specified but it doesn't use a wrapper of the correct style, is a >correct and clear error message given in this case? As I said I don't have access to LSF myself so I've tried >to keep any changes to a minimum. >Ashley, I tested the 3.2 beta0 release version, you just missed slurm_cmd at line 919 as below: [senglont at artemis1 lsf-ompi]$ ./padb -O rmgr=lsf -atx Undefined subroutine &main::slurm_cmd called at ./padb line 919. [senglont at artemis1 lsf-ompi]$ ./padb -V padb version 3.2 (Revision 389) Written by Ashley Pittman http://padb.pittman.org.uk [senglont at artemis1 lsf-ompi]$ sources is: sub slurp_remote_cmd { my ( $host, $cmd ) = @_; return slurm_cmd("ssh $host $cmd"); } I guess it should have been 'slurp_cmd' instead of 'slurm_cmd'. I'll modify myself and re-try. Thipadin. -------------- next part -------------- An HTML attachment was scrubbed... URL: From padb at googlecode.com Tue Feb 16 22:14:03 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Tue, 16 Feb 2010 22:14:03 +0000 Subject: [padb] r391 committed - Fix a typo in the previous commit, call slurp_cmd which does exist... Message-ID: <000e0cd118862f52bd047fbf0ddb@google.com> Revision: 391 Author: apittman Date: Tue Feb 16 14:13:21 2010 Log: Fix a typo in the previous commit, call slurp_cmd which does exist rather than slurm_cmd() which doesn't. http://code.google.com/p/padb/source/detail?r=391 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Mon Feb 15 10:24:12 2010 +++ /trunk/src/padb Tue Feb 16 14:13:21 2010 @@ -916,7 +916,7 @@ sub slurp_remote_cmd { my ( $host, $cmd ) = @_; - return slurm_cmd("ssh $host $cmd"); + return slurp_cmd("ssh $host $cmd"); } sub slurp_dir { From ashley at pittman.co.uk Tue Feb 16 22:16:51 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Tue, 16 Feb 2010 22:16:51 +0000 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_R=E9f=2E_?= =?iso-8859-1?q?=3A__R=E9f=2E_=3A_Re=3A__R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_B?= =?iso-8859-1?q?ullchanges_=28_with_LSF_-mpich2_wrapper_and_-openmpi=5Fwra?= =?iso-8859-1?q?pper_combined=29?= In-Reply-To: References: Message-ID: <9AD40D81-B19F-4C00-8E50-AC35C6E5A85A@pittman.co.uk> On 16 Feb 2010, at 16:08, thipadin.seng-long at bull.net wrote: > I guess it should have been 'slurp_cmd' instead of 'slurm_cmd'. > I'll modify myself and re-try. Fixed. It shouldn't affect anyone other than you so I won't make another beta release at this stage if you're happy to make the change locally yourself. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From thipadin.seng-long at bull.net Tue Feb 16 23:29:35 2010 From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net) Date: Wed, 17 Feb 2010 00:29:35 +0100 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_Re=3A_R=E9f?= =?iso-8859-1?q?=2E_=3A_R=E9f_=2E_=3A__R=E9f=2E_=3A_Re=3A__R=E9f=2E_=3A_Re?= =?iso-8859-1?q?=3A_R=E9f=2E_=3A_Bullchanges=28_with_LSF_-mpich2_wrapper_a?= =?iso-8859-1?q?nd_-openmpi=5Fwrapper_combined=29?= Message-ID: On 16 Feb 2010, at 23:16 Ashley Pittman wrote: >On 16 Feb 2010, at 16:08, thipadin.seng-long at bull.net wrote: >> I guess it should have been 'slurp_cmd' instead of 'slurm_cmd'. >> I'll modify myself and re-try. > >Fixed. It shouldn't affect anyone other than you so I won't make another beta release at this stage if you're >happy to make the change locally yourself. > I was testing further and there's still another problem, i guess it came from the ps command you changed. [senglont at artemis1 lsf-ompi]$ ./padb -O rmgr=lsf -tx 1516 Use of uninitialized value in numeric eq (==) at ./padb line 2896. Use of uninitialized value in numeric eq (==) at ./padb line 2896. Use of uninitialized value in numeric eq (==) at ./padb line 2896. Use of uninitialized value in numeric eq (==) at ./padb line 2896. Here's the result of the break point after the call to slurp_cmd: [senglont at artemis1 lsf-ompi]$ perl -d ./padb -O rmgr=lsf -tx 1516 Loading DB routines from perl5db.pl version 1.28 Editor support available. Enter h or `h h' for help, or `man perldebug' for more help. main::(./padb:345): my $svn_revision_string = '$Revision: 389 $'; DB<1> b 2939 DB<2> b 2942 DB<3> c main::lsfmpi_get_mpiproc(./padb:2939): 2939: my @handle = 2940: slurp_remote_cmd( $host, "ps -o pid=,ppid=,cmd= -u $target_user" ); DB<3> c main::lsfmpi_get_mpiproc(./padb:2942): 2942: $count_line = @handle; DB<3> p @handle ,ppid=,cmd= 16179 16180 16184 16185 16187 16191 16193 16194 16195 16196 16201 16202 16203 16207 16208 16210 16214 16215 16216 16217 16218 16221 21554 21555 DB<4> In my version the ps command was: my $cmd = "ssh $host ps -o pid,ppid,cmd -u $target_user "; which display this on host 'artemis4': [senglont at artemis1 lsf-ompi]$ ssh artemis4 ps -o pid,ppid,cmd -u senglont PID PPID CMD 16179 2787 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/etc/res -d /usr/share/lsf/conf -m artemis1 /home_nfs/senglont/.lsbatch/1266322840.1516 16180 16179 /bin/sh /home_nfs/senglont/.lsbatch/1266322840.1516 16184 16180 /bin/bash /home_nfs/senglont/.lsbatch/1266322840.1516.shell 16185 16184 pam -g /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper --prefix /home_nfs/senglont/ompi_inst/1.3.3/ ./pp_sndrcv_spbl 16187 16185 /bin/sh /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper --prefix /home_nfs/senglont/ompi_inst/1.3.3/ ./pp_sndrcv_spbl 16191 16187 mpirun --app /home_nfs/senglont/.openmpi_appfile_1516 ...................................; ......................................................................... So can you tell me what you would have wanted to do!!! Thipadin. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ashley at pittman.co.uk Wed Feb 17 08:59:44 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Wed, 17 Feb 2010 08:59:44 +0000 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_Re=3A_R=E9f=2E_=3A_Re=3A_R=E9f?= =?iso-8859-1?q?=2E_=3A_R=E9f=2E_=3A__R=E9f=2E_=3A_Re=3A__R=E9f=2E_=3A_Re?= =?iso-8859-1?q?=3A_R=E9f=2E_=3A_Bullchanges_=28_with_LSF_-mpich2_wrapper_?= =?iso-8859-1?q?and_-openmpi=5Fwrapper_combined=29?= In-Reply-To: References: Message-ID: <741B6B69-203E-4A32-BBEE-9DAD36D295C6@pittman.co.uk> On 16 Feb 2010, at 23:29, thipadin.seng-long at bull.net wrote: > > On 16 Feb 2010, at 23:16 Ashley Pittman wrote: > > I was testing further and there's still another problem, i guess it came from the ps command you changed. > > [senglont at artemis1 lsf-ompi]$ ./padb -O rmgr=lsf -tx 1516 > Use of uninitialized value in numeric eq (==) at ./padb line 2896. > Use of uninitialized value in numeric eq (==) at ./padb line 2896. > Use of uninitialized value in numeric eq (==) at ./padb line 2896. > Use of uninitialized value in numeric eq (==) at ./padb line 2896. > > Here's the result of the break point after the call to slurp_cmd: Can you try this patch, I'd forgotten how to format ps commands. The following should work. Index: padb =================================================================== --- padb (revision 391) +++ padb (working copy) @@ -2937,7 +2937,7 @@ #get ps from the leading host(the one that start mpirun.lsf) my @handle = - slurp_remote_cmd( $host, "ps -o pid=,ppid=,cmd= -u $target_user" ); + slurp_remote_cmd( $host, "ps -o pid= -o ppid= -o cmd= -u $target_user" ); $count_line = @handle; for ( my $i = 0 ; $i < $count_line ; $i++ ) { # to avoid loop -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From thipadin.seng-long at bull.net Wed Feb 17 09:12:43 2010 From: thipadin.seng-long at bull.net (thipadin.seng-long at bull.net) Date: Wed, 17 Feb 2010 10:12:43 +0100 Subject: [padb] =?iso-8859-1?q?R=E9f=2E_=3A_R=E9f=2E_=3A_Re=3A_R=E9f=2E_?= =?iso-8859-1?q?=3A_Re=3A_R=E9f_=2E_=3A_R=E9f=2E_=3A__R=E9f=2E_=3A_Re=3A__?= =?iso-8859-1?q?R=E9f=2E_=3A_Re_=3A_R=E9f=2E_=3A_Bullchanges_=28_with_LSF_?= =?iso-8859-1?q?-mpich2_wrapper_and__-openmpi=5Fwrapper_combined=29?= Message-ID: Hi, After reflection, i guess you didn't want the headers line of ps command. And i think use of multiple renamed Headers in one -o option doesn't work as: [senglont at artemis1 lsf-ompi]$ ps -o ppid=X,comm=Y -u senglont X,comm=Y 4080 4082 4162 4164 4165 21488 21491 21492 4083 Cause all characters after the first '=' would be taken as a string, so i think you should use one -o option for each renamed Header as: [senglont at artemis1 lsf-ompi]$ ps -o pid= -o ppid= -o comm= -u senglont 4082 4080 sshd 4083 4082 bash 4164 4162 sshd 4165 4164 bash 21488 4165 man 21491 21488 sh 21492 21491 sh 21497 21492 less 23548 4083 ps I'll modify and re-try. Thipadin, Thipadin Seng-Long 02/17/2010 12:29 AM Pour : Ashley Pittman cc : Andry.Razafinjatovo at bull.net, florence.vallee at bull.net, padb-devel at pittman.org.uk, Sylvain Jeaugey Objet : R?f. : Re: R?f. : Re: R?f. : R?f. : [padb] R?f. : Re: R?f. : Re: R?f. : Bullchanges ( with LSF -mpich2 wrapper and -openmpi_wrapper combined) On 16 Feb 2010, at 23:16 Ashley Pittman wrote: >On 16 Feb 2010, at 16:08, thipadin.seng-long at bull.net wrote: >> I guess it should have been 'slurp_cmd' instead of 'slurm_cmd'. >> I'll modify myself and re-try. > >Fixed. It shouldn't affect anyone other than you so I won't make another beta release at this stage if you're >happy to make the change locally yourself. > I was testing further and there's still another problem, i guess it came from the ps command you changed. [senglont at artemis1 lsf-ompi]$ ./padb -O rmgr=lsf -tx 1516 Use of uninitialized value in numeric eq (==) at ./padb line 2896. Use of uninitialized value in numeric eq (==) at ./padb line 2896. Use of uninitialized value in numeric eq (==) at ./padb line 2896. Use of uninitialized value in numeric eq (==) at ./padb line 2896. Here's the result of the break point after the call to slurp_cmd: [senglont at artemis1 lsf-ompi]$ perl -d ./padb -O rmgr=lsf -tx 1516 Loading DB routines from perl5db.pl version 1.28 Editor support available. Enter h or `h h' for help, or `man perldebug' for more help. main::(./padb:345): my $svn_revision_string = '$Revision: 389 $'; DB<1> b 2939 DB<2> b 2942 DB<3> c main::lsfmpi_get_mpiproc(./padb:2939): 2939: my @handle = 2940: slurp_remote_cmd( $host, "ps -o pid=,ppid=,cmd= -u $target_user" ); DB<3> c main::lsfmpi_get_mpiproc(./padb:2942): 2942: $count_line = @handle; DB<3> p @handle ,ppid=,cmd= 16179 16180 16184 16185 16187 16191 16193 16194 16195 16196 16201 16202 16203 16207 16208 16210 16214 16215 16216 16217 16218 16221 21554 21555 DB<4> In my version the ps command was: my $cmd = "ssh $host ps -o pid,ppid,cmd -u $target_user "; which display this on host 'artemis4': [senglont at artemis1 lsf-ompi]$ ssh artemis4 ps -o pid,ppid,cmd -u senglont PID PPID CMD 16179 2787 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/etc/res -d /usr/share/lsf/conf -m artemis1 /home_nfs/senglont/.lsbatch/1266322840.1516 16180 16179 /bin/sh /home_nfs/senglont/.lsbatch/1266322840.1516 16184 16180 /bin/bash /home_nfs/senglont/.lsbatch/1266322840.1516.shell 16185 16184 pam -g /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper --prefix /home_nfs/senglont/ompi_inst/1.3.3/ ./pp_sndrcv_spbl 16187 16185 /bin/sh /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper --prefix /home_nfs/senglont/ompi_inst/1.3.3/ ./pp_sndrcv_spbl 16191 16187 mpirun --app /home_nfs/senglont/.openmpi_appfile_1516 ...................................; ......................................................................... So can you tell me what you would have wanted to do!!! Thipadin. -------------- next part -------------- An HTML attachment was scrubbed... URL: From padb at googlecode.com Wed Feb 17 21:00:27 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Wed, 17 Feb 2010 21:00:27 +0000 Subject: [padb] r392 committed - Repair the ps command line so it doesn't display a header and hence is... Message-ID: <000e0cd23c2ad25866047fd223de@google.com> Revision: 392 Author: apittman Date: Wed Feb 17 13:00:01 2010 Log: Repair the ps command line so it doesn't display a header and hence is easier to parse. http://code.google.com/p/padb/source/detail?r=392 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Tue Feb 16 14:13:21 2010 +++ /trunk/src/padb Wed Feb 17 13:00:01 2010 @@ -2937,7 +2937,7 @@ #get ps from the leading host(the one that start mpirun.lsf) my @handle = - slurp_remote_cmd( $host, "ps -o pid=,ppid=,cmd= -u $target_user" ); + slurp_remote_cmd( $host, "ps -o pid= -o ppid= -o cmd= -u $target_user" ); $count_line = @handle; for ( my $i = 0 ; $i < $count_line ; $i++ ) { # to avoid loop From padb at googlecode.com Fri Feb 26 20:05:23 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Fri, 26 Feb 2010 20:05:23 +0000 Subject: [padb] r393 committed - Call lstopo with the new --pid option Message-ID: <000325564d6a717cb50480866b73@google.com> Revision: 393 Author: apittman Date: Fri Feb 26 12:04:36 2010 Log: Call lstopo with the new --pid option http://code.google.com/p/padb/source/detail?r=393 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Wed Feb 17 13:00:01 2010 +++ /trunk/src/padb Fri Feb 26 12:04:36 2010 @@ -10158,8 +10158,8 @@ handler => \&lstopo, arg_long => 'lstopo', help => 'Show CPU topology using lstopo', - options_i => { lstopo_command => 'lstopo --whole-system -', }, - options_bool => { lstopo_show_warning => 'yes', }, + options_i => { lstopo_command => 'lstopo --pid %p -', }, + options_bool => { lstopo_show_warning => 'no', }, }; $allfns{command} = { From padb at googlecode.com Fri Feb 26 20:09:25 2010 From: padb at googlecode.com (padb at googlecode.com) Date: Fri, 26 Feb 2010 20:09:25 +0000 Subject: [padb] r394 committed - Run lstopo such that it draws asci art of the machine. Message-ID: <000e0cd146f8dfa8570480867942@google.com> Revision: 394 Author: apittman Date: Fri Feb 26 12:08:10 2010 Log: Run lstopo such that it draws asci art of the machine. http://code.google.com/p/padb/source/detail?r=394 Modified: /trunk/src/padb ======================================= --- /trunk/src/padb Fri Feb 26 12:04:36 2010 +++ /trunk/src/padb Fri Feb 26 12:08:10 2010 @@ -10158,7 +10158,7 @@ handler => \&lstopo, arg_long => 'lstopo', help => 'Show CPU topology using lstopo', - options_i => { lstopo_command => 'lstopo --pid %p -', }, + options_i => { lstopo_command => 'lstopo --pid %p -.txt', }, options_bool => { lstopo_show_warning => 'no', }, };