[padb] Réf. : Re: Réf. : Re: [padb-devel] Patch for Support of PBS Pro resource manager

thipadin.seng-long at bull.net thipadin.seng-long at bull.net
Wed Nov 18 15:48:50 GMT 2009


Hi,

I've got through padb test now (assuming jobid is just numeric part), the 
result is OK, and I met 2 bugs:

1- Path of remote padb:

[thipa at xn5 padb_open]$ ./padb -O rmgr=pbs -tx 27611
einner: xn20: bash: ./padb: No such file or directory
einner: pdsh at xn5: xn20: ssh exited with exit code 127
einner: xn19: bash: ./padb: No such file or directory
einner: pdsh at xn5: xn19: ssh exited with exit code 127
Unexpected EOF from Inner stdout (connecting)
Unexpected EOF from Inner stderr (connecting)
Waiting for signon from 2 hosts.
Unexpected exit from parallel command (state=connecting)
[thipa at xn5 padb_open]$

[thipa at xn20 padb_open]$ ssh -V
OpenSSH_5.1p1, OpenSSL 0.9.8e-fips-rhel5 01 Jul 2008

I think from OpenSSH_5.1, the remote SSH does not execute .bashrc anymore
So PATH variable is not set from your .bashrc,
As the consequence path is not found,
So path to remote host must be a full path, I did the patch as follows:

sub pbs_setup_job {

.
.

    my $pwd=$ENV{PWD};
    my $dirnm = dirname ($0);
    my $base = basename ($0);
    # if padb is launch as padb then dirname is .
    # if padb is launched with a full path then dir is full
    my $out;
    if ($dirnm eq ".") {
       $out=" $pwd\/$base ";
    } else {
       $out=" $0 ";
    }
    $pcmd{padb_path} = $out;

.
}

sub  go_job {

.

my $padb_path  = $pcmd{padb_path};

.
.
 #replace this line $cmd .= " $0 --inner"; by

    if (!defined $padb_path) {
       $cmd .= " $0 --inner";
    } else {
       $cmd .= " $padb_path --inner ";
    }

.

.

}

If you have another idea i take it.


2- Use of uninitialized value in subtraction (-) at ./padb line 4077

[thipa at xn5 padb_open]$ ./padb -O rmgr=pbs -O stack-shows-locals=no  -O 
stack-shows-params=no -tx 27611
Use of uninitialized value in subtraction (-) at ./padb line 4077.
-----------------
[0-1,3,6] (4 processes)
-----------------
main() at pp_sndrcv_spbl.c:50
  PMPI_Finalize() at ?:?
    MPID_Finalize() at ?:?
      MPIDI_CH3_Progress_wait() at ?:?
        MPIDU_Sock_wait() at ?:?
          poll() at ?:?
-----------------
[2,4] (2 processes)
-----------------
ThreadId: 1
  -----------------
  [2] (1 processes)
  -----------------
  main() at pp_sndrcv_spbl.c:46
    PMPI_Recv() at ?:?
      MPID_Progress_wait() at ?:?
        MPIDI_CH3_Progress_wait() at ?:?
          MPIDU_Sock_wait() at ?:?
            poll() at ?:?
              ThreadId: 2
                start_thread() at ?:?
                  fd_server() at server.c:354
                    select() at ?:?
  -----------------
  [4] (1 processes)
  -----------------
  main() at pp_sndrcv_spbl.c:50
    PMPI_Finalize() at ?:?
      MPID_Finalize() at ?:?
        MPIDI_CH3_Progress_wait() at ?:?
          MPIDU_Sock_wait() at ?:?
            poll() at ?:?
              ThreadId: 2
                start_thread() at ?:?
                  fd_server() at server.c:354
                    select() at ?:?
[thipa at xn5 padb_open]$ 

line 4077 looks like:

4058 sub check_signon {
4059     my ( $comm_data, $data ) = @_;
4060     return if ( $conf{check_signon} eq 'none' );
.
.
4075     my $rng = rng_create_empty();
4076 
4077     foreach my $proc ( 0 .. $comm_data->{nprocesses} - 1 ) {
4078         if ( not defined $here{$proc} ) {
4079             rng_add_value( $rng, $proc );
4080         }
4081     }

3- Question about starting inner padb:


How can I start an inner padb by hand on a remote host to debug such as:
perl -d ./padb --inner --jobid=27611.xn0 --stack-trace -O rmgr="pbs" 
--line-formatted
like I did it before, because this command doesn't work anymore.
You have changed it with "call back" and communication on ports.


Here is the diff again r311 (diff r311 newone).



So you can integrate my new patch and try to correct the point 2,
and send me back the new one, i will test it over.

Regards,
Thipadin.








Ashley Pittman <ashley at pittman.co.uk>
11/18/2009 09:00 AM

 
        Pour :  thipadin.seng-long at bull.net
        cc :    florence.vallee at bull.net, francois.wellenreiter at bull.net, 
padb-devel at pittman.org.uk, Sylvain.JEAUGEY at bull.net
        Objet : Re: Réf. : Re: [padb-devel] Patch for Support of PBS Pro resource manager

On Tue, 2009-11-17 at 17:17 +0100, thipadin.seng-long at bull.net wrote:
>  I have to break on the screen to get prompt: 
> So I guess it is a infinite loop. 
> I have changed  'while(@output)'  for 'foreach(@output)', to correct
> this probleme.

That looks like a simple mistake on my part.  I prefer not to use $_ in
my code (either implicitly or explicitly) as I think it makes it less
readable, once the code works it's easy enough to make variables
explicit however, I just didn't do it in the patch because I prefer to
change as least as I can possibly get away with unless I can test it
immediately.

> 2- Job is not found: 
> 
> So when the loop is disappeared  i can go further: 
> 
> ./padb -O rmgr=pbs -tx 27611.xn0 
> Job 27611.xn0 is not active 
> [thipa at xn5]$ qstat 
> Job id            Name             User              Time Use S Queue 
> ----------------  ---------------- ----------------  -------- - ----- 
> 27611.xn0         STDIN            thipa             00:00:06 R workq

> 
> [thipa at xn5]$ 
> 
> The jobs that are display by qstat have the suffice with .xn0 (which
> is the server), 
> so we used to pick up the whole job id as input jobid. 
> So something have to be changed (code or synopsis).

I guess the job id as you requested it (27611.xn0) does not match what
it returned by pbs_get_jobs(), there is a problem here to do with the
server.  In the past all job id's have been numeric, this hasn't been a
problem but isn't something that I've strived for, it's just that so far
all resource managers have worked that way so that's how I think of it.
There is no technical reason for this to be true however so how about we
just say that in the future jobid's have to be alphanumeric strings,
this would work in this case although would have the downside you
couldn't specify the job as 27611 in the case above.

padb --show-jobs should show you what padb thinks the job id's are and
of course using -a rather than specifying a job tells it to use all jobs
so it'll just attempt to target one in the case above, regardless of
what it thinks it's called.

I'd be happy for a patch supporting either implementation, i.e. I don't
have a strong preference either way.   You can either have the jobid
encompass both the number and the server or you could continue with what
I attempted to encode in the patch I sent you where the job id is the
number and the server becomes a configuration option.

Actually this could make life easier for slurm and the way it handles
job steps, it effectively appends ".0" to the padb job id before handing
it over to slurm so this could probably be simplified if the .0 became a
optional part of the job id itself rather than a separate configuration
option.

> I am waiting for your patch (or reply) to continue.

I hope this helps you along the way, I can't really code anything from
here as I don't have access to a pbs system.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091118/da6b0a07/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: new.patch.diff
Type: application/octet-stream
Size: 7143 bytes
Desc: not available
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20091118/da6b0a07/attachment.obj>


More information about the padb-devel mailing list