[padb] Réf. : Réf. : Re: Réf. : Re : Réf. : Bullchanges ( with LSF -mpich2 wrapperi and -openmpiwrapper)

thipadin.seng-long at bull.net thipadin.seng-long at bull.net
Thu Feb 4 14:38:36 GMT 2010


Hi,
As I promised you, i 'm sending you the display outputs of 'bjobs' and 
'ps' command on the respective hosts as an example.
This example is for openmpi wrapper. This may help to understand my 
codings as well.
There are 2 jobs, the one (1478) started with the wrapper and the one 
(1477) without the wrapper.
With the wrapper we can determine how many procs are on which host (2 on 
artemis3, 2 on artemis4) etc.
Without the wrapper we can just see it has started on 'artemis3' but we 
don't know how many procs on artemis3 and on artemis4
(actually 2 on artemis2 and 2 on artemis4).
So this jobs should not be taken into account.
To distinguish between the two, i look in the bjobs command, see where is 
the master host (where job starts)?, should be the first line of EXEC_HOST
in this case is 'artemis3' for both jobs. And in this host ps command will 
show a mpirun --app  'path_to_app_file' while for the other job
it shows mpirun without --app parameter.
And in this appfile it'll show  the TaskStarter command with -p 
artemis3:37756, a port number that all subsequent processes should have in 
each
remote hosts, while the job without wrapper doesn't have.



[senglont at artemis3 lsf-ompi]$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME 
SUBMIT_TIME
1477    senglon RUN   normal     artemis3    artemis3    PP_SLNOWR  Feb  4 
14:13
1478    senglon RUN   normal     artemis3    2*artemis3  PP_SNDRCV  Feb  4 
14:21
         2*artemis4
[senglont at artemis3 ]$ 

Bjobs of 1478

bjobs -l 1478

Job <1478>, Job Name <PP_SNDRCV>, User <senglont>, Project <default>, 
Status <R
                     UN>, Queue <normal>, Command <#! /bin/bash;#  with 
mpirun 
                     wrapper;#  essai avec -R span a lancer deux fois 
lui-meme;
                     #  Ok ce script est bon pour lancer 2 jobs;#  avec 
chaque 
                     2proc sur artemis3 et 2proc sur artemis4;#BSUB -J 
"PP_SNDR
                     CV";#BSUB -m "artemis3 artemis4";#BSUB -o 
PP_SNDRCV.%J;#BS
                     UB -n 4;#BSUB -e PP_SNDRCVerr.%J;#BSUB -a 
openmpi;#BSUB -R
                      "span[ptile=2]";source ~/.bashrc_lompi;mpirun.lsf 
--prefi
                     x /home_nfs/senglont/ompi_inst/1.3.3/ 
./pp_sndrcv_spbl>
Thu Feb  4 14:21:02: Submitted from host <artemis3>, CWD 
<$HOME/mympi/lsf-ompi>
                     , Output File <PP_SNDRCV.%J>, Error File 
<PP_SNDRCVerr.%J>
                     , 4 Processors Requested, Requested Resources 
<span[ptile=
                     2]>, Specified Hosts <artemis3>, <artemis4>;
Thu Feb  4 14:21:04: Started on 4 Hosts/Processors <2*artemis3> 
<2*artemis4>, E
                     xecution Home </home_nfs/senglont>, Execution CWD 
</home_n
                     fs/senglont/mympi/lsf-ompi>;
Thu Feb  4 15:19:57: Resource usage collected.
                     The CPU time used is 3526 seconds.
                     MEM: 14 Mbytes;  SWAP: 611 Mbytes;  NTHREAD: 14
                     PGID: 13623;  PIDs: 13631 13635 13637 13638 13623 
13624 
                     13628 13629 
                     PGID: 13639;  PIDs: 13639 
                     PGID: 13640;  PIDs: 13640 
                     PGID: 10491;  PIDs: 10491 
                     PGID: 10492;  PIDs: 10492 


 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp 
mem
 loadSched   -     -     -     -       -     -    -     -     -      -  - 
 loadStop    -     -     -     -       -     -    -     -     -      -  - 

[senglont at artemis3 lsf-ompi]$ 
PS from artemis3
 
[senglont at artemis3 ]$ psu
  PID  PPID CMD
10222 10220 sshd: senglont at pts/5
10223 10222 -bash
13586 27520 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/etc/res -d 
/usr/share/lsf/conf -m
13587 13586 /bin/sh /home_nfs/senglont/.lsbatch/1265289212.1477
13591 13587 /bin/bash /home_nfs/senglont/.lsbatch/1265289212.1477.shell
13592 13591 mpirun --prefix /home_nfs/senglont/ompi_inst/1.3.3 -H 
artemis3,artemis4 -n 4 .
13594 13592 ./pp_sleep
13595 13592 ./pp_sleep
13623 27520 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/etc/res -d 
/usr/share/lsf/conf -m
13624 13623 /bin/sh /home_nfs/senglont/.lsbatch/1265289662.1478
13628 13624 /bin/bash /home_nfs/senglont/.lsbatch/1265289662.1478.shell
13629 13628 pam -g 
/usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper --prefi
13631 13629 /bin/sh 
/usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper --pref
13635 13631 mpirun --app /home_nfs/senglont/.openmpi_appfile_1478
13637 13635 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/TaskStarter -p artemis3:37756
13638 13635 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/TaskStarter -p artemis3:37756
13639 13637 ./pp_sndrcv_spbl
13640 13638 ./pp_sndrcv_spbl
13645 27420 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/etc/res
13699 10223 ps -o pid,ppid,cmd -u senglont
[senglont at artemis3 lsf-ompi]$ 
[senglont at artemis3 lsf-ompi]$ [senglont at artemis3 lsf-ompi]$ cat /home_nfs/senglont/.openmpi_appfile_1478
-host artemis4 -n 2   --prefix /home_nfs/senglont/ompi_inst/1.3.3/ 
/usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/TaskStarter -p 
artemis3:37756 -c /usr/share/lsf/conf -s 
/usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/etc -a X86_64 ./pp_sndrcv_spbl 

-host artemis3 -n 2   --prefix /home_nfs/senglont/ompi_inst/1.3.3/ 
/usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/TaskStarter -p 
artemis3:37756 -c /usr/share/lsf/conf -s 
/usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/etc -a X86_64 ./pp_sndrcv_spbl 

[senglont at artemis3 lsf-ompi]$ 

PS from artemis4
[senglont at artemis4 ~]$ psu
  PID  PPID CMD
10478     1 /home_nfs/senglont/ompi_inst/1.3.3/bin/orted --daemonize -mca 
ess env -mca ort
10479 10478 ./pp_sleep
10480 10478 ./pp_sleep
10488     1 /home_nfs/senglont/ompi_inst/1.3.3/bin/orted --daemonize -mca 
ess env -mca ort
10489 10488 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/TaskStarter -p artemis3:37756
10490 10488 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/TaskStarter -p artemis3:37756
10491 10490 ./pp_sndrcv_spbl
10492 10489 ./pp_sndrcv_spbl
10493 18965 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/etc/res
11019 11017 sshd: senglont at pts/8
11020 11019 -bash
11054 11020 ps -o pid,ppid,cmd -u senglont
[senglont at artemis4 ~]$ 

As you said, we can work it out to optimize the codings to just have one 
(after the commit).

Thipadin.
 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20100204/67bddaa3/attachment.html>


More information about the padb-devel mailing list