[padb] Réf. : Réf. : Re: Réf. : Re : Réf. : Bullchanges ( with LSF -mpich2 wrapperi and -openmpiwrapper)
thipadin.seng-long at bull.net
thipadin.seng-long at bull.net
Thu Feb 4 14:38:36 GMT 2010
Hi,
As I promised you, i 'm sending you the display outputs of 'bjobs' and
'ps' command on the respective hosts as an example.
This example is for openmpi wrapper. This may help to understand my
codings as well.
There are 2 jobs, the one (1478) started with the wrapper and the one
(1477) without the wrapper.
With the wrapper we can determine how many procs are on which host (2 on
artemis3, 2 on artemis4) etc.
Without the wrapper we can just see it has started on 'artemis3' but we
don't know how many procs on artemis3 and on artemis4
(actually 2 on artemis2 and 2 on artemis4).
So this jobs should not be taken into account.
To distinguish between the two, i look in the bjobs command, see where is
the master host (where job starts)?, should be the first line of EXEC_HOST
in this case is 'artemis3' for both jobs. And in this host ps command will
show a mpirun --app 'path_to_app_file' while for the other job
it shows mpirun without --app parameter.
And in this appfile it'll show the TaskStarter command with -p
artemis3:37756, a port number that all subsequent processes should have in
each
remote hosts, while the job without wrapper doesn't have.
[senglont at artemis3 lsf-ompi]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME
SUBMIT_TIME
1477 senglon RUN normal artemis3 artemis3 PP_SLNOWR Feb 4
14:13
1478 senglon RUN normal artemis3 2*artemis3 PP_SNDRCV Feb 4
14:21
2*artemis4
[senglont at artemis3 ]$
Bjobs of 1478
bjobs -l 1478
Job <1478>, Job Name <PP_SNDRCV>, User <senglont>, Project <default>,
Status <R
UN>, Queue <normal>, Command <#! /bin/bash;# with
mpirun
wrapper;# essai avec -R span a lancer deux fois
lui-meme;
# Ok ce script est bon pour lancer 2 jobs;# avec
chaque
2proc sur artemis3 et 2proc sur artemis4;#BSUB -J
"PP_SNDR
CV";#BSUB -m "artemis3 artemis4";#BSUB -o
PP_SNDRCV.%J;#BS
UB -n 4;#BSUB -e PP_SNDRCVerr.%J;#BSUB -a
openmpi;#BSUB -R
"span[ptile=2]";source ~/.bashrc_lompi;mpirun.lsf
--prefi
x /home_nfs/senglont/ompi_inst/1.3.3/
./pp_sndrcv_spbl>
Thu Feb 4 14:21:02: Submitted from host <artemis3>, CWD
<$HOME/mympi/lsf-ompi>
, Output File <PP_SNDRCV.%J>, Error File
<PP_SNDRCVerr.%J>
, 4 Processors Requested, Requested Resources
<span[ptile=
2]>, Specified Hosts <artemis3>, <artemis4>;
Thu Feb 4 14:21:04: Started on 4 Hosts/Processors <2*artemis3>
<2*artemis4>, E
xecution Home </home_nfs/senglont>, Execution CWD
</home_n
fs/senglont/mympi/lsf-ompi>;
Thu Feb 4 15:19:57: Resource usage collected.
The CPU time used is 3526 seconds.
MEM: 14 Mbytes; SWAP: 611 Mbytes; NTHREAD: 14
PGID: 13623; PIDs: 13631 13635 13637 13638 13623
13624
13628 13629
PGID: 13639; PIDs: 13639
PGID: 13640; PIDs: 13640
PGID: 10491; PIDs: 10491
PGID: 10492; PIDs: 10492
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp
mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
[senglont at artemis3 lsf-ompi]$
PS from artemis3
[senglont at artemis3 ]$ psu
PID PPID CMD
10222 10220 sshd: senglont at pts/5
10223 10222 -bash
13586 27520 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/etc/res -d
/usr/share/lsf/conf -m
13587 13586 /bin/sh /home_nfs/senglont/.lsbatch/1265289212.1477
13591 13587 /bin/bash /home_nfs/senglont/.lsbatch/1265289212.1477.shell
13592 13591 mpirun --prefix /home_nfs/senglont/ompi_inst/1.3.3 -H
artemis3,artemis4 -n 4 .
13594 13592 ./pp_sleep
13595 13592 ./pp_sleep
13623 27520 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/etc/res -d
/usr/share/lsf/conf -m
13624 13623 /bin/sh /home_nfs/senglont/.lsbatch/1265289662.1478
13628 13624 /bin/bash /home_nfs/senglont/.lsbatch/1265289662.1478.shell
13629 13628 pam -g
/usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper --prefi
13631 13629 /bin/sh
/usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/openmpi_wrapper --pref
13635 13631 mpirun --app /home_nfs/senglont/.openmpi_appfile_1478
13637 13635 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/TaskStarter -p artemis3:37756
13638 13635 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/TaskStarter -p artemis3:37756
13639 13637 ./pp_sndrcv_spbl
13640 13638 ./pp_sndrcv_spbl
13645 27420 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/etc/res
13699 10223 ps -o pid,ppid,cmd -u senglont
[senglont at artemis3 lsf-ompi]$
[senglont at artemis3 lsf-ompi]$ [senglont at artemis3 lsf-ompi]$ cat /home_nfs/senglont/.openmpi_appfile_1478
-host artemis4 -n 2 --prefix /home_nfs/senglont/ompi_inst/1.3.3/
/usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/TaskStarter -p
artemis3:37756 -c /usr/share/lsf/conf -s
/usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/etc -a X86_64 ./pp_sndrcv_spbl
-host artemis3 -n 2 --prefix /home_nfs/senglont/ompi_inst/1.3.3/
/usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/TaskStarter -p
artemis3:37756 -c /usr/share/lsf/conf -s
/usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/etc -a X86_64 ./pp_sndrcv_spbl
[senglont at artemis3 lsf-ompi]$
PS from artemis4
[senglont at artemis4 ~]$ psu
PID PPID CMD
10478 1 /home_nfs/senglont/ompi_inst/1.3.3/bin/orted --daemonize -mca
ess env -mca ort
10479 10478 ./pp_sleep
10480 10478 ./pp_sleep
10488 1 /home_nfs/senglont/ompi_inst/1.3.3/bin/orted --daemonize -mca
ess env -mca ort
10489 10488 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/TaskStarter -p artemis3:37756
10490 10488 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/bin/TaskStarter -p artemis3:37756
10491 10490 ./pp_sndrcv_spbl
10492 10489 ./pp_sndrcv_spbl
10493 18965 /usr/share/lsf/7.0/linux2.6-glibc2.3-x86_64/etc/res
11019 11017 sshd: senglont at pts/8
11020 11019 -bash
11054 11020 ps -o pid,ppid,cmd -u senglont
[senglont at artemis4 ~]$
As you said, we can work it out to optimize the codings to just have one
(after the commit).
Thipadin.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pittman.org.uk/pipermail/padb-devel_pittman.org.uk/attachments/20100204/67bddaa3/attachment.html>
More information about the padb-devel
mailing list