From d.love at liverpool.ac.uk Thu Jul 8 16:38:33 2010 From: d.love at liverpool.ac.uk (Dave Love) Date: Thu, 08 Jul 2010 16:38:33 +0100 Subject: [padb-users] running with SGE/OMPI Message-ID: <87mxu23rwm.fsf@liv.ac.uk> I'd like to use padb with OpenMPI jobs under Gridengine. Currently it sees no jobs. What do I need to make this work? Is a working ompi-ps what's required? That currently isn't working, and I'm not sure how it's supposed to, but might be able to fix it. From ashley at pittman.co.uk Thu Jul 8 22:10:27 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Thu, 8 Jul 2010 22:10:27 +0100 Subject: [padb-users] running with SGE/OMPI In-Reply-To: <87mxu23rwm.fsf@liv.ac.uk> References: <87mxu23rwm.fsf@liv.ac.uk> Message-ID: <474A91EC-F0AF-48F2-BCD1-880140329B7F@pittman.co.uk> On 8 Jul 2010, at 16:38, Dave Love wrote: > I'd like to use padb with OpenMPI jobs under Gridengine. In padb-parlance Gridengine would be the scheduler which padb is un-interested in and given you mention ompi-ps orte is presumably the resource manager. Padb only interfaces with the resource manager of these two. orte is fully supported as a resource manager and has been since this project went public. The resource manger should be automatically detected based on which binaries it can find on path, you can also manually set it if you need to. > Currently it > sees no jobs. What do I need to make this work? Is a working ompi-ps > what's required? That currently isn't working, and I'm not sure how > it's supposed to, but might be able to fix it. You have two choices here, you can either use "orte" as the resource manager in which case a working ompi-ps is required or you can use "mpirun" as the resource manager in which case it'll attach to the orterun (or mpirun) process with gdb and read the data it needs directly. In both of these cases the data is only available and hence you'll need to run padb on the node where the orterun process is running, given you are using Gridengine this could be a non-trivial problem but it depends on your setup. orte would be the preferred option, you'll need to ensure that the environment you run padb under is the same as used for the parallel job (PATH,LD_LIBRARY_PATH) to avoid any version problems between different versions of the orte tools, be aware that using an incorrect ompi-ps version cause running jobs to fail so tread carefully. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From ashley at pittman.co.uk Thu Jul 8 22:50:30 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Thu, 8 Jul 2010 22:50:30 +0100 Subject: [padb-users] running with SGE/OMPI In-Reply-To: <87mxu23rwm.fsf@liv.ac.uk> References: <87mxu23rwm.fsf@liv.ac.uk> Message-ID: <04DE6F37-3531-4A22-AAFB-1CF5C6F33F98@pittman.co.uk> On 8 Jul 2010, at 16:38, Dave Love wrote: > I'd like to use padb with OpenMPI jobs under Gridengine. In padb-parlance Gridengine would be the scheduler which padb is un-interested in and given you mention ompi-ps presumably the resource manager is orte. Padb only interfaces with the resource manager of these two. orte is fully supported as a resource manager and has been since this project went public. The resource manger should be automatically detected based on which binaries it can find on path, you can also manually set it if you need to. > Currently it > sees no jobs. What do I need to make this work? Is a working ompi-ps > what's required? That currently isn't working, and I'm not sure how > it's supposed to, but might be able to fix it. You have two choices here, you can either use "orte" as the resource manager in which case a working ompi-ps is required or you can use "mpirun" as the resource manager in which case it'll attach to the orterun (or mpirun) process with gdb and read the data it needs directly. In both of these cases the data is only available, and hence you'll need to run padb, on the node where the orterun process is running, given you are using Gridengine finding this node could be a non-trivial problem but it depends on your setup. orte would be the preferred resource manager to use, you'll need to ensure that the environment you run padb under is the same as used for the parallel job (PATH,LD_LIBRARY_PATH) to avoid any version problems between different versions of the orte tools, be aware that using an incorrect ompi-ps version can cause orted to crash and running jobs to fail so tread carefully. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From d.love at liverpool.ac.uk Fri Jul 9 15:32:18 2010 From: d.love at liverpool.ac.uk (Dave Love) Date: Fri, 09 Jul 2010 15:32:18 +0100 Subject: [padb-users] running with SGE/OMPI In-Reply-To: <04DE6F37-3531-4A22-AAFB-1CF5C6F33F98@pittman.co.uk> (Ashley Pittman's message of "Thu, 8 Jul 2010 22:50:30 +0100") References: <87mxu23rwm.fsf@liv.ac.uk> <04DE6F37-3531-4A22-AAFB-1CF5C6F33F98@pittman.co.uk> Message-ID: <87sk3s3evh.fsf@liv.ac.uk> Ashley Pittman writes: > On 8 Jul 2010, at 16:38, Dave Love wrote: > >> I'd like to use padb with OpenMPI jobs under Gridengine. > > In padb-parlance Gridengine would be the scheduler which padb is > un-interested in and given you mention ompi-ps presumably the resource > manager is orte. Padb only interfaces with the resource manager of > these two. I assumed Gridengine is relevant (a) in referring to `jobs', and (b) in that I think the OpenMPI tight integration is relevant, at least because it seems ompi-ps appears to be looking in the wrong place for files. > You have two choices here, you can either use "orte" as the resource > manager in which case a working ompi-ps is required or you can use > "mpirun" as the resource manager in which case it'll attach to the > orterun (or mpirun) process with gdb and read the data it needs > directly. In both of these cases the data is only available, and > hence you'll need to run padb, on the node where the orterun process > is running, given you are using Gridengine finding this node could be > a non-trivial problem but it depends on your setup. That's easy, but neither mpirun nor orte work. With mpirun I get Error, resource manager "mpirun" not supported and orte doesn't find any jobs because ompi-ps doesn't. I'll try to figure out what's going on when I get some time. > be aware that using an incorrect ompi-ps version can cause orted to > crash and running jobs to fail so tread carefully. Thanks for the warning and the rest. From ashley at pittman.co.uk Mon Jul 12 14:01:14 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Mon, 12 Jul 2010 14:01:14 +0100 Subject: [padb-users] running with SGE/OMPI In-Reply-To: <87sk3s3evh.fsf@liv.ac.uk> References: <87mxu23rwm.fsf@liv.ac.uk> <04DE6F37-3531-4A22-AAFB-1CF5C6F33F98@pittman.co.uk> <87sk3s3evh.fsf@liv.ac.uk> Message-ID: <22542719-B510-4862-B605-FAA84DF89DFA@pittman.co.uk> On 9 Jul 2010, at 15:32, Dave Love wrote: > Ashley Pittman writes: > > I assumed Gridengine is relevant (a) in referring to `jobs', and (b) in > that I think the OpenMPI tight integration is relevant, at least because > it seems ompi-ps appears to be looking in the wrong place for files. You are right, padb will use the "jobid" that orte had allocated the job rather than the id that Gridengine has given it but the tight integration mighy have changed the orte behaviour. I see this with mpd (Mpich2) and PBS as well where PBS sets an environment variable which causes mpd to store it's temporary files under a different filename. Unfortunately this is very hard to get around. > That's easy, but neither mpirun nor orte work. With mpirun I get > > Error, resource manager "mpirun" not supported You need to use the 3.2 beta release for this, I keep forgetting it's not in 3.0. When using this method of attaching to jobs you have to run padb on the host where the "mpirun" process is running and the jobid will be the pid of that process. Padb use pdsh to launch itself on the nodes so you'll need to have this installed if you haven't already. > and orte doesn't find any jobs because ompi-ps doesn't. I'll try to > figure out what's going on when I get some time. Unfortunately without a working ompi-os padb has no way of collecting the information it needs so the orte resource manager won't work for you in this case, you could on the opmi-users list to see if there is anything they recommend, as above we managed to get this working on MPICH2 recently by asking users to unset PBS_JOBID in their job script. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From daniel.kidger at googlemail.com Mon Jul 12 17:58:27 2010 From: daniel.kidger at googlemail.com (Daniel Kidger) Date: Mon, 12 Jul 2010 17:58:27 +0100 Subject: [padb-users] running with SGE/OMPI In-Reply-To: <22542719-B510-4862-B605-FAA84DF89DFA@pittman.co.uk> References: <87mxu23rwm.fsf@liv.ac.uk> <04DE6F37-3531-4A22-AAFB-1CF5C6F33F98@pittman.co.uk> <87sk3s3evh.fsf@liv.ac.uk> <22542719-B510-4862-B605-FAA84DF89DFA@pittman.co.uk> Message-ID: > You are right, padb will use the "jobid" that orte had allocated the job rather than the id that > Gridengine has given it but the tight integration mighy have changed the orte behaviour. I see > this with mpd (Mpich2) and PBS as well where PBS sets an environment variable which causes > mpd to store it's temporary files under a different filename. Unfortunately this is very hard to get > around. In particular, I found this to be from these lines in mpirun (from Intel mpi 4.0.0) --------------- if [ -n "$PBS_ENVIRONMENT" ] ; then export MPD_CON_EXT="${PBS_JOBID}_$$" # PBS Pro and Torque (lines deleted) elif [ -n "$MP_JOBID" ] ; then export MPD_CON_EXT="${MP_JOBID}_$$" # SGE --------------- The environment variable MPD_CON_EXT is used by mpdboot to add an extension to both the socket /tmp/mpd2.console_ and the logfile /tmp/mpd2.logfile_ For padb I add my own wrapper to add the (known) PBS_JOBID to MPD_CON_EXT (The processes id thought needs to be found by inspection) padb appears to call mpdlistjobs which itself honours MPD_CON_EXT. Hope this helps, Daniel On 12 July 2010 14:01, Ashley Pittman wrote: > > On 9 Jul 2010, at 15:32, Dave Love wrote: > > > Ashley Pittman writes: > > > > I assumed Gridengine is relevant (a) in referring to `jobs', and (b) in > > that I think the OpenMPI tight integration is relevant, at least because > > it seems ompi-ps appears to be looking in the wrong place for files. > > You are right, padb will use the "jobid" that orte had allocated the job > rather than the id that Gridengine has given it but the tight integration > mighy have changed the orte behaviour. I see this with mpd (Mpich2) and PBS > as well where PBS sets an environment variable which causes mpd to store > it's temporary files under a different filename. Unfortunately this is very > hard to get around. > > > That's easy, but neither mpirun nor orte work. With mpirun I get > > > > Error, resource manager "mpirun" not supported > > You need to use the 3.2 beta release for this, I keep forgetting it's not > in 3.0. When using this method of attaching to jobs you have to run padb on > the host where the "mpirun" process is running and the jobid will be the pid > of that process. Padb use pdsh to launch itself on the nodes so you'll need > to have this installed if you haven't already. > > > and orte doesn't find any jobs because ompi-ps doesn't. I'll try to > > figure out what's going on when I get some time. > > Unfortunately without a working ompi-os padb has no way of collecting the > information it needs so the orte resource manager won't work for you in this > case, you could on the opmi-users list to see if there is anything they > recommend, as above we managed to get this working on MPICH2 recently by > asking users to unset PBS_JOBID in their job script. > > Ashley, > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > > _______________________________________________ > padb-users mailing list > padb-users at pittman.org.uk > http://pittman.org.uk/mailman/listinfo/padb-users_pittman.org.uk > -------------- next part -------------- An HTML attachment was scrubbed... URL: