From ashley at pittman.co.uk Wed Sep 1 22:23:40 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Wed, 1 Sep 2010 22:23:40 +0100 Subject: [padb-users] padb not finding remote ranks In-Reply-To: <4C7C0DAC.7080300@opengridcomputing.com> References: <4C7C0DAC.7080300@opengridcomputing.com> Message-ID: On 30 Aug 2010, at 20:59, Steve Wise wrote: > Hey, > > I have an openmpi-1.4.1 8 node 64NP cluster, which is running jobs via orte/mpirun. I start a job that is hanging, and then run padb to get the stack traces, yet padb displays this error and only shows the local process stacks (see output below). > > Any ideas? This looks like a hostname issue, I note that ompi-ps is giving a FQDN for for n0 but short names for n[1-7], padb is then finding every process on n0 but none on any of the other nodes. It looks like I can probably make padb work in this case by changing the logic around slightly, if the reported hostname is a fully qualified domain name and no hosts are specified for that host then also check the shorter un-qualified hostname. This shouldn't be too hard to add and is far from a "proper" solution but from the looks of it should make padb behave correctly in this case. I'm on holiday this week having moved house and don't have regular internet access or an office right now but I can take a look at this early next week for you. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From harris.duncan at gmail.com Wed Sep 15 11:47:09 2010 From: harris.duncan at gmail.com (Duncan Harris) Date: Wed, 15 Sep 2010 11:47:09 +0100 Subject: [padb-users] Problems running padb on large processor counts Message-ID: Hi. One of our users is trying to run padb to find out why his 576 PE job is hanging. However when he does he gets this: [node195]> padb -x -t -a -Ormgr=mpirun Waiting for signon from 16 hosts. Waiting for signon from 16 hosts. Waiting for signon from 16 hosts. Waiting for signon from 16 hosts. Waiting for signon from 16 hosts. einner: node197: Failed to connect to child (node204:53314) at ~/bin/padb line 4414. einner: pdsh at node195: node197: ssh exited with exit code 111 einner: node197: Failed to connect to child (node210:36682) at ~/bin/padb line 4414. einner: node209: Use of uninitialized value in numeric eq (==) at ~/bin/padb line 9886. einner: node208: Use of uninitialized value in numeric eq (==) at ~/bin/padb line 9886. einner: pdsh at node195: node198: ssh exited with exit code 111 einner: node196: Failed to connect to child (node201:39642) at ~/bin/padb line 4414. einner: node200: Use of uninitialized value in numeric eq (==) at ~/bin/padb line 9886. einner: pdsh at node195: node196: ssh exited with exit code 111 einner: node199: Failed to connect to child (node212:54985) at ~/bin/padb line 4414. einner: pdsh at node195: node199: ssh exited with exit code 111 This is running padb-3.2-beta0 over infiniband with OpenMPI 1.4.1 and Red Hat Enterprise Linux v5.3 Any suggestions as to what the problem could be please? Cheers. Duncan From ashley at pittman.co.uk Wed Sep 15 12:35:12 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Wed, 15 Sep 2010 12:35:12 +0100 Subject: [padb-users] Problems running padb on large processor counts In-Reply-To: References: Message-ID: On 15 Sep 2010, at 11:47, Duncan Harris wrote: > Hi. > One of our users is trying to run padb to find out why his 576 PE job > is hanging. However when he does he gets this: > > [node195]> padb -x -t -a -Ormgr=mpirun > > Any suggestions as to what the problem could be please? Can you try setting the environment variable FANOUT to a value of 64 please, when using the mpirun resource manager interface padb uses pdsh to launch the backend and by default pdsh is limited to 32 remote hosts concurrently. If he is able to then using the "orte" resource manager interface would also solve this problem as that does not use pdsh. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From harris.duncan at gmail.com Wed Sep 15 16:14:53 2010 From: harris.duncan at gmail.com (Duncan Harris) Date: Wed, 15 Sep 2010 16:14:53 +0100 Subject: [padb-users] Problems running padb on large processor counts In-Reply-To: References: Message-ID: Hi. Thanks for the prompt reply Ashley, that worked a treat. Duncan On Wed, Sep 15, 2010 at 12:35 PM, Ashley Pittman wrote: > > On 15 Sep 2010, at 11:47, Duncan Harris wrote: > >> Hi. >> One of our users is trying to run padb to find out why his 576 PE job >> is hanging. However when he does he gets this: >> >> [node195]> padb -x -t -a -Ormgr=mpirun > >> >> Any suggestions as to what the problem could be please? > > Can you try setting the environment variable FANOUT to a value of 64 please, when using the mpirun resource manager interface padb uses pdsh to launch the backend and by default pdsh is limited to 32 remote hosts concurrently. ?If he is able to then using the "orte" resource manager interface would also solve this problem as that does not use pdsh. > > Ashley, > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > >