From ljdursi at scinet.utoronto.ca Thu Aug 5 21:15:36 2010 From: ljdursi at scinet.utoronto.ca (Jonathan Dursi) Date: Thu, 05 Aug 2010 16:15:36 -0400 Subject: [padb-users] Illegal hex digit ' ' Message-ID: <4C5B1BE8.7010105@scinet.utoronto.ca> So I'm trying to use padb with orte-launched jobs, and gdb 7.1 or 6.8. Things weem to work ok, but padb reports the following messages constantly, which makes --watch more or less unusable: einner: Illegal hexadecimal digit ' ' ignored at /scinet/gpc/tools/padb/padb-3.0/bin/padb line 5036, line 169. einner: Illegal hexadecimal digit ' ' ignored at /scinet/gpc/tools/padb/padb-3.0/bin/padb line 5036, line 195. einner: Illegal hexadecimal digit ' ' ignored at /scinet/gpc/tools/padb/padb-3.0/bin/padb line 5036, line 247. einner: Illegal hexadecimal digit ' ' ignored at /scinet/gpc/tools/padb/padb-3.0/bin/padb line 5036, line 169. einner: Illegal hexadecimal digit ' ' ignored at /scinet/gpc/tools/padb/padb-3.0/bin/padb line 5036, line 195. einner: Illegal hexadecimal digit ' ' ignored at /scinet/gpc/tools/padb/padb-3.0/bin/padb line 5036, line 247. einner: Illegal hexadecimal digit ' ' ignored at /scinet/gpc/tools/padb/padb-3.0/bin/padb line 5036, line 169. einner: Illegal hexadecimal digit ' ' ignored at /scinet/gpc/tools/padb/padb-3.0/bin/padb line 5036, line 195. einner: Illegal hexadecimal digit ' ' ignored at /scinet/gpc/tools/padb/padb-3.0/bin/padb line 5036, line 247. einner: Illegal hexadecimal digit ' ' ignored at /scinet/gpc/tools/padb/padb-3.0/bin/padb line 5036, line 169. einner: Illegal hexadecimal digit ' ' ignored at /scinet/gpc/tools/padb/padb-3.0/bin/padb line 5036, line 195. einner: Illegal hexadecimal digit ' ' ignored at /scinet/gpc/tools/padb/padb-3.0/bin/padb line 5036, line 247. Am I doing something wrong? - Joanthan -- Jonathan Dursi From ashley at pittman.co.uk Thu Aug 5 23:14:52 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Thu, 5 Aug 2010 15:14:52 -0700 Subject: [padb-users] Illegal hex digit ' ' In-Reply-To: <4C5B1BE8.7010105@scinet.utoronto.ca> References: <4C5B1BE8.7010105@scinet.utoronto.ca> Message-ID: <7DE1FD3D-42A4-4D68-A5F1-373858D1535D@pittman.co.uk> On 5 Aug 2010, at 13:15, Jonathan Dursi wrote: > So I'm trying to use padb with orte-launched jobs, and gdb 7.1 or 6.8. > > Things weem to work ok, but padb reports the following messages constantly, which makes --watch more or less unusable: > einner: Illegal hexadecimal digit ' ' ignored at /scinet/gpc/tools/padb/padb-3.0/bin/padb line 5036, line 169. > einner: Illegal hexadecimal digit ' ' ignored at /scinet/gpc/tools/padb/padb-3.0/bin/padb line 5036, line 195. These errors come from the code to interact with gdb, padb runs with perl warnings enabled so it does sometimes report this kind of error. In particular different versions of gdb produce different output and I spent a lot of effort last year trying to catch and correct these cases. I suspect the errors would go away if you tried using the 3.2 beta release, if they don't or you don't want to do that then the fix is to change the first line of padb and remove the -w option from the perl command line. In fact I've just checked, I've removed the -w flag from 3.2 anyway to avoid just this situation. > Am I doing something wrong? No, I don't think so. You should be aware that using --watch to monitor jobs in a mode which interrupts the job with gdb will have a significant performance impact on the job, unfortunately this cannot be helped so I wouldn't recommend doing it routinely unless you are looking for a specific problem or use a long interval period. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From ljdursi at scinet.utoronto.ca Thu Aug 5 23:52:52 2010 From: ljdursi at scinet.utoronto.ca (Jonathan Dursi) Date: Thu, 05 Aug 2010 18:52:52 -0400 Subject: [padb-users] Illegal hex digit ' ' In-Reply-To: <7DE1FD3D-42A4-4D68-A5F1-373858D1535D@pittman.co.uk> References: <4C5B1BE8.7010105@scinet.utoronto.ca> <7DE1FD3D-42A4-4D68-A5F1-373858D1535D@pittman.co.uk> Message-ID: <4C5B40C4.6020306@scinet.utoronto.ca> Ashley: Thanks for the quick response. I hadn't seen the 3.2 beta was available; I'll give that a shot. - Jonathan -- Jonathan Dursi From rpnabar at gmail.com Wed Aug 18 18:11:03 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 18 Aug 2010 12:11:03 -0500 Subject: [padb-users] Error message from /opt/sbin/libexec/minfo: No DLL to load Message-ID: Any ideas what the following error message is indicative of? Did I miss a step in the padb installation? Background: I'm trying to debug a 32 node (256 core) Open-MPI job using padb. /opt/sbin/bin/padb --show-jobs --config-option rmgr=orte /opt/sbin/bin/padb --full-report=8801 --config-option rmgr=orte | tee padb.log.new cat tee padb.log.new ############################ [snip] Warning: errors reported by some ranks ======== [0-255]: Error message from /opt/sbin/libexec/minfo: No DLL to load ======== Warning: errors reported by some ranks ======== [0-255]: Error message from /opt/sbin/libexec/minfo: No DLL to load ======== [snip] ############################# My installation details: cd /opt/src/padb-3.2-beta0 ./configure --prefix=/opt/sbin make make install [/opt is a nfs mounted dir] -- Rahul From ashley at pittman.co.uk Wed Aug 18 19:15:24 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Wed, 18 Aug 2010 20:15:24 +0200 Subject: [padb-users] Error message from /opt/sbin/libexec/minfo: No DLL to load In-Reply-To: References: Message-ID: On 18 Aug 2010, at 19:11, Rahul Nabar wrote: > Any ideas what the following error message is indicative of? Did I > miss a step in the padb installation? > > Background: I'm trying to debug a 32 node (256 core) Open-MPI job using padb. > > Warning: errors reported by some ranks > ======== > [0-255]: Error message from /opt/sbin/libexec/minfo: No DLL to load > ======== > Warning: errors reported by some ranks > ======== > [0-255]: Error message from /opt/sbin/libexec/minfo: No DLL to load > ======== This error means that padb is unable to find the name of the debuger DLL which is supposed to be provided by the MPI library. To give a bit of background here the way message queues are implemented is the MPI library provides a library which the debugger loads into it's own memory space and is used to extract the message queues from the parallel MPI process. This dll is built and installed with the MPI library but loaded by and into the debugger. To allow the debugger to find the library it's install location is built into the MPI process in a text variable. A short aside here is that because it's a text location it's defined at build time so it's impossible for a MPI library to be re-located after build which is unfortunate. The error you are getting means that the MPI library isn't exporting this text string for the filesystem location of the library which could either be because you aren't really looking at an MPI process or because Open-MPI wasn't build with debugger support. With Open-MPI the debugger library is called $OPAL_PREFIX/lib/libompi_dbg_msgq.so IIRC so you could check if this file exists, if it doesn't then you need to check with Open-MPI what steps are needed to ensure this is built. I thought it was built automatically but this is not the case with all MPI's and it doesn't help matters that in some cases if the build of this DLL fails then the build of MPI could still succeed - I fixed this in around the 1.4 timeframe. Alternatively as I say it could be that padb isn't finding the correct processes, does the rest of the output look correct for what you are expecting and are you using some kind of wrapper script between mpirun and your executable? padb should detect this case and act correctly but it is another possible cause. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From rpnabar at gmail.com Wed Aug 18 19:50:47 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 18 Aug 2010 13:50:47 -0500 Subject: [padb-users] Error message from /opt/sbin/libexec/minfo: No DLL to load In-Reply-To: References:

Message-ID: On Wed, Aug 18, 2010 at 1:15 PM, Ashley Pittman wrote: > > This error means that padb is unable to find the name of the debuger DLL which is supposed to be provided by the MPI library. Thanks Ashley! That helps! > > The error you are getting means that the MPI library isn't exporting this text string for the filesystem location of the library which could either be because you aren't really looking at an MPI process or because Open-MPI wasn't build with debugger support. It could be either. The config.log from the OpenMPI build shows: $ ./configure --prefix=/opt/ompi_new --with-tm=/opt/torque FC=ifort CC=icc F77=ifort CXX=icpc CFLAGS=-g -O3 -mp FFLAGS=-mp -recurs ive -O3 CXXFLAGS=-g CPPFLAGS=-DPgiFortran --disable-shared --enable-static --with-memory-manager --disable-dlopen --enable-openib-rd macm --with-openib=/usr Some other relevant parts: configure:4764: checking whether to debug memory usage configure:4776: result: no configure:4796: checking whether to profile memory usage configure:4808: result: no configure:4828: checking if want developer-level compiler pickyness configure:4840: result: no configure:4855: checking if want developer-level debugging code configure:4867: result: no configure:5381: checking if want trace file debugging configure:5393: result: no To my naive eyes this doesn't mean much but maybe you have a clue? If not I'll post on the OpenMPI list (or read their make instructions) to see how the debugger support is built in. > With Open-MPI the debugger library is called $OPAL_PREFIX/lib/libompi_dbg_msgq.so IIRC so you could check if this file exists, if it doesn't then you need to check with Open-MPI what steps are needed to ensure this is built. ?I thought it was built automatically but this is not the case with all MPI's and it doesn't help matters that in some cases if the build of this DLL fails then the build of MPI could still succeed - I fixed this in around the 1.4 timeframe. > I can't find that specific file in my MPI install. > Alternatively as I say it could be that padb isn't finding the correct processes, does the rest of the output look correct for what you are expecting and are you using some kind of wrapper script between mpirun and your executable? ?padb should detect this case and act correctly but it is another possible cause. This is the first time I'm using padb (or a stack debugger for that matter!) :) So, not sure what is the "correct" or "typical" output. I've pasted a snippet at the very bottom of this message, just in case there are any clues. I found the process number like so: /opt/sbin/bin/padb --show-jobs --config-option rmgr=orte 25883 /opt/sbin/bin/padb --full-report=25883 --config-option rmgr=orte | tee padb.log.new.new What is suspicious though is that this number does not show up in the ps output. Does that imply padb is mis-discovering the process? ps aux | grep mpi rpnabar 17800 0.0 0.0 20660 2264 pts/0 S+ 13:46 0:00 mpirun -np 256 --host eu001,eu002,eu003,eu004,eu005,eu006,eu007,eu008,eu009,eu010,eu011,eu012,eu013,eu014,eu015,eu016,eu017,eu018,eu019,eu020,eu021,eu022,eu023,eu024,eu025,eu026,eu027,eu028,eu029,eu030,eu031,eu032 -mca btl openib,sm,self /opt/src/mpitests/imb/src/IMB-MPI1 -npmin 256 bcast rpnabar 17832 95.3 25.4 4330112 4181284 pts/0 RLl+ 13:46 0:45 /opt/src/mpitests/imb/src/IMB-MPI1 -npmin 256 bcast rpnabar 17833 95.5 1.0 270100 169748 pts/0 RLl+ 13:46 0:45 /opt/src/mpitests/imb/src/IMB-MPI1 -npmin 256 bcast rpnabar 17834 95.6 1.0 335608 169784 pts/0 RLl+ 13:46 0:45 /opt/src/mpitests/imb/src/IMB-MPI1 -npmin 256 bcast rpnabar 17835 95.3 1.0 335608 169788 pts/0 RLl+ 13:46 0:45 /opt/src/mpitests/imb/src/IMB-MPI1 -npmin 256 bcast rpnabar 17836 95.6 1.0 335288 169812 pts/0 RLl+ 13:46 0:45 /opt/src/mpitests/imb/src/IMB-MPI1 -npmin 256 bcast rpnabar 17837 95.6 0.5 191204 90996 pts/0 RLl+ 13:46 0:45 /opt/src/mpitests/imb/src/IMB-MPI1 -npmin 256 bcast rpnabar 17838 95.6 0.5 256840 91000 pts/0 RLl+ 13:46 0:45 /opt/src/mpitests/imb/src/IMB-MPI1 -npmin 256 bcast rpnabar 17839 95.6 0.5 256740 90984 pts/0 RLl+ 13:46 0:45 /opt/src/mpitests/imb/src/IMB-MPI1 -npmin 256 bcast rpnabar 17889 0.0 0.0 61140 756 pts/10 S+ 13:47 0:00 grep mpi ############ Warning: errors reported by some ranks ======== [0-255]: Error message from /opt/sbin/libexec/minfo: No DLL to load ======== Warning: errors reported by some ranks ======== [0-255]: Error message from /opt/sbin/libexec/minfo: No DLL to load ======== Total: 0 communicators, no communication data recorded. Stack trace(s) for thread: 1 ----------------- [0-23,25-153,155-165,167-204,206,208-252,254-255] (250 processes) ----------------- main() at ?:? IMB_init_buffers_iter() at ?:? IMB_bcast() at ?:? PMPI_Bcast() at pbcast.c:107 params void * buffer: '0x2 (Invalid pointer)' [4,17,20,93,116,153,168,183,187-188,201-202,206,218,232] 'null pointer' [0,3,5,7-9,11-12,14,16,18,21,23,25,29,31-32,34-35,37-38,40-44,47-48,51,55-56,59-60,70,72,74-77,79-8 1,83,86,89-92,95-96,98-102,104-105,108-111,114-115,119,124,126,131,134-137,140,144-145,147-150,160-161,163,169-172,174,179-180,182,1 84-186,190-191,194-196,198-200,203,208-211,213-216,219,221,224,226,228-229,231,234,236,238-241,243-245,247-248,250-251,254-255] 'valid pointer perm=rwxp' [1-2,6,10,13,15,19,22,26-28,30,33,36,39,45-46,49-50,52-54,57-58,61-69,71,73,78,82,84-85, 87-88,94,97,103,106-107,112-113,117-118,120-123,125,127-130,132-133,138-139,141-143,146,151-152,155-159,162,164-165,167,173,175-178, 181,189,192-193,197,204,212,217,220,222-223,225,227,230,233,235,237,242,246,249,252] int count: more than 3 distinct values MPI_Datatype datatype: more than 3 distinct values int root: more than 3 distinct values MPI_Comm comm: '0x0' [2,4,6,10,13,15,17,19-20,22,30,33,46,49-50,53-54,62-64,66-67,69,73,78,82,84-85,87-88,93-94,97,112-113,116,11 8,120-123,125,128-130,133,138,143,146,151-153,156-159,164-165,167-168,173,175-176,181,183,187-188,192-193,197,201-202,204,206,217-21 8,220,222-223,225,230,232-233,235,237,242] '0x1' [0-1,3,5,7-9,11-12,14,16,18,21,23,25-29,31-32,34-45,47-48,51-52,55-61,65,68,70-72,74-77,79-81,83,86,89-92,95 -96,98-111,114-115,117,119,124,126-127,131-132,134-137,139-142,144-145,147-150,155,160-163,169-172,174,177-180,182,184-186,189-191,1 94-196,198-200,203,208-216,219,221,224,226-229,231,234,236,238-241,243-252,254-255] locals int err = '1048576' [0-23,25-153,155-165,167-204,206,208-252,254-255] ----------------- [0-23,25-153,155-165,167-204,206,208-252,254-255] (250 processes) ----------------- mca_coll_sync_bcast() at coll_sync_bcast.c:44 params void * buff: '0x2 (Invalid pointer)' [4,17,20,93,116,153,168,183,187-188,201-202,206,218,232] 'null pointer' [0,3,5,7-9,11-12,14,16,18,21,23,25,29,31-32,34-35,37-38,40-44,47-48,51,55-56,59-60,70,72,74-77,79 -81,83,86,89-92,95-96,98-102,104-105,108-111,114-115,119,124,126,131,134-137,140,144-145,147-150,160-161,163,169-172,174,179-180,182 ,184-186,190-191,194-196,198-200,203,208-211,213-216,219,221,224,226,228-229,231,234,236,238-241,243-245,247-248,250-251,254-255] 'valid pointer perm=rwxp' [1-2,6,10,13,15,19,22,26-28,30,33,36,39,45-46,49-50,52-54,57-58,61-69,71,73,78,82,84-8 5,87-88,94,97,103,106-107,112-113,117-118,120-123,125,127-130,132-133,138-139,141-143,146,151-152,155-159,162,164-165,167,173,175-17 8,181,189,192-193,197,204,212,217,220,222-223,225,227,230,233,235,237,242,246,249,252] int count: more than 3 distinct values struct ompi_datatype_t * datatype: more than 3 distinct values int root: more than 3 distinct values struct ompi_communicator_t * comm: '0x1 (Invalid pointer)' [0-1,3,5,7-9,11-12,14,16,18,21,23,25-29,31-32,34-45,47-48,51-52,55-61,65,68,70-72,74-77, 79-81,83,86,89-92,95-96,98-111,114-115,117,119,124,126-127,131-132,134-137,139-142,144-145,147-150,155,160-163,169-172,174,177-180,1 82,184-186,189-191,194-196,198-200,203,208-216,219,221,224,226-229,231,234,236,238-241,243-252,254-255] 'null pointer' [2,4,6,10,13,15,17,19-20,22,30,33,46,49-50,53-54,62-64,66-67,69,73,78,82,84-85,87-88,93-94,97,112 -113,116,118,120-123,125,128-130,133,138,143,146,151-153,156-159,164-165,167-168,173,175-176,181,183,187-188,192-193,197,201-202,204 ,206,217-218,220,222-223,225,230,232-233,235,237,242] mca_coll_base_module_t * module = 'null pointer' [0-23,25-153,155-165,167-204,206,208-252,254-255] ----------------- [0-23,25-153,155-165,167-204,206,208-252,254-255] (250 processes) ----------------- ompi_coll_tuned_bcast_intra_dec_fixed() at coll_tuned_decision_fixed.c:301 params void * buff: '0x2 (Invalid pointer)' [4,17,20,93,116,153,168,183,187-188,201-202,206,218,232] 'null pointer' [0,3,5,7-9,11-12,14,16,18,21,23,25,29,31-32,34-35,37-38,40-44,47-48,51,55-56,59-60,70,72,74-77, 79-81,83,86,89-92,95-96,98-102,104-105,108-111,114-115,119,124,126,131,134-137,140,144-145,147-150,160-161,163,169-172,174,179-180,1 82,184-186,190-191,194-196,198-200,203,208-211,213-216,219,221,224,226,228-229,231,234,236,238-241,243-245,247-248,250-251,254-255] 'valid pointer perm=rwxp' [1-2,6,10,13,15,19,22,26-28,30,33,36,39,45-46,49-50,52-54,57-58,61-69,71,73,78,82,84 -85,87-88,94,97,103,106-107,112-113,117-118,120-123,125,127-130,132-133,138-139,141-143,146,151-152,155-159,162,164-165,167,173,175- 178,181,189,192-193,197,204,212,217,220,222-223,225,227,230,233,235,237,242,246,249,252] int count: more than 3 distinct values struct ompi_datatype_t * datatype: more than 3 distinct values int root: more than 3 distinct values struct ompi_communicator_t * comm: '0x1 (Invalid pointer)' [0-1,3,5,7-9,11-12,14,16,18,21,23,25-29,31-32,34-45,47-48,51-52,55-61,65,68,70-72,74-7 7,79-81,83,86,89-92,95-96,98-111,114-115,117,119,124,126-127,131-132,134-137,139-142,144-145,147-150,155,160-163,169-172,174,177-180 ,182,184-186,189-191,194-196,198-200,203,208-216,219,221,224,226-229,231,234,236,238-241,243-252,254-255] 'null pointer' [2,4,6,10,13,15,17,19-20,22,30,33,46,49-50,53-54,62-64,66-67,69,73,78,82,84-85,87-88,93-94,97,1 12-113,116,118,120-123,125,128-130,133,138,143,146,151-153,156-159,164-165,167-168,173,175-176,181,183,187-188,192-193,197,201-202,2 04,206,217-218,220,222-223,225,230,232-233,235,237,242] mca_coll_base_module_t * module = 'null pointer' [0-23,25-153,155-165,167-204,206,208-252,254-255] locals size_t message_size: more than 3 distinct values ----------------- [0-23,25-153,155-165,167-204,206,208-252,254-255] (250 processes) ----------------- ompi_coll_tuned_bcast_intra_pipeline() at coll_tuned_bcast.c:310 params void * buffer: '0x2 (Invalid pointer)' [4,17,20,93,116,153,168,183,187-188,201-202,206,218,232] 'null pointer' [0,3,5,7-9,11-12,14,16,18,21,23,25,29,31-32,34-35,37-38,40-44,47-48,51,55-56,59-60,70,72,74-7 7,79-81,83,86,89-92,95-96,98-102,104-105,108-111,114-115,119,124,126,131,134-137,140,144-145,147-150,160-161,163,169-172,174,179-180 ,182,184-186,190-191,194-196,198-200,203,208-211,213-216,219,221,224,226,228-229,231,234,236,238-241,243-245,247-248,250-251,254-255 ] 'valid pointer perm=rwxp' [1-2,6,10,13,15,19,22,26-28,30,33,36,39,45-46,49-50,52-54,57-58,61-69,71,73,78,82, 84-85,87-88,94,97,103,106-107,112-113,117-118,120-123,125,127-130,132-133,138-139,141-143,146,151-152,155-159,162,164-165,167,173,17 5-178,181,189,192-193,197,204,212,217,220,222-223,225,227,230,233,235,237,242,246,249,252] int count: more than 3 distinct values struct ompi_datatype_t * datatype: more than 3 distinct values int root: more than 3 distinct values struct ompi_communicator_t * comm: '0x1 (Invalid pointer)' [0-1,3,5,7-9,11-12,14,16,18,21,23,25-29,31-32,34-45,47-48,51-52,55-61,65,68,70-72,74 -77,79-81,83,86,89-92,95-96,98-111,114-115,117,119,124,126-127,131-132,134-137,139-142,144-145,147-150,155,160-163,169-172,174,177-1 80,182,184-186,189-191,194-196,198-200,203,208-216,219,221,224,226-229,231,234,236,238-241,243-252,254-255] 'null pointer' [2,4,6,10,13,15,17,19-20,22,30,33,46,49-50,53-54,62-64,66-67,69,73,78,82,84-85,87-88,93-94,97 ,112-113,116,118,120-123,125,128-130,133,138,143,146,151-153,156-159,164-165,167-168,173,175-176,181,183,187-188,192-193,197,201-202 ,204,206,217-218,220,222-223,225,230,232-233,235,237,242] mca_coll_base_module_t * module = 'null pointer' [0-23,25-153,155-165,167-204,206,208-252,254-255] uint32_t segsize: more than 3 distinct values ----------------- [0-23] (24 processes) ----------------- ompi_coll_tuned_bcast_intra_generic() at coll_tuned_bcast.c:232 params void * buffer: '0x2 (Invalid pointer)' [4,17,20] 'null pointer' [0,3,5,7-9,11-12,14,16,18,21,23] 'valid pointer perm=rwxp' [1-2,6,10,13,15,19,22] int original_count: more than 3 distinct values struct ompi_datatype_t * datatype: more than 3 distinct values int root: more than 3 distinct values struct ompi_communicator_t * comm: '0x1 (Invalid pointer)' [0-1,3,5,7-9,11-12,14,16,18,21,23] 'null pointer' [2,4,6,10,13,15,17,19-20,22] mca_coll_base_module_t * module = 'null pointer' [0-23] uint32_t count_by_segment = '8192' [0-23] ompi_coll_tree_t * tree = 'valid pointer perm=rwxp' [0-23] locals ptrdiff_t extent: more than 3 distinct values int num_segments = '1' [0-23] size_t realsegsize: more than 3 distinct values ompi_request_t *[2] recv_reqs = '{, }' [0-23] int segindex = '1' [0-23] char * tmpbuf: more than 3 distinct values ----------------- [0-23] (24 processes) ----------------- ompi_request_default_wait() at request/req_wait.c:37 params ompi_request_t ** req_ptr: more than 3 distinct values ompi_status_public_t * status: more than 3 distinct values locals ompi_request_t * req = 'valid pointer perm=rwxp' [0-23] ----------------- [0-9,11-16,18-19,21-23] (21 processes) ----------------- opal_progress() at runtime/opal_progress.c:207 ----------------- [0-3,5-7,9,11-13,15-16,18-19,22-23] (17 processes) ----------------- btl_openib_component_progress() at btl_openib_component.c:3175 locals mca_btl_openib_device_t * device = 'valid pointer perm=rwxp' [0-3,5-7,9,11-13,15-16,18-19,22-23] ----------------- [1,13,19] (3 processes) ----------------- t3b_poll_cq() at src/cq.c:406 params struct ibv_cq * ibcq = 'valid pointer perm=rwxp' [1,13,19] int num_entries = '1' [1,13,19] struct ibv_wc * wc = 'valid pointer perm=rwxp ([stack])' [1,13,19] locals struct iwch_cq * chp = 'value optimized out' [1,13,19] int err = 'value optimized out' [1,13,19] int npolled = 'value optimized out' [1,13,19] struct iwch_device * rhp = 'valid pointer perm=rwxp' [1,13,19] ----------------- [1,13,19] (3 processes) ----------------- pthread_spin_lock() at ?:? ----------------- [2] (1 processes) ----------------- t3b_poll_cq() at src/cq.c:407 params struct ibv_cq * ibcq = 'valid pointer perm=rwxp' [2] int num_entries = '1' [2] struct ibv_wc * wc = 'valid pointer perm=rwxp ([stack])' [2] locals struct iwch_cq * chp = 'value optimized out' [2] int err = 'value optimized out' [2] int npolled = 'value optimized out' [2] struct iwch_device * rhp = 'valid pointer perm=rwxp' [2] ----------------- [15] (1 processes) ----------------- t3b_poll_cq() at src/cq.c:415 params struct ibv_cq * ibcq = 'valid pointer perm=rwxp' [15] int num_entries = '1' [15] struct ibv_wc * wc = 'valid pointer perm=rwxp ([stack])' [15] locals struct iwch_cq * chp = '0x11 (Invalid pointer)' [15] int err = '650348576' [15] int npolled = '0' [15] struct iwch_device * rhp = 'valid pointer perm=rwxp' [15] ----------------- [15] (1 processes) ----------------- iwch_poll_cq_one() at src/cq.c:394 params struct iwch_device * rhp = 'valid pointer perm=rwxp' [15] struct iwch_cq * chp = 'value optimized out' [15] struct ibv_wc * wc = 'valid pointer perm=rwxp ([stack])' [15] locals uint64_t cookie = '46212224' [15] uint8_t cqe_flushed = '0 '\0'' [15] struct t3_cqe * hw_cqe = 'null pointer' [15] struct iwch_qp * qhp = 'valid pointer perm=rwxp' [15] int ret = '0' [15] struct t3_wq * wq = 'null pointer' [15] ----------------- ######################### -- Rahul From ashley at pittman.co.uk Wed Aug 18 20:10:43 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Wed, 18 Aug 2010 21:10:43 +0200 Subject: [padb-users] Error message from /opt/sbin/libexec/minfo: No DLL to load In-Reply-To: References:

Message-ID: <98E5BA01-5BA4-418B-81DA-CFAA4BA7976D@pittman.co.uk> On 18 Aug 2010, at 20:50, Rahul Nabar wrote: > On Wed, Aug 18, 2010 at 1:15 PM, Ashley Pittman wrote: > > To my naive eyes this doesn't mean much but maybe you have a clue? If > not I'll post on the OpenMPI list (or read their make instructions) to > see how the debugger support is built in. I've checked with Jeff already and it's enabled by default. >> With Open-MPI the debugger library is called $OPAL_PREFIX/lib/libompi_dbg_msgq.so IIRC so you could check if this file exists, if it doesn't then you need to check with Open-MPI what steps are needed to ensure this is built. I thought it was built automatically but this is not the case with all MPI's and it doesn't help matters that in some cases if the build of this DLL fails then the build of MPI could still succeed - I fixed this in around the 1.4 timeframe. >> > > I can't find that specific file in my MPI install. Can you send the list of the files that you do have there? I've just checked on a system here and the correct location is $OPAL_PREFIX/lib/openmpi/libompi_dbg_msgq.so (note the extra openmpi in there). If you do have that file then could you take a peek inside a single MPI process and print out the value of "MPIR_dll_name", just attach with gdb and type "p MPIR_dll_name" should tell you. I assume you are using a recent version of OpenMPI? >> Alternatively as I say it could be that padb isn't finding the correct processes, does the rest of the output look correct for what you are expecting and are you using some kind of wrapper script between mpirun and your executable? padb should detect this case and act correctly but it is another possible cause. > > This is the first time I'm using padb (or a stack debugger for that > matter!) :) So, not sure what is the "correct" or "typical" output. > I've pasted a snippet at the very bottom of this message, just in case > there are any clues. Mainly that the stack trace is from your application and not from say /bin/sh. Some people like to run "mpirun sh -c /path/to/my-app" and whilst padb should cope with this if you use a non-standard shell it might not. > I found the process number like so: > /opt/sbin/bin/padb --show-jobs --config-option rmgr=orte > 25883 > /opt/sbin/bin/padb --full-report=25883 --config-option rmgr=orte | > tee padb.log.new.new > > What is suspicious though is that this number does not show up in the > ps output. Does that imply padb is mis-discovering the process? No. 35883 is the orte job number, run the command opmi-ps and it'll all become clear. > Stack trace(s) for thread: 1 > ----------------- > [0-23,25-153,155-165,167-204,206,208-252,254-255] (250 processes) > ----------------- > main() at ?:? > IMB_init_buffers_iter() at ?:? > IMB_bcast() at ?:? > PMPI_Bcast() at pbcast.c:107 > params > void * buffer: [snip] This is absolutely correct. The extended output with variables like you have can be over-whelming, it's included in -full-report for off-line diagnostics but if you have a reproducer and can experiment it's often easier to see problems without it so just specifying -xt rather than --full-report Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From rpnabar at gmail.com Wed Aug 18 20:35:01 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 18 Aug 2010 14:35:01 -0500 Subject: [padb-users] Error message from /opt/sbin/libexec/minfo: No DLL to load In-Reply-To: <98E5BA01-5BA4-418B-81DA-CFAA4BA7976D@pittman.co.uk> References:

<98E5BA01-5BA4-418B-81DA-CFAA4BA7976D@pittman.co.uk> Message-ID: On Wed, Aug 18, 2010 at 2:10 PM, Ashley Pittman wrote: > > On 18 Aug 2010, at 20:50, Rahul Nabar wrote: > I've checked with Jeff already and it's enabled by default. Great! Thanks for checking. > > Can you send the list of the files that you do have there? ?I've just checked on a system here and the correct location is $OPAL_PREFIX/lib/openmpi/libompi_dbg_msgq.so (note the extra openmpi in there). which mpirun /opt/ompi_new/bin/mpirun cd /opt/ompi_new/lib/openmpi ls libompi_dbg_msgq.a libompi_dbg_msgq.la ?If you do have that file then could you take a peek inside a single MPI process and print out the value of "MPIR_dll_name", just attach with gdb and type "p MPIR_dll_name" should tell you. ?I assume you are using a recent version of OpenMPI? That's confusing. I don't seem to have that file! mpirun --version mpirun (Open MPI) 1.4.1 > Mainly that the stack trace is from your application and not from say /bin/sh. ?Some people like to run "mpirun sh -c /path/to/my-app" and whilst padb should cope with this if you use a non-standard shell it might not. Ok. My shell's bash and the invocation was indeed pretty similar to what you show: NP=256;mpirun -np $NP --host eu001,eu002,eu003,eu004,eu005,eu006,eu007,eu008,eu009,eu010,eu011,eu012,eu013,eu014,eu015,eu016,eu017,eu018,eu019,eu020,eu021,eu022,eu023,eu024,eu025,eu026,eu027,eu028,eu029,eu030,eu031,eu032 -mca btl openib,sm,self /opt/src/mpitests/imb/src/IMB-MPI1 -npmin $NP bcast > No. ?35883 is the orte job number, run the command opmi-ps and it'll all become clear. Ah! Got it! :) > > This is absolutely correct. ?The extended output with variables like you have can be over-whelming, it's included in -full-report for off-line diagnostics but if you have a reproducer and can experiment it's often easier to see problems without it so just specifying -xt rather than --full-report Thanks for the tip! Some background on where I'm going with this: I've a problem on our 10GigE network where the OFED broadcast tests don't run when called with larger number of cores. The test just stalls. I've an OFED developer helping me debug this and he wanted the stacks from each compute node once the bcast test is in this strange stalled state. That's exactly what I am trying to use padb for. Just to confirm: Even though I don;t have the *.so file and padb is throwing the error about not finding the dll yet the stack seen in the output I pasted is still relevant? i.e. is there something that's missing because of the absence of the mpi dll? Thanks again for all your help! -- Rahul From ashley at pittman.co.uk Thu Aug 19 00:03:40 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Thu, 19 Aug 2010 01:03:40 +0200 Subject: [padb-users] Error message from /opt/sbin/libexec/minfo: No DLL to load In-Reply-To: References:

<98E5BA01-5BA4-418B-81DA-CFAA4BA7976D@pittman.co.uk> Message-ID: <60AAB44E-1375-45EC-BB1D-78312C44537F@pittman.co.uk> On 18 Aug 2010, at 21:35, Rahul Nabar wrote: > which mpirun > /opt/ompi_new/bin/mpirun > > cd /opt/ompi_new/lib/openmpi > ls > libompi_dbg_msgq.a libompi_dbg_msgq.la That's odd, I have 158 files in that directory including the libompi_dbg_msgq.la but not libompi_dbg_msgq.a, are you using a static build of OpenMPI or building it with --disable-dlopen? It looks like you've found a bug with the OpenMPI build though, this .so should exist even for static builds. > Some background on where I'm going with this: > I've a problem on our 10GigE network where the OFED broadcast tests > don't run when called with larger number of cores. The test just > stalls. I've an OFED developer helping me debug this and he wanted the > stacks from each compute node once the bcast test is in this strange > stalled state. That's exactly what I am trying to use padb for. I would send him what you sent to me in your previous mail. It's a lot of information and it can be hard to parse because of the long lines so it's best to re-direct it to a file and attach it to avoid line-wrap. > Just to confirm: > > Even though I don;t have the *.so file and padb is throwing the error > about not finding the dll yet the stack seen in the output I pasted is > still relevant? Yes it's still relevant, it's the complete stack trace of the application, it's just missing some information that could be included. What is interesting and potentially important is that the stack trace for six processes isn't present, it appears the processes were found or padb would have complained and they give warnings about the DLL but they have no stack trace showing, did you truncate the output in the previous email? > i.e. is there something that's missing because of the > absence of the mpi dll? The "MPI message queues" are missing, this will show you the contents of the send and receive queue for every rank. It's possibly not relevant to what you are looking for if you are only interested in the collectives. An example is shown on the web-page or it can also be seen in the OMPI testing results. http://padb.pittman.org.uk/modes.html#mpi-queue As a final point debugging collectives can be hard, in a deadlock situation it can be hard to tell if all ranks are on the same iteration or if some are ahead of others and some are behind, I have a patch to Open-MPI to add a counter to all collective calls to allow this situation to be detected and reported correctly, if you're still stuck even with the stack trace then you might find this of use. It'll mean patching you MPI build and fixing the above problem with the DLL. http://padb.pittman.org.uk/extensions.html Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From rpnabar at gmail.com Thu Aug 19 00:20:42 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 18 Aug 2010 18:20:42 -0500 Subject: [padb-users] Error message from /opt/sbin/libexec/minfo: No DLL to load In-Reply-To: <60AAB44E-1375-45EC-BB1D-78312C44537F@pittman.co.uk> References:

<98E5BA01-5BA4-418B-81DA-CFAA4BA7976D@pittman.co.uk> <60AAB44E-1375-45EC-BB1D-78312C44537F@pittman.co.uk> Message-ID: On Wed, Aug 18, 2010 at 6:03 PM, Ashley Pittman wrote: Thanks again very much Ashley! > That's odd, I have 158 files in that directory including the libompi_dbg_msgq.la but not libompi_dbg_msgq.a, are you using a static build of OpenMPI or building it with --disable-dlopen? ?It looks like you've found a bug with the OpenMPI build though, this .so should exist even for static builds. I wasn't the one who built our original mpi so unfortunately I can't answer your question definitively. Is there a way to query out the build options from the mpi executable? Same for the static vs dynamic compilation. Can I find that from the executable? Alternatively, some snooping in the sources does reveal that config.log has this: $ ./configure --prefix=/opt/ompi_new --with-tm=/opt/torque FC=ifort CC=icc F77=ifort CXX=icpc CFLAGS=-g -O3 -mp FFLAGS=-mp -recurs ive -O3 CXXFLAGS=-g CPPFLAGS=-DPgiFortran --disable-shared --enable-static --with-memory-manager --disable-dlopen --enable-openib-rd macm --with-openib=/usr So it could be that we did use the static and disable-dlopen. But I'd take it with skepticism since I cannot be 100% sure that this indeed was the source that was used. Sorry. :( > I would send him what you sent to me in your previous mail. ?It's a lot of information and it can be hard to parse because of the long lines so it's best to re-direct it to a file and attach it to avoid line-wrap. Yup. I already used a tee to a file. > Yes it's still relevant, it's the complete stack trace of the application, it's just missing some information that could be included. ?What is interesting and potentially important is that the stack trace for six processes isn't present, it appears the processes were found or padb would have complained and they give warnings about the DLL but they have no stack trace showing, did you truncate the output in the previous email? Yes. I had. Here's the full output. I wasn't sure if the list accepts attachments so I posted it online. http://dl.dropbox.com/u/118481/padb.log.new.new.txt > As a final point debugging collectives can be hard, in a deadlock situation it can be hard to tell if all ranks are on the same iteration or if some are ahead of others and some are behind, I have a patch to Open-MPI to add a counter to all collective calls to allow this situation to be detected and reported correctly, if you're still stuck even with the stack trace then you might find this of use. ?It'll mean patching you MPI build and fixing the above problem with the DLL. That would be my next line of attack, thanks! :) BTW, out of curiosity, is padb an alternative to things like vampir, totalview etc. or are those a different niche with a different goal? -- Rahul From ashley at pittman.co.uk Thu Aug 19 00:58:31 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Thu, 19 Aug 2010 01:58:31 +0200 Subject: [padb-users] Error message from /opt/sbin/libexec/minfo: No DLL to load In-Reply-To: References:

<98E5BA01-5BA4-418B-81DA-CFAA4BA7976D@pittman.co.uk> <60AAB44E-1375-45EC-BB1D-78312C44537F@pittman.co.uk> Message-ID: On 19 Aug 2010, at 01:20, Rahul Nabar wrote: > On Wed, Aug 18, 2010 at 6:03 PM, Ashley Pittman wrote: > > Alternatively, some snooping in the sources does reveal that > config.log has this: > > $ ./configure --prefix=/opt/ompi_new --with-tm=/opt/torque FC=ifort > CC=icc F77=ifort CXX=icpc CFLAGS=-g -O3 -mp FFLAGS=-mp -recurs > ive -O3 CXXFLAGS=-g CPPFLAGS=-DPgiFortran --disable-shared > --enable-static --with-memory-manager --disable-dlopen > --enable-openib-rd > macm --with-openib=/usr I'd say that was fairly conclusive, I'm happy to take this up with the Open-MPI team myself unless you want to? > http://dl.dropbox.com/u/118481/padb.log.new.new.txt That's good to see, re your problem the process that jumps out at me is rank 255, all processes are in ompi_coll_tuned_bcast_intra_generic() but at three different line numbers, rank 255 is unique in being at coll_tuned_bcast.c:232. This is later in the code and suggests to me that this process is stuck here on one iteration for whatever reason and all other processes have continued on, into the next iteration of MPI_Bcast() and are blocked waiting for rank 255 to complete the previous broadcast and start the next one. This is where the collective state patch comes into it's own really. >> As a final point debugging collectives can be hard, in a deadlock situation it can be hard to tell if all ranks are on the same iteration or if some are ahead of others and some are behind, I have a patch to Open-MPI to add a counter to all collective calls to allow this situation to be detected and reported correctly, if you're still stuck even with the stack trace then you might find this of use. It'll mean patching you MPI build and fixing the above problem with the DLL. > > That would be my next line of attack, thanks! :) > > BTW, out of curiosity, is padb an alternative to things like vampir, > totalview etc. or are those a different niche with a different goal? It's both an alternative and a also different niche. There is overlap with totalview certainly but I try to target a different work-flow, TotalView is good if you are local to the machine, have a working reproducer and are sitting down for an afternoon to work at a problem, padb is more lightweight and is more for taking snapshots of state and then moving on. padb can both be automated and it's results emailed around which I think are both big plus points, it's non-interactive though so can't go into anything like the level of detail. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From rpnabar at gmail.com Thu Aug 19 01:34:34 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 18 Aug 2010 19:34:34 -0500 Subject: [padb-users] Error message from /opt/sbin/libexec/minfo: No DLL to load In-Reply-To: References:

<98E5BA01-5BA4-418B-81DA-CFAA4BA7976D@pittman.co.uk> <60AAB44E-1375-45EC-BB1D-78312C44537F@pittman.co.uk>

Message-ID: On Wed, Aug 18, 2010 at 6:58 PM, Ashley Pittman wrote: > I'd say that was fairly conclusive, I'm happy to take this up with the Open-MPI team myself unless you want to? Please, do go ahead. I'd be happy to assist in any way required but you likely understand the meat of the matter better! I'd confuse the Open-MPIers mostly if I posted there. >> http://dl.dropbox.com/u/118481/padb.log.new.new.txt > > That's good to see, re your problem the process that jumps out at me is rank 255, all processes are in ompi_coll_tuned_bcast_intra_generic() but at three different line numbers, rank 255 is unique in being at coll_tuned_bcast.c:232. ?This is later in the code and suggests to me that this process is stuck here on one iteration for whatever reason and all other processes have continued on, into the next iteration of MPI_Bcast() and are blocked waiting for rank 255 to complete the previous broadcast and start the next one. ?This is where the collective state patch comes into it's own really. Thanks for this analysis! One, I'm passing this one to the OFED developer working with me on this. Second, I'm going to try to apply your patch just to see if that makes Steve's debugging life any easier. > > It's both an alternative and a also different niche. ?There is overlap with totalview certainly but I try to target a different work-flow, TotalView is good if you are local to the machine, have a working reproducer and are sitting down for an afternoon to work at a problem, padb is more lightweight and is more for taking snapshots of state and then moving on. ?padb can both be automated and it's results emailed around which I think are both big plus points, it's non-interactive though so can't go into anything like the level of detail. Thanks for the clarification. padb is clearly very useful! :) -- Rahul From daniel.kidger at googlemail.com Thu Aug 19 10:51:07 2010 From: daniel.kidger at googlemail.com (Daniel Kidger) Date: Thu, 19 Aug 2010 10:51:07 +0100 Subject: [padb-users] Fwd: Error message from /opt/sbin/libexec/minfo: No DLL to load In-Reply-To: References:

<98E5BA01-5BA4-418B-81DA-CFAA4BA7976D@pittman.co.uk> <60AAB44E-1375-45EC-BB1D-78312C44537F@pittman.co.uk> Message-ID: ---------- Forwarded message ---------- From: Daniel Kidger Date: 19 August 2010 10:50 Subject: Re: [padb-users] Error message from /opt/sbin/libexec/minfo: No DLL to load To: Ashley Pittman Ashley, >As a final point debugging collectives can be hard, in a deadlock situation it can be hard to tell if all >ranks are on the same iteration or if some are ahead of others and some are behind, I have a >patch to Open-MPI to add a counter to all collective calls to allow this situation to be detected and >reported correctly, if you're still stuck even with the stack trace then you might find this of use. It'll >mean patching you MPI build and fixing the above problem with the DLL. I would be particularly interested in this patch. Albeit it is often further complicated in that with the code I am working on often calls collectives like MPI_Allgather from various subsets of MPI_COMM_WORLD such that I do no expect all process to have called it the same number of times - does your patch allow for this? Daniel Dr. Dan Kidger Bull UK -------------- next part -------------- An HTML attachment was scrubbed... URL: From ashley at pittman.co.uk Thu Aug 19 12:18:07 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Thu, 19 Aug 2010 12:18:07 +0100 Subject: [padb-users] Fwd: Error message from /opt/sbin/libexec/minfo: No DLL to load In-Reply-To: References:

<98E5BA01-5BA4-418B-81DA-CFAA4BA7976D@pittman.co.uk> <60AAB44E-1375-45EC-BB1D-78312C44537F@pittman.co.uk> Message-ID: <363EA6D3-9FA4-4ABA-B860-31BD7AAEDF41@pittman.co.uk> On 19 Aug 2010, at 10:51, Daniel Kidger wrote: > >As a final point debugging collectives can be hard, in a deadlock situation it can be hard to tell if all >ranks are on the same iteration or if some are ahead of others and some are behind, I have a >patch to Open-MPI to add a counter to all collective calls to allow this situation to be detected and >reported correctly, if you're still stuck even with the stack trace then you might find this of use. It'll >mean patching you MPI build and fixing the above problem with the DLL. > > I would be particularly interested in this patch. > Albeit it is often further complicated in that with the code I am working on often calls collectives like MPI_Allgather from various subsets of MPI_COMM_WORLD such that I do no expect all process to have called it the same number of times - does your patch allow for this? Yes it does. To be clear the "collective debugger" functionality is a proposal for extending the specification between tool (in this case padb) and the MPI library. The patch is a implementation of the proposal for Open-MPI so you will need to re-compile your MPI library to use it. Unfortunately it's looking like the proposal might not be formally adopted purely due to a lack of time on my part but I'm hoping that it can be made to work somehow. The patch and it's background are on-line although unfortunately no sample output from when padb uses this. http://padb.pittman.org.uk/extensions.html Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From rpnabar at gmail.com Fri Aug 20 19:11:57 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 20 Aug 2010 13:11:57 -0500 Subject: [padb-users] padb stalls with no output. Message-ID: I'm stumped, since I had a perfectly running padb just a day ago and now on the same system it won't produce any output at all. I'm sure it's me doing something terribly silly, but can't figure out what! ------------------------------------------------------------------------------------------------------------ [rpnabar at eu001 test]$ whoami rpnabar [rpnabar at eu001 test]$ NP=64;mpirun -np $NP --host eu001,eu002,eu003,eu004,eu005,eu006,eu007,eu008 -mca btl openib,sm,self /opt/src/mpitests/imb/src/IMB-MPI1 -npmin $NP gather [wait for test to stall] [then in another shell window onto the same node] [rpnabar at eu001 ~]$ whoami rpnabar [rpnabar at eu001 ~]$ /opt/sbin/bin/padb --all --stack-trace --tree --config-option rmgr=orte -v Loading config from "/etc/padb.conf" Loading config from "/home/rpnabar/.padbrc" Loading config from environment Loading config from command line Setting 'rmgr' to 'orte' Active jobs (0) are No active jobs could be found for user 'rpnabar' [gather test is still stalled in the other window] ------------------------------------------------------------------------------------------------------------ Any ideas what could be going on here? What's even more confusing is that ompi-ps produces no output either (see below)! Have I broken my mpi install somehow? But that wouldn't make sense since the actual mpi tests are running file. Again, the symptoms are so bizarre that I suspect I am the one doing something stupid. But can't figure out what it is!! ------------------------------------------------------------------------------------------------------------ [rpnabar at eu001 ~]$ /opt/sbin/bin/padb --show-jobs --config-option rmgr=orte [No output] [rpnabar at eu001 ~]$ ompi-ps [No output] [rpnabar at eu001 ~]$ ompi-ps -v [eu001:11486] orte_ps: Acquiring list of HNPs and setting contact info into RML... [eu001:11486] orte_ps: Gathering Information for HNP: [[14224,0],0]:5891 [No output] ------------------------------------------------------------------------------------------------------------ -- Rahul From ashley at pittman.co.uk Fri Aug 20 19:38:26 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Fri, 20 Aug 2010 19:38:26 +0100 Subject: [padb-users] padb stalls with no output. In-Reply-To: References: Message-ID: <4D1B4A5C-6885-43CC-AB55-55F93F0EACF3@pittman.co.uk> On 20 Aug 2010, at 19:11, Rahul Nabar wrote: > Any ideas what could be going on here? What's even more confusing is > that ompi-ps produces no output either (see below)! Have I broken my > mpi install somehow? But that wouldn't make sense since the actual mpi > tests are running file. Again, the symptoms are so bizarre that I > suspect I am the one doing something stupid. But can't figure out what > it is!! Padb simply calls ompi-ps to get the list of running jobs, if ompi-ps hangs then padb will hang as well, I could make padb handle this case better but only by detecting a timeout and giving an error message to the user, it still wouldn't be able to attach to the job. The problem is with OpenMPI and with it's state directories in /tmp in particular, it could either be that you've had a parallel job that crashed and has left files around or it could that you are running multiple versions of OMPI and they don't like talking to each other. Where possible where you are running multiple versions of OMPI I recommend having your PATH and the rest of your environment the same for padb as it is for the job you are trying to inspect, for most cases it makes no difference but there tend to be corner cases when resource managers get upset otherwise. Can you report this to OpenMPI please, there used to be lots of problems with this kind of issue but it underwent a big cleanup for 1.3 and I've not had a problem myself for a long time now. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From ashley at pittman.co.uk Fri Aug 20 19:46:13 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Fri, 20 Aug 2010 19:46:13 +0100 Subject: [padb-users] padb stalls with no output. In-Reply-To: <4D1B4A5C-6885-43CC-AB55-55F93F0EACF3@pittman.co.uk> References: <4D1B4A5C-6885-43CC-AB55-55F93F0EACF3@pittman.co.uk> Message-ID: <76CB6038-C62C-49CA-9729-C5380F7B4AB9@pittman.co.uk> On 20 Aug 2010, at 19:38, Ashley Pittman wrote: > The problem is with OpenMPI and with it's state directories in /tmp in particular, it could either be that you've had a parallel job that crashed and has left files around or it could that you are running multiple versions of OMPI and they don't like talking to each other. Oh - I forgot to say how to "fix" this issue, wait until there are no jobs running and remove all OPMI related files and directories in /tmp on the node where the mpirun process was running. Alternatively you could try setting the "resource manager" in padb to mpirun instead of orte, this won't use ompi-ps but will use the MPIR interface instead to extract the information from the mpirun process directly, you are limited to 32 nodes if you do this however as it uses pdsh as a backend. You can try increasing this by setting FANOUT=128 in your environment but you risk running out of file descriptors if you do this, see the pdsh man page for details. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From rpnabar at gmail.com Fri Aug 20 20:51:33 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 20 Aug 2010 14:51:33 -0500 Subject: [padb-users] padb stalls with no output. In-Reply-To: <4D1B4A5C-6885-43CC-AB55-55F93F0EACF3@pittman.co.uk> References: <4D1B4A5C-6885-43CC-AB55-55F93F0EACF3@pittman.co.uk> Message-ID: On Fri, Aug 20, 2010 at 1:38 PM, Ashley Pittman wrote: > > On 20 Aug 2010, at 19:11, Rahul Nabar wrote: > > The problem is with OpenMPI and with it's state directories in /tmp in particular, it could either be that you've had a parallel job that crashed Yes. I had one yesterday that warned me "mpi could not guarantee all processes were killed" I followed up with pkills on all mpi processes. And then: mpirun --pernode --host $LIST_OF_NODES orte-clean Is there a better response? >>and has left files around or it could that you are running multiple versions of OMPI Nope. Just one version. But the problem is fixed now! I just deleted the whole /tmp dir on the main node. THanks for that tip! -- Rahul From rpnabar at gmail.com Fri Aug 20 20:53:09 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 20 Aug 2010 14:53:09 -0500 Subject: [padb-users] padb stalls with no output. In-Reply-To: <76CB6038-C62C-49CA-9729-C5380F7B4AB9@pittman.co.uk> References: <4D1B4A5C-6885-43CC-AB55-55F93F0EACF3@pittman.co.uk> <76CB6038-C62C-49CA-9729-C5380F7B4AB9@pittman.co.uk> Message-ID: On Fri, Aug 20, 2010 at 1:46 PM, Ashley Pittman wrote: > On 20 Aug 2010, at 19:38, Ashley Pittman wrote: > >> The problem is with OpenMPI and with it's state directories in /tmp in particular, it could either be that you've had a parallel job that crashed and has left files around or it could that you are running multiple versions of OMPI and they don't like talking to each other. > > Oh - I forgot to say how to "fix" this issue, wait until there are no jobs running and remove all OPMI related files and directories in /tmp on the node where the mpirun process was running. Yup! Thanks! This fixed it. It's funny that I had rebooted all nodes but that hadn't fixed it. Somehow the bad state was persistent through the reboot. Maybe because it is based on the files and dirs in /tmp. Maybe I will change my reboot protocol to clean up /tmp -- Rahul From swise at opengridcomputing.com Mon Aug 30 20:59:40 2010 From: swise at opengridcomputing.com (Steve Wise) Date: Mon, 30 Aug 2010 14:59:40 -0500 Subject: [padb-users] padb not finding remote ranks Message-ID: <4C7C0DAC.7080300@opengridcomputing.com> Hey, I have an openmpi-1.4.1 8 node 64NP cluster, which is running jobs via orte/mpirun. I start a job that is hanging, and then run padb to get the stack traces, yet padb displays this error and only shows the local process stacks (see output below). Any ideas? Thanks in advance. Steve ------ [root at n0 ~]# mpirun --output-filename /share/log/out -np 64 --host n0,n1,n2,n3,n4,n5,n6,n7 --mca btl_openib_verbose 0 --mca btl_openib_receive_queues P,65536,64 --mca btl openib,sm,self /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 -npmin 64 gather & [1] 4621 [root at n0 ~]# ompi-ps Information from mpirun [20398,0] ----------------------------------- JobID | State | Slots | Num Procs | ------------------------------------------ [20398,1] | Running | 8 | 64 | Process Name | ORTE Name | Local Rank | PID | Node | State | --------------------------------------------------------------------------------------------------------------------------- /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],0] | 0 | 4629 | n0.asicdesigners.com | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],1] | 0 | 4673 | n1 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],2] | 0 | 4794 | n2 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],3] | 0 | 4694 | n3 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],4] | 0 | 4666 | n4 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],5] | 0 | 4674 | n5 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],6] | 0 | 4671 | n6 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],7] | 0 | 4876 | n7 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],8] | 1 | 4630 | n0.asicdesigners.com | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],9] | 1 | 4674 | n1 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],10] | 1 | 4795 | n2 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],11] | 1 | 4695 | n3 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],12] | 1 | 4667 | n4 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],13] | 1 | 4675 | n5 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],14] | 1 | 4672 | n6 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],15] | 1 | 4877 | n7 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],16] | 2 | 4631 | n0.asicdesigners.com | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],17] | 2 | 4675 | n1 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],18] | 2 | 4796 | n2 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],19] | 2 | 4696 | n3 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],20] | 2 | 4668 | n4 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],21] | 2 | 4676 | n5 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],22] | 2 | 4673 | n6 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],23] | 2 | 4878 | n7 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],24] | 3 | 4632 | n0.asicdesigners.com | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],25] | 3 | 4676 | n1 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],26] | 3 | 4797 | n2 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],27] | 3 | 4697 | n3 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],28] | 3 | 4669 | n4 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],29] | 3 | 4677 | n5 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],30] | 3 | 4674 | n6 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],31] | 3 | 4879 | n7 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],32] | 4 | 4633 | n0.asicdesigners.com | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],33] | 4 | 4677 | n1 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],34] | 4 | 4798 | n2 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],35] | 4 | 4698 | n3 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],36] | 4 | 4670 | n4 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],37] | 4 | 4678 | n5 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],38] | 4 | 4675 | n6 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],39] | 4 | 4880 | n7 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],40] | 5 | 4634 | n0.asicdesigners.com | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],41] | 5 | 4678 | n1 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],42] | 5 | 4799 | n2 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],43] | 5 | 4699 | n3 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],44] | 5 | 4671 | n4 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],45] | 5 | 4679 | n5 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],46] | 5 | 4676 | n6 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],47] | 5 | 4881 | n7 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],48] | 6 | 4635 | n0.asicdesigners.com | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],49] | 6 | 4679 | n1 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],50] | 6 | 4800 | n2 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],51] | 6 | 4700 | n3 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],52] | 6 | 4672 | n4 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],53] | 6 | 4680 | n5 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],54] | 6 | 4677 | n6 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],55] | 6 | 4882 | n7 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],56] | 7 | 4636 | n0.asicdesigners.com | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],57] | 7 | 4680 | n1 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],58] | 7 | 4801 | n2 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],59] | 7 | 4701 | n3 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],60] | 7 | 4673 | n4 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],61] | 7 | 4681 | n5 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],62] | 7 | 4678 | n6 | Running | /usr/mpi/gcc/openmpi-1.4.1/tests/IMB-3.2/IMB-MPI1 | [[20398,1],63] | 7 | 4883 | n7 | Running | [root at n0 ~]# /share/bin/padb --all --stack-trace --tree --config-option rmgr=orte Warning, failed to locate ranks [1-7,9-15,17-23,25-31,33-39,41-47,49-55,57-63] ----------------- [0,8,16,24,32,40,48,56] (8 processes) ----------------- main() at ?:? IMB_init_buffers_iter() at ?:? IMB_gather() at ?:? PMPI_Gather() at pgather.c:175 mca_coll_sync_gather() at coll_sync_gather.c:46 ompi_coll_tuned_gather_intra_dec_fixed() at coll_tuned_decision_fixed.c:714 ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:248 mca_pml_ob1_recv() at ../../../../opal/threads/condition.h:99 ----------------- [0,8,16,24,32,40,56] (7 processes) ----------------- opal_progress() at runtime/opal_progress.c:207 ----------------- 48 (1 processes) ----------------- opal_progress() at ../opal/include/opal/sys/amd64/timer.h:46 [root at n0 ~]#