[padb] Message queues on solaris

Ethan Mallove ethan.mallove at sun.com
Fri Nov 6 14:50:35 GMT 2009


On Thu, Nov/05/2009 10:58:12PM, Ashley Pittman wrote:
> 
> I've managed to boot a Solaris 10 machine and take a closer look and
> have got it working, r325 should work now assuming the dll itself is
> correct.
> 
> Below is the output I get from a two process job using sun cluster tools
> 8.1, what this tells us is two fold:
> 
> 1) The dll isn't build properly and won't open with
> dlopen(...,RTLD_NOW), I've added a fallback to use RTLD_LAZY though
> which allows it to work.
> 
> 2) The dll is looking for a type called mca_topo_base_comm_1_0_0_t which
> isn't present in the library and is at that point claiming it cannot
> display the message queues.
> 
> These are both bugs in the dll itself and not something I can fix in
> padb.
> 
> I've also fixed a number of other problems, mpirun_get_jobs wasn't
> returning any jobs on solaris so the --all (-a) options wasn't working,
> this should now behave on Solaris as it does on Linux.
> 
> root at ip-10-250-13-226:~/padb-read-only/src# ./padb -aq
> Warning: errors reported by some ranks
> ========
> [0-1]: Error message from ./minfo.x: image_has_queues() failed
> ========
> ----------------
> [0-1]: Error string from DLL
> ----------------
> Failed to find some type
> ----------------
> [0-1]: Message from DLL
> ----------------
> mca_topo_base_comm_1_0_0_t
> ----------------
> [0-1]: Warning message from minfo
> ----------------
> Unable to dlopen dll with RTLD_NOW, trying LAZY...
> ----------------
> [0-1]: Warning message from minfo
> ----------------
> ld.so.1: minfo.x: fatal: relocation error: file /opt/SUNWhpc/HPC8.1/sun/lib/openmpi/libompi_dbg_msgq.so: symbol opal_malloc: referenced symbol not found

I'm using ClusterTools 8.2.1 (8.1 is about a year old), though I get a
similar error:

$ padb -aq -Ormgr=mpirun
einner: sh: edb: not found
Warning: errors reported by some ranks
========
[0]: Error message from /home/em162155/software/SunOS/i386/padb/bin/minfo.x: Could not load symbols from dll
========
----------------
[0]: Warning message from minfo
----------------
Unable to dlopen dll with RTLD_NOW, trying LAZY...
----------------
[0]: Warning message from minfo
----------------
ld.so.1: minfo.x: fatal: relocation error: file /ws/hpc-ct-1/hpc-ct-8.2.1/pkgs/09d/SunOS-10/i386/built-with-sun/installs/QEjm/install/lib/openmpi/libompi_dbg_msgq.so: symbol opal_mutex_check_locks: referenced symbol not found

-Ethan

> 
> 
> -- 
> 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
> 
> 
> _______________________________________________
> padb-devel mailing list
> padb-devel at pittman.org.uk
> http://pittman.org.uk/mailman/listinfo/padb-devel_pittman.org.uk




More information about the padb-devel mailing list