[padb] Message queues on solaris

Ashley Pittman ashley at pittman.co.uk
Thu Nov 5 22:58:12 GMT 2009


I've managed to boot a Solaris 10 machine and take a closer look and
have got it working, r325 should work now assuming the dll itself is
correct.

Below is the output I get from a two process job using sun cluster tools
8.1, what this tells us is two fold:

1) The dll isn't build properly and won't open with
dlopen(...,RTLD_NOW), I've added a fallback to use RTLD_LAZY though
which allows it to work.

2) The dll is looking for a type called mca_topo_base_comm_1_0_0_t which
isn't present in the library and is at that point claiming it cannot
display the message queues.

These are both bugs in the dll itself and not something I can fix in
padb.

I've also fixed a number of other problems, mpirun_get_jobs wasn't
returning any jobs on solaris so the --all (-a) options wasn't working,
this should now behave on Solaris as it does on Linux.

root at ip-10-250-13-226:~/padb-read-only/src# ./padb -aq
Warning: errors reported by some ranks
========
[0-1]: Error message from ./minfo.x: image_has_queues() failed
========
----------------
[0-1]: Error string from DLL
----------------
Failed to find some type
----------------
[0-1]: Message from DLL
----------------
mca_topo_base_comm_1_0_0_t
----------------
[0-1]: Warning message from minfo
----------------
Unable to dlopen dll with RTLD_NOW, trying LAZY...
----------------
[0-1]: Warning message from minfo
----------------
ld.so.1: minfo.x: fatal: relocation error: file /opt/SUNWhpc/HPC8.1/sun/lib/openmpi/libompi_dbg_msgq.so: symbol opal_malloc: referenced symbol not found


-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk





More information about the padb-devel mailing list