[padb-devel] [padb commit] r42 - Write the extensions page and add my email to the front
codesite-noreply at google.com
codesite-noreply at google.com
Wed Jun 10 14:44:04 BST 2009
Author: apittman
Date: Wed Jun 10 06:43:10 2009
New Revision: 42
Modified:
trunk/doc/extensions.html
trunk/doc/header.html
trunk/doc/index.html
Log:
Write the extensions page and add my email to the front
page.
Modified: trunk/doc/extensions.html
==============================================================================
--- trunk/doc/extensions.html (original)
+++ trunk/doc/extensions.html Wed Jun 10 06:43:10 2009
@@ -1,10 +1,109 @@
-<h2>Patches</h2>
-The advanced Group Deadlock Detection algorithm within <i>padb</i>
-requires modifications to the MPI library to function properly.
-At this time patches are available for Open MPI svn trunk
-only.
-Contact the <a href="mailto:padb-devel at pittman.org.uk">developer mailing
list</a>
-for more information.
+
+<h2>MPI collective debugger extension proposal</h2>
+
+<h3>Overview</h3>
+The current
+<a href="http://www.mcs.anl.gov/research/projects/mpi/mpi-debug/">MPI
debugger interface</a>
+is used to export information from a running application to a debugger.
The current
+interface allow the debugger to look at a MPI Process, to iterate over
communicators
+within that process and to view message queues associated with a
communicator.
+
+<p>
+I propose an extension to this to export information about individual
communicators
+within a process, in particular information about collective operations
(MPI_Bcast,
+MPI_Reduce et. al.)
+
+<h2>Implementation</h2>
+The specific information I propose is to add a communicator specific
counter for
+each possible collective where the counter simply records the number of
times the
+collective has been called on this communicator. Along with this is
keeping a second piece
+of data, that of if the process is still performing the collective
operation.
+
+<p>
+
+A new enum is added to the interface, <i>mqs_comm_class</i> with values
for each collective
+call.
+
+<p>
+
+A single extra callback function <i>mqs_get_comm_coll_state</i> is added
to the
+interface and queries the current communicator in the same way as
<i>mqs_next_operation</i>.
+This function takes the standard <i>process</i> parameter, a
<i>mqs_comm_class</i> enum as input
+for which collective to query and two <i>int *</i>, the first of these is
a pointer to a
+int set which should be set to the count of the number of calls to the
collective, the second
+is a pointer to a int which should be set zero or one depending if the
collective operation is still
+active.
+
+<p>
+
+A successful call to the <i>mqs_get_comm_coll_state</i> should return
<i>mqs_ok</i> with
+<i>mqs_no_information</i> being used in the case where information isn't
available. This allows
+further enum values to be added in the future should the mpi-forum approve
new collective
+functions without needing to change the debugger function interface.
+
+<h3>Performance Impact</h3>
+Maintaining this data does add code to the "critical path" of the MPI
stack, in
+it's simplest form all it requires is a pair of counter increments per
collective call,
+one on function entry and one on function exit so whilst there is a
non-zero run-time cost
+associated with maintaining this information it's a minimal one.
+
+<h2>mpi_interface.h</h2>
+The additions required to mpi_interface.h are shown below.
+<pre>
+
+typedef enum
+{
+ mqs_comm_barrier,
+ mqs_comm_broadcast,
+ mqs_comm_allgather,
+ mqs_comm_allgatherv,
+ mqs_comm_allreduce,
+ mqs_comm_alltoall,
+ mqs_comm_alltoallv,
+ mqs_comm_reduce_scatter,
+ mqs_comm_reduce,
+ mqs_comm_gather,
+ mqs_comm_gatherv,
+ mqs_comm_scan,
+ mqs_comm_scatter,
+ mqs_comm_scatterv
+} mqs_comm_class;
+
+/***********************************************************************
+ * Collective extension
+ *
+ * This extension should be considered optional and the debugger should
+ * correctly the case where it doesn't exist.
+ *
+ */
+
+/*
+ * Return the state of collective operations for the currently active
+ * communicator, that is the number of times the collective has been
+ * called and if the operation is still in progress.
+ *
+ * The first int is *really* mqs_comm_class.
+ */
+extern int mqs_get_comm_coll_state (mqs_process *, int, int *, int *);
+</pre>
+
+<h2>Benefits</h2>
+The extension allows a debugger or external program to know the state of
collective
+calls with the parallel program. In the typical scenario of debugging a
hung
+application this knowledge allows the debugger and programmer to know
instantly
+which processes are stuck in collective calls and which aren't, either
because they
+have successful made the collective call and returned or because they
haven't
+made the calls other ranks in a communicator have. This information
allows swift
+identification of problem areas within the job where further investigation
may be
+required.
+
+<p>
+This extension was originally developed in early 2007 whilst I was working
at
+Quadrics and has proved it's value numerous times in real-life cases.
+
+<h2>Sample Implementation</h2>
+At this time a sample implementation is available for OpenMPI only
although work
+is being done a MPICH2 version.
<p>
Patch for OpenMPI <a href=OpenMPI-padb-groups.patch>trunk.</a>
Modified: trunk/doc/header.html
==============================================================================
--- trunk/doc/header.html (original)
+++ trunk/doc/header.html Wed Jun 10 06:43:10 2009
@@ -1,7 +1,7 @@
<head><title>Padb: A parallel debugging tool</title></head>
<body>
<center>
-<h1>Parallel Application Discovery Browser</h1>
+<h1>Parallel Application Debugger</h1>
<a href="http://padb.pittman.org.uk/" title="Main page">Padb</a>
<a href="usage.html" title="Command line options">usage</a>
<a href="download.html" title="Download page">download</a>
Modified: trunk/doc/index.html
==============================================================================
--- trunk/doc/index.html (original)
+++ trunk/doc/index.html Wed Jun 10 06:43:10 2009
@@ -9,6 +9,9 @@
open source, non-interactive, command line, script-able tool intended
for use by programmers and system administrators alike.
+<p>
+<i>Padb</i> is currently maintained outside of Quadrics by <a
href="mailto:ashley at pittman.co.uk">Ashley Pittman</a>
+
<h2>Features</h2>
<ul>
<li>Stack trace generation</li>
@@ -40,7 +43,7 @@
kind of problems facing them at the time. It's been a part of the
Quadrics software stack for a number of years and has recently been
made available to a wider audience. It has been commercially supported
-for a number of years and is known to work at a scale of thousands of
+for a number of years and is known to work at a scale of tens of thousands
of
processes.
<h2>Parallel Environments</h2>
More information about the padb-devel
mailing list