[padb-users] start using padb on TORQUE

David Singleton David.Singleton at anu.edu.au
Thu Nov 25 20:01:44 GMT 2010


On 11/26/2010 05:46 AM, Ashley Pittman wrote:
>
> On 10 Nov 2010, at 23:48, Jie Cai wrote:
>
>>
>> On 11/11/10 06:41, Ashley Pittman wrote:
>>>
>>>> (2) in the PBS interactive mode of a job, I have following information and warning, please noted that no PBS job detected. I am actually expecting a pbs job detected.
>>>>
>>> pbs_pro support has been included for a while, pbs and Torque support are slightly different and have only been added very recently, in fact the current HEAD will detect jobs but and launch itself on the remote nodes but not find the individual processes, it is almost certainly looking for the wrong environment variable so should be easy to fix when I get some more feedback from people who are testing it (I don't have access to a pbs system and that makes it difficult).
>>>
>>>
>> I am pretty happy to help with this. Our PBS system is built on OpenPBS. I am not sure whether there is major difference in the interface between old OpenPBS and torque or PBS pro.
>
> Have you made any headway on this?  I'm back from SC now so can devote some time to it myself if you have questions or can get me access to a PBS system.  As I said it should really just be a case of finding out what environment variables are set by pbs and what the parent process of the parallel processes is called.
>

Hi Ashley,

Jie now has padb working in our PBS/OpenMPI environment.  We would be
happy to give you access to our system but since our environment is fairly
"unique", it may not help you much in terms of supporting Torque or other
PBS sites. We have our own version of PBS based on OpenPBS started well
before Torque came into existence.

I'll comment on a couple of the issues we have come across and you can
decide what is interesting to you.  Jie might have other points to add (he
is the one who has had to hack the perl).

* Long ago, we changed the format of the PBS exechost string (qstat -n
   output) to be more compact. The relevant parts are like pdsh hostlist
   format, eg.
       v[5-6,15-18,30-31]/cpus=0-7/mems=0-1
   We have avoided working out how to get padb to use this format by
   adding an "old exechost format" option to our qstat.  For very large
   jobs, I think our format makes more sense and should be easier to use
   with pdsh.  We haven't looked at clustershell yet (we use c3 for cluster
   management).

* We run all jobs under project groups which are not user login groups.
   That causes grief for "rsh node gdb ..." type debugging because of
   insufficient privileges.  Since this is a common problem for us, we
   have a variant of newgrp that we can insert in remote commands to
   overcome this, eg
       rsh node nfnewgrp projgroup gdb ...

   Note that all variants of PBS support users nominating their jobs
   execution group (the group_list/egroup job attributes) but I dont know
   how commonly this is exercised.

* A common variant of MPI jobs are those launched like
       mpirun wrapper_script mpi_executable
   so that the parent of the MPI tasks is not orted/mpid/mpirun. We are
   interested in ways to support such jobs.  Since job processes are
   contained in cpusets (cgroups) on our system, we can easily get the
   relevant process list and then use environment to find ranks. Will it
   matter if a non-MPI process with OMPI_COMM_WORLD_RANK set is queried
   for message queue info?  Does it matter that two process have the same
   rank?

Thanks for the very relevant and useful tool.

Cheers,
David

-- 
--------------------------------------------------------------------------
    Dr David Singleton               ANU Supercomputer Facility
    HPC Systems Manager              and NCI National Facility
    David.Singleton at anu.edu.au       Leonard Huxley Bldg (No. 56)
    Phone: +61 2 6125 4389           Australian National University
    Fax:   +61 2 6125 8199           Canberra, ACT, 0200, Australia
--------------------------------------------------------------------------




More information about the padb-users mailing list