Bug 5698 - Allow a prologue/epilogue script for 'mpi' and 'multiple' jobs
: Allow a prologue/epilogue script for 'mpi' and 'multiple' jobs
Status: RESOLVED LATER
: GRAM
general
: 4.0.5
: PC Linux
: P3 normal
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2007-12-03 10:49 by
Modified: 2012-09-12 13:16 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2007-12-03 10:49:31
We need to run MPI jobs which require the following sequence of steps:
1) set up working directory (single process step)
2) mpirun/mpiexec (multi-process step)
3) package output files for stage out (single process step)

Step 3) is crucial because we cannot foretell the exact file names that will be
produced by the job. The MPI executable is third-party software (WRF), so we
can't easily control how it names the output files (of which a variable number
is produced, depending on the input data). All we know is that their names
begin with 'wrfout'.

We have been using a workaround so far: we let mpirun/mpiexec execute a script
rather than the actual MPI executable. The script implements the necessary
synchronization magic to execute the setup/cleanup step just once. However,
this solution has limited portability because some implementations of mpirun
send SIGKILL to the processes immediately after the MPI_Finalize, making it
impossible to perform step 3) above (and thus making stage-out fail).

We'd prefer a simple mechanism for running a user-defined script right before
and right after the invocation of mpirun/mpiexec in the pbs.pm-generated job
script. The same mechanism would also be beneficial for jobs of type
'multiple'. It should be very easy to extend pbs.pm accordingly. However, to
make it universally useful rather than confined to a single Grid (site), an
official extension to the GRAM schema is needed which allows the specification
of the 'pre' and 'post' executable for jobs of type 'mpi' or 'multiple'.

Some have suggested that the functionality could be implemented using an
external workflow engine. This might be true, but it should not be necessary in
the first place. Our problem concerns conceptually atomic jobs, not much
different in character than the traditional sequential jobs (we have a clear
stage-in, computation, stage-out phase). To force users to familiarize
themselves with a workflow engine for the purpose of running such typical jobs
is not a good solution. Furthermore, the use of a workflow engine would
complicate matters because the three jobs must run at the same site. This
requirement seems to defeat straightforward meta-scheduling approaches that
schedule jobs individually, without keeping track of the execution history of
multiple jobs (e.g., Condor-G).
------- Comment #1 From 2008-05-06 07:02:21 -------
I second this request.

This is a very common need in cluster jobs, that is conventionally handled in a
batch job submission script.

Users may do much more than file management. For instance they may do
pre-processing, or signal other machines that the job has started and finished,
for example.  So the need is for a general command to be run (just once per
job) before the executable.
------- Comment #2 From 2008-05-08 04:49:44 -------
This issue has been discussed on the Globus users list
    http://www.globus.org/mail_archive/gt-user/2008/05/msg00164.html

As further evidence of pressing need, consider these large sites where
personnel time has been expended to produce site-specific work-arounds
* LRZ (Gabriel Mateescu)
    http://www.grid.lrz.de/en/mware/globus/download_preamble.html
* Jan Ploski's solution
    https://bi.offis.de/wisent/tiki-index.php?page=Condor-GT4-BigJobs

To deal with sites where the sysadmin has not taken special measures, I have
personally written user-specific scripts to run a per-job prologue.  

My scripts use environment variables specific to the MPI implementation to
determine the rank of the process, then on the root process execute the
prologue script; the other processes wait for the root process to finish. 
Finally they all run the MPI executable.

And I know a user who has gone back to running jobs directly on the batch
system because the medicine (globusrun-ws) was worse than the disease.
------- Comment #3 From 2008-05-08 05:00:05 -------
Here is JDD syntax I proposed to for job prologue/epilogue scripts.
It echoes the syntax for the main (parallel) executable.

In contrast to the parallel executable, a prologue executable would be run just
once per job.  It would be run immediately after stage-in, and just before the
parallel executable is started.  That is, it would execute as does code in a
conventional cluster batch script that precedes the mpirun command.

*** begin suggested code **************************
<prologue>
    <executable>/bin/bash</executable>
    <argument>-c</argument>
     <argument>my bash shell script</argument>
</prologue>
*** end suggested code **************************

Likewise, an epilogue section would specify a program to be executed just once,
after the parallel executable runs, and before stage-out.
------- Comment #4 From 2008-05-08 10:37:10 -------
I like Steve's syntax. The proposed point of execution "immediately before"
mpiexec/mpirun (for prologue) and "immediately after" mpiexec/mpirun (for
epilogue) also seems appropriate. More precisely, the prologue/epilogue should
be executed as child processes of the same process which now executes
mpiexec/mpirun.

Because at this point of execution all resources are already allocated to the
local job by the resource manager, there should be a documented recommendation
for users to limit the amount of computation done in the prologue/epilogue to
prevent wasting of resources. An alternative would be to run the
prologue/epilogue on the Globus host (rather than on a particular node chosen
by the resource manager), but I think it would not match the users' intuition
as well as Steve's proposed solution.
------- Comment #5 From 2008-05-09 03:50:29 -------
Let me be more accurate about the immediacy.

It is important for some purposes (e.g. signaling processes on another machine)
that the prologue run immediately before the parallel processes start.  

But the prologue need only run *after* stage-in, not "immediately after".

Likewise for epilogue: it should run immediately after the parallel process
runs, and before stage-out.

This models better the conventional batch system usage.  Commands in a batch
script run immediately before and after the mpirun command, but files are
usually moved to and from the machine by some other mechanism.
------- Comment #6 From 2008-05-09 03:57:04 -------
This functionality is common to all batch systems, for very good reasons.  This
is a serious omission in WS-GRAM job submission.

It is a "blocker" for many users.  They will give up on WS-GRAM job submission
when they hit it.  I have seen this happen.
------- Comment #7 From 2008-05-09 10:58:06 -------
I showed this to a user, and he had a very good addition to make.

He pointed out that there was no reason why the main directories the prologue
and epilogue work from should be the same as that of the MPI executable.

So we propose:  Inside a <prologue> or <epilogue> section, that there can be a
<directory> element.  It works like the existing directory elements, but
applies only to the executions inside those sections.  In the absence of this
special <directory> element, the working directory is taken to be that of the
<job>.
------- Comment #8 From 2008-07-17 07:04:03 -------
This problem has been mentioned in a report

   "Running MPI Jobs on Grid Resources"
   http://www.gac-grid.org/project-documents/deliverables/wp1.html

as one of several problems hindering real users from submitting scientific jobs
to clusters using Globus.

I would call it the worst of these problems.
------- Comment #9 From 2008-07-17 11:55:23 -------
Steve, Jan,

I have a student working on adding the prologue/epilogue functionality.  I
think this will be available to try out pretty soon.  In the next 1 - 2 months.

Nice doc on running MPI jobs.  I added a link to it in the WS GRAM 4.0 User
guide.
------- Comment #10 From 2008-08-13 17:39:36 -------
Here is the result of Connor's work this summer.
   http://dev.globus.org/wiki/GRAM/prologue

This worked with PBS and a gram 4.0 install, but there more work to do.  To
implement, we had to preserve the the depth of the XML extension elements
otherwise there would be collisions with things like prologue's executable and
base JDD executable.  In this implementation, we used a non-default perl module
XML::Simple to transform the XML extension into perl JD hash elements.  But we
probably won't end up going with XML::Simple.  We're looking into doing all the
XML to perl job description hash processing in java in the GRAM service.  Then
the perl scripts/modules won't have to do any XML parsing.

So this is not ready for prime time yet, but it would be good to see if this is
in line with what you were thinking.
------- Comment #11 From 2008-08-14 05:43:04 -------
Hi,

All I see at that link is some code and some diffs.
I am not a compute cluster admin, so I can't actually test this.
Where is the documentation?
------- Comment #12 From 2012-09-12 13:16:30 -------
We've migrated our issue tracking software to jira.globus.org. This issue is
being tracked in http://jira.globus.org/browse/GT-83