Bugzilla – Bug 5698
Allow a prologue/epilogue script for 'mpi' and 'multiple' jobs
Last modified: 2012-09-12 13:16:30
You need to log in before you can comment on or make changes to this bug.
We need to run MPI jobs which require the following sequence of steps: 1) set up working directory (single process step) 2) mpirun/mpiexec (multi-process step) 3) package output files for stage out (single process step) Step 3) is crucial because we cannot foretell the exact file names that will be produced by the job. The MPI executable is third-party software (WRF), so we can't easily control how it names the output files (of which a variable number is produced, depending on the input data). All we know is that their names begin with 'wrfout'. We have been using a workaround so far: we let mpirun/mpiexec execute a script rather than the actual MPI executable. The script implements the necessary synchronization magic to execute the setup/cleanup step just once. However, this solution has limited portability because some implementations of mpirun send SIGKILL to the processes immediately after the MPI_Finalize, making it impossible to perform step 3) above (and thus making stage-out fail). We'd prefer a simple mechanism for running a user-defined script right before and right after the invocation of mpirun/mpiexec in the pbs.pm-generated job script. The same mechanism would also be beneficial for jobs of type 'multiple'. It should be very easy to extend pbs.pm accordingly. However, to make it universally useful rather than confined to a single Grid (site), an official extension to the GRAM schema is needed which allows the specification of the 'pre' and 'post' executable for jobs of type 'mpi' or 'multiple'. Some have suggested that the functionality could be implemented using an external workflow engine. This might be true, but it should not be necessary in the first place. Our problem concerns conceptually atomic jobs, not much different in character than the traditional sequential jobs (we have a clear stage-in, computation, stage-out phase). To force users to familiarize themselves with a workflow engine for the purpose of running such typical jobs is not a good solution. Furthermore, the use of a workflow engine would complicate matters because the three jobs must run at the same site. This requirement seems to defeat straightforward meta-scheduling approaches that schedule jobs individually, without keeping track of the execution history of multiple jobs (e.g., Condor-G).
I second this request. This is a very common need in cluster jobs, that is conventionally handled in a batch job submission script. Users may do much more than file management. For instance they may do pre-processing, or signal other machines that the job has started and finished, for example. So the need is for a general command to be run (just once per job) before the executable.
This issue has been discussed on the Globus users list http://www.globus.org/mail_archive/gt-user/2008/05/msg00164.html As further evidence of pressing need, consider these large sites where personnel time has been expended to produce site-specific work-arounds * LRZ (Gabriel Mateescu) http://www.grid.lrz.de/en/mware/globus/download_preamble.html * Jan Ploski's solution https://bi.offis.de/wisent/tiki-index.php?page=Condor-GT4-BigJobs To deal with sites where the sysadmin has not taken special measures, I have personally written user-specific scripts to run a per-job prologue. My scripts use environment variables specific to the MPI implementation to determine the rank of the process, then on the root process execute the prologue script; the other processes wait for the root process to finish. Finally they all run the MPI executable. And I know a user who has gone back to running jobs directly on the batch system because the medicine (globusrun-ws) was worse than the disease.
Here is JDD syntax I proposed to for job prologue/epilogue scripts. It echoes the syntax for the main (parallel) executable. In contrast to the parallel executable, a prologue executable would be run just once per job. It would be run immediately after stage-in, and just before the parallel executable is started. That is, it would execute as does code in a conventional cluster batch script that precedes the mpirun command. *** begin suggested code ************************** <prologue> <executable>/bin/bash</executable> <argument>-c</argument> <argument>my bash shell script</argument> </prologue> *** end suggested code ************************** Likewise, an epilogue section would specify a program to be executed just once, after the parallel executable runs, and before stage-out.
I like Steve's syntax. The proposed point of execution "immediately before" mpiexec/mpirun (for prologue) and "immediately after" mpiexec/mpirun (for epilogue) also seems appropriate. More precisely, the prologue/epilogue should be executed as child processes of the same process which now executes mpiexec/mpirun. Because at this point of execution all resources are already allocated to the local job by the resource manager, there should be a documented recommendation for users to limit the amount of computation done in the prologue/epilogue to prevent wasting of resources. An alternative would be to run the prologue/epilogue on the Globus host (rather than on a particular node chosen by the resource manager), but I think it would not match the users' intuition as well as Steve's proposed solution.
Let me be more accurate about the immediacy. It is important for some purposes (e.g. signaling processes on another machine) that the prologue run immediately before the parallel processes start. But the prologue need only run *after* stage-in, not "immediately after". Likewise for epilogue: it should run immediately after the parallel process runs, and before stage-out. This models better the conventional batch system usage. Commands in a batch script run immediately before and after the mpirun command, but files are usually moved to and from the machine by some other mechanism.
This functionality is common to all batch systems, for very good reasons. This is a serious omission in WS-GRAM job submission. It is a "blocker" for many users. They will give up on WS-GRAM job submission when they hit it. I have seen this happen.
I showed this to a user, and he had a very good addition to make. He pointed out that there was no reason why the main directories the prologue and epilogue work from should be the same as that of the MPI executable. So we propose: Inside a <prologue> or <epilogue> section, that there can be a <directory> element. It works like the existing directory elements, but applies only to the executions inside those sections. In the absence of this special <directory> element, the working directory is taken to be that of the <job>.
This problem has been mentioned in a report "Running MPI Jobs on Grid Resources" http://www.gac-grid.org/project-documents/deliverables/wp1.html as one of several problems hindering real users from submitting scientific jobs to clusters using Globus. I would call it the worst of these problems.
Steve, Jan, I have a student working on adding the prologue/epilogue functionality. I think this will be available to try out pretty soon. In the next 1 - 2 months. Nice doc on running MPI jobs. I added a link to it in the WS GRAM 4.0 User guide.
Here is the result of Connor's work this summer. http://dev.globus.org/wiki/GRAM/prologue This worked with PBS and a gram 4.0 install, but there more work to do. To implement, we had to preserve the the depth of the XML extension elements otherwise there would be collisions with things like prologue's executable and base JDD executable. In this implementation, we used a non-default perl module XML::Simple to transform the XML extension into perl JD hash elements. But we probably won't end up going with XML::Simple. We're looking into doing all the XML to perl job description hash processing in java in the GRAM service. Then the perl scripts/modules won't have to do any XML parsing. So this is not ready for prime time yet, but it would be good to see if this is in line with what you were thinking.
Hi, All I see at that link is some code and some diffs. I am not a compute cluster admin, so I can't actually test this. Where is the documentation?
We've migrated our issue tracking software to jira.globus.org. This issue is being tracked in http://jira.globus.org/browse/GT-83