Bug 5467 - Condor-G/Jobmanager race results in truncated stdout/err files
: Condor-G/Jobmanager race results in truncated stdout/err files
Status: RESOLVED FIXED
: GRAM
gt2 Gatekeeper/Jobmanager
: unspecified
: Macintosh All
: P3 blocker
: 4.0.6
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2007-08-01 09:28 by
Modified: 2007-08-13 15:55 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2007-08-01 09:28:20
Jaime Frey's analysis of a problem seen on multiple OSG sites that results in a
double stage-out of data, with the second copy improperly being zero bytes
long:

When stage-in for the job is complete, Condor-G tells the jobmanager  
to exit. At the same time, the jobmanager learns that the job itself  
has finished and starts staging the output files back to Condor-G.  
The jobmanager waits until the staging out is done before exiting. In  
the mean time, Condor-G receives word that the job is complete and  
tries to restart the jobmanager (creating a new jobmanager process to  
replace the one that it thinks is now dead). The new jobmanager  
realizes that the original jobmanager is still alive, says so to  
Condor-G and exits. But before exiting, it calls the cache_cleanup  
perl module callout, which removes job-related files like stdout/err.  
Condor-G waits for a minute and restarts the jobmanager again. By  
this time, the original jobmanager has exited, and the new jobmanager  
process proceeds. It repeats the stage out process, but the files  
have been deleted and empty files end up being transferred.

I see three problems that should be addressed to fix the lost stdout/ 
err:
1) If a jobmanager that's started to manage an existing job notices  
that a previous jobmanager for the same job is still alive, it  
shouldn't delete the job's files.
2) If a jobmanager finishes staging output files successfully and  
then exits, a new jobmanager for that job shouldn't perform the same  
transfers again.
3) When Condor-G tells a jobmanager to exit, it should wait at least  
a few seconds before starting a new jobmanager for that job.
------- Comment #1 From 2007-08-02 16:05:04 -------
A preliminary patch for this issue is on the dev.globus.org wiki:
http://dev.globus.org/images/d/d4/Old-jm-alive.diff
It implements something like option 1 in the bug report description.

joe
------- Comment #2 From 2007-08-13 15:55:48 -------
Fix committed to CVS.