Bugzilla – Bug 5467
Condor-G/Jobmanager race results in truncated stdout/err files
Last modified: 2007-08-13 15:55:48
You need to
before you can comment on or make changes to this bug.
Jaime Frey's analysis of a problem seen on multiple OSG sites that results in a
double stage-out of data, with the second copy improperly being zero bytes
When stage-in for the job is complete, Condor-G tells the jobmanager
to exit. At the same time, the jobmanager learns that the job itself
has finished and starts staging the output files back to Condor-G.
The jobmanager waits until the staging out is done before exiting. In
the mean time, Condor-G receives word that the job is complete and
tries to restart the jobmanager (creating a new jobmanager process to
replace the one that it thinks is now dead). The new jobmanager
realizes that the original jobmanager is still alive, says so to
Condor-G and exits. But before exiting, it calls the cache_cleanup
perl module callout, which removes job-related files like stdout/err.
Condor-G waits for a minute and restarts the jobmanager again. By
this time, the original jobmanager has exited, and the new jobmanager
process proceeds. It repeats the stage out process, but the files
have been deleted and empty files end up being transferred.
I see three problems that should be addressed to fix the lost stdout/
1) If a jobmanager that's started to manage an existing job notices
that a previous jobmanager for the same job is still alive, it
shouldn't delete the job's files.
2) If a jobmanager finishes staging output files successfully and
then exits, a new jobmanager for that job shouldn't perform the same
3) When Condor-G tells a jobmanager to exit, it should wait at least
a few seconds before starting a new jobmanager for that job.
A preliminary patch for this issue is on the dev.globus.org wiki:
It implements something like option 1 in the bug report description.
Fix committed to CVS.