Bugzilla – Bug 1552
Fix race condition in globusrun
Last modified: 2004-03-02 15:21:29
You need to log in before you can comment on or make changes to this bug.
This is a patch by David Smith of LCG and it is a part of the VDT. It fixes a race condition. I'll try to attach the patch, if Bugzilla lets me. Hmph.
Created an attachment (id=315) [details] Fix for race condition See above.
There have been a lot of patches submitted and applied to this particular bit of code, depending on what behavior the user is desiring. Some (people with firewall issues) dislike blocking until callbacks happen. Others are uncomfortable with lost error state when submitting batch mode jobs when globusrun doesn't wait for a callback. I'm more inclined to add another option to be used with batch mode processing to either wait or not for the job to be submitted to the scheduler and a state callback to be returned than apply this patch. Do you have any comments? joe
Hi Joe, There seems to be a minimum requirement: (+) If the job allows reads from the gass server, globusrun must wait (ie keep the gass server around) until the job has completed stage in. The current model appeared to be that if batch is enabled the submission program returns as soon as possible, given the above constraint. (That is the behaviour that the patch should preserve). This behaviour is fine for us - if you would like to introduce extra options to give greater flexability that is great. Infact we don't use globusrun for normal job submission (we use Condor-G), but we do make some functionaly tests with globusrun - which is how we noticed: The problem we found was that for jobs using two phase commit, the final COMMIT_END signal was sent even if the job wasn't in DONE. The job manager would reply with GLOBUS_GRAM_PROTOCOL_ERROR_JOB_QUERY_DENIAL - if the state moved to DONE at about the same time it was possible to finish with the deny error still set in err. In that case the globusrun indicates that something has failed, when infact it had not. The patch also 'tidied up' - in the sense that as written in the globus-2.4.3 source I was working from it appeared possible for globusrun to exit before the job was done, in the case were batch mode wasn't requested but read enable for the gass cache was. (This would depend on the monitor state moving past stage in before the test at line 1405, which may or may not have been a likely event - I suppose it depends on the possibility of dispatching callbacks during globus_gram_client_job_request() or globus_gram_client_job_signal()) Thanks, David
I've committed a variation on this patch to 3.2 branch and trunk. joe