Bug 1552 - Fix race condition in globusrun
: Fix race condition in globusrun
Status: RESOLVED FIXED
: GRAM
gt2 Gram client
: 1.6
: PC All
: P2 major
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2004-02-13 16:16 by
Modified: 2004-03-02 15:21 (History)


Attachments
Fix for race condition (3.26 KB, patch)
2004-02-13 16:17, Alain Roy
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2004-02-13 16:16:36
This is a patch by David Smith of LCG and it is a part of the VDT. It fixes a 
race condition. 

I'll try to attach the patch, if Bugzilla lets me. Hmph.
------- Comment #1 From 2004-02-13 16:17:25 -------
Created an attachment (id=315) [details]
Fix for race condition

See above.
------- Comment #2 From 2004-03-02 09:41:29 -------
There have been a lot of patches submitted and applied to this particular bit 
of code, depending on what behavior the user is desiring. Some (people with 
firewall issues) dislike blocking until callbacks happen. Others are 
uncomfortable with lost error state when submitting batch mode jobs when 
globusrun doesn't wait for a callback. 
 
I'm more inclined to add another option to be used with batch mode processing 
to either wait or not for the job to be submitted to the scheduler and a state 
callback to be returned than apply this patch. Do you have any comments? 
 
joe 
------- Comment #3 From 2004-03-02 12:49:50 -------
Hi Joe,

There seems to be a minimum requirement:

(+) If the job allows reads from the gass server, globusrun must wait (ie keep
the gass server around) until the job has completed stage in.

The current model appeared to be that if batch is enabled the submission program
returns as soon as possible, given the above constraint. (That is the behaviour
that the patch should preserve). This behaviour is fine for us - if you would
like to introduce extra options to give greater flexability that is great.
Infact we don't use globusrun for normal job submission (we use Condor-G), but
we do make some functionaly tests with globusrun - which is how we noticed:

The problem we found was that for jobs using two phase commit, the final
COMMIT_END signal was sent even if the job wasn't in DONE. The job manager would
reply with GLOBUS_GRAM_PROTOCOL_ERROR_JOB_QUERY_DENIAL - if the state moved to
DONE at about the same time it was possible to finish with the deny error still
set in err. In that case the globusrun indicates that something has failed, when
infact it had not.

The patch also 'tidied up' - in the sense that as written in the globus-2.4.3
source I was working from it appeared possible for globusrun to exit before the
job was done, in the case were batch mode wasn't requested but read enable for
the gass cache was. (This would depend on the monitor state moving past stage in
before the test at line 1405, which may or may not have been a likely event - I
suppose it depends on the possibility of dispatching callbacks during
globus_gram_client_job_request() or globus_gram_client_job_signal())

Thanks,
David
------- Comment #4 From 2004-03-02 15:21:29 -------
I've committed a variation on this patch to 3.2 branch and trunk. 
 
joe