Bugzilla – Bug 5397
GRAM4 recovery of persisted job resources needs to be reviewed
Last modified: 2012-09-05 11:50:19
You need to log in before you can comment on or make changes to this bug.
In certain situations connection to resources are lost and actions are repeatedly done in recovery of persisted job resources during a container startup. The following describes what may happen to resources in state stageIn or stageInResponse. The same is true for all states where RFT interacations are involved (stageOut/stageOutResponse, fileCleanUp/fileCleanUpResponse). Example: A job was in state stageIn when the container was shutdown. The transfer resource had been created, subscription for state changes was done and the transfer had been started. The only thing that did not happen before the shutdown was the change of the internal state to stageInResponse. When this resource is recovered, it will be started again. This causes that a RepeatedlyStartedFaultType is thrown by RFT which is ignored in GRAM4. Then the resource is processed from scratch in state stageIn, i.e. a new transfer resource will be created, a subscription takes place and the new transfer will be started. Connection to the old resources is lost and the transfer will be executed again. This case may not happen too often, but it is not optimal. Similar things happen under certain circumstances when the container gets stopped and restarted and resources are in state stageInResponse. The plan is to have a more fine-grained behavior: all of the following actions happen during submitting a transfer request: * create a transfer resource * subscribe for state changes * start the transfer * create a notification consumer resource In a recovery situation optimally only those actions that did not get executed because the container went down should be repeated, but not the submission of the transfer request as a whole.
Doing some bugzilla cleanup... Resolving old GRAM3 and GRAM4 issues that are no longer relevant since we've moved on to GRAM5. Also, we're now tracking issue in jira. Any new issues should be added here: http://jira.globus.org/secure/VersionBoard.jspa?selectedProjectId=10363