Bug 5383 - GRAM recovery: jobs with staging fail if they already passed the staging state.
: GRAM recovery: jobs with staging fail if they already passed the staging state.
Status: CLOSED FIXED
: GRAM
wsrf managed execution job service
: 4.0.4
: Macintosh All
: P3 normal
: 4.0.5
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2007-06-16 11:31 by
Modified: 2007-06-16 11:48 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2007-06-16 11:31:17
The following explains what happens for stageIn, but is also valid
for stageOut and fileCleanUp.

Jobs with file staging fail in recovery if they are reloaded and
already passed the state StageIn.
Reason: In StateMachine.processRestartSate() a check is done whether
a transferEndpoint is set in the job resource or is null. If one is
set, then the transfer resource will be started.
(A potentially repeatedly called start is catched and ignored)
With audit logging in 4.0.5 we don't want to loose the transfer
endpoints because we want to have them in the audit records. So we
don't nullify them after a transfer is done. So in case of a restart
of a resource there will be a transfer endpoint even if the job is
already in state Submit and thus the transfer will be started.
But since the job is already in state Submit the RFT resource has
already been deleted in state StageInResponse and we get a 
NoSuchResource-Exception which causes the job to fail.

Fix: 
add a new check in StateMachine.restart():
check if transferEndpoint is not equal to null AND the job
resource is in state stageIn
(accordingly for stageOut, fileCleanUp)
------- Comment #1 From 2007-06-16 11:47:57 -------
Committed the fix to TRUNK and the 4.0 branch