Bug 5383

Summary: GRAM recovery: jobs with staging fail if they already passed the staging state.
Product: GRAM Reporter: Martin Feller <feller@mcs.anl.gov>
Component: wsrf managed execution job serviceAssignee: Martin Feller <feller@mcs.anl.gov>
Status: CLOSED FIXED    
Severity: normal CC: feller@mcs.anl.gov, madduri@mcs.anl.gov, smartin@mcs.anl.gov
Priority: P3    
Version: 4.0.4   
Target Milestone: 4.0.5   
Hardware: Macintosh   
OS: All   

Description From 2007-06-16 11:31:17
The following explains what happens for stageIn, but is also valid
for stageOut and fileCleanUp.

Jobs with file staging fail in recovery if they are reloaded and
already passed the state StageIn.
Reason: In StateMachine.processRestartSate() a check is done whether
a transferEndpoint is set in the job resource or is null. If one is
set, then the transfer resource will be started.
(A potentially repeatedly called start is catched and ignored)
With audit logging in 4.0.5 we don't want to loose the transfer
endpoints because we want to have them in the audit records. So we
don't nullify them after a transfer is done. So in case of a restart
of a resource there will be a transfer endpoint even if the job is
already in state Submit and thus the transfer will be started.
But since the job is already in state Submit the RFT resource has
already been deleted in state StageInResponse and we get a 
NoSuchResource-Exception which causes the job to fail.

Fix: 
add a new check in StateMachine.restart():
check if transferEndpoint is not equal to null AND the job
resource is in state stageIn
(accordingly for stageOut, fileCleanUp)
------- Comment #1 From 2007-06-16 11:47:57 -------
Committed the fix to TRUNK and the 4.0 branch