Bugzilla – Bug 3091
RFT does not retry transfers, even though GRAM specifies <attempts>5</attempts>
Last modified: 2005-04-08 20:11:40
You need to log in before you can comment on or make changes to this bug.
RSL: <ns17:transfer xmlns:ns17="http://www.globus.org/namespaces/2004/10/rft"><ns17:sourceUrl>gsiftp://dc-user2.isi.edu:9001/nfs/asd/rynge/condorg/test-viz/hello.sh</ns17:sourceUrl><ns17:destinationUrl>file:///${GLOBUS_SCRATCH_DIR}/job_42fc8eb0-a6f0-11d9-91b9-993ccb6731b0/hello.sh</ns17:destinationUrl><ns17:attempts>5</ns17:attempts></ns17:transfer> Summary of log (will attach full log): 2005-04-06 16:06:13,327 INFO service.TransferWork [Thread-3362,getTransferClient:327] [Request 101775, Transfer 178019] transferring gsiftp://dc-user2.isi.edu:9001/nfs/asd/rynge/condorg/test-viz/hello.sh -> gsiftp://viz-login.isi.edu:9001/nfs/home/rynge/job_42fc8eb0-a6f0-11d9-91b9-993ccb6731b0/hello.sh 2005-04-06 16:06:13,330 DEBUG service.TransferWork [Thread-3362,getTransferClient:353] [Request 101775, Transfer 178019] no client to reuse 2005-04-06 16:06:17,588 DEBUG service.TransferWork [Thread-3362,processStates:465] [Request 101775, Transfer 178019] processing state for transfer of gsiftp://dc-user2.isi.edu:9001/nfs/asd/rynge/condorg/test-viz/hello.sh -> gsiftp://viz-login.isi.edu:9001/nfs/home/rynge/job_42fc8eb0-a6f0-11d9-91b9-993ccb6731b0/hello.sh 2005-04-06 16:06:17,618 INFO service.TransferWork [Thread-3362,processStates:488] [Request 101775, Transfer 178019] transfer failed 2005-04-06 16:06:17,629 DEBUG service.TransferWork [Thread-3362,statusChanged:155] [Request 101775, Transfer 178019] status changed called with status: 2 Looking at the RFT code, if attempts were tried we should see log messages about it: case RFTConstants.STATUS_RETRYING: { if (logger.isDebugEnabled()) { logger.debug(this.getTransferIdentifiers() + "retry attempt " + this.transferJob.getAttempts() + " of " + this.maxAttempts); } if (this.transferJob.getAttempts().intValue() >= this.maxAttempts) { logger.info("Transfer " + transferId + ": " + "transfer failed (retried " + this.transferJob.getAttempts() + " times)"); statusChanged(RFTConstants.STATUS_FAILED); } else { // Retrying statusChanged(RFTConstants.STATUS_RETRYING); } if(this.transferClient != null) { this.transferClient.close(); } break; }
RFT does not retry transfers that it considers as fatal errors. What was the error you were getting ?
http://www.isi.edu/~rynge/bug-3091/container-log-20050406-1.txt
It is hard to tell if this is the error in question, but in the log I see: Terminal transfer error: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message: Custom message: Unexpected reply:500-Command failed. : callback failed. 500-globus_xio: System error in writev: Connection reset by peer 500-globus_xio: A system call failed: Connection reset by peer 500 End.] [Caused by: Server refused performing the request. Custom message: Server reported transfer failure (error code 1) [Nested exception message: Custom message: Unexpected reply: 500-Command failed. : callback failed. 500-globus_xio: System error in writev: Connection reset by peer 500-globus_xio: A system call failed: Connection reset by peer
That is considered as a terminal error by current code. But looking at this error and my experience when transferring SDSS data I am inclined to mark all errors that happen once the transfer begins as transient and not completely rely on error code sent by server
A possible fix in trunk.
Full log: http://www.isi.edu/~rynge/bug-3091/container-log-20050407-1.txt See transfer 195867. Summary: 2005-04-07 19:52:01,008 DEBUG service.TransferWork [Thread-7775,processStates:465] [Request 111055, Transfer 195867] processing state for transfer of gsiftp://dc-use r2.isi.edu:9001/nfs/asd/rynge/condorg/test-viz/hello.sh -> gsiftp://viz-login.isi.edu:9001/nfs/home/rynge/job_dd5d3260-a7d7-11d9-9011-c7c57ddc5de5/hello.sh 2005-04-07 19:52:01,061 DEBUG service.TransferWork [Thread-7775,processStates:506] [Request 111055, Transfer 195867] retry attempt 1 of 0 2005-04-07 19:52:01,080 INFO service.TransferWork [Thread-7775,processStates:513] Transfer 195867: transfer failed (retried 1 times) maxAttempts is still wrong. It should be 5 (from the GRAM RSL), but is 0. Also, it can be argued that saying that it has been retried once is wrong, a retry implies a previous attempt. Maybe it should read: transfer failed (tried 1 times)
Hmm.. Can i see the RSL for this job ? RFT tries the transfer once and only in retrial it checks for number of attempts in the job which should be fixed. But what I am more worried about is that if attempts is not being set in RFT database properly.
Thanks for the catch mats, this is fixed in trunk. the maxattempts was not being updated in database properly as the column name was case sensitive ( i was using maxAttempts as per in my schema )
btw it is not <attempts>5</attempts> it shoud be <maxAttempt>5</maxAttempts>. A valid rsl with maxAttempts set to 5 looks like this : <job> <executable>my_echo</executable> <directory>${GLOBUS_SCRATCH_DIR}/${THROUGHPUT_TESTER_JOB_ID}/</directory> <argument>12</argument> <argument>abc</argument> <argument>34</argument> <argument>pdscaex_instr_GrADS_grads23_28919.cfg</argument> <argument>pgwynnel was here</argument> <environment> <name>PI</name> <value>3.141</value> </environment> <environment> <name>GLOBUS_DUROC_SUBJOB_INDEX</name> <value>0</value> </environment> <stdout>stdout</stdout> <stderr>stderr</stderr> <fileStageIn> <maxAttempts>5</maxAttempts> <transfer> <sourceUrl>gsiftp://promptu:2811/tmp/empty_dir/</sourceUrl> <destinationUrl>file:///${GLOBUS_SCRATCH_DIR}/${THROUGHPUT_TESTER_JOB_ID}/</destinationUrl> </transfer> <transfer> <sourceUrl>gsiftp://promptu:2811/bin/echo</sourceUrl> <destinationUrl>file:///${GLOBUS_SCRATCH_DIR}/${THROUGHPUT_TESTER_JOB_ID}/my_echo</ destinationUrl> </transfer> </fileStageIn> <!-- <fileStageOut> <transfer> <sourceUrl>file:///${GLOBUS_SCRATCH_DIR}/${THROUGHPUT_TESTER_JOB_ID}/stdout</sourceUrl> <destinationUrl>gsiftp://promptu:2811/${GLOBUS_USER_HOME}/ stdout.${THROUGHPUT_TESTER_JOB_ID}</destinationUrl> </transfer> <transfer> <sourceUrl>file:///${GLOBUS_SCRATCH_DIR}/${THROUGHPUT_TESTER_JOB_ID}/stderr</sourceUrl> <destinationUrl>gsiftp://promptu:2811/${GLOBUS_USER_HOME}/ stderr.${THROUGHPUT_TESTER_JOB_ID}</destinationUrl> </transfer> </fileStageOut> --> <fileCleanUp> <maxAttempts>5</maxAttempts> <deletion> <file>file:///${GLOBUS_SCRATCH_DIR}/${THROUGHPUT_TESTER_JOB_ID}/</file> </deletion> </fileCleanUp> </job>
So this is now a Condor issue then. Jaime, the RSL you are generating needs a fix: s/attempts/maxAttempts/
You can find a fixed gridmanager in ftp://ftp.cs.wisc.edu/condor/temporary/forlane/2005-04-08
Jaime, It wasn't just an easy s/// fix. The <maxAttempts> element should be outside the <transfer>s: <fileStageIn> <maxAttempts>5</maxAttempts> <transfer> See comment #9 for a good example. I think it would also be a good idea to have it for fileStageOut.
Is that your final answer? :-) (this is the third time I've been told to change attempts/maxAttempts) Latest and greatest: ftp://ftp.cs.wisc.edu/condor/temporary/forlane/2005-04-08_2
Yeah, sorry about that. I was a little bit confused about what was needed to be done. I have verified that the Condor and RFT fixes are working and will close this bug. Thanks guys!