Bug 3091 - RFT does not retry transfers, even though GRAM specifies <attempts>5</attempts>
: RFT does not retry transfers, even though GRAM specifies <attempts>5</attempts>
Status: RESOLVED FIXED
: RFT
RFT
: development
: PC Linux
: P3 normal
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2005-04-06 18:24 by
Modified: 2005-04-08 20:11 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2005-04-06 18:24:11
RSL:

<ns17:transfer
xmlns:ns17="http://www.globus.org/namespaces/2004/10/rft"><ns17:sourceUrl>gsiftp://dc-user2.isi.edu:9001/nfs/asd/rynge/condorg/test-viz/hello.sh</ns17:sourceUrl><ns17:destinationUrl>file:///${GLOBUS_SCRATCH_DIR}/job_42fc8eb0-a6f0-11d9-91b9-993ccb6731b0/hello.sh</ns17:destinationUrl><ns17:attempts>5</ns17:attempts></ns17:transfer>


Summary of log (will attach full log):

2005-04-06 16:06:13,327 INFO  service.TransferWork
[Thread-3362,getTransferClient:327] [Request 101775, Transfer 178019]
transferring
gsiftp://dc-user2.isi.edu:9001/nfs/asd/rynge/condorg/test-viz/hello.sh  -> 
gsiftp://viz-login.isi.edu:9001/nfs/home/rynge/job_42fc8eb0-a6f0-11d9-91b9-993ccb6731b0/hello.sh
2005-04-06 16:06:13,330 DEBUG service.TransferWork
[Thread-3362,getTransferClient:353] [Request 101775, Transfer 178019] no client
to reuse
2005-04-06 16:06:17,588 DEBUG service.TransferWork
[Thread-3362,processStates:465] [Request 101775, Transfer 178019] processing
state for transfer of
gsiftp://dc-user2.isi.edu:9001/nfs/asd/rynge/condorg/test-viz/hello.sh  -> 
gsiftp://viz-login.isi.edu:9001/nfs/home/rynge/job_42fc8eb0-a6f0-11d9-91b9-993ccb6731b0/hello.sh
2005-04-06 16:06:17,618 INFO  service.TransferWork
[Thread-3362,processStates:488] [Request 101775, Transfer 178019] transfer failed
2005-04-06 16:06:17,629 DEBUG service.TransferWork
[Thread-3362,statusChanged:155] [Request 101775, Transfer 178019] status changed
called with status: 2



Looking at the RFT code, if attempts were tried we should see log messages about it:

        case RFTConstants.STATUS_RETRYING: {
            if (logger.isDebugEnabled()) {
                logger.debug(this.getTransferIdentifiers()
                             + "retry attempt "
                             + this.transferJob.getAttempts()
                             + " of "
                             + this.maxAttempts);
            }
            if (this.transferJob.getAttempts().intValue() >= this.maxAttempts) {
                logger.info("Transfer " + transferId + ": "
                            + "transfer failed (retried "
                            + this.transferJob.getAttempts()
                            + " times)");
                statusChanged(RFTConstants.STATUS_FAILED);
            } else {
                // Retrying
                statusChanged(RFTConstants.STATUS_RETRYING);
            }
            if(this.transferClient != null) {
                this.transferClient.close();
            }
            break;
        }
------- Comment #1 From 2005-04-06 18:25:37 -------
RFT does not retry transfers that it considers as fatal errors. What was the
error you were getting ?
------- Comment #2 From 2005-04-06 18:26:18 -------
http://www.isi.edu/~rynge/bug-3091/container-log-20050406-1.txt
------- Comment #3 From 2005-04-06 18:27:35 -------
It is hard to tell if this is the error in question, but in the log I see:

Terminal transfer error:
Server refused performing the request. Custom message: Server reported transfer
failure (error code 1) [Nested exception message:  Custom message: Unexpected
reply:500-Command failed. : callback failed.
500-globus_xio: System error in writev: Connection reset by peer
500-globus_xio: A system call failed: Connection reset by peer
500 End.] [Caused by: Server refused performing the request. Custom message:
Server reported transfer failure (error code 1) [Nested exception message: 
Custom message: Unexpected reply: 500-Command failed. : callback failed.
500-globus_xio: System error in writev: Connection reset by peer
500-globus_xio: A system call failed: Connection reset by peer
------- Comment #4 From 2005-04-06 18:48:51 -------
That  is considered as a terminal error by current code. But looking at this
error and my experience 
when transferring SDSS data I am inclined to mark all errors that happen once
the transfer begins as 
transient and not completely rely on error code sent by server
------- Comment #5 From 2005-04-07 00:31:01 -------
A possible fix in trunk.
------- Comment #6 From 2005-04-07 22:20:21 -------
Full log:

  http://www.isi.edu/~rynge/bug-3091/container-log-20050407-1.txt

See transfer 195867.

Summary:

2005-04-07 19:52:01,008 DEBUG service.TransferWork
[Thread-7775,processStates:465] [Request 111055, Transfer 195867] processing
state for transfer of gsiftp://dc-use
r2.isi.edu:9001/nfs/asd/rynge/condorg/test-viz/hello.sh  -> 
gsiftp://viz-login.isi.edu:9001/nfs/home/rynge/job_dd5d3260-a7d7-11d9-9011-c7c57ddc5de5/hello.sh
2005-04-07 19:52:01,061 DEBUG service.TransferWork
[Thread-7775,processStates:506] [Request 111055, Transfer 195867] retry attempt
1 of 0
2005-04-07 19:52:01,080 INFO  service.TransferWork
[Thread-7775,processStates:513] Transfer 195867: transfer failed (retried 1 times)


maxAttempts is still wrong. It should be 5 (from the GRAM RSL), but is 0.


Also, it can be argued that saying that it has been retried once is wrong, a
retry implies a previous attempt. Maybe it should read: 

    transfer failed (tried 1 times)

------- Comment #7 From 2005-04-08 08:48:07 -------
Hmm.. Can i see the RSL for this job ? RFT tries the transfer once and only in
retrial it checks for 
number of attempts in the job which should be fixed. But what I am more worried
about is that if 
attempts is not being set in RFT database properly.
------- Comment #8 From 2005-04-08 10:11:30 -------
Thanks for the catch mats, this is fixed in trunk. the maxattempts was not
being updated in database 
properly as the column name was case sensitive ( i was using maxAttempts as per
in my schema )
------- Comment #9 From 2005-04-08 11:53:23 -------
btw it is not <attempts>5</attempts> it shoud be <maxAttempt>5</maxAttempts>. A
valid rsl with 
maxAttempts set to 5  looks like this :

<job>
    <executable>my_echo</executable>
    <directory>${GLOBUS_SCRATCH_DIR}/${THROUGHPUT_TESTER_JOB_ID}/</directory>
    <argument>12</argument>
    <argument>abc</argument>
    <argument>34</argument>
    <argument>pdscaex_instr_GrADS_grads23_28919.cfg</argument>
    <argument>pgwynnel was here</argument>
    <environment>
        <name>PI</name>
        <value>3.141</value>
    </environment>
    <environment>
        <name>GLOBUS_DUROC_SUBJOB_INDEX</name>
        <value>0</value>
    </environment>
    <stdout>stdout</stdout>
    <stderr>stderr</stderr>
    <fileStageIn>
        <maxAttempts>5</maxAttempts>
        <transfer>
            <sourceUrl>gsiftp://promptu:2811/tmp/empty_dir/</sourceUrl>

<destinationUrl>file:///${GLOBUS_SCRATCH_DIR}/${THROUGHPUT_TESTER_JOB_ID}/</destinationUrl>
        </transfer>
        <transfer>
            <sourceUrl>gsiftp://promptu:2811/bin/echo</sourceUrl>

<destinationUrl>file:///${GLOBUS_SCRATCH_DIR}/${THROUGHPUT_TESTER_JOB_ID}/my_echo</
destinationUrl>
        </transfer>
    </fileStageIn>
    <!--
    <fileStageOut>
        <transfer>

<sourceUrl>file:///${GLOBUS_SCRATCH_DIR}/${THROUGHPUT_TESTER_JOB_ID}/stdout</sourceUrl>

<destinationUrl>gsiftp://promptu:2811/${GLOBUS_USER_HOME}/
stdout.${THROUGHPUT_TESTER_JOB_ID}</destinationUrl>
        </transfer>
        <transfer>

<sourceUrl>file:///${GLOBUS_SCRATCH_DIR}/${THROUGHPUT_TESTER_JOB_ID}/stderr</sourceUrl>

<destinationUrl>gsiftp://promptu:2811/${GLOBUS_USER_HOME}/
stderr.${THROUGHPUT_TESTER_JOB_ID}</destinationUrl>
        </transfer>
    </fileStageOut>
    -->
    <fileCleanUp>
        <maxAttempts>5</maxAttempts>
        <deletion>
           
<file>file:///${GLOBUS_SCRATCH_DIR}/${THROUGHPUT_TESTER_JOB_ID}/</file>
        </deletion>
    </fileCleanUp>
</job>
------- Comment #10 From 2005-04-08 12:02:43 -------
So this is now a Condor issue then.

Jaime, the RSL you are generating needs a fix: s/attempts/maxAttempts/ 
------- Comment #11 From 2005-04-08 12:25:27 -------
You can find a fixed gridmanager in
ftp://ftp.cs.wisc.edu/condor/temporary/forlane/2005-04-08
------- Comment #12 From 2005-04-08 14:43:16 -------
Jaime,

It wasn't just an easy s/// fix. The <maxAttempts> element should be outside the
<transfer>s:

   <fileStageIn>
        <maxAttempts>5</maxAttempts>
        <transfer>

See comment #9 for a good example.

I think it would also be a good idea to have it for fileStageOut.
------- Comment #13 From 2005-04-08 17:02:45 -------
Is that your final answer? :-)
(this is the third time I've been told to change attempts/maxAttempts)
Latest and greatest:
ftp://ftp.cs.wisc.edu/condor/temporary/forlane/2005-04-08_2
------- Comment #14 From 2005-04-08 20:11:40 -------
Yeah, sorry about that. I was a little bit confused about what was needed to be
done.

I have verified that the Condor and RFT fixes are working and will close this bug.

Thanks guys!