<?xml version="1.0" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "http://bugzilla.globus.org/bugzilla/bugzilla.dtd">

<bugzilla version="3.2.3"
          urlbase="http://bugzilla.globus.org/bugzilla/"
          maintainer="bacon@mcs.anl.gov"
>

    <bug>
          <bug_id>3091</bug_id>
          
          <creation_ts>2005-04-06 18:24</creation_ts>
          <short_desc>RFT does not retry transfers, even though GRAM specifies &lt;attempts&gt;5&lt;/attempts&gt;</short_desc>
          <delta_ts>2005-04-08 20:11:40</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>RFT</product>
          <component>RFT</component>
          <version>development</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Linux</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          
          
          <priority>P3</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Mats Rynge">rynge@isi.edu</reporter>
          <assigned_to name="Ravi Madduri">madduri@mcs.anl.gov</assigned_to>
          <cc>jfrey@cs.wisc.edu</cc>
    
    <cc>ogsa-bugs@globus.org</cc>

      

      
          <long_desc isprivate="0">
            <who name="Mats Rynge">rynge@isi.edu</who>
            <bug_when>2005-04-06 18:24:11</bug_when>
            <thetext>RSL:

&lt;ns17:transfer
xmlns:ns17=&quot;http://www.globus.org/namespaces/2004/10/rft&quot;&gt;&lt;ns17:sourceUrl&gt;gsiftp://dc-user2.isi.edu:9001/nfs/asd/rynge/condorg/test-viz/hello.sh&lt;/ns17:sourceUrl&gt;&lt;ns17:destinationUrl&gt;file:///${GLOBUS_SCRATCH_DIR}/job_42fc8eb0-a6f0-11d9-91b9-993ccb6731b0/hello.sh&lt;/ns17:destinationUrl&gt;&lt;ns17:attempts&gt;5&lt;/ns17:attempts&gt;&lt;/ns17:transfer&gt;


Summary of log (will attach full log):

2005-04-06 16:06:13,327 INFO  service.TransferWork
[Thread-3362,getTransferClient:327] [Request 101775, Transfer 178019]
transferring
gsiftp://dc-user2.isi.edu:9001/nfs/asd/rynge/condorg/test-viz/hello.sh  -&gt; 
gsiftp://viz-login.isi.edu:9001/nfs/home/rynge/job_42fc8eb0-a6f0-11d9-91b9-993ccb6731b0/hello.sh
2005-04-06 16:06:13,330 DEBUG service.TransferWork
[Thread-3362,getTransferClient:353] [Request 101775, Transfer 178019] no client
to reuse
2005-04-06 16:06:17,588 DEBUG service.TransferWork
[Thread-3362,processStates:465] [Request 101775, Transfer 178019] processing
state for transfer of
gsiftp://dc-user2.isi.edu:9001/nfs/asd/rynge/condorg/test-viz/hello.sh  -&gt; 
gsiftp://viz-login.isi.edu:9001/nfs/home/rynge/job_42fc8eb0-a6f0-11d9-91b9-993ccb6731b0/hello.sh
2005-04-06 16:06:17,618 INFO  service.TransferWork
[Thread-3362,processStates:488] [Request 101775, Transfer 178019] transfer failed
2005-04-06 16:06:17,629 DEBUG service.TransferWork
[Thread-3362,statusChanged:155] [Request 101775, Transfer 178019] status changed
called with status: 2



Looking at the RFT code, if attempts were tried we should see log messages about it:

        case RFTConstants.STATUS_RETRYING: {
            if (logger.isDebugEnabled()) {
                logger.debug(this.getTransferIdentifiers()
                             + &quot;retry attempt &quot;
                             + this.transferJob.getAttempts()
                             + &quot; of &quot;
                             + this.maxAttempts);
            }
            if (this.transferJob.getAttempts().intValue() &gt;= this.maxAttempts) {
                logger.info(&quot;Transfer &quot; + transferId + &quot;: &quot;
                            + &quot;transfer failed (retried &quot;
                            + this.transferJob.getAttempts()
                            + &quot; times)&quot;);
                statusChanged(RFTConstants.STATUS_FAILED);
            } else {
                // Retrying
                statusChanged(RFTConstants.STATUS_RETRYING);
            }
            if(this.transferClient != null) {
                this.transferClient.close();
            }
            break;
        }</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Ravi Madduri">madduri@mcs.anl.gov</who>
            <bug_when>2005-04-06 18:25:37</bug_when>
            <thetext>RFT does not retry transfers that it considers as fatal errors. What was the error you were getting ?</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Mats Rynge">rynge@isi.edu</who>
            <bug_when>2005-04-06 18:26:18</bug_when>
            <thetext>http://www.isi.edu/~rynge/bug-3091/container-log-20050406-1.txt</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Mats Rynge">rynge@isi.edu</who>
            <bug_when>2005-04-06 18:27:35</bug_when>
            <thetext>It is hard to tell if this is the error in question, but in the log I see:

Terminal transfer error:
Server refused performing the request. Custom message: Server reported transfer
failure (error code 1) [Nested exception message:  Custom message: Unexpected
reply:500-Command failed. : callback failed.
500-globus_xio: System error in writev: Connection reset by peer
500-globus_xio: A system call failed: Connection reset by peer
500 End.] [Caused by: Server refused performing the request. Custom message:
Server reported transfer failure (error code 1) [Nested exception message: 
Custom message: Unexpected reply: 500-Command failed. : callback failed.
500-globus_xio: System error in writev: Connection reset by peer
500-globus_xio: A system call failed: Connection reset by peer
</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Ravi Madduri">madduri@mcs.anl.gov</who>
            <bug_when>2005-04-06 18:48:51</bug_when>
            <thetext>That  is considered as a terminal error by current code. But looking at this error and my experience 
when transferring SDSS data I am inclined to mark all errors that happen once the transfer begins as 
transient and not completely rely on error code sent by server</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Ravi Madduri">madduri@mcs.anl.gov</who>
            <bug_when>2005-04-07 00:31:01</bug_when>
            <thetext>A possible fix in trunk.</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Mats Rynge">rynge@isi.edu</who>
            <bug_when>2005-04-07 22:20:21</bug_when>
            <thetext>Full log:

  http://www.isi.edu/~rynge/bug-3091/container-log-20050407-1.txt

See transfer 195867.

Summary:

2005-04-07 19:52:01,008 DEBUG service.TransferWork
[Thread-7775,processStates:465] [Request 111055, Transfer 195867] processing
state for transfer of gsiftp://dc-use
r2.isi.edu:9001/nfs/asd/rynge/condorg/test-viz/hello.sh  -&gt; 
gsiftp://viz-login.isi.edu:9001/nfs/home/rynge/job_dd5d3260-a7d7-11d9-9011-c7c57ddc5de5/hello.sh
2005-04-07 19:52:01,061 DEBUG service.TransferWork
[Thread-7775,processStates:506] [Request 111055, Transfer 195867] retry attempt
1 of 0
2005-04-07 19:52:01,080 INFO  service.TransferWork
[Thread-7775,processStates:513] Transfer 195867: transfer failed (retried 1 times)


maxAttempts is still wrong. It should be 5 (from the GRAM RSL), but is 0.


Also, it can be argued that saying that it has been retried once is wrong, a
retry implies a previous attempt. Maybe it should read: 

    transfer failed (tried 1 times)

</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Ravi Madduri">madduri@mcs.anl.gov</who>
            <bug_when>2005-04-08 08:48:07</bug_when>
            <thetext>Hmm.. Can i see the RSL for this job ? RFT tries the transfer once and only in retrial it checks for 
number of attempts in the job which should be fixed. But what I am more worried about is that if 
attempts is not being set in RFT database properly.</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Ravi Madduri">madduri@mcs.anl.gov</who>
            <bug_when>2005-04-08 10:11:30</bug_when>
            <thetext>Thanks for the catch mats, this is fixed in trunk. the maxattempts was not being updated in database 
properly as the column name was case sensitive ( i was using maxAttempts as per in my schema )</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Ravi Madduri">madduri@mcs.anl.gov</who>
            <bug_when>2005-04-08 11:53:23</bug_when>
            <thetext>btw it is not &lt;attempts&gt;5&lt;/attempts&gt; it shoud be &lt;maxAttempt&gt;5&lt;/maxAttempts&gt;. A valid rsl with 
maxAttempts set to 5  looks like this :

&lt;job&gt;
    &lt;executable&gt;my_echo&lt;/executable&gt;
    &lt;directory&gt;${GLOBUS_SCRATCH_DIR}/${THROUGHPUT_TESTER_JOB_ID}/&lt;/directory&gt;
    &lt;argument&gt;12&lt;/argument&gt;
    &lt;argument&gt;abc&lt;/argument&gt;
    &lt;argument&gt;34&lt;/argument&gt;
    &lt;argument&gt;pdscaex_instr_GrADS_grads23_28919.cfg&lt;/argument&gt;
    &lt;argument&gt;pgwynnel was here&lt;/argument&gt;
    &lt;environment&gt;
        &lt;name&gt;PI&lt;/name&gt;
        &lt;value&gt;3.141&lt;/value&gt;
    &lt;/environment&gt;
    &lt;environment&gt;
        &lt;name&gt;GLOBUS_DUROC_SUBJOB_INDEX&lt;/name&gt;
        &lt;value&gt;0&lt;/value&gt;
    &lt;/environment&gt;
    &lt;stdout&gt;stdout&lt;/stdout&gt;
    &lt;stderr&gt;stderr&lt;/stderr&gt;
    &lt;fileStageIn&gt;
        &lt;maxAttempts&gt;5&lt;/maxAttempts&gt;
        &lt;transfer&gt;
            &lt;sourceUrl&gt;gsiftp://promptu:2811/tmp/empty_dir/&lt;/sourceUrl&gt;
           
&lt;destinationUrl&gt;file:///${GLOBUS_SCRATCH_DIR}/${THROUGHPUT_TESTER_JOB_ID}/&lt;/destinationUrl&gt;
        &lt;/transfer&gt;
        &lt;transfer&gt;
            &lt;sourceUrl&gt;gsiftp://promptu:2811/bin/echo&lt;/sourceUrl&gt;
           
&lt;destinationUrl&gt;file:///${GLOBUS_SCRATCH_DIR}/${THROUGHPUT_TESTER_JOB_ID}/my_echo&lt;/
destinationUrl&gt;
        &lt;/transfer&gt;
    &lt;/fileStageIn&gt;
    &lt;!--
    &lt;fileStageOut&gt;
        &lt;transfer&gt;
           
&lt;sourceUrl&gt;file:///${GLOBUS_SCRATCH_DIR}/${THROUGHPUT_TESTER_JOB_ID}/stdout&lt;/sourceUrl&gt;
           
&lt;destinationUrl&gt;gsiftp://promptu:2811/${GLOBUS_USER_HOME}/
stdout.${THROUGHPUT_TESTER_JOB_ID}&lt;/destinationUrl&gt;
        &lt;/transfer&gt;
        &lt;transfer&gt;
           
&lt;sourceUrl&gt;file:///${GLOBUS_SCRATCH_DIR}/${THROUGHPUT_TESTER_JOB_ID}/stderr&lt;/sourceUrl&gt;
           
&lt;destinationUrl&gt;gsiftp://promptu:2811/${GLOBUS_USER_HOME}/
stderr.${THROUGHPUT_TESTER_JOB_ID}&lt;/destinationUrl&gt;
        &lt;/transfer&gt;
    &lt;/fileStageOut&gt;
    --&gt;
    &lt;fileCleanUp&gt;
        &lt;maxAttempts&gt;5&lt;/maxAttempts&gt;
        &lt;deletion&gt;
            &lt;file&gt;file:///${GLOBUS_SCRATCH_DIR}/${THROUGHPUT_TESTER_JOB_ID}/&lt;/file&gt;
        &lt;/deletion&gt;
    &lt;/fileCleanUp&gt;
&lt;/job&gt;</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Mats Rynge">rynge@isi.edu</who>
            <bug_when>2005-04-08 12:02:43</bug_when>
            <thetext>So this is now a Condor issue then.

Jaime, the RSL you are generating needs a fix: s/attempts/maxAttempts/ </thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Jaime Frey">jfrey@cs.wisc.edu</who>
            <bug_when>2005-04-08 12:25:27</bug_when>
            <thetext>You can find a fixed gridmanager in
ftp://ftp.cs.wisc.edu/condor/temporary/forlane/2005-04-08</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Mats Rynge">rynge@isi.edu</who>
            <bug_when>2005-04-08 14:43:16</bug_when>
            <thetext>Jaime,

It wasn&apos;t just an easy s/// fix. The &lt;maxAttempts&gt; element should be outside the
&lt;transfer&gt;s:

   &lt;fileStageIn&gt;
        &lt;maxAttempts&gt;5&lt;/maxAttempts&gt;
        &lt;transfer&gt;

See comment #9 for a good example.

I think it would also be a good idea to have it for fileStageOut.
</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Jaime Frey">jfrey@cs.wisc.edu</who>
            <bug_when>2005-04-08 17:02:45</bug_when>
            <thetext>Is that your final answer? :-)
(this is the third time I&apos;ve been told to change attempts/maxAttempts)
Latest and greatest:
ftp://ftp.cs.wisc.edu/condor/temporary/forlane/2005-04-08_2</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Mats Rynge">rynge@isi.edu</who>
            <bug_when>2005-04-08 20:11:40</bug_when>
            <thetext>Yeah, sorry about that. I was a little bit confused about what was needed to be
done.

I have verified that the Condor and RFT fixes are working and will close this bug.

Thanks guys!</thetext>
          </long_desc>
      
      

    </bug>

</bugzilla>