Bug 3130 - RFT looses transfer stats in the middle of a transfer
: RFT looses transfer stats in the middle of a transfer
Status: RESOLVED FIXED
: RFT
RFT
: development
: PC Linux
: P3 normal
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2005-04-11 00:25 by
Modified: 2005-04-11 14:15 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2005-04-11 00:25:43
This problem shows up as a GRAM fault when GRAM can't find the executable after
a stagein: Invalid file permissions on executable 


2005-04-10 22:00:43,719 DEBUG service.TransferWork
    [Thread-5865,statusChanged:171] [Request 125606, Transfer 223075] setting
    overall rp to (finished: 1, active: 0, failed: 0, restarted: 0, pending:
    2, cancelled: 0)
2005-04-10 22:00:49,645 DEBUG service.TransferWork
    [Thread-5913,statusChanged:171] [Request 125606, Transfer 223076] setting
    overall rp to (finished: 0, active: 0, failed: 0, restarted: 0, pending:
    0, cancelled: 0)
2005-04-10 22:00:49,652 DEBUG service.TransferWork
    [Thread-5913,setRequestStatus:225] [Request 125606, Transfer 223076]
    setting status RP to Done

Somehow the stats are lost between the two first lines.

I will attach a full log.
------- Comment #1 From 2005-04-11 00:29:52 -------
Full log:

http://www.isi.edu/~rynge/bug-3130/container-log-20050410-1.txt
------- Comment #2 From 2005-04-11 00:32:21 -------
Looks like "destroy" is being called on the resource before all the transfers
are done. 
------- Comment #3 From 2005-04-11 00:33:00 -------
Actually, it looks like the transfer is getting removed in the middle of the
transfer:

2005-04-10 22:00:44,496 DEBUG service.ReliableFileTransferResource
    [Thread-5839,remove:746] Removing the resource with id: 125606
------- Comment #4 From 2005-04-11 10:10:33 -------
The status ought to go to Cancelled or something if GRAM is prematurely
destroying the resource, right?  So this shouldn't show up as a GRAM fault but
an RFT fault that says the transfer was cancelled.
------- Comment #5 From 2005-04-11 10:14:50 -------
Is it possible that with the polling GRAM is doing the status is fetched before
any OverallStatus values are set (i.e. all counts are zero)?
------- Comment #6 From 2005-04-11 10:25:41 -------
The delete is happening way too early. At that point, the directory has been
expanded (that is the 1 finished), but no transfers has actually taken place (2
pending). A guess is that there might be something wrong with the logic of
detemining if the transfer is done in the new staging listener.
------- Comment #7 From 2005-04-11 11:29:43 -------
According to my logs, the polling is returning an OverallStatus with all counts
zero before it destroys the transfer.  Notice the timestamps of the following
two excerpts.  The first is the GetResourcePropertyResponse message from the
poll, and the second is what it does with the OverallStatus object after it returns:

-------------------------------------------------------------------------------
2005-04-10 23:01:08,999 DEBUG handlers.MessageLoggingHandler
[Thread-23,logMessage:48] SOAPEnvelope: <?xml version="1.0" encoding="UTF-8"?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing"><soapenv:Header><wsa:MessageID
soapenv:mustUnderstand="0">uuid:b77f34a0-aa46-11d9-8fd2-c6f981d513e6</wsa:MessageID><wsa:To
soapenv:mustUnderstand="0">http://schemas.xmlsoap.org/ws/2004/03/addressing/role/anonymous</wsa:To><wsa:Action
soapenv:mustUnderstand="0">http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-ResourceProperties/GetResourcePropertyResponse</wsa:Action><wsa:From
soapenv:mustUnderstand="0"
xmlns:ns1="http://www.globus.org/namespaces/2004/10/rft"><wsa:Address>https://192.168.0.101:8443/wsrf/services/ReliableFileTransferService</wsa:Address><wsa:ReferenceProperties><ns1:TransferKey
soapenv:mustUnderstand="0">23204</ns1:TransferKey></wsa:ReferenceProperties></wsa:From><wsa:RelatesTo
RelationshipType="wsa:Reply"
soapenv:mustUnderstand="0">uuid:b3841230-aa46-11d9-8fd2-c6f981d513e6</wsa:RelatesTo></soapenv:Header><soapenv:Body><GetResourcePropertyResponse
xmlns="http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-ResourceProperties-1.2-draft-01.xsd"><ns1:OverallStatusProperty
xmlns:ns1="http://www.globus.org/namespaces/2004/10/rft"><ns1:transfersFinished>0</ns1:transfersFinished><ns1:transfersRestarted>0</ns1:transfersRestarted><ns1:transfersFailed>0</ns1:transfersFailed><ns1:transfersActive>0</ns1:transfersActive><ns1:transfersPending>0</ns1:transfersPending><ns1:transfersCancelled>0</ns1:transfersCancelled></ns1:OverallStatusProperty></GetResourcePropertyResponse></soapenv:Body></soapenv:Envelope>

-------------------------------------------------------------------------------
2005-04-10 23:01:09,212 DEBUG exec.StateMachine
[Thread-23,processStageInResponseState:809] failedCount: 0
2005-04-10 23:01:09,212 DEBUG exec.StateMachine
[Thread-23,processCompletedStagingJob:2334] Entering processCompletedStagingJob()
2005-04-10 23:01:09,213 DEBUG exec.StateMachine
[Thread-23,processCompletedStagingJob:2362] totalTransferCount: 0
2005-04-10 23:01:09,213 DEBUG exec.StateMachine
[Thread-23,processCompletedStagingJob:2363] finishedCount: 0
2005-04-10 23:01:09,213 DEBUG exec.StateMachine
[Thread-23,processCompletedStagingJob:2364] failedCount: 0
2005-04-10 23:01:09,213 DEBUG exec.StateMachine
[Thread-23,processCompletedStagingJob:2365] canceledCount: 0
2005-04-10 23:01:09,213 DEBUG exec.StateMachine
[Thread-23,processCompletedStagingJob:2373] Transfer finished successufully.
2005-04-10 23:01:09,219 DEBUG utils.StagingHelper
[Thread-23,destroyTransfer:239]
[resourceKey:bc2d96f0-aa45-11d9-a357-b72315c75a36] Entering destroyTransfer()
------- Comment #8 From 2005-04-11 11:36:51 -------
I wonder if poll happened when the overallStatus is being constructed from the
database. May be you 
should check for all zeros and try polling again ?
------- Comment #9 From 2005-04-11 11:45:00 -------
Yeah, that's pretty much what I was thinking.  I was going to do this earlier
when I asked you about how the DB is initialized by you said it couldn't happen. :)

Since the StateMachine ignores null OverallStatus objects from the poll method,
I can just check for that condition in the method and return null.  The transfer
job would then be polled again in 5 minutes or so if the notification didn't
come in before then.

So the question is, should RFT be doing something to assure that this state
doesn't occur?  If it's absolutely impossible then we can close this bug and
assume my workaround is a fix.  It still think it's a dangerous value to be
returning.  I'd much prefer a null object.
------- Comment #10 From 2005-04-11 11:48:52 -------
Well .. I can create the object after I get results from the database instead
of before. so you will get a 
old/valid value if you poll in between 
------- Comment #11 From 2005-04-11 13:08:51 -------
Ravi, the patch you gave me to fix this seems to work.  If you concur we can
close the bug and you can commit the patch.
------- Comment #12 From 2005-04-11 13:35:00 -------
Ok. I will commit the patch i gave you.  If you can rerun your test for sanity
and let me know I will close 
this bug
------- Comment #13 From 2005-04-11 14:10:26 -------
I re-ran the test over lunch. Looks good.