Bug 3840 - RFT ordering and concurrency problem in HEAD
: RFT ordering and concurrency problem in HEAD
Status: RESOLVED FIXED
: RFT
RFT
: development
: PC Linux
: P3 normal
: ---
Assigned To:
:
:
:
: 4111
  Show dependency treegraph
 
Reported: 2005-10-26 11:57 by
Modified: 2006-03-02 21:10 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2005-10-26 11:57:45
Ravi,

I think Jarek is right about the gram testing problem being RFT's
fault (or maybe even GRAM's). The following test has RFT debugging
turned on, and every 5 mins, a kill -QUIT is issued to the container.

http://viz-login.isi.edu/gram-testing/test-details.php?uuid=572d5a6a-1d86-449b-a31a-6ff553ca9950


There is one RFT failure at 08:19:39,979. It says it is having problem
to delete a file, but I think that the RFT mistook the request for
being a delete opereation instead of a stage in. The stagin RSL:

<fileStageIn>
  <maxAttempts>5</maxAttempts>
  <transfer>
    <sourceUrl>
        gsiftp://viz-2.isi.edu/nfs/home/rynge/gram-test-suite/tests/job/empty_dir/
    </sourceUrl>
    <destinationUrl>
        file:///scratch/rynge/jobs/${THROUGHPUT_TESTER_JOB_ID}/
    </destinationUrl>
  </transfer>
  <transfer>
    <sourceUrl>
        gsiftp://viz-2.isi.edu/nfs/home/rynge/gram-test-suite/tests/job/sleep.sh
    </sourceUrl>
    <destinationUrl>
        file:///scratch/rynge/jobs/${THROUGHPUT_TESTER_JOB_ID}/sleep.sh
    </destinationUrl>
  </transfer>
</fileStageIn>


Looking at the log, it seems like the first part is handled right:

2005-10-26 08:19:36,867 DEBUG service.TransferWork
[Thread-41,getTransferClient:336] [Request 5, Transfer 9] transferring
gsiftp://viz-2.isi.edu:2811/nfs/home/rynge/
gram-test-suite/tests/job/empty_dir/  -> 
gsiftp://viz-1.isi.edu:2811/scratch/rynge/jobs/e5e50980-4633-11da-8b4c-d7d91db7639c/
2005-10-26 08:19:38,542 DEBUG service.TransferWork [Thread-41,processStates:474]
[Request 5, Transfer 9] processing state for transfer of gsiftp://viz-2.isi.edu:2811
/nfs/home/rynge/gram-test-suite/tests/job/empty_dir/  -> 
gsiftp://viz-1.isi.edu:2811/scratch/rynge/jobs/e5e50980-4633-11da-8b4c-d7d91db7639c/
2005-10-26 08:19:38,542 DEBUG service.TransferWork [Thread-41,processStates:542]
[Request 5, Transfer 9] expanding the url: gsiftp://viz-2.isi.edu:2811/nfs/home/ryng
e/gram-test-suite/tests/job/empty_dir/


But then, after URL expansion, it get confused. Transfer 14 looks
right, but Transfer 10 (which is the one which fails later) should
be a stagin, and not even be running until Transfer 14 is done:


2005-10-26 08:19:39,145 DEBUG service.TransferWork [Thread-41,statusChanged:155]
[Request 5, Transfer 9] status changed called with status: 7
2005-10-26 08:19:39,146 DEBUG service.RFTResourceManager [Thread-41,update:795]
Updating status in database for id: 9 with status: 7
2005-10-26 08:19:39,151 DEBUG service.RFTResourceManager [Thread-41,update:810]
Updated status in database for id: 9 with status: 7
2005-10-26 08:19:39,153 DEBUG database.ReliableFileTransferDbAdapter
[Thread-41,getOverallStatus:865] OverallStatus for requestId: 5 Finished:1
Retrying:0 Failed:0 Active:0 Pending:1 Cancelled:0
2005-10-26 08:19:39,154 DEBUG service.TransferWork [Thread-41,statusChanged:171]
[Request 5, Transfer 9] setting overall rp to (finished: 1, active: 0, failed:
0, restarted: 0, pending: 1, cancelled: 0)
2005-10-26 08:19:39,225 DEBUG service.TransferWork
[Thread-41,startNewTransfer:604] Starting a new transfer
2005-10-26 08:19:39,226 DEBUG service.RFTResourceManager
[Thread-41,getNextSet:719] Cache Size : 10
2005-10-26 08:19:39,232 DEBUG service.TransferWork [Thread-68,run:675] This is a
delete:10
2005-10-26 08:19:39,232 DEBUG service.TransferWork
[Thread-68,getDeleteClient:382] [Request 5, Transfer 10] deleting
gsiftp://viz-1.isi.edu:2811/scratch/rynge/jobs/e5e50980-4633-11da-8b4c-d7d91db7639c/sleep.sh
2005-10-26 08:19:39,237 DEBUG service.TransferWork [Thread-69,run:683] This is a
transfer:14
2005-10-26 08:19:39,240 DEBUG service.TransferWork
[Thread-69,getTransferClient:336] [Request 5, Transfer 14] transferring
gsiftp://viz-2.isi.edu:2811/nfs/home/rynge/gram-test-suite/tests/job/empty_dir//.ignoreme
 -> 
gsiftp://viz-1.isi.edu:2811/scratch/rynge/jobs/e5e50980-4633-11da-8b4c-d7d91db7639c///.ignoreme
2005-10-26 08:19:39,247 DEBUG service.TransferWork
[Thread-69,getTransferClient:362] [Request 5, Transfer 14] no client to reuse
2005-10-26 08:19:39,247 DEBUG service.TransferWork [Thread-69,getNewClient:412]
max active allowed: 100
2005-10-26 08:19:39,350 DEBUG service.TransferWork [Thread-69,getNewClient:417]
Transfer 14: new transfer client for user: rynge
2005-10-26 08:19:39,477 DEBUG service.DeleteClient [Thread-68,<init>:96] Subject
name is null in DeleteClient
2005-10-26 08:19:39,916 DEBUG service.TransferWork [Thread-68,statusChanged:155]
[Request 5, Transfer 10] status changed called with status: 3
------- Comment #1 From 2005-10-26 15:59:42 -------
A "transfer" is treated as a delete if the destination url is null. I am not
sure why the destinationUrl will 
become null in some cases and not in others. I did not see this in tests
------- Comment #2 From 2005-11-02 09:41:20 -------
I changed the title of this bug, as I know have some more information. See

http://www.isi.edu/~rynge/3840/

for a gram rsl file and a container log. I think we have a couple of problems
going on.

The first one is that the urlexpander adds expanded transfers to the end of the
request, and hence changes the order of the transfers.

The second problem is happening around 07:27:08,240 where 2 transfers (directory
and file from subdir 1 in rsl) are started at the same time. This should not
happen when concurrency is 1.

------- Comment #3 From 2006-03-02 21:10:12 -------
This is fixed too.