Bugzilla – Bug 3840
RFT ordering and concurrency problem in HEAD
Last modified: 2006-03-02 21:10:12
You need to log in before you can comment on or make changes to this bug.
Ravi, I think Jarek is right about the gram testing problem being RFT's fault (or maybe even GRAM's). The following test has RFT debugging turned on, and every 5 mins, a kill -QUIT is issued to the container. http://viz-login.isi.edu/gram-testing/test-details.php?uuid=572d5a6a-1d86-449b-a31a-6ff553ca9950 There is one RFT failure at 08:19:39,979. It says it is having problem to delete a file, but I think that the RFT mistook the request for being a delete opereation instead of a stage in. The stagin RSL: <fileStageIn> <maxAttempts>5</maxAttempts> <transfer> <sourceUrl> gsiftp://viz-2.isi.edu/nfs/home/rynge/gram-test-suite/tests/job/empty_dir/ </sourceUrl> <destinationUrl> file:///scratch/rynge/jobs/${THROUGHPUT_TESTER_JOB_ID}/ </destinationUrl> </transfer> <transfer> <sourceUrl> gsiftp://viz-2.isi.edu/nfs/home/rynge/gram-test-suite/tests/job/sleep.sh </sourceUrl> <destinationUrl> file:///scratch/rynge/jobs/${THROUGHPUT_TESTER_JOB_ID}/sleep.sh </destinationUrl> </transfer> </fileStageIn> Looking at the log, it seems like the first part is handled right: 2005-10-26 08:19:36,867 DEBUG service.TransferWork [Thread-41,getTransferClient:336] [Request 5, Transfer 9] transferring gsiftp://viz-2.isi.edu:2811/nfs/home/rynge/ gram-test-suite/tests/job/empty_dir/ -> gsiftp://viz-1.isi.edu:2811/scratch/rynge/jobs/e5e50980-4633-11da-8b4c-d7d91db7639c/ 2005-10-26 08:19:38,542 DEBUG service.TransferWork [Thread-41,processStates:474] [Request 5, Transfer 9] processing state for transfer of gsiftp://viz-2.isi.edu:2811 /nfs/home/rynge/gram-test-suite/tests/job/empty_dir/ -> gsiftp://viz-1.isi.edu:2811/scratch/rynge/jobs/e5e50980-4633-11da-8b4c-d7d91db7639c/ 2005-10-26 08:19:38,542 DEBUG service.TransferWork [Thread-41,processStates:542] [Request 5, Transfer 9] expanding the url: gsiftp://viz-2.isi.edu:2811/nfs/home/ryng e/gram-test-suite/tests/job/empty_dir/ But then, after URL expansion, it get confused. Transfer 14 looks right, but Transfer 10 (which is the one which fails later) should be a stagin, and not even be running until Transfer 14 is done: 2005-10-26 08:19:39,145 DEBUG service.TransferWork [Thread-41,statusChanged:155] [Request 5, Transfer 9] status changed called with status: 7 2005-10-26 08:19:39,146 DEBUG service.RFTResourceManager [Thread-41,update:795] Updating status in database for id: 9 with status: 7 2005-10-26 08:19:39,151 DEBUG service.RFTResourceManager [Thread-41,update:810] Updated status in database for id: 9 with status: 7 2005-10-26 08:19:39,153 DEBUG database.ReliableFileTransferDbAdapter [Thread-41,getOverallStatus:865] OverallStatus for requestId: 5 Finished:1 Retrying:0 Failed:0 Active:0 Pending:1 Cancelled:0 2005-10-26 08:19:39,154 DEBUG service.TransferWork [Thread-41,statusChanged:171] [Request 5, Transfer 9] setting overall rp to (finished: 1, active: 0, failed: 0, restarted: 0, pending: 1, cancelled: 0) 2005-10-26 08:19:39,225 DEBUG service.TransferWork [Thread-41,startNewTransfer:604] Starting a new transfer 2005-10-26 08:19:39,226 DEBUG service.RFTResourceManager [Thread-41,getNextSet:719] Cache Size : 10 2005-10-26 08:19:39,232 DEBUG service.TransferWork [Thread-68,run:675] This is a delete:10 2005-10-26 08:19:39,232 DEBUG service.TransferWork [Thread-68,getDeleteClient:382] [Request 5, Transfer 10] deleting gsiftp://viz-1.isi.edu:2811/scratch/rynge/jobs/e5e50980-4633-11da-8b4c-d7d91db7639c/sleep.sh 2005-10-26 08:19:39,237 DEBUG service.TransferWork [Thread-69,run:683] This is a transfer:14 2005-10-26 08:19:39,240 DEBUG service.TransferWork [Thread-69,getTransferClient:336] [Request 5, Transfer 14] transferring gsiftp://viz-2.isi.edu:2811/nfs/home/rynge/gram-test-suite/tests/job/empty_dir//.ignoreme -> gsiftp://viz-1.isi.edu:2811/scratch/rynge/jobs/e5e50980-4633-11da-8b4c-d7d91db7639c///.ignoreme 2005-10-26 08:19:39,247 DEBUG service.TransferWork [Thread-69,getTransferClient:362] [Request 5, Transfer 14] no client to reuse 2005-10-26 08:19:39,247 DEBUG service.TransferWork [Thread-69,getNewClient:412] max active allowed: 100 2005-10-26 08:19:39,350 DEBUG service.TransferWork [Thread-69,getNewClient:417] Transfer 14: new transfer client for user: rynge 2005-10-26 08:19:39,477 DEBUG service.DeleteClient [Thread-68,<init>:96] Subject name is null in DeleteClient 2005-10-26 08:19:39,916 DEBUG service.TransferWork [Thread-68,statusChanged:155] [Request 5, Transfer 10] status changed called with status: 3
A "transfer" is treated as a delete if the destination url is null. I am not sure why the destinationUrl will become null in some cases and not in others. I did not see this in tests
I changed the title of this bug, as I know have some more information. See http://www.isi.edu/~rynge/3840/ for a gram rsl file and a container log. I think we have a couple of problems going on. The first one is that the urlexpander adds expanded transfers to the end of the request, and hence changes the order of the transfers. The second problem is happening around 07:27:08,240 where 2 transfers (directory and file from subdir 1 in rsl) are started at the same time. This should not happen when concurrency is 1.
This is fixed too.