Bugzilla – Bug 2662
ws-gram service leaves hanging file transfers
Last modified: 2005-02-28 15:55:56
You need to log in before you can comment on or make changes to this bug.
I'm seeing file transfers that hang indefinitely after the gram job that spawned them is finished and destroyed. That is, I'm seeing extra gridftp server processes on both source and destination machines. When these appear, I can't shut down the GT4 container gracefully (with globus-stop-container), but if I kill it, the hung transfers disappear. Here's a short job that produces the problem: <job> <executable>/tmp/foobar/date</executable> <directory>/tmp/foobar/</directory> <fileStageIn> <transfer> <sourceUrl>gsiftp://balthazar.cs.wisc.edu/tmp/smell/</sourceUrl> <destinationUrl>file:///tmp/foobar/</destinationUrl> </transfer> <transfer> <sourceUrl>gsiftp://balthazar.cs.wisc.edu/bin/date</sourceUrl> <destinationUrl>file:///tmp/foobar/date</destinationUrl> </transfer> </fileStageIn> </job> /tmp/smell is an empty directory on balthazar (the machine I'm submitted from). This produces hangs when run with both managed-job-globusrun and globusrun-ws. I'm using code I checked out of CVS on Friday, January 28. Both source and destination machines are x86 running Tao Linux.
Another detail that may be relevant to reproducing this behavior: I'm running everything non-root. The servers are using host certificates owned by the uid they're running under.
GRAM destroys all RFT resources, so this has to be an RFT issue.
I just test RFT from trunk using the following transfers in a transfer request: #Source/Dest URL Pairs gsiftp://ned-0.isi.edu:5678/tmp/small/ gsiftp://ned-0.isi.edu:5678/tmp/foobar/ gsiftp://ned-0.isi.edu:5678/bin/date gsiftp://ned-0.isi.edu:5678/tmp/foobar/date I see no hang or any error to that matter. an empty dir /tmp/foobar is created and a /bin/date was copied into it by the next transfer. And I can see the resource being destroyed which closed all the gridftp connections. I was also able to kill the container w/o any problem. Can I see logs from the container when you see this behaviour ?
Created an attachment (id=496) [details] container output during hanging behavior Here is the output from the container during the job submission that provokes the hanging behavior.
From the logs i see that transfers were completed successfully and destroy has been called. I don't see any exceptions/problems in the log... Can you increase logging level to DEBUG on org.globus.transfer package and send me the log ?
Created an attachment (id=497) [details] debug log during provoked behavior Here is the container output with DEBUG for org.globus.transfer.
From the latest log .. I don't really see anything that would hint any sort of problem from RFT side of things. This piece of log <snip> 2005-01-31 18:13:33,383 DEBUG service.ReliableFileTransferResource [Thread-5,remove:701] Removing the resource with id: 1 2005-01-31 18:13:33,384 DEBUG database.RFTDatabaseSetup [Thread-5,getDBConnection:201] Added 1 active 1 idle 2 2005-01-31 18:13:33,386 DEBUG database.RFTDatabaseSetup [Thread-5,returnDBConnection:258] Reduced 0 active 0 idle 3 2005-01-31 18:13:33,387 DEBUG service.ReliableFileTransferResource [Thread-5,remove:706] Proxy file : /tmp/x509up_11631.tmp 2005-01-31 18:13:33,388 DEBUG service.ReliableFileTransferResource [Thread-5,remove:709] Deleting proxy file 2005-01-31 18:13:33,390 DEBUG service.ReliableFileTransferResource [Thread-5,remove:717] deleting subscriptions file 2005-01-31 18:13:33,394 DEBUG database.RFTDatabaseSetup [Thread-5,getDBConnection:201] Added 1 active 1 idle 2 2005-01-31 18:13:33,400 DEBUG database.RFTDatabaseSetup [Thread-5,returnDBConnection:258] Reduced 0 active 0 idle 3 2005-01-31 18:13:33,401 DEBUG service.TransferWork [Thread-5,closeTransferClients:247] Closing transfer clients for resource 1 </snip> shows everything is cool with respect to RFT.
I don't see any rft problems here .. Jaime, Can you reproduce this error consistently ?
Yes. It doesn't happen every time, but most times. I'm trying to test it with the lastest code out of CVS, but I'm running into compilation errors.
I just tested against code checked out of CVS this past Thursday (Feb 10) at around 1pm CST, and the problem is still there. Again, it doesn't happen every time, but most times with the gram rsl in the ticket. Now, I'm seeing errors in the container output that look related to the problem. I will attach it.
Created an attachment (id=511) [details] debug log with errors
Mike fixed a bug in the new server, which is causing the server to seg fault when setting tcp buffer size. (Bug 2687 ) Can you try your job against fixed server and let me know if you see this error again ?
I updated and recompiled the gridftp-server code on both ends and get the same behavior with what appears to be the exact same NPE exception.
Can you send me the npe exception ?
Created an attachment (id=512) [details] error with updated gridftp server
I just attached the NullPointerException error in the container log with gridftp updated on both ends.
That is a different error. There is a bug ( bug 2724 ) filed for it already. this error should not cause the conditions you described before. The stack trace your reported above is obtained when usage stats are being sent at the end of resource's life time.
The behavior is still occurring, whether or not the exception is related.
Making this a blocker for 3.9.5 as several projects are waiting for Condor-G to start porting their stuff to GT4.
I just tried the same rsl with code from rc1 and gftp server from rc1 many times, but i could not reproduce the problem
Jaime, I have tried to reproduce this error, but have not been able to. I did 50 successful jobs: [rynge@devrandom 2662]$ cat job.xml <job> <executable>/tmp/foobar/date</executable> <directory>/tmp/foobar/</directory> <fileStageIn> <transfer> <sourceUrl>gsiftp://devrandom.isi.edu:9001/tmp/smell/</sourceUrl> <destinationUrl>file:///tmp/foobar/</destinationUrl> </transfer> <transfer> <sourceUrl>gsiftp://devrandom.isi.edu:9001/bin/date</sourceUrl> <destinationUrl>file:///tmp/foobar/date</destinationUrl> </transfer> </fileStageIn> </job> [rynge@devrandom 2662]$ for I in `seq 1 50`; do echo $I; ssh dc-user2.isi.edu rm -rf /tmp/foobar; globusrun-ws -submit -S -F dc-user2.isi.edu:9000 -f job.xml;done This was with a CVS checkout from 2/13, and with full security, running gridftp as root from inetd and the container as the 'globus' user. I did suspect a race condition between the transfers (the second one depend on the directory from the first one), but Ravi has ensured me that the transfers are sequential. Can you give me the options you start the gridftp servers with, and the full commandline of your globusrun-ws invokation? I don't think it has to do with security, but I'm running out of ideas on how to reproduce this problem.
The source gridftp server is non-root from the command-line. The web services container is non-root. The destination gridftp server is root under inetd, but the client credential maps to the container's uid.
Hmm, looks similar enough. What about the globusrun-ws line? What command line arguments do you use for that?
I've been using managed-job-globusrun -factory <hostname> -file <rsl file> I haven't tried globusrun-ws recently, but when I did several weeks ago, it showed the same behavior. I start the source gridftp server (non-root from the command line) with -p 2811. I start the destination gridftp server (root under inetd) with -i -l -a.
Can you try and use and globusrun-ws now and let me know what happens ?
Can you also telnet to the ports that you have your latest gridftp servers running and send me the output ?
Client gridftp server: 220 GridFTP Server balthazar.cs.wisc.edu 0.16 (gcc32dbg, 1108067214-1) ready. ** Development Release ** Server gridftp server: 220 GridFTP Server nostos.cs.wisc.edu 0.16 (gcc32dbg, 1108067214-1) ready. ** Development Release **
Added mike to cc list.
I get hanging transfers when source and destination are the same machine, with the gridftp server running under inetd. I retested globusrun-ws, and still get hangs: globusrun-ws -submit -factory nostos -job-description-file job4h.xml -staging-delegate
Can you start the server with options '-log-level ALL -l /tmp/logfile' and show that log after a hang? (note, your -i -l -a arg line is just logging to a file named -a, probably in /. there is no -a argument to globus-gridftp-server)
The '-l -a' is a holdover from the much older gridftp server. I will attach the gridftp logs generated with '-log-level ALL -l /tmp/logfile'.
Created an attachment (id=513) [details] client gridftp server logfile
Created an attachment (id=514) [details] server gridftp server logfile
based on those logs, but servers think they have exited normally. Are the server processes still hanging around?
Looks like the transfers are done. I don't understand where the hang is. Can you increase logging level to DEBUG for org.globus.transfer=DEBUG org.globus.ftp=DEBUG, and send me the log please ? Also if the container appears to be hanging try to do kill -QUIT <server-jvm-process> which will log all the java threads to the file.
Now, the hung transfers are timing out and exitting 19 1/2 minutes afterwards. No information on the hung transfers goes into the logfile until then. I'll post new logfiles. For the transfers started at 15:51, I disabled logging on the server side to see if that affected the hanging behavior.
Created an attachment (id=515) [details] client gridftp server logfile
Created an attachment (id=516) [details] server gridftp server logfile
Created an attachment (id=517) [details] result of kill -QUIIT on container after hung transfers timed out This is the result of a kill -QUIT on the container after serveral hung transfers timed out. I tried globus-stop-container beforehand and it didn't shutdown.
That is normal behavior for the server. There is a default idle-timeout of 10 minutes, which (due to the way timeouts currently work), actually causes the timeout to happen between 10-20 minutes. I believe it is also normal behavior for RFT to keep the connection open for the resource lifetime (or something). Possibly RFT is not handling the 421 timeout and closing of the connection by the server well?
Is the ManagedExecutableJobService resource not destroying the RFT resources then? The ManagedExecutableJobService resource is destroyed long before the connections time out (at least, the client says it's destroyed). So the long-term hangs may have been solved for a while, and I just haven't waited more than 20 minutes in my recent tests. Will these connections be reused by other RFT resources? If not, I wonder if this could become a scalability issue with large numbers of jobs with small input data.
If I remove the second transfer and switch the executable to /bin/date, no gridftp connections are left open after the job is destroyed.
RFT does not reuse connections across multiple resources. So when GRAM destroys RFT resource the connections should be closed. Can you increase the container log levels of org.globus.ftp and org.globus.transfer to DEBUG and attach the logs ? And are you still using m-j-g for your jobs ?
I'm about to disappear on vacation for a week and a half. I don't know if one of the other Condor staff will be able to help with further investigation.
Ravi, I think this function has a couple of problems: public synchronized static void closeTransferClients(int requestId) { logger.debug("Closing transfer clients for resource " + requestId); for (int i = 0; i < transferClients.size(); i++) { TransferClient closeMe = (TransferClient) transferClients .elementAt(i); if (closeMe.getRequestId() == requestId) { transferClients.remove(i); closeMe.close(); } } for (int i = 0; i < deleteClients.size(); i++) { DeleteClient closeMe = (DeleteClient) deleteClients.elementAt(i); if (closeMe.getRequestId() == requestId) { closeMe.close(); } } } 1) Iterating over the Vector and calling .remove(i) within the loop might skip some valid TransferClient instances (therefore some connections might not be closed?). Calling remove(closeMe) should fix it. 2) DeleteClients are not removed as TransferClients are from its Vector - that probably might lead to a leak.
Jaime, I would be interested in getting a Condor-G install here at ISI for further testing. Do you have anything packaged up or otherwise easily installable to share for this purpose?
I've placed a tarball of condor-g gt4 binaries at ftp://ftp.cs.wisc.edu/condor/temporary/forstu/2005-02-17/condor-gt4.tar.gz The included README should explain how to set it up. I compiled them on my tao linux 1.0 system. If there's an incompatibility with your machine, hopefully another Condor person can compile a different set.
I will make those changes and we can try again.
My suggestion in #45 for 1 has the same issue as the original code. Will need to reset the index after removing the TransferClient.
Created an attachment (id=518) [details] patch to clean up connections Please find attached patch that should fix the connection closing. I was able to stop the container using globus-stop-container. Let me know if it works for you.
Hi. I'm back from vacation. I just applied the patch and retested, and the hanging transfers have disappeared.