Bugzilla – Bug 1871
globus-url-copy often hangs with default buffer size
Last modified: 2005-08-09 13:07:53
You need to log in before you can comment on or make changes to this bug.
On our system (RedHat 7.2 + GT 3.2.0 + all patches) I often see globus-url-copy hanging when used to transfer large files using GASS/https. This is easily reproduced by running "globus-gass-server -r" in one window, and doing transfers in a loop using globus-url-copy -vb <gass-url>/./BIGFILE file:///usr/tmp/TMPFILE in another. The latter often gets stuck in a phase showing "0 bytes 0.00 KB/sec avg 0.00 KB/sec inst" from which it never seems to recover. This happens for both threaded and non-threaded versions of globus-url-copy. Using strace on the non-threaded version, I found that in hanging cases, the following occurs: ----- gettimeofday({1092061833, 833515}, NULL) = 0 select(8, [3 7], [], NULL, {0, 188965}) = 1 (in [7], left {0, 190000}) read(7, "\27\3\0\4\24\304\243\334H\317\0247=nl\33\374\22\266\307"..., 131072) = 48254 gettimeofday({1092061833, 842185}, NULL) = 0 select(8, [3 7], [], NULL, {0, 180295}) = 1 (in [7], left {0, 190000}) read(7, "\27\3\0\4\24\366l\30_\275\16&Z\227\246\223\233\301\323"..., 82818) = 33568 gettimeofday({1092061833, 848637}, NULL) = 0 select(8, [3 7], [], NULL, {0, 173843}) = 1 (in [7], left {0, 180000}) read(7, "\27\3\0\4\24{\340\231[\321\247\354g\220\351\\<77\247\260"..., 49250) = 48254 gettimeofday({1092061833, 857114}, NULL) = 0 select(8, [3 7], [], NULL, {0, 165366}) = 1 (in [7], left {0, 170000}) read(7, "\27\3\0\4\24Kn\337\316\223l\3C@\310\352}D\230!\321F\211"..., 996) = 996 gettimeofday({1092061833, 863643}, NULL) = 0 select(8, [3 7], [], NULL, {0, 158837}) = 1 (in [7], left {0, 160000}) gettimeofday({1092061833, 864119}, NULL) = 0 select(8, [3 7], [], NULL, {0, 158361}) = 1 (in [7], left {0, 160000}) gettimeofday({1092061833, 864445}, NULL) = 0 select(8, [3 7], [], NULL, {0, 158035}) = 1 (in [7], left {0, 160000}) gettimeofday({1092061833, 864764}, NULL) = 0 select(8, [3 7], [], NULL, {0, 157716}) = 1 (in [7], left {0, 160000}) gettimeofday({1092061833, 865086}, NULL) = 0 select(8, [3 7], [], NULL, {0, 157394}) = 1 (in [7], left {0, 160000}) gettimeofday({1092061833, 865405}, NULL) = 0 select(8, [3 7], [], NULL, {0, 157075}) = 1 (in [7], left {0, 160000}) gettimeofday({1092061833, 865723}, NULL) = 0 select(8, [3 7], [], NULL, {0, 156757}) = 1 (in [7], left {0, 160000}) gettimeofday({1092061833, 866039}, NULL) = 0 select(8, [3 7], [], NULL, {0, 156441}) = 1 (in [7], left {0, 160000}) gettimeofday({1092061833, 866355}, NULL) = 0 select(8, [3 7], [], NULL, {0, 156125}) = 1 (in [7], left {0, 160000}) gettimeofday({1092061833, 881131}, NULL) = 0 select(8, [3 7], [], NULL, {0, 141349}) = 1 (in [7], left {0, 150000}) gettimeofday({1092061833, 881455}, NULL) = 0 select(8, [3 7], [], NULL, {0, 141025}) = 1 (in [7], left {0, 150000}) gettimeofday({1092061833, 881772}, NULL) = 0 [..] gettimeofday({1092061834, 10118}, NULL) = 0 select(8, [3 7], [], NULL, {0, 12362}) = 1 (in [7], left {0, 20000}) gettimeofday({1092061834, 10444}, NULL) = 0 select(8, [3 7], [], NULL, {0, 12036}) = 1 (in [7], left {0, 20000}) gettimeofday({1092061834, 27880}, NULL) = 0 select(8, [3 7], [], NULL, {0, 0}) = 1 (in [7], left {0, 0}) gettimeofday({1092061834, 28199}, NULL) = 0 gettimeofday({1092061834, 28308}, NULL) = 0 gettimeofday({1092061834, 28379}, NULL) = 0 select(8, [3 7], [], NULL, {0, 494101}) = 1 (in [7], left {0, 500000}) gettimeofday({1092061834, 28702}, NULL) = 0 select(8, [3 7], [], NULL, {0, 493778}) = 1 (in [7], left {0, 500000}) gettimeofday({1092061834, 29022}, NULL) = 0 select(8, [3 7], [], NULL, {0, 493458}) = 1 (in [7], left {0, 500000}) gettimeofday({1092061834, 29410}, NULL) = 0 select(8, [3 7], [], NULL, {0, 493070}) = 1 (in [7], left {0, 500000}) gettimeofday({1092061834, 29729}, NULL) = 0 select(8, [3 7], [], NULL, {0, 491796}) = 1 (in [7], left {0, 500000}) gettimeofday({1092061834, 30999}, NULL) = 0 [..] ----- i.e., after the first 128 KB chunk is read, no read() is issued anymore, and timeouts also don't trigger recovery from this. Interestingly enough, explicitly setting the buffer size used by globus-url-copy to 128 KB or below using the "-bs" option (default is 1 MB) seems to avoid the problem on our system. Since 128 KB happens to be the initial buffer size supplied to read(), this may have some significance, but then again it might not ;-)
can you paste the output you see when using the -dbg flag?
Uhm, "-dbg" seems to be just for GridFTP, and nothing extra is printed when used with GASS transfers. But I cannot reproduce the hang with GridFTP, just with GASS it appears.
Ah, sorry, didnt see the gass url. I will pass this on to the gass expert here. Joe
I've dug into this and it looks like the client is getting to a state where the XIO select identifies the fd as having data, but then the call to globus_l_xio_system_try_read() is being called with buflen=0 causing no progress to be made.
Further tracing makes it look like the GSI XIO driver is passing a read to the TCP layer which is waiting for an amount of data larger than the buffer. This is forcing the read to get stuck in the XIO event driver. Sam: Could you take a look at this? The test case I've used is a 10M file. globus-gass-server -r globus-url-copy -vb https://<url>/./BIGFILE file:/tmp/bigfile To get a stack trace in gdb, I used break globus_xio_driver_pass_read if in_iovec[0].iov_len < in_wait_for joe
Do you have any updates on this bug? We seem to be encountering the same bug, and it is holding up a new release of the Virtual Data Toolkit, so we are very interested in this. Thanks, Alain Roy
Created an attachment (id=439) [details] Fix
We (the VDT folks) have tried out this patch, and it makes a big difference-- we can now successfully run large jobs with Globus 3.2.1. Thank you, thank you! We will be patching our version of Globus to include this patch. Can I recommend that you also release it as an update for Globus 3.2.1? The world should benefit from this improvement.
*** Bug 2009 has been marked as a duplicate of this bug. ***
Subject: Re: globus-url-copy often hangs with default buffer size This is an automatic reply. Feel free to send additional mail, as only this one notice will be generated. The following is a prerecorded message, sent for versto I'm on holiday until October 31. DAS system administration issues should best be reported to das-sysadm@cs.vu.nl. Other mail will be handled as soon as I return. Regards, Kees Verstoep
There is now an updated xio package with this fix. http://www.globus.org/toolkit/advisories.html?version=3.2
Subject: Re: globus-url-copy often hangs with default buffer size This is an automatic reply. Feel free to send additional mail, as only this one notice will be generated. The following is a prerecorded message, sent for versto I'm away until August 18. DAS system administration issues should best be reported to das-sysadm@cs.vu.nl. Other mail will be handled as soon as I return. Regards, Kees Verstoep