Bug 1871 - globus-url-copy often hangs with default buffer size
: globus-url-copy often hangs with default buffer size
Status: RESOLVED FIXED
: XIO
Globus XIO
: 3.2.0
: PC Linux
: P3 major
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2004-08-09 10:05 by
Modified: 2005-08-09 13:07 (History)


Attachments
Fix (616 bytes, patch)
2004-10-08 14:32, Sam Meder
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2004-08-09 10:05:12
On our system (RedHat 7.2 + GT 3.2.0 + all patches) I often
see globus-url-copy hanging when used to transfer large files
using GASS/https.  This is easily reproduced by running
"globus-gass-server -r" in one window, and doing transfers
in a loop using
globus-url-copy -vb <gass-url>/./BIGFILE file:///usr/tmp/TMPFILE
in another.  The latter often gets stuck in a phase showing
"0 bytes  0.00 KB/sec avg 0.00 KB/sec inst"
from which it never seems to recover.  This happens for both
threaded and non-threaded versions of globus-url-copy.

Using strace on the non-threaded version, I found that in
hanging cases, the following occurs:
-----
gettimeofday({1092061833, 833515}, NULL) = 0
select(8, [3 7], [], NULL, {0, 188965}) = 1 (in [7], left {0, 190000})
read(7, "\27\3\0\4\24\304\243\334H\317\0247=nl\33\374\22\266\307"..., 131072) =
48254
gettimeofday({1092061833, 842185}, NULL) = 0
select(8, [3 7], [], NULL, {0, 180295}) = 1 (in [7], left {0, 190000})
read(7, "\27\3\0\4\24\366l\30_\275\16&Z\227\246\223\233\301\323"..., 82818) = 33568
gettimeofday({1092061833, 848637}, NULL) = 0
select(8, [3 7], [], NULL, {0, 173843}) = 1 (in [7], left {0, 180000})
read(7, "\27\3\0\4\24{\340\231[\321\247\354g\220\351\\<77\247\260"..., 49250) =
48254
gettimeofday({1092061833, 857114}, NULL) = 0
select(8, [3 7], [], NULL, {0, 165366}) = 1 (in [7], left {0, 170000})
read(7, "\27\3\0\4\24Kn\337\316\223l\3C@\310\352}D\230!\321F\211"..., 996) = 996
gettimeofday({1092061833, 863643}, NULL) = 0
select(8, [3 7], [], NULL, {0, 158837}) = 1 (in [7], left {0, 160000})
gettimeofday({1092061833, 864119}, NULL) = 0
select(8, [3 7], [], NULL, {0, 158361}) = 1 (in [7], left {0, 160000})
gettimeofday({1092061833, 864445}, NULL) = 0
select(8, [3 7], [], NULL, {0, 158035}) = 1 (in [7], left {0, 160000})
gettimeofday({1092061833, 864764}, NULL) = 0
select(8, [3 7], [], NULL, {0, 157716}) = 1 (in [7], left {0, 160000})
gettimeofday({1092061833, 865086}, NULL) = 0
select(8, [3 7], [], NULL, {0, 157394}) = 1 (in [7], left {0, 160000})
gettimeofday({1092061833, 865405}, NULL) = 0
select(8, [3 7], [], NULL, {0, 157075}) = 1 (in [7], left {0, 160000})
gettimeofday({1092061833, 865723}, NULL) = 0
select(8, [3 7], [], NULL, {0, 156757}) = 1 (in [7], left {0, 160000})
gettimeofday({1092061833, 866039}, NULL) = 0
select(8, [3 7], [], NULL, {0, 156441}) = 1 (in [7], left {0, 160000})
gettimeofday({1092061833, 866355}, NULL) = 0
select(8, [3 7], [], NULL, {0, 156125}) = 1 (in [7], left {0, 160000})
gettimeofday({1092061833, 881131}, NULL) = 0
select(8, [3 7], [], NULL, {0, 141349}) = 1 (in [7], left {0, 150000})
gettimeofday({1092061833, 881455}, NULL) = 0
select(8, [3 7], [], NULL, {0, 141025}) = 1 (in [7], left {0, 150000})
gettimeofday({1092061833, 881772}, NULL) = 0
[..]
gettimeofday({1092061834, 10118}, NULL) = 0
select(8, [3 7], [], NULL, {0, 12362})  = 1 (in [7], left {0, 20000})
gettimeofday({1092061834, 10444}, NULL) = 0
select(8, [3 7], [], NULL, {0, 12036})  = 1 (in [7], left {0, 20000})
gettimeofday({1092061834, 27880}, NULL) = 0
select(8, [3 7], [], NULL, {0, 0})      = 1 (in [7], left {0, 0})
gettimeofday({1092061834, 28199}, NULL) = 0
gettimeofday({1092061834, 28308}, NULL) = 0
gettimeofday({1092061834, 28379}, NULL) = 0
select(8, [3 7], [], NULL, {0, 494101}) = 1 (in [7], left {0, 500000})
gettimeofday({1092061834, 28702}, NULL) = 0
select(8, [3 7], [], NULL, {0, 493778}) = 1 (in [7], left {0, 500000})
gettimeofday({1092061834, 29022}, NULL) = 0
select(8, [3 7], [], NULL, {0, 493458}) = 1 (in [7], left {0, 500000})
gettimeofday({1092061834, 29410}, NULL) = 0
select(8, [3 7], [], NULL, {0, 493070}) = 1 (in [7], left {0, 500000})
gettimeofday({1092061834, 29729}, NULL) = 0
select(8, [3 7], [], NULL, {0, 491796}) = 1 (in [7], left {0, 500000})
gettimeofday({1092061834, 30999}, NULL) = 0
[..]
-----
i.e., after the first 128 KB chunk is read, no read() is issued anymore,
and timeouts also don't trigger recovery from this.

Interestingly enough, explicitly setting the buffer size used by
globus-url-copy to 128 KB or below using the "-bs" option
(default is 1 MB) seems to avoid the problem on our system.
Since 128 KB happens to be the initial buffer size supplied to read(),
this may have some significance, but then again it might not ;-)
------- Comment #1 From 2004-08-09 11:27:39 -------
can you paste the output you see when using the -dbg flag?
------- Comment #2 From 2004-08-09 11:56:53 -------
Uhm, "-dbg" seems to be just for GridFTP, and nothing extra is printed
when used with GASS transfers.
But I cannot reproduce the hang with GridFTP, just with GASS it appears.
------- Comment #3 From 2004-08-09 12:12:18 -------
Ah, sorry, didnt see the gass url.  I will pass this on to the gass expert here.

Joe
------- Comment #4 From 2004-08-11 09:45:14 -------
I've dug into this and it looks like the client is getting to a state where the
XIO select identifies the fd as having data, but then the call to
globus_l_xio_system_try_read() is being called with buflen=0 causing no progress
to be made.
------- Comment #5 From 2004-08-11 14:33:29 -------
Further tracing makes it look like the GSI XIO driver is passing a read to the
TCP layer which is waiting for an amount of data larger than the buffer. This is
forcing the read to get stuck in the XIO event driver.

Sam: Could you take a look at this? The test case I've used is a 10M file.

globus-gass-server -r

globus-url-copy -vb https://<url>/./BIGFILE file:/tmp/bigfile

To get a stack trace in gdb, I used 
break globus_xio_driver_pass_read if in_iovec[0].iov_len < in_wait_for

joe
------- Comment #6 From 2004-10-08 14:05:55 -------
Do you have any updates on this bug? We seem to be encountering the same bug, 
and it is holding up a new release of the Virtual Data Toolkit, so we are very 
interested in this. 

Thanks,
Alain Roy
------- Comment #7 From 2004-10-08 14:32:29 -------
Created an attachment (id=439) [details]
Fix
------- Comment #8 From 2004-10-20 09:37:31 -------
We (the VDT folks) have tried out this patch, and it makes a big difference--
we can now successfully run large jobs with Globus 3.2.1. Thank you, thank you!

We will be patching our version of Globus to include this patch. Can I 
recommend that you also release it as an update for Globus 3.2.1? The world 
should benefit from this improvement. 

------- Comment #9 From 2004-10-26 12:10:21 -------
*** Bug 2009 has been marked as a duplicate of this bug. ***
------- Comment #10 From 2004-10-26 14:01:06 -------
Subject: Re:  globus-url-copy often hangs with default buffer size

	This is an automatic reply.  Feel free to send additional
mail, as only this one notice will be generated.  The following
is a prerecorded message, sent for versto


I'm on holiday until October 31.

DAS system administration issues should best be reported to
	das-sysadm@cs.vu.nl.
Other mail will be handled as soon as I return.

Regards,
Kees Verstoep


------- Comment #11 From 2005-08-09 13:07:53 -------
There is now an updated xio package with this fix.

http://www.globus.org/toolkit/advisories.html?version=3.2
------- Comment #12 From 2005-08-09 13:10:06 -------
Subject: Re:  globus-url-copy often hangs with default buffer size

	This is an automatic reply.  Feel free to send additional
mail, as only this one notice will be generated.  The following
is a prerecorded message, sent for versto


I'm away until August 18.

DAS system administration issues should best be reported to
	das-sysadm@cs.vu.nl.
Other mail will be handled as soon as I return.

Regards,
Kees Verstoep