Bug 3737 - Container hangs, many read timeouts in ServiceGroupRegistrationClient
: Container hangs, many read timeouts in ServiceGroupRegistrationClient
: Java WS Core
: 4.0.1
: PC Linux
: P3 normal
: ---
Assigned To:
  Show dependency treegraph
Reported: 2005-09-09 14:23 by
Modified: 2005-09-14 15:15 (History)

Container.log from this server. (365.37 KB, text/plain)
2005-09-09 14:24, Laura Pearlman
Typical contents of this index (118.22 KB, text/xml)
2005-09-12 19:34, Laura Pearlman
lsof output from just before I killed the container. (80.15 KB, text/plain)
2005-09-13 12:04, Laura Pearlman


You need to log in before you can comment on or make changes to this bug.

Description From 2005-09-09 14:23:50
This happened on the mds.globus.org index server.  This has been running for
months on end (and for weeks on end with gt4.0.1 rcs) with no problems.  Things
that have changed recently are that I upgraded the server to the 4.0.1 release,
and some of the remote servers registering to the index have been running
different versions (Mats knows details).

Also, one of the main ISI file/nis/etc. servers (darkstar) went down for awhile
during the lifetime of this server.  This particular Globus container
installation doesn't live on any darkstar filesystems, but it may have used dns
or something from there, or some of the remote servers may have relied on darkstar.
------- Comment #1 From 2005-09-09 14:24:41 -------
Created an attachment (id=685) [details]
Container.log from this server.

There's a thread dump at the end.
------- Comment #2 From 2005-09-12 16:47:18 -------
The server doesn't appear to be hung. It appears that all its threads are busy 
accepting/reading data from new connections but no data is coming so it sits 
there until the connections time out. So my guess is that it is somehow 
releated to the DNS problems. 
What services connect to this index server?
------- Comment #3 From 2005-09-12 19:34:08 -------
Created an attachment (id=688) [details]
Typical  contents of this index
------- Comment #4 From 2005-09-12 19:36:50 -------
Typically, this index polls four remote indexes for their contents (using the
GetResourceProperty aggregator source).

What concerns me is that I found the index in this state about 12 hours after
the machine that runs dns, etc. came back up -- so I'm afraid that these reads
(either to something like a DNS server or reads done as part of a GetRP) may
never be timing out.
------- Comment #5 From 2005-09-13 12:04:42 -------
Created an attachment (id=693) [details]
lsof output from just before I killed the container.
------- Comment #6 From 2005-09-13 15:32:05 -------
My latest thinking is that the server was not timing out the old client 

I just committed an update to the globus_4_0_branch where the client 
connections will be timed out after 3 minutes (by default) (if server got 
blocked in read). The timeout value is configurable by the 'containerTimeout' 
parameter in the main server-config.wsdd file.
------- Comment #7 From 2005-09-13 16:07:40 -------
Just for future reference: the default client-side timeout set by Axis is 10 
------- Comment #8 From 2005-09-14 15:15:50 -------
I committed the timeout fixes to trunk. Because I don't have a good way to 
replicate the problem described in this bug I will consider this issue resolved 
for now. Please reopen the bug if the problem happens again with the updated