Bugzilla – Bug 3737
Container hangs, many read timeouts in ServiceGroupRegistrationClient
Last modified: 2005-09-14 15:15:50
You need to
before you can comment on or make changes to this bug.
This happened on the mds.globus.org index server. This has been running for
months on end (and for weeks on end with gt4.0.1 rcs) with no problems. Things
that have changed recently are that I upgraded the server to the 4.0.1 release,
and some of the remote servers registering to the index have been running
different versions (Mats knows details).
Also, one of the main ISI file/nis/etc. servers (darkstar) went down for awhile
during the lifetime of this server. This particular Globus container
installation doesn't live on any darkstar filesystems, but it may have used dns
or something from there, or some of the remote servers may have relied on darkstar.
Created an attachment (id=685) [details]
Container.log from this server.
There's a thread dump at the end.
The server doesn't appear to be hung. It appears that all its threads are busy
accepting/reading data from new connections but no data is coming so it sits
there until the connections time out. So my guess is that it is somehow
releated to the DNS problems.
What services connect to this index server?
Created an attachment (id=688) [details]
Typical contents of this index
Typically, this index polls four remote indexes for their contents (using the
GetResourceProperty aggregator source).
What concerns me is that I found the index in this state about 12 hours after
the machine that runs dns, etc. came back up -- so I'm afraid that these reads
(either to something like a DNS server or reads done as part of a GetRP) may
never be timing out.
Created an attachment (id=693) [details]
lsof output from just before I killed the container.
My latest thinking is that the server was not timing out the old client
I just committed an update to the globus_4_0_branch where the client
connections will be timed out after 3 minutes (by default) (if server got
blocked in read). The timeout value is configurable by the 'containerTimeout'
parameter in the main server-config.wsdd file.
Just for future reference: the default client-side timeout set by Axis is 10
I committed the timeout fixes to trunk. Because I don't have a good way to
replicate the problem described in this bug I will consider this issue resolved
for now. Please reopen the bug if the problem happens again with the updated