Bugzilla – Bug 3010
May leak socket FDs due to potential race condition when socket timeout occurs
Last modified: 2005-04-12 17:43:28
You need to
before you can comment on or make changes to this bug.
When timout occurs in rpc.c, rrpc_connect() call, there is a potential for a
race condition between rrpc_connect() logic and connectcb() logic. Also, the
monitor may fall off the stack frame before connectcb() has completed therefore
inducing an undefined behavior.
User reported increase in open file handles. The only abnormality recorded in
the rls log was that a particular RLI site was unreachable and the RLS was
reaching rpc.c:142, where the connection timeout is recorded.
We have tried to recreate this but cannot. Possible differences could be that
the user's environment is Solaris (ours RH Linux), Wide area across Atlantic
(ours LAN), and failures are not completely known (our test is to unplug the host).
The user's enviornment may involve a combination of errors including failures at
a router, switch, also building power outage was reported. It's impractical to
recreate these conditions in full.
In any event, the code in rpc.c/rrpc_connect() can be improved to prevent the
race condition and any/all effects that may stem from that.
Bug is reported in a 3.2.1 version and applies to Development as well.
Fortunately, I have recreated the bug (w/out shutting down the local power
In the test that I ran, I have an RLS opening TCP connections to 7 other RLS
services (all on a box called plato.isi.edu). By unplugging plato’s network
cable, I induce the timeout. What you can see from the lsof output is that the
RLS starts off with 54 open files (total), then it attempts connections to
plato (some lines showing “IPv4” and “SYN_SENT” in them) then a while later
those attempted connections turn into leaked files with 7 open files of “sock”
(“can’t identify protocol”) open files. So after one test run, RLS open files
grow from 54 to 61 (7 more) and after a second run grows to 68 (7 more).
Created an attachment (id=553) [details]
lsof output showing increased open files
Added attachment to show increased open files. Server started with 54 open
files, grew by 7 after first test, then another 7 after second test. 14 leaked
FDs after two tests. "14" comes from the fact that there are "7" offline
(unplugged!) RLS/RLI services that the RLS is trying to update.
Bug has been identified in globus xio package.
Created an attachment (id=558) [details]
Revised lsof output showing tests after globus_xio fix
Output of tests performed to confirm that the globus_xio fix successfully
resolves the issue of leaked files when running RLS.
I've confirmed that the globus_xio fix resolves our problem. See 3028 for info
on the globus_xio bug and fix.
I should also mention that the final tests of RLS concerning this bug were
performed with an update package of the version 3.0-050325.
*** Bug 1769 has been marked as a duplicate of this bug. ***