Bugzilla – Bug 3010
May leak socket FDs due to potential race condition when socket timeout occurs
Last modified: 2005-04-12 17:43:28
You need to log in before you can comment on or make changes to this bug.
When timout occurs in rpc.c, rrpc_connect() call, there is a potential for a race condition between rrpc_connect() logic and connectcb() logic. Also, the monitor may fall off the stack frame before connectcb() has completed therefore inducing an undefined behavior. User reported increase in open file handles. The only abnormality recorded in the rls log was that a particular RLI site was unreachable and the RLS was reaching rpc.c:142, where the connection timeout is recorded. We have tried to recreate this but cannot. Possible differences could be that the user's environment is Solaris (ours RH Linux), Wide area across Atlantic (ours LAN), and failures are not completely known (our test is to unplug the host). The user's enviornment may involve a combination of errors including failures at a router, switch, also building power outage was reported. It's impractical to recreate these conditions in full. In any event, the code in rpc.c/rrpc_connect() can be improved to prevent the race condition and any/all effects that may stem from that. Bug is reported in a 3.2.1 version and applies to Development as well.
Fortunately, I have recreated the bug (w/out shutting down the local power grid ;-). In the test that I ran, I have an RLS opening TCP connections to 7 other RLS services (all on a box called plato.isi.edu). By unplugging plato’s network cable, I induce the timeout. What you can see from the lsof output is that the RLS starts off with 54 open files (total), then it attempts connections to plato (some lines showing “IPv4” and “SYN_SENT” in them) then a while later those attempted connections turn into leaked files with 7 open files of “sock” (“can’t identify protocol”) open files. So after one test run, RLS open files grow from 54 to 61 (7 more) and after a second run grows to 68 (7 more).
Created an attachment (id=553) [details] lsof output showing increased open files Added attachment to show increased open files. Server started with 54 open files, grew by 7 after first test, then another 7 after second test. 14 leaked FDs after two tests. "14" comes from the fact that there are "7" offline (unplugged!) RLS/RLI services that the RLS is trying to update.
Bug has been identified in globus xio package.
Created an attachment (id=558) [details] Revised lsof output showing tests after globus_xio fix Output of tests performed to confirm that the globus_xio fix successfully resolves the issue of leaked files when running RLS.
I've confirmed that the globus_xio fix resolves our problem. See 3028 for info on the globus_xio bug and fix.
I should also mention that the final tests of RLS concerning this bug were performed with an update package of the version 3.0-050325.
*** Bug 1769 has been marked as a duplicate of this bug. ***