Bug 3010 - May leak socket FDs due to potential race condition when socket timeout occurs
: May leak socket FDs due to potential race condition when socket timeout occurs
Status: RESOLVED FIXED
: Replica Location
RLS
: 3.2.1
: PC Linux
: P3 normal
: ---
Assigned To:
:
:
: 3028
:
  Show dependency treegraph
 
Reported: 2005-03-24 16:58 by
Modified: 2005-04-12 17:43 (History)


Attachments
lsof output showing increased open files (15.17 KB, text/plain)
2005-03-28 12:54, Rob S
Details
Revised lsof output showing tests after globus_xio fix (2.49 KB, text/plain)
2005-03-30 18:24, Rob S
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2005-03-24 16:58:05
When timout occurs in rpc.c, rrpc_connect() call, there is a potential for a
race condition between rrpc_connect() logic and connectcb() logic. Also, the
monitor may fall off the stack frame before connectcb() has completed therefore
inducing an undefined behavior.

User reported increase in open file handles. The only abnormality recorded in
the rls log was that a particular RLI site was unreachable and the RLS was
reaching rpc.c:142, where the connection timeout is recorded.

We have tried to recreate this but cannot. Possible differences could be that
the user's environment is Solaris (ours RH Linux), Wide area across Atlantic
(ours LAN), and failures are not completely known (our test is to unplug the host).

The user's enviornment may involve a combination of errors including failures at
a router, switch, also building power outage was reported. It's impractical to
recreate these conditions in full.

In any event, the code in rpc.c/rrpc_connect() can be improved to prevent the
race condition and any/all effects that may stem from that.

Bug is reported in a 3.2.1 version and applies to Development as well.
------- Comment #1 From 2005-03-28 12:51:34 -------
Fortunately, I have recreated the bug (w/out shutting down the local power 
grid ;-).

In the test that I ran, I have an RLS opening TCP connections to 7 other RLS 
services (all on a box called plato.isi.edu). By unplugging plato’s network 
cable, I induce the timeout. What you can see from the lsof output is that the 
RLS starts off with 54 open files (total), then it attempts connections to 
plato (some lines showing “IPv4” and “SYN_SENT” in them) then a while later 
those attempted connections turn into leaked files with 7 open files of “sock” 
(“can’t identify protocol”) open files. So after one test run, RLS open files 
grow from 54 to 61 (7 more) and after a second run grows to 68 (7 more).
------- Comment #2 From 2005-03-28 12:54:46 -------
Created an attachment (id=553) [details]
lsof output showing increased open files

Added attachment to show increased open files. Server started with 54 open
files, grew by 7 after first test, then another 7 after second test. 14 leaked
FDs after two tests. "14" comes from the fact that there are "7" offline
(unplugged!) RLS/RLI services that the RLS is trying to update.
------- Comment #3 From 2005-03-29 02:05:21 -------
Bug has been identified in globus xio package.
------- Comment #4 From 2005-03-30 18:24:50 -------
Created an attachment (id=558) [details]
Revised lsof output showing tests after globus_xio fix

Output of tests performed to confirm that the globus_xio fix successfully
resolves the issue of leaked files when running RLS.
------- Comment #5 From 2005-03-30 18:26:26 -------
I've confirmed that the globus_xio fix resolves our problem. See 3028 for info
on the globus_xio bug and fix.
------- Comment #6 From 2005-03-30 18:28:53 -------
I should also mention that the final tests of RLS concerning this bug were
performed with an update package of the version 3.0-050325.
------- Comment #7 From 2005-04-12 17:43:28 -------
*** Bug 1769 has been marked as a duplicate of this bug. ***