| Summary: | May leak socket FDs due to potential race condition when socket timeout occurs | ||
|---|---|---|---|
| Product: | Replica Location | Reporter: | Rob S <schuler@isi.edu> |
| Component: | RLS | Assignee: | Rob S <schuler@isi.edu> |
| Status: | RESOLVED FIXED | ||
| Severity: | normal | CC: | aleks@fys.uio.no, annc@isi.edu, link@mcs.anl.gov, shishir@isi.edu |
| Priority: | P3 | ||
| Version: | 3.2.1 | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | Linux | ||
| Bug Depends on: | 3028 | ||
| Bug Blocks: | |||
| Attachments: |
lsof output showing increased open files
Revised lsof output showing tests after globus_xio fix |
||
Fortunately, I have recreated the bug (w/out shutting down the local power grid ;-). In the test that I ran, I have an RLS opening TCP connections to 7 other RLS services (all on a box called plato.isi.edu). By unplugging plato’s network cable, I induce the timeout. What you can see from the lsof output is that the RLS starts off with 54 open files (total), then it attempts connections to plato (some lines showing “IPv4” and “SYN_SENT” in them) then a while later those attempted connections turn into leaked files with 7 open files of “sock” (“can’t identify protocol”) open files. So after one test run, RLS open files grow from 54 to 61 (7 more) and after a second run grows to 68 (7 more).
Created an attachment (id=553) [details]
lsof output showing increased open files
Added attachment to show increased open files. Server started with 54 open
files, grew by 7 after first test, then another 7 after second test. 14 leaked
FDs after two tests. "14" comes from the fact that there are "7" offline
(unplugged!) RLS/RLI services that the RLS is trying to update.
Bug has been identified in globus xio package.
Created an attachment (id=558) [details]
Revised lsof output showing tests after globus_xio fix
Output of tests performed to confirm that the globus_xio fix successfully
resolves the issue of leaked files when running RLS.
I've confirmed that the globus_xio fix resolves our problem. See 3028 for info on the globus_xio bug and fix.