Bugzilla – Bug 1546
File descriptor leak & potential deadlock
Last modified: 2008-07-18 14:35:07
You need to log in before you can comment on or make changes to this bug.
This is a patch for two GRAM problems: 1) There is a file descriptor leak. 2) There is a fix for a potential deadlock problem. These were developed by David Smith at CERN and tested as part of the VDT. I will attach the patch separately. -alain
Created an attachment (id=310) [details] Patch for problem described above Patch for problem described above
The first problem (FD leak) ought to have been fixed in Globus 2.4. There was a problem in Globus I/O where failed connects would result in a FD leak, but that was fixed a long time ago (see bug #237). Can you verify that this problem exists with a recent version of the code (and identify version what you are using). A test case sent in with the patches would be excellent. The deadlock problem has been fixed already. See bug #1566 for details on bug and the fix I applied.
Joe, the first problem obviously was *not* fixed in 2.4! Look at the original globus_2_4_3 code in the attachment! We found the leak running the Condor-G "gahp_server" under "valgrind", so this is not some theoretical issue: it actually was a problem! Please do apply the fix; it has been carefully tested. Thanks.
Hi Joe, Regarding the first problem addressed by this patch (file descriptor leak): The version being discussed is globus-2.4.3, it's identified in the patch text, although not directly in the bug comment. The patch is forcing a close on the connection handle from a failed connect. (The close was previously conditional on whether it was a failed connect or whether a subsequent write register failed). Infact there is still some room for tidying up the handling of the error objects here, although as written it doesn't cause any problem. (As an asside there are several places in the existing code where error objects are never released, although given the small finite size of the globus object cache it don't see that it causes any problems) In this case I think there is argument for saying that, given the connection handle should always be closed before globus_l_gram_protocol_connect_callback() is called with an error, the close is indeed not necessary. However we've seen that sometimes the failed connection handle is _not_ closed and thus we suffer from the file descriptor leak. So in the end I made a decision to always attempt a close when there is an error in this routine, rather than distinguish whether the error was a connection error or some other error in the callback processing. If you prefer to enforce the close in the connection logic that would also be very accepable: A test case for this could be attempting a connection (using a secure authentication mode) to a destination that accepts the connection, receives the token but then drops the connection. I had started to look into the lower level connection handling, with the intention of including the required patch in some future suggestions. Indeed I have one solution now, although I haven't convinced myself that it really cannot cause any existing code to misbehave. I include it now so that you can follow that up, if you like. (see attached patch to globus_io_securesocket.c) Many thanks, David
Created an attachment (id=328) [details] Some suggested changes for globus_io_securesocket.c Some possible changes to ensure that failed connection attempts close the connection handle before calling the connection callback routine.
I'm redirecting this bug to the I/O guys. You should definitely try without this patch with the 3.2 beta---Globus I/O has been reworked as a compatibility layer on top of XIO, so the bugs might be a bit different. joe
The semantics of not requiring a close after a failed connect will remain. The fd leak appears to be possible when authentication fails mid way through the handshake. Since 3.2 now uses a globus io compatibility wrapper over globus xio, the priority for fixing this will be pretty low. As a work around, it does not hurt to always call globus_io_register_close(). If the handle has already been destroyed, the call will just fail. As for error objects not being released, this is the way they're intended to be used. As you note, the error object cache size is finite, so it's not a real leak. The system was designed in such a way to allow you to ignore return codes. If you're not going to do anything with an error object, you have no reason to clutter your code with get-ing and free-ing the object. Joe