Bug 4624 - globus_xio_close() dumps a core
: globus_xio_close() dumps a core
Status: RESOLVED WORKSFORME
: XIO
Globus XIO
: 4.1.0
: Sun Solaris
: P3 critical
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2006-07-26 17:49 by
Modified: 2008-01-18 15:29 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2006-07-26 17:49:42
Our application has been experiencing infrequent core dump (appoximately once
in 3 weeks) when it tries to close XIO handle. This is happening inside Globus
XIO before closing the handle, XIO checks whether it's in the valid state
(op->ndx > 0). But, for some reasons, globus_assert(op->ndx > 0) indicates that
op->ndx is negative. 

PS: We're using Globus-4.2.0.

Here is the core:

(gdb) where
> #0  0x7f7c0f90 in _lwp_kill () from /lib/libc.so.1
> #1  0x7f75fd80 in raise () from /lib/libc.so.1
> #2  0x7f73ffa0 in abort () from /lib/libc.so.1
> #3  0x7ef3d3f4 in globus_xio_driver_finished_close (in_op=0x3e7b248, in_res=0)
>     at globus_xio_pass.c:548
> #4  0x7ef25ed0 in globus_l_xio_driver_op_close_kickout (user_arg=0x3e7b248)
>     at globus_xio_driver.c:795
> #5  0x7ef3d8e0 in globus_xio_driver_finished_close (in_op=0x3e7b248, in_res=0)
>     at globus_xio_pass.c:593
> #6  0x7ef25ed0 in globus_l_xio_driver_op_close_kickout (user_arg=0x3e7b248)
>     at globus_xio_driver.c:795
> #7  0x7ef3d8e0 in globus_xio_driver_finished_close (in_op=0x3e7b248, in_res=0)
>     at globus_xio_pass.c:593
> #8  0x7ef934c8 in globus_l_xio_tcp_system_close_cb (result=0, 
>     user_arg=0x3e7b248) at globus_xio_tcp_driver.c:2118
> #9  0x7ef4ed64 in globus_l_xio_system_close_kickout (user_arg=0x7fc1a0)
>     at globus_xio_system_select.c:2284
> #10 0x7f4a63b4 in globus_l_callback_thread_poll (user_arg=0x7f4edc38)
>     at globus_callback_threads.c:2482
> #11 0x7f4cc440 in globus_l_thread_pool_thread_start (user_arg=0x33ebc8)
>     at globus_thread_pool.c:217
> #12 0x7f4caa38 in thread_starter (temparg=0x3515c8)
>     at globus_thread_pthreads.c:508

Any idea what might cause this problem.
------- Comment #1 From 2006-07-26 19:25:03 -------
Subject: Re:  New: globus_xio_close() dumps a core

please run your application under valgrind or some other memory checking 
tool until you get the same core dump and verufy that memory is not 
being corrupted somewhere.

bugzilla-daemon@mcs.anl.gov wrote:
> http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=4624
> 
>            Summary: globus_xio_close() dumps a core
>            Product: XIO
>            Version: 4.1.0
>           Platform: Sun
>         OS/Version: Solaris
>             Status: NEW
>           Severity: critical
>           Priority: P3
>          Component: Globus XIO
>         AssignedTo: bresnaha@mcs.anl.gov
>         ReportedBy: msamidi@ligo.caltech.edu
>                 CC: allcock@mcs.anl.gov
> 
> 
> Our application has been experiencing infrequent core dump (appoximately once
> in 3 weeks) when it tries to close XIO handle. This is happening inside Globus
> XIO before closing the handle, XIO checks whether it's in the valid state
> (op->ndx > 0). But, for some reasons, globus_assert(op->ndx > 0) indicates that
> op->ndx is negative. 
> 
> PS: We're using Globus-4.2.0.
> 
> Here is the core:
> 
> (gdb) where
>> #0  0x7f7c0f90 in _lwp_kill () from /lib/libc.so.1
>> #1  0x7f75fd80 in raise () from /lib/libc.so.1
>> #2  0x7f73ffa0 in abort () from /lib/libc.so.1
>> #3  0x7ef3d3f4 in globus_xio_driver_finished_close (in_op=0x3e7b248, in_res=0)
>>     at globus_xio_pass.c:548
>> #4  0x7ef25ed0 in globus_l_xio_driver_op_close_kickout (user_arg=0x3e7b248)
>>     at globus_xio_driver.c:795
>> #5  0x7ef3d8e0 in globus_xio_driver_finished_close (in_op=0x3e7b248, in_res=0)
>>     at globus_xio_pass.c:593
>> #6  0x7ef25ed0 in globus_l_xio_driver_op_close_kickout (user_arg=0x3e7b248)
>>     at globus_xio_driver.c:795
>> #7  0x7ef3d8e0 in globus_xio_driver_finished_close (in_op=0x3e7b248, in_res=0)
>>     at globus_xio_pass.c:593
>> #8  0x7ef934c8 in globus_l_xio_tcp_system_close_cb (result=0, 
>>     user_arg=0x3e7b248) at globus_xio_tcp_driver.c:2118
>> #9  0x7ef4ed64 in globus_l_xio_system_close_kickout (user_arg=0x7fc1a0)
>>     at globus_xio_system_select.c:2284
>> #10 0x7f4a63b4 in globus_l_callback_thread_poll (user_arg=0x7f4edc38)
>>     at globus_callback_threads.c:2482
>> #11 0x7f4cc440 in globus_l_thread_pool_thread_start (user_arg=0x33ebc8)
>>     at globus_thread_pool.c:217
>> #12 0x7f4caa38 in thread_starter (temparg=0x3515c8)
>>     at globus_thread_pthreads.c:508
> 
> Any idea what might cause this problem.
> 
> 
> 
> 
> ------- You are receiving this mail because: -------
> You are the assignee for the bug, or are watching the assignee.
> 
------- Comment #2 From 2007-01-08 12:28:01 -------
Hello John,

I have some additional information. 

If we use nonthreaded GT XIO library, then we don't have coredumps anymore.

The reason we can't use valgrind is our application is running multi processes
and multi hosts connecting to a getekeeper through a single process written
inside Tcl extension. Sparc solaris doesn't support valgrind.

I think we've tried using valgrind, but we can't reproduce the coredump. 

(In reply to comment #1)
> Subject: Re:  New: globus_xio_close() dumps a core
> 
> please run your application under valgrind or some other memory checking 
> tool until you get the same core dump and verufy that memory is not 
> being corrupted somewhere.
> 
> bugzilla-daemon@mcs.anl.gov wrote:
> > http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=4624
> > 
> >            Summary: globus_xio_close() dumps a core
> >            Product: XIO
> >            Version: 4.1.0
> >           Platform: Sun
> >         OS/Version: Solaris
> >             Status: NEW
> >           Severity: critical
> >           Priority: P3
> >          Component: Globus XIO
> >         AssignedTo: bresnaha@mcs.anl.gov
> >         ReportedBy: msamidi@ligo.caltech.edu
> >                 CC: allcock@mcs.anl.gov
> > 
> > 
> > Our application has been experiencing infrequent core dump (appoximately once
> > in 3 weeks) when it tries to close XIO handle. This is happening inside Globus
> > XIO before closing the handle, XIO checks whether it's in the valid state
> > (op->ndx > 0). But, for some reasons, globus_assert(op->ndx > 0) indicates that
> > op->ndx is negative. 
> > 
> > PS: We're using Globus-4.2.0.
> > 
> > Here is the core:
> > 
> > (gdb) where
> >> #0  0x7f7c0f90 in _lwp_kill () from /lib/libc.so.1
> >> #1  0x7f75fd80 in raise () from /lib/libc.so.1
> >> #2  0x7f73ffa0 in abort () from /lib/libc.so.1
> >> #3  0x7ef3d3f4 in globus_xio_driver_finished_close (in_op=0x3e7b248, in_res=0)
> >>     at globus_xio_pass.c:548
> >> #4  0x7ef25ed0 in globus_l_xio_driver_op_close_kickout (user_arg=0x3e7b248)
> >>     at globus_xio_driver.c:795
> >> #5  0x7ef3d8e0 in globus_xio_driver_finished_close (in_op=0x3e7b248, in_res=0)
> >>     at globus_xio_pass.c:593
> >> #6  0x7ef25ed0 in globus_l_xio_driver_op_close_kickout (user_arg=0x3e7b248)
> >>     at globus_xio_driver.c:795
> >> #7  0x7ef3d8e0 in globus_xio_driver_finished_close (in_op=0x3e7b248, in_res=0)
> >>     at globus_xio_pass.c:593
> >> #8  0x7ef934c8 in globus_l_xio_tcp_system_close_cb (result=0, 
> >>     user_arg=0x3e7b248) at globus_xio_tcp_driver.c:2118
> >> #9  0x7ef4ed64 in globus_l_xio_system_close_kickout (user_arg=0x7fc1a0)
> >>     at globus_xio_system_select.c:2284
> >> #10 0x7f4a63b4 in globus_l_callback_thread_poll (user_arg=0x7f4edc38)
> >>     at globus_callback_threads.c:2482
> >> #11 0x7f4cc440 in globus_l_thread_pool_thread_start (user_arg=0x33ebc8)
> >>     at globus_thread_pool.c:217
> >> #12 0x7f4caa38 in thread_starter (temparg=0x3515c8)
> >>     at globus_thread_pthreads.c:508
> > 
> > Any idea what might cause this problem.
> > 
> > 
> > 
> > 
> > ------- You are receiving this mail because: -------
> > You are the assignee for the bug, or are watching the assignee.
> > 
> 
------- Comment #3 From 2007-01-09 13:28:35 -------
It seems that there is some race condition occuring.  From the information we
have it is difficult to see if the race is due to the globus code or the
application code.  Is it possible that under certain threaded events your code
calls xio_close twice?  Having some sort of memory profilling tool like
valgrind would be very helpful here, but i do understand that it can screw up
the timing of things.

would it be possible for you to come up with a small program which recreates
this problem so that I have a means of debugging?
------- Comment #4 From 2007-01-10 17:38:38 -------
(In reply to comment #3)
> It seems that there is some race condition occuring.  From the information we
> have it is difficult to see if the race is due to the globus code or the
> application code.  Is it possible that under certain threaded events your code
> calls xio_close twice?  Having some sort of memory profilling tool like
> valgrind would be very helpful here, but i do understand that it can screw up
> the timing of things.
> 
> would it be possible for you to come up with a small program which recreates
> this problem so that I have a means of debugging?
> 


I thought about that problem, calling globus_xio_close() twice. But, I modified
the code so it checks whether it's closing valid handle or not. And, it's
always closing a valid handle.

using a small program running continuously, we can't reproduce the problem.
that's the hard part.

LDAS is written using Tcl extension and I've asked Tcl programmer to check
every time it calls my function to close a Tcl channel and it's always valid
Tcl channel, no double closing.

Let me do some more investigations on my side and I'll let you know if I find
anything new.