Bugzilla – Bug 4624
globus_xio_close() dumps a core
Last modified: 2008-01-18 15:29:54
You need to log in before you can comment on or make changes to this bug.
Our application has been experiencing infrequent core dump (appoximately once in 3 weeks) when it tries to close XIO handle. This is happening inside Globus XIO before closing the handle, XIO checks whether it's in the valid state (op->ndx > 0). But, for some reasons, globus_assert(op->ndx > 0) indicates that op->ndx is negative. PS: We're using Globus-4.2.0. Here is the core: (gdb) where > #0 0x7f7c0f90 in _lwp_kill () from /lib/libc.so.1 > #1 0x7f75fd80 in raise () from /lib/libc.so.1 > #2 0x7f73ffa0 in abort () from /lib/libc.so.1 > #3 0x7ef3d3f4 in globus_xio_driver_finished_close (in_op=0x3e7b248, in_res=0) > at globus_xio_pass.c:548 > #4 0x7ef25ed0 in globus_l_xio_driver_op_close_kickout (user_arg=0x3e7b248) > at globus_xio_driver.c:795 > #5 0x7ef3d8e0 in globus_xio_driver_finished_close (in_op=0x3e7b248, in_res=0) > at globus_xio_pass.c:593 > #6 0x7ef25ed0 in globus_l_xio_driver_op_close_kickout (user_arg=0x3e7b248) > at globus_xio_driver.c:795 > #7 0x7ef3d8e0 in globus_xio_driver_finished_close (in_op=0x3e7b248, in_res=0) > at globus_xio_pass.c:593 > #8 0x7ef934c8 in globus_l_xio_tcp_system_close_cb (result=0, > user_arg=0x3e7b248) at globus_xio_tcp_driver.c:2118 > #9 0x7ef4ed64 in globus_l_xio_system_close_kickout (user_arg=0x7fc1a0) > at globus_xio_system_select.c:2284 > #10 0x7f4a63b4 in globus_l_callback_thread_poll (user_arg=0x7f4edc38) > at globus_callback_threads.c:2482 > #11 0x7f4cc440 in globus_l_thread_pool_thread_start (user_arg=0x33ebc8) > at globus_thread_pool.c:217 > #12 0x7f4caa38 in thread_starter (temparg=0x3515c8) > at globus_thread_pthreads.c:508 Any idea what might cause this problem.
Subject: Re: New: globus_xio_close() dumps a core please run your application under valgrind or some other memory checking tool until you get the same core dump and verufy that memory is not being corrupted somewhere. bugzilla-daemon@mcs.anl.gov wrote: > http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=4624 > > Summary: globus_xio_close() dumps a core > Product: XIO > Version: 4.1.0 > Platform: Sun > OS/Version: Solaris > Status: NEW > Severity: critical > Priority: P3 > Component: Globus XIO > AssignedTo: bresnaha@mcs.anl.gov > ReportedBy: msamidi@ligo.caltech.edu > CC: allcock@mcs.anl.gov > > > Our application has been experiencing infrequent core dump (appoximately once > in 3 weeks) when it tries to close XIO handle. This is happening inside Globus > XIO before closing the handle, XIO checks whether it's in the valid state > (op->ndx > 0). But, for some reasons, globus_assert(op->ndx > 0) indicates that > op->ndx is negative. > > PS: We're using Globus-4.2.0. > > Here is the core: > > (gdb) where >> #0 0x7f7c0f90 in _lwp_kill () from /lib/libc.so.1 >> #1 0x7f75fd80 in raise () from /lib/libc.so.1 >> #2 0x7f73ffa0 in abort () from /lib/libc.so.1 >> #3 0x7ef3d3f4 in globus_xio_driver_finished_close (in_op=0x3e7b248, in_res=0) >> at globus_xio_pass.c:548 >> #4 0x7ef25ed0 in globus_l_xio_driver_op_close_kickout (user_arg=0x3e7b248) >> at globus_xio_driver.c:795 >> #5 0x7ef3d8e0 in globus_xio_driver_finished_close (in_op=0x3e7b248, in_res=0) >> at globus_xio_pass.c:593 >> #6 0x7ef25ed0 in globus_l_xio_driver_op_close_kickout (user_arg=0x3e7b248) >> at globus_xio_driver.c:795 >> #7 0x7ef3d8e0 in globus_xio_driver_finished_close (in_op=0x3e7b248, in_res=0) >> at globus_xio_pass.c:593 >> #8 0x7ef934c8 in globus_l_xio_tcp_system_close_cb (result=0, >> user_arg=0x3e7b248) at globus_xio_tcp_driver.c:2118 >> #9 0x7ef4ed64 in globus_l_xio_system_close_kickout (user_arg=0x7fc1a0) >> at globus_xio_system_select.c:2284 >> #10 0x7f4a63b4 in globus_l_callback_thread_poll (user_arg=0x7f4edc38) >> at globus_callback_threads.c:2482 >> #11 0x7f4cc440 in globus_l_thread_pool_thread_start (user_arg=0x33ebc8) >> at globus_thread_pool.c:217 >> #12 0x7f4caa38 in thread_starter (temparg=0x3515c8) >> at globus_thread_pthreads.c:508 > > Any idea what might cause this problem. > > > > > ------- You are receiving this mail because: ------- > You are the assignee for the bug, or are watching the assignee. >
Hello John, I have some additional information. If we use nonthreaded GT XIO library, then we don't have coredumps anymore. The reason we can't use valgrind is our application is running multi processes and multi hosts connecting to a getekeeper through a single process written inside Tcl extension. Sparc solaris doesn't support valgrind. I think we've tried using valgrind, but we can't reproduce the coredump. (In reply to comment #1) > Subject: Re: New: globus_xio_close() dumps a core > > please run your application under valgrind or some other memory checking > tool until you get the same core dump and verufy that memory is not > being corrupted somewhere. > > bugzilla-daemon@mcs.anl.gov wrote: > > http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=4624 > > > > Summary: globus_xio_close() dumps a core > > Product: XIO > > Version: 4.1.0 > > Platform: Sun > > OS/Version: Solaris > > Status: NEW > > Severity: critical > > Priority: P3 > > Component: Globus XIO > > AssignedTo: bresnaha@mcs.anl.gov > > ReportedBy: msamidi@ligo.caltech.edu > > CC: allcock@mcs.anl.gov > > > > > > Our application has been experiencing infrequent core dump (appoximately once > > in 3 weeks) when it tries to close XIO handle. This is happening inside Globus > > XIO before closing the handle, XIO checks whether it's in the valid state > > (op->ndx > 0). But, for some reasons, globus_assert(op->ndx > 0) indicates that > > op->ndx is negative. > > > > PS: We're using Globus-4.2.0. > > > > Here is the core: > > > > (gdb) where > >> #0 0x7f7c0f90 in _lwp_kill () from /lib/libc.so.1 > >> #1 0x7f75fd80 in raise () from /lib/libc.so.1 > >> #2 0x7f73ffa0 in abort () from /lib/libc.so.1 > >> #3 0x7ef3d3f4 in globus_xio_driver_finished_close (in_op=0x3e7b248, in_res=0) > >> at globus_xio_pass.c:548 > >> #4 0x7ef25ed0 in globus_l_xio_driver_op_close_kickout (user_arg=0x3e7b248) > >> at globus_xio_driver.c:795 > >> #5 0x7ef3d8e0 in globus_xio_driver_finished_close (in_op=0x3e7b248, in_res=0) > >> at globus_xio_pass.c:593 > >> #6 0x7ef25ed0 in globus_l_xio_driver_op_close_kickout (user_arg=0x3e7b248) > >> at globus_xio_driver.c:795 > >> #7 0x7ef3d8e0 in globus_xio_driver_finished_close (in_op=0x3e7b248, in_res=0) > >> at globus_xio_pass.c:593 > >> #8 0x7ef934c8 in globus_l_xio_tcp_system_close_cb (result=0, > >> user_arg=0x3e7b248) at globus_xio_tcp_driver.c:2118 > >> #9 0x7ef4ed64 in globus_l_xio_system_close_kickout (user_arg=0x7fc1a0) > >> at globus_xio_system_select.c:2284 > >> #10 0x7f4a63b4 in globus_l_callback_thread_poll (user_arg=0x7f4edc38) > >> at globus_callback_threads.c:2482 > >> #11 0x7f4cc440 in globus_l_thread_pool_thread_start (user_arg=0x33ebc8) > >> at globus_thread_pool.c:217 > >> #12 0x7f4caa38 in thread_starter (temparg=0x3515c8) > >> at globus_thread_pthreads.c:508 > > > > Any idea what might cause this problem. > > > > > > > > > > ------- You are receiving this mail because: ------- > > You are the assignee for the bug, or are watching the assignee. > > >
It seems that there is some race condition occuring. From the information we have it is difficult to see if the race is due to the globus code or the application code. Is it possible that under certain threaded events your code calls xio_close twice? Having some sort of memory profilling tool like valgrind would be very helpful here, but i do understand that it can screw up the timing of things. would it be possible for you to come up with a small program which recreates this problem so that I have a means of debugging?
(In reply to comment #3) > It seems that there is some race condition occuring. From the information we > have it is difficult to see if the race is due to the globus code or the > application code. Is it possible that under certain threaded events your code > calls xio_close twice? Having some sort of memory profilling tool like > valgrind would be very helpful here, but i do understand that it can screw up > the timing of things. > > would it be possible for you to come up with a small program which recreates > this problem so that I have a means of debugging? > I thought about that problem, calling globus_xio_close() twice. But, I modified the code so it checks whether it's closing valid handle or not. And, it's always closing a valid handle. using a small program running continuously, we can't reproduce the problem. that's the hard part. LDAS is written using Tcl extension and I've asked Tcl programmer to check every time it calls my function to close a Tcl channel and it's always valid Tcl channel, no double closing. Let me do some more investigations on my side and I'll let you know if I find anything new.