Bugzilla – Bug 6356
LIGO: RLS crash on x86/Solaris 10 box
Last modified: 2008-10-14 13:14:21
You need to log in before you can comment on or make changes to this bug.
This record is for keeping track of the test box setup at ISI running Sun x86_64 Solaris 10. The RLS is installed from the GT 4.2.0 binary bundle for x86 Solaris 10. It is using MySQL 5.0(.22 I believe), a MySQL ODBC 3.51(.latest) driver, and uses the libiodbc from the GT package. The database is loaded with data from LIGO so its a fairly large set of replica mappings. There are two test clients that run every 5 minutes and execute 10,000 creates followed by 10,000 deletes (each). We've had a couple crashes on this box. Both stack traces (to follow) indicate the same line of code caused the crash.
Stack trace for crash on Aug 22: (dbx) where current thread: t@3 [1] __lwp_kill(0x3, 0x6), at 0xfe880d57 [2] _thr_kill(0x3, 0x6), at 0xfe87e512 [3] raise(0x6), at 0xfe82d083 [4] umem_do_abort(0x0, 0xfefa5000, 0xfe5ba9f4, 0xfef87bda, 0xfef92fe4, 0xfefa9a00), at 0xfef85f79 [5] umem_err_recoverable(0xfef92fe4, 0xfefa9a00, 0x81dd9b8, 0xfef9300c), at 0xfef860a9 [6] process_free(0x81dd9b8, 0x1, 0x0, 0xfe5baa78, 0xfe9f0df4, 0x81dd9b8), at 0xfef87bda [7] free(0x81dd9b8, 0x24e7f380, 0xfe9f0dc6), at 0xfef87c7f [8] CRYPTO_free(0xfef8a67d, 0x80ac390, 0x0, 0x0, 0x10, 0x0), at 0xfe9f0df4 =>[9] globus_i_gsi_gssapi_init_ssl_context(minor_status = 0xfe5baae4, credential = 0x24e7f448, anon_ctx = GLOBUS_I_GSI_GSS_DEFAULT_CONTEXT), line 2115 in "globus_i_gsi_gss_utils.c" [10] globus_i_gsi_gss_create_cred(minor_status = 0xfe5bab18, cred_usage = 2, output_cred_handle_P = 0x6ffb8e28, cred_handle = 0xfe5bab1c), line 1562 in "globus_i_gsi_gss_utils.c" [11] globus_i_gsi_gss_cred_read(minor_status = 0xfe5bab58, cred_usage = 2, cred_handle = 0x6ffb8e28, desired_subject = (nil)), line 1353 in "globus_i_gsi_gss_utils.c" [12] gss_acquire_cred(minor_status = 0xfe5bacac, desired_name_P = (nil), time_req = 4294967295U, desired_mechs = (nil), cred_usage = 2, output_cred_handle_P = 0x6ffb8e28, actual_mechs = (nil), time_rec = (nil)), line 138 in "acquire_cred.c" [13] globus_i_gsi_gss_create_and_fill_context(minor_status = 0xfe5bad24, context_handle_P = 0xfe5bad30, cred_handle = (nil), cred_usage = 2, req_flags = 2U), line 297 in "globus_i_gsi_gss_utils.c" [14] gss_accept_sec_context(minor_status = 0xfe5badc4, context_handle_P = 0x6ffe92e0, acceptor_cred_handle = (nil), input_token = 0xfe5badb0, input_chan_bindings = (nil), src_name_P = 0x6ffe92f0, mech_type = 0x6ffe92ec, output_token = 0xfe5badb8, ret_flags = 0x6ffe92d4, time_rec = 0x6ffe92d8, delegated_cred_handle_P = 0x6ffe92e4), line 127 in "accept_sec_context.c" [15] globus_l_xio_gsi_read_token_cb(op = 0x8216388, result = 0, nbytes = 106U, user_arg = 0x6ffe92d0), line 1176 in "globus_xio_gsi.c" [16] globus_l_xio_driver_op_read_kickout(user_arg = 0x8216388), line 639 in "globus_xio_driver.c" [17] globus_xio_driver_finished_read(in_op = 0x8216388, result = 0, nbytes = 106U), line 1232 in "globus_xio_pass.c" [18] globus_l_xio_tcp_finish_read(handle = 0x31d59208, result = 0, nbytes = 106U), line 2243 in "globus_xio_tcp_driver.c" [19] globus_l_xio_tcp_system_read_cb(result = 0, nbytes = 106U, user_arg = 0x31d59208), line 2259 in "globus_xio_tcp_driver.c" [20] globus_l_xio_system_kickout(user_arg = 0x81f7090), line 919 in "globus_xio_system_select.c" [21] globus_l_callback_thread_poll(user_arg = 0xfec68960), line 2519 in "globus_callback_threads.c" [22] globus_l_thread_pool_thread_start(user_arg = 0x80d8fe8), line 222 in "globus_thread_pool.c" [23] thread_starter(temparg = 0x80cf028), line 513 in "globus_thread_pthreads.c" [24] _thr_setup(0xfe4a0000), at 0xfe87fd36 [25] _lwp_start(0x3, 0x6, 0xfe8a8000, 0xfe5ba9a8, 0xfe82d083, 0x3), at 0xfe880020
Stack trace for crash on Sep 03: (dbx) where current thread: t@3 [1] __lwp_kill(0x3, 0x6), at 0xfe880d57 [2] _thr_kill(0x3, 0x6), at 0xfe87e512 [3] raise(0x6), at 0xfe82d083 [4] umem_do_abort(0x4, 0xfefa5000, 0xfe5ba95c, 0xfef88b88, 0xfef93b90, 0x80ac710), at 0xfef85f79 [5] umem_err_recoverable(0xfef93b90), at 0xfef860a9 [6] umem_error(0x2, 0x80ac710, 0x250e4b78), at 0xfef88b88 [7] umem_cache_free_debug(0x80ac710, 0x250e4b78), at 0xfef89d31 [8] _umem_cache_free(0x80ac710, 0x250e4b78), at 0xfef8a42b [9] _umem_free(0x250e4b78, 0x18), at 0xfef8a904 [10] process_free(0x250e4b80, 0x1, 0x0, 0xfe5baa78, 0xfe9f0df4, 0x250e4b80), at 0xfef87c55 [11] free(0x250e4b80, 0x24d987b0, 0xfe9f0dc6), at 0xfef87c7f [12] CRYPTO_free(0xfef8a67d, 0x80ac390, 0x0, 0x0, 0x10, 0x0), at 0xfe9f0df4 =>[13] globus_i_gsi_gssapi_init_ssl_context(minor_status = 0xfe5baae4, credential = 0x24f6ddd0, anon_ctx = GLOBUS_I_GSI_GSS_DEFAULT_CONTEXT), line 2115 in "globus_i_gsi_gss_utils.c" [14] globus_i_gsi_gss_create_cred(minor_status = 0xfe5bab18, cred_usage = 2, output_cred_handle_P = 0x80c15a8, cred_handle = 0xfe5bab1c), line 1562 in "globus_i_gsi_gss_utils.c" [15] globus_i_gsi_gss_cred_read(minor_status = 0xfe5bab58, cred_usage = 2, cred_handle = 0x80c15a8, desired_subject = (nil)), line 1353 in "globus_i_gsi_gss_utils.c" [16] gss_acquire_cred(minor_status = 0xfe5bacac, desired_name_P = (nil), time_req = 4294967295U, desired_mechs = (nil), cred_usage = 2, output_cred_handle_P = 0x80c15a8, actual_mechs = (nil), time_rec = (nil)), line 138 in "acquire_cred.c" [17] globus_i_gsi_gss_create_and_fill_context(minor_status = 0xfe5bad24, context_handle_P = 0xfe5bad30, cred_handle = (nil), cred_usage = 2, req_flags = 2U), line 297 in "globus_i_gsi_gss_utils.c" [18] gss_accept_sec_context(minor_status = 0xfe5badc4, context_handle_P = 0x24d79718, acceptor_cred_handle = (nil), input_token = 0xfe5badb0, input_chan_bindings = (nil), src_name_P = 0x24d79728, mech_type = 0x24d79724, output_token = 0xfe5badb8, ret_flags = 0x24d7970c, time_rec = 0x24d79710, delegated_cred_handle_P = 0x24d7971c), line 127 in "accept_sec_context.c" [19] globus_l_xio_gsi_read_token_cb(op = 0x24e060c8, result = 0, nbytes = 106U, user_arg = 0x24d79708), line 1176 in "globus_xio_gsi.c" [20] globus_l_xio_driver_op_read_kickout(user_arg = 0x24e060c8), line 639 in "globus_xio_driver.c" [21] globus_xio_driver_finished_read(in_op = 0x24e060c8, result = 0, nbytes = 106U), line 1232 in "globus_xio_pass.c" [22] globus_l_xio_tcp_finish_read(handle = 0x80e9388, result = 0, nbytes = 106U), line 2243 in "globus_xio_tcp_driver.c" [23] globus_l_xio_tcp_system_read_cb(result = 0, nbytes = 106U, user_arg = 0x80e9388), line 2259 in "globus_xio_tcp_driver.c" [24] globus_l_xio_system_kickout(user_arg = 0x24f67f40), line 919 in "globus_xio_system_select.c" [25] globus_l_callback_thread_poll(user_arg = 0xfec68960), line 2519 in "globus_callback_threads.c" [26] globus_l_thread_pool_thread_start(user_arg = 0x80d8fe8), line 222 in "globus_thread_pool.c" [27] thread_starter(temparg = 0x80cf028), line 513 in "globus_thread_pthreads.c" [28] _thr_setup(0xfe4a0000), at 0xfe87fd36 [29] _lwp_start(0x3, 0x6, 0xfe8a8000, 0xfe5ba8f4, 0xfe82d083, 0x3), at 0xfe880020
libumem was also used during both runs that resulted in dumped core. In both cases, umem_status indicates a double free at the point of the crash. The umem_verify for 8/22 shows all clean. However, the 9/03 umem_verify has a corrupt buffer -- the same indicated in the double free.
This isn't terribly insightful for the corrup buffer but here's the output of umem_verify: > 80ac710 ::umem_verify Summary for cache 'umem_alloc_24' buffer 24f05a58 (allocated) has a corrupt buftag And umem_status: > ::umem_status Status: ready and active Concurrency: 8 Logs: transaction=128k (inactive) Message buffer: umem allocator: duplicate free: buffer freed twice buffer=250e4b78 bufctl=250e64c8 cache: umem_alloc_24 previous transaction on buffer 250e4b78: thread=4 time=T-0.-00000578 slab=24d130a8 cache: umem_alloc_24 libumem.so.1'?? (0xfef89ab4) libumem.so.1'umem_cache_alloc+0xe1 libumem.so.1'umem_alloc+0x3f libumem.so.1'malloc+0x23 umem: heap corruption detected stack trace: libumem.so.1'?? (0xfef86099) libumem.so.1'?? (0xfef88b88) libumem.so.1'?? (0xfef89d31) libumem.so.1'umem_cache_free+0x3f libumem.so.1'umem_free+0xd5 libumem.so.1'?? (0xfef87c55) libumem.so.1'free+0x14 libcrypto.so.0.9.7'CRYPTO_free+0x37 libglobus_gssapi_gsi_gcc32dbgpthr.so.0.0.0'globus_i_gsi_gssapi_init_ssl_context+0xe6 libglobus_gssapi_gsi_gcc32dbgpthr.so.0.0.0'globus_i_gsi_gss_create_cred+0x237 libglobus_gssapi_gsi_gcc32dbgpthr.so.0.0.0'globus_i_gsi_gss_cred_read+0x109 libglobus_gssapi_gsi_gcc32dbgpthr.so.0.0.0'gss_acquire_cred+0x139 libglobus_gssapi_gsi_gcc32dbgpthr.so.0.0.0'globus_i_gsi_gss_create_and_fill_context+0x464 libglobus_gssapi_gsi_gcc32dbgpthr.so.0.0.0'gss_accept_sec_context+0x188 libglobus_xio_gsi_driver_gcc32dbgpthr.so.0.0.0'?? (0xfe284fdc) libglobus_xio_gcc32dbgpthr.so.0.0.0'globus_l_xio_driver_op_read_kickout+0x2d8 libglobus_xio_gcc32dbgpthr.so.0.0.0'globus_xio_driver_finished_read+0x5e3 libglobus_xio_gcc32dbgpthr.so.0.0.0'?? (0xfed31414) libglobus_xio_gcc32dbgpthr.so.0.0.0'?? (0xfed314ea) libglobus_xio_gcc32dbgpthr.so.0.0.0'?? (0xfed0cd7e) libglobus_common_gcc32dbgpthr.so.0.0.0'?? (0xfec3392b) libglobus_common_gcc32dbgpthr.so.0.0.0'globus_l_thread_pool_thread_start+0x30 libglobus_common_gcc32dbgpthr.so.0.0.0'?? (0xfec4c378) libc.so.1'?? (0xfe87fd36) libc.so.1'?? (0xfe880020)
Inspecting the source code along the stack trace, I went looking for memory that could have been passed from outside of the GSI GSSAPI call from a caller (i.e., from RLS or XIO)... =>[13] globus_i_gsi_gssapi_init_ssl_context(minor_status = 0xfe5baae4, credential = 0x24f6ddd0, anon_ctx = GLOBUS_I_GSI_GSS_DEFAULT_CONTEXT), line 2115 in "globus_i_gsi_gss_utils.c" Unfortunatly, the line 2115 does not match up with the source code. I suspect this is because of conditional compiling by the C preprocessor. So the best I can do at this point is to just see where the parameters came from. OM_uint32 globus_i_gsi_gssapi_init_ssl_context( OM_uint32 * minor_status, gss_cred_id_t credential, globus_i_gsi_gss_context_type_t anon_ctx) It's the address of the "credential" variable that the debuggers indicate is the problem. So next step is to see where that came from. [14] globus_i_gsi_gss_create_cred(minor_status = 0xfe5bab18, cred_usage = 2, output_cred_handle_P = 0x80c15a8, cred_handle = 0xfe5bab1c), line 1562 in "globus_i_gsi_gss_utils.c" ... newcred->cred_handle = *cred_handle; *cred_handle = NULL; (*) major_status = globus_i_gsi_gssapi_init_ssl_context( &local_minor_status, (gss_cred_id_t) newcred, GLOBUS_I_GSI_GSS_DEFAULT_CONTEXT); ... (*) marks line 1562 So the credential variable comes from "newcred" which is malloc'ed in the same function on line 1522 and initialized before the call. Next is to find out where the cred_handle memory address comes from. OM_uint32 globus_i_gsi_gss_create_cred( OM_uint32 * minor_status, const gss_cred_usage_t cred_usage, gss_cred_id_t * output_cred_handle_P, globus_gsi_cred_handle_t * cred_handle) That is an input parameter to the function. So next step is to go up the stack to find out where that came from. [15] globus_i_gsi_gss_cred_read(minor_status = 0xfe5bab58, cred_usage = 2, cred_handle = 0x80c15a8, desired_subject = (nil)), line 1353 in "globus_i_gsi_gss_utils.c" ... (*) major_status = globus_i_gsi_gss_create_cred(&local_minor_status, cred_usage, cred_handle, &local_cred_handle); ... (*) indicates line 1353 The variable "local_cred_handle" has its address passed to the create_cred() function (not to be confused with this function's local variable called "cred_handle"). So the next step is to find out where "local_cred_handle" comes from. At this point, I'll just paste in the method from the signature to the last line executed before the crash. OM_uint32 globus_i_gsi_gss_cred_read( OM_uint32 * minor_status, const gss_cred_usage_t cred_usage, gss_cred_id_t * cred_handle, const X509_NAME * desired_subject) { globus_result_t local_result = GLOBUS_SUCCESS; globus_gsi_cred_handle_t local_cred_handle; OM_uint32 local_minor_status; OM_uint32 major_status = GSS_S_COMPLETE; static char * _function_name_ = "globus_i_gsi_gss_cred_read"; GLOBUS_I_GSI_GSSAPI_DEBUG_ENTER; local_result = globus_gsi_cred_handle_init(&local_cred_handle, NULL); if(local_result != GLOBUS_SUCCESS) { local_cred_handle = NULL; GLOBUS_GSI_GSSAPI_ERROR_CHAIN_RESULT( minor_status, local_result, GLOBUS_GSI_GSSAPI_ERROR_WITH_GSI_CREDENTIAL); major_status = GSS_S_FAILURE; goto exit; } local_result = globus_gsi_cred_read(local_cred_handle, desired_subject); if(local_result != GLOBUS_SUCCESS) { GLOBUS_GSI_GSSAPI_ERROR_CHAIN_RESULT( minor_status, local_result, GLOBUS_GSI_GSSAPI_ERROR_WITH_GSS_CREDENTIAL); major_status = GSS_S_FAILURE; goto exit; } major_status = globus_i_gsi_gss_create_cred(&local_minor_status, cred_usage, cred_handle, &local_cred_handle); ... At this point, it appears that the memory buffer that is associated with the crash originates from within the "globus_i_gsi_gss_cred_read" function. I do not see an indication that outside (calling) code is responsible for this buffer.
This is reported as a double-free. Is there a report indicating which shows where the initial free occurred?
How reproducable is this? The stack traces not matching the source code makes it extremely difficult to make any sense out of this bug report. I don't even see where CRYPTO_free() can be called directly from globus_i_gsi_gssapi_init_ssl_context().
(In reply to comment #6) > This is reported as a double-free. Is there a report indicating which shows > where the initial free occurred? > The umem log is here: http://www.isi.edu/~schuler/umalog.20080903
> The umem log is here: > http://www.isi.edu/~schuler/umalog.20080903 > Sorry. That should have been: http://www.isi.edu/~schuler/bug6356/umalog.20080903
(In reply to comment #7) > How reproducable is this? It is not easy to reproduce and I do not have a precise set of steps to recreate the bug. An RLS (from GT 4.2) running (on Solaris 10 x86 64) for a couple days, with a small number (2) of clients, has encountered this problem twice. This is not exactly a "crash", it is actually an ABORT caused by the debug memory library (libumem). Normally, (I think) a double-free could occur without immediate consequences -- which in some ways is even more concerning. I have a couple core dumps here: http://www.isi.edu/~schuler/bug6356/core.20080822.gz http://www.isi.edu/~schuler/bug6356/core.20080903.gz > The stack traces not matching the source code makes > it extremely difficult to make any sense out of this bug report. I don't even > see where CRYPTO_free() can be called directly from > globus_i_gsi_gssapi_init_ssl_context(). > The stack trace line numbers match up, until the last (Globus-related) frame. I believe that may be due to the C preprocessor and #defined code in this compilation unit. But it is really unfortunate that the lineno does not match up for that last frame.
I still have a few problems with this report: The double-free without a report of where the initial free occurred doesn't really indicate anything other than memory corruption has occurred some time before the crash. The initial free of the pointer value could have happened anywhere (in this or another thread). The function where the error is occurring (CRYPTO_free) is not called by the globus_i_gsi_gssapi_init_ssl_context function (even via a macro) shown by the stack trace. This makes me extremely skeptical that I'll be able to make any progress with these core files. Can you give me commands to replicate this test scenario?
Reproducing the bug basicaly comes down to: 1) Installing GT 4.2 2) Running globus-rls-server, use a memory checker (on Solaris I'm using libumem) 3) Running a script like www.isi.edu/~schuler/bug6356/rls-test.sh on a couple client machines every 5 minutes out of cron 4) And waiting a couple days for the ABORT signal to kill the process I agree with your assessment -- that the erroneous free could come from anywhere and not necessarily GSI or an underlying SSL lib.
I ran with valgrind on linux for quite a while and saw some odd things coming out of the mysql driver, but it didn't crash. I did run into a crash while debugging a globus_io test looks like it that might be related. The patch for that is at http://www.mcs.anl.gov/~bester/patches/bug6356.diff
I created the following GPT source packages which I used to patch my Solaris test environment: http://www.isi.edu/~schuler/bug6356/globus_gsi_cert_utils-5.0.tar.gz http://www.isi.edu/~schuler/bug6356/globus_gssapi_gsi-5.9.tar.gz
I should just add that the GSI patch was released with GT 4.2.1, so users of that distribution and later do not need to apply the patches.