Bug 6356 - LIGO: RLS crash on x86/Solaris 10 box
: LIGO: RLS crash on x86/Solaris 10 box
Status: ASSIGNED
: Replica Location
RLS
: development
: Sun Solaris
: P3 normal
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2008-09-04 13:01 by
Modified: 2008-10-14 13:14 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2008-09-04 13:01:36
This record is for keeping track of the test box setup at ISI running Sun
x86_64 Solaris 10. The RLS is installed from the GT 4.2.0 binary bundle for x86
Solaris 10. It is using MySQL 5.0(.22 I believe), a MySQL ODBC 3.51(.latest)
driver, and uses the libiodbc from the GT package. The database is loaded with
data from LIGO so its a fairly large set of replica mappings. There are two
test clients that run every 5 minutes and execute 10,000 creates followed by
10,000 deletes (each). We've had a couple crashes on this box. Both stack
traces (to follow) indicate the same line of code caused the crash.
------- Comment #1 From 2008-09-04 13:02:21 -------
Stack trace for crash on Aug 22:

(dbx) where
current thread: t@3
  [1] __lwp_kill(0x3, 0x6), at 0xfe880d57 
  [2] _thr_kill(0x3, 0x6), at 0xfe87e512 
  [3] raise(0x6), at 0xfe82d083 
  [4] umem_do_abort(0x0, 0xfefa5000, 0xfe5ba9f4, 0xfef87bda, 0xfef92fe4,
0xfefa9a00), at 0xfef85f79 
  [5] umem_err_recoverable(0xfef92fe4, 0xfefa9a00, 0x81dd9b8, 0xfef9300c), at
0xfef860a9 
  [6] process_free(0x81dd9b8, 0x1, 0x0, 0xfe5baa78, 0xfe9f0df4, 0x81dd9b8), at
0xfef87bda 
  [7] free(0x81dd9b8, 0x24e7f380, 0xfe9f0dc6), at 0xfef87c7f 
  [8] CRYPTO_free(0xfef8a67d, 0x80ac390, 0x0, 0x0, 0x10, 0x0), at 0xfe9f0df4 
=>[9] globus_i_gsi_gssapi_init_ssl_context(minor_status = 0xfe5baae4,
credential = 0x24e7f448, anon_ctx = GLOBUS_I_GSI_GSS_DEFAULT_CONTEXT), line
2115 in "globus_i_gsi_gss_utils.c"
  [10] globus_i_gsi_gss_create_cred(minor_status = 0xfe5bab18, cred_usage = 2,
output_cred_handle_P = 0x6ffb8e28, cred_handle = 0xfe5bab1c), line 1562 in
"globus_i_gsi_gss_utils.c"
  [11] globus_i_gsi_gss_cred_read(minor_status = 0xfe5bab58, cred_usage = 2,
cred_handle = 0x6ffb8e28, desired_subject = (nil)), line 1353 in
"globus_i_gsi_gss_utils.c"
  [12] gss_acquire_cred(minor_status = 0xfe5bacac, desired_name_P = (nil),
time_req = 4294967295U, desired_mechs = (nil), cred_usage = 2,
output_cred_handle_P = 0x6ffb8e28, actual_mechs = (nil), time_rec = (nil)),
line 138 in "acquire_cred.c"
  [13] globus_i_gsi_gss_create_and_fill_context(minor_status = 0xfe5bad24,
context_handle_P = 0xfe5bad30, cred_handle = (nil), cred_usage = 2, req_flags =
2U), line 297 in "globus_i_gsi_gss_utils.c"
  [14] gss_accept_sec_context(minor_status = 0xfe5badc4, context_handle_P =
0x6ffe92e0, acceptor_cred_handle = (nil), input_token = 0xfe5badb0,
input_chan_bindings = (nil), src_name_P = 0x6ffe92f0, mech_type = 0x6ffe92ec,
output_token = 0xfe5badb8, ret_flags = 0x6ffe92d4, time_rec = 0x6ffe92d8,
delegated_cred_handle_P = 0x6ffe92e4), line 127 in "accept_sec_context.c"
  [15] globus_l_xio_gsi_read_token_cb(op = 0x8216388, result = 0, nbytes =
106U, user_arg = 0x6ffe92d0), line 1176 in "globus_xio_gsi.c"
  [16] globus_l_xio_driver_op_read_kickout(user_arg = 0x8216388), line 639 in
"globus_xio_driver.c"
  [17] globus_xio_driver_finished_read(in_op = 0x8216388, result = 0, nbytes =
106U), line 1232 in "globus_xio_pass.c"
  [18] globus_l_xio_tcp_finish_read(handle = 0x31d59208, result = 0, nbytes =
106U), line 2243 in "globus_xio_tcp_driver.c"
  [19] globus_l_xio_tcp_system_read_cb(result = 0, nbytes = 106U, user_arg =
0x31d59208), line 2259 in "globus_xio_tcp_driver.c"
  [20] globus_l_xio_system_kickout(user_arg = 0x81f7090), line 919 in
"globus_xio_system_select.c"
  [21] globus_l_callback_thread_poll(user_arg = 0xfec68960), line 2519 in
"globus_callback_threads.c"
  [22] globus_l_thread_pool_thread_start(user_arg = 0x80d8fe8), line 222 in
"globus_thread_pool.c"
  [23] thread_starter(temparg = 0x80cf028), line 513 in
"globus_thread_pthreads.c"
  [24] _thr_setup(0xfe4a0000), at 0xfe87fd36 
  [25] _lwp_start(0x3, 0x6, 0xfe8a8000, 0xfe5ba9a8, 0xfe82d083, 0x3), at
0xfe880020
------- Comment #2 From 2008-09-04 13:02:57 -------
Stack trace for crash on Sep 03:

(dbx) where
current thread: t@3
  [1] __lwp_kill(0x3, 0x6), at 0xfe880d57 
  [2] _thr_kill(0x3, 0x6), at 0xfe87e512 
  [3] raise(0x6), at 0xfe82d083 
  [4] umem_do_abort(0x4, 0xfefa5000, 0xfe5ba95c, 0xfef88b88, 0xfef93b90,
0x80ac710), at 0xfef85f79 
  [5] umem_err_recoverable(0xfef93b90), at 0xfef860a9 
  [6] umem_error(0x2, 0x80ac710, 0x250e4b78), at 0xfef88b88 
  [7] umem_cache_free_debug(0x80ac710, 0x250e4b78), at 0xfef89d31 
  [8] _umem_cache_free(0x80ac710, 0x250e4b78), at 0xfef8a42b 
  [9] _umem_free(0x250e4b78, 0x18), at 0xfef8a904 
  [10] process_free(0x250e4b80, 0x1, 0x0, 0xfe5baa78, 0xfe9f0df4, 0x250e4b80),
at 0xfef87c55 
  [11] free(0x250e4b80, 0x24d987b0, 0xfe9f0dc6), at 0xfef87c7f 
  [12] CRYPTO_free(0xfef8a67d, 0x80ac390, 0x0, 0x0, 0x10, 0x0), at 0xfe9f0df4 
=>[13] globus_i_gsi_gssapi_init_ssl_context(minor_status = 0xfe5baae4,
credential = 0x24f6ddd0, anon_ctx = GLOBUS_I_GSI_GSS_DEFAULT_CONTEXT), line
2115 in "globus_i_gsi_gss_utils.c"
  [14] globus_i_gsi_gss_create_cred(minor_status = 0xfe5bab18, cred_usage = 2,
output_cred_handle_P = 0x80c15a8, cred_handle = 0xfe5bab1c), line 1562 in
"globus_i_gsi_gss_utils.c"
  [15] globus_i_gsi_gss_cred_read(minor_status = 0xfe5bab58, cred_usage = 2,
cred_handle = 0x80c15a8, desired_subject = (nil)), line 1353 in
"globus_i_gsi_gss_utils.c"
  [16] gss_acquire_cred(minor_status = 0xfe5bacac, desired_name_P = (nil),
time_req = 4294967295U, desired_mechs = (nil), cred_usage = 2,
output_cred_handle_P = 0x80c15a8, actual_mechs = (nil), time_rec = (nil)), line
138 in "acquire_cred.c"
  [17] globus_i_gsi_gss_create_and_fill_context(minor_status = 0xfe5bad24,
context_handle_P = 0xfe5bad30, cred_handle = (nil), cred_usage = 2, req_flags =
2U), line 297 in "globus_i_gsi_gss_utils.c"
  [18] gss_accept_sec_context(minor_status = 0xfe5badc4, context_handle_P =
0x24d79718, acceptor_cred_handle = (nil), input_token = 0xfe5badb0,
input_chan_bindings = (nil), src_name_P = 0x24d79728, mech_type = 0x24d79724,
output_token = 0xfe5badb8, ret_flags = 0x24d7970c, time_rec = 0x24d79710,
delegated_cred_handle_P = 0x24d7971c), line 127 in "accept_sec_context.c"
  [19] globus_l_xio_gsi_read_token_cb(op = 0x24e060c8, result = 0, nbytes =
106U, user_arg = 0x24d79708), line 1176 in "globus_xio_gsi.c"
  [20] globus_l_xio_driver_op_read_kickout(user_arg = 0x24e060c8), line 639 in
"globus_xio_driver.c"
  [21] globus_xio_driver_finished_read(in_op = 0x24e060c8, result = 0, nbytes =
106U), line 1232 in "globus_xio_pass.c"
  [22] globus_l_xio_tcp_finish_read(handle = 0x80e9388, result = 0, nbytes =
106U), line 2243 in "globus_xio_tcp_driver.c"
  [23] globus_l_xio_tcp_system_read_cb(result = 0, nbytes = 106U, user_arg =
0x80e9388), line 2259 in "globus_xio_tcp_driver.c"
  [24] globus_l_xio_system_kickout(user_arg = 0x24f67f40), line 919 in
"globus_xio_system_select.c"
  [25] globus_l_callback_thread_poll(user_arg = 0xfec68960), line 2519 in
"globus_callback_threads.c"
  [26] globus_l_thread_pool_thread_start(user_arg = 0x80d8fe8), line 222 in
"globus_thread_pool.c"
  [27] thread_starter(temparg = 0x80cf028), line 513 in
"globus_thread_pthreads.c"
  [28] _thr_setup(0xfe4a0000), at 0xfe87fd36 
  [29] _lwp_start(0x3, 0x6, 0xfe8a8000, 0xfe5ba8f4, 0xfe82d083, 0x3), at
0xfe880020
------- Comment #3 From 2008-09-04 13:09:50 -------
libumem was also used during both runs that resulted in dumped core. In both
cases, umem_status indicates a double free at the point of the crash. The
umem_verify for 8/22 shows all clean. However, the 9/03 umem_verify has a
corrupt buffer -- the same indicated in the double free.
------- Comment #4 From 2008-09-04 13:17:34 -------
This isn't terribly insightful for the corrup buffer but here's the output of
umem_verify:

> 80ac710 ::umem_verify
Summary for cache 'umem_alloc_24'
  buffer 24f05a58 (allocated) has a corrupt buftag

And umem_status:

> ::umem_status
Status:         ready and active
Concurrency:    8
Logs:           transaction=128k (inactive)
Message buffer:
umem allocator: duplicate free: buffer freed twice
buffer=250e4b78  bufctl=250e64c8  cache: umem_alloc_24
previous transaction on buffer 250e4b78:
thread=4  time=T-0.-00000578  slab=24d130a8  cache: umem_alloc_24
libumem.so.1'?? (0xfef89ab4)
libumem.so.1'umem_cache_alloc+0xe1
libumem.so.1'umem_alloc+0x3f
libumem.so.1'malloc+0x23
umem: heap corruption detected
stack trace:
libumem.so.1'?? (0xfef86099)
libumem.so.1'?? (0xfef88b88)
libumem.so.1'?? (0xfef89d31)
libumem.so.1'umem_cache_free+0x3f
libumem.so.1'umem_free+0xd5
libumem.so.1'?? (0xfef87c55)
libumem.so.1'free+0x14
libcrypto.so.0.9.7'CRYPTO_free+0x37
libglobus_gssapi_gsi_gcc32dbgpthr.so.0.0.0'globus_i_gsi_gssapi_init_ssl_context+0xe6
libglobus_gssapi_gsi_gcc32dbgpthr.so.0.0.0'globus_i_gsi_gss_create_cred+0x237
libglobus_gssapi_gsi_gcc32dbgpthr.so.0.0.0'globus_i_gsi_gss_cred_read+0x109
libglobus_gssapi_gsi_gcc32dbgpthr.so.0.0.0'gss_acquire_cred+0x139
libglobus_gssapi_gsi_gcc32dbgpthr.so.0.0.0'globus_i_gsi_gss_create_and_fill_context+0x464
libglobus_gssapi_gsi_gcc32dbgpthr.so.0.0.0'gss_accept_sec_context+0x188
libglobus_xio_gsi_driver_gcc32dbgpthr.so.0.0.0'?? (0xfe284fdc)
libglobus_xio_gcc32dbgpthr.so.0.0.0'globus_l_xio_driver_op_read_kickout+0x2d8
libglobus_xio_gcc32dbgpthr.so.0.0.0'globus_xio_driver_finished_read+0x5e3
libglobus_xio_gcc32dbgpthr.so.0.0.0'?? (0xfed31414)
libglobus_xio_gcc32dbgpthr.so.0.0.0'?? (0xfed314ea)
libglobus_xio_gcc32dbgpthr.so.0.0.0'?? (0xfed0cd7e)
libglobus_common_gcc32dbgpthr.so.0.0.0'?? (0xfec3392b)
libglobus_common_gcc32dbgpthr.so.0.0.0'globus_l_thread_pool_thread_start+0x30
libglobus_common_gcc32dbgpthr.so.0.0.0'?? (0xfec4c378)
libc.so.1'?? (0xfe87fd36)
libc.so.1'?? (0xfe880020)
------- Comment #5 From 2008-09-04 18:00:40 -------
Inspecting the source code along the stack trace, I went looking for memory
that could have been passed from outside of the GSI GSSAPI call from a caller
(i.e., from RLS or XIO)...

=>[13] globus_i_gsi_gssapi_init_ssl_context(minor_status = 0xfe5baae4,
credential = 0x24f6ddd0, anon_ctx = GLOBUS_I_GSI_GSS_DEFAULT_CONTEXT), line
2115 in "globus_i_gsi_gss_utils.c"

Unfortunatly, the line 2115 does not match up with the source code. I suspect
this is because of conditional compiling by the C preprocessor. So the best I
can do at this point is to just see where the parameters came from.

OM_uint32
globus_i_gsi_gssapi_init_ssl_context(
    OM_uint32 *                         minor_status,
    gss_cred_id_t                       credential,
    globus_i_gsi_gss_context_type_t     anon_ctx)

It's the address of the "credential" variable that the debuggers indicate is
the problem. So next step is to see where that came from.

  [14] globus_i_gsi_gss_create_cred(minor_status = 0xfe5bab18, cred_usage = 2,
output_cred_handle_P = 0x80c15a8, cred_handle = 0xfe5bab1c), line 1562 in
"globus_i_gsi_gss_utils.c"

...
    newcred->cred_handle = *cred_handle;
    *cred_handle = NULL;

(*) major_status = globus_i_gsi_gssapi_init_ssl_context(
        &local_minor_status,
        (gss_cred_id_t) newcred,
        GLOBUS_I_GSI_GSS_DEFAULT_CONTEXT);
...

(*) marks line 1562

So the credential variable comes from "newcred" which is malloc'ed in the same
function on line 1522 and initialized before the call. Next is to find out
where the cred_handle memory address comes from.

OM_uint32
globus_i_gsi_gss_create_cred(
    OM_uint32 *                         minor_status,
    const gss_cred_usage_t              cred_usage,
    gss_cred_id_t *                     output_cred_handle_P,
    globus_gsi_cred_handle_t *          cred_handle)

That is an input parameter to the function. So next step is to go up the stack
to find out where that came from.

  [15] globus_i_gsi_gss_cred_read(minor_status = 0xfe5bab58, cred_usage = 2,
cred_handle = 0x80c15a8, desired_subject = (nil)), line 1353 in
"globus_i_gsi_gss_utils.c"

...
(*) major_status = globus_i_gsi_gss_create_cred(&local_minor_status,
                                                cred_usage,
                                                cred_handle,
                                                &local_cred_handle);
...

(*) indicates line 1353

The variable "local_cred_handle" has its address passed to the create_cred()
function (not to be confused with this function's local variable called
"cred_handle"). So the next step is to find out where "local_cred_handle" comes
from.

At this point, I'll just paste in the method from the signature to the last
line executed before the crash.

OM_uint32
globus_i_gsi_gss_cred_read(
    OM_uint32 *                         minor_status,
    const gss_cred_usage_t              cred_usage,
    gss_cred_id_t *                     cred_handle,
    const X509_NAME *                   desired_subject)
{
    globus_result_t                     local_result = GLOBUS_SUCCESS;
    globus_gsi_cred_handle_t            local_cred_handle;
    OM_uint32                           local_minor_status;
    OM_uint32                           major_status = GSS_S_COMPLETE;
    static char *                       _function_name_ =
        "globus_i_gsi_gss_cred_read";

    GLOBUS_I_GSI_GSSAPI_DEBUG_ENTER;

    local_result = globus_gsi_cred_handle_init(&local_cred_handle, NULL);
    if(local_result != GLOBUS_SUCCESS)
    {
        local_cred_handle = NULL;
        GLOBUS_GSI_GSSAPI_ERROR_CHAIN_RESULT(
            minor_status, local_result,
            GLOBUS_GSI_GSSAPI_ERROR_WITH_GSI_CREDENTIAL);
        major_status = GSS_S_FAILURE;
        goto exit;
    }

    local_result = globus_gsi_cred_read(local_cred_handle,
                                        desired_subject);
    if(local_result != GLOBUS_SUCCESS)
    {
        GLOBUS_GSI_GSSAPI_ERROR_CHAIN_RESULT(
            minor_status, local_result,
            GLOBUS_GSI_GSSAPI_ERROR_WITH_GSS_CREDENTIAL);
        major_status = GSS_S_FAILURE;
        goto exit;
    }

    major_status = globus_i_gsi_gss_create_cred(&local_minor_status,
                                                cred_usage,
                                                cred_handle,
                                                &local_cred_handle);
...

At this point, it appears that the memory buffer that is associated with the
crash originates from within the "globus_i_gsi_gss_cred_read" function. I do
not see an indication that outside (calling) code is responsible for this
buffer.
------- Comment #6 From 2008-09-10 08:54:06 -------
This is reported as a double-free. Is there a report indicating which shows
where the initial free occurred?
------- Comment #7 From 2008-09-10 13:21:19 -------
How reproducable is this? The stack traces not matching the source code makes
it extremely difficult to make any sense out of this bug report. I don't even
see where CRYPTO_free() can be called directly from
globus_i_gsi_gssapi_init_ssl_context().
------- Comment #8 From 2008-09-10 14:54:34 -------
(In reply to comment #6)
> This is reported as a double-free. Is there a report indicating which shows
> where the initial free occurred?
> 

The umem log is here:
http://www.isi.edu/~schuler/umalog.20080903
------- Comment #9 From 2008-09-10 14:56:13 -------
> The umem log is here:
> http://www.isi.edu/~schuler/umalog.20080903
> 

Sorry. That should have been:
http://www.isi.edu/~schuler/bug6356/umalog.20080903
------- Comment #10 From 2008-09-10 15:12:52 -------
(In reply to comment #7)
> How reproducable is this?

It is not easy to reproduce and I do not have a precise set of steps to
recreate the bug. An RLS (from GT 4.2) running (on Solaris 10 x86 64) for a
couple days, with a small number (2) of clients, has encountered this problem
twice.

This is not exactly a "crash", it is actually an ABORT caused by the debug
memory library (libumem). Normally, (I think) a double-free could occur without
immediate consequences -- which in some ways is even more concerning.

I have a couple core dumps here:
http://www.isi.edu/~schuler/bug6356/core.20080822.gz
http://www.isi.edu/~schuler/bug6356/core.20080903.gz

> The stack traces not matching the source code makes
> it extremely difficult to make any sense out of this bug report. I don't even
> see where CRYPTO_free() can be called directly from
> globus_i_gsi_gssapi_init_ssl_context().
> 

The stack trace line numbers match up, until the last (Globus-related) frame. I
believe that may be due to the C preprocessor and #defined code in this
compilation unit. But it is really unfortunate that the lineno does not match
up for that last frame.
------- Comment #11 From 2008-09-12 13:11:07 -------
I still have a few problems with this report:

The double-free without a report of where the initial free occurred doesn't
really indicate anything other than memory corruption has occurred some time
before the crash. The initial free of the pointer value could have happened
anywhere (in this or another thread). 

The function where the error is occurring (CRYPTO_free) is not called by the
globus_i_gsi_gssapi_init_ssl_context function (even via a macro) shown by the
stack trace. This makes me extremely skeptical that I'll be able to make any
progress with these core files.

Can you give me commands to replicate this test scenario?
------- Comment #12 From 2008-09-16 16:05:26 -------
Reproducing the bug basicaly comes down to:

1) Installing GT 4.2
2) Running globus-rls-server, use a memory checker (on Solaris I'm using
libumem)
3) Running a script like www.isi.edu/~schuler/bug6356/rls-test.sh on a couple
client machines every 5 minutes out of cron
4) And waiting a couple days for the ABORT signal to kill the process

I agree with your assessment -- that the erroneous free could come from
anywhere and not necessarily GSI or an underlying SSL lib.
------- Comment #13 From 2008-09-24 15:32:47 -------
I ran with valgrind on linux for quite a while and saw some odd things coming
out of the mysql driver, but it didn't crash. I did run into a crash while
debugging a globus_io test looks like it that might be related. The patch for
that is at http://www.mcs.anl.gov/~bester/patches/bug6356.diff
------- Comment #14 From 2008-09-25 12:33:16 -------
I created the following GPT source packages which I used to patch my Solaris
test environment:

http://www.isi.edu/~schuler/bug6356/globus_gsi_cert_utils-5.0.tar.gz
http://www.isi.edu/~schuler/bug6356/globus_gssapi_gsi-5.9.tar.gz
------- Comment #15 From 2008-10-14 13:13:18 -------
I should just add that the GSI patch was released with GT 4.2.1, so users of
that distribution and later do not need to apply the patches.