Bugzilla – Bug 5004
Failed assert in globus_l_xio_system_unregister_write on x86_64 platform
Last modified: 2008-03-27 09:52:47
You need to log in before you can comment on or make changes to this bug.
I am moving this from an xio-user discussion to a bugzilla report. I have copied the mail below, which includes one from an RLS user to me (schuler). --- emails below --- A user of ours (of RLS) is running into a problem on a x86_64 machine running RedHat. There seems to be a failed globus_assert in a globus_l_xio_system_unregister_read(...) call. And another failed assert downstream from an xio call. See the assert messages below. The failure is happening on the client side of RLS when making multiple rapid connections -> transactions -> disconnects with multiple threads running (though the threads are NOT sharing the client, they create their own). They sent me a tester that depended on some of their own code, so I rewrote it. I tried it myself on a x86_64 running Suse 9. It failed with: " ... t1095125312:p5917: Fatal error: [Thread System] GLOBUSTHREAD: pthread_mutex_lock() failed [Thread System] invalid value passed to thread interface (EINVAL)Aborted " Not sure yet whether that's related or not. On a 32bit installation I can run the tester with 100 threads and it completes successfully. On a 64 bit machine when running the tester it crashes with 10 threads (concurrent connections). Is this a globus_xio problem on 64bit machines? Do you need more information? thanks, rob -------- Original Message -------- Subject: RE: RLS Client connection thro JNI causes JVM failure Date: Tue, 6 Feb 2007 20:22:08 -0500 Rob, I have more bugs and questions for you. I added connection pooling and stress tested RLS and found it to fail when having multiple threads accessing the RLS service (using different RLSClient objects) simultaneously. I get two types of errors, both when connecting, and both causing a JVM seg fault and core dump. 1. "Assertion FD_ISSET(fd, globus_l_xio_system_write_fds) failed in file globus_xio_system_select.c at line 985." 2. "java: md_rand.c:312: ssleay_rand_add: Assertion `md_c[1] == md_count[1]' failed." The gdb traces are below. I see that the first of these has been reported by another user on a 64 bit machine (http://www.globus.org/mail_archive/discuss/2005/06/msg00046.html). Do you have any bug fixes for these? I'm attaching the test client I used. It starts failing with # of clients > 1. If you need more installation information, let me or Suresh know. Thanks! --Yogesh "Assertion FD_ISSET(fd, globus_l_xio_system_write_fds) failed in file globus_xio_system_select.c at line 985." (gdb) where #0 0x0000003e2642e21d in raise () from /lib64/tls/libc.so.6 #1 0x0000003e2642fa1e in abort () from /lib64/tls/libc.so.6 #2 0x0000002adcd1b3cb in globus_l_xio_system_unregister_write (fd=9) at globus_xio_system_select.c:985 #3 0x0000002adcd1f263 in globus_l_xio_system_handle_write (fd=9) at globus_xio_system_select.c:2002 #4 0x0000002adcd1fd82 in globus_l_xio_system_poll (user_args=0x0) at globus_xio_system_select.c:2249 #5 0x0000002addc183b1 in globus_callback_space_poll (timestop=0x40f5d350, space=-2) at globus_callback_nothreads.c:1430 #6 0x0000002adcac915e in rrpc_connect (host=0x2ad8dcf3b0 "kilimanjaro.extreme.indiana.edu", port=39281, attr=0x40f5f410, h=0x40118020, errmsg=0x40f5f450 "", errmsglen=1024) at rpc.c:144 #7 0x0000002adcac8762 in connect1 (h=0x40117fc0, errmsg=0x40f5f450 "", errlen=1024) at client.c:2462 #8 0x0000002adcac2f59 in globus_rls_client_connect (url=0x401176d0 "rls://kilimanjaro.extreme.indiana.edu", h=0x40f5f8b0) at client.c:204 #9 0x0000002adc9b8a51 in Java_org_globus_replica_rls_RLSClient_connect (env=0x2adc4f9110, obj=0x40f5f968, jurl=0x40f5f960) at jni.c:106 "java: md_rand.c:312: ssleay_rand_add: Assertion `md_c[1] == md_count[1]' failed." (gdb) where #0 0x0000003e2642e21d in raise () from /lib64/tls/libc.so.6 #1 0x0000003e2642fa1e in abort () from /lib64/tls/libc.so.6 #2 0x0000003e26427ae1 in __assert_fail () from /lib64/tls/libc.so.6 #3 0x0000002ade091145 in ssleay_rand_add (buf=0x41966b00, num=144, add=2) at md_rand.c:312 #4 0x0000002ade091e91 in RAND_add (buf=0x41966a70, num=144, entropy=2) at rand_lib.c:151 #5 0x0000002add730326 in globus_gsi_sysconfig_dir_exists_unix (filename=0x2adc939d90 "/u/drlead") at globus_gsi_system_config.c:4600 #6 0x0000002add72fc75 in globus_gsi_sysconfig_get_home_dir_unix (home_dir=0x41966bd8) at globus_gsi_system_config.c:4394 #7 0x0000002add730ee2 in globus_gsi_sysconfig_get_cert_dir_unix (cert_dir=0x41966c50) at globus_gsi_system_config.c:4954 #8 0x0000002add1eb188 in globus_i_gsi_gssapi_init_ssl_context (minor_status=0x41966ce8, credential=0x2adc939e10, anon_ctx=GLOBUS_I_GSI_GSS_DEFAULT_CONTEXT) at globus_i_gsi_gss_utils.c:2097 #9 0x0000002add1e99ad in globus_i_gsi_gss_create_cred (minor_status=0x41966d3c, cred_usage=1, output_cred_handle_P=0x2adc937878, cred_handle=0x41966d40) at globus_i_gsi_gss_utils.c:1532 #10 0x0000002add1e9437 in globus_i_gsi_gss_cred_read (minor_status=0x41966da0, cred_usage=1, cred_handle=0x2adc937878, desired_subject=0x0) at globus_i_gsi_gss_utils.c:1323 #11 0x0000002add1df784 in gss_acquire_cred (minor_status=0x41966f28, desired_name_P=0x0, time_req=4294967295, desired_mechs=0x0, cred_usage=1, output_cred_handle_P=0x2adc937878, actual_mechs=0x0, time_rec=0x0) at acquire_cred.c:126 #12 0x0000002add1e777e in globus_i_gsi_gss_create_and_fill_context (minor_status=0x41966fd0, context_handle_P=0x41966fd8, cred_handle=0x0, cred_usage=1, req_flags=2) at globus_i_gsi_gss_utils.c:291 #13 0x0000002add1e32c8 in gss_init_sec_context (minor_status=0x41967070, initiator_cred_handle=0x0, context_handle_P=0x2adc275b08, target_name=0x2adc937510, mech_type=0x0, req_flags=2, time_req=0, input_chan_bindings=0x0, input_token=0x0, actual_mech_type=0x2adc275b20, output_token=0x41967060, ret_flags=0x2adc275af8, time_rec=0x2adc275afc) at init_sec_context.c:128 #14 0x0000002adce75140 in globus_l_xio_gsi_open_cb (op=0x2adc275750, result=0, user_arg=0x2adc275af0) at globus_xio_gsi.c:1488 #15 0x0000002adce52841 in globus_l_xio_driver_open_op_kickout (user_arg=0x2adc275750) at globus_xio_driver.c:893 #16 0x0000002adce5fcd2 in globus_xio_driver_finished_open (in_dh=0x2adc275a10, in_op=0x2adc275750, in_res=0) at globus_xio_pass.c:231 #17 0x0000002adce96348 in globus_l_xio_tcp_system_connect_cb (result=0, user_arg=0x2adc275cb0) at globus_xio_tcp_driver.c:1797 #18 0x0000002adce6658c in globus_l_xio_system_kickout (user_arg=0x2adc938ae0) at globus_xio_system_select.c:1010 #19 0x0000002addd633b1 in globus_callback_space_poll (timestop=0x41967350, space=-2) at globus_callback_nothreads.c:1430 #20 0x0000002adcc1415e in rrpc_connect (host=0x2adc2749f0 "kilimanjaro.extreme.indiana.edu", port=39281, attr=0x41969410, h=0x2adc2749d0, errmsg=0x41969450 "", errmsglen=1024) at rpc.c:144 #21 0x0000002adcc13762 in connect1 (h=0x2adc274970, errmsg=0x41969450 "", errlen=1024) at client.c:2462 #22 0x0000002adcc0df59 in globus_rls_client_connect (url=0x2adc274940 "rls://kilimanjaro.extreme.indiana.edu", h=0x419698b0) at client.c:204 #23 0x0000002adca03a51 in Java_org_globus_replica_rls_RLSClient_connect (env=0x2adc210140, obj=0x41969968, jurl=0x41969960) at jni.c:106
Created an attachment (id=1185) [details] Tester that invokes RLSClient This tester uses RLSClient to cause a XIO crash on x86_64 platforms (it does not fail on 32bit platforms). If you are interested in using this tester, I can help with the setup of RLS (we would only need the client setup and can use one of my existing servers). I can also get a core file from the LEAD team that encountered the failure.
For what it is worth, a LIGO user running a client tool on Cent OS 5 x86_64 is receiving the same error: t1094719808:p2904: Fatal error: [Thread System] GLOBUSTHREAD: pthread_mutex_lock() failed [Thread System] invalid value passed to thread interface (EINVAL) The error is transient. The client tool simply opens up a GSI socket, writes a few bytes to it, and then reads what a remote server has sent down the wire. I have asked the user to run the tool with GLOBUS_XIO_GSI_DEBUG set to 127 and record the stdout.
I just noticed the same bug in one of the VDT nightly tests. It was running on CentOS 5, x86. It is Globus 4.0.6. We did: > globus/bin/globus-rls-cli create foo bar \ rls://localhost:39281 >/tmp/vdt-run-tests.temp.24761 And we got: t3075419024:p24764: Fatal error: [Thread System] GLOBUSTHREAD: pthread_mutex_lock() failed [Thread System] invalid value passed to thread interface (EINVAL) Is this bug being investigated? It seems to be rather old now. Thanks, -alain
I attached gdb to a running client that was looping over and over again opening a socket, sending some bytes to a server, reading the response, and then closing the socket. Here is the backtrace after the client aborts with the reported error: (gdb) backtrace #0 0x0000003237a30015 in raise () from /lib64/libc.so.6 #1 0x0000003237a31980 in abort () from /lib64/libc.so.6 #2 0x00002aaaae305029 in globus_silent_fatal () at globus_print.c:57 #3 0x00002aaaae30513b in globus_fatal (msg=0x2aaaae314abb "%s %s\n%s %s") at globus_print.c:88 #4 0x00002aaaae309246 in globus_i_thread_report_bad_rc (rc=22, message=0x2aaaae314eb0 "GLOBUSTHREAD: pthread_mutex_lock() failed\n") at globus_thread_common.c:145 #5 0x00002aaaae30a1db in globus_mutex_lock (mut=0x2aaaae521d40) at globus_thread_pthreads.c:823 #6 0x00002aaaae300fa0 in globus_object_free (object=0x1e0a37b0) at globus_object.c:225 #7 0x00002aaaae2f3f3d in s_key_destructor_func (value=0x1e0a37b0) at globus_error.c:289 #8 0x0000003238605919 in __nptl_deallocate_tsd () from /lib64/libpthread.so.0 #9 0x00000032386061c3 in start_thread () from /lib64/libpthread.so.0 #10 0x0000003237acd36d in clone () from /lib64/libc.so.6 #11 0x0000000000000000 in ?? ()
I ran the client with GLOBUS_XIO_GSI_DEBUG = 127. The debugging trace is available at http://www.lsc-group.phys.uwm.edu/lscdatagrid/downloads/ldr_software/xiodebug.out.gz
I have a pretty good idea why this is happening. The mutex is being checked in a thread specific key destructor handler that potentially occurs after the mutex is destroyed. I think the easiest solution is to initialize the mutex once per process creation, and then never destroy it (just allow the OS to clean it up on unload). I will test a patch soon.
Subject: Re: Failed assert in globus_l_xio_system_unregister_write on x86_64 platform > I have a pretty good idea why this is happening. The mutex is being > checked in > a thread specific key destructor handler that potentially occurs > after the > mutex is destroyed. I think the easiest solution is to initialize > the mutex > once per process creation, and then never destroy it (just allow the > OS to > clean it up on unload). > > I will test a patch soon. That's great news! If you can share a patch against Globus 4.0.6, I can include it in a future VDT release. Thanks! -alain
This is fixed in the trunk and the 4.0 branch. It should be in the 4.0.7 release