Bug 5004

Summary: Failed assert in globus_l_xio_system_unregister_write on x86_64 platform
Product: XIO Reporter: Rob S <schuler@isi.edu>
Component: Globus XIOAssignee: John Bresnahan <bresnaha@mcs.anl.gov>
Status: RESOLVED FIXED    
Severity: major CC: allcock@mcs.anl.gov, annc@isi.edu, genaro@caos.uab.es, roy@cs.wisc.edu, skoranda@gravity.phys.uwm.edu, ysimmhan@cs.indiana.edu
Priority: P3    
Version: unspecified   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Attachments: Tester that invokes RLSClient

Description From 2007-02-09 17:22:54
I am moving this from an xio-user discussion to a bugzilla report. I have
copied the mail below, which includes one from an RLS user to me (schuler).

--- emails below ---

A user of ours (of RLS) is running into a problem on a x86_64 machine running
RedHat. There seems to be a failed globus_assert in a
globus_l_xio_system_unregister_read(...) call. And another failed assert
downstream from an xio call. See the assert messages below.

The failure is happening on the client side of RLS when making multiple rapid
connections -> transactions -> disconnects with multiple threads running
(though the threads are NOT sharing the client, they create their own).

They sent me a tester that depended on some of their own code, so I rewrote it.
I tried it myself on a x86_64 running Suse 9. It failed with:

"
...
t1095125312:p5917: Fatal error: [Thread System] GLOBUSTHREAD:
pthread_mutex_lock() failed

[Thread System] invalid value passed to thread interface (EINVAL)Aborted
"

Not sure yet whether that's related or not.

On a 32bit installation I can run the tester with 100 threads and it completes
successfully. On a 64 bit machine when running the tester it crashes with 10
threads (concurrent connections).

Is this a globus_xio problem on 64bit machines? Do you need more information?

thanks,
rob

-------- Original Message --------
Subject:     RE: RLS Client connection thro JNI causes JVM failure
Date:     Tue, 6 Feb 2007 20:22:08 -0500


Rob, I have more bugs and questions for you. I added connection pooling and
stress tested RLS and found it to fail when
having multiple threads accessing the RLS service (using different RLSClient
objects) simultaneously. I get two types of
errors, both when connecting, and both causing a JVM seg fault and core dump.
1. "Assertion FD_ISSET(fd, globus_l_xio_system_write_fds) failed in file
globus_xio_system_select.c at line 985."
2. "java: md_rand.c:312: ssleay_rand_add: Assertion `md_c[1] == md_count[1]'
failed."

The gdb traces are below. I see that the first of these has been reported by
another user on a 64 bit machine
(http://www.globus.org/mail_archive/discuss/2005/06/msg00046.html). Do you have
any bug fixes for these? I'm attaching
the test client I used. It starts failing with # of clients > 1. If you need
more installation information, let me or
Suresh know.

Thanks!

--Yogesh

"Assertion FD_ISSET(fd, globus_l_xio_system_write_fds) failed in file
globus_xio_system_select.c at line 985."

(gdb) where
#0  0x0000003e2642e21d in raise () from /lib64/tls/libc.so.6
#1  0x0000003e2642fa1e in abort () from /lib64/tls/libc.so.6
#2  0x0000002adcd1b3cb in globus_l_xio_system_unregister_write (fd=9) at
globus_xio_system_select.c:985
#3  0x0000002adcd1f263 in globus_l_xio_system_handle_write (fd=9) at
globus_xio_system_select.c:2002
#4  0x0000002adcd1fd82 in globus_l_xio_system_poll (user_args=0x0) at
globus_xio_system_select.c:2249
#5  0x0000002addc183b1 in globus_callback_space_poll (timestop=0x40f5d350,
space=-2) at globus_callback_nothreads.c:1430
#6  0x0000002adcac915e in rrpc_connect (host=0x2ad8dcf3b0
"kilimanjaro.extreme.indiana.edu", port=39281,
attr=0x40f5f410, h=0x40118020, errmsg=0x40f5f450 "", errmsglen=1024) at
rpc.c:144
#7  0x0000002adcac8762 in connect1 (h=0x40117fc0, errmsg=0x40f5f450 "",
errlen=1024) at client.c:2462
#8  0x0000002adcac2f59 in globus_rls_client_connect (url=0x401176d0
"rls://kilimanjaro.extreme.indiana.edu",
h=0x40f5f8b0) at client.c:204
#9  0x0000002adc9b8a51 in Java_org_globus_replica_rls_RLSClient_connect
(env=0x2adc4f9110, obj=0x40f5f968,
jurl=0x40f5f960) at jni.c:106


"java: md_rand.c:312: ssleay_rand_add: Assertion `md_c[1] == md_count[1]'
failed."
(gdb) where
#0  0x0000003e2642e21d in raise () from /lib64/tls/libc.so.6
#1  0x0000003e2642fa1e in abort () from /lib64/tls/libc.so.6
#2  0x0000003e26427ae1 in __assert_fail () from /lib64/tls/libc.so.6
#3  0x0000002ade091145 in ssleay_rand_add (buf=0x41966b00, num=144, add=2) at
md_rand.c:312
#4  0x0000002ade091e91 in RAND_add (buf=0x41966a70, num=144, entropy=2) at
rand_lib.c:151
#5  0x0000002add730326 in globus_gsi_sysconfig_dir_exists_unix
(filename=0x2adc939d90 "/u/drlead") at
globus_gsi_system_config.c:4600
#6  0x0000002add72fc75 in globus_gsi_sysconfig_get_home_dir_unix
(home_dir=0x41966bd8) at
globus_gsi_system_config.c:4394
#7  0x0000002add730ee2 in globus_gsi_sysconfig_get_cert_dir_unix
(cert_dir=0x41966c50) at
globus_gsi_system_config.c:4954
#8  0x0000002add1eb188 in globus_i_gsi_gssapi_init_ssl_context
(minor_status=0x41966ce8, credential=0x2adc939e10,
anon_ctx=GLOBUS_I_GSI_GSS_DEFAULT_CONTEXT) at globus_i_gsi_gss_utils.c:2097
#9  0x0000002add1e99ad in globus_i_gsi_gss_create_cred
(minor_status=0x41966d3c, cred_usage=1,
output_cred_handle_P=0x2adc937878, cred_handle=0x41966d40) at
globus_i_gsi_gss_utils.c:1532
#10 0x0000002add1e9437 in globus_i_gsi_gss_cred_read (minor_status=0x41966da0,
cred_usage=1, cred_handle=0x2adc937878,
desired_subject=0x0) at globus_i_gsi_gss_utils.c:1323
#11 0x0000002add1df784 in gss_acquire_cred (minor_status=0x41966f28,
desired_name_P=0x0, time_req=4294967295,
desired_mechs=0x0, cred_usage=1, output_cred_handle_P=0x2adc937878,
actual_mechs=0x0, time_rec=0x0)
   at acquire_cred.c:126
#12 0x0000002add1e777e in globus_i_gsi_gss_create_and_fill_context
(minor_status=0x41966fd0,
context_handle_P=0x41966fd8, cred_handle=0x0, cred_usage=1, req_flags=2) at
globus_i_gsi_gss_utils.c:291
#13 0x0000002add1e32c8 in gss_init_sec_context (minor_status=0x41967070,
initiator_cred_handle=0x0,
context_handle_P=0x2adc275b08, target_name=0x2adc937510, mech_type=0x0,
req_flags=2, time_req=0,    input_chan_bindings=0x0, input_token=0x0,
actual_mech_type=0x2adc275b20, output_token=0x41967060,
ret_flags=0x2adc275af8, time_rec=0x2adc275afc) at init_sec_context.c:128
#14 0x0000002adce75140 in globus_l_xio_gsi_open_cb (op=0x2adc275750, result=0,
user_arg=0x2adc275af0) at
globus_xio_gsi.c:1488
#15 0x0000002adce52841 in globus_l_xio_driver_open_op_kickout
(user_arg=0x2adc275750) at globus_xio_driver.c:893
#16 0x0000002adce5fcd2 in globus_xio_driver_finished_open (in_dh=0x2adc275a10,
in_op=0x2adc275750, in_res=0) at
globus_xio_pass.c:231
#17 0x0000002adce96348 in globus_l_xio_tcp_system_connect_cb (result=0,
user_arg=0x2adc275cb0) at
globus_xio_tcp_driver.c:1797
#18 0x0000002adce6658c in globus_l_xio_system_kickout (user_arg=0x2adc938ae0)
at globus_xio_system_select.c:1010
#19 0x0000002addd633b1 in globus_callback_space_poll (timestop=0x41967350,
space=-2) at globus_callback_nothreads.c:1430
#20 0x0000002adcc1415e in rrpc_connect (host=0x2adc2749f0
"kilimanjaro.extreme.indiana.edu", port=39281,
attr=0x41969410, h=0x2adc2749d0, errmsg=0x41969450 "", errmsglen=1024) at
rpc.c:144
#21 0x0000002adcc13762 in connect1 (h=0x2adc274970, errmsg=0x41969450 "",
errlen=1024) at client.c:2462
#22 0x0000002adcc0df59 in globus_rls_client_connect (url=0x2adc274940
"rls://kilimanjaro.extreme.indiana.edu",
h=0x419698b0) at client.c:204
#23 0x0000002adca03a51 in Java_org_globus_replica_rls_RLSClient_connect
(env=0x2adc210140, obj=0x41969968,
jurl=0x41969960) at jni.c:106
------- Comment #1 From 2007-02-09 17:27:43 -------
Created an attachment (id=1185) [details]
Tester that invokes RLSClient

This tester uses RLSClient to cause a XIO crash on x86_64 platforms (it does
not fail on 32bit platforms).

If you are interested in using this tester, I can help with the setup of RLS
(we would only need the client setup and can use one of my existing servers).

I can also get a core file from the LEAD team that encountered the failure.
------- Comment #2 From 2008-03-11 21:23:20 -------
For what it is worth, a LIGO user running a client tool on Cent OS 5 x86_64 is
receiving the same error:

t1094719808:p2904: Fatal error: [Thread System] GLOBUSTHREAD:
pthread_mutex_lock() failed

[Thread System] invalid value passed to thread interface (EINVAL)

The error is transient. 

The client tool simply opens up a GSI socket, writes a few bytes to it, and
then reads what a remote server has sent down the wire. 

I have asked the user to run the tool with GLOBUS_XIO_GSI_DEBUG set to 127 and
record the stdout. 
------- Comment #3 From 2008-03-14 09:01:31 -------
I just noticed the same bug in one of the VDT nightly tests. It was running on
CentOS 5, x86. It is Globus 4.0.6. 

We did:

> globus/bin/globus-rls-cli create foo bar \
  rls://localhost:39281 >/tmp/vdt-run-tests.temp.24761

And we got:

   t3075419024:p24764: Fatal error: [Thread System] GLOBUSTHREAD:
pthread_mutex_lock() failed

 [Thread System] invalid value passed to thread interface (EINVAL)

Is this bug being investigated? It seems to be rather old now. 

Thanks,
-alain
------- Comment #4 From 2008-03-17 16:34:57 -------
I attached gdb to a running client that was looping over and over again opening
a socket, sending some bytes to a server, reading the response, and then
closing the socket. 

Here is the backtrace after the client aborts with the reported error:

(gdb) backtrace
#0  0x0000003237a30015 in raise () from /lib64/libc.so.6
#1  0x0000003237a31980 in abort () from /lib64/libc.so.6
#2  0x00002aaaae305029 in globus_silent_fatal () at globus_print.c:57
#3  0x00002aaaae30513b in globus_fatal (msg=0x2aaaae314abb "%s %s\n%s %s")
    at globus_print.c:88
#4  0x00002aaaae309246 in globus_i_thread_report_bad_rc (rc=22,
    message=0x2aaaae314eb0 "GLOBUSTHREAD: pthread_mutex_lock() failed\n")
    at globus_thread_common.c:145
#5  0x00002aaaae30a1db in globus_mutex_lock (mut=0x2aaaae521d40)
    at globus_thread_pthreads.c:823
#6  0x00002aaaae300fa0 in globus_object_free (object=0x1e0a37b0)
    at globus_object.c:225
#7  0x00002aaaae2f3f3d in s_key_destructor_func (value=0x1e0a37b0)
    at globus_error.c:289
#8  0x0000003238605919 in __nptl_deallocate_tsd () from /lib64/libpthread.so.0
#9  0x00000032386061c3 in start_thread () from /lib64/libpthread.so.0
#10 0x0000003237acd36d in clone () from /lib64/libc.so.6
#11 0x0000000000000000 in ?? ()
------- Comment #5 From 2008-03-17 18:13:17 -------
I ran the client with GLOBUS_XIO_GSI_DEBUG = 127. The debugging trace is
available at

http://www.lsc-group.phys.uwm.edu/lscdatagrid/downloads/ldr_software/xiodebug.out.gz
------- Comment #6 From 2008-03-18 16:57:04 -------
I have a pretty good idea why this is happening.  The mutex is being checked in
a thread specific key destructor handler that potentially occurs after the
mutex is destroyed.  I think the easiest solution is to initialize the mutex
once per process creation, and then never destroy it (just allow the OS to
clean it up on unload).

I will test a patch soon.
------- Comment #7 From 2008-03-18 21:00:08 -------
Subject: Re:  Failed assert in globus_l_xio_system_unregister_write
 on x86_64 platform

> I have a pretty good idea why this is happening.  The mutex is being  
> checked in
> a thread specific key destructor handler that potentially occurs  
> after the
> mutex is destroyed.  I think the easiest solution is to initialize  
> the mutex
> once per process creation, and then never destroy it (just allow the  
> OS to
> clean it up on unload).
>
> I will test a patch soon.

That's great news! If you can share a patch against Globus 4.0.6, I  
can include it in a future VDT release.
Thanks!
-alain
------- Comment #8 From 2008-03-27 09:52:47 -------
This is fixed in the trunk and the 4.0 branch.  It should be in the 4.0.7
release