Bugzilla – Bug 6085
LIGO: RLS server crash with GLOBUSTHREAD: pthread_mutex_lock() failed
Last modified: 2008-05-29 13:05:02
You need to log in before you can comment on or make changes to this bug.
The RLS server running on ldas-cit.ligo.caltech.edu crashed. It is a Solaris 10 box. The RLS comes from compiling GT 4.0.7 on Solaris 10 using gcc. The log file shows the following: 2008-05-14 14:18:24 T5: checkidle: Timing out connection 38612EA0 2008-05-14 14:18:24 T5: checkidle: Timing out connection 38492A60 2008-05-14 14:18:24 T5: checkidle: Timing out connection 376A9D78 2008-05-14 14:18:24 T5: checkidle: Timing out connection 376858B0 2008-05-14 14:18:24 T5: checkidle: Timing out connection 37012F18 2008-05-14 14:18:24 T5: checkidle: Timing out connection 36FEEAB0 2008-05-14 14:18:24 T5: checkidle: Timing out connection 36D43C68 2008-05-14 14:18:24 T5: checkidle: Timing out connection 35D1F2A8 2008-05-14 14:18:24 T5: checkidle: Timing out connection 35FBC3A0 2008-05-14 14:18:24 T5: checkidle: Timing out connection 349119F8 2008-05-14 14:18:24 T5: checkidle: Timing out connection 33432778 2008-05-14 14:18:24 T5: checkidle: Timing out connection 1D3EC7E8 2008-05-14 14:18:24 T5: checkidle: Timing out connection 319C73A8 2008-05-14 14:18:24 T5: checkidle: Timing out connection 1E0D11A8 2008-05-14 14:18:24 T5: checkidle: Timing out connection 1ECE40C8 2008-05-14 14:18:53 T11: lrc_bfiupdates: 0 2008-05-14 14:18:54 T17: rli_bfiupdates: 2008-05-14 14:18:55 T10: updatebf(rls://golf.astro.cf.ac.uk:39281): Globus I/O error: globus_xio: System error in writev: Broken pipe globus_xio: A system call failed: Broken pipe 2008-05-14 14:18:55 T10: update_sendbf: Sending bloomfilter to rls://ldas-cit:39281 2008-05-14 14:18:58 T3: auth_getperms(/DC=org/DC=doegrids/OU=Services/CN=ldas-cit.ligo.caltech.edu): localuser - perms 8 2008-05-14 14:18:58 T3: authcb: Accepted connection from /DC=org/DC=doegrids/OU=Services/CN=ldas-cit.ligo.caltech.edu 2008-05-14 14:19:00 T3: auth_getperms(/O=GermanGrid/OU=AEI/CN=ldr/charlie.amp.uni-hannover.de): localuser - perms 8 2008-05-14 14:19:00 T3: authcb: Accepted connection from /O=GermanGrid/OU=AEI/CN=ldr/charlie.amp.uni-hannover.de 2008-05-14 14:19:23 T11: lrc_bfiupdates: 0 2008-05-14 14:19:24 T17: rli_bfiupdates: 2008-05-14 14:19:53 T11: lrc_bfiupdates: 0 2008-05-14 14:19:54 T17: rli_bfiupdates: 2008-05-14 14:20:01 T3: auth_getperms(/O=GermanGrid/OU=AEI/CN=ygraine.aei.mpg.de): localuser - perms 0 2008-05-14 14:20:01 T3: authcb: Accepted connection from /O=GermanGrid/OU=AEI/CN=ygraine.aei.mpg.de 2008-05-14 14:20:01 T3: Permission denied: /O=GermanGrid/OU=AEI/CN=ygraine.aei.mpg.de 2008-05-14 14:20:06 T16: auth_getperms(/O=GermanGrid/OU=AEI/CN=hanrobot/ldr.aei.uni-hannover.de): localuser - perms 25 2008-05-14 14:20:06 T16: authcb: Accepted connection from /O=GermanGrid/OU=AEI/CN=hanrobot/ldr.aei.uni-hannover.de 2008-05-14 14:20:15 T39: db_open: rli1000 dbuser 2008-05-14 14:20:15 T31: db_open: rli1000 dbuser 2008-05-14 14:20:16 T34: db_open: rli1000 dbuser 2008-05-14 14:20:16 T31: db_exists: L-R-894811392-32.gwf 0 2008-05-14 14:20:16 T35: db_open: rli1000 dbuser t31:p12951: Fatal error: [Thread System] GLOBUSTHREAD: pthread_mutex_lock() failed [Thread System] unknown error number: 59
You can find the core file at http://www.lsc-group.phys.uwm.edu/lscdatagrid/downloads/ldr_software/globus-rls-server.core
I've been trying to get a look at the core file. Was this binary on a sparc architecture or x86? I used gdb and mdb on a sparc sol9 box to look at the core file but I didn't get much info out of it. In the meantime... If someone could open the core file with gdb and give me the output of 'backtrace' that would help. Also, if I remember correctly you build your binaries withOUT debug symbols.
It was/is a Solaris 10 box: [grid@ldas-cit skoranda]$ uname -a SunOS ldas-cit 5.10 Generic_127111-11 sun4u sparc SUNW,Sun-Fire-880 The backtrace from the core file is (gdb) backtrace #0 0xfe945b84 in _lwp_kill () from /lib/libc.so.1 #1 0xfe8e4bbc in raise () from /lib/libc.so.1 #2 0xfe8c10c0 in abort () from /lib/libc.so.1 #3 0xfef33078 in globus_silent_fatal () at globus_print.c:57 #4 0xfef33124 in globus_fatal ( msg=0xfef4a248 "%s %s\n%s unknown error number: %d\n") at globus_print.c:88 #5 0xfef39644 in globus_i_thread_report_bad_rc (rc=59, message=0xfef4a620 "GLOBUSTHREAD: pthread_mutex_lock() failed\n") at globus_thread_common.c:138 #6 0xfef3b21c in globus_mutex_lock (mut=0x27455b70) at globus_thread_pthreads.c:823 #7 0xff294928 in globus_io_register_writev (handle=0x2599c7b0, iov=0xfc5fb220, iovcnt=2, writev_callback=0xff2feaa8 <writevcb>, callback_arg=0xfc5fb160) at globus_io_xio_compat.c:3236 #8 0xff2fe36c in rrpc_writev (h=0x2599c7b0, iov=0xfc5fb220, iovcnt=2, nbw=0xfc5fb230, errmsg=0x259a07d4 "L-R-894811392-32.gwf,0") at rpc.c:317 #9 0x0002ba7c in rrpc_error (c=0x2599c7b0, rc=12, fmt=0x3b6d0 "%s") at server.c:1376 #10 0x0002e14c in lrc_exists (c=0x2599c7b0, dbh=0xfc5fbf4c, arglist=0xfc5fbf58) at server.c:1950 #11 0x0002a65c in procreq (a=0x0) at server.c:1054 #12 0xfe944998 in _lwp_start () from /lib/libc.so.1 #13 0xfe944998 in _lwp_start () from /lib/libc.so.1 ---Type <return> to continue, or q <return> to quit--- Backtrace stopped: previous frame identical to this frame (corrupt stack?)
We build our LDR binaries with debug symbols (gcc32dbgpthr).
Most of the below RLS functions are in replica/rls/server/server.c the last one is in replica/rls/client/library/rpc.c -- not that I expect you'll be looking into them but just in case. The lifecycle of the connection follows the path: globus_io_tcp_register_listen || \/ lisetncb: checks for io error condition || \/ doaccept: sets security attributes on io handle || \/ globus_io_tcp_register_accept || \/ acceptcb: checks some RLS error conditions || \/ globus_io_register_read || \/ readcb: queue's the handle (inside an RLS connection object) || \/ -- at this point the connection is placed on a "request queue" internal to the RLS. Then a request processing thread is notified (waiting on a condition var) that a request is ready for processing. || \/ procereq(): wakes and inspects the request and calls a function (in the case of the crash the function called was lrc_exists() || \/ lrc_exists(): makes a db call (in this case the application logic is to report an "error" that the object the caller was looking for didn't exist in the RLS database -- but not a system error to be clear) || \/ rrpc_error(): creates its iovec structure with the error message || \/ rrpc_writev() -- found in rpc.c: initializes a mutex/cv to be used by the writevcb (never reached) || \/ globus_io_register_writev || \/ *BANG* It seems a pretty simple case of listen->accept->read->write->close.
You can find two log files for the RLS server running on this platform/machine at http://www.lsc-group.phys.uwm.edu/lscdatagrid/downloads/ldr_software/rls.log.1.gz http://www.lsc-group.phys.uwm.edu/lscdatagrid/downloads/ldr_software/rls.log.2.gz