Bug 6085 - LIGO: RLS server crash with GLOBUSTHREAD: pthread_mutex_lock() failed
: LIGO: RLS server crash with GLOBUSTHREAD: pthread_mutex_lock() failed
Status: ASSIGNED
: Replica Location
RLS
: development
: Sun Solaris
: P3 normal
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2008-05-15 14:40 by
Modified: 2008-05-29 13:05 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2008-05-15 14:40:33
The RLS server running on ldas-cit.ligo.caltech.edu crashed. It is a Solaris 10
box. The RLS comes from compiling GT 4.0.7 on Solaris 10 using gcc.

The log file shows the following:

2008-05-14 14:18:24 T5: checkidle: Timing out connection 38612EA0
2008-05-14 14:18:24 T5: checkidle: Timing out connection 38492A60
2008-05-14 14:18:24 T5: checkidle: Timing out connection 376A9D78
2008-05-14 14:18:24 T5: checkidle: Timing out connection 376858B0
2008-05-14 14:18:24 T5: checkidle: Timing out connection 37012F18
2008-05-14 14:18:24 T5: checkidle: Timing out connection 36FEEAB0
2008-05-14 14:18:24 T5: checkidle: Timing out connection 36D43C68
2008-05-14 14:18:24 T5: checkidle: Timing out connection 35D1F2A8
2008-05-14 14:18:24 T5: checkidle: Timing out connection 35FBC3A0
2008-05-14 14:18:24 T5: checkidle: Timing out connection 349119F8
2008-05-14 14:18:24 T5: checkidle: Timing out connection 33432778
2008-05-14 14:18:24 T5: checkidle: Timing out connection 1D3EC7E8
2008-05-14 14:18:24 T5: checkidle: Timing out connection 319C73A8
2008-05-14 14:18:24 T5: checkidle: Timing out connection 1E0D11A8
2008-05-14 14:18:24 T5: checkidle: Timing out connection 1ECE40C8
2008-05-14 14:18:53 T11: lrc_bfiupdates: 0
2008-05-14 14:18:54 T17: rli_bfiupdates:
2008-05-14 14:18:55 T10: updatebf(rls://golf.astro.cf.ac.uk:39281): Globus I/O
error: globus_xio: System error in writev: Broken pipe
globus_xio: A system call failed: Broken pipe

2008-05-14 14:18:55 T10: update_sendbf: Sending bloomfilter to
rls://ldas-cit:39281
2008-05-14 14:18:58 T3:
auth_getperms(/DC=org/DC=doegrids/OU=Services/CN=ldas-cit.ligo.caltech.edu):
localuser - perms 8
2008-05-14 14:18:58 T3: authcb: Accepted connection from
/DC=org/DC=doegrids/OU=Services/CN=ldas-cit.ligo.caltech.edu
2008-05-14 14:19:00 T3:
auth_getperms(/O=GermanGrid/OU=AEI/CN=ldr/charlie.amp.uni-hannover.de):
localuser - perms 8
2008-05-14 14:19:00 T3: authcb: Accepted connection from
/O=GermanGrid/OU=AEI/CN=ldr/charlie.amp.uni-hannover.de
2008-05-14 14:19:23 T11: lrc_bfiupdates: 0
2008-05-14 14:19:24 T17: rli_bfiupdates:
2008-05-14 14:19:53 T11: lrc_bfiupdates: 0
2008-05-14 14:19:54 T17: rli_bfiupdates:
2008-05-14 14:20:01 T3:
auth_getperms(/O=GermanGrid/OU=AEI/CN=ygraine.aei.mpg.de): localuser - perms 0
2008-05-14 14:20:01 T3: authcb: Accepted connection from
/O=GermanGrid/OU=AEI/CN=ygraine.aei.mpg.de
2008-05-14 14:20:01 T3: Permission denied:
/O=GermanGrid/OU=AEI/CN=ygraine.aei.mpg.de
2008-05-14 14:20:06 T16:
auth_getperms(/O=GermanGrid/OU=AEI/CN=hanrobot/ldr.aei.uni-hannover.de):
localuser - perms 25
2008-05-14 14:20:06 T16: authcb: Accepted connection from
/O=GermanGrid/OU=AEI/CN=hanrobot/ldr.aei.uni-hannover.de
2008-05-14 14:20:15 T39: db_open: rli1000 dbuser
2008-05-14 14:20:15 T31: db_open: rli1000 dbuser
2008-05-14 14:20:16 T34: db_open: rli1000 dbuser
2008-05-14 14:20:16 T31: db_exists: L-R-894811392-32.gwf 0
2008-05-14 14:20:16 T35: db_open: rli1000 dbuser
t31:p12951: Fatal error: [Thread System] GLOBUSTHREAD: pthread_mutex_lock()
failed

[Thread System] unknown error number: 59
------- Comment #1 From 2008-05-15 14:57:48 -------
You can find the core file at

http://www.lsc-group.phys.uwm.edu/lscdatagrid/downloads/ldr_software/globus-rls-server.core
------- Comment #2 From 2008-05-21 18:10:24 -------
I've been trying to get a look at the core file. Was this binary on a sparc
architecture or x86? I used gdb and mdb on a sparc sol9 box to look at the core
file but I didn't get much info out of it.

In the meantime... If someone could open the core file with gdb and give me the
output of 'backtrace' that would help.

Also, if I remember correctly you build your binaries withOUT debug symbols.
------- Comment #3 From 2008-05-22 16:09:03 -------
It was/is a Solaris 10 box:

[grid@ldas-cit skoranda]$ uname -a
SunOS ldas-cit 5.10 Generic_127111-11 sun4u sparc SUNW,Sun-Fire-880

The backtrace from the core file is 

(gdb) backtrace
#0  0xfe945b84 in _lwp_kill () from /lib/libc.so.1
#1  0xfe8e4bbc in raise () from /lib/libc.so.1
#2  0xfe8c10c0 in abort () from /lib/libc.so.1
#3  0xfef33078 in globus_silent_fatal () at globus_print.c:57
#4  0xfef33124 in globus_fatal (
    msg=0xfef4a248 "%s %s\n%s unknown error number: %d\n") at globus_print.c:88
#5  0xfef39644 in globus_i_thread_report_bad_rc (rc=59,
    message=0xfef4a620 "GLOBUSTHREAD: pthread_mutex_lock() failed\n")
    at globus_thread_common.c:138
#6  0xfef3b21c in globus_mutex_lock (mut=0x27455b70)
    at globus_thread_pthreads.c:823
#7  0xff294928 in globus_io_register_writev (handle=0x2599c7b0,
    iov=0xfc5fb220, iovcnt=2, writev_callback=0xff2feaa8 <writevcb>,
    callback_arg=0xfc5fb160) at globus_io_xio_compat.c:3236
#8  0xff2fe36c in rrpc_writev (h=0x2599c7b0, iov=0xfc5fb220, iovcnt=2,
    nbw=0xfc5fb230, errmsg=0x259a07d4 "L-R-894811392-32.gwf,0") at rpc.c:317
#9  0x0002ba7c in rrpc_error (c=0x2599c7b0, rc=12, fmt=0x3b6d0 "%s")
    at server.c:1376
#10 0x0002e14c in lrc_exists (c=0x2599c7b0, dbh=0xfc5fbf4c, arglist=0xfc5fbf58)
    at server.c:1950
#11 0x0002a65c in procreq (a=0x0) at server.c:1054
#12 0xfe944998 in _lwp_start () from /lib/libc.so.1
#13 0xfe944998 in _lwp_start () from /lib/libc.so.1
---Type <return> to continue, or q <return> to quit---
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
------- Comment #4 From 2008-05-22 16:09:49 -------
We build our LDR binaries with debug symbols (gcc32dbgpthr).
------- Comment #5 From 2008-05-29 12:07:28 -------
Most of the below RLS functions are in replica/rls/server/server.c the last one
is in replica/rls/client/library/rpc.c -- not that I expect you'll be looking
into them but just in case. The lifecycle of the connection follows the path:

globus_io_tcp_register_listen
 || 
 \/
lisetncb: checks for io error condition
 ||
 \/
doaccept: sets security attributes on io handle
 ||
 \/
globus_io_tcp_register_accept
 ||
 \/
acceptcb: checks some RLS error conditions
 ||
 \/
globus_io_register_read
 ||
 \/
readcb: queue's the handle (inside an RLS connection object)
 ||
 \/
-- at this point the connection is placed on a "request queue" internal to the
RLS. Then a request processing thread is notified (waiting on a condition var)
that a request is ready for processing.
 ||
 \/
procereq(): wakes and inspects the request and calls a function (in the case of
the crash the function called was lrc_exists()
 ||
 \/
lrc_exists(): makes a db call (in this case the application logic is to report
an "error" that the object the caller was looking for didn't exist in the RLS
database -- but not a system error to be clear)
 ||
 \/
rrpc_error(): creates its iovec structure with the error message
 ||
 \/
rrpc_writev() -- found in rpc.c: initializes a mutex/cv to be used by the
writevcb (never reached)
 ||
 \/
globus_io_register_writev
 ||
 \/
*BANG*

It seems a pretty simple case of listen->accept->read->write->close.
------- Comment #6 From 2008-05-29 13:05:02 -------
You can find two log files for the RLS server running on this platform/machine
at

http://www.lsc-group.phys.uwm.edu/lscdatagrid/downloads/ldr_software/rls.log.1.gz
http://www.lsc-group.phys.uwm.edu/lscdatagrid/downloads/ldr_software/rls.log.2.gz