Bug 5684 - LIGO: RLS server unstable on Debian 4.0
: LIGO: RLS server unstable on Debian 4.0
Status: RESOLVED WORKSFORME
: Replica Location
RLS
: development
: PC Linux
: P3 major
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2007-11-26 09:43 by
Modified: 2009-07-08 18:04 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2007-11-26 09:43:24
datarobot@golf:/opt/LDR-0.8.0/globus/bin$ uname -a
Linux golf 2.6.18-5-686 #1 SMP Wed Sep 26 17:54:59 UTC 2007 i686 GNU/Linux
datarobot@golf:/opt/LDR-0.8.0/globus/bin$ cat /etc/issue
Debian GNU/Linux 4.0 \n \l

On the platform/machine above the RLS server has been unstable. We are used to
measuring the mean time between failure in weeks and months but on this
machine/platform we are measuring it in hours and days.

The globus-rls-server version is

datarobot@golf:/opt/LDR-0.8.0/globus/bin$ globus-rls-server -v
Version: 4.3

This server was compiled from source from the GT 4.0.5 release, along with the
rest of the supporting Globus libraries.

We are using MySQL 5.0.22 as the relational database backend with MySQL
Connector ODBC 3.51.12 and unixODBC-2.2.11, both compiled from source on this
machine/platform.

Note that because of bug in the glibc deployed by default for Debian 4.0 we are
running all Globus tools with 

LD_ASSUME_KERNEL=2.4.19

in the environment.

The server crashes and we cannot correlate the crash with any specific activity
on the machine or within our RLS network. All other servers in the network
appear to be functioning normally.

We have been running the server with -d -L 8 options and I have a number of log
files that I have available and I will append URL pointers to them.
------- Comment #2 From 2007-11-26 16:16:26 -------
In logs 2, 3, and 4, the log file terminates near an rls_lock_get or
rls_lock_release. I'm not sure that's an indication of the cause of the bug,
but it might be something to look into. The last call within those calls is
globus_mutex_unlock. In the record
http://bugzilla.mcs.anl.gov/globus/show_bug.cgi?id=5481, the problem on Deb 4.0
was related to the cond var getting corrupted on a thread cancel. Not sure if
this  could be related or not. There are many lock gets/releases in RLS
operations, so there's a high probability that any crash is going to be "near a
lock get/release" anyway. Again, it may be nothing, but just something I
noticed at first look.

Since the RLS server crashes so quckly and routinely on Deb 4.0 maybe you could
just run it in gdb so we can see what the thread stack traces look like.
------- Comment #3 From 2008-06-11 17:40:20 -------
Per our telecon, Mike L. (XIO) indicated this might be fixed in 4.0.6+. I would
like to close this one -- pending feedback from your Debian 4.0 site. If
they've been running without issue -- and/or if they run without issue upon
upgrading to GT 4.0.6 -- then I'd like to close it.
------- Comment #4 From 2009-07-08 18:04:57 -------
This cannot be reproduced with globus-rls-server from GT 4.2.1 running on
Debian Lenny.