Bugzilla – Bug 5684
LIGO: RLS server unstable on Debian 4.0
Last modified: 2009-07-08 18:04:57
You need to
before you can comment on or make changes to this bug.
datarobot@golf:/opt/LDR-0.8.0/globus/bin$ uname -a
Linux golf 2.6.18-5-686 #1 SMP Wed Sep 26 17:54:59 UTC 2007 i686 GNU/Linux
datarobot@golf:/opt/LDR-0.8.0/globus/bin$ cat /etc/issue
Debian GNU/Linux 4.0 \n \l
On the platform/machine above the RLS server has been unstable. We are used to
measuring the mean time between failure in weeks and months but on this
machine/platform we are measuring it in hours and days.
The globus-rls-server version is
datarobot@golf:/opt/LDR-0.8.0/globus/bin$ globus-rls-server -v
This server was compiled from source from the GT 4.0.5 release, along with the
rest of the supporting Globus libraries.
We are using MySQL 5.0.22 as the relational database backend with MySQL
Connector ODBC 3.51.12 and unixODBC-2.2.11, both compiled from source on this
Note that because of bug in the glibc deployed by default for Debian 4.0 we are
running all Globus tools with
in the environment.
The server crashes and we cannot correlate the crash with any specific activity
on the machine or within our RLS network. All other servers in the network
appear to be functioning normally.
We have been running the server with -d -L 8 options and I have a number of log
files that I have available and I will append URL pointers to them.
Here are links to 5 gzipped RLS log files created with -d -L 8. Each file
recorded up to the crash.
In logs 2, 3, and 4, the log file terminates near an rls_lock_get or
rls_lock_release. I'm not sure that's an indication of the cause of the bug,
but it might be something to look into. The last call within those calls is
globus_mutex_unlock. In the record
http://bugzilla.mcs.anl.gov/globus/show_bug.cgi?id=5481, the problem on Deb 4.0
was related to the cond var getting corrupted on a thread cancel. Not sure if
this could be related or not. There are many lock gets/releases in RLS
operations, so there's a high probability that any crash is going to be "near a
lock get/release" anyway. Again, it may be nothing, but just something I
noticed at first look.
Since the RLS server crashes so quckly and routinely on Deb 4.0 maybe you could
just run it in gdb so we can see what the thread stack traces look like.
Per our telecon, Mike L. (XIO) indicated this might be fixed in 4.0.6+. I would
like to close this one -- pending feedback from your Debian 4.0 site. If
they've been running without issue -- and/or if they run without issue upon
upgrading to GT 4.0.6 -- then I'd like to close it.
This cannot be reproduced with globus-rls-server from GT 4.2.1 running on