Bugzilla – Bug 5684
LIGO: RLS server unstable on Debian 4.0
Last modified: 2009-07-08 18:04:57
You need to log in before you can comment on or make changes to this bug.
datarobot@golf:/opt/LDR-0.8.0/globus/bin$ uname -a Linux golf 2.6.18-5-686 #1 SMP Wed Sep 26 17:54:59 UTC 2007 i686 GNU/Linux datarobot@golf:/opt/LDR-0.8.0/globus/bin$ cat /etc/issue Debian GNU/Linux 4.0 \n \l On the platform/machine above the RLS server has been unstable. We are used to measuring the mean time between failure in weeks and months but on this machine/platform we are measuring it in hours and days. The globus-rls-server version is datarobot@golf:/opt/LDR-0.8.0/globus/bin$ globus-rls-server -v Version: 4.3 This server was compiled from source from the GT 4.0.5 release, along with the rest of the supporting Globus libraries. We are using MySQL 5.0.22 as the relational database backend with MySQL Connector ODBC 3.51.12 and unixODBC-2.2.11, both compiled from source on this machine/platform. Note that because of bug in the glibc deployed by default for Debian 4.0 we are running all Globus tools with LD_ASSUME_KERNEL=2.4.19 in the environment. The server crashes and we cannot correlate the crash with any specific activity on the machine or within our RLS network. All other servers in the network appear to be functioning normally. We have been running the server with -d -L 8 options and I have a number of log files that I have available and I will append URL pointers to them.
Here are links to 5 gzipped RLS log files created with -d -L 8. Each file recorded up to the crash. http://www.lsc-group.phys.uwm.edu/lscdatagrid/downloads/ldr_software/debugging/rls.out.1.gz http://www.lsc-group.phys.uwm.edu/lscdatagrid/downloads/ldr_software/debugging/rls.out.2.gz http://www.lsc-group.phys.uwm.edu/lscdatagrid/downloads/ldr_software/debugging/rls.out.3.gz http://www.lsc-group.phys.uwm.edu/lscdatagrid/downloads/ldr_software/debugging/rls.out.4.gz http://www.lsc-group.phys.uwm.edu/lscdatagrid/downloads/ldr_software/debugging/rls.out.5.gz
In logs 2, 3, and 4, the log file terminates near an rls_lock_get or rls_lock_release. I'm not sure that's an indication of the cause of the bug, but it might be something to look into. The last call within those calls is globus_mutex_unlock. In the record http://bugzilla.mcs.anl.gov/globus/show_bug.cgi?id=5481, the problem on Deb 4.0 was related to the cond var getting corrupted on a thread cancel. Not sure if this could be related or not. There are many lock gets/releases in RLS operations, so there's a high probability that any crash is going to be "near a lock get/release" anyway. Again, it may be nothing, but just something I noticed at first look. Since the RLS server crashes so quckly and routinely on Deb 4.0 maybe you could just run it in gdb so we can see what the thread stack traces look like.
Per our telecon, Mike L. (XIO) indicated this might be fixed in 4.0.6+. I would like to close this one -- pending feedback from your Debian 4.0 site. If they've been running without issue -- and/or if they run without issue upon upgrading to GT 4.0.6 -- then I'd like to close it.
This cannot be reproduced with globus-rls-server from GT 4.2.1 running on Debian Lenny.