Bugzilla – Bug 1573
Seg fault in ldapsearch
Last modified: 2005-12-06 17:00:07
You need to log in before you can comment on or make changes to this bug.
I just installed Globus 2.4.3 on an IA-64 system running SuSE Linux (I think it's SP3; I can provide more detailed configuration information later if it's needed). When I try to start up the MDS services with "SXXgris start", it immediately fails. I also get a segmentation fault in grid-info-search, which I've tracked down to the "ldapsearch" command. It's dying on a call to gethostbyname_r(). I've confirmed that the gethostbyname_r() function works in a small test program. I get the same failure in two different builds, one built with gcc 3.2.2, the other with Intel's ecc 7.1. I do not get this failure under a similar IA-32 build. I also don't get it on a nearly identical build on another IA-64 system (which is really odd). I had attempted to apply all currently available updates, but the globus_openldap-2.0.22 update didn't build; see my response to Globus Bugzilla # 1201 for more information on that. I'll attach a log showing the error.
Created an attachment (id=318) [details] Failure log
JP Navarro <navarro@mcs.anl.gov> suggests that <http://curl.haxx.se/mail/lib-2002-12/0067.html> might be related to this. I have no idea, but I thought I'd pass it along.
How likely is it that the patch for bug # 1201 would correct this? (I suspect not, but I thought it's worth asking.)
Not likely that it would fix the particular seg fault you are seeing. But it won't hurt at all to try it in your installation.
I've done a new build with the updated globus_openldap-2.0.22.tar.gz (the one that actually builds), and I still get the seg fault.
I have some more information on this bug. The seg fault seems to occur if ldapsearch links to /usr/lib/libbind.so.0, but not if it links to /lib/libresolv.so.2 The gory details. I've done nearly equivalent test builds on two different nodes: dtf-test1.sdsc.teragrid.org (old config) tg-login1.sdsc.teragrid.org (new config) Both systems are IA-64 SuSE systems (TeraGrid nodes). tg-login1 is an up-to-date system, including a full set of packages and the latest service pack (SP3, I think) from SuSE. dtf-test1 hasn't been updated for a while, and it's basically configured as a compute node, with a much smaller set of packages, some of them probably in older versions. ("rpm -qa" shows 822 packages on tg-login1 and 272 on dtf-test1 -- not particularly meaningful numbers, but they give a general idea of how the systems are set up.) Both builds used basically the same set of bundles and options (the build on tg-login1 doesn't have the latest version of the globus_openldap-2.0.22.tar.gz update package, but I'm fairly sure that's not the issue; the dtf-test1 build didn't include some irrelevant extras). I presume an examination of the logs would show different "configure" output due to the different sets of libraries installed on the two nodes. I'll keep the log files indefinitely, and I can look for specifics if you like, but I'm not going to post multi-megabyte log files. Both nodes share the filesystem on which the builds were done, so I can run both versions of the ldapsearch command from either node. I run the command with no arguments to illustrate the symptoms. In each case, I first set $GLOBUS_LOCATION to the appropriate directory and source the setup script, so I have the proper $LD_LIBRARY_PATH in my environment. I think the "Can't contact LDAP server" message is a normal result of running the command with no arguments. On dtf-test1: The dtf-test1 version of ldapsearch gives: ldap_sasl_interactive_bind_s: Can't contact LDAP server The tg-login1 version of ldapsearch gives: /usr/local/apps/globus-2.4.3-gcc-2004-02-17/bin/ldapsearch: \ error while loading shared libraries: libbind.so.0: cannot \ open shared object file: No such file or directory On tg-login1: The dtf-test1 version of ldapsearch gives: ldap_sasl_interactive_bind_s: Can't contact LDAP server The tg-login1 version of ldapsearch gives: Segmentation fault (gdb shows the seg fault showing at the same location I reported earlier.) Ok, it looks like a shared library problem. I run "ldd" on both ldapsearch executables (again, with $LD_LIBRARY_PATH set properly). After factoring out the different $GLOBUS_LOCATION/lib directory paths, there's only one significant difference: the dtf-test1 version has: libresolv.so.2 => /lib/libresolv.so.2 and the tg-login1 version instead has: libbind.so.0 => /usr/lib/libbind.so.0 "rpm -qf" tells me that /lib/libresolv.so.2 is part of glibc-2.2.5-136 on dtf-test1 and glibc-2.2.5-161 on tg-login1 and /usr/lib/libbind.so.0 is part of bind9-utils-9.2.2-64 on tg-login1 but it doesn't exist on the older dtf-test1. dtf-test1 has bind9-utils-9.1.3-218 installed, but that version of the package doesn't include the same libraries. As you'll recall, the seg fault occurs on a call to gethostbyname_r(). On tg-login1, that function is provided by /usr/lib/libbind.so.0. On dtf-test1, the gethostbyname_r() function doesn't exist; there's no man page for it, and no reference to it anywhere in /usr/include. The source file that contains the call to gethostbyname_r (util-int.c line 404, from the openldap sources) has a lot of configuration #ifdef's. So the version of ldapsearch built on dtf-test1 works because it doesn't call gethostbyname_r() (because it doesn't exist, but the configure script is smart enough to find an alternative), but the version built on tg-login1 dies with a seg fault because it does call gethostbyname_r(); either it calls it incorrectly or there's a bug in the bind9-utils package itself. Now that I think about it, I'm not sure how much closer this gets us to tracking down the cause of the error; we already knew that it was dying on a call to gethostbyname_r(). But it may suggest a workaround if we can build the globus_openssh package in a way that prevents it from using gethostbyname_r() even if it's available. I'll look into that next.
Created an attachment (id=320) [details] Small C test case
I've just reproduced the error with a small test program. I've provided the test program as an attachment; now I'll post a transcript that illustrates the problem. The program calls gethostbyname_r(). If it's compiled and linked with no special options, it works correctly. If it's compiled and linked with "-lbind -lpthread", it gets the same segmentation fault I see in ldapsearch. This leads me to think it's an OS problem, not a Globus or OpenLDAP problem, unless both OpenLDAP and my test case are doing something wrong. I'll give the TG clusters/software group a pointer to this bug report; perhaps somebody in that group can figure something out. tg-login1% gcc gethostbyname_r_test.c -o gethostbyname_r_test tg-login1% ./gethostbyname_r_test Calling gethostbyname_r with name = "tg-login2" returned_value = 0 (success) h.h_name = "tg-login2.sdsc.teragrid.org" result->h_name = "tg-login2.sdsc.teragrid.org" Done tg-login1% ldd gethostbyname_r_test libc.so.6.1 => /lib/libc.so.6.1 (0x2000000000054000) /lib/ld-linux-ia64.so.2 => /lib/ld-linux-ia64.so.2 (0x2000000000000000) tg-login1% gcc gethostbyname_r_test.c -o gethostbyname_r_test -lbind -lpthread tg-login1% ./gethostbyname_r_test Calling gethostbyname_r with name = "tg-login2" Segmentation fault Exit 139 tg-login1% ldd gethostbyname_r_test libbind.so.0 => /usr/lib/libbind.so.0 (0x2000000000040000) libpthread.so.0 => /lib/libpthread.so.0 (0x20000000000fc000) libc.so.6.1 => /lib/libc.so.6.1 (0x2000000000134000) /lib/ld-linux-ia64.so.2 => /lib/ld-linux-ia64.so.2 (0x2000000000000000) tg-login1% gdb gethostbyname_r_test GNU gdb 5.3 Copyright 2002 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "ia64-suse-linux"... (gdb) run Starting program: /users/kst/cvs-kst/c/gethostbyname_r_test [New Thread 1024 (LWP 28495)] Calling gethostbyname_r with name = "tg-login2" Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1024 (LWP 28495)] 0x20000000002190c1 in memcpy () from /lib/libc.so.6.1 (gdb) where #0 0x20000000002190c1 in memcpy () from /lib/libc.so.6.1 #1 0x2000000000061dd0 in copy_hostent () from /usr/lib/libbind.so.0 #2 0x20000000000618d0 in gethostbyname_r () from /usr/lib/libbind.so.0 #3 0x4000000000000a00 in main () #4 0x2000000000167fa0 in __libc_start_main () from /lib/libc.so.6.1 #5 0x4000000000000840 in _start () #6 0x2000000000061dd0 in copy_hostent () from /usr/lib/libbind.so.0 Cannot access memory at address 0x60000f7ffffffff0 (gdb) A debugging session is active. Do you still want to close the debugger?(y or n) y
Keith, I'd like to try your C test case and see if I can replicate it on my own. I'm tempted to add it to the PMR that I've reopened with IBM on pthread problems -- I've been struggling to characterize the higher failure rate I'm seeing with SuSE SLES8 SP3 as installed -- and my test cases have only been in Java, which compounds the complications. So having a simple C test case may be really valuable. What do you think? Jay
Jay: No problem. I've been playing with the idea that this might be caused by an incompatible version of gethostbyname_r(). There's one version in /lib/libc.so.6.1 (from glibc-2.2.5-161) and another in /usr/lib/libbind.so.0 (from bind9-utils-9.2.2-64). My test case passes if it uses the libc version, and fails if it uses the libbind version. If the two versions are incompatible, that could explain the symptoms we're seeing, but I haven't been able to confirm that. (There are several different versions of the gethostbyname_r() function, with 3, 5, and 6 arguments.)
I have created a workaround for this bug. It is not a fix for the underlying problem. The workaround consists of modifying the "configure" script in the globus_openldap update package so it avoids using the "-lbind" (/usr/lib/libbind.so) library, forcing it to fall back to other libraries that actually work. I'll add a context diff as an attachment. I can also make the updated package available, but it's probably too big to attach to a bug report; I'll post a URL.
Created an attachment (id=321) [details] Context diff for workaround I didn't mark this as a patch because I haven't actually tried applying it by feeding it to the "patch" program.
See <http://www.sdsc.edu/~kst/globus-bugzilla-1573/>
I have a new version of the test case. The original version uses "h_errno" as a local variable; under some circumstances, depending on compiler and options, "h_errno" is used as a macro. I've changed it to "my_h_errno". By turning up the compiler warning levels, I got a complaint about an implicit declaration of the gethostbyname_r function. This can easily cause segmentation faults, especially on a platform like the IA-64 where pointers are bigger than ints. After examining the relevant header file, I determined that adding "-D_REENTRANT" would avoid the warning, but the call still segfaults. To demonstrate the problem: tg-login1% gcc -g -W -Wall -D_REENTRANT \ -I/usr/include/bind gethostbyname_r_test.c \ -o gethostbyname_r_test \ -lbind -lpthread tg-login1% ./gethostbyname_r_test Calling gethostbyname_r with name = "tg-login2" Segmentation fault
Created an attachment (id=323) [details] Updated C test case
A workaround package for this bug is now available on the Globus advisory pages. The update process is the same as for the other packages. This package should be used only on installations running on IA-64 (SUSE linux SP3) experiencing the segmentation faults using MDS, but it is safe on other supported platforms. Many thanks Keith for your efforts. Workaround: http://www-unix.globus.org/toolkit/advisories.html?version=2.4
The resolution tag is changed to Fixed rather than NOTGLOBUS due to the existence of a globus package for the OS bug.
I don't believe this bug is actually specific to the IA-64. It just happens that, within the TeraGrid, our IA-64 systems have a more recent OS update than our (relatively few) IA-32 systems. Specifically, the IA-64 systems have a newer version of the bind9-devel and bind9-utils packages. I expect that when and if the IA-32 systems are updated, this problem will appear there as well. I'm trying to get this reported to IBM; I'll provide more details when and if I hear back from them. The ugly workaround I implemented modifies the configure script so it doesn't try to use "-lbind". What part of the build process knows to try that in the first place? (The twisty mazes of libtool, autoconf, and so forth can be a bit difficult to navigate.) A cleaner solution, I think, would be to patch whatever looks for "-lbind" in the first place so it doesn't try to do so.