Bug 1573 - Seg fault in ldapsearch
: Seg fault in ldapsearch
Status: RESOLVED FIXED
: MDS2
gt2_mds
: unspecified
: IA64 Linux
: P2 major
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2004-02-18 16:39 by
Modified: 2005-12-06 17:00 (History)


Attachments
Failure log (1.72 KB, text/plain)
2004-02-18 16:40, Keith Thompson
Details
Small C test case (822 bytes, text/plain)
2004-02-19 20:54, Keith Thompson
Details
Context diff for workaround (5.15 KB, text/plain)
2004-02-20 15:06, Keith Thompson
Details
Updated C test case (828 bytes, text/plain)
2004-02-21 21:16, Keith Thompson
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2004-02-18 16:39:30
I just installed Globus 2.4.3 on an IA-64 system running SuSE Linux (I
think it's SP3; I can provide more detailed configuration information
later if it's needed).
   
When I try to start up the MDS services with "SXXgris start",
it immediately fails.  I also get a segmentation fault in
grid-info-search, which I've tracked down to the "ldapsearch" command.
It's dying on a call to gethostbyname_r().  I've confirmed that the
gethostbyname_r() function works in a small test program.

I get the same failure in two different builds, one built with
gcc 3.2.2, the other with Intel's ecc 7.1.  I do not get this failure
under a similar IA-32 build.  I also don't get it on a nearly identical
build on another IA-64 system (which is really odd).

I had attempted to apply all currently available updates, but the
globus_openldap-2.0.22 update didn't build; see my response to
Globus Bugzilla # 1201 for more information on that.

I'll attach a log showing the error.
------- Comment #1 From 2004-02-18 16:40:43 -------
Created an attachment (id=318) [details]
Failure log
------- Comment #2 From 2004-02-18 16:55:52 -------
JP Navarro <navarro@mcs.anl.gov> suggests that
<http://curl.haxx.se/mail/lib-2002-12/0067.html>
might be related to this.  I have no idea, but I thought
I'd pass it along.
------- Comment #3 From 2004-02-18 17:20:28 -------
How likely is it that the patch for bug # 1201 would correct this?
(I suspect not, but I thought it's worth asking.)
------- Comment #4 From 2004-02-18 18:11:20 -------
Not likely that it would fix the particular seg fault you are seeing.  But it 
won't hurt at all to try it in your installation.
------- Comment #5 From 2004-02-19 03:39:24 -------
I've done a new build with the updated globus_openldap-2.0.22.tar.gz
(the one that actually builds), and I still get the seg fault.
------- Comment #6 From 2004-02-19 20:41:18 -------
I have some more information on this bug.  The seg fault seems to
occur if ldapsearch links to /usr/lib/libbind.so.0, but not if it
links to /lib/libresolv.so.2

The gory details.

I've done nearly equivalent test builds on two different nodes:

    dtf-test1.sdsc.teragrid.org (old config)
    tg-login1.sdsc.teragrid.org (new config)

Both systems are IA-64 SuSE systems (TeraGrid nodes).  tg-login1 is
an up-to-date system, including a full set of packages and the latest
service pack (SP3, I think) from SuSE.  dtf-test1 hasn't been updated
for a while, and it's basically configured as a compute node, with a
much smaller set of packages, some of them probably in older versions.
("rpm -qa" shows 822 packages on tg-login1 and 272 on dtf-test1 --
not particularly meaningful numbers, but they give a general idea of
how the systems are set up.)

Both builds used basically the same set of bundles and options
(the build on tg-login1 doesn't have the latest version of the
globus_openldap-2.0.22.tar.gz update package, but I'm fairly
sure that's not the issue; the dtf-test1 build didn't include some
irrelevant extras).  I presume an examination of the logs would show
different "configure" output due to the different sets of libraries
installed on the two nodes.  I'll keep the log files indefinitely,
and I can look for specifics if you like, but I'm not going to post
multi-megabyte log files.

Both nodes share the filesystem on which the builds were done, so I can
run both versions of the ldapsearch command from either node.  I run
the command with no arguments to illustrate the symptoms.  In each
case, I first set $GLOBUS_LOCATION to the appropriate directory and
source the setup script, so I have the proper $LD_LIBRARY_PATH in
my environment.  I think the "Can't contact LDAP server" message is
a normal result of running the command with no arguments.

On dtf-test1:
    The dtf-test1 version of ldapsearch gives:
        ldap_sasl_interactive_bind_s: Can't contact LDAP server
    The tg-login1 version of ldapsearch gives:
        /usr/local/apps/globus-2.4.3-gcc-2004-02-17/bin/ldapsearch: \
        error while loading shared libraries: libbind.so.0: cannot \
        open shared object file: No such file or directory

On tg-login1:
    The dtf-test1 version of ldapsearch gives:
        ldap_sasl_interactive_bind_s: Can't contact LDAP server
    The tg-login1 version of ldapsearch gives:
        Segmentation fault
        (gdb shows the seg fault showing at the same location I
        reported earlier.)

Ok, it looks like a shared library problem.

I run "ldd" on both ldapsearch executables (again, with
$LD_LIBRARY_PATH set properly).  After factoring out the different
$GLOBUS_LOCATION/lib directory paths, there's only one significant
difference: the dtf-test1 version has:

    libresolv.so.2 => /lib/libresolv.so.2

and the tg-login1 version instead has:

    libbind.so.0 => /usr/lib/libbind.so.0

"rpm -qf" tells me that /lib/libresolv.so.2 is part of

glibc-2.2.5-136 on dtf-test1 and
glibc-2.2.5-161 on tg-login1

and
/usr/lib/libbind.so.0 is part of
bind9-utils-9.2.2-64 on tg-login1
but it doesn't exist on the older dtf-test1.

dtf-test1 has bind9-utils-9.1.3-218 installed, but that version of
the package doesn't include the same libraries.

As you'll recall, the seg fault occurs on a call to gethostbyname_r().
On tg-login1, that function is provided by /usr/lib/libbind.so.0.
On dtf-test1, the gethostbyname_r() function doesn't exist; there's
no man page for it, and no reference to it anywhere in /usr/include.
The source file that contains the call to gethostbyname_r (util-int.c
line 404, from the openldap sources) has a lot of configuration
#ifdef's.

So the version of ldapsearch built on dtf-test1 works because it
doesn't call gethostbyname_r() (because it doesn't exist, but the
configure script is smart enough to find an alternative), but the
version built on tg-login1 dies with a seg fault because it does call
gethostbyname_r(); either it calls it incorrectly or there's a bug
in the bind9-utils package itself.

Now that I think about it, I'm not sure how much closer this gets
us to tracking down the cause of the error; we already knew that
it was dying on a call to gethostbyname_r().  But it may suggest a
workaround if we can build the globus_openssh package in a way that
prevents it from using gethostbyname_r() even if it's available.
I'll look into that next.
------- Comment #7 From 2004-02-19 20:54:08 -------
Created an attachment (id=320) [details]
Small C test case
------- Comment #8 From 2004-02-19 20:59:12 -------
I've just reproduced the error with a small test program.  I've provided
the test program as an attachment; now I'll post a transcript that illustrates
the problem.

The program calls gethostbyname_r().  If it's compiled and linked with
no special options, it works correctly.  If it's compiled and linked
with "-lbind -lpthread", it gets the same segmentation fault I see in
ldapsearch.

This leads me to think it's an OS problem, not a Globus or OpenLDAP problem,
unless both OpenLDAP and my test case are doing something wrong.

I'll give the TG clusters/software group a pointer to this bug report; perhaps
somebody in that group can figure something out.

tg-login1% gcc gethostbyname_r_test.c -o gethostbyname_r_test
tg-login1% ./gethostbyname_r_test
Calling gethostbyname_r with name = "tg-login2"
returned_value = 0 (success)
h.h_name       = "tg-login2.sdsc.teragrid.org"
result->h_name = "tg-login2.sdsc.teragrid.org"
Done
tg-login1% ldd gethostbyname_r_test
        libc.so.6.1 => /lib/libc.so.6.1 (0x2000000000054000)
        /lib/ld-linux-ia64.so.2 => /lib/ld-linux-ia64.so.2 (0x2000000000000000)
tg-login1% gcc gethostbyname_r_test.c -o gethostbyname_r_test -lbind -lpthread
tg-login1% ./gethostbyname_r_test
Calling gethostbyname_r with name = "tg-login2"
Segmentation fault
Exit 139
tg-login1% ldd gethostbyname_r_test
        libbind.so.0 => /usr/lib/libbind.so.0 (0x2000000000040000)
        libpthread.so.0 => /lib/libpthread.so.0 (0x20000000000fc000)
        libc.so.6.1 => /lib/libc.so.6.1 (0x2000000000134000)
        /lib/ld-linux-ia64.so.2 => /lib/ld-linux-ia64.so.2 (0x2000000000000000)
tg-login1% gdb gethostbyname_r_test
GNU gdb 5.3
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "ia64-suse-linux"...
(gdb) run
Starting program: /users/kst/cvs-kst/c/gethostbyname_r_test
[New Thread 1024 (LWP 28495)]
Calling gethostbyname_r with name = "tg-login2"

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1024 (LWP 28495)]
0x20000000002190c1 in memcpy () from /lib/libc.so.6.1
(gdb) where
#0  0x20000000002190c1 in memcpy () from /lib/libc.so.6.1
#1  0x2000000000061dd0 in copy_hostent () from /usr/lib/libbind.so.0
#2  0x20000000000618d0 in gethostbyname_r () from /usr/lib/libbind.so.0
#3  0x4000000000000a00 in main ()
#4  0x2000000000167fa0 in __libc_start_main () from /lib/libc.so.6.1
#5  0x4000000000000840 in _start ()
#6  0x2000000000061dd0 in copy_hostent () from /usr/lib/libbind.so.0
Cannot access memory at address 0x60000f7ffffffff0
(gdb) A debugging session is active.
Do you still want to close the debugger?(y or n) y
------- Comment #9 From 2004-02-20 04:04:02 -------
Keith,

I'd like to try your C test case and see if I can replicate it on my own.  I'm 
tempted to add it to the PMR that I've reopened with IBM on pthread problems -- 
I've been struggling to characterize the higher failure rate I'm seeing with 
SuSE SLES8 SP3 as installed -- and my test cases have only been in Java, which 
compounds the complications.  So having a simple C test case may be really 
valuable.  What do you think?

Jay
------- Comment #10 From 2004-02-20 05:49:08 -------
Jay: No problem.

I've been playing with the idea that this might be caused 
by an incompatible version of gethostbyname_r().  There's one
version in /lib/libc.so.6.1 (from glibc-2.2.5-161) and another in
/usr/lib/libbind.so.0 (from bind9-utils-9.2.2-64).  My test case passes
if it uses the libc version, and fails if it uses the libbind version.
 
If the two versions are incompatible, that could explain the symptoms
we're seeing, but I haven't been able to confirm that.  (There are
several different versions of the gethostbyname_r() function, with 3,   
5, and 6 arguments.)
------- Comment #11 From 2004-02-20 15:05:11 -------
I have created a workaround for this bug.  It is not a fix for the underlying
problem.

The workaround consists of modifying the "configure" script in the
globus_openldap update package so it avoids using the "-lbind"
(/usr/lib/libbind.so) library, forcing it to fall back to other
libraries that actually work.

I'll add a context diff as an attachment.  I can also make the updated
package available, but it's probably too big to attach to a bug report;
I'll post a URL.
------- Comment #12 From 2004-02-20 15:06:22 -------
Created an attachment (id=321) [details]
Context diff for workaround

I didn't mark this as a patch because I haven't actually tried applying
it by feeding it to the "patch" program.
------- Comment #13 From 2004-02-20 15:12:14 -------
See <http://www.sdsc.edu/~kst/globus-bugzilla-1573/>
------- Comment #14 From 2004-02-21 21:15:50 -------
I have a new version of the test case.  The original version uses "h_errno"
as a local variable; under some circumstances, depending on compiler and
options, "h_errno" is used as a macro.  I've changed it to "my_h_errno".

By turning up the compiler warning levels, I got a complaint about an
implicit declaration of the gethostbyname_r function.  This can easily
cause segmentation faults, especially on a platform like the IA-64 where
pointers are bigger than ints.  After examining the relevant header file,
I determined that adding "-D_REENTRANT" would avoid the warning, but the
call still segfaults.

To demonstrate the problem:

tg-login1% gcc -g -W -Wall -D_REENTRANT \
    -I/usr/include/bind gethostbyname_r_test.c \
    -o gethostbyname_r_test \   
    -lbind -lpthread
tg-login1% ./gethostbyname_r_test  
Calling gethostbyname_r with name = "tg-login2"
Segmentation fault 
------- Comment #15 From 2004-02-21 21:16:41 -------
Created an attachment (id=323) [details]
Updated C test case
------- Comment #16 From 2004-02-26 19:43:37 -------
A workaround package for this bug is now available on the Globus advisory 
pages.  The update process is the same as for the other packages.  This package 
should be used only on installations running on IA-64 (SUSE linux SP3) 
experiencing the segmentation faults using MDS, but it is safe on other 
supported platforms.  Many thanks Keith for your efforts.

Workaround: http://www-unix.globus.org/toolkit/advisories.html?version=2.4
------- Comment #17 From 2004-02-27 11:45:38 -------
The resolution tag is changed to Fixed rather than NOTGLOBUS due to the
existence of a globus package for the OS bug.
------- Comment #18 From 2004-08-09 22:52:24 -------
I don't believe this bug is actually specific to the IA-64.  It just 
happens that, within the TeraGrid, our IA-64 systems have a more recent
OS update than our (relatively few) IA-32 systems.  Specifically, the
IA-64 systems have a newer version of the bind9-devel and bind9-utils
packages.

I expect that when and if the IA-32 systems are updated, this problem 
will appear there as well.

I'm trying to get this reported to IBM; I'll provide more details 
when and if I hear back from them.

The ugly workaround I implemented modifies the configure script so it 
doesn't try to use "-lbind".  What part of the build process knows to
try that in the first place?  (The twisty mazes of libtool, autoconf,
and so forth can be a bit difficult to navigate.)  A cleaner solution,
I think, would be to patch whatever looks for "-lbind" in the first
place so it doesn't try to do so.