Bug 1532

Summary: Server hangs
Product: Replica Location Reporter: Aleksandr Konstantinov <aleks@fys.uio.no>
Component: RLSAssignee: Rob S <schuler@isi.edu>
Status: RESOLVED FIXED    
Severity: normal CC: annc@isi.edu, gmehta@isi.edu, langgard@nbi.dk, schuler@isi.edu, shishir@isi.edu
Priority: P2    
Version: 2.0   
Target Milestone: ---   
Hardware: PC   
OS: Linux   
Attachments: Small test-program that hangs the server
Patch for freezing RLS problem
Initial patch

Description From 2004-02-09 16:02:43
After some time globus-rls-server hangs and does not accept new 
connections (accept() is never called for new connections). gdb shows
few threads hanging in globus_io_cancel(), which is called (directly and
inderectly) from functions acceptcb(), checkidle() and rrpc_connect() .
------- Comment #1 From 2004-02-09 16:58:47 -------
We need the following information:
1. RLS version (It's most likely not 2.0)
2. Globus version
3. Globus IO library version
4. RLS Configuration information
5. If the problem can be reproduced, a test case (Any set of interactions with
the server which causes it to hang)
6. Log messages logged by RLS (Syslog messages or from stdout if server was
started in debug mode)
------- Comment #2 From 2004-02-10 00:50:59 -------
1. 2.1.4 
2. 2.4.3
3. 4.0.1
4. >cat globus-rls-server.conf
db_user                 ...
db_pwd                  ...
odbcini                 /etc/odbc.ini

update_immediate        true            # Propagate changes to RLI quickly

# LRC options
lrc_server              true
lrc_dbname              lrc1000

# RLI options
rli_server              false
#rli_server             true            # Indicates this is an RLI server
#rli_dbname             rli1000         # mysql-database-name 

# Access control.  There may be multiple acl options, each one
# is a regular expression which is matched against the user's Distinguished
# Name in is X.509 certificate, followed by a colon and a whitespace
# separated list of privs, which are all, lrc_read, lrc_update, rli_read,
# rli_update, stats and admin.  If you use a grid-mapfile to map DNs to
# local usernames you may specify those instead of DNs.
acl                     ...: all
acl                     ...: all
acl                     ...: all
acl                     ...: all

5. Not sure. I think it just hangs after some time being used (couple of days).
In my setup it is being continiuosly contacted by storage element for adding
new lfns and refreshing old ones. It also interacts with RLI server. Which, 
by the way, also hangs but it takes longer time.

6. Server is started with -d and it's stdout and stderr are redirected to file.
There are no messages from server in that file. Actually to find out what is 
going on I added few fprintf(stderr,...) into code. And I see only my messages
there.
------- Comment #3 From 2004-02-10 13:01:26 -------
By default, the logging level is set to 0 which means that only errors are
logged. This can be increased enable higher levels of debug messages.
To start the server with logging set to level 3, for example, you'd do:
$GLOBUS_LOCATION/bin/globus-rls-server -d -L 3
or set the "loglevel" option in the configuration file. (Level 3 might be too
verbose. So, you might want to start at a lower level)

Let me confirm your setup:
1. You have one RLS configured as an LRC only
2. You have another RLS configured as an RLI only
3. The LRC is updating the RLI and these two are the only RLS servers in your setup.
4. You're (continuously ?) adding mappings to the LRC.
   (I don't understand what you mean by refreshing old LFNs. Mappings continue
to exist in the LRC database until explicitly deleted)

I'll try to duplicate this setup and see what happens. Some additional details
might help:
1. What database are you using? MySQL ? 
2. What's the approximate number of lfns/mappings in the LRC?
3. How do you update the LRC?
      a. Are you using the client (globus-rls-cli) provided or the RLS C/Java API?
      b. Do you create a new connection per new LFN added and close it when done
or have a long standing connection?
      c. How often do you add a new lfn?
4. How many fds are in use by the server? You can check this with the "lsof
$GLOBUS_LOCATION/bin/globus-rls-server | wc -l" command.
------- Comment #4 From 2004-02-10 13:58:08 -------
1. Yes.
2. Yes.
3. Yes.
4. Yes and No. Setup is usually left alone for few days without new files 
coming in. So there are no new mapping appearing in LRC. But internal design
of other parts is such that it continiously registers same mapping into LRC
approximately every 10 minutes.


1. PostgreSQL used.
2.
> globus-rls-cli query wildcard lrc lfn '*' rls://host
  __storage_service__: se://dc4.uio.no:8000/se
  se?test_file_2: se://dc4.uio.no:8000/se/se?test_file_2
  test_file3: se://dc4.uio.no:8000/se?test_file3
  test_file3_copy: se://dc4.uio.no:8000/se?test_file3_copy
  test_file3_copy2: se://dc4.uio.no:8000/se?test_file3_copy2
  test_file3_copy4: se://dc4.uio.no:8000/se?test_file3_copy4
  test_file3_copy5: se://dc4.uio.no:8000/se?test_file3_copy5
  test_file5: se://dc4.uio.no:8000/se/test_file5
  test_file6: se://dc4.uio.no:8000/se/test_file6
  test_file_1: se://dc4.uio.no:8000/se/test_file_1
  test_file_1: se://dc4.uio.no:8000/se?test_file_1
  testdms1: gsiftp://testfile
  testdms2: gsiftp://mbranco_test1
  testdms3: gsiftp://testfile
  testdms4: gsiftp://testfil1e
  testdms4: gsiftp://testfile

3. My code uses C API. One connection per mapping.

4. I can't answer that question right now. Frontend is down and I can't 
login as root. Currently it has 1 listening TCP socket and 2 in CLOSE_WAIT
state. I'll try to get more information later while it is running and after
it hangs.
------- Comment #5 From 2004-02-11 14:46:31 -------
Information about open handles (lsof globus-rls-server would only 
show process which have globus-rls-server opened) :

> lsof -n | grep globus-rl | wc -l
     48

Just in case:

> lsof -n | grep globus-rl
globus-rl  1925     root  cwd    DIR        3,1     4096          2 /
globus-rl  1925     root  rtd    DIR        3,1     4096          2 /
globus-rl  1925     root  txt    REG        3,1   460313     109577
/opt/globus/bin/globus-rls-server
globus-rl  1925     root  mem    REG        3,1   106400     124513 /lib/ld-2.3.2.so
globus-rl  1925     root  mem    REG        3,1   140923      48506
/opt/globus/lib/libglobus_rls_client_gcc32dbgpthr.so.0.0.0
globus-rl  1925     root  mem    DEL        0,4               32769 /SYSV79016d02
globus-rl  1925     root  mem    REG        3,7    21040     243150
/usr/lib/gconv/gconv-modules.cache
globus-rl  1925     root  mem    REG        3,7     5524     243094
/usr/lib/gconv/ISO8859-1.so
globus-rl  1925     root  mem    REG        3,7   412400      65403
/usr/lib/libodbc.so.1.0.0
globus-rl  1925     root  mem    REG        3,1   176952      48023
/opt/globus/lib/libglobus_io_gcc32dbgpthr.so.0.0.0
globus-rl  1925     root  mem    REG        3,1    37756      48003
/opt/globus/lib/libglobus_gss_assist_gcc32dbgpthr.so.0.0.0
globus-rl  1925     root  mem    REG        3,1   105476      48013
/opt/globus/lib/libglobus_gssapi_gsi_gcc32dbgpthr.so.0.0.0
globus-rl  1925     root  mem    REG        3,1    54820      47983
/opt/globus/lib/libglobus_gsi_proxy_core_gcc32dbgpthr.so.0.0.0
globus-rl  1925     root  mem    REG        3,1    56156      47973
/opt/globus/lib/libglobus_gsi_credential_gcc32dbgpthr.so.0.0.0
globus-rl  1925     root  mem    REG        3,1    36120      47953
/opt/globus/lib/libglobus_gsi_callback_gcc32dbgpthr.so.0.0.0
globus-rl  1925     root  mem    REG        3,1    33640      48053
/opt/globus/lib/libglobus_oldgaa_gcc32dbgpthr.so.0.0.0
globus-rl  1925     root  mem    REG        3,1    41096      47993
/opt/globus/lib/libglobus_gsi_sysconfig_gcc32dbgpthr.so.0.0.0
globus-rl  1925     root  mem    REG        3,1    18636      47963
/opt/globus/lib/libglobus_gsi_cert_utils_gcc32dbgpthr.so.0.0.0
globus-rl  1925     root  mem    REG        3,1     6284      48073
/opt/globus/lib/libglobus_openssl_gcc32dbgpthr.so.0.0.0
globus-rl  1925     root  mem    REG        3,1    14572      48083
/opt/globus/lib/libglobus_proxy_ssl_gcc32dbgpthr.so.0.0.0
globus-rl  1925     root  mem    REG        3,1    13288      48063
/opt/globus/lib/libglobus_openssl_error_gcc32dbgpthr.so.0.0.0
globus-rl  1925     root  mem    REG        3,1   213576      48211
/opt/globus/lib/libssl_gcc32dbgpthr.so.0.0.0
globus-rl  1925     root  mem    REG        3,1   920868      47753
/opt/globus/lib/libcrypto_gcc32dbgpthr.so.0.0.0
globus-rl  1925     root  mem    REG        3,1   174868      47763
/opt/globus/lib/libglobus_common_gcc32dbgpthr.so.0.0.0
globus-rl  1925     root  mem    REG        3,1    15900     124433
/lib/libdl-2.3.2.so
globus-rl  1925     root  mem    REG        3,1    80592     171082
/lib/tls/libpthread-0.34.so
globus-rl  1925     root  mem    REG        3,1   212020     171080
/lib/tls/libm-2.3.2.so
globus-rl  1925     root  mem    REG        3,1    53568     124555
/lib/libnss_compat-2.3.2.so
globus-rl  1925     root  mem    REG        3,1    91624     124437
/lib/libnsl-2.3.2.so
globus-rl  1925     root  mem    REG        3,7   229032      65423
/usr/lib/libodbcpsql.so.2.0.0
globus-rl  1925     root  mem    REG        3,1    23688     124431
/lib/libcrypt-2.3.2.so
globus-rl  1925     root  mem    REG        3,1    52492     124443
/lib/libnss_files-2.3.2.so
globus-rl  1925     root  mem    REG        3,1    18460     124441
/lib/libnss_dns-2.3.2.so
globus-rl  1925     root  mem    REG        3,1    76608     124453
/lib/libresolv-2.3.2.so
globus-rl  1925     root  mem    REG        3,1  1539996     171088
/lib/tls/libc-2.3.2.so
globus-rl  1925     root    0r   CHR        1,3               32636 /dev/null
globus-rl  1925     root    1w   REG        3,1    60650      48523 /var/log/rls.log
globus-rl  1925     root    2w   REG        3,1    60650      48523 /var/log/rls.log
globus-rl  1925     root    3r  FIFO        0,5            12318075 pipe
globus-rl  1925     root    4w  FIFO        0,5            12318075 pipe
globus-rl  1925     root    5u  IPv4   12318076                 TCP *:39281 (LISTEN)
globus-rl  1925     root    6u  unix 0xdb7b2080            12318080 socket
globus-rl  1925     root    7u  IPv4   12318084                 TCP
129.240.86.43:43278->129.240.86.18:39281 (CLOSE_WAIT)
globus-rl  1925     root    8w   CHR        1,3               32636 /dev/null
globus-rl  1925     root    9u  unix 0xc0fe1580            12319203 socket
globus-rl  1925     root   10u  unix 0xd1888a80            12319197 socket
globus-rl  1925     root   11u  unix 0xdfaed580            12319200 socket
globus-rl  1925     root   12u  IPv4   12712157                 TCP
129.240.86.43:39281->129.240.86.18:32773 (CLOSE_WAIT)
------- Comment #6 From 2004-02-25 15:32:39 -------
I've had this setup running without any problems for a few days now. Do you
have
any new information to add? It might be a good idea to also attach your code
that uses the RLS API.
------- Comment #7 From 2004-02-26 15:43:30 -------
The code is a bit too complicated to be attached. But I have some advances.
My setup was:
1. LRC server
2. My server, which registers periodically lfn-pfn mapping into LRC through RLS
API and accepts requests from clients through globus_io.
3. My client which can query both RLS and my server for available information.

After some time both LRC server and my server stoped responding to my client.
Some additional debug output showed that in both even globus_io accept callback
was never reached.
So I tried gdb. It showed me that every time setup hanged programs were sitting
in some unknow function. And backtrace looked like (for both LRC and my server):

- unknow function
- globus_io functions
- program code which was registered as globus_io accept callback
- globus_io functions
- program code which calls globus_io

So I assumed it must be something like "distributed loop" in globus_io. To
break it I rearranged code of my server not to call globus_io operations 
from inside globus_io callbacks. Now it all works for ~3 days without restart.
------- Comment #8 From 2004-02-26 17:49:56 -------
I talked to a globus io/xio developer and he says that it should be ok to use
globus_io functions in globus_io callbacks, provided you have the libraries
built threaded.
To build RLS you'd need the threaded flavors of io anyway.

If you continue to have problems with your server, you could file this under
globus_io, since it doesn't seem to be an RLS problem in particular.

If you don't have any RLS problems in the next couple of days, this bug will be
closed as RESOLVED-NOTABUG.
------- Comment #9 From 2004-03-04 10:52:33 -------
Today I found out that rls (lrc) server does not respond. All that time my
program was registering same lfn-pfn pair into it approximately every 10 seconds.
netstat show 240 connections to rls server with various states: SYN_RECV,
CLOSE_WAIT, ESTABLISHED, FIN_WAIT1.
I killed my executable but rls server did not recover. gdb shows it's threads
hangs in:

#1  0x40264f18 in globus_l_thread_pool_thread_start ()
#1  0x400a4d17 in globus_l_io_poll (user_args=0x0) at globus_io_core.c:3188
#1  0x0805795e in checkidle (a=0x807669c) at server.c:998
#1  0x08057020 in procreq (a=0x0) at server.c:802
#1  0x08057020 in procreq (a=0x0) at server.c:802
#1  0x08057020 in procreq (a=0x0) at server.c:802
#1  0x0805db93 in process_update (vrli=0x80a9080) at update.c:448
#1  0x080542c8 in lrc_llupdates (a=0x807ff7c) at lrc.c:367
#1  0x08053d79 in lrc_bfupdates (a=0x8091ef4) at lrc.c:212
#1  0x08053f41 in lrc_bfiupdates (a=0x8091fa4) at lrc.c:268
#1  0x40264f18 in globus_l_thread_pool_thread_start ()
#1  0x40252853 in globus_callback_signal_poll ()

and at lowest frame there is 
#0  0xffffe002 in ?? ()
everuthere.
------- Comment #10 From 2004-03-04 13:30:55 -------
240 connections is on the higher side. I would like to make sure your client is
managing connections correctly.
To test this out, you could also try (for csh)
while (1)
  globus-rls-cli create <your lfn> <your pfn> rls://<your lrc>
  sleep 10
end
OR
watch -n 10 globus-rls-cli create <your lfn> <your pfn> rls://<your lrc>

which should achieve what you want to do and check if the lrc still hangs.

You should also attach your code so that I can use it in my setup.
------- Comment #11 From 2004-04-15 15:56:57 -------
We've recently seen such hangs on RH9 systems.
A workaround is to set LD_ASSUME_KERNEL to 2.4.1 before starting the server.
This is documented at:

http://www-unix.globus.org/toolkit/docs/3.2/installation/install_config_rls.html#redhat9

Since this is a pthreads incompatibility it's also applicable to earlier
versions of globus.
------- Comment #12 From 2004-06-28 10:44:45 -------
I've seen the same problem on a debian machine. But it only freezes if a new
connection arrives at the same time as one connection is timed out. The
followi\ng is a typical sniplet from the logfile:

21:47:06 T49156: checkidle: Timing out connection 817B3C8
21:47:06 T49156: freeconn: Connection 817B3C8
--21:47:06 T49156: freeconn: Before globus_io_close
21:47:06 T16386: listencb:
21:47:06 T16386: newconn: Connection 80EE498
--21:47:06 T16386: newconn: After setting c-parameters

where lines with -- prefix are extra debug messages I've put in. It looks as
though the call to globus_io_close in freeconn never returns for some reason.

Note the server doesn't really hang, but continues to do lrc_bfiupdates and
rli_bfiupdates:

21:52:06 T163851: lrc_bfiupdates: 0
21:52:06 T294917: rli_bfiupdates:
21:52:06 T294917: lock_get: 8068960 readlock
21:52:06 T294917: lock_get(8068960): 2 readers
21:52:06 T294917: lock_release(8068960): 1 readers

It just doesn't accept new connections.
------- Comment #13 From 2004-06-28 13:11:16 -------
Yes. The server hangs in globus_io_close. This holds up the other connection
management threads. I think Debian also ported the same changes that Redhat did.
The same workaround should work for you.
------- Comment #14 From 2004-06-30 02:46:27 -------
No, it doesn't work. The server hangs no matter if I use the workaround or not.
I attach a small program. This hangs the server in less than 30 seconds if
idletimeout is set to 5 seconds (which I did for testing).
------- Comment #15 From 2004-06-30 02:47:17 -------
Created an attachment (id=404) [details]
Small test-program that hangs the server
------- Comment #16 From 2004-07-02 18:09:29 -------
I ran this test case. For me, one of the threads got stuck in globus_io_close.
This held up the other threads and further connections were not accepted. Also,
connections previously queued up were also not freed.
This doesn't seem to be related to the RH9 issue, although the symptoms are
similar (hanging in globus_io_close)

Jakob: I saw on the ivdgl list that you have a patch for this. At this point,
I'm not sure RLS is the problem here. (We have to investigate why
globus_io_close didn't return, first). But, maybe your patch can shed some
light. Could you attach that also? Could you also let us know if this test is
indicative of your workload?
------- Comment #17 From 2004-07-05 04:08:58 -------
I agree that it's first and foremost a globus_io problem but it would be nice
to
have a workaround in the RLS-server. Here is what I get if I build globus_io
with -DBUILD_DEBUG and set GLOBUS_IO_DEBUG_LEVEL to 9:

18:23:20 T1101429552: checkidle: Timing out connection 83703F8
globus_io_register_cancel(): entering: handle=0x83703f8, handle->state =
4,fd=32
globus_i_io_unregister_operation(): entering: handle=0x83703f8, handle->state =
4, fd=32
globus_i_io_unregister_operation(): exiting, fd=32
globus_i_io_end_operation(): entering: handle=0x83703f8, handle->state =
4,fd=32
globus_i_io_end_operation(): exiting, fd=32
globus_l_io_select_wakeup(): entering
globus_l_io_select_wakeup(): select_active=0 wakeup_pending=0
globus_l_io_select_wakeup(): exiting
globus_io_register_cancel(): exiting
globus_l_io_kickout_cancel_cb(): exiting, fd=138179344
globus_l_io_handle_events(): entering
globus_l_io_handle_events(): Calling select()
globus_l_io_handle_events(): select() returned
globus_l_io_handle_events(): read, fd=5
globus_i_io_unregister_operation(): entering: handle=0xbffff290, handle->state
=
5, fd=5
globus_i_io_unregister_operation(): exiting, fd=5
globus_l_io_handle_events(): exiting
globus_l_io_kickout_cancel_cb(): entering, fd=32
globus_l_io_kickout_cancel_cb(): cancel kickout, fd=32
globus_io_register_close(): entering: handle=0x83703f8, handle->state = 4,
fd=32
globus_l_io_select_wakeup(): entering
globus_l_io_select_wakeup(): select_active=0 wakeup_pending=0
globus_l_io_select_wakeup(): exiting
globus_l_io_kickout_cancel_cb(): exiting, fd=32
globus_l_io_kickout_cb(): entering, fd=5
globus_i_io_end_operation(): entering: handle=0xbffff290, handle->state =
5,fd=5
globus_i_io_end_operation(): exiting, fd=5
18:23:20 T1084644144: listencb:
18:23:20 T1084644144: newconn: Connection 83B2698
globus_l_io_handle_events(): entering
globus_l_io_handle_events(): Calling select()
globus_l_io_handle_events(): select() returned
globus_l_io_handle_events(): exiting
globus_l_io_handle_events(): entering
globus_l_io_handle_events(): Calling select()

You can see the checkidle-call and the new connection coming in at the same
time. After that it hangs in a select-call inside globus_io waiting for new
events to happen in the open file-descriptors. This is not happening because
the
new connection is hanging waiting for the connlistmtx (that is locked by
checkidle) to be unlocked. 

I don't understand why it hangs here (and why there are two select calls) but
the patch that I attach makes a workaround for this. The idea is just to unlock
connlistmtx before calling globus_io_close and globus_io_cancel and relock
afterwards.

It looks like it eliminates this problem and the server is somewhat more
stable.
It still hangs though but now only approximately 1 time pr. day compared to 3-4
times a day before. I'm still investigating.
------- Comment #18 From 2004-07-05 04:10:54 -------
Created an attachment (id=405) [details]
Patch for freezing RLS problem
------- Comment #19 From 2004-07-06 20:55:29 -------
Created an attachment (id=406) [details]
Initial patch

Initial patch to fix the hanging problem, based on suggestions from globus_io
developers. Alternate fix being investigated.
------- Comment #20 From 2004-07-08 03:30:12 -------
Sorry, it doesn't work. Our patched server eventually hung after approximately
a
day of running. This time at some point new connections does not seem to get an
auth callback:

07:01:04 T262161: listencb:
07:01:04 T262161: newconn: Connection 80A0490
<snip>
07:05:50 T311316: acceptcb: Connection 80A0490
07:05:50 T311316: acceptcb: an end-of-file was reached
07:05:50 T311316: freeconn: Connection 80A0490
07:05:50 T311316: freeconn: Connection 80A0490 closed

which is repeated a couple of times. And then:

07:05:50 T16384: dolisten:
07:05:50 T262161: listencb:
07:05:50 T262161: newconn: Connection 80A0490

after which no new connections are accepted.

The problem seems to be unrelated to the previous one but the server still
hangs
and has to be restarted by hand.
------- Comment #21 From 2004-07-08 12:31:53 -------
This seems like a good time to open a new bug. I've filed this (new) problem as
bug #1819
------- Comment #22 From 2004-07-09 12:50:27 -------
Shishir's patch fixes original bug. New issue openned in bug #1819 as noted by 
Shishir.