| Summary: | Server hangs | ||
|---|---|---|---|
| Product: | Replica Location | Reporter: | Aleksandr Konstantinov <aleks@fys.uio.no> |
| Component: | RLS | Assignee: | Rob S <schuler@isi.edu> |
| Status: | RESOLVED FIXED | ||
| Severity: | normal | CC: | annc@isi.edu, gmehta@isi.edu, langgard@nbi.dk, schuler@isi.edu, shishir@isi.edu |
| Priority: | P2 | ||
| Version: | 2.0 | ||
| Target Milestone: | --- | ||
| Hardware: | PC | ||
| OS: | Linux | ||
| Attachments: |
Small test-program that hangs the server
Patch for freezing RLS problem Initial patch |
||
We need the following information: 1. RLS version (It's most likely not 2.0) 2. Globus version 3. Globus IO library version 4. RLS Configuration information 5. If the problem can be reproduced, a test case (Any set of interactions with the server which causes it to hang) 6. Log messages logged by RLS (Syslog messages or from stdout if server was started in debug mode)
1. 2.1.4 2. 2.4.3 3. 4.0.1 4. >cat globus-rls-server.conf db_user ... db_pwd ... odbcini /etc/odbc.ini update_immediate true # Propagate changes to RLI quickly # LRC options lrc_server true lrc_dbname lrc1000 # RLI options rli_server false #rli_server true # Indicates this is an RLI server #rli_dbname rli1000 # mysql-database-name # Access control. There may be multiple acl options, each one # is a regular expression which is matched against the user's Distinguished # Name in is X.509 certificate, followed by a colon and a whitespace # separated list of privs, which are all, lrc_read, lrc_update, rli_read, # rli_update, stats and admin. If you use a grid-mapfile to map DNs to # local usernames you may specify those instead of DNs. acl ...: all acl ...: all acl ...: all acl ...: all 5. Not sure. I think it just hangs after some time being used (couple of days). In my setup it is being continiuosly contacted by storage element for adding new lfns and refreshing old ones. It also interacts with RLI server. Which, by the way, also hangs but it takes longer time. 6. Server is started with -d and it's stdout and stderr are redirected to file. There are no messages from server in that file. Actually to find out what is going on I added few fprintf(stderr,...) into code. And I see only my messages there.
By default, the logging level is set to 0 which means that only errors are
logged. This can be increased enable higher levels of debug messages.
To start the server with logging set to level 3, for example, you'd do:
$GLOBUS_LOCATION/bin/globus-rls-server -d -L 3
or set the "loglevel" option in the configuration file. (Level 3 might be too
verbose. So, you might want to start at a lower level)
Let me confirm your setup:
1. You have one RLS configured as an LRC only
2. You have another RLS configured as an RLI only
3. The LRC is updating the RLI and these two are the only RLS servers in your setup.
4. You're (continuously ?) adding mappings to the LRC.
(I don't understand what you mean by refreshing old LFNs. Mappings continue
to exist in the LRC database until explicitly deleted)
I'll try to duplicate this setup and see what happens. Some additional details
might help:
1. What database are you using? MySQL ?
2. What's the approximate number of lfns/mappings in the LRC?
3. How do you update the LRC?
a. Are you using the client (globus-rls-cli) provided or the RLS C/Java API?
b. Do you create a new connection per new LFN added and close it when done
or have a long standing connection?
c. How often do you add a new lfn?
4. How many fds are in use by the server? You can check this with the "lsof
$GLOBUS_LOCATION/bin/globus-rls-server | wc -l" command.
1. Yes.
2. Yes.
3. Yes.
4. Yes and No. Setup is usually left alone for few days without new files
coming in. So there are no new mapping appearing in LRC. But internal design
of other parts is such that it continiously registers same mapping into LRC
approximately every 10 minutes.
1. PostgreSQL used.
2.
> globus-rls-cli query wildcard lrc lfn '*' rls://host
__storage_service__: se://dc4.uio.no:8000/se
se?test_file_2: se://dc4.uio.no:8000/se/se?test_file_2
test_file3: se://dc4.uio.no:8000/se?test_file3
test_file3_copy: se://dc4.uio.no:8000/se?test_file3_copy
test_file3_copy2: se://dc4.uio.no:8000/se?test_file3_copy2
test_file3_copy4: se://dc4.uio.no:8000/se?test_file3_copy4
test_file3_copy5: se://dc4.uio.no:8000/se?test_file3_copy5
test_file5: se://dc4.uio.no:8000/se/test_file5
test_file6: se://dc4.uio.no:8000/se/test_file6
test_file_1: se://dc4.uio.no:8000/se/test_file_1
test_file_1: se://dc4.uio.no:8000/se?test_file_1
testdms1: gsiftp://testfile
testdms2: gsiftp://mbranco_test1
testdms3: gsiftp://testfile
testdms4: gsiftp://testfil1e
testdms4: gsiftp://testfile
3. My code uses C API. One connection per mapping.
4. I can't answer that question right now. Frontend is down and I can't
login as root. Currently it has 1 listening TCP socket and 2 in CLOSE_WAIT
state. I'll try to get more information later while it is running and after
it hangs.
Information about open handles (lsof globus-rls-server would only show process which have globus-rls-server opened) : > lsof -n | grep globus-rl | wc -l 48 Just in case: > lsof -n | grep globus-rl globus-rl 1925 root cwd DIR 3,1 4096 2 / globus-rl 1925 root rtd DIR 3,1 4096 2 / globus-rl 1925 root txt REG 3,1 460313 109577 /opt/globus/bin/globus-rls-server globus-rl 1925 root mem REG 3,1 106400 124513 /lib/ld-2.3.2.so globus-rl 1925 root mem REG 3,1 140923 48506 /opt/globus/lib/libglobus_rls_client_gcc32dbgpthr.so.0.0.0 globus-rl 1925 root mem DEL 0,4 32769 /SYSV79016d02 globus-rl 1925 root mem REG 3,7 21040 243150 /usr/lib/gconv/gconv-modules.cache globus-rl 1925 root mem REG 3,7 5524 243094 /usr/lib/gconv/ISO8859-1.so globus-rl 1925 root mem REG 3,7 412400 65403 /usr/lib/libodbc.so.1.0.0 globus-rl 1925 root mem REG 3,1 176952 48023 /opt/globus/lib/libglobus_io_gcc32dbgpthr.so.0.0.0 globus-rl 1925 root mem REG 3,1 37756 48003 /opt/globus/lib/libglobus_gss_assist_gcc32dbgpthr.so.0.0.0 globus-rl 1925 root mem REG 3,1 105476 48013 /opt/globus/lib/libglobus_gssapi_gsi_gcc32dbgpthr.so.0.0.0 globus-rl 1925 root mem REG 3,1 54820 47983 /opt/globus/lib/libglobus_gsi_proxy_core_gcc32dbgpthr.so.0.0.0 globus-rl 1925 root mem REG 3,1 56156 47973 /opt/globus/lib/libglobus_gsi_credential_gcc32dbgpthr.so.0.0.0 globus-rl 1925 root mem REG 3,1 36120 47953 /opt/globus/lib/libglobus_gsi_callback_gcc32dbgpthr.so.0.0.0 globus-rl 1925 root mem REG 3,1 33640 48053 /opt/globus/lib/libglobus_oldgaa_gcc32dbgpthr.so.0.0.0 globus-rl 1925 root mem REG 3,1 41096 47993 /opt/globus/lib/libglobus_gsi_sysconfig_gcc32dbgpthr.so.0.0.0 globus-rl 1925 root mem REG 3,1 18636 47963 /opt/globus/lib/libglobus_gsi_cert_utils_gcc32dbgpthr.so.0.0.0 globus-rl 1925 root mem REG 3,1 6284 48073 /opt/globus/lib/libglobus_openssl_gcc32dbgpthr.so.0.0.0 globus-rl 1925 root mem REG 3,1 14572 48083 /opt/globus/lib/libglobus_proxy_ssl_gcc32dbgpthr.so.0.0.0 globus-rl 1925 root mem REG 3,1 13288 48063 /opt/globus/lib/libglobus_openssl_error_gcc32dbgpthr.so.0.0.0 globus-rl 1925 root mem REG 3,1 213576 48211 /opt/globus/lib/libssl_gcc32dbgpthr.so.0.0.0 globus-rl 1925 root mem REG 3,1 920868 47753 /opt/globus/lib/libcrypto_gcc32dbgpthr.so.0.0.0 globus-rl 1925 root mem REG 3,1 174868 47763 /opt/globus/lib/libglobus_common_gcc32dbgpthr.so.0.0.0 globus-rl 1925 root mem REG 3,1 15900 124433 /lib/libdl-2.3.2.so globus-rl 1925 root mem REG 3,1 80592 171082 /lib/tls/libpthread-0.34.so globus-rl 1925 root mem REG 3,1 212020 171080 /lib/tls/libm-2.3.2.so globus-rl 1925 root mem REG 3,1 53568 124555 /lib/libnss_compat-2.3.2.so globus-rl 1925 root mem REG 3,1 91624 124437 /lib/libnsl-2.3.2.so globus-rl 1925 root mem REG 3,7 229032 65423 /usr/lib/libodbcpsql.so.2.0.0 globus-rl 1925 root mem REG 3,1 23688 124431 /lib/libcrypt-2.3.2.so globus-rl 1925 root mem REG 3,1 52492 124443 /lib/libnss_files-2.3.2.so globus-rl 1925 root mem REG 3,1 18460 124441 /lib/libnss_dns-2.3.2.so globus-rl 1925 root mem REG 3,1 76608 124453 /lib/libresolv-2.3.2.so globus-rl 1925 root mem REG 3,1 1539996 171088 /lib/tls/libc-2.3.2.so globus-rl 1925 root 0r CHR 1,3 32636 /dev/null globus-rl 1925 root 1w REG 3,1 60650 48523 /var/log/rls.log globus-rl 1925 root 2w REG 3,1 60650 48523 /var/log/rls.log globus-rl 1925 root 3r FIFO 0,5 12318075 pipe globus-rl 1925 root 4w FIFO 0,5 12318075 pipe globus-rl 1925 root 5u IPv4 12318076 TCP *:39281 (LISTEN) globus-rl 1925 root 6u unix 0xdb7b2080 12318080 socket globus-rl 1925 root 7u IPv4 12318084 TCP 129.240.86.43:43278->129.240.86.18:39281 (CLOSE_WAIT) globus-rl 1925 root 8w CHR 1,3 32636 /dev/null globus-rl 1925 root 9u unix 0xc0fe1580 12319203 socket globus-rl 1925 root 10u unix 0xd1888a80 12319197 socket globus-rl 1925 root 11u unix 0xdfaed580 12319200 socket globus-rl 1925 root 12u IPv4 12712157 TCP 129.240.86.43:39281->129.240.86.18:32773 (CLOSE_WAIT)
I've had this setup running without any problems for a few days now. Do you have any new information to add? It might be a good idea to also attach your code that uses the RLS API.
The code is a bit too complicated to be attached. But I have some advances. My setup was: 1. LRC server 2. My server, which registers periodically lfn-pfn mapping into LRC through RLS API and accepts requests from clients through globus_io. 3. My client which can query both RLS and my server for available information. After some time both LRC server and my server stoped responding to my client. Some additional debug output showed that in both even globus_io accept callback was never reached. So I tried gdb. It showed me that every time setup hanged programs were sitting in some unknow function. And backtrace looked like (for both LRC and my server): - unknow function - globus_io functions - program code which was registered as globus_io accept callback - globus_io functions - program code which calls globus_io So I assumed it must be something like "distributed loop" in globus_io. To break it I rearranged code of my server not to call globus_io operations from inside globus_io callbacks. Now it all works for ~3 days without restart.
I talked to a globus io/xio developer and he says that it should be ok to use globus_io functions in globus_io callbacks, provided you have the libraries built threaded. To build RLS you'd need the threaded flavors of io anyway. If you continue to have problems with your server, you could file this under globus_io, since it doesn't seem to be an RLS problem in particular. If you don't have any RLS problems in the next couple of days, this bug will be closed as RESOLVED-NOTABUG.
Today I found out that rls (lrc) server does not respond. All that time my program was registering same lfn-pfn pair into it approximately every 10 seconds. netstat show 240 connections to rls server with various states: SYN_RECV, CLOSE_WAIT, ESTABLISHED, FIN_WAIT1. I killed my executable but rls server did not recover. gdb shows it's threads hangs in: #1 0x40264f18 in globus_l_thread_pool_thread_start () #1 0x400a4d17 in globus_l_io_poll (user_args=0x0) at globus_io_core.c:3188 #1 0x0805795e in checkidle (a=0x807669c) at server.c:998 #1 0x08057020 in procreq (a=0x0) at server.c:802 #1 0x08057020 in procreq (a=0x0) at server.c:802 #1 0x08057020 in procreq (a=0x0) at server.c:802 #1 0x0805db93 in process_update (vrli=0x80a9080) at update.c:448 #1 0x080542c8 in lrc_llupdates (a=0x807ff7c) at lrc.c:367 #1 0x08053d79 in lrc_bfupdates (a=0x8091ef4) at lrc.c:212 #1 0x08053f41 in lrc_bfiupdates (a=0x8091fa4) at lrc.c:268 #1 0x40264f18 in globus_l_thread_pool_thread_start () #1 0x40252853 in globus_callback_signal_poll () and at lowest frame there is #0 0xffffe002 in ?? () everuthere.
240 connections is on the higher side. I would like to make sure your client is managing connections correctly. To test this out, you could also try (for csh) while (1) globus-rls-cli create <your lfn> <your pfn> rls://<your lrc> sleep 10 end OR watch -n 10 globus-rls-cli create <your lfn> <your pfn> rls://<your lrc> which should achieve what you want to do and check if the lrc still hangs. You should also attach your code so that I can use it in my setup.
We've recently seen such hangs on RH9 systems. A workaround is to set LD_ASSUME_KERNEL to 2.4.1 before starting the server. This is documented at: http://www-unix.globus.org/toolkit/docs/3.2/installation/install_config_rls.html#redhat9 Since this is a pthreads incompatibility it's also applicable to earlier versions of globus.
I've seen the same problem on a debian machine. But it only freezes if a new connection arrives at the same time as one connection is timed out. The followi\ng is a typical sniplet from the logfile: 21:47:06 T49156: checkidle: Timing out connection 817B3C8 21:47:06 T49156: freeconn: Connection 817B3C8 --21:47:06 T49156: freeconn: Before globus_io_close 21:47:06 T16386: listencb: 21:47:06 T16386: newconn: Connection 80EE498 --21:47:06 T16386: newconn: After setting c-parameters where lines with -- prefix are extra debug messages I've put in. It looks as though the call to globus_io_close in freeconn never returns for some reason. Note the server doesn't really hang, but continues to do lrc_bfiupdates and rli_bfiupdates: 21:52:06 T163851: lrc_bfiupdates: 0 21:52:06 T294917: rli_bfiupdates: 21:52:06 T294917: lock_get: 8068960 readlock 21:52:06 T294917: lock_get(8068960): 2 readers 21:52:06 T294917: lock_release(8068960): 1 readers It just doesn't accept new connections.
Yes. The server hangs in globus_io_close. This holds up the other connection management threads. I think Debian also ported the same changes that Redhat did. The same workaround should work for you.
No, it doesn't work. The server hangs no matter if I use the workaround or not. I attach a small program. This hangs the server in less than 30 seconds if idletimeout is set to 5 seconds (which I did for testing).
Created an attachment (id=404) [details]
Small test-program that hangs the server
I ran this test case. For me, one of the threads got stuck in globus_io_close. This held up the other threads and further connections were not accepted. Also, connections previously queued up were also not freed. This doesn't seem to be related to the RH9 issue, although the symptoms are similar (hanging in globus_io_close) Jakob: I saw on the ivdgl list that you have a patch for this. At this point, I'm not sure RLS is the problem here. (We have to investigate why globus_io_close didn't return, first). But, maybe your patch can shed some light. Could you attach that also? Could you also let us know if this test is indicative of your workload?
I agree that it's first and foremost a globus_io problem but it would be nice to have a workaround in the RLS-server. Here is what I get if I build globus_io with -DBUILD_DEBUG and set GLOBUS_IO_DEBUG_LEVEL to 9: 18:23:20 T1101429552: checkidle: Timing out connection 83703F8 globus_io_register_cancel(): entering: handle=0x83703f8, handle->state = 4,fd=32 globus_i_io_unregister_operation(): entering: handle=0x83703f8, handle->state = 4, fd=32 globus_i_io_unregister_operation(): exiting, fd=32 globus_i_io_end_operation(): entering: handle=0x83703f8, handle->state = 4,fd=32 globus_i_io_end_operation(): exiting, fd=32 globus_l_io_select_wakeup(): entering globus_l_io_select_wakeup(): select_active=0 wakeup_pending=0 globus_l_io_select_wakeup(): exiting globus_io_register_cancel(): exiting globus_l_io_kickout_cancel_cb(): exiting, fd=138179344 globus_l_io_handle_events(): entering globus_l_io_handle_events(): Calling select() globus_l_io_handle_events(): select() returned globus_l_io_handle_events(): read, fd=5 globus_i_io_unregister_operation(): entering: handle=0xbffff290, handle->state = 5, fd=5 globus_i_io_unregister_operation(): exiting, fd=5 globus_l_io_handle_events(): exiting globus_l_io_kickout_cancel_cb(): entering, fd=32 globus_l_io_kickout_cancel_cb(): cancel kickout, fd=32 globus_io_register_close(): entering: handle=0x83703f8, handle->state = 4, fd=32 globus_l_io_select_wakeup(): entering globus_l_io_select_wakeup(): select_active=0 wakeup_pending=0 globus_l_io_select_wakeup(): exiting globus_l_io_kickout_cancel_cb(): exiting, fd=32 globus_l_io_kickout_cb(): entering, fd=5 globus_i_io_end_operation(): entering: handle=0xbffff290, handle->state = 5,fd=5 globus_i_io_end_operation(): exiting, fd=5 18:23:20 T1084644144: listencb: 18:23:20 T1084644144: newconn: Connection 83B2698 globus_l_io_handle_events(): entering globus_l_io_handle_events(): Calling select() globus_l_io_handle_events(): select() returned globus_l_io_handle_events(): exiting globus_l_io_handle_events(): entering globus_l_io_handle_events(): Calling select() You can see the checkidle-call and the new connection coming in at the same time. After that it hangs in a select-call inside globus_io waiting for new events to happen in the open file-descriptors. This is not happening because the new connection is hanging waiting for the connlistmtx (that is locked by checkidle) to be unlocked. I don't understand why it hangs here (and why there are two select calls) but the patch that I attach makes a workaround for this. The idea is just to unlock connlistmtx before calling globus_io_close and globus_io_cancel and relock afterwards. It looks like it eliminates this problem and the server is somewhat more stable. It still hangs though but now only approximately 1 time pr. day compared to 3-4 times a day before. I'm still investigating.
Created an attachment (id=405) [details]
Patch for freezing RLS problem
Created an attachment (id=406) [details]
Initial patch
Initial patch to fix the hanging problem, based on suggestions from globus_io
developers. Alternate fix being investigated.
Sorry, it doesn't work. Our patched server eventually hung after approximately a day of running. This time at some point new connections does not seem to get an auth callback: 07:01:04 T262161: listencb: 07:01:04 T262161: newconn: Connection 80A0490 <snip> 07:05:50 T311316: acceptcb: Connection 80A0490 07:05:50 T311316: acceptcb: an end-of-file was reached 07:05:50 T311316: freeconn: Connection 80A0490 07:05:50 T311316: freeconn: Connection 80A0490 closed which is repeated a couple of times. And then: 07:05:50 T16384: dolisten: 07:05:50 T262161: listencb: 07:05:50 T262161: newconn: Connection 80A0490 after which no new connections are accepted. The problem seems to be unrelated to the previous one but the server still hangs and has to be restarted by hand.
This seems like a good time to open a new bug. I've filed this (new) problem as
bug #1819