Bug 6215 - LIGO: hung globus-gridftp-server processes accumulate over time
: LIGO: hung globus-gridftp-server processes accumulate over time
Status: RESOLVED FIXED
: GridFTP
GridFTP
: 4.0.7
: All All
: P3 normal
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2008-07-10 10:24 by
Modified: 2008-10-20 10:54 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2008-07-10 10:24:47
Today on two of our production systems I found a number of
dead/hanging globus-gridftp-server processes. The output of
'ps auwx' for some of these processes looks like this:

hanrobot 32495  0.0  0.0   6136  3468 ?        S    Jul08 0:00
/srv/LDR/globus/sbin/globus-gridftp-server -c /srv/LDR/globus/etc/gridftp.conf
-i
hanrobot 32496  0.0  0.0   6136  3472 ?        S    Jul08 0:00
/srv/LDR/globus/sbin/globus-gridftp-server -c /srv/LDR/globus/etc/gridftp.conf
-i
hanrobot 32497  0.0  0.0   6136  3476 ?        S    Jul08 0:00
/srv/LDR/globus/sbin/globus-gridftp-server -c /srv/LDR/globus/etc/gridftp.conf
-i
hanrobot 32506  0.0  0.0   6136  3468 ?        S    Jul08 0:00
/srv/LDR/globus/sbin/globus-gridftp-server -c /srv/LDR/globus/etc/gridftp.conf
-i
hanrobot 32510  0.0  0.0   6136  3472 ?        S    Jul08 0:00
/srv/LDR/globus/sbin/globus-gridftp-server -c /srv/LDR/globus/etc/gridftp.conf
-i

The problem was that we allow 200 connection (connections_max
is set to 200 in gridftp.conf) and at the time most of the 200
slots were filled up with these hung processes so new
connections were failing.

Some of the hung processes were weeks old.

Do you know what causes these hung processes? Is there
anything we can do to prevent them, or make them go away
gracefully?
------- Comment #1 From 2008-07-10 10:25:36 -------
Backtrace from gdb of one of the hung/dead processes:

(gdb) backtrace
#0  0xf7b0c928 in select () from /lib32/libc.so.6
#1  0xf7e085e7 in globus_l_xio_system_poll (user_args=0x0)
    at globus_xio_system_select.c:2145
#2  0xf7d28d9a in globus_callback_space_poll
(timestop=0x804d450, space=-2)
    at globus_callback_nothreads.c:1430
#3  0x0804c180 in main (argc=4, argv=0xfffd0be4)
    at globus_gridftp_server.c:1414
------- Comment #2 From 2008-09-11 14:54:15 -------
A process watchdog timer was added in 4.0.8 that should take care of most
instances of this.  Here is an update package for earlier releases.

http://www.mcs.anl.gov/~mlink/bugs/globus_gridftp_server-2.8.tar.gz

Mike
------- Comment #3 From 2008-09-11 15:00:10 -------
Is the watchdog timer also in 4.2? I am trying to move all of our LDR instances
to using GridFTP from GT 4.2.
------- Comment #4 From 2008-09-11 15:02:24 -------
Not in 4.2.0, but it will be in 4.2.1 (~ early october).
------- Comment #5 From 2008-10-14 17:13:02 -------
I can verify that on CentOS 5.2 when my client code calls abort() on what
appears to be a hung third-party transfer the forked globus-gridftp-server
process owned by the user (non-root) is reaped a few minutes later.

Is this the expected behavior?
------- Comment #6 From 2008-10-20 10:54:20 -------
Yes, that's exactly the behavior.  We'll follow up on the cause of that
particular hang in a new bug.