Bugzilla – Bug 6215
LIGO: hung globus-gridftp-server processes accumulate over time
Last modified: 2008-10-20 10:54:20
You need to log in before you can comment on or make changes to this bug.
Today on two of our production systems I found a number of dead/hanging globus-gridftp-server processes. The output of 'ps auwx' for some of these processes looks like this: hanrobot 32495 0.0 0.0 6136 3468 ? S Jul08 0:00 /srv/LDR/globus/sbin/globus-gridftp-server -c /srv/LDR/globus/etc/gridftp.conf -i hanrobot 32496 0.0 0.0 6136 3472 ? S Jul08 0:00 /srv/LDR/globus/sbin/globus-gridftp-server -c /srv/LDR/globus/etc/gridftp.conf -i hanrobot 32497 0.0 0.0 6136 3476 ? S Jul08 0:00 /srv/LDR/globus/sbin/globus-gridftp-server -c /srv/LDR/globus/etc/gridftp.conf -i hanrobot 32506 0.0 0.0 6136 3468 ? S Jul08 0:00 /srv/LDR/globus/sbin/globus-gridftp-server -c /srv/LDR/globus/etc/gridftp.conf -i hanrobot 32510 0.0 0.0 6136 3472 ? S Jul08 0:00 /srv/LDR/globus/sbin/globus-gridftp-server -c /srv/LDR/globus/etc/gridftp.conf -i The problem was that we allow 200 connection (connections_max is set to 200 in gridftp.conf) and at the time most of the 200 slots were filled up with these hung processes so new connections were failing. Some of the hung processes were weeks old. Do you know what causes these hung processes? Is there anything we can do to prevent them, or make them go away gracefully?
Backtrace from gdb of one of the hung/dead processes: (gdb) backtrace #0 0xf7b0c928 in select () from /lib32/libc.so.6 #1 0xf7e085e7 in globus_l_xio_system_poll (user_args=0x0) at globus_xio_system_select.c:2145 #2 0xf7d28d9a in globus_callback_space_poll (timestop=0x804d450, space=-2) at globus_callback_nothreads.c:1430 #3 0x0804c180 in main (argc=4, argv=0xfffd0be4) at globus_gridftp_server.c:1414
A process watchdog timer was added in 4.0.8 that should take care of most instances of this. Here is an update package for earlier releases. http://www.mcs.anl.gov/~mlink/bugs/globus_gridftp_server-2.8.tar.gz Mike
Is the watchdog timer also in 4.2? I am trying to move all of our LDR instances to using GridFTP from GT 4.2.
Not in 4.2.0, but it will be in 4.2.1 (~ early october).
I can verify that on CentOS 5.2 when my client code calls abort() on what appears to be a hung third-party transfer the forked globus-gridftp-server process owned by the user (non-root) is reaped a few minutes later. Is this the expected behavior?
Yes, that's exactly the behavior. We'll follow up on the cause of that particular hang in a new bug.