Bug 2662 - ws-gram service leaves hanging file transfers
: ws-gram service leaves hanging file transfers
Status: RESOLVED FIXED
: RFT
RFT
: development
: All All
: P3 normal
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2005-01-31 10:59 by
Modified: 2005-02-28 15:55 (History)


Attachments
container output during hanging behavior (5.08 KB, text/plain)
2005-01-31 12:51, Jaime Frey
Details
debug log during provoked behavior (35.88 KB, text/plain)
2005-01-31 18:18, Jaime Frey
Details
debug log with errors (44.81 KB, text/plain)
2005-02-14 13:01, Jaime Frey
Details
error with updated gridftp server (16.02 KB, text/plain)
2005-02-15 11:34, Jaime Frey
Details
client gridftp server logfile (3.39 KB, text/plain)
2005-02-16 15:22, Jaime Frey
Details
server gridftp server logfile (3.66 KB, text/plain)
2005-02-16 15:22, Jaime Frey
Details
client gridftp server logfile (25.97 KB, text/plain)
2005-02-16 17:22, Jaime Frey
Details
server gridftp server logfile (19.52 KB, text/plain)
2005-02-16 17:23, Jaime Frey
Details
result of kill -QUIIT on container after hung transfers timed out (26.03 KB, text/plain)
2005-02-16 17:28, Jaime Frey
Details
patch to clean up connections (3.77 KB, patch)
2005-02-17 12:26, Ravi Madduri
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2005-01-31 10:59:41
I'm seeing file transfers that hang indefinitely after the gram job that
spawned
them is finished and destroyed. That is, I'm seeing extra gridftp server
processes on both source and destination machines. When these appear, I can't
shut down the GT4 container gracefully (with globus-stop-container), but if I
kill it, the hung transfers disappear. Here's a short job that produces the
problem:

<job>
  <executable>/tmp/foobar/date</executable>
  <directory>/tmp/foobar/</directory>
  <fileStageIn>
    <transfer>
      <sourceUrl>gsiftp://balthazar.cs.wisc.edu/tmp/smell/</sourceUrl>
      <destinationUrl>file:///tmp/foobar/</destinationUrl>
    </transfer>
    <transfer>
      <sourceUrl>gsiftp://balthazar.cs.wisc.edu/bin/date</sourceUrl>
      <destinationUrl>file:///tmp/foobar/date</destinationUrl>
    </transfer>
  </fileStageIn>
</job>

/tmp/smell is an empty directory on balthazar (the machine I'm submitted from).
This produces hangs when run with both managed-job-globusrun and globusrun-ws.
I'm using code I checked out of CVS on Friday, January 28. Both source and
destination machines are x86 running Tao Linux.
------- Comment #1 From 2005-01-31 11:04:35 -------
Another detail that may be relevant to reproducing this behavior: I'm running
everything non-root. The servers are using host certificates owned by the uid
they're running under.
------- Comment #2 From 2005-01-31 11:15:22 -------
GRAM destroys all RFT resources, so this has to be an RFT issue.
------- Comment #3 From 2005-01-31 12:23:51 -------
I just test RFT from trunk using the following transfers in a transfer request:
#Source/Dest URL Pairs
gsiftp://ned-0.isi.edu:5678/tmp/small/
gsiftp://ned-0.isi.edu:5678/tmp/foobar/
gsiftp://ned-0.isi.edu:5678/bin/date
gsiftp://ned-0.isi.edu:5678/tmp/foobar/date

I see no hang or any error to that matter. an empty dir /tmp/foobar is created and a /bin/date was 
copied into it by the next transfer. And I can see the resource being destroyed which closed all the 
gridftp connections. I was also able to kill the container w/o any problem. 
Can I see logs from the container when you see this behaviour ?
------- Comment #4 From 2005-01-31 12:51:09 -------
Created an attachment (id=496) [details]
container output during hanging behavior

Here is the output from the container during the job submission that provokes
the hanging behavior.
------- Comment #5 From 2005-01-31 14:35:51 -------
From the logs i see that transfers were completed successfully and destroy has
been called. I don't see 
any exceptions/problems in the log... Can you increase logging level to DEBUG
on org.globus.transfer 
package and send me the log ?
------- Comment #6 From 2005-01-31 18:18:33 -------
Created an attachment (id=497) [details]
debug log during provoked behavior

Here is the container output with DEBUG for org.globus.transfer.
------- Comment #7 From 2005-01-31 18:43:58 -------
From the latest log .. I don't really see anything that would hint any sort of
problem from RFT side of 
things.  This piece of log 
<snip>
2005-01-31 18:13:33,383 DEBUG service.ReliableFileTransferResource
[Thread-5,remove:701] 
Removing the resource with id: 1
2005-01-31 18:13:33,384 DEBUG database.RFTDatabaseSetup
[Thread-5,getDBConnection:201] 
Added 1 active 1 idle 2
2005-01-31 18:13:33,386 DEBUG database.RFTDatabaseSetup
[Thread-5,returnDBConnection:258] 
Reduced 0 active 0 idle 3
2005-01-31 18:13:33,387 DEBUG service.ReliableFileTransferResource
[Thread-5,remove:706] Proxy 
file : /tmp/x509up_11631.tmp
2005-01-31 18:13:33,388 DEBUG service.ReliableFileTransferResource
[Thread-5,remove:709] 
Deleting proxy file 
2005-01-31 18:13:33,390 DEBUG service.ReliableFileTransferResource
[Thread-5,remove:717] 
deleting subscriptions file
2005-01-31 18:13:33,394 DEBUG database.RFTDatabaseSetup
[Thread-5,getDBConnection:201] 
Added 1 active 1 idle 2
2005-01-31 18:13:33,400 DEBUG database.RFTDatabaseSetup
[Thread-5,returnDBConnection:258] 
Reduced 0 active 0 idle 3
2005-01-31 18:13:33,401 DEBUG service.TransferWork
[Thread-5,closeTransferClients:247] Closing 
transfer clients for resource 1
</snip> 
shows everything is cool with respect to RFT.
------- Comment #8 From 2005-02-07 17:45:31 -------
I don't see any rft problems here .. 
Jaime, Can you reproduce this error consistently ?
------- Comment #9 From 2005-02-08 10:20:19 -------
Yes. It doesn't happen every time, but most times. I'm trying to test it with
the lastest code out of CVS, but I'm running into compilation errors.
------- Comment #10 From 2005-02-14 13:00:50 -------
I just tested against code checked out of CVS this past Thursday (Feb 10) at
around 1pm CST, and the problem is still there. Again, it doesn't happen every
time, but most times with the gram rsl in the ticket. Now, I'm seeing errors in
the container output that look related to the problem. I will attach it.
------- Comment #11 From 2005-02-14 13:01:51 -------
Created an attachment (id=511) [details]
debug log with errors
------- Comment #12 From 2005-02-15 08:57:48 -------
Mike fixed a bug in the new server, which is causing the server to seg fault
when setting tcp buffer size.   
(Bug 2687 ) Can you try your job against fixed  server and let me know if you
see this error again ? 
------- Comment #13 From 2005-02-15 10:22:28 -------
I updated and recompiled the gridftp-server code on both ends and get the same
behavior with what appears to be the exact same NPE exception.
------- Comment #14 From 2005-02-15 10:33:16 -------
Can you send me the npe exception ?
------- Comment #15 From 2005-02-15 11:34:49 -------
Created an attachment (id=512) [details]
error with updated gridftp server
------- Comment #16 From 2005-02-15 11:35:25 -------
I just attached the NullPointerException error in the container log with
gridftp
updated on both ends.
------- Comment #17 From 2005-02-15 11:49:29 -------
That is a different error. There is a bug ( bug 2724 ) filed for it already.
this error should not cause the 
conditions you described before. The stack trace your reported above is
obtained when usage stats are 
being sent at the end of resource's life time.
------- Comment #18 From 2005-02-15 14:12:54 -------
The behavior is still occurring, whether or not the exception is related.
------- Comment #19 From 2005-02-15 14:59:52 -------
Making this a blocker for 3.9.5 as several projects are waiting for Condor-G to
start porting their stuff to GT4.
------- Comment #20 From 2005-02-15 16:05:56 -------
I just tried the same rsl with code from rc1 and gftp server from rc1 many
times, but i could not 
reproduce the problem
------- Comment #21 From 2005-02-15 16:51:32 -------
Jaime,

I have tried to reproduce this error, but have not been able to. I did 50
successful jobs:

[rynge@devrandom 2662]$ cat job.xml
<job>
  <executable>/tmp/foobar/date</executable>
  <directory>/tmp/foobar/</directory>
  <fileStageIn>
    <transfer>
      <sourceUrl>gsiftp://devrandom.isi.edu:9001/tmp/smell/</sourceUrl>
      <destinationUrl>file:///tmp/foobar/</destinationUrl>
    </transfer>
    <transfer>
      <sourceUrl>gsiftp://devrandom.isi.edu:9001/bin/date</sourceUrl>
      <destinationUrl>file:///tmp/foobar/date</destinationUrl>
    </transfer>
  </fileStageIn>
</job>
[rynge@devrandom 2662]$ for I in `seq 1 50`; do echo $I; ssh dc-user2.isi.edu rm
-rf /tmp/foobar; globusrun-ws -submit -S -F dc-user2.isi.edu:9000 -f job.xml;done  


This was with a CVS checkout from 2/13, and with full security, running gridftp
as root from inetd and the container as the 'globus' user.


I did suspect a race condition between the transfers (the second one depend on
the directory from the first one), but Ravi has ensured me that the transfers
are sequential.


Can you give me the options you start the gridftp servers with, and the full
commandline of your globusrun-ws invokation? I don't think it has to do with
security, but I'm running out of ideas on how to reproduce this problem.
------- Comment #22 From 2005-02-15 17:20:00 -------
The source gridftp server is non-root from the command-line. The web services
container is non-root. The destination gridftp server is root under inetd, but
the client credential maps to the container's uid.
------- Comment #23 From 2005-02-15 18:49:37 -------
Hmm, looks similar enough.

What about the globusrun-ws line? What command line arguments do you use for that?
------- Comment #24 From 2005-02-16 00:43:01 -------
I've been using managed-job-globusrun -factory <hostname> -file <rsl file>
I haven't tried globusrun-ws recently, but when I did several weeks ago, it
showed the same behavior.

I start the source gridftp server (non-root from the command line) with -p 2811.
I start the destination gridftp server (root under inetd) with -i -l -a.
------- Comment #25 From 2005-02-16 09:24:02 -------
Can you try and use and globusrun-ws now and let me know what happens ?
------- Comment #26 From 2005-02-16 12:03:47 -------
Can you also telnet to the ports that you have your latest gridftp servers
running and send me the 
output ?
------- Comment #27 From 2005-02-16 13:14:52 -------
Client gridftp server:
220 GridFTP Server balthazar.cs.wisc.edu 0.16 (gcc32dbg, 1108067214-1) ready. **
Development Release **

Server gridftp server:
220 GridFTP Server nostos.cs.wisc.edu 0.16 (gcc32dbg, 1108067214-1) ready. **
Development Release **
------- Comment #28 From 2005-02-16 13:23:43 -------
Added mike to cc list.
------- Comment #29 From 2005-02-16 13:49:31 -------
I get hanging transfers when source and destination are the same machine, with
the gridftp server running under inetd.

I retested globusrun-ws, and still get hangs:
globusrun-ws -submit -factory nostos -job-description-file job4h.xml
-staging-delegate
------- Comment #30 From 2005-02-16 14:09:16 -------
Can you start the server with options '-log-level ALL -l /tmp/logfile' and show 
that log after a hang? 

(note, your -i -l -a arg line is just logging to a file named -a, probably 
in /.  there is no -a argument to globus-gridftp-server)
------- Comment #31 From 2005-02-16 15:22:02 -------
The '-l -a' is a holdover from the much older gridftp server. I will attach the
gridftp logs generated with '-log-level ALL -l /tmp/logfile'.
------- Comment #32 From 2005-02-16 15:22:38 -------
Created an attachment (id=513) [details]
client gridftp server logfile
------- Comment #33 From 2005-02-16 15:22:58 -------
Created an attachment (id=514) [details]
server gridftp server logfile
------- Comment #34 From 2005-02-16 15:30:22 -------
based on those logs, but servers think they have exited normally.  Are the 
server processes still hanging around?
------- Comment #35 From 2005-02-16 16:46:55 -------
Looks like the transfers are done. I don't understand where the hang is. Can
you increase logging level 
to DEBUG for org.globus.transfer=DEBUG org.globus.ftp=DEBUG, and send me the
log please ? Also if 
the container appears to be hanging try to do kill -QUIT <server-jvm-process>
which will log all the 
java threads to the file. 
------- Comment #36 From 2005-02-16 17:20:49 -------
Now, the hung transfers are timing out and exitting 19 1/2 minutes afterwards.
No information on the hung transfers goes into the logfile until then. I'll post
new logfiles. For the transfers started at 15:51, I disabled logging on the
server side to see if that affected the hanging behavior.
------- Comment #37 From 2005-02-16 17:22:37 -------
Created an attachment (id=515) [details]
client gridftp server logfile
------- Comment #38 From 2005-02-16 17:23:08 -------
Created an attachment (id=516) [details]
server gridftp server logfile
------- Comment #39 From 2005-02-16 17:28:28 -------
Created an attachment (id=517) [details]
result of kill -QUIIT on container after hung transfers timed out

This is the result of a kill -QUIT on the container after serveral hung
transfers timed out. I tried globus-stop-container beforehand and it didn't
shutdown.
------- Comment #40 From 2005-02-16 18:16:53 -------
That is normal behavior for the server.  There is a default idle-timeout of 10 
minutes, which (due to the way timeouts currently work), actually causes the 
timeout to happen between 10-20 minutes.  
I believe it is also normal behavior for RFT to keep the connection open for 
the resource lifetime (or something).  Possibly RFT is not handling the 421 
timeout and closing of the connection by the server well?
------- Comment #41 From 2005-02-16 18:28:02 -------
Is the ManagedExecutableJobService resource not destroying the RFT resources
then? The ManagedExecutableJobService resource is destroyed long before the
connections time out (at least, the client says it's destroyed).

So the long-term hangs may have been solved for a while, and I just haven't
waited more than 20 minutes in my recent tests.

Will these connections be reused by other RFT resources? If not, I wonder if
this could become a scalability issue with large numbers of jobs with small
input data.
------- Comment #42 From 2005-02-16 18:42:47 -------
If I remove the second transfer and switch the executable to /bin/date, no
gridftp connections are left open after the job is destroyed.
------- Comment #43 From 2005-02-16 20:14:40 -------
RFT does not reuse connections across multiple resources. So when GRAM destroys
RFT resource the 
connections should be closed. Can you increase the container log levels of
org.globus.ftp and 
org.globus.transfer to DEBUG and attach the logs ? And are you still using
m-j-g for your jobs ?
------- Comment #44 From 2005-02-17 00:29:37 -------
I'm about to disappear on vacation for a week and a half. I don't know if one
of
the other Condor staff will be able to help with further investigation.
------- Comment #45 From 2005-02-17 01:23:59 -------
Ravi, I think this function has a couple of problems:

    public synchronized static void closeTransferClients(int requestId) {
        logger.debug("Closing transfer clients for resource " + requestId);
        for (int i = 0; i < transferClients.size(); i++) {
            TransferClient closeMe = (TransferClient) transferClients
                    .elementAt(i);
            if (closeMe.getRequestId() == requestId) {
                transferClients.remove(i);
                closeMe.close();
            }
        }
        for (int i = 0; i < deleteClients.size(); i++) {
            DeleteClient closeMe = (DeleteClient) deleteClients.elementAt(i);
            if (closeMe.getRequestId() == requestId) {
                closeMe.close();
            }
        }
    }

1) Iterating over the Vector and calling .remove(i) within the loop might skip 
some valid TransferClient instances (therefore some connections might not be 
closed?). Calling remove(closeMe) should fix it.

2) DeleteClients are not removed as TransferClients are from its Vector - that 
probably might lead to a leak.
------- Comment #46 From 2005-02-17 01:34:28 -------
Jaime,

I would be interested in getting a Condor-G install here at ISI for further
testing. Do you have anything packaged up or otherwise easily installable to
share for this purpose?
------- Comment #47 From 2005-02-17 06:36:33 -------
I've placed a tarball of condor-g gt4 binaries at
ftp://ftp.cs.wisc.edu/condor/temporary/forstu/2005-02-17/condor-gt4.tar.gz
The included README should explain how to set it up. I compiled them on my tao
linux 1.0 system. If there's an incompatibility with your machine, hopefully
another Condor person can compile a different set.
------- Comment #48 From 2005-02-17 09:52:43 -------
I will make those changes and we can try again.
------- Comment #49 From 2005-02-17 10:19:52 -------
My suggestion in #45 for 1 has the same issue as the original code. Will need 
to reset the index after removing the TransferClient.
------- Comment #50 From 2005-02-17 12:26:33 -------
Created an attachment (id=518) [details]
patch to clean up connections

Please find attached patch that should fix the connection closing. I was able
to stop the container using globus-stop-container. Let me know if it works for
you. 
------- Comment #51 From 2005-02-28 15:25:43 -------
Hi. I'm back from vacation. I just applied the patch and retested, and the
hanging transfers have disappeared.