Bugzilla – Bug 3568
list function terminates after 200 files
Last modified: 2005-07-28 11:25:52
You need to log in before you can comment on or make changes to this bug.
This error needs to be completed by several people. a) Kaizar must post a link to his program to test this b) Mike H. must complete the test for just jglobus c) we need exact names and versions of the gridftp servers with port numbers d) the test needs to be rerun on gridftp servers that Bill specifies to us The reason for uploading the error in this incomplete state is to assure that others are aware that we are evaluating a performance issue and that this work is completed prior to the 4.0.1 release. I will classify this error as critical and not as a blocker. =============== First analysis based on the software included in Java CoG Kit 4. the information I got is somewhat incomplete: what is the version of the server, prort number, machine, so we can replicate this in other frameworks and exclude its a Java CoG Kit issue. The gridftp server does not return files more than 300 with mlsd. With 200 files it returns fine. But with 300 files it gives a "wait timeout". Hence it fails for some number between 200 and 300 files. I see no way of going to 5000 files ;) The test output for 1, 101, and 201 files respectively are as follows: # of Files Time in Secs ========== ============ 1 3.254 101 8.059 201 15.587 Error after 200 files is as follows: DEBUG [org.globus.cog.abstraction.examples.execution.Test] - Status of ListTask :Failed Job failed: org.globus.cog.abstraction.impl.file.GeneralException: Could not get list of files in /home/amin/gridftp-test/from server at org.globus.cog.abstraction.impl.file.gridftp.FileResourceImpl.list(FileResourceImpl.java:96) at org.globus.cog.abstraction.impl.file.TaskHandlerImpl.execute(TaskHandlerImpl.java:247) at org.globus.cog.abstraction.impl.file.TaskHandlerImpl.submit(TaskHandlerImpl.java:221) at org.globus.cog.abstraction.impl.file.TaskHandlerImpl.submit(TaskHandlerImpl.java:134) at org.globus.cog.abstraction.impl.common.task.FileOperationTaskHandler.submit(FileOperationTaskHandler.java:48) at org.globus.cog.abstraction.impl.common.task.GenericTaskHandler.submit(GenericTaskHandler.java:51) at org.globus.cog.abstraction.impl.common.taskgraph.TaskGraphHandlerImpl.submitExecutableObject(TaskGraphHandlerImpl.java:177) at org.globus.cog.abstraction.impl.common.taskgraph.TaskGraphHandlerImpl.handleDependents(TaskGraphHandlerImpl.java:517) at org.globus.cog.abstraction.impl.common.taskgraph.TaskGraphHandlerImpl.statusChanged(TaskGraphHandlerImpl.java:131) at org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:193) at org.globus.cog.abstraction.impl.common.task.TaskImpl.setStatus(TaskImpl.java:201) at org.globus.cog.abstraction.impl.execution.gt2.JobSubmissionTaskHandler.statusChanged(JobSubmissionTaskHandler.java:309) at org.globus.gram.GramJob.setStatus(GramJob.java:179) at org.globus.gram.GramCallbackHandler.run(CallbackHandler.java:171) at java.lang.Thread.run(Thread.java:534) Caused by: org.globus.ftp.exception.ServerException: Reply wait timeout. (error code 4) at org.globus.ftp.vanilla.FTPControlChannel.waitFor(FTPControlChannel.java:213) at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:109) bash-2.05b$
Another report in addition to 3543? I would suggest that any tests use servers from the current trunk or globus_4_0_branch (2.20 and 2.1, respectively), as they contain performance increases as noted in 3543. Also, please be careful with statements such as: "The gridftp server does not return files more than 300 with mlsd. With 200 files it returns fine. But with 300 files it gives a "wait timeout". Hence it fails for some number between 200 and 300 files." Any timeouts or failure to wait for results is squarely a client issue... the only thing the server is capable of failing to do in this case is return results within the clients timeout period.
> I would suggest that any tests use servers > from the current trunk or globus_4_0_branch (2.20 and 2.1, respectively), as > they contain performance increases as noted in 3543. Are any such servers running anywhere, or would we have to compile our own?
The test should use the same machine(s) you initially noticed the problem on, so I say compile your own. I have a server set up on pitcairn.mcs.anl.gov:9000 if you want to use that (note that the machine still has a host cert for wiggum.mcs.anl.gov so you'll have to change the expected subject to use that one). I'll try to also get one set up on wiggum.mcs.anl.gov:9000.
to comment #1 from Mike. We should make sure that the documentation explicitly mentions this and shows an example on how to modify he values. E.g. users will take the default configureation and measure this. Is this already in the documentation?
The server is up at wiggum.mcs.anl.gov:9000. This is what will eventually be released in 4.0.1 assuming no more changes.
Verified ok per email thread.