Bug 1301 - random errors during ftp tests
: random errors during ftp tests
Status: RESOLVED WONTFIX
: GSI C
Authentication
: 2.2
: PC Linux
: P2 normal
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2003-10-17 11:51 by
Modified: 2008-08-11 15:17 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2003-10-17 11:51:51
> I recently moved a lot of files to NCAR's dataportal(solaris machine)
> from our linux machines. For each run, globus api is initiated from
> dataporal. There are 1200 files per run, about 40MB each. So there are
> 1200 file requests for each run. Every time when LBNL side receives a
> file request, it checks the proxy. If validated, a return path is given
> so NCAR side can initiate a file transfer. I am using my proxy on NCAR
> for all the file transfers.  The proxy is valid for several days, more
> than enough to complete a run.
> 
> For most files in a run, things are fine. However, there are always some
> exceptions, and we want to understand why.  Here they are:
> 
> 1: even though the proxy is the same for all the 1200 files per run, I
> see gss_import_cred() fails a few times on dmx.lbl.gov(a linux machine,
> gt2.2.4). This behaviour has been consist for many runs that I tested.
> To be sure, I printed out the proxy on dmx when it fails the
> gss_import_cred(), it turns out it is the same as the NCAR proxy I use.
> This failure happens at random places.
> 
> 2: Not all the time, but often, I see this error message once awhile
> during each run:
>    "the server sent an error response: 530 530 Login incorrect"
> (dmx.lbl.gov, gt2.2.4)
> or this:
>     "the server sent an error response: 425 425 Can't open data
> connection" (datagrid.lbl.gov, gt2.2.4)
> 
> The above errors are random for the 1200 file transfers. They do show up
> for each test and we were not able to have a test that completes all
> 1200 file transfers. We would like to know what can be done to resolve
> it. Your help would be greatly appreciated.
------- Comment #1 From 2003-10-17 12:12:52 -------
From the system log,

Oct 16 01:05:39 dmx gridftpd[4678]: PASS password Oct 16 01:05:39 dmx gridftpd[4678]: failed login from dataportal.ucar.edu [128.117.12.2] 

I could find the exact same error once in a while, and the same previous and subsequent connections go through.

------- Comment #2 From 2003-10-20 11:28:50 -------
A few questions:

* Were all the transfers done sequentially?

* Are you pretty much doing the equivalent of a url-copy for each of the 1200
transfers? IE, are you opening a new connection for each transfer?

* The 1200 files are transferred from different linux machines? IE, not one
machine holding all 1200 copies?

* Are all transfers third party?

To me these errors would indicate that some of the machines run out of FDs at
some point, but I would like to set up the exact environment locally so we can
observe what is happening.
------- Comment #3 From 2003-10-20 11:50:37 -------
Junmin will reply to your questions, and I have some comments.
We have 1024 FDs setup on the machine, and she was the only user on the machine. From the sys log, it does not seem that it had too many connections in a given minute as well as that not too many files are opened concurrently. We've had this discussion with Joe about the number of FDs that are opened in a gridftp connection, and we recently updated the linux kernel and the driver as well, but the same error happened before and after we updated the kernel/driver. I can fetch the portions of sys log if you want.
------- Comment #4 From 2003-10-20 12:05:43 -------
Sam, here are the answers to your questions:

>* Were all the transfers done sequentially?
they were done parallelly. There are four threads processing the 1200 
ftps. So there are 4 transfers at most at a time. Each transfer uses 2 streams.

>* Are you pretty much doing the equivalent of a url-copy for each of the 1200
transfers? IE, are you opening a new connection for each transfer?
yes. 

>* The 1200 files are transferred from different linux machines? IE, not one
machine holding all 1200 copies?
For each run, all 1200 files are located at a LBNL linux machine.

>* Are all transfers third party?
no. targets are using the "file:" protocol, so it is not third party.

>To me these errors would indicate that some of the machines run out of FDs at
some point, but I would like to set up the exact environment locally so we can
observe what is happening.

Alex Sim told me FDs are 1024 on the machines. 

I am also concerned why the same valid proxy was not validated during the runs. 
This cannt be FD related, can it?

- Junmin
------- Comment #5 From 2003-10-20 12:15:56 -------
>>* Were all the transfers done sequentially?
>they were done parallelly. There are four threads processing the 1200 
>ftps. So there are 4 transfers at most at a time. Each transfer uses 2 
>streams.

What we had was 4 concurrent gridftp connections with 2 parallel streams. 
What appears on the receiving end is sequential calls, one after another.
However, there were max concurrent gridftp connections (4) that we're forcing
at any given time.