Bug 5590 - gridftp2 server can send truncated control channel messages
: gridftp2 server can send truncated control channel messages
Status: RESOLVED FIXED
: GridFTP
GridFTP
: 4.0.5
: All All
: P3 critical
: 4.0.6
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2007-10-01 13:48 by
Modified: 2008-01-17 00:24 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2007-10-01 13:48:49
Hello,

I'm the maintainer of the VDT and one of our customers (LCG) recently let us
know about a problem that they consider to be critical. 

The problem began with a user reporting the following GridFTP error for a
failed transfer:

Reason: Transfer failed. ERROR globus_l_ftp_control_read_cb: Error while
searching for end of reply

You can see details of the debugging process they went through here:
https://gus.fzk.de/pages/ticket_details.php?ticket=26868

Not everything appears to be available there. They sent me a bug report today
that suggests that a particular line of code is incorrect. This line of code
exists in Globus 4.0.5. I'll let them describe the problem in their own words. 

> Hello,
>
> We recently had a problem reported by a site using a 64bit, SL4
> release of DPM (and hence the gridftp2 server) that was traced to an
> internal globus/gridftp2 lib issue:
>
> from VDT RPM vdt_globus_data_server-VDT1.6.0x86_64_rhas_4-3 in
> library(s) libglobus_gridftp_server_control_*
>
> function globus_l_xio_gssapi_ftp_write(), from source file:
> globus_xio_gssapi_ftp.c at line aprox 2590:
>
> res = globus_xio_driver_pass_write(
> op,
> &handle->auth_write_iov,
> 1,
> length,
> cb,
> handle);
>
> here the argument 'length' refers to the length of the unwrapped
> (unencrpyted) and binary (not base64) version of the message (whereas
> handle->auth_write_iov.iov_len has the actual length); and thus
> 'length' is about 75% of the actual length of the message that the
> function wants to send. The forth argument to
> globus_xio_driver_pass_write() is the 'wait_for' parameter and sets
> the minimum amount of data to have sent to the kernel before the
> callback is called. Thus if the kernel accepts more than about 75% of
> the length of the message, but less than all of it, the message is
> trunctaed.
>
> Browsing the globus CVS I noticed that recent versions/branches of
> globus_xio_gssapi_ftp.c have been reworked and probably do not suffer
> from this problem - although I don't know if exactly this problem was
> ever recognised explictly. I don't know if there are any
> globus/gridftp2 releases based on the new code, but it is likely we
> will find problems again from this bug so it is very desirable to have
> this fixed - either as a fix to the older version or by moving to the
> newer code.
>
> Thank you,
> David

I don't yet know a lot of details about their problem. I don't know how
frequently it occurs, for example. However, they have labeled it as critical. I
think it's disrupting work being done by CMS (based at CERN). 

Can you look at this bug and see if the diagnosis (the bad parameter) is in
fact correct? 

Upgrading to a version of Globus in the trunk (Globus 4.1?) is not an option at
this point: we'll need a patch for Globus 4.0.5. 

Thanks,
-alain
------- Comment #1 From 2007-10-01 13:52:36 -------
For reference, the VDT bug ticket for this problem is: 
http://vdt.cs.wisc.edu/rt/index.html?user=guest&pass=guest&q=2989
------- Comment #2 From 2007-10-08 11:37:59 -------
Any thoughts on this? We do consider this to be a serious problem. 

Here is some additional information from David Smith:

1) They tried a fix as a binary fix, not a source code fix, and it worked. They
replaced length with handle->auth_write_iov.iov_len. We would like your
confirmation that this is an appropriate fix. 

2) On the original symptoms:

> It was a failure in FTS transfers from one site to another - it
> showed up a problem just in transfers between those two particular
> sites, the exact conditions required appearing to be complex; it
> appears to depends on the size of range markers (and thus the size of
> messages the server was writing on the control channel) and so the
> transfer rate number of streams, tcp buffer sizes as well as the
> configuration of the system where the globus gridtp2 is running. The
> symptom was the error:
>
> debug: error reading response from gsiftp://node26.datagrid.cea.fr/
> node26.datagrid.cea.fr:/pool_node26/dteam/2007-09-25/david102d.
> 2397.0: globus_l_ftp_control_read_cb: Error while searching for end
> of reply
> debug: fault on connection to gsiftp://node26.datagrid.cea.fr/
> node26.datagrid.cea.fr:/pool_node26/dteam/2007-09-25/david102d.
> 2397.0: globus_l_ftp_control_read_cb: Error while searching for end
> of reply
> debug: error reading response from gsiftp://ccxfer13.in2p3.fr:2811//
> pnfs/in2p3.fr/data/dteam/disk/dapnia/cleroy/ATLAS-filenode20_007: an
> I/O operation was cancelled
> debug: operation complete
>
> error: globus_l_ftp_control_read_cb: Error while searching for end of
> reply
>
> the above is the client side error I made which investigating. It was
> reported by the ftp client that was handling the 3rd party copy, an
> older globus 2 based client in this case.  In production the result
> is an FTS error, with a similar error to the above being available
> from the FTS server when the transfer job status is queried.
------- Comment #3 From 2007-10-08 16:46:46 -------
That change looks good to me.  I'll commit a fix for that and release an update
package soon.

The changes you noted in the current CVS trunk have the side affect of fixing
the problem as well, but they were designed for better threaded performance
along with other changes throughout the package, so they aren't really suitable
for porting to the 4.0.x branch.

Thanks for reporting this.
------- Comment #4 From 2007-10-08 17:30:04 -------
Subject: Re:  gridftp2 server can send truncated control channel messages

> That change looks good to me.  I'll commit a fix for that and  
> release an update
> package soon.

Great! When you release the update package, can you tell me about it?  
I'll provide the update to VDT users.

Thanks,
-alain
------- Comment #5 From 2007-10-17 13:28:32 -------
Any idea when the update package will be updated? 

If it's going to be a while, can I simple change "length" to
"handle->auth_write_iov.iov_len", as David Smith suggested, or are other
changes needed in addition?

thanks,
-alain
------- Comment #6 From 2007-10-17 13:39:24 -------
Thats going to be the only change.

I haven't created the update yet as I was hoping to combine the advisories for
this and 5607.
------- Comment #7 From 2007-10-17 14:55:05 -------
Subject: Re:  gridftp2 server can send truncated control channel messages

> Thats going to be the only change.

Great.

> I haven't created the update yet as I was hoping to combine the  
> advisories for
> this and 5607.

OK. I was hoping to combine them too, before I rebuilt Globus. Folks  
from EGEE think that this particular bug (5590) is critical enough to  
them that they would like a fix before waiting for the other bug  
(5607). David Smith says it will take him a little while to be able  
to provide you a case you can easily reproduce.

-alain
------- Comment #8 From 2007-12-10 18:37:12 -------
Sorry for the delay on this -- I had missed your last comment and was still
waiting on feedback on 5607.

I've tested the suggested fix and committed to the 4.0 branch, and an update
package can be found here:
http://www-unix.mcs.anl.gov/~mlink/bugs/globus_gridftp_server_control-0.20.tar.gz

Mike
------- Comment #9 From 2007-12-10 21:05:03 -------
Subject: Re:  gridftp2 server can send truncated control channel messages


> Sorry for the delay on this -- I had missed your last comment and  
> was still
> waiting on feedback on 5607.
>
> I've tested the suggested fix and committed to the 4.0 branch, and  
> an update
> package can be found here:
> http://www-unix.mcs.anl.gov/~mlink/bugs/ 
> globus_gridftp_server_control-0.20.tar.gz

No problem, we've already taken up the patch.

Thanks for looking it over and committing it!

-alain