Bug 5190 - Error 421 during globus-url-copy, why?
: Error 421 during globus-url-copy, why?
Status: RESOLVED FIXED
: GridFTP
GridFTP
: 4.0.1
: Macintosh All
: P3 normal
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2007-04-03 10:48 by
Modified: 2007-05-30 11:22 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2007-04-03 10:48:44
Below is a GLOBUS_XIO_TCP_DEBUG=ALL and -dbg -verbose globus-url-copy that ends
in an error 421.  There's a description about the circumstances surrounding
this error at
http://drupal.star.bnl.gov/STAR/comp/grid/documentation/gsiftp-troubleshooting/

[globus_l_xio_tcp_activate] Entering
[globus_l_xio_tcp_activate] Exiting
[globus_l_xio_tcp_init] Entering
[globus_l_xio_tcp_init] Exiting
[globus_l_xio_tcp_attr_init] Entering
[globus_l_xio_tcp_attr_init] Exiting
[globus_l_xio_tcp_attr_cntl] Entering
[globus_l_xio_tcp_attr_cntl] Exiting
debug: starting to get
gsiftp://stargrid02.rcf.bnl.gov/star/data08/HRM/GSTAR5.fz
debug: connecting to gsiftp://stargrid02.rcf.bnl.gov/star/data08/HRM/GSTAR5.fz
[globus_l_xio_tcp_attr_init] Entering
[globus_l_xio_tcp_attr_init] Exiting
[globus_l_xio_tcp_attr_cntl] Entering
[globus_l_xio_tcp_attr_cntl] Exiting
[globus_l_xio_tcp_attr_copy] Entering
[globus_l_xio_tcp_attr_copy] Exiting
[globus_l_xio_tcp_attr_cntl] Entering
[globus_l_xio_tcp_attr_cntl] Exiting
[globus_l_xio_tcp_attr_copy] Entering
[globus_l_xio_tcp_attr_copy] Exiting
[globus_l_xio_tcp_open] Entering
[globus_l_xio_tcp_handle_init] Entering
[globus_l_xio_tcp_handle_init] Exiting
[globus_l_xio_tcp_connect] Entering
[globus_l_xio_tcp_attr_copy] Entering
[globus_l_xio_tcp_attr_copy] Exiting
[globus_l_xio_tcp_connect_next] Entering
[globus_l_xio_tcp_apply_handle_attrs] Entering
[globus_l_xio_tcp_apply_handle_attrs] Exiting
[globus_l_xio_tcp_connect_next] Exiting
[globus_l_xio_tcp_connect] Exiting
[globus_l_xio_tcp_open] Exiting
[globus_l_xio_tcp_attr_destroy] Entering
[globus_l_xio_tcp_attr_destroy] Exiting
[globus_l_xio_tcp_system_connect_cb] Entering
[globus_l_xio_tcp_cntl] Entering
[globus_l_xio_tcp_contact_string] Entering
[globus_l_xio_tcp_contact_string] Exiting
[globus_l_xio_tcp_cntl] Exiting
[globus_l_xio_tcp_read] Entering
[globus_l_xio_tcp_read] Exiting
[globus_l_xio_tcp_attr_destroy] Entering
[globus_l_xio_tcp_attr_destroy] Exiting
[globus_l_xio_tcp_system_connect_cb] Exiting
[globus_l_xio_tcp_system_read_cb] Entering
debug: response from gsiftp://stargrid02.rcf.bnl.gov/star/data08/HRM/GSTAR5.fz:
220 stargrid02.rcf.bnl.gov GridFTP Server 2.3 (gcc32dbg, 1144436882-63) ready.

debug: authenticating with
gsiftp://stargrid02.rcf.bnl.gov/star/data08/HRM/GSTAR5.fz
[globus_l_xio_tcp_write] Entering
[globus_l_xio_tcp_write] count=1, 1st buflen=13
[globus_l_xio_tcp_write] Exiting
[globus_l_xio_tcp_system_read_cb] Exiting
[globus_l_xio_tcp_system_write_cb] Entering
[globus_l_xio_tcp_read] Entering
[globus_l_xio_tcp_read] Exiting
[globus_l_xio_tcp_system_write_cb] Exiting
[globus_l_xio_tcp_system_read_cb] Entering
debug: response from gsiftp://stargrid02.rcf.bnl.gov/star/data08/HRM/GSTAR5.fz:
421 Idle Timeout: closing control connection.

debug: fault on connection to
gsiftp://stargrid02.rcf.bnl.gov/star/data08/HRM/GSTAR5.fz
debug: data callback, error globus_ftp_client: the server responded with an
error, buffer 0xb749d008, length 0, offset=0, eof=true
debug: operation complete
[globus_l_xio_tcp_system_read_cb] Exiting

error: globus_ftp_client: the server responded with an error
421 Idle Timeout: closing control connection.

[globus_l_xio_tcp_write] Entering
[globus_l_xio_tcp_write] count=1, 1st buflen=6
[globus_l_xio_tcp_write] Exiting
[globus_l_xio_tcp_system_write_cb] Entering
[globus_l_xio_tcp_read] Entering
[globus_l_xio_tcp_read] Exiting
[globus_l_xio_tcp_system_write_cb] Exiting
[globus_l_xio_tcp_system_read_cb] Entering
[globus_l_xio_tcp_close] Entering
[globus_l_xio_tcp_close] Exiting
[globus_l_xio_tcp_system_read_cb] Exiting
[globus_l_xio_tcp_system_close_cb] Entering
[globus_l_xio_tcp_attr_destroy] Entering
[globus_l_xio_tcp_attr_destroy] Exiting
[globus_l_xio_tcp_attr_destroy] Entering
[globus_l_xio_tcp_attr_destroy] Exiting
[globus_l_xio_tcp_attr_destroy] Entering
[globus_l_xio_tcp_attr_destroy] Exiting
[globus_l_xio_tcp_handle_destroy] Entering
[globus_l_xio_tcp_handle_destroy] Exiting
[globus_l_xio_tcp_system_close_cb] Exiting
[globus_l_xio_tcp_destroy] Entering
[globus_l_xio_tcp_destroy] Exiting
[globus_l_xio_tcp_deactivate] Entering
[globus_l_xio_tcp_deactivate] Exiting
Source: gsiftp://stargrid02.rcf.bnl.gov/star/data08/HRM/
Dest:   file:////home/wbetts/gridftp.test/STAR-BNL/output2007-04-02-15-04/
 GSTAR5.fz
------- Comment #1 From 2007-04-03 11:22:02 -------
What this report shows me is that the client (globus-url-copy) successfully
forms a TCP control channel connection with the server.  It then Successfully
reads the 220 Banner message from the server.  The client then attempts to
authenicate with the server.  It sends the AUTH GSSAPI command and posts a read
for a response.  It is this second read that times out.

From what i see here the both sides believe the TCP connection is formed
successfully, enough so that at least 1 message is sent from the server to the
client (220 banner) and possibly 1 from the client to the server (AUTH GSSAPI,
since we dont have server logs we cannot confirm the server actually received
it).

I think the next step should be looking at the gssapi authentication logs on
the gridftp server side to see what commands were actually received and what
replies sent.  I think debugging at the TCP layer may be premature and may be
introducing some red herrings.

To get the desired logs sent the env

export GLOBUS_XIO_GSSAPI_FTP_DEBUG=255,filename
------- Comment #2 From 2007-04-17 08:54:58 -------
Here's the latest from Wayne, with a pointer to some FTP-command level logs:

I have complied with Charles Bacon's request for a server-side log file with
GLOBUS_XIO_GSSAPI_FTP_DEBUG set to 255:

http://drupal.star.bnl.gov/STAR/files/active/0/gridftp-auth.xio_gssapi_ftp_debug.log

This log file covers a test with 11 transfers, in which 1 failed. 
Unfortunately, the logging for all 11 transfer attempts is jumbled together,
and the extra debug lines have no distinguishing tag to separate them from each
other so I can't follow it through, but perhaps the experts will be able to
pull something out of it.

I have also added a couple more notes to my troubleshooting page near the
bottom in "Updates on April 16:"
(
http://drupal.star.bnl.gov/STAR/comp/grid/documentation/gsiftp-troubleshooting/#April_16_2007
), It amounts to the following:
1.  On the server (stargrid02.rcf.bnl.gov), I tried disabling TCP Windows
Scaling ("sysctl -w net.ipv4.tcp_window_scaling=0") -- no improvement
2.  On the server (stargrid02.rcf.bnl.gov), I tried disabling MTU discovery
("sysctl -w net.ipv4.ip_no_pmtu_disc=1") -- no improvement
------- Comment #3 From 2007-04-17 12:09:42 -------
You can disable the pre-auth timeout with the option '-control-preauth-timeout
0' so the server won't forcibly close the connection.  The default of 30s
should be enough time for connection/auth establishment, but at least this way
we can see how it acts given more time.

Mike
------- Comment #4 From 2007-04-19 15:21:15 -------
If you saw the notes on the drupal link, he has tried:
"Increased the control_preauth_timeout on the server to 120, then 300 seconds
(default is 30).  Failures occured with all settings."

Do you think 0 would be even better?  The new data here was in response to
John's request for "export GLOBUS_XIO_GSSAPI_FTP_DEBUG=255,filename", and those
logs are on the page.  Were they useful or no?  Wayne was concerned that since
it happened in the middle of a bunch of transfers that there wasn't a good way
to figure out which connection each log line corresponded to.
------- Comment #5 From 2007-04-19 15:33:05 -------
Ah, I hadn't seen that.  I suggested 0 anyways because the only thing actually
cancelling the connection attempt was the timeout.  I don't expect anything
useful to happen after 300s though.

The debug output wasn't clear.  Try GLOBUS_XIO_GSSAPI_FTP_DEBUG=ALL,filename,1
for the debug.  That will write a new filename.pid for each connection.  The
interesting pid can be found by checking the gridftp log for the line with the
421 error.  the lines are prefixed with [pid].  I'd also add
GLOBUS_XIO_TCP_DEBUG with the same options.
------- Comment #6 From 2007-04-19 15:37:19 -------
Thanks  Mike.  I'll have Wayne set that up and report back.
------- Comment #7 From 2007-04-23 15:39:49 -------
Wayne's email with links to the PID-marked logs:

Following the instructions that Charles sent produces only a single log file
named "filename" rather than separate files for each server instance (no
filename.pid), but it did something just as good - it labeled each line entry
with the pid of the server instance, so a given connection can now be pulled
out easily.  New logfiles with the requested DEBUG options are available here:

http://drupal.star.bnl.gov/STAR/files/active/0/gridftp-auth.xio_gssapi_ftp_debug.labelled.log
http://drupal.star.bnl.gov/STAR/files/active/0/gridftp-auth.xio_tcp_debug.labelled.log

These log files cover a test with 10 failures out of 21 attempted.


Also, I now have access to several additional gsiftp servers at BNL in various
locations on the BNL network.  In early tests (including several with 100
nearly simultaneous connections), no failures occur when using a server outside
of the BNL firewall, but all servers inside the BNL firewall (both inside and
outside of the RACF firewall) have experienced failures that appear to be the
same as stargrid02.  So the BNL firewall is a prime suspect in my mind, though
I still have a purely intuitive suspicion about routers playing tricks on us
and possibly using different routes in the failed transfers.

I've passed these preliminary results to the BNL ITD network engineer assigned
to my open ticket, but haven't heard back from him (so I plan on an "in person"
visit Tuesday morning.)
------- Comment #8 From 2007-05-21 13:53:07 -------
Any comments on the PID-marked logfiles?
------- Comment #9 From 2007-05-30 11:22:37 -------
Unreal.  Turns out to have been a problem with the xinetd their redhat version
was using:

"Under some circumstances, xinetd confused the file descriptor used for 
logging with the descriptors to be passed to the new server. This resulted 
in the server process being unable communicate with the client. This 
updated package contains a backported fix from upstream. "

Closing this as not a GridFTP bug.