Bugzilla – Bug 5190
Error 421 during globus-url-copy, why?
Last modified: 2007-05-30 11:22:37
You need to log in before you can comment on or make changes to this bug.
Below is a GLOBUS_XIO_TCP_DEBUG=ALL and -dbg -verbose globus-url-copy that ends in an error 421. There's a description about the circumstances surrounding this error at http://drupal.star.bnl.gov/STAR/comp/grid/documentation/gsiftp-troubleshooting/ [globus_l_xio_tcp_activate] Entering [globus_l_xio_tcp_activate] Exiting [globus_l_xio_tcp_init] Entering [globus_l_xio_tcp_init] Exiting [globus_l_xio_tcp_attr_init] Entering [globus_l_xio_tcp_attr_init] Exiting [globus_l_xio_tcp_attr_cntl] Entering [globus_l_xio_tcp_attr_cntl] Exiting debug: starting to get gsiftp://stargrid02.rcf.bnl.gov/star/data08/HRM/GSTAR5.fz debug: connecting to gsiftp://stargrid02.rcf.bnl.gov/star/data08/HRM/GSTAR5.fz [globus_l_xio_tcp_attr_init] Entering [globus_l_xio_tcp_attr_init] Exiting [globus_l_xio_tcp_attr_cntl] Entering [globus_l_xio_tcp_attr_cntl] Exiting [globus_l_xio_tcp_attr_copy] Entering [globus_l_xio_tcp_attr_copy] Exiting [globus_l_xio_tcp_attr_cntl] Entering [globus_l_xio_tcp_attr_cntl] Exiting [globus_l_xio_tcp_attr_copy] Entering [globus_l_xio_tcp_attr_copy] Exiting [globus_l_xio_tcp_open] Entering [globus_l_xio_tcp_handle_init] Entering [globus_l_xio_tcp_handle_init] Exiting [globus_l_xio_tcp_connect] Entering [globus_l_xio_tcp_attr_copy] Entering [globus_l_xio_tcp_attr_copy] Exiting [globus_l_xio_tcp_connect_next] Entering [globus_l_xio_tcp_apply_handle_attrs] Entering [globus_l_xio_tcp_apply_handle_attrs] Exiting [globus_l_xio_tcp_connect_next] Exiting [globus_l_xio_tcp_connect] Exiting [globus_l_xio_tcp_open] Exiting [globus_l_xio_tcp_attr_destroy] Entering [globus_l_xio_tcp_attr_destroy] Exiting [globus_l_xio_tcp_system_connect_cb] Entering [globus_l_xio_tcp_cntl] Entering [globus_l_xio_tcp_contact_string] Entering [globus_l_xio_tcp_contact_string] Exiting [globus_l_xio_tcp_cntl] Exiting [globus_l_xio_tcp_read] Entering [globus_l_xio_tcp_read] Exiting [globus_l_xio_tcp_attr_destroy] Entering [globus_l_xio_tcp_attr_destroy] Exiting [globus_l_xio_tcp_system_connect_cb] Exiting [globus_l_xio_tcp_system_read_cb] Entering debug: response from gsiftp://stargrid02.rcf.bnl.gov/star/data08/HRM/GSTAR5.fz: 220 stargrid02.rcf.bnl.gov GridFTP Server 2.3 (gcc32dbg, 1144436882-63) ready. debug: authenticating with gsiftp://stargrid02.rcf.bnl.gov/star/data08/HRM/GSTAR5.fz [globus_l_xio_tcp_write] Entering [globus_l_xio_tcp_write] count=1, 1st buflen=13 [globus_l_xio_tcp_write] Exiting [globus_l_xio_tcp_system_read_cb] Exiting [globus_l_xio_tcp_system_write_cb] Entering [globus_l_xio_tcp_read] Entering [globus_l_xio_tcp_read] Exiting [globus_l_xio_tcp_system_write_cb] Exiting [globus_l_xio_tcp_system_read_cb] Entering debug: response from gsiftp://stargrid02.rcf.bnl.gov/star/data08/HRM/GSTAR5.fz: 421 Idle Timeout: closing control connection. debug: fault on connection to gsiftp://stargrid02.rcf.bnl.gov/star/data08/HRM/GSTAR5.fz debug: data callback, error globus_ftp_client: the server responded with an error, buffer 0xb749d008, length 0, offset=0, eof=true debug: operation complete [globus_l_xio_tcp_system_read_cb] Exiting error: globus_ftp_client: the server responded with an error 421 Idle Timeout: closing control connection. [globus_l_xio_tcp_write] Entering [globus_l_xio_tcp_write] count=1, 1st buflen=6 [globus_l_xio_tcp_write] Exiting [globus_l_xio_tcp_system_write_cb] Entering [globus_l_xio_tcp_read] Entering [globus_l_xio_tcp_read] Exiting [globus_l_xio_tcp_system_write_cb] Exiting [globus_l_xio_tcp_system_read_cb] Entering [globus_l_xio_tcp_close] Entering [globus_l_xio_tcp_close] Exiting [globus_l_xio_tcp_system_read_cb] Exiting [globus_l_xio_tcp_system_close_cb] Entering [globus_l_xio_tcp_attr_destroy] Entering [globus_l_xio_tcp_attr_destroy] Exiting [globus_l_xio_tcp_attr_destroy] Entering [globus_l_xio_tcp_attr_destroy] Exiting [globus_l_xio_tcp_attr_destroy] Entering [globus_l_xio_tcp_attr_destroy] Exiting [globus_l_xio_tcp_handle_destroy] Entering [globus_l_xio_tcp_handle_destroy] Exiting [globus_l_xio_tcp_system_close_cb] Exiting [globus_l_xio_tcp_destroy] Entering [globus_l_xio_tcp_destroy] Exiting [globus_l_xio_tcp_deactivate] Entering [globus_l_xio_tcp_deactivate] Exiting Source: gsiftp://stargrid02.rcf.bnl.gov/star/data08/HRM/ Dest: file:////home/wbetts/gridftp.test/STAR-BNL/output2007-04-02-15-04/ GSTAR5.fz
What this report shows me is that the client (globus-url-copy) successfully forms a TCP control channel connection with the server. It then Successfully reads the 220 Banner message from the server. The client then attempts to authenicate with the server. It sends the AUTH GSSAPI command and posts a read for a response. It is this second read that times out. From what i see here the both sides believe the TCP connection is formed successfully, enough so that at least 1 message is sent from the server to the client (220 banner) and possibly 1 from the client to the server (AUTH GSSAPI, since we dont have server logs we cannot confirm the server actually received it). I think the next step should be looking at the gssapi authentication logs on the gridftp server side to see what commands were actually received and what replies sent. I think debugging at the TCP layer may be premature and may be introducing some red herrings. To get the desired logs sent the env export GLOBUS_XIO_GSSAPI_FTP_DEBUG=255,filename
Here's the latest from Wayne, with a pointer to some FTP-command level logs: I have complied with Charles Bacon's request for a server-side log file with GLOBUS_XIO_GSSAPI_FTP_DEBUG set to 255: http://drupal.star.bnl.gov/STAR/files/active/0/gridftp-auth.xio_gssapi_ftp_debug.log This log file covers a test with 11 transfers, in which 1 failed. Unfortunately, the logging for all 11 transfer attempts is jumbled together, and the extra debug lines have no distinguishing tag to separate them from each other so I can't follow it through, but perhaps the experts will be able to pull something out of it. I have also added a couple more notes to my troubleshooting page near the bottom in "Updates on April 16:" ( http://drupal.star.bnl.gov/STAR/comp/grid/documentation/gsiftp-troubleshooting/#April_16_2007 ), It amounts to the following: 1. On the server (stargrid02.rcf.bnl.gov), I tried disabling TCP Windows Scaling ("sysctl -w net.ipv4.tcp_window_scaling=0") -- no improvement 2. On the server (stargrid02.rcf.bnl.gov), I tried disabling MTU discovery ("sysctl -w net.ipv4.ip_no_pmtu_disc=1") -- no improvement
You can disable the pre-auth timeout with the option '-control-preauth-timeout 0' so the server won't forcibly close the connection. The default of 30s should be enough time for connection/auth establishment, but at least this way we can see how it acts given more time. Mike
If you saw the notes on the drupal link, he has tried: "Increased the control_preauth_timeout on the server to 120, then 300 seconds (default is 30). Failures occured with all settings." Do you think 0 would be even better? The new data here was in response to John's request for "export GLOBUS_XIO_GSSAPI_FTP_DEBUG=255,filename", and those logs are on the page. Were they useful or no? Wayne was concerned that since it happened in the middle of a bunch of transfers that there wasn't a good way to figure out which connection each log line corresponded to.
Ah, I hadn't seen that. I suggested 0 anyways because the only thing actually cancelling the connection attempt was the timeout. I don't expect anything useful to happen after 300s though. The debug output wasn't clear. Try GLOBUS_XIO_GSSAPI_FTP_DEBUG=ALL,filename,1 for the debug. That will write a new filename.pid for each connection. The interesting pid can be found by checking the gridftp log for the line with the 421 error. the lines are prefixed with [pid]. I'd also add GLOBUS_XIO_TCP_DEBUG with the same options.
Thanks Mike. I'll have Wayne set that up and report back.
Wayne's email with links to the PID-marked logs: Following the instructions that Charles sent produces only a single log file named "filename" rather than separate files for each server instance (no filename.pid), but it did something just as good - it labeled each line entry with the pid of the server instance, so a given connection can now be pulled out easily. New logfiles with the requested DEBUG options are available here: http://drupal.star.bnl.gov/STAR/files/active/0/gridftp-auth.xio_gssapi_ftp_debug.labelled.log http://drupal.star.bnl.gov/STAR/files/active/0/gridftp-auth.xio_tcp_debug.labelled.log These log files cover a test with 10 failures out of 21 attempted. Also, I now have access to several additional gsiftp servers at BNL in various locations on the BNL network. In early tests (including several with 100 nearly simultaneous connections), no failures occur when using a server outside of the BNL firewall, but all servers inside the BNL firewall (both inside and outside of the RACF firewall) have experienced failures that appear to be the same as stargrid02. So the BNL firewall is a prime suspect in my mind, though I still have a purely intuitive suspicion about routers playing tricks on us and possibly using different routes in the failed transfers. I've passed these preliminary results to the BNL ITD network engineer assigned to my open ticket, but haven't heard back from him (so I plan on an "in person" visit Tuesday morning.)
Any comments on the PID-marked logfiles?
Unreal. Turns out to have been a problem with the xinetd their redhat version was using: "Under some circumstances, xinetd confused the file descriptor used for logging with the descriptors to be passed to the new server. This resulted in the server process being unable communicate with the client. This updated package contains a backported fix from upstream. " Closing this as not a GridFTP bug.