Bug 1851 - Globus XIO should be smarter about selecting a port
: Globus XIO should be smarter about selecting a port
Status: RESOLVED FIXED
: XIO
Globus XIO
: 3.2.1
: PC Linux
: P3 normal
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2004-07-23 16:27 by
Modified: 2004-12-09 11:19 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2004-07-23 16:27:27
When given a port range (via GLOBUS_TCP_PORT_RANGE), Globus I/O tries each 
port in order, from the beginning. You can see that here in 
globus_xio_tcp_driver.c:

    if(port == 0)
    {
        port = min_port;
    }
    else
    {
        max_port = port;
    }
    
    done = GLOBUS_FALSE;
    do
    {
        GlobusLibcSockaddrCopy(myaddr, *addr, addr_len);
        GlobusLibcSockaddrSetPort(myaddr, port);
        
        if(bind(
            fd,
            (struct sockaddr *) &myaddr,
            GlobusLibcSockaddrLen(&myaddr)) < 0)
        {
            if(++port > max_port)
            {
                result = GlobusXIOErrorSystemError("bind", errno);
                goto error_bind;
            }
        }
        else
        {
            done = GLOBUS_TRUE;
        }
    } while(!done);

When encountering a stateful firewall, this can cause problems, as described 
by Jason Smith to the VDT team:

----- Begin Quote
We were having a problem with a new firewall that we recently installed
and eventually tracked down the cause.  The problem happens when a tcp
connection doesn't get closed properly, maybe because one side crashed
or some kind of network interruption.  Since we have a stateful firewall
and it doesn't see the connection being torn down, it is blocking new
connection attempts (syn packets) with the same IP & port pairs, until
this stale connection times out in the state engine of the firewall,
which has to be fairly long so it doesn't break legitimate idle
connections.

The problem is made many times worse by the fact that globus uses this
defined port range, which is there for the sole purpose of letting
globus know about conduits through your firewall, and the fact that
globus searches through this port range sequentially starting from the
beginning, every time it needs a port.  Because of this, ports in the
beginning of this range have a very high probability of being reused
over and over again, and if one of these doesn't get closed properly, it
will continue to be reused over and over since the connections will
never make it through the firewall, causing many failed globus
connections.

The globus I/O library should not be reusing these same ports all the
time.  It would be better if it could somehow remember the last port it
used and continue searching from there for a port to use next time, like
the linux kernel does.  You can essentially think of the problem as
being caused by the firewall being stateful, while globus is not.  If
making globus stateful so it can remember the last port selected is too
hard, an easy solution might be to randomize the selection of the port
from this range.  This would reduce the chance of reusing a recently
used port, thus allowing the state engine in the firewall time to expire
stale connections that weren't closed properly.
----- End Quote

I understand the difficulty in remembering the last port, since we are usually 
taking about distinct processes, like job managers. 

Here is a proposal: you could choose a random port in the range (make sure to 
seed the random number generator first), then search to the end of the range, 
then wrap around until you've gone through the whole range. 

Here is another proposal: you could search the range in a completely random 
order. A simple implementation may require an array of the length of the port 
range that you can randomize, so perhaps this is inappropriate. 

Or perhaps there is another good solution. We at least wanted to let you know 
that this problem exists, and see what you think about fixing it. 

-alain of the VDT team
------- Comment #1 From 2004-07-23 16:51:03 -------
This isn't difficult at all.  I will release an update to gt 3.2.1 soon.
------- Comment #2 From 2004-07-23 16:54:46 -------
*** Bug 1849 has been marked as a duplicate of this bug. ***
------- Comment #3 From 2004-07-23 18:01:04 -------
Subject: Re:  Globus XIO should be smarter about selecting a
  port

You rock, Joe.

-alain


------- Comment #4 From 2004-07-24 08:01:01 -------
Subject: Re:  Globus XIO should be smarter about selecting a
  port


>This isn't difficult at all.  I will release an update to gt 3.2.1 soon.

Here's an interesting idea. This is what one of the Condor developers told 
me about it is done in Condor:

>In Condor, we do not search from the beginning every time.  Instead, the 
>port we start "probing" at is determined via taking the current uTime of 
>day and the port range and hashing it.  Specifically:
>
>int start_trial = low_port +
>                   (curTime.tv_usec * 73/*some prime number*/ % range);

I don't know if it's better than randomness or not. I'll let you think 
about it.

Thanks,
-alain



------- Comment #5 From 2004-07-24 13:33:16 -------
An update that can be applied to 3.2.1 (or 3.2.0 with all the other updates) is 
here:
ftp://ftp.globus.org/pub/gt3/3.2/contrib/globus_xio-0.9-range-state.tar.gz

simply do $GPT_LOCATION/sbin/gpt-build -verbose -force <flavor> <package>

The first official release of this will have to be 4.0, as it is a bit more 
involved than something I am willing to add to a stable release.

in conjunction with the GLOBUS_TCP_PORT_RANGE env variable, set the new 
GLOBUS_TCP_PORT_RANGE_STATE_FILE env variable to point to someplace for a state 
file (doesnt need to exist).  All processes sharing the port range specified in 
GLOBUS_TCP_PORT_RANGE should have this env set to the same file.

The file is protected with posix advisory locking and is updated with the last 
port used.  The next processes tries to bind that last port + 1.  If that 
fails, it then keeps incrementing the port number (wrapping at the max port 
range) until it successfully binds or hits the starting port number.

This is just an initial attempt at this.  Later, I may track each port in use 
to further prevent collisions when application lifetime widely varies.

Joe
------- Comment #6 From 2004-07-24 14:20:51 -------
Alain,

Would it be possible to backport this patch and apply it to GT2 in the VDT?  Or
even a simple patch implementing the random or time hashing method like you
suggested?  Anything other than the linear search that always starts from the
beginning would greatly reduce the probability of having connection errors like
we are seeing now.

Thanks,
~Jason
------- Comment #7 From 2004-07-24 18:29:16 -------
It was straightforward enough for me to at least provide a patch to globus io.  
Its attached to bug 1850.  As it say, its untested.  Let me know if you have 
any problems.

Also, I made i minor change to the update package above that fixes a potential 
problem with threaded builds.

Joe
------- Comment #8 From 2004-08-09 11:30:36 -------
Has anyone attempted using this feature?  Is it working for you?

Joe
------- Comment #9 From 2004-08-24 12:04:05 -------
Still waiting on some feedback here.
------- Comment #10 From 2004-08-25 18:12:24 -------
Sorry Joe--I had sent comments in the past, but as it often seems to happen, 
Bugzilla ignores my email. I'll try typing this into the web form to see if 
that helps. 

I haven't had a chance to test the patch yet--sorry!

That said, I worry about patches that add state via file and use file locking. 
People often forget that file locking is broken when using Linux's NFS, and 
this ends up failing for odd reasons. 

What did you think about the method that is used in Condor? It doesn't rely on 
files or file locking. It may not be as successful when everything (file 
locking) works, but it's fast, doesn't break when file locking breaks, and 
doesn't require extra configuration?

-alain
------- Comment #11 From 2004-08-25 18:36:55 -------
Because of how it will be used, there is _no_ reason to use an nfs filesytem 
for the state file.  /tmp will work just fine.

Joe
------- Comment #12 From 2004-08-26 11:23:13 -------
Someone inevitably will put the file onto NFS.  I've debugged a number of
problems when I found that various state files for Globus or Condor had been
placed on NFS and things went wrong when file locking started acting up.  Just
something to keep in mind.
------- Comment #13 From 2004-08-26 12:54:18 -------
Subject: Re:  Globus XIO should be smarter about selecting a port

either way, even if the file becomes corrupted, things will still
work. the contents are validated and will be ignored if its invalid.
When the file is updated again, it will be back in working order.

Also, some quick digging into the topic shows that linux's nfs locking
woes have been in check since 2.12 or so.

Joe

bugzilla-daemon@mcs.anl.gov wrote:

> http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=1851
> ------- Additional Comments From adesmet@cs.wisc.edu  2004-08-26
> 11:23 -------
> Someone inevitably will put the file onto NFS.  I've debugged a number of
> problems when I found that various state files for Globus or Condor had been
> placed on NFS and things went wrong when file locking started acting up.  Just
> something to keep in mind.
> ------- You are receiving this mail because: -------
> You are the assignee for the bug, or are watching the assignee.
------- Comment #14 From 2004-08-26 14:22:06 -------
In our experience, NFS locking woes on Linux still exist. 

It seems like you are sold on this method, so we'll live with it. Thanks for 
all of your effort--we really appreciate it! 

We'll do our best to do some testing of it in the field--which version do you 
prefer we test with?

-alain
------- Comment #15 From 2004-08-26 14:28:22 -------
I am, sorry ;-)

Test with 3.2.1 and the unofficial update package at:

ftp://ftp.globus.org/pub/gt3/3.2/contrib/globus_xio-0.9-range-state.tar.gz

Joe
------- Comment #16 From 2004-09-02 11:02:25 -------
*** Bug 1908 has been marked as a duplicate of this bug. ***
------- Comment #17 From 2004-12-09 09:25:32 -------
We just updated our firewall with the latest release from Cisco which has a
patch to fix their part of the problem.  I tested it and verified that it now
correctly expires the stale state after the configurable timeout setting, even
with new syn packets trying to get through that are using the same IP/port
pairs.  This update, combined with the globus patch should greatly reduce the
chance of reusing ports too quickly, causing problems with stateful firewalls. 
Have the globus patches been tested and integrated into official releases,
including VDT?
------- Comment #18 From 2004-12-09 11:19:45 -------
4.0 will be the first official release to have the STATE_FILE patch.