Bugzilla – Bug 1851
Globus XIO should be smarter about selecting a port
Last modified: 2004-12-09 11:19:45
You need to
before you can comment on or make changes to this bug.
When given a port range (via GLOBUS_TCP_PORT_RANGE), Globus I/O tries each
port in order, from the beginning. You can see that here in
if(port == 0)
port = min_port;
max_port = port;
done = GLOBUS_FALSE;
GlobusLibcSockaddrCopy(myaddr, *addr, addr_len);
(struct sockaddr *) &myaddr,
GlobusLibcSockaddrLen(&myaddr)) < 0)
if(++port > max_port)
result = GlobusXIOErrorSystemError("bind", errno);
done = GLOBUS_TRUE;
When encountering a stateful firewall, this can cause problems, as described
by Jason Smith to the VDT team:
----- Begin Quote
We were having a problem with a new firewall that we recently installed
and eventually tracked down the cause. The problem happens when a tcp
connection doesn't get closed properly, maybe because one side crashed
or some kind of network interruption. Since we have a stateful firewall
and it doesn't see the connection being torn down, it is blocking new
connection attempts (syn packets) with the same IP & port pairs, until
this stale connection times out in the state engine of the firewall,
which has to be fairly long so it doesn't break legitimate idle
The problem is made many times worse by the fact that globus uses this
defined port range, which is there for the sole purpose of letting
globus know about conduits through your firewall, and the fact that
globus searches through this port range sequentially starting from the
beginning, every time it needs a port. Because of this, ports in the
beginning of this range have a very high probability of being reused
over and over again, and if one of these doesn't get closed properly, it
will continue to be reused over and over since the connections will
never make it through the firewall, causing many failed globus
The globus I/O library should not be reusing these same ports all the
time. It would be better if it could somehow remember the last port it
used and continue searching from there for a port to use next time, like
the linux kernel does. You can essentially think of the problem as
being caused by the firewall being stateful, while globus is not. If
making globus stateful so it can remember the last port selected is too
hard, an easy solution might be to randomize the selection of the port
from this range. This would reduce the chance of reusing a recently
used port, thus allowing the state engine in the firewall time to expire
stale connections that weren't closed properly.
----- End Quote
I understand the difficulty in remembering the last port, since we are usually
taking about distinct processes, like job managers.
Here is a proposal: you could choose a random port in the range (make sure to
seed the random number generator first), then search to the end of the range,
then wrap around until you've gone through the whole range.
Here is another proposal: you could search the range in a completely random
order. A simple implementation may require an array of the length of the port
range that you can randomize, so perhaps this is inappropriate.
Or perhaps there is another good solution. We at least wanted to let you know
that this problem exists, and see what you think about fixing it.
-alain of the VDT team
This isn't difficult at all. I will release an update to gt 3.2.1 soon.
*** Bug 1849 has been marked as a duplicate of this bug. ***
Subject: Re: Globus XIO should be smarter about selecting a
You rock, Joe.
Subject: Re: Globus XIO should be smarter about selecting a
>This isn't difficult at all. I will release an update to gt 3.2.1 soon.
Here's an interesting idea. This is what one of the Condor developers told
me about it is done in Condor:
>In Condor, we do not search from the beginning every time. Instead, the
>port we start "probing" at is determined via taking the current uTime of
>day and the port range and hashing it. Specifically:
>int start_trial = low_port +
> (curTime.tv_usec * 73/*some prime number*/ % range);
I don't know if it's better than randomness or not. I'll let you think
An update that can be applied to 3.2.1 (or 3.2.0 with all the other updates) is
simply do $GPT_LOCATION/sbin/gpt-build -verbose -force <flavor> <package>
The first official release of this will have to be 4.0, as it is a bit more
involved than something I am willing to add to a stable release.
in conjunction with the GLOBUS_TCP_PORT_RANGE env variable, set the new
GLOBUS_TCP_PORT_RANGE_STATE_FILE env variable to point to someplace for a state
file (doesnt need to exist). All processes sharing the port range specified in
GLOBUS_TCP_PORT_RANGE should have this env set to the same file.
The file is protected with posix advisory locking and is updated with the last
port used. The next processes tries to bind that last port + 1. If that
fails, it then keeps incrementing the port number (wrapping at the max port
range) until it successfully binds or hits the starting port number.
This is just an initial attempt at this. Later, I may track each port in use
to further prevent collisions when application lifetime widely varies.
Would it be possible to backport this patch and apply it to GT2 in the VDT? Or
even a simple patch implementing the random or time hashing method like you
suggested? Anything other than the linear search that always starts from the
beginning would greatly reduce the probability of having connection errors like
we are seeing now.
It was straightforward enough for me to at least provide a patch to globus io.
Its attached to bug 1850. As it say, its untested. Let me know if you have
Also, I made i minor change to the update package above that fixes a potential
problem with threaded builds.
Has anyone attempted using this feature? Is it working for you?
Still waiting on some feedback here.
Sorry Joe--I had sent comments in the past, but as it often seems to happen,
Bugzilla ignores my email. I'll try typing this into the web form to see if
I haven't had a chance to test the patch yet--sorry!
That said, I worry about patches that add state via file and use file locking.
People often forget that file locking is broken when using Linux's NFS, and
this ends up failing for odd reasons.
What did you think about the method that is used in Condor? It doesn't rely on
files or file locking. It may not be as successful when everything (file
locking) works, but it's fast, doesn't break when file locking breaks, and
doesn't require extra configuration?
Because of how it will be used, there is _no_ reason to use an nfs filesytem
for the state file. /tmp will work just fine.
Someone inevitably will put the file onto NFS. I've debugged a number of
problems when I found that various state files for Globus or Condor had been
placed on NFS and things went wrong when file locking started acting up. Just
something to keep in mind.
Subject: Re: Globus XIO should be smarter about selecting a port
either way, even if the file becomes corrupted, things will still
work. the contents are validated and will be ignored if its invalid.
When the file is updated again, it will be back in working order.
Also, some quick digging into the topic shows that linux's nfs locking
woes have been in check since 2.12 or so.
> ------- Additional Comments From firstname.lastname@example.org 2004-08-26
> 11:23 -------
> Someone inevitably will put the file onto NFS. I've debugged a number of
> problems when I found that various state files for Globus or Condor had been
> placed on NFS and things went wrong when file locking started acting up. Just
> something to keep in mind.
> ------- You are receiving this mail because: -------
> You are the assignee for the bug, or are watching the assignee.
In our experience, NFS locking woes on Linux still exist.
It seems like you are sold on this method, so we'll live with it. Thanks for
all of your effort--we really appreciate it!
We'll do our best to do some testing of it in the field--which version do you
prefer we test with?
I am, sorry ;-)
Test with 3.2.1 and the unofficial update package at:
*** Bug 1908 has been marked as a duplicate of this bug. ***
We just updated our firewall with the latest release from Cisco which has a
patch to fix their part of the problem. I tested it and verified that it now
correctly expires the stale state after the configurable timeout setting, even
with new syn packets trying to get through that are using the same IP/port
pairs. This update, combined with the globus patch should greatly reduce the
chance of reusing ports too quickly, causing problems with stateful firewalls.
Have the globus patches been tested and integrated into official releases,
4.0 will be the first official release to have the STATE_FILE patch.