Bug 4927 - RFT reliability
: RFT reliability
Status: NEW
: RFT
RFT
: development
: All All
: P3 enhancement
: ---
Assigned To:
:
:
: 5255 5256 5257
:
  Show dependency treegraph
 
Reported: 2006-12-21 10:39 by
Modified: 2007-06-05 09:57 (History)


Attachments
Changes to RFT code to support enhancement (32.37 KB, text/plain)
2006-12-21 10:42, Patrick Duda
Details
Changes to Persistent Subscription code to support enhancement (20.25 KB, text/plain)
2006-12-21 10:43, Patrick Duda
Details
Changes to Delegation Resource code (18.02 KB, text/plain)
2006-12-21 10:44, Patrick Duda
Details
Changes to Transfer Work code (28.11 KB, text/plain)
2006-12-21 10:45, Patrick Duda
Details
Changes to Transfer Client code (29.69 KB, text/plain)
2006-12-21 10:46, Patrick Duda
Details
New code for RFT (3.10 KB, text/plain)
2006-12-21 10:47, Patrick Duda
Details
New code for fault tolerant RFT (2.69 KB, text/plain)
2006-12-21 10:48, Patrick Duda
Details
Changes to Resource Home Impl code (13.69 KB, text/plain)
2006-12-21 10:50, Patrick Duda
Details
Changes to Reliable File Transfer Impl code (14.62 KB, text/plain)
2006-12-21 10:51, Patrick Duda
Details
Changes to Reliable File Transfer DB code (118.56 KB, text/plain)
2006-12-21 10:51, Patrick Duda
Details
Changes in the database that are needed to support code (3.97 KB, text/plain)
2006-12-21 11:10, Patrick Duda
Details
DS changes (9.15 KB, patch)
2007-01-26 11:12, Patrick Duda
Details
RFT changes (77.23 KB, patch)
2007-01-26 11:13, Patrick Duda
Details
WSRF subscription changes (20.10 KB, patch)
2007-01-26 11:14, Patrick Duda
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2006-12-21 10:39:41
RFT can become more reliable by improving the fault tolerance of the tool. 
Right now, each instance of RFT has its own database and is running in its own
container.  If a particular container or database goes down, then any transfers
that RFT instance was processing will be stopped.  A more fault tolerant model
would allow for another RFT instance to pick up the requests that had been
running on the downed instance and continue to process them.
------- Comment #1 From 2006-12-21 10:42:48 -------
Created an attachment (id=1147) [details]
Changes to RFT code to support enhancement
------- Comment #2 From 2006-12-21 10:43:59 -------
Created an attachment (id=1148) [details]
Changes to Persistent Subscription code to support enhancement
------- Comment #3 From 2006-12-21 10:44:51 -------
Created an attachment (id=1149) [details]
Changes to Delegation Resource code
------- Comment #4 From 2006-12-21 10:45:37 -------
Created an attachment (id=1150) [details]
Changes to Transfer Work code
------- Comment #5 From 2006-12-21 10:46:19 -------
Created an attachment (id=1151) [details]
Changes to Transfer Client code
------- Comment #6 From 2006-12-21 10:47:44 -------
Created an attachment (id=1152) [details]
New code for RFT
------- Comment #7 From 2006-12-21 10:48:41 -------
Created an attachment (id=1153) [details]
New code for fault tolerant RFT
------- Comment #8 From 2006-12-21 10:50:21 -------
Created an attachment (id=1154) [details]
Changes to Resource Home Impl code
------- Comment #9 From 2006-12-21 10:51:14 -------
Created an attachment (id=1155) [details]
Changes to Reliable File Transfer Impl code
------- Comment #10 From 2006-12-21 10:51:48 -------
Created an attachment (id=1156) [details]
Changes to Reliable File Transfer DB code
------- Comment #11 From 2006-12-21 11:10:28 -------
Created an attachment (id=1157) [details]
Changes in the database that are needed to support code
------- Comment #12 From 2006-12-21 11:15:27 -------
The files submitted contain the code developed at NCSA to support the fault
tolerance talked about by this enhancement. The assumption made is that all of
the RFT instances will be sharing a MySQL database.  The initial code uses a
single database for all instances but in the future it is hoped that the code
can be moved to a more fault tolerant DB solution, possibly either a clustered
DB or a replicated DB.

When the new RFT code starts up with a container a timer is started.  This
timer is used to check the status of running transfers.  Right now the code is
set to check every 30 seconds, this can be changed and probably should be moved
to a jndi file for user configuration.  A side benefit to this timer is the
ability to recover from a down database.  If the database is taken off line for
some reason but the container continues running, each time the timer fires off
it checks the DB connection.  When the DB comes back on line, the timer will
reestablish the connection.

Under normal operations the timer will perform a query on the database to see
if any transfers are stalled.  This is accomplished by adding new field to the
restart table to hold the last time the row was touched.  An assumption is made
at this point that if X amount of time has passed and the request is still
active then the transfer is stalled.  At this point another container can pick
up this request and begin to process it.  A container ID field has also been
added to the request table.  This is to prevent containers from picking up each
others requests.  The time is hard coded at this point and should be moved to a
jndi file for user configuration.

In order to fully facilitate this fail over, changes needed to be made to
Delegation Services and the handling of notifications.  If another container
picks up a stalled transfer it needs to know about the DS information.  In
current Globus code, the DS is persisted to a flat file.  To allow other
containers access to this information, DS has been changed to persist its self
to a DB table.

The same is true for the notification code.  In order for notifications to get
back to the correct location a new container needs to now the information.  As
with DS, notifications are currently being persisted to a flat file.  Under the
new code the notifications are now being persisted to DB tables.

The code that has been submitted to handle the DS and notification changes
needs some changes.  The current code has hard coded connections to the
database that was used for development.  This code needs to be removed and
should be made configurable like the rest of the Globus DB code.
------- Comment #13 From 2006-12-21 11:30:15 -------
1)  Would it be possible for you to post unified diffs, rather than drop-in
replacements?

2)  The DelegationService code you present appears to be using file persistence
- where's the database code?

3)  There would seem to be problems for long-running transfers that require
proxy refreshes - the client would not know that the delegation resource in use
for the transfer had moved.  That is, the problem of "fault tolerant Delegation
Service" seems to still exist in this scenario unless one has done some kind of
DNS/load-balanced IP readdressing the the second container?

I'm just commenting as an interested observer - Ravi is already on vacation and
should be back in January.
------- Comment #14 From 2006-12-21 22:11:27 -------
(In reply to comment #13)
> 3)  There would seem to be problems for long-running transfers that require
> proxy refreshes - the client would not know that the delegation resource in use
> for the transfer had moved.  That is, the problem of "fault tolerant Delegation
> Service" seems to still exist in this scenario unless one has done some kind of
> DNS/load-balanced IP readdressing the the second container?

Right, this code would work best in combination with a front-end load-balancer
and a back-end highly-available (clustered) database.  Since the DS resources
are shared across the instances via the database, the client could contact any
of the containers to refresh the credential.

However, my understanding is that Patrick doesn't yet have the DS refresh
callback working when the request fails over to another container.  (Patrick,
correct me if I'm wrong.)

Patrick is working on improvements, so please keep the comments/suggestions
coming.
------- Comment #15 From 2006-12-21 22:22:32 -------
(In reply to comment #14)
> Right, this code would work best in combination with a front-end load-balancer

Alternatively, WS-Naming could be used to locate the moved resources.
------- Comment #16 From 2007-01-26 11:12:12 -------
Created an attachment (id=1174) [details]
DS changes

Changes made to the DS code to support new RFT code.
------- Comment #17 From 2007-01-26 11:13:13 -------
Created an attachment (id=1175) [details]
RFT changes

Changes made to RFT code to support new failover functionality.
------- Comment #18 From 2007-01-26 11:14:29 -------
Created an attachment (id=1176) [details]
WSRF subscription changes

Changes made to the PersistentSubscription code in support of new RFT code.
------- Comment #19 From 2007-01-26 11:17:12 -------
Added files that contain cvs diff -u information.

Again, it should be noted that the PersistentSubscription and
ReliableFileTransferResource have hard coded calls to an NCSA test database. 
This will be corrected and moved to a JNDI file.
------- Comment #20 From 2007-04-06 16:10:18 -------
We're presenting a paper on this work at TeraGrid '07
(http://www.union.wisc.edu/teragrid07/).  The paper is available online at:

  http://grid.ncsa.uiuc.edu/dependable/rrft.pdf
  http://grid.ncsa.uiuc.edu/depenadble/rrft.doc

Comments welcome.
------- Comment #21 From 2007-06-05 09:57:04 -------
Patrick has submitted separate bugs for modifications to the different
services:

  Bug 5255 - RFT
  Bug 5256 - Delegation Service
  Bug 5257 - GT Core Persistence API

The latest patches can be found there.