Bugzilla – Bug 4927
RFT reliability
Last modified: 2007-06-05 09:57:04
You need to log in before you can comment on or make changes to this bug.
RFT can become more reliable by improving the fault tolerance of the tool. Right now, each instance of RFT has its own database and is running in its own container. If a particular container or database goes down, then any transfers that RFT instance was processing will be stopped. A more fault tolerant model would allow for another RFT instance to pick up the requests that had been running on the downed instance and continue to process them.
Created an attachment (id=1147) [details] Changes to RFT code to support enhancement
Created an attachment (id=1148) [details] Changes to Persistent Subscription code to support enhancement
Created an attachment (id=1149) [details] Changes to Delegation Resource code
Created an attachment (id=1150) [details] Changes to Transfer Work code
Created an attachment (id=1151) [details] Changes to Transfer Client code
Created an attachment (id=1152) [details] New code for RFT
Created an attachment (id=1153) [details] New code for fault tolerant RFT
Created an attachment (id=1154) [details] Changes to Resource Home Impl code
Created an attachment (id=1155) [details] Changes to Reliable File Transfer Impl code
Created an attachment (id=1156) [details] Changes to Reliable File Transfer DB code
Created an attachment (id=1157) [details] Changes in the database that are needed to support code
The files submitted contain the code developed at NCSA to support the fault tolerance talked about by this enhancement. The assumption made is that all of the RFT instances will be sharing a MySQL database. The initial code uses a single database for all instances but in the future it is hoped that the code can be moved to a more fault tolerant DB solution, possibly either a clustered DB or a replicated DB. When the new RFT code starts up with a container a timer is started. This timer is used to check the status of running transfers. Right now the code is set to check every 30 seconds, this can be changed and probably should be moved to a jndi file for user configuration. A side benefit to this timer is the ability to recover from a down database. If the database is taken off line for some reason but the container continues running, each time the timer fires off it checks the DB connection. When the DB comes back on line, the timer will reestablish the connection. Under normal operations the timer will perform a query on the database to see if any transfers are stalled. This is accomplished by adding new field to the restart table to hold the last time the row was touched. An assumption is made at this point that if X amount of time has passed and the request is still active then the transfer is stalled. At this point another container can pick up this request and begin to process it. A container ID field has also been added to the request table. This is to prevent containers from picking up each others requests. The time is hard coded at this point and should be moved to a jndi file for user configuration. In order to fully facilitate this fail over, changes needed to be made to Delegation Services and the handling of notifications. If another container picks up a stalled transfer it needs to know about the DS information. In current Globus code, the DS is persisted to a flat file. To allow other containers access to this information, DS has been changed to persist its self to a DB table. The same is true for the notification code. In order for notifications to get back to the correct location a new container needs to now the information. As with DS, notifications are currently being persisted to a flat file. Under the new code the notifications are now being persisted to DB tables. The code that has been submitted to handle the DS and notification changes needs some changes. The current code has hard coded connections to the database that was used for development. This code needs to be removed and should be made configurable like the rest of the Globus DB code.
1) Would it be possible for you to post unified diffs, rather than drop-in replacements? 2) The DelegationService code you present appears to be using file persistence - where's the database code? 3) There would seem to be problems for long-running transfers that require proxy refreshes - the client would not know that the delegation resource in use for the transfer had moved. That is, the problem of "fault tolerant Delegation Service" seems to still exist in this scenario unless one has done some kind of DNS/load-balanced IP readdressing the the second container? I'm just commenting as an interested observer - Ravi is already on vacation and should be back in January.
(In reply to comment #13) > 3) There would seem to be problems for long-running transfers that require > proxy refreshes - the client would not know that the delegation resource in use > for the transfer had moved. That is, the problem of "fault tolerant Delegation > Service" seems to still exist in this scenario unless one has done some kind of > DNS/load-balanced IP readdressing the the second container? Right, this code would work best in combination with a front-end load-balancer and a back-end highly-available (clustered) database. Since the DS resources are shared across the instances via the database, the client could contact any of the containers to refresh the credential. However, my understanding is that Patrick doesn't yet have the DS refresh callback working when the request fails over to another container. (Patrick, correct me if I'm wrong.) Patrick is working on improvements, so please keep the comments/suggestions coming.
(In reply to comment #14) > Right, this code would work best in combination with a front-end load-balancer Alternatively, WS-Naming could be used to locate the moved resources.
Created an attachment (id=1174) [details] DS changes Changes made to the DS code to support new RFT code.
Created an attachment (id=1175) [details] RFT changes Changes made to RFT code to support new failover functionality.
Created an attachment (id=1176) [details] WSRF subscription changes Changes made to the PersistentSubscription code in support of new RFT code.
Added files that contain cvs diff -u information. Again, it should be noted that the PersistentSubscription and ReliableFileTransferResource have hard coded calls to an NCSA test database. This will be corrected and moved to a JNDI file.
We're presenting a paper on this work at TeraGrid '07 (http://www.union.wisc.edu/teragrid07/). The paper is available online at: http://grid.ncsa.uiuc.edu/dependable/rrft.pdf http://grid.ncsa.uiuc.edu/depenadble/rrft.doc Comments welcome.
Patrick has submitted separate bugs for modifications to the different services: Bug 5255 - RFT Bug 5256 - Delegation Service Bug 5257 - GT Core Persistence API The latest patches can be found there.