Bugzilla – Bug 4197
WS GRAM integration in OSG 0.4.x and 0.6.x
Last modified: 2006-06-15 09:17:26
You need to log in before you can comment on or make changes to this bug.
Title: WS GRAM integration in OSG 0.4.x and 0.6.x Definition: GT 4.0.1 WS GRAM is in VDT and subsequently in OSG 0.4.0. The WS GRAM service is not deployed and available by default in OSG 0.4.0. It is available as an optional deployment. In order to test WS GRAM sufficiently for OSG's liking, Frank Wuerthwein has tasked Brian Bockelman (Student, U of Nebraska) to install OSG 0.4.0 deploy WS GRAM for testing and evaluation. The plan is to test condor-g submitting real OSG application with a real workload to the deployed WS GRAM service. Possibly try out alternate service configurations (i.e. gridftp running on a separate service host). After testing and evaluation process has completed successfully, other OSG sites might be asked to deploy WS GRAM too. After successful use from these deployments, WS GRAM can then be considered as a required OSG service (meaning it will be deployed by default). OSG will continue to deploy Pre-WS GRAM as well. VDT OSG target release dates ================ OSG 0.6 ---------- July 15 is the target date for the OSG 0.6 release June 1 is the date when final release testing will begin May 1 is when the set of required services are decided and ITB testing begins upshot - WS GRAM needs to be ready to go by May 1 to make 0.6.0. OSG 0.4.1 ------------ April 1-15 Release of OSG 0.4.1 (based on VDT 1.3.10) March 15th Release of VDT 1.3.10 upshot - WS GRAM needs to be ready to go by Feb 15th to make 0.4.1 as a required service. Deliverables: 1) Approved/certified version of WS GRAM for OSG (coming from GT 4.0 community branch) 2) Web page documenting performance results from testing/evalulation Tasks: 1) Support Nebraska for any installation questions/issues 2) Support Nebraska during the testing and evaluation period 3) Analyze/debug/resolve any issues 4) Make improvements as necessary
Here are the things we improved so far in working with large CRAB runs at UNL: 1) Fixed a recovery bug. 2) Updated to the latest WS-GRAM globus_4_0_branch code. This improved error reporting making it easier to diagnose problems. 3) Updated to the latest RFT globus_4_0_branch code. This fixed some problems with out-of-order transfers and hung GridFTP control channel problems. 4) Improved GRAM job queue utilization. This allows for faster response times for simple fork jobs. It also improves performance in general because relatively fast events aren't stuck behind relatively slow staging events. 5) Implemented local transport to RFT. This improved container responsiveness since it essentially removed RFT callouts from the list of total connection attempts at any particular time, leaving Condor-G with more available threads to submit jobs. Current status: CRAB runs up to about 2000 jobs are completeing for the most part. Occasionally a job is left unsubmitted or is held because of an error. CRAB runs above 2000 jobs aren't fairing so well. There's an issue where jobs seem to be lost in the state machine. I can't find any direct evidence (perhaps partly because the log files get so big it's hard to find anything out of the ordinary), but this may be a delegation problem. Close to the end of the run (i.e. when no more activity is observed in the container), there are a lot of errors being generated pertaining to delegated credential resources that can't be found. This is most likely because Condor-G failed to set the lifetime of the delegated credentials to a long enough period of time to accomodate the long execution time of the CRAB run. Once the Condor guys fix this then we can resume testing to see if that helped stability. If it does then we need to figure out a better way to handle the situation so that it's easier to identify and doesn't simply result in lost jobs. If not, obviously we'll have to keep looking for the cause.
I fixed a bug that seemed to eliminate the problem of jobs seemingly disappearing from the state machine. In fact, there was evidence of this in the form of a debug message rather than an error or warning. This is unrelated afterall to the expiring delegation issue. Jaime Frey said that he is testing the latest Condor-G code to make sure it's doing what it's supposed to in terms of refreshing the delegated credential. Meanwhile, Brian is installing the latest Condor-G release available since fixes that may have an affect on the delegated credential issue are in the latest release. We should have some results by weeks end at the latest.
I'm adding bug #3121 as a dependency since a large number of jobs seem to be failing due to problems connecting to GridFTP. This bug pertains to the broken check on the maximum number of transfers that are active at any one time. Hopefully limiting the number of transfers will reduce the traffic between the two machines involved and prevent connection timeouts.
There were condor-g issues in 6.7.14 with refreshing a delegation for long CRAB runs (> 12 hours I think). This was hoped to be fixed with a new version of condor-g 6.7.17, but Brian backed that out as it caused problems with pre-ws gram users on the same machine ([condor-admin #13508]). Jaime is helping Brian to setup 2 condor installs - 1 for pre-ws jobs and one for experimental use like this ws gram testing. At the same time, the CRAB application is no longer stable and cannot be used for testing at the moment and, according to Brian, maybe not for a while. We are stepping back to a dummy stage-in-sleep-stage-out job (condor-g-ws-test-sleep-io). Initial testing with a 4 MB input and 10MB output file produced RFT/gridftp errors. We will reduce the file sizes and see where the breaking point is and then report back. During this run there were also condor job execution errors that may have been caused by NFS. Brian has changed an NFS setting, hopefully avoiding the problem. We'll see. Here are the current action items: 1) Brian: Get Condor-G version 6.7.17 running just for ws-gram jobs on osg-test2. 2) Peter: Adjust the input file size to 0.5 MB for condor-g-ws-test- sleep-io test. 3) Peter: Adjust the output file size to 1 MB for condor-g-ws-test- sleep-io test. 4) Peter: Run a 3500-job test run with condor-g-ws-test-sleep-io. 2 & 3 are necessary because of the RFT bug (#3121) that prevents throttling the number of active transfers. After getting the above test to work, we can worry about getting RFT patched and find a reasonable setting to avoid the GridFTP server timeouts.
Tasks 1 through 3 of the latest action list are done (thanks Brian!). I'm running a 500-job test right now just to shake everything down. If that runs cleanly I'll bump it up to 3500.
I haven't had a chance to track down problems I was having with larger runs. Here's the the email I wrote regarding this problem from March 31st: >> I was able to run a 500-job test, but the 3500-job test had >> problems. It >> got hung up at one point waiting for the fileCleanUp transfer request >> before being destroyed. I looked into it and if I'm not mistaken it >> looks like RFT lost the request. Here's the line where GRAM registers >> the transfer request with the RFT notification listener thread: >> >> 2006-03-31 12:41:49,653 DEBUG exec.StagingListener [RunQueue >> FileCleanUp,registerTransferJob:104] >> [execJobKey:72193bb0-c0b9-11da-9560-fa7a7cd61e10,transferJobKey: >> 150037] >> Leaving registerTransferJob() >> >> From this I determined that the request ID is >> "150037" ("transferJobKey" >> is my name for the RFT request id--don't worry, I checked for >> sanity's >> sake that this wasn't a transfer ID too). I then did a search in the >> MySQL database for this request: >> >> mysql> select * from transfer where request_id=150037; >> Empty set (0.12 sec) >> >> mysql> select * from request where id=150037; >> Empty set (0.00 sec) >> >> RFT doesn't appear to delete the database records, so even if the >> request was destroyed the record should still be there. That said, I >> also don't see an deliver() calls for that request. Unfortunately I >> didn't have debugging turned on for RFT, so I'll have to restart the >> test. >> >> Peter
Hello I have talked to Stu Martin who said follow up to this problem since April 28th - see the bottom of this Bugzilla report - has been superceded by a higher priority project. Stu also said that he hopes to get some effort available in the next month or so. Is this a correct assessment? When I read the last posting on this Bugzilla thread I thought there would be continuing follow up and was worried that OSG had failed to be avialable as needed. Thank you Ruth
I am marking this campaign as closed. Another campaign bug 4506 has been created to focus the continuing effort in this area. -Stu
I'm increasing the priority of the replacement campaign aimed at this issue, 4050.
Oops! I mean the priority of campaign 4506 is being increased!