Bugzilla – Bug 4664
performance improvements for large run job submissions using condor-g
Last modified: 2008-02-04 11:10:54
You need to log in before you can comment on or make changes to this bug.
Title: WS GRAM Performance improvements for 4.0.x Projects: VDT, OSG, TG Technologies: WS GRAM Definition: A next step now that large job runs are now completing reliably (bug 4506), is to examine and improve performance. A few problems have already been identified. There is over-synchronization in RFT's createReliableTransfer() and start() operations. This appears to be a bottleneck for large job runs with staging. Deliverables: 1) Updated version of WS GRAM and/or any dependencies where improvements have been made in globus_4_0_branch 2) Updated version of CVS HEAD components where improvements make sense Tasks: 1) get a baseline for WS GRAM performance for a large run (3500 jobs) 2) get a baseline profile for the time spent in WS GRAM, RFT and any other service used on the service 3) identify code/routines to be improved 4) test improvements during a large job run 5) repeat step 1 Time Estimate: 10 days Deadline: Sept 1 (in time to be deployed on TG before SC)
We had been working on this for some time. Here's a summary: The following is a typical snapshot of the container during a test with 3500 job submissions when all 3500 jobs had been accepted. ------------------------------------------------------------------- JOB STATES | #JOBS IN THAT STATE ------------------------------------------------------------------- Done: 1502 FileCleanUpResponse: 7 WaitingForStateChanges: 6 StageInResponse: 9 StageOutResponse: 9 StageOut: 51 StageIn: 1887 FileCleanUp: 29 ------------------------------------------------------------------- TOTAL #JOBS: 3500 ------------------------------------------------------------------- This shows, that 3500 jobs had been accepted by the container, 1502 jobs already finished, 7 resources are currently in state FileCleanUpResponse, 1887 resources are in state StageIn and so on. Most of the (not finished) resources are in state StageIn. When having a look at the logs we found out that sometimes a job resource stays in the RunQueue stageIn for up to 5 hours before it's processed. That means that resources stay there for 5 hours without being processed and before stageIn will be started. The reason for this is, that adding new jobs to the container is much faster than processing resources in state StageIn: More than 100 new resources can be added to the container per minute but only about 3-9 resources in state StageIn can be processed per minute. As a result the RunQueue stageIn starts filling up and resources stay there for a long time before being processed. So the staging process seems to throttle the whole processing of a resource. 1st approach: ############# Adding more RunThreads for RunQueue stageIn (6 instead of 3) result: Throughput in the stageIn Queue improved, but the stageOut queue started filling up then and in total there were no performance improvements. 2nd approach: ############# Removing synchronization from methods for creation of an RFT resource and start of an RFT resource. result: Removing the synchronization led (surprisingly) not to better but to worse performance. Details about the tests can be found here: http://www-unix.mcs.anl.gov/~feller/tpTesting/index.html
CAMPAIGN ADDENDUM Now that we no longer see any obvious performance problems with this testing, we need to add 2 more deliverable and a few tasks to this campaign. Deliverables: 3) An update patch for RFT for 4.0.3 4) A document describing configuration recommendations for the GT container and condor-g. So users can get the same performance results as we've been able to achieve. Tasks: 6) create RFT patch that applies to 4.0.3 (maybe 4.0.2 too?) 7) attach patch to this bug 8) notify VDT and TG team of patch and work with them so it can be applied in their grids 9) write a 4.0.3 performance recommendations document describing issues and configuration settings in order to avoid them
The recommendation document should include a task to delete all records from the RFT 'marker' table and set up an index on that table (or ensure there is one created automatically by the database)
Created an attachment (id=1046) [details] ecommendations how to prepare server and client (condor-g) before large job submissions I created the campaign for RFT (4699) and attached the patch for 4.0.3 to that campaign. The attached file here is the recommendation document. Tell me if anybody has comments on this doc or wants to add additional items.
I'm marking this with a 4.2 Target Milestone. What else do we need to do here? The campaign doesn't have a clear stopping point. If we're done with the task cycle described in the campaign description, we should close this bug out and move on.
The sub campaigns are done. Many improvements have been identified here and fixed. New campaigns need to be created to continue in this area.