Bug 4664 - performance improvements for large run job submissions using condor-g
: performance improvements for large run job submissions using condor-g
Status: RESOLVED FIXED
: GRAM
Campaign
: 4.0.2
: PC Linux
: P3 normal
: 4.2
Assigned To:
:
:
: 4687 4699 4865
: 4050
  Show dependency treegraph
 
Reported: 2006-08-17 08:35 by
Modified: 2008-02-04 11:10 (History)


Attachments
ecommendations how to prepare server and client (condor-g) before large job submissions (57.57 KB, application/pdf)
2006-09-14 02:13, Martin Feller
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2006-08-17 08:35:54
Title: WS GRAM Performance improvements for 4.0.x

Projects: VDT, OSG, TG

Technologies: WS GRAM


Definition:

A next step now that large job runs are now completing reliably (bug 4506), is
to examine and improve performance.  A few problems have already been
identified.  There is over-synchronization in RFT's createReliableTransfer()
and start() operations.  This appears to be a bottleneck for large job runs
with staging.

Deliverables:

1) Updated version of WS GRAM and/or any dependencies where improvements have
been made in globus_4_0_branch
2) Updated version of CVS HEAD components where improvements make sense


Tasks:

1) get a baseline for WS GRAM performance for a large run (3500 jobs)
2) get a baseline profile for the time spent in WS GRAM, RFT and any other
service used on the service
3) identify code/routines to be improved
4) test improvements during a large job run
5) repeat step 1

Time Estimate:

10 days

Deadline:
Sept 1 (in time to be deployed on TG before SC)
------- Comment #1 From 2006-08-18 08:50:40 -------
We had been working on this for some time. Here's a summary:

The following is a typical snapshot of the container during a test
with 3500 job submissions when all 3500 jobs had been accepted.

-------------------------------------------------------------------
            JOB STATES | #JOBS IN THAT STATE
-------------------------------------------------------------------
                  Done: 1502
   FileCleanUpResponse:    7
WaitingForStateChanges:    6
       StageInResponse:    9
      StageOutResponse:    9
              StageOut:   51
               StageIn: 1887
           FileCleanUp:   29
-------------------------------------------------------------------
           TOTAL #JOBS: 3500
-------------------------------------------------------------------

This shows, that 3500 jobs had been accepted by the container, 
1502 jobs already finished, 7 resources are currently in state
FileCleanUpResponse, 1887 resources are in state StageIn
and so on.
Most of the (not finished) resources are in state StageIn.

When having a look at the logs we found out that sometimes a job
resource stays in the RunQueue stageIn for up to 5 hours before it's
processed.
That means that resources stay there for 5 hours without being
processed and before stageIn will be started.

The reason for this is, that adding new jobs to the container is
much faster than processing resources in state StageIn:
More than 100 new resources can be added to the container per minute
but only about 3-9 resources in state StageIn can be processed per
minute.
As a result the RunQueue stageIn starts filling up and resources
stay there for a long time before being processed.
So the staging process seems to throttle the whole processing of a
resource.

1st approach: 
#############
Adding more RunThreads for RunQueue stageIn (6 instead of 3)

result:
Throughput in the stageIn Queue improved, but the stageOut queue started
filling up then and in total there were no performance improvements.

2nd approach:
#############
Removing synchronization from methods for creation of an RFT resource and
start of an RFT resource.

result:
Removing the synchronization led (surprisingly) not to better but to worse
performance.

Details about the tests can be found here:
http://www-unix.mcs.anl.gov/~feller/tpTesting/index.html
------- Comment #2 From 2006-09-08 16:48:11 -------
        CAMPAIGN ADDENDUM

Now that we no longer see any obvious performance problems with this testing,
we need to add 2 more deliverable and a few tasks to this campaign.

Deliverables:

3) An update patch for RFT for 4.0.3
4) A document describing configuration recommendations for the GT container and
condor-g.  So users can get the same performance results as we've been able to
achieve.

Tasks:

6) create RFT patch that applies to 4.0.3 (maybe 4.0.2 too?)
7) attach patch to this bug
8) notify VDT and TG team of patch and work with them so it can be applied in
their grids
9) write a 4.0.3 performance recommendations document describing issues and
configuration settings in order to avoid them
------- Comment #3 From 2006-09-08 17:29:36 -------
The recommendation document should include a task to delete all records from
the RFT 'marker' table and set up an index on that table (or ensure there is
one created automatically by the database)
------- Comment #4 From 2006-09-14 02:13:12 -------
Created an attachment (id=1046) [details]
ecommendations how to prepare server and client (condor-g) before large job
submissions

I created the campaign for RFT (4699) and attached the patch for
4.0.3 to that campaign.
The attached file here is the recommendation document.
Tell me if anybody has comments on this doc or wants to add additional
items.
------- Comment #5 From 2006-10-02 18:50:13 -------
I'm marking this with a 4.2 Target Milestone. What else do we need to do here?
The campaign doesn't have a clear stopping point. If we're done with the task
cycle described in the campaign description, we should close this bug out and
move on.
------- Comment #6 From 2008-02-04 11:10:54 -------
The sub campaigns are done.  Many improvements have been identified here and
fixed.  New campaigns need to be created to continue in this area.