Bug 802 - gatekeeper machine with many jobs heavily loaded
: gatekeeper machine with many jobs heavily loaded
Status: RESOLVED FIXED
: GRAM
gt2 Gatekeeper/Jobmanager
: 1.6
: PC Linux
: P2 critical
: 3.2
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2003-03-07 18:08 by
Modified: 2004-02-26 12:23 (History)


Attachments
Optimized shell wrapper to globus-job-manager-script.pl (2.11 KB, text/plain)
2003-03-10 19:19, Alan De Smet
Details
globus-job-manager scalability improvements (35.04 KB, patch)
2003-03-31 12:56, Alan De Smet
Details
Corrections to scalability improvements (38.50 KB, patch)
2003-05-14 17:21, Alan De Smet
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2003-03-07 18:08:21
Problem:

When a large number of jobs have been submitted to a Globus head
node, the head node becomes overwhelmed with running processes.


How to reproduce:

Running the following simple script against a Globus 2.2.4 head
node will quickly reproduce the problem:

#! /bin/sh
i=0
while [ $i -lt 300 ]; do
	i=`expr $i + 1`
	echo $i
	globus-job-run MACHINE_TO_TEST/jobmanager-fork \
		/bin/sleep 600 &
done

Once running, monitor the load on MACHINE_TO_TEST.  The load will
likely rise and fall in cycles.  On our particular configuration
the load shoots up to 190.  Use "time globusrun -a -r
MACHINE_TO_TEST" to monitor responsiveness.

While an initial spike in usage can be expected as the jobs begin
running, once the jobs are all running, the load remains
extremely high and the machine is less usable.  The slowdown can
cause globus-job-managers to become slow to respond to queries,
making it difficult for any process monitoring things (like
Condor-G) to determine what is going on.  During the peak, a
"globusrun -a -r MACHINE_TO_TEST" might take five minutes.  The
machine is so badly loaded that commands like ps become too slow
to capture the state of the machine. 

This behavior also occurs when using the Condor jobmanager, I've
used fork here because it's simple.  When running real work
against the Condor jobmanager, the load has been even higher,
possibly because the work to query Condor is more than to query a
fork job.

Some testing suggests that the problem is in part
globus-job-managers and the repeated polls of
globus-job-manager-script.  Death by a thousand cuts.

This is a problem because if a head node is fronting for a batch
system with 150 nodes, it's quite reasonable to submit 300 jobs
(to ensure that the system has work to do when the first 150
finish).  For some work we're doing, we expect to submit even
larger numbers of jobs to an even larger pool.  1,000 or 10,000
jobs simultaneously submitted to the head node are real
possibilities.  This sort of load will crush the head node.

This poses a real scalability problem for our use. This appears to
represent a regression.  Under Globus 2.0.x, while lots of submitted
jobs did load the machine, the load was much lower (perhaps one tenth
as bad).
------- Comment #1 From 2003-03-10 19:17:24 -------
Some testing suggests that globus-job-manager-script.pl may be a significant
part of the problem.  It's not clear if it's the core.

I replaced globus-job-manager-script.pl with a simple shell script that handled
poll requests directly and handed all other requests to the real script.  (I'll
attach the shell script to this bug.)  This cut down on the load:

Before: Peak Load: 298, Run length: 32 minutes
After:  Peak Load: 194, Run length: 17 minutes

Before, the machine became marginally responsive during the core of the.  Simple
commands would take thirty seconds or more to respond.  A globusrun -a -r could
take over five minutes.  After, while the machine was slightly slow, it remained
generally responsive.  A globusrun -a -r might take one minute.

The shell script is not perfectly efficient (it invokes kill, grep, and awk
every time it processes a poll request), perhaps a more efficient implementation
would improve the results even further.  

That said, while this does improve the situation, it doesn't completely resolve
it.  I suspect this just represents a minor scalability improvement, not a real
fix.  In all likelyhood a larger run (perhaps 1,000 jobs, perhaps 10,000) will
return us to the same situation.

I'm not sure if this information really points to any sort of answer, but
hopefully it provides some insight into the problem.
------- Comment #2 From 2003-03-10 19:19:31 -------
Created an attachment (id=82) [details]
Optimized shell wrapper to globus-job-manager-script.pl

Replaces globus-job-manager-script.pl.	The original
globus-job-manager-script.pl should be moved to
globus-job-manager-script.pl.real.  To use you'll need to change the hard coded
path to globus-job-manager-script.pl.real in the script.  Optimizes poll
commands against fork jobs, hands off all other commands to the original
script.
------- Comment #3 From 2003-03-10 21:48:40 -------
Just to give it a quick and dirty try, I modified by test script to sends 3000
jobs out.  This is using my shell wrapper optimization.  Early on (presumably
while still starting the jobs) the load peaked at 598.  The system heavily
bogged down, a simple uptime took 15 minutes to respond at one point.  Once the
initial setup finished, the load dropped to about 300 and remained extremely
slow, but marginally useful (taking about 1 minute to respond to uptime).

Given that the initial setup took longer than the ten minutes and jobs were
finishing before all of the jobs were started, I'll increase the run time of my
sleep job to 3600 (one hour) and try again.  If the machine remains minimally
useful until the running load, making globus-job-manager-script.pl faster may be
enough to solve the problem.  Of course, most sites doing heavy duty work will
be using a batch system instead of the fork jobmanager, so a simple workaround
like my shell script won't be good enough.
------- Comment #4 From 2003-03-11 15:41:14 -------
I was away from the machine I submitted by 3000 job run from when I remotely
logged in to collect the results.  As a result, I didn't see the stream of error
messages alerting me that I had hit a number of limits (processes per user and
the like).  It looks like only about 1,000 of the 3,000 jobs were submitted.

Retrying with 1,000 jobs and a 60 minute job length returned similar results. 
The peak load was about 700 and was regularly over 500.  For the entire duration
of the run there were periods in which the machine became unresponsive for
several minutes at a stretch.  Long after all of the jobs had started globusrun
-a -r's would still sometimes taking over 5 minutes to respond.
------- Comment #5 From 2003-03-31 12:56:38 -------
Created an attachment (id=102) [details]
globus-job-manager scalability improvements

This patch contains a number of changes toward improving the scalability of
Globus on gatekeeper machines.	These changes are necessary for the EDG work.

- If a cache file of job status is being maintained on the machine, that is
used to extract the status instead of calling the Perl script.	The script
providing this cache is submitted by submit sites desiring to use it.

- To support this cache file it is necessary to know when the status was last
updated.  Added code to manage the last update time.

- Added some logging for STDIO_SIZE requests.
------- Comment #6 From 2003-04-08 17:15:22 -------
Is there any news on getting the patch I offered applied?
------- Comment #7 From 2003-04-09 11:16:09 -------
Alan,

We were not able to review,test,commit this patch before our final batch of 
commits to the 2.2 branch which was then tagged for use in a VDT version.  Nor 
did it make it into the 2.4 branch.

I am wondering about the severity.  Since you are working from a tag from the 
2.2 branch, I assume that you have applied it to the 2.2 tag checkout in the 
condor-globus repository.

Does it make sense to apply it to the 2.2 branch, or should it be applied to 
2.4?  I guess it all depend on what the next version of globus will be used in 
VDT.  Will it be from GT 2.4?

-Stu
------- Comment #8 From 2003-05-02 16:22:51 -------
VDT 1.1.8 is going to be based on Globus 2.2.x.  Checking it into the 2.2 line
would be ideal.
------- Comment #9 From 2003-05-12 16:49:45 -------
It turns out that the patch I previously offered has some weaknesses.

1. If the job manager is restarted it will may be unable to locate its own entry
in the cache file of job status's (if available).  As a result it will call the
Perl status scripts directly instead of using the cache.

2. If the job manager is restarted it will refuse to look at the cache file at
all until the file has been updated at least once.

I've making the necessary changes to fix these.

Do you want a new complete patch to replace the one above (attachment 102 [details]), or
just a patch that assumes the patch above?
------- Comment #10 From 2003-05-12 17:48:01 -------
Can you check the correlation to bug/patches #931? The use of JobManager.pm
supplied methods fork_and_exec_cmd and pipe_out_cmd should have positive impact
on your performance, too. Beyond that, I'd appreciate a diff against the
original source. 
------- Comment #11 From 2003-05-13 16:00:40 -------
Jens, the changes in bug #931 appear to just be (to quickly summarize)
optimizations to the job-manager scripts.  This change optimizes the job-manager
itself to avoid calling the scripts when possible.  Both are good changes aimed
at reducing load on the gatekeeper.  I expect that used together that will
provide an even bigger benefit, but I don't see any direct I correlation between
them.
------- Comment #12 From 2003-05-13 16:47:55 -------
Just to note it for future reference, bug #868 might also need to be fixed.  My
current fix for point two from comment #9 doesn't solve all cases, the delay
before doing doing a status query seems to provide an opportunity for problems.
 During the delay there are some cases where the job-manager will decide to
change the job's state.  Under my current patch (attachment 102 [details]), this causes
the job-manager to decide that its information is more correct than the cache of
job status information.  Fixing bug #868 might fix this, I'm looking into it.
------- Comment #13 From 2003-05-14 17:21:14 -------
Created an attachment (id=126) [details]
Corrections to scalability improvements

The previous patch had two weaknesses (designed in previous comments) that
would cause the optimizations to not function in some cases.  This patch
corrects most of those weaknesses and should replace the prior patch.
------- Comment #14 From 2003-11-25 17:07:01 -------
We are wondering if this patch has been applied to any version of Globus? 

It has been part of the VDT for quite some time and has proven its use. We 
would have thought it would make it into an advisory. We would rather not fork 
Globus to have a VDT Globus and a Globus Globus. 

Thoughts?

Thanks,
-alain
------- Comment #15 From 2003-12-23 21:18:48 -------
Bug #1142 seems to be a duplicate of this bug.  It's odd that nobody
seems to have noticed that before now.  It's also odd that #1142 was
rejected, originally as "NOTABUG" and finally as "WONTFIX".

Anyway, I've implemented a Linux-specific workaround for this, a
script called "throttle-job-manager" that uses STOP and CONT signals
to limit the number of globus-job-manager processes that can be active
simultaneously.  It's a fairly crude solution, almost brute-force,
and it can cause significant delays in commands that attempt to 
communicate with a globus-job-manager process (globus-job-status,
globus-job-get-output, etc.)

See <http://www.sdsc.edu/~kst/throttle-job-manager/>.
------- Comment #16 From 2004-01-16 16:34:30 -------
Alan, Alain,

We have not applied this patch yet.  Is the one here the latest version of it?  
We have a release coming up and would like to apply this patch if possible.

-Stu
------- Comment #17 From 2004-01-16 17:53:06 -------
Patch 126 (the one previously attached) remains the most recent.
------- Comment #18 From 2004-02-26 12:23:53 -------
the scalability improvements patch has been applied to the 3.2 branch and cvs 
trunk. 
 
joe