Bugzilla – Bug 802
gatekeeper machine with many jobs heavily loaded
Last modified: 2004-02-26 12:23:53
You need to log in before you can comment on or make changes to this bug.
Problem: When a large number of jobs have been submitted to a Globus head node, the head node becomes overwhelmed with running processes. How to reproduce: Running the following simple script against a Globus 2.2.4 head node will quickly reproduce the problem: #! /bin/sh i=0 while [ $i -lt 300 ]; do i=`expr $i + 1` echo $i globus-job-run MACHINE_TO_TEST/jobmanager-fork \ /bin/sleep 600 & done Once running, monitor the load on MACHINE_TO_TEST. The load will likely rise and fall in cycles. On our particular configuration the load shoots up to 190. Use "time globusrun -a -r MACHINE_TO_TEST" to monitor responsiveness. While an initial spike in usage can be expected as the jobs begin running, once the jobs are all running, the load remains extremely high and the machine is less usable. The slowdown can cause globus-job-managers to become slow to respond to queries, making it difficult for any process monitoring things (like Condor-G) to determine what is going on. During the peak, a "globusrun -a -r MACHINE_TO_TEST" might take five minutes. The machine is so badly loaded that commands like ps become too slow to capture the state of the machine. This behavior also occurs when using the Condor jobmanager, I've used fork here because it's simple. When running real work against the Condor jobmanager, the load has been even higher, possibly because the work to query Condor is more than to query a fork job. Some testing suggests that the problem is in part globus-job-managers and the repeated polls of globus-job-manager-script. Death by a thousand cuts. This is a problem because if a head node is fronting for a batch system with 150 nodes, it's quite reasonable to submit 300 jobs (to ensure that the system has work to do when the first 150 finish). For some work we're doing, we expect to submit even larger numbers of jobs to an even larger pool. 1,000 or 10,000 jobs simultaneously submitted to the head node are real possibilities. This sort of load will crush the head node. This poses a real scalability problem for our use. This appears to represent a regression. Under Globus 2.0.x, while lots of submitted jobs did load the machine, the load was much lower (perhaps one tenth as bad).
Some testing suggests that globus-job-manager-script.pl may be a significant part of the problem. It's not clear if it's the core. I replaced globus-job-manager-script.pl with a simple shell script that handled poll requests directly and handed all other requests to the real script. (I'll attach the shell script to this bug.) This cut down on the load: Before: Peak Load: 298, Run length: 32 minutes After: Peak Load: 194, Run length: 17 minutes Before, the machine became marginally responsive during the core of the. Simple commands would take thirty seconds or more to respond. A globusrun -a -r could take over five minutes. After, while the machine was slightly slow, it remained generally responsive. A globusrun -a -r might take one minute. The shell script is not perfectly efficient (it invokes kill, grep, and awk every time it processes a poll request), perhaps a more efficient implementation would improve the results even further. That said, while this does improve the situation, it doesn't completely resolve it. I suspect this just represents a minor scalability improvement, not a real fix. In all likelyhood a larger run (perhaps 1,000 jobs, perhaps 10,000) will return us to the same situation. I'm not sure if this information really points to any sort of answer, but hopefully it provides some insight into the problem.
Created an attachment (id=82) [details] Optimized shell wrapper to globus-job-manager-script.pl Replaces globus-job-manager-script.pl. The original globus-job-manager-script.pl should be moved to globus-job-manager-script.pl.real. To use you'll need to change the hard coded path to globus-job-manager-script.pl.real in the script. Optimizes poll commands against fork jobs, hands off all other commands to the original script.
Just to give it a quick and dirty try, I modified by test script to sends 3000 jobs out. This is using my shell wrapper optimization. Early on (presumably while still starting the jobs) the load peaked at 598. The system heavily bogged down, a simple uptime took 15 minutes to respond at one point. Once the initial setup finished, the load dropped to about 300 and remained extremely slow, but marginally useful (taking about 1 minute to respond to uptime). Given that the initial setup took longer than the ten minutes and jobs were finishing before all of the jobs were started, I'll increase the run time of my sleep job to 3600 (one hour) and try again. If the machine remains minimally useful until the running load, making globus-job-manager-script.pl faster may be enough to solve the problem. Of course, most sites doing heavy duty work will be using a batch system instead of the fork jobmanager, so a simple workaround like my shell script won't be good enough.
I was away from the machine I submitted by 3000 job run from when I remotely logged in to collect the results. As a result, I didn't see the stream of error messages alerting me that I had hit a number of limits (processes per user and the like). It looks like only about 1,000 of the 3,000 jobs were submitted. Retrying with 1,000 jobs and a 60 minute job length returned similar results. The peak load was about 700 and was regularly over 500. For the entire duration of the run there were periods in which the machine became unresponsive for several minutes at a stretch. Long after all of the jobs had started globusrun -a -r's would still sometimes taking over 5 minutes to respond.
Created an attachment (id=102) [details] globus-job-manager scalability improvements This patch contains a number of changes toward improving the scalability of Globus on gatekeeper machines. These changes are necessary for the EDG work. - If a cache file of job status is being maintained on the machine, that is used to extract the status instead of calling the Perl script. The script providing this cache is submitted by submit sites desiring to use it. - To support this cache file it is necessary to know when the status was last updated. Added code to manage the last update time. - Added some logging for STDIO_SIZE requests.
Is there any news on getting the patch I offered applied?
Alan, We were not able to review,test,commit this patch before our final batch of commits to the 2.2 branch which was then tagged for use in a VDT version. Nor did it make it into the 2.4 branch. I am wondering about the severity. Since you are working from a tag from the 2.2 branch, I assume that you have applied it to the 2.2 tag checkout in the condor-globus repository. Does it make sense to apply it to the 2.2 branch, or should it be applied to 2.4? I guess it all depend on what the next version of globus will be used in VDT. Will it be from GT 2.4? -Stu
VDT 1.1.8 is going to be based on Globus 2.2.x. Checking it into the 2.2 line would be ideal.
It turns out that the patch I previously offered has some weaknesses. 1. If the job manager is restarted it will may be unable to locate its own entry in the cache file of job status's (if available). As a result it will call the Perl status scripts directly instead of using the cache. 2. If the job manager is restarted it will refuse to look at the cache file at all until the file has been updated at least once. I've making the necessary changes to fix these. Do you want a new complete patch to replace the one above (attachment 102 [details]), or just a patch that assumes the patch above?
Can you check the correlation to bug/patches #931? The use of JobManager.pm supplied methods fork_and_exec_cmd and pipe_out_cmd should have positive impact on your performance, too. Beyond that, I'd appreciate a diff against the original source.
Jens, the changes in bug #931 appear to just be (to quickly summarize) optimizations to the job-manager scripts. This change optimizes the job-manager itself to avoid calling the scripts when possible. Both are good changes aimed at reducing load on the gatekeeper. I expect that used together that will provide an even bigger benefit, but I don't see any direct I correlation between them.
Just to note it for future reference, bug #868 might also need to be fixed. My current fix for point two from comment #9 doesn't solve all cases, the delay before doing doing a status query seems to provide an opportunity for problems. During the delay there are some cases where the job-manager will decide to change the job's state. Under my current patch (attachment 102 [details]), this causes the job-manager to decide that its information is more correct than the cache of job status information. Fixing bug #868 might fix this, I'm looking into it.
Created an attachment (id=126) [details] Corrections to scalability improvements The previous patch had two weaknesses (designed in previous comments) that would cause the optimizations to not function in some cases. This patch corrects most of those weaknesses and should replace the prior patch.
We are wondering if this patch has been applied to any version of Globus? It has been part of the VDT for quite some time and has proven its use. We would have thought it would make it into an advisory. We would rather not fork Globus to have a VDT Globus and a Globus Globus. Thoughts? Thanks, -alain
Bug #1142 seems to be a duplicate of this bug. It's odd that nobody seems to have noticed that before now. It's also odd that #1142 was rejected, originally as "NOTABUG" and finally as "WONTFIX". Anyway, I've implemented a Linux-specific workaround for this, a script called "throttle-job-manager" that uses STOP and CONT signals to limit the number of globus-job-manager processes that can be active simultaneously. It's a fairly crude solution, almost brute-force, and it can cause significant delays in commands that attempt to communicate with a globus-job-manager process (globus-job-status, globus-job-get-output, etc.) See <http://www.sdsc.edu/~kst/throttle-job-manager/>.
Alan, Alain, We have not applied this patch yet. Is the one here the latest version of it? We have a release coming up and would like to apply this patch if possible. -Stu
Patch 126 (the one previously attached) remains the most recent.
the scalability improvements patch has been applied to the 3.2 branch and cvs trunk. joe