Bugzilla – Bug 4452
job submission response is effected by java 1.5 thread processing
Last modified: 2008-02-04 11:24:29
You need to log in before you can comment on or make changes to this bug.
The details to this problem can be read from this gram-user email thread. There are suggested solutions to the problem that should be considered. http://www-unix.globus.org/mail_archive/gram-user/2006/05/msg00022.html
I'm just going to paste the last email on this thread since it's where all the good meat is: -------------- Apparently Java does not mandate any scheduling order for threads waiting for monitor entrance. Hence, this order may vary between different JVM implementations and operating systems. http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6316090 In my environment, Java 1.5 queues up threads waiting for monitor entrance in stack-order, which makes the implementation starvation-prone (if a steady stream of threads are attempting to enter the monitor, the first waiting thread will never be granted monitor entry and will hence be starved out). In Java 1.4, however, threads seem to be queued in ("fair") FIFO-order. Consequently I restarted my container with Java 1.4 and I haven't (yet :) observed any of the abnormal response times seen previously. As a work-around Sun recommends the use of 'one of the excellent "fair" lock constructs in java.util.concurrent.' Maybe this approach should be taken by Globus developers in the future since the synchronized construct, apparently, cannot be relied upon. regards, Peter
I have this problem also with Java 1.5 .... tried the following entropy fix: http://www.globus.org/mail_archive/gt-user/2006/09/msg00199.html but long delays still occur, even with simple jobs.
In general, the non-FIFO order gives better performance then FIFO order but it can lead to starvation as mentioned. And using 'fair' locks also decrements the performance. Although it is hard to say what overall effect a 'fair' lock would have in this case (in terms of performance). But, looking at the createManageJob() code, I think the synchronization there is unnecessary. That is, home.create() does not need to be called under a global (service) lock. It only needs to be called under a job-specific lock. If the code was switched to job-specific lock then this problem would disappear and the throughput of creating jobs would increase.
Martin, We have this targeting 4.2, but this may be important and significant enough to look at in the 4.0.4 timeframe. What do you think? Can we try Jarek's suggestion and see if it solves the problem? -Stu
Sure, it seems to be a small change. But it should be done with care and tested like all threading issues. Alan: In order for me to be able to check if thing improve: Could you describe a bit more detailed what you mean by "abnormal response times" and "long delays": * Do you mean response time between job submission and getting the EPR back or the time it takes for a job to be completely processed? * Does this also happen when destroying a job or querying for resource properties? * This also occured with jobs without staging? * Did it occur consistently or only sometimes? * Did it occur during more or less sequential job submission too, or only during concurrent job submission? * What's the container load when it occured: only under heavy load with many jobs or also with just a few jobs running?
(In reply to comment #5) > Sure, it seems to be a small change. But it should be done with care and tested > like all threading issues. > Alan: In order for me to be able to check if thing improve: Could you describe > a bit more detailed what you mean by "abnormal response times" and > "long delays": > * Do you mean response time between job submission and getting the EPR back > or the time it takes for a job to be completely processed? Yes. > * Does this also happen when destroying a job or querying for resource > properties? I'd have to have a spceific scenario. As far as I can tell, destroying a job happens fairly quickly, but my tests in this area have not been exhaustive. After encountering problems in Java 1.5, I went back to 1.4.2 in production environments. > * This also occured with jobs without staging? Yes. > * Did it occur consistently or only sometimes? Consistently, in that any particular job was likely to encounter delays. > * Did it occur during more or less sequential job submission too, or only > during concurrent job submission? I don't believe that concurrency was an issue. > * What's the container load when it occured: only under heavy load with many > jobs or also with just a few jobs running? Occurred even with a light load on the host machine. Agree with your comments on testing. How can we set up an organized test? Note I have torn down the 1.5 test I was mounting, so would need to recreate it.
Alan, sorry for the delay. Before changing anything i wanted to see that behaviour myself. Could you confirm that the numbers are roughly comparable to what you experienced, please. Here's what i did: 1. Built GT 4.0.4 two times: one with Suns Java 1.4.2_13, one with Java 1.5.0_01. 2. Started the GT container and ran the stability test against it, i.e. created a steady load of 5 jobs being processed by the container all the time (in reality this may vary a bit). The jobs included file stage in and file stage out, the executable was /bin/date. 3. Then i submitted 50 simple jobs sequentially to the GT container and measured the time each one of them took. The values are measured in seconds. Steps 2 and 3 had been done with Java 1.4 and 1.5 and Runtimes 1.4 and 1.5. My execution environment is my (quite old) notebook: Processor: mobile AMD Athlon(tm) XP-M 2500+ RAM: 512 MB Find the script that measures the time of 50 jobs and the results attached. All in all I experienced that jobs are processed faster when Java 1.4 was used for compilation of the GT and as runtime, and faster when 1.5 is used. Some jobs sometimes take a bit longer than others. If you look at the attached overview table: Is this what you mean with "long delays"? Can you confirm that these are approximately the timings you experienced? I also realized that i have a bigger dispersion in the values if other (resource-consuming) applications like Firefox or Thunderbird were running.
Created an attachment (id=1203) [details] Time measurements of 50 sequential job submissions
Created an attachment (id=1204) [details] script to submit 50 jobs and measure time for each job in seconds
Maybe that's not clear: I submitted the sequential 50 jobs while the stability test, which created the steady load, was running.
We do not see a problem with threads between java 1.4 and 1.5. But there have been other changes that removed synchronization that maybe have been the problem.