Bug 4452 - job submission response is effected by java 1.5 thread processing
: job submission response is effected by java 1.5 thread processing
Status: RESOLVED INVALID
: GRAM
wsrf managed execution job service
: 4.0.2
: Macintosh All
: P3 normal
: 4.2
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2006-05-26 11:32 by
Modified: 2008-02-04 11:24 (History)


Attachments
Time measurements of 50 sequential job submissions (2.65 KB, text/plain)
2007-03-08 04:38, Martin Feller
Details
script to submit 50 jobs and measure time for each job in seconds (158 bytes, text/plain)
2007-03-08 04:39, Martin Feller
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2006-05-26 11:32:51
The details to this problem can be read from this gram-user email thread. 
There are suggested solutions to the problem that should be considered.

http://www-unix.globus.org/mail_archive/gram-user/2006/05/msg00022.html
------- Comment #1 From 2006-10-02 18:15:15 -------
I'm just going to paste the last email on this thread since it's where all the
good meat is:

--------------
Apparently Java does not mandate any scheduling order for threads waiting
for monitor entrance. Hence, this order may vary between different JVM
implementations and operating systems.

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6316090

In my environment, Java 1.5 queues up threads waiting for monitor entrance
in stack-order, which makes the implementation starvation-prone (if a
steady stream of threads are attempting to enter the monitor, the first
waiting thread will never be granted monitor entry and will hence be
starved out). In Java 1.4, however, threads seem to be queued in ("fair")
FIFO-order.

Consequently I restarted my container with Java 1.4 and I haven't (yet :)
observed any of the abnormal response times seen previously.

As a work-around Sun recommends the use of 'one of the excellent
"fair" lock constructs in java.util.concurrent.'

Maybe this approach should be taken by Globus developers in the future
since the synchronized construct, apparently, cannot be relied upon.

regards, Peter
------- Comment #2 From 2006-10-16 16:25:52 -------
I have this problem also with Java 1.5 .... tried the following entropy fix:

http://www.globus.org/mail_archive/gt-user/2006/09/msg00199.html

but long delays still occur, even with simple jobs.
------- Comment #3 From 2006-11-17 00:06:27 -------
In general, the non-FIFO order gives better performance then FIFO order but it
can lead to starvation as mentioned. And using 'fair' locks also decrements the
performance. Although it is hard to say what overall effect a 'fair' lock would
have in this case (in terms of performance).

But, looking at the createManageJob() code, I think the synchronization there
is unnecessary. That is, home.create() does not need to be called under a
global (service) lock. It only needs to be called under a job-specific lock. If
the code was switched to job-specific lock then this problem would disappear
and the throughput of creating jobs would increase.
------- Comment #4 From 2007-02-07 16:47:30 -------
Martin,

We have this targeting 4.2, but this may be important and significant enough to
look at in the 4.0.4 timeframe.  What do you think?  Can we try Jarek's
suggestion and see if it solves the problem?

-Stu
------- Comment #5 From 2007-02-08 02:29:51 -------
Sure, it seems to be a small change. But it should be done with care and tested 
like all threading issues.
Alan: In order for me to be able to check if thing improve: Could you describe
a bit more detailed what you mean by "abnormal response times" and 
"long delays":
* Do you mean response time between job submission and getting the EPR back
  or the time it takes for a job to be completely processed?
* Does this also happen when destroying a job or querying for resource
  properties?
* This also occured with jobs without staging?
* Did it occur consistently or only sometimes?
* Did it occur during more or less sequential job submission too, or only
  during concurrent job submission?
* What's the container load when it occured: only under heavy load with many
  jobs or also with just a few jobs running?
------- Comment #6 From 2007-02-12 09:07:38 -------
(In reply to comment #5)
> Sure, it seems to be a small change. But it should be done with care and tested 
> like all threading issues.
> Alan: In order for me to be able to check if thing improve: Could you describe
> a bit more detailed what you mean by "abnormal response times" and 
> "long delays":
> * Do you mean response time between job submission and getting the EPR back
>   or the time it takes for a job to be completely processed?

Yes.

> * Does this also happen when destroying a job or querying for resource
>   properties?

I'd have to have a spceific scenario.  As far as I can tell, destroying a job
happens fairly quickly, but my tests in this area have not been exhaustive. 
After encountering problems in Java 1.5, I went back to 1.4.2 in production
environments.

> * This also occured with jobs without staging?

Yes.

> * Did it occur consistently or only sometimes?

Consistently, in that any particular job was likely to encounter delays.  

> * Did it occur during more or less sequential job submission too, or only
>   during concurrent job submission?

I don't believe that concurrency was an issue.

> * What's the container load when it occured: only under heavy load with many
>   jobs or also with just a few jobs running?

Occurred even with a light load on the host machine.

Agree with your comments on testing.  How can we set up an organized test?

Note I have torn down the 1.5 test I was mounting, so would need to recreate
it.
------- Comment #7 From 2007-03-08 04:36:18 -------
Alan, sorry for the delay. Before changing anything i wanted to see that
behaviour myself. Could you confirm that the numbers are roughly comparable
to what you experienced, please.

Here's what i did: 
1. Built GT 4.0.4 two times: one with Suns Java 1.4.2_13, one with
   Java 1.5.0_01.
2. Started the GT container and ran the stability test against it, i.e. created
   a steady load of 5 jobs being processed by the container all the time (in
   reality this may vary a bit). The jobs included file stage in and file stage 
   out, the executable was /bin/date.
3. Then i submitted 50 simple jobs sequentially to the GT container and
   measured the time each one of them took. The values are measured in
   seconds.

Steps 2 and 3 had been done with Java 1.4 and 1.5 and Runtimes 1.4 and 1.5.

My execution environment is my (quite old) notebook:
Processor: mobile AMD Athlon(tm) XP-M 2500+
RAM: 512 MB

Find the script that measures the time of 50 jobs and the results attached.

All in all I experienced that jobs are processed faster when Java 1.4 was used
for compilation of the GT and as runtime, and faster when 1.5 is used. Some
jobs sometimes take a bit longer than others.
If you look at the attached overview table: Is this what you mean with
"long delays"? Can you confirm that these are approximately the timings you
experienced? 
I also realized that i have a bigger dispersion in the values if other
(resource-consuming) applications like Firefox or Thunderbird were running.
------- Comment #8 From 2007-03-08 04:38:08 -------
Created an attachment (id=1203) [details]
Time measurements of 50 sequential job submissions
------- Comment #9 From 2007-03-08 04:39:22 -------
Created an attachment (id=1204) [details]
script to submit 50 jobs and measure time for each job in seconds
------- Comment #10 From 2007-03-08 04:42:23 -------
Maybe that's not clear: I submitted the sequential 50 jobs while the stability
test, which created the steady load, was running.
------- Comment #11 From 2008-02-04 11:24:29 -------
We do not see a problem with threads between java 1.4 and 1.5.  But there have
been other changes that removed synchronization that maybe have been the
problem.