Bug 3342 - Reliable Job Create isn't entirely reliable
: Reliable Job Create isn't entirely reliable
Status: RESOLVED FIXED
: GRAM
wsrf managed job factory service
: 4.0.0
: PC Linux
: P3 normal
: 4.0.1
Assigned To:
:
:
:
: 3348
  Show dependency treegraph
 
Reported: 2005-05-12 09:01 by
Modified: 2005-08-03 17:11 (History)


Attachments
reliable-create.diff (4.46 KB, patch)
2005-05-12 09:02, Joe Bester
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2005-05-12 09:01:41
I committed a reliable create test to the CVS trunk which attempts to submit a
job with the same Job Description and Job ID multiple times.

The test fails with the GT4 ManagedJobFactoryService sometimes. It looks like
the single-creation logic is not threadsafe. I'll attach a patch I have which
seems to fix this problem. I'd appreciate feedback from Peter before committing
this.

joe
------- Comment #1 From 2005-05-12 09:02:34 -------
Created an attachment (id=607) [details]
reliable-create.diff

Patch to fix reliable creation based on JobID.
------- Comment #2 From 2005-05-12 10:39:41 -------
Hmm. This sucks because I think it's going to degrade throughput tremendously. 
I believe this is where 
I removed a synch block before to improve throughput because it didn't seem
necessary.  That's 
primarily how we met our performance goal for throughput.  I guess we'll have
to think of something 
else for 4.2.
------- Comment #3 From 2005-05-12 10:56:59 -------
Can you run the performance test with and without this patch for comparison?
------- Comment #4 From 2005-05-12 11:01:58 -------
Sure.  I'll put up a set of web pages and paste the link here when I'm done.
------- Comment #5 From 2005-05-12 11:04:21 -------
An alternative would be to have the code stick the job id into the resource
home
before initializing the resource and then do resource intialization outside of
the lock. I'm not sure if that would create other troubles.
------- Comment #6 From 2005-05-12 11:28:05 -------
There would need to be some way of mapping the job ID to the job description
object to avoid the 
seriously bad race condition where I might get my resource assigned to someone
else's job ID if a 
simple queue were used.  So instead of a queue of new job IDs have a hashtable
of JD->ID entries.
------- Comment #7 From 2005-05-13 18:44:39 -------
Here are the throughput stats for 4.0.0:

http://www-unix.mcs.anl.gov/~lane/Test-reports/Throughput/4.0.0/

Here is a sampling of throughput stats after applying the patch:

http://www-unix.mcs.anl.gov/~lane/Test-reports/Throughput/bug_3342/

Fortunately the patch didn't seem to affect throughput at all, so I'm fine with
seeing it committed.  Joe, I'll reassign this back to you for you to close when
it's comitted.
------- Comment #8 From 2005-05-16 13:59:31 -------
Patch comitted to trunk and globus_4_0_branch.