Bug 4253 - unable to monitor job for state changes
: unable to monitor job for state changes
Status: RESOLVED FIXED
: GRAM
wsrf managed execution job service
: 4.0.1
: Macintosh All
: P3 normal
: 4.0.2
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2006-03-06 16:38 by
Modified: 2006-03-21 14:14 (History)


Attachments
Patch to globus_fork_starter (6.83 KB, patch)
2006-03-20 09:05, Joe Bester
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2006-03-06 16:38:07
Frequently the larger throughput runs are turning up a "unable to monitor job
state changes" fault.

An example:
http://skynet-login.isi.edu/gram-testing/results/8ad818eb-097c-4200-8b70-2ec105808b7a/
throughput-tester.log
------- Comment #1 From 2006-03-06 17:23:22 -------
As was discussed on the MUD, it's possible this is simply a problem with the
system reusing PIDs too soon. A race condition between exists in this situation
between the unregistering of the PID from one job and the registering of the PID
again from a newly submitted job. Debug logging should be turned on for the
org.globus.exec.service.exec.JobManagerScript class. This allows for the
comparison of PIDs. To support the theory, one should look to see if two
instances of the adapter returning the same PID occur close to each other.
------- Comment #2 From 2006-03-16 15:25:15 -------
It's happening every day in the tests, as far as I can tell.  Is there anything
you can do to add monitoring 
code to figure out what's causing it?  If it's a race, it's pretty reliable.
------- Comment #3 From 2006-03-16 16:43:16 -------
I suggested a way to see if PIDs were being reused in a short amount of time.
Are you asking me to add code to explicitly look for this situtation? I'd rather
have some evidence in support of the theory before I risk wasting my time trying
to fix it.
------- Comment #4 From 2006-03-16 18:17:13 -------
The reproducability of this makes me think it is not a pid problem, but
something else. Maybe it will be enough to log what the conflicting job id is to
get a clue on what is going on.
------- Comment #5 From 2006-03-16 20:03:59 -------
http://skynet-login.isi.edu/gram-testing/

See test 1143, 1144, 1146, and 1147

An interesting note is that I had to make the tests run for longer (2000 seconds
vs 1200 seconds) in order to trigger the problem with logging enabled. This is
probably because not enough jobs were run, about 800. With the longer test,
about 1300 jobs were run.
------- Comment #6 From 2006-03-20 09:05:54 -------
Created an attachment (id=891) [details]
Patch to globus_fork_starter

Attaching a patch to prepend a uuid to the fork pid in the fork starter. This
should eliminate job id reuse problems. No changes should be needed for the SEG
for this patch.
------- Comment #7 From 2006-03-21 10:00:46 -------
What's the status on this?  It looks like the HEAD version of the test is
passing now, but the branch is still failing.
------- Comment #8 From 2006-03-21 12:18:10 -------
I tested the patch, and it looks good to me:

http://skynet-login.isi.edu/gram-testing/
Test 1413-1418
------- Comment #9 From 2006-03-21 14:14:34 -------
I've committed this patch to CVS in trunk and 4.0 branch after Mats confirmed
the 4/4/never test succeeds.