Bugzilla – Bug 4253
unable to monitor job for state changes
Last modified: 2006-03-21 14:14:34
You need to
before you can comment on or make changes to this bug.
Frequently the larger throughput runs are turning up a "unable to monitor job
state changes" fault.
As was discussed on the MUD, it's possible this is simply a problem with the
system reusing PIDs too soon. A race condition between exists in this situation
between the unregistering of the PID from one job and the registering of the PID
again from a newly submitted job. Debug logging should be turned on for the
org.globus.exec.service.exec.JobManagerScript class. This allows for the
comparison of PIDs. To support the theory, one should look to see if two
instances of the adapter returning the same PID occur close to each other.
It's happening every day in the tests, as far as I can tell. Is there anything
you can do to add monitoring
code to figure out what's causing it? If it's a race, it's pretty reliable.
I suggested a way to see if PIDs were being reused in a short amount of time.
Are you asking me to add code to explicitly look for this situtation? I'd rather
have some evidence in support of the theory before I risk wasting my time trying
to fix it.
The reproducability of this makes me think it is not a pid problem, but
something else. Maybe it will be enough to log what the conflicting job id is to
get a clue on what is going on.
See test 1143, 1144, 1146, and 1147
An interesting note is that I had to make the tests run for longer (2000 seconds
vs 1200 seconds) in order to trigger the problem with logging enabled. This is
probably because not enough jobs were run, about 800. With the longer test,
about 1300 jobs were run.
Created an attachment (id=891) [details]
Patch to globus_fork_starter
Attaching a patch to prepend a uuid to the fork pid in the fork starter. This
should eliminate job id reuse problems. No changes should be needed for the SEG
for this patch.
What's the status on this? It looks like the HEAD version of the test is
passing now, but the branch is still failing.
I tested the patch, and it looks good to me:
I've committed this patch to CVS in trunk and 4.0 branch after Mats confirmed
the 4/4/never test succeeds.