Bugzilla – Bug 4253
unable to monitor job for state changes
Last modified: 2006-03-21 14:14:34
You need to log in before you can comment on or make changes to this bug.
Frequently the larger throughput runs are turning up a "unable to monitor job state changes" fault. An example: http://skynet-login.isi.edu/gram-testing/results/8ad818eb-097c-4200-8b70-2ec105808b7a/ throughput-tester.log
As was discussed on the MUD, it's possible this is simply a problem with the system reusing PIDs too soon. A race condition between exists in this situation between the unregistering of the PID from one job and the registering of the PID again from a newly submitted job. Debug logging should be turned on for the org.globus.exec.service.exec.JobManagerScript class. This allows for the comparison of PIDs. To support the theory, one should look to see if two instances of the adapter returning the same PID occur close to each other.
It's happening every day in the tests, as far as I can tell. Is there anything you can do to add monitoring code to figure out what's causing it? If it's a race, it's pretty reliable.
I suggested a way to see if PIDs were being reused in a short amount of time. Are you asking me to add code to explicitly look for this situtation? I'd rather have some evidence in support of the theory before I risk wasting my time trying to fix it.
The reproducability of this makes me think it is not a pid problem, but something else. Maybe it will be enough to log what the conflicting job id is to get a clue on what is going on.
http://skynet-login.isi.edu/gram-testing/ See test 1143, 1144, 1146, and 1147 An interesting note is that I had to make the tests run for longer (2000 seconds vs 1200 seconds) in order to trigger the problem with logging enabled. This is probably because not enough jobs were run, about 800. With the longer test, about 1300 jobs were run.
Created an attachment (id=891) [details] Patch to globus_fork_starter Attaching a patch to prepend a uuid to the fork pid in the fork starter. This should eliminate job id reuse problems. No changes should be needed for the SEG for this patch.
What's the status on this? It looks like the HEAD version of the test is passing now, but the branch is still failing.
I tested the patch, and it looks good to me: http://skynet-login.isi.edu/gram-testing/ Test 1413-1418
I've committed this patch to CVS in trunk and 4.0 branch after Mats confirmed the 4/4/never test succeeds.