Bugzilla – Bug 5308
gwd doesn't recover AIDs and TIDs
Last modified: 2008-12-22 04:40:33
You need to
before you can comment on or make changes to this bug.
If gwd crashes and then is asked to recover past jobs, it will loose all the
information regarding Array and Task IDs.
More problems with job arrays recovery, received on the mailing list by Emir
we're using GridWay 5.3 in multiuser environment. We find job array
functionality with PARAM variable extremely useful.
However, we noticed several problems with recovery of job arrays. All
other jobs recovered fine but in case of job arrays gwd service simply
hangs. Last message in the log was:
Recovering job 0.
Recovering job 97.
Transfer and execution MADs for users are started but also just hang.
There is no useful message in job.logs. Also gwd doesn't react to TERM
signal and has to be put down with KILL.
Only way to make GridWay start again is to delete jobs from array.
Also, in one case after we removed all jobs from an job array which was
previously rescheduled GridWay managed to recover jobs from a second job
array which wasn't rescheduled. However, we didn't have chance to
reproduce this later so I can't confirm that this is a rule or just luck.
Bigger problem is that in case when GridWay recovered jobs from job
array, ID and PARAM values of all jobs were set to 0. So, even if
recovered they were useless and we had to put them down.