Bug 5308 - gwd doesn't recover AIDs and TIDs
: gwd doesn't recover AIDs and TIDs
Status: NEW
: Gridway
: 5.2
: All All
: P3 normal
: ---
Assigned To:
  Show dependency treegraph
Reported: 2007-05-17 06:45 by
Modified: 2008-12-22 04:40 (History)



You need to log in before you can comment on or make changes to this bug.

Description From 2007-05-17 06:45:24
If gwd crashes and then is asked to recover past jobs, it will loose all the
information regarding Array and Task IDs.
------- Comment #1 From 2008-12-22 04:40:33 -------
More problems with job arrays recovery, received on the mailing list by Emir
Imamagic <eimamagi@srce.hr>

we're using GridWay 5.3 in multiuser environment. We find job array
functionality with PARAM variable extremely useful.

However, we noticed several problems with recovery of job arrays. All
other jobs recovered fine but in case of job arrays gwd service simply
hangs. Last message in the log was:
 Recovering job 0.
 Recovering job 97.
Transfer and execution MADs for users are started but also just hang.
There is no useful message in job.logs. Also gwd doesn't react to TERM
signal and has to be put down with KILL.

Only way to make GridWay start again is to delete jobs from array.

Also, in one case after we removed all jobs from an job array which was
previously rescheduled GridWay managed to recover jobs from a second job
array which wasn't rescheduled. However, we didn't have chance to
reproduce this later so I can't confirm that this is a rule or just luck.

Bigger problem is that in case when GridWay recovered jobs from job
array, ID and PARAM values of all jobs were set to 0. So, even if
recovered they were useless and we had to put them down.