Bugzilla – Bug 6502
First audit log for a job is too late
Last modified: 2008-10-27 12:21:22
You need to log in before you can comment on or make changes to this bug.
We store the first audit event (event "job started") for a job in the audit DB when the processing of a job starts in the StateMachine, and not when the job resource is created as part of the job submission request. In certain situations this is too late and causes problems: Say we have 1000 jobs in the container. Many of them will sit idle for a while before we start processing them, because not all of them can be processed at the same time. If a destruction request comes in for one or some of them, then those jobs, that did not yet start processing will skip the step where the first audit event is normally inserted into the DB. But when the job is then finally in state "Failed" we try to update the audit record in the database, which fails because we didn't insert anything for this job yet. Further more: Because we want to be reliable, we'll store audit records to disk in situations where an update fails, and periodically retry to upload them in the database. In this situation this will never work though, because it's a systematic error in ws-gram, and not a DB-related problem. The fix is rather easy: Persist the first audit record for a job when a job resource has been successfully created, and not when we start processing the job in the StateMachine.
Fix committed to 4.0 branch.