Bug 6502 - First audit log for a job is too late
: First audit log for a job is too late
Status: RESOLVED FIXED
: GRAM
wsrf managed execution job service
: 4.0.8
: Macintosh All
: P3 major
: 4.0.9
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2008-10-25 08:52 by
Modified: 2008-10-27 12:21 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2008-10-25 08:52:06
We store the first audit event (event "job started") for a job in
the audit DB when the processing of a job starts in the StateMachine,
and not when the job resource is created as part of the job submission
request. In certain situations this is too late and causes problems:

Say we have 1000 jobs in the container. Many of them will
sit idle for a while before we start processing them, because
not all of them can be processed at the same time.
If a destruction request comes in for one or some of them, then those
jobs, that did not yet start processing will skip the step where
the first audit event is normally inserted into the DB.
But when the job is then finally in state "Failed" we try to update
the audit record in the database, which fails because we didn't insert
anything for this job yet.

Further more: Because we want to be reliable, we'll store audit
records to disk in situations where an update fails, and periodically
retry to upload them in the database. In this situation this will
never work though, because it's a systematic error in ws-gram, and not
a DB-related problem.  

The fix is rather easy: Persist the first audit record for a job
when a job resource has been successfully created, and not when
we start processing the job in the StateMachine.
------- Comment #1 From 2008-10-27 12:21:22 -------
Fix committed to 4.0 branch.