Bug 6400 - Fix audit logging problems in Gram2 and Gram4 in globus_4_0_branch
: Fix audit logging problems in Gram2 and Gram4 in globus_4_0_branch
Status: RESOLVED FIXED
: GRAM
wsrf managed execution job service
: 4.0.7
: Macintosh All
: P3 major
: 4.0.9
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2008-09-18 16:38 by
Modified: 2009-03-19 09:48 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2008-09-18 16:38:08
Audit logging in Pre WS and WS-GRAM in 4.0.5+ has a bunch of critical
problems: See bugs 5713, 5777, 5778, 6203, item 2 in 6358
to get an idea.
Additionally we would like to have the fallback functionality we
implemented in Gram4 for 4.2.1+ (see item 3 in bug 6357).

All of this is already done in globus_4_2_branch (will be in 4.2.1) and in
trunk but it cannot just be copied over to globus_4_0_branch because
we don't have JPA support in globus_4_0_branch.
------- Comment #1 From 2008-10-03 14:00:11 -------
committed to 4.0 branch. tests, originally written for 4.2 branch
and ported to 4.0 branch, pass.
------- Comment #2 From 2008-10-16 09:56:30 -------
Having tests does not mean that there's not a bug, or two ...

1. There are 3 types of audit events: "job started" (insert into database),
  "job queued" (update in database), "job finished" (update in database).
   If any db interaction fails we'll write a fallback record file to disk
   and retry the db upload later (periodically).
   If the database goes down after "job started", i.e. insert into the
   database went ok, and the db is available again right before the event
   "job finished", we'll have one fallback record file for "job queued", 
   because the db was down at that time.
   But the update of the audit record in the db for the "job finished" event
   was successful.
   Now we periodically try to upload the "job queued" event into the db and
   will overwrite certain fields that had been written by the "job finished"
   event.
   Not so difficult to fix.

2. I almost always tested with MySQL as DB system. Yesterday evening I used
   PostgreSQL and found that, once PostgreSQL goes down, the db connection
   pool has broken connections and does not seem to recreate them, even if
   PostgreSQL comes back again.
   I.e. once PostgreSQL goes down, all audit records will be written as
   fallback files and cannot be uploaded, even if PostgreSQL comes back,
   until the container restarts (and reconnects to PostgreSQL).
   It works ok with MySQL, i.e. if the database comes back all db interactions
   with the audit db work fine again.

Fix for 2. is the DataSource parameter "testOnBorrow" combined with the
parameter "validationQuery", which means for the connection pool that a
connection is checked with the configured SQL query before it is handed 
to an application. If it fails a new connection seems to be created.
Somehow the MySQL driver seems to do that implicitly or so, but
not so with PostgreSQL. If i specify "testOnBorrow" and "validationQuery"
in the JNDI audit db configuration, the db interactions work ok again
after a PostgreSQL restart.
MySQL works as expected too with these new parameters.

I'd like to test these things in a condor-g submission of 1000 jobs where
i randomly shut down and restart the db. At the end all audit records
should show up ok in the db.

When this works fine i'll commit it.

Tom: I hope you don't rely on the code in 4.0 branch at the moment and
forgive me that i have to change it. I'll update this bug once i'm done,
hopefully today.
------- Comment #3 From 2008-10-16 11:35:29 -------
(In reply to comment #2)
> 
> Tom: I hope you don't rely on the code in 4.0 branch...

No, I don't, thanks for asking.
------- Comment #4 From 2008-10-16 17:05:12 -------
Fix Committed.
I ran a couple of condor-g tests with postgres and mysql as ws-gram audit 
database. I stopped and restarted the database several times. It worked
fine for me with both db systems: at the end each job had a full audit
record in the databases and all fallback records temporarily stored as files
in the filesystem when the db was down were gone.
------- Comment #5 From 2009-03-19 09:47:44 -------
*** Bug 5713 has been marked as a duplicate of this bug. ***
------- Comment #6 From 2009-03-19 09:48:31 -------
*** Bug 5777 has been marked as a duplicate of this bug. ***
------- Comment #7 From 2009-03-19 09:48:57 -------
*** Bug 5778 has been marked as a duplicate of this bug. ***