Bugzilla – Bug 6400
Fix audit logging problems in Gram2 and Gram4 in globus_4_0_branch
Last modified: 2009-03-19 09:48:57
You need to
before you can comment on or make changes to this bug.
Audit logging in Pre WS and WS-GRAM in 4.0.5+ has a bunch of critical
problems: See bugs 5713, 5777, 5778, 6203, item 2 in 6358
to get an idea.
Additionally we would like to have the fallback functionality we
implemented in Gram4 for 4.2.1+ (see item 3 in bug 6357).
All of this is already done in globus_4_2_branch (will be in 4.2.1) and in
trunk but it cannot just be copied over to globus_4_0_branch because
we don't have JPA support in globus_4_0_branch.
committed to 4.0 branch. tests, originally written for 4.2 branch
and ported to 4.0 branch, pass.
Having tests does not mean that there's not a bug, or two ...
1. There are 3 types of audit events: "job started" (insert into database),
"job queued" (update in database), "job finished" (update in database).
If any db interaction fails we'll write a fallback record file to disk
and retry the db upload later (periodically).
If the database goes down after "job started", i.e. insert into the
database went ok, and the db is available again right before the event
"job finished", we'll have one fallback record file for "job queued",
because the db was down at that time.
But the update of the audit record in the db for the "job finished" event
Now we periodically try to upload the "job queued" event into the db and
will overwrite certain fields that had been written by the "job finished"
Not so difficult to fix.
2. I almost always tested with MySQL as DB system. Yesterday evening I used
PostgreSQL and found that, once PostgreSQL goes down, the db connection
pool has broken connections and does not seem to recreate them, even if
PostgreSQL comes back again.
I.e. once PostgreSQL goes down, all audit records will be written as
fallback files and cannot be uploaded, even if PostgreSQL comes back,
until the container restarts (and reconnects to PostgreSQL).
It works ok with MySQL, i.e. if the database comes back all db interactions
with the audit db work fine again.
Fix for 2. is the DataSource parameter "testOnBorrow" combined with the
parameter "validationQuery", which means for the connection pool that a
connection is checked with the configured SQL query before it is handed
to an application. If it fails a new connection seems to be created.
Somehow the MySQL driver seems to do that implicitly or so, but
not so with PostgreSQL. If i specify "testOnBorrow" and "validationQuery"
in the JNDI audit db configuration, the db interactions work ok again
after a PostgreSQL restart.
MySQL works as expected too with these new parameters.
I'd like to test these things in a condor-g submission of 1000 jobs where
i randomly shut down and restart the db. At the end all audit records
should show up ok in the db.
When this works fine i'll commit it.
Tom: I hope you don't rely on the code in 4.0 branch at the moment and
forgive me that i have to change it. I'll update this bug once i'm done,
(In reply to comment #2)
> Tom: I hope you don't rely on the code in 4.0 branch...
No, I don't, thanks for asking.
I ran a couple of condor-g tests with postgres and mysql as ws-gram audit
database. I stopped and restarted the database several times. It worked
fine for me with both db systems: at the end each job had a full audit
record in the databases and all fallback records temporarily stored as files
in the filesystem when the db was down were gone.
*** Bug 5713 has been marked as a duplicate of this bug. ***
*** Bug 5777 has been marked as a duplicate of this bug. ***
*** Bug 5778 has been marked as a duplicate of this bug. ***