Bugzilla – Bug 6400
Fix audit logging problems in Gram2 and Gram4 in globus_4_0_branch
Last modified: 2009-03-19 09:48:57
You need to log in before you can comment on or make changes to this bug.
Audit logging in Pre WS and WS-GRAM in 4.0.5+ has a bunch of critical problems: See bugs 5713, 5777, 5778, 6203, item 2 in 6358 to get an idea. Additionally we would like to have the fallback functionality we implemented in Gram4 for 4.2.1+ (see item 3 in bug 6357). All of this is already done in globus_4_2_branch (will be in 4.2.1) and in trunk but it cannot just be copied over to globus_4_0_branch because we don't have JPA support in globus_4_0_branch.
committed to 4.0 branch. tests, originally written for 4.2 branch and ported to 4.0 branch, pass.
Having tests does not mean that there's not a bug, or two ... 1. There are 3 types of audit events: "job started" (insert into database), "job queued" (update in database), "job finished" (update in database). If any db interaction fails we'll write a fallback record file to disk and retry the db upload later (periodically). If the database goes down after "job started", i.e. insert into the database went ok, and the db is available again right before the event "job finished", we'll have one fallback record file for "job queued", because the db was down at that time. But the update of the audit record in the db for the "job finished" event was successful. Now we periodically try to upload the "job queued" event into the db and will overwrite certain fields that had been written by the "job finished" event. Not so difficult to fix. 2. I almost always tested with MySQL as DB system. Yesterday evening I used PostgreSQL and found that, once PostgreSQL goes down, the db connection pool has broken connections and does not seem to recreate them, even if PostgreSQL comes back again. I.e. once PostgreSQL goes down, all audit records will be written as fallback files and cannot be uploaded, even if PostgreSQL comes back, until the container restarts (and reconnects to PostgreSQL). It works ok with MySQL, i.e. if the database comes back all db interactions with the audit db work fine again. Fix for 2. is the DataSource parameter "testOnBorrow" combined with the parameter "validationQuery", which means for the connection pool that a connection is checked with the configured SQL query before it is handed to an application. If it fails a new connection seems to be created. Somehow the MySQL driver seems to do that implicitly or so, but not so with PostgreSQL. If i specify "testOnBorrow" and "validationQuery" in the JNDI audit db configuration, the db interactions work ok again after a PostgreSQL restart. MySQL works as expected too with these new parameters. I'd like to test these things in a condor-g submission of 1000 jobs where i randomly shut down and restart the db. At the end all audit records should show up ok in the db. When this works fine i'll commit it. Tom: I hope you don't rely on the code in 4.0 branch at the moment and forgive me that i have to change it. I'll update this bug once i'm done, hopefully today.
(In reply to comment #2) > > Tom: I hope you don't rely on the code in 4.0 branch... No, I don't, thanks for asking.
Fix Committed. I ran a couple of condor-g tests with postgres and mysql as ws-gram audit database. I stopped and restarted the database several times. It worked fine for me with both db systems: at the end each job had a full audit record in the databases and all fallback records temporarily stored as files in the filesystem when the db was down were gone.
*** Bug 5713 has been marked as a duplicate of this bug. ***
*** Bug 5777 has been marked as a duplicate of this bug. ***
*** Bug 5778 has been marked as a duplicate of this bug. ***