| Summary: | GRAM auditing: Failed database connection loses audit records | ||
|---|---|---|---|
| Product: | GRAM | Reporter: | John Weigand <weigand@fnal.gov> |
| Component: | general | Assignee: | Stuart Martin <smartin@mcs.anl.gov> |
| Status: | RESOLVED DUPLICATE | ||
| Severity: | major | CC: | feller@mcs.anl.gov, greenc@fnal.gov, madduri@mcs.anl.gov, pcanal@fnal.gov, roy@cs.wisc.edu, smartin@mcs.anl.gov |
| Priority: | P2 | ||
| Version: | 4.0.5 | ||
| Target Milestone: | 4.0.7 | ||
| Hardware: | Open Science Grid (OSG) | ||
| OS: | Linux | ||
In GRAM2, I discovered the --check argument to the cron which will advise that the database connection failed and it does NOT remove the audit record. In GRAM4, I still cannot determine an option for this same capability.
*** This bug has been marked as a duplicate of bug 6400 ***
In both pre-ws and ws Gram auditing, when a database connection/update fails, the audit record is lost. I am currently testing this with: Condor 6.8.3 Globus 4.0.5 MySQL 4.1.22 There a a couple related issues with this behavior: 1. In both ws and pre-ws, the only indication of this ia a java stack trace. These connection problems never appear to be caught and therefore no log message is generated. 2. In ws, the exception is thrown only once indicating to me that it recognizes the failure and never attempts again to connect. In pre-ws, the exception is naturally thrown with each execution of the cron job. 3. In both cases, the audit information is lost. In ws, since there is no queuing/staging of the data, they just vanish. In pre-ws, the file in $GLOBUS_LOCATION/share/globus_gram_job_manager_auditing is deleted. I would suggest that in both ws and pre-ws..., 1. audit records should have some form of recovery capability in the event of a database outage 2. some type of log message should be generated to notify an admin of a problem Also, I should note that the jobs are being successfully processed by the batch job/queue manager (condor) regardless of the GRAM audit failure.