Bugzilla – Bug 6024
GRAM Audit v2
Last modified: 2012-09-05 13:39:01
You need to log in before you can comment on or make changes to this bug.
Campaign Title: ================================== GRAM Audit v2 Projects/Grids: ================================== VDT, OSG, TG, SURA, APAC, DGrid Technologies: ================================== GRAM2, GRAM4 Definition: ================================== There have been a number of very good feature requests to improve GRAM auditing v1, thanks very much for that. In particular, from Terrence Martin, John Weigand, Shawn McKee, Frank Breitling, and Markus Binsteiner. GRAM Auditing v1 is available in GRAM2 and GRAM4 in GT versions 4.0.5 (and later versions) and 4.2.0 (and later). The plan is to implement the new GRAM audit v2 in GT 4.2.x and not back port this to 4.0.x. These changes include an audit database schema change. In order to accommodate this interface change in a 4.2 point release, a new configuration option will be added in order to toggle the GRAM audit version desired. In a 4.2 point release the v2 option will be added allowing admins to select either the v1 (default) or v2 method. In the next major GT release 4.4, only the v2 method will be kept. For reference, the 4.0 DB record fields are listed here: http://www-unix.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Audit_Logging.html#id2585297 Additional DB field requests: - session_id This is the unique ID for each client interaction with the GT container. The session ID is needed in order to join records with multiple GT auditing tables. For example, core audit records, security audit records. There are some plans to have a security audit table for the GT gridshib component. - active_time* Date when the job was started/running in the local resource manager - job_resource_key This is the unique ID (UUID) generated by the service. This is included in the job's EPR. - lrm_job_terminated_time* Date when the job terminated in the local resource manager - job_all_done_time Date when the job was fully processed by the GRAM service. This includes, staging, execution, cleanup, etc... - client_hostname This is the hostname of the client that sent the job to the gram service - execution_host This is the hostname that the GRAM service is running on. The host is also part of the grid_job_id, does this need to be a separate field? - resource_usage information as reported by the UNIX time command (i) the elapsed real time between invocation and termination (ii) the user CPU time (the sum of the tms_utime and tms_cutime values in a struct tms as returned by times(2)) (iii) the system CPU time (the sum of the tms_stime and tms_cstime values in a struct tms as returned by times(2)) I assume this was a request for Fork only? Does it make sense to add this if it is only available for Fork? - Audit logging Usage Records http://bugzilla.globus.org/globus/show_bug.cgi?id=5865 I think this could be a script that fetches the data from the DB and formats the data as necessary. Is that right? - VOMS FQAN This is the unique ID of the client. This is needed when the Grid credential DN is shared. For VOMS and GridShib, this is an "attribute" in the credential. I think a separate security audit table makes most sense to store a record the contains the attribute information. The "session ID" stored by the service (WS GRAM, RFT, Delegation service) can then be used to get (via join) the unique client id and any other security information stored. * These dates/times could be obtained by 2 methods: 1) When the gram service is notified of a change 2) Attempt to get a timestamp from the LRM for when the LRM records the change (if available) In method 1, the date observed by the gram service could be inaccurate if the GRAM container was down or if there were delays in the SEG reporting job events. There would be a discrepancy with the LRM accounting information and GRAM audit, but this would probably be quite rare. I think reporting this date as observed by the gram service is reasonable and more consistent in that SEG modifications per LRM are not needed. Thoughts? -------------------------------------------------------------------- ISSUES brought up with the current implementation 1) Data type for date/time/timestamps in gram auditing Currently, date/times are stored as strings. This does not allow to do any of the following easily: 1) Show all the jobs after a certain date/time 2) Count the number of jobs in a particular time window 3) Display the jobs in order of creation time The "creation_time" and "queued_time" are VARCHAR(40) strings. A typical entry: "Mon Feb 11 04:05:49 UTC 2008" The suggestion is to change to use an actual DATETIME type. Then the database can help with time/date related queries. This seems like a good change to make. 2) Audit DB Security Currently, WS GRAM will insert an initial record and then update it after the LRM job submission and then again at the end of the job. This required the globus account to have update privileges on the audit table. If the account was compromised, then the entire table could be altered. The audit trail could be lost or possibly even worse, erroneous. It was desired to have access to the some of the information in the audit record before the job completed. For this reason the insert and update method was done. To satisfy both requirements, the gram service could generate a unique audit record for each "audit event". So multiple inserts would be done for each job, removing the need for update privileges. But this would add complexity to the select statement. 3) Reliability A requirement for auditing is that no records are lost. Currently, an attempt is made to insert/update the record, but if the database is unavailable, the record is lost. A file based fall back mechanism needs to be implemented. If the DB is unavailable we need to decide how the record should be written to disk and later uploaded. This is actually how GRAM2 audit records are uploaded. A file is created and there is a cron job that occasionally uploads the records and removes the files. The GRAM4 fall back approach could use the GRAM2 method or a new one could be devised. Deliverables: ================================== 1) v2 DB schema 2) GT 4.2.0 based patch to implement new v2 audit behavior in the GRAM4 service 3) GT 4.2.0 based patch to add new v2 audit configuration option (if not specified, then v1 implementation is assumed) 4) GT 4.2.0 based patch to implement new v2 audit behavior in the GRAM2 service Tasks: ================================== 1) Decide on the datatype that will be used for the date/time fields in the audit DB. These include the new and previously existing date/time fields. 2) Decide on all DB field names a) new fields include: session_id, active_time, job_resource_key, lrm_job_terminated_time, job_all_done_time, client_hostname, execution_host 3) Decide on the record format for each insert auditing event (beginning, after LRM submission, end) 4) Create the new v2 DB schema(s) based on tasks 1, 2 and 3 5) Add comments in the GRAM4 service code where the values should be obtained (Martin) 6) Add code to the GRAM4 service to get the values that will populate the new java audit object. 7) Modify code to replace old date/time values stored as strings to the new v2 date/time datatype 8) Modify DB insert/update statements to: a) Insert a record shortly after the job has been received from the client. b) Insert a record just after the job has been submitted to the LRM. c) Insert a record just after all processing has been completed. 9) Implement the method to assure audit records are not lost when the DB is unavailable. 10) Write CLI to fetch an audit record given a LRM job id and output in OGF UR format. http://bugzilla.globus.org/globus/show_bug.cgi?id=5865 11) Decide how the auditing of the FQAN will be done. a) This should probably be in a separate security/authorization DB table that can be joined/linked from the gram audit records. Code for this should be in the security PIP. If so, the security PIP should leverage the DB insert code that is used by GRAM4. But this work should probably be done in a separate campaign. 12) Release new v2 implementation in GT 4.2.? point release (whenever it is ready)
Martin and I just discussed this some and realized that we will want an audit "terminate" record. This record will be used for the situation when a job is being terminated. There are 3 general reasons that trigger it: 1) a user calls the terminate operation 2) the job resource's lifetime has expired 3) the was an internal or processing error of some sort a) file staging failed b) LRM job submission failed. Maybe it was not available/down
Note: The V2 plans are listed here: http://dev.globus.org/wiki/GRAM_Audit_V2
Doing some bugzilla cleanup... Resolving old GRAM3 and GRAM4 issues that are no longer relevant since we've moved on to GRAM5. Also, we're now tracking issue in jira. Any new issues should be added here: http://jira.globus.org/secure/VersionBoard.jspa?selectedProjectId=10363