Bug 6024 - GRAM Audit v2
: GRAM Audit v2
Status: RESOLVED WONTFIX
: GRAM
Campaign
: development
: Macintosh All
: P3 enhancement
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2008-04-18 16:26 by
Modified: 2012-09-05 13:39 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2008-04-18 16:26:21
Campaign Title:
==================================
GRAM Audit v2

Projects/Grids:
==================================
VDT, OSG, TG, SURA, APAC, DGrid

Technologies:
==================================
GRAM2, GRAM4

Definition:
==================================
There have been a number of very good feature requests to improve GRAM auditing
v1, thanks very much for that.  In particular, from Terrence Martin, John
Weigand, Shawn McKee, Frank Breitling, and Markus Binsteiner.  GRAM Auditing v1
is available in GRAM2 and GRAM4 in GT versions 4.0.5 (and later versions) and
4.2.0 (and later).  The plan is to implement the new GRAM audit v2 in GT 4.2.x
and not back port this to 4.0.x.  These changes include an audit database
schema change.  In order to accommodate this interface change in a 4.2 point
release, a new configuration option will be added in order to toggle the GRAM
audit version desired.  In a 4.2 point release the v2 option will be added
allowing admins to select either the v1 (default) or v2 method.  In the next
major GT release 4.4, only the v2 method will be kept.

For reference, the 4.0 DB record fields are listed here:
   
http://www-unix.globus.org/toolkit/docs/4.0/execution/wsgram/WS_GRAM_Audit_Logging.html#id2585297

Additional DB field requests:

- session_id
    This is the unique ID for each client interaction with the GT container. 
The session ID is needed in order to join records with multiple GT auditing
tables.  For example, core audit records, security audit records.  There are
some plans to have a security audit table for the GT gridshib component.

- active_time*
    Date when the job was started/running in the local resource manager

- job_resource_key
    This is the unique ID (UUID) generated by the service.  This is included in
the job's EPR.

- lrm_job_terminated_time*
    Date when the job terminated in the local resource manager

- job_all_done_time
    Date when the job was fully processed by the GRAM service.  This includes,
staging, execution, cleanup, etc...

- client_hostname
    This is the hostname of the client that sent the job to the gram service

- execution_host
    This is the hostname that the GRAM service is running on.  The host is also
part of the grid_job_id, does this need to be a separate field?

- resource_usage
    information as reported by the UNIX time command
        (i) the elapsed real time between invocation and termination
        (ii) the user CPU time (the sum of the tms_utime and tms_cutime values
in a struct tms as returned by times(2))
        (iii) the system CPU time (the sum of the tms_stime and tms_cstime
values in a struct tms as returned by times(2))
    I assume this was a request for Fork only?  Does it make sense to add this
if it is only available for Fork?

- Audit logging Usage Records
    http://bugzilla.globus.org/globus/show_bug.cgi?id=5865
    I think this could be a script that fetches the data from the DB and
formats the data as necessary.  Is that right?

- VOMS FQAN
    This is the unique ID of the client.  This is needed when the Grid
credential DN is shared.  For VOMS and GridShib, this is an "attribute" in the
credential.  I think a separate security audit table makes most sense to store
a record the contains the attribute information.  The "session ID" stored by
the service (WS GRAM, RFT, Delegation service) can then be used to get (via
join) the unique client id and any other security information stored.

* These dates/times could be obtained by 2 methods:
    1) When the gram service is notified of a change
    2) Attempt to get a timestamp from the LRM for when the LRM records the
change (if available)

In method 1, the date observed by the gram service could be inaccurate if the
GRAM container was down or if there were delays in the SEG reporting job
events.  There would be a discrepancy with the LRM accounting information and
GRAM audit, but this would probably be quite rare.  I think reporting this date
as observed by the gram service is reasonable and more consistent in that SEG
modifications per LRM are not needed.  Thoughts?

--------------------------------------------------------------------

ISSUES brought up with the current implementation

1) Data type for date/time/timestamps in gram auditing

Currently, date/times are stored as strings.  This does not allow to do any of
the following easily:
    1) Show all the jobs after a certain date/time
    2) Count the number of jobs in a particular time window
    3) Display the jobs in order of creation time

The "creation_time" and "queued_time" are VARCHAR(40) strings.  A typical
entry:
"Mon Feb 11 04:05:49 UTC 2008"

The suggestion is to change to use an actual DATETIME type.  Then the database
can help with time/date related queries.

This seems like a good change to make.

2) Audit DB Security

Currently, WS GRAM will insert an initial record and then update it after the
LRM job submission and then again at the end of the job.  This required the
globus account to have update privileges on the audit table.  If the account
was compromised, then the entire table could be altered.  The audit trail could
be lost or possibly even worse, erroneous.

It was desired to have access to the some of the information in the audit
record before the job completed.  For this reason the insert and update method
was done.

To satisfy both requirements, the gram service could generate a unique audit
record for each "audit event".  So multiple inserts would be done for each job,
removing the need for update privileges.  But this would add complexity to the
select statement.

3) Reliability

A requirement for auditing is that no records are lost.  Currently, an attempt
is made to insert/update the record, but if the database is unavailable, the
record is lost.  A file based fall back mechanism needs to be implemented.  If
the DB is unavailable we need to decide how the record should be written to
disk and later uploaded.  This is actually how GRAM2 audit records are
uploaded.  A file is created and there is a cron job that occasionally uploads
the records and removes the files.  The GRAM4 fall back approach could use the
GRAM2 method or a new one could be devised.

Deliverables:
==================================
1) v2 DB schema
2) GT 4.2.0 based patch to implement new v2 audit behavior in the GRAM4 service
3) GT 4.2.0 based patch to add new v2 audit configuration option (if not
specified, then v1 implementation is assumed)
4) GT 4.2.0 based patch to implement new v2 audit behavior in the GRAM2 service

Tasks:
==================================
1) Decide on the datatype that will be used for the date/time fields in the
audit DB.  These include the new and previously existing date/time fields.
2) Decide on all DB field names
    a) new fields include: session_id, active_time, job_resource_key,
lrm_job_terminated_time, job_all_done_time, client_hostname, execution_host
3) Decide on the record format for each insert auditing event (beginning, after
LRM submission, end)
4) Create the new v2 DB schema(s) based on tasks 1, 2 and 3
5) Add comments in the GRAM4 service code where the values should be obtained
(Martin)
6) Add code to the GRAM4 service to get the values that will populate the new
java audit object.
7) Modify code to replace old date/time values stored as strings to the new v2
date/time datatype
8) Modify DB insert/update statements to:
    a) Insert a record shortly after the job has been received from the client.
    b) Insert a record just after the job has been submitted to the LRM.
    c) Insert a record just after all processing has been completed.
9) Implement the method to assure audit records are not lost when the DB is
unavailable.
10) Write CLI to fetch an audit record given a LRM job id and output in OGF UR
format.
    http://bugzilla.globus.org/globus/show_bug.cgi?id=5865
11) Decide how the auditing of the FQAN will be done.
    a) This should probably be in a separate security/authorization DB table
that can be joined/linked from the gram audit records.  Code for this should be
in the security PIP.  If so, the security PIP should leverage the DB insert
code that is used by GRAM4.  But this work should probably be done in a
separate campaign.
12) Release new v2 implementation in GT 4.2.? point release (whenever it is
ready)
------- Comment #1 From 2008-04-22 10:57:12 -------
Martin and I just discussed this some and realized that we will want an audit
"terminate" record.  This record will be used for the situation when a job is
being terminated.  There are 3 general reasons that trigger it:
  1) a user calls the terminate operation
  2) the job resource's lifetime has expired
  3) the was an internal or processing error of some sort
     a) file staging failed
     b) LRM job submission failed. Maybe it was not available/down
------- Comment #2 From 2008-09-17 09:39:10 -------
Note: The V2 plans are listed here:
http://dev.globus.org/wiki/GRAM_Audit_V2
------- Comment #3 From 2012-09-05 13:39:01 -------
Doing some bugzilla cleanup...  Resolving old GRAM3 and GRAM4 issues that are
no longer relevant since we've moved on to GRAM5.  Also, we're now tracking
issue in jira.  Any new issues should be added here:

http://jira.globus.org/secure/VersionBoard.jspa?selectedProjectId=10363