Bug 3912 - Rotation of gram_condor_log?
: Rotation of gram_condor_log?
Status: RESOLVED DUPLICATE of bug 5731
: GRAM
general
: 4.0.1
: All All
: P2 question
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2005-11-15 14:13 by
Modified: 2008-01-22 09:19 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2005-11-15 14:13:26
The condor.pm script creates a log file for recording all Condor events for all
Condor jobs. It is shared between users. 

This file can grow without bound, but it is not clear how it should be rotated.
Is it safe to rotate it while a job is being executed? For instance, if the log has:

... 
submit alain's job
job begins running
... 

If I rotate the log file at this point, will it cause problems because the
events for the submission and starting can no longer be read? Or does GRAM only
rely on being able to read new events, so it is actually safe to rotate the file
at any time?

Thanks,
-alain
------- Comment #1 From 2007-03-12 18:12:23 -------
Joe,

This recently came up as needing to be understood for OSG.  How can the single
condor log file that is used by the condor SEG be rotated safely?

-Stu
------- Comment #2 From 2007-03-13 15:49:19 -------
Alain,

Currently, the only way to know that the condor log file can be safely rotated
is when all events from the log have been processed by the SEG.  There is not
really a good way to know this.  The SEG keeps a timestamp for recovery
purposes.  The timestamp is unique to a single "event" in the log file.  If the
container was down for some reason, meaning the SEG wasn't running, but condor
continued to run, there would then be unprocessed events waiting for the SEG
when it is restarted.  If you rotate/truncate/remove those events, then they
would be lost causing problems (job hangs waiting for DONE mostlikely)

LSF and PBS have log rotation schemes that WS GRAM understand and knows how to
deal with.  The recovery timestamp will lead the seg to a rotated log file and
then it will continue on the current log file.  Something similar could be done
with condor.

However one problem with the log rotation method for condor is that the job
submission script does the naming of the log file, and that if a job takes a
very very very long time to run a job, it will keep trying to write to the name
chosen at job creation time, even if it has been rotated away

Given these issues, maybe it's worth revisiting if there is another channel
(central log/DB/...) that the SEG can suck the information out of without using
the user specified log files?  The events the SEG needs are: job started, done,
failed, and exit code.  All events need to have a timestamp in order to
identify them uniquely for recovery.

Thoughts?

-Stu
------- Comment #3 From 2007-03-14 10:29:29 -------
Subject: Re:  Rotation of gram_condor_log?

At 03:49 PM 3/13/2007 -0500, you wrote:
>Currently, the only way to know that the condor log file can be safely rotated
>is when all events from the log have been processed by the SEG.

I was worried about that.

>LSF and PBS have log rotation schemes that WS GRAM understand and knows how to
>deal with.  The recovery timestamp will lead the seg to a rotated log file and
>then it will continue on the current log file.  Something similar 
>could be done with condor.

Today, Condor does not rotate this log file. I suppose it could (I'll 
bring it up with the Condor team), but the original expectation was 
that the log file would be used on a per-job basis (or perhaps 
per-set of jobs basis), not for all jobs.

I'll talk to the Condor Team about it, but no version of Condor today 
rotates this log file.

>Given these issues, maybe it's worth revisiting if there is another channel
>(central log/DB/...) that the SEG can suck the information out of 
>without using the user specified log files?  The events the SEG 
>needs are: job started, done, failed, and exit code.  All events 
>need to have a timestamp in order to identify them uniquely for recovery.

Nothing pops out at me right now: the job log is where we put those 
events. Theoretically you could use condor_q, but that's a bad choice 
for several reasons.

In the past, we had talked about another option: the SEG could read 
multiple log files, one per job. Is that an option?

-alain
------- Comment #4 From 2007-03-15 11:50:58 -------
test.  I didn't get alain comment in email.  Testing to see if I get this.
------- Comment #5 From 2008-01-22 09:19:49 -------

*** This bug has been marked as a duplicate of 5731 ***