Bug 3910 - Bad permissions on condor log file prevents job submissions
: Bad permissions on condor log file prevents job submissions
Status: RESOLVED FIXED
: GRAM
wsrf scheduler interface
: 4.0.1
: All All
: P2 critical
: 4.0.6
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2005-11-15 13:52 by
Modified: 2007-12-11 16:25 (History)


Attachments
Patch for condor.pm (951 bytes, patch)
2005-11-15 13:53, Alain Roy
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2005-11-15 13:52:48
Summary: In the default setup, only a single user can submit jobs to the Condor
jobmanager. This problem affects both the pre-web services GRAM and web-services
GRAM. 

When a user submits a job, the Condor jobmanager (condor.pm) selects a log file
for the job. It selects the same log file for all jobs, but it does not create
the log file. It merely submits the job to Condor and Condor creates the log file. 

Condor, quite reasonably, sets the permissions on the log file to be 664 when it
creates the log file. In most situations, we don't want to create files that
other users can tamper with. 

This has a bad result for Globus though: as soon as a different user submits a
job to the Condor jobmanager, they will be unable to use that log file. When the
log file can't be used, various errors occur. Nothing works. 

This is a serious problem: in a production environment, lots of different users
will submit jobs to the Condor jobmanager. 

I'm building the VDT, so I'll apply a patch to it in order to fix the problem.
The patch looks like this (it's for condor.pm, not condor.in, but I'm sure you
can figure it out.)

--- condor.pm.orig      2005-11-15 13:13:39.000000000 -0600
+++ condor.pm   2005-11-15 13:40:23.000000000 -0600
@@ -66,6 +66,21 @@
         }
         $self->{condor_logfile} = "$log_dir/gram_condor_log";
     }
+    if(! -e $self->{condor_logfile}) 
+    {
+        # We make sure that the log file exists with the correct 
+        # permissions. If we just let Condor create it, it will
+        # have 664 permissions, and when another user submits a job
+        # they will be unable to write to the log file. We create the 
+        # file in append mode to avoid a race condition, in case
+        # multiple instantiations of this script open and write
+        # to the log file. 
+        if ( open(CONDOR_LOG_FILE, '>>' . $self->{condor_logfile}) ) 
+        {
+            close(CONDOR_LOG_FILE);
+        }
+        chmod(0666, $self->{condor_logfile});
+    }
 
     if($description->jobtype() eq 'multiple' && $description->count > 1)
     {

I'll attach this as a separate file too. As a bit of extra context, this goes
after the block that begins with:

if(! exists($self->{condor_logfile}))

My fix may not be perfect, but something clearly needs to be done.
------- Comment #1 From 2005-11-15 13:53:47 -------
Created an attachment (id=750) [details]
Patch for condor.pm

Matches patch in Alain's description of the bug. 
------- Comment #2 From 2005-11-15 14:36:45 -------
Another option is for Condor to provide another logging implementation that can
handle multiple writers w/o compromising the previously logged information. 
Something like syslog? 
------- Comment #3 From 2005-11-15 14:53:53 -------
Condor uses locking to ensure that multiple writers do not trample each other.
I'm confused about what you are suggesting.

Condor also allows each job to have it's own log file, so you could have a
single log per job or per user, also avoiding the problem that way. Having a
single log file for all jobs may cause scalability problems due to locking and
very large log files.

If you are suggesting that Condor should send all logging events to a single
process, and that process should write the log file, then there are other big
problems to solve. The main reason we write directly to a log file is so that we
do not rely on another process in order to get the critical logging completed.
What happens if the other process is not there, or is not responding, or is
overwhelmed with requests to log? We push the information to disk as fast as we
can so as not to lose the information. 

------- Comment #4 From 2005-11-15 15:05:57 -------
Sorry if it was confusing, I am just agreeing with your statement that "In most
situations, we don't want to create files that other users can tamper with." 
Having a world readable file may  mean you cannot trust what is in the logfile.  

Having separate files per user seems OK (?), sacrificing some usability of
always knowing where to look for the log information.  Another way to do it
would be to have per-user files for reliability and also tail each file and copy
to syslog.

Making the write functionality setuid would be an option but sometimes can
create problems in its own right, in my opinion.
------- Comment #5 From 2005-11-15 15:27:32 -------
I agree that there are many solutions to solve this in the best possible way.
I'm happy to donate time and effort discussing them: as a member of both the VDT
and the Condor team, I care deeply about getting a good solution. See Bugzilla
3912 for another related issue: rotation of this log file. 

I suspect that we need a short term fix for Globus 4.0.2, so setting the
permissions is probably the right fix. What do you think? 

For Globus 4.2 I'm happy to help brainstorm other solutions.

-alain

------- Comment #6 From 2005-11-15 15:52:39 -------
Ideally I would like to see an interface built into Condor where I can
basically
subscribe for job events directly to the resource manager. Then we could ditch
this hackish tailing of log files for events.
------- Comment #7 From 2005-11-15 15:55:48 -------
I think that the reason to log all the jobs to the same file is so the event
generator can tail the file and fire events to GRAM when job status changes.

Creating the file in Condor.pm is not correct. I think the file is in G_L/var,
and that directory is not directly writable by other users than the container
owner. The right fix is to have the jobmanager setup script set the correct
permissions.
------- Comment #8 From 2005-11-15 16:10:08 -------
Mats wrote: "I think the file is in G_L/var". That's not what I see from the
code in condor.pm, but by default, it's in $GLOBUS_LOCATION/tmp/gram_job_state. 

condor.pm looks in the following locations in order. 

1) $GLOBUS_LOCATION/etc/globus-condor.conf (not created by default)
2) $GLOBUS_SPOOL_DIR, set to $GLOBUS_LOCATION/tmp/gram_job_state by the C
jobmanager (pre-web services). 
3) Globus::Core::Paths::tmpdir, which is /tmp. 

If you only create the log file when you do the setup, then you have a problem.
Imagine a system administrator cleans up the log file when no jobs are running,
because it's grown to be nearly 2GB. When it's re-created by Condor, the
permissions will not be world-writable. 
------- Comment #9 From 2005-11-15 16:37:09 -------
Reading the job's log file is the official sanctioned way to monitor a Condor
job.  That's unlikely to change in the near term.

Is there a problem with the solution present in GT2?  There was a job log
per-job.  When the job is cleaned up, the job log is deleted.  No contention.  
You (usually) don't need to worry about the file growing and filling the disk. 
If jobs have a per-job, owned-by-the-user temporary directory, that would be the
ideal location.  Failing that, some reasonably unique filename in /tmp would do
just fine.
------- Comment #10 From 2005-11-15 16:54:03 -------
I misunderstood with log file we were talking about.

Creating a shared file in tmp/gram_job_state is even worse due to the
permissions on the directory, and could lead to an exploit. If you are going to
use that directory, use mktemp.
------- Comment #11 From 2005-11-15 17:03:22 -------
I'm not expecting any miracles in terms of better monitoring interfaces. Nobody
else does it either, AFAIK. I just think that in the age of service based
applications, it'd be nice if resource managers would start providing interfaces
in line with that philosophy.

At the moment there is only one SEG per resource manager per container. There's
nothing preventing us from changing to a per-user SEG model, but that's a lot
more work than just hacking a perl script.
------- Comment #12 From 2005-11-15 17:08:13 -------
Perhaps a log file seems hackish, but it's pretty reliable. We can log the
events whether or not someone is listening. The listener can read past events
whether or not Condor is currently running. Doing something equally reliable in
a direct publish-subscribe model is non-trivial. 

I'm glad that we're discussing what the best solution might be, but that sort of
discussion might be done more easily over the phone or in person. There are a
fair number of issues involved. I'm happy to organize a phone call if people are
interested.

Let's please not forget the high-level problem here. As Globus 4.0.1 ships right
now, only one user can submit jobs to the Condor jobmanager. Is there a chance
that some solution can be in place for Globus 4.0.2? 

I'll stick with my simplistic world-writable hack in the VDT for now, because I
don't have the time or knowledge to modify both pre-web services and
web-services GRAM to do something more sophisticated. If someone can provide a
better fix, I'll gladly take it into the VDT. 
------- Comment #13 From 2005-11-15 17:11:41 -------
BTW, creating a per-job log file would necessitate an equal number of SEGs to
be
started. That's totally not scalable. This is why I started talking about a
per-user model. That's still not as scalable as a single SEG per resource
manager, but if permissions are a problem then it may be unavoidable.
------- Comment #14 From 2005-11-15 17:19:16 -------
Why can't a single SEG monitor multiple log files?
------- Comment #15 From 2005-11-16 10:59:22 -------
I'm sure I'm missing all sorts of sutble details here, but I don't understand
why condor doesn't use the syslog facility. This would seem to allow for a
world-appendable, world-reable log file.
------- Comment #16 From 2005-11-16 12:10:10 -------
I need to ammend my bug report:

1) This definitely affects pre-web services, because the Condor job log file is
not created by the Globus installation when installing just the pre-web services. 

2) This only partly affects web services, because the setup-seg-condor script
does create the Condor job log file with the correct (world-readable)
permissions. I still think that the condor.pm needs to ensure that the file
exists and is world-writable, because:

  a) Users might change the location of the job log file, but editing
$GLOBUS_LOCATION/etc/globus-condor.conf

  b) Users might delete the job log file if it grows too large. 

------- Comment #17 From 2005-11-16 12:23:27 -------
Von--

The log in question is a log of what has happened to a job or set of jobs. It is
meant for users to look at, to understand what has happened to their job. It is
machine-readable, and is also meant for user tools (Like our DAGMan tool for
coordinating sets of jobs) to read so that they can control what is happening. 

If the only option was to send it to /var/log/syslog:

  a) Users could not normally see what happened to their job, 
     because most people do not allow users to see the syslog. 

  b) If the syslog was open:

     b1) I could see information about other people's jobs, which may have
     privacy concerns. Globus's choice to use a single log file for all
     jobs aside, most people do not do this. 

    b2) The syslog rotates, and eventually information about my job 
    will be thrown out. It's therefore not a reliable store for users
    that wish persistent information: something that is quite common.

If the Globus software would prefer to see user job events in the syslog, it
could do it easily: just read the events out of the user's job log file, and
dump it into the syslog.

------- Comment #18 From 2005-11-16 13:00:32 -------
I think Jaime is right and we will need to modify the SEG to monitoring
multiple files and sending multiple 
SEG restart markers.  I don't think we want to make this change in 4.0.2 unless
this is viewed as a 
showstopper.

The seg would have to be told dynamically which new log files (probably based
on the usernames) that it 
should monitor.  If the seg started looking at globus_condor_user_*.log then
any user could create a file 
globus_condor_user_stugotcha.log and spew events into a log file.

-Stu
------- Comment #19 From 2005-11-16 13:20:44 -------
Or it could check for the well known name in the well known directory but also
require 640 permissions (with privileged globus group being able to read it). 
------- Comment #20 From 2006-02-02 14:34:12 -------
I think we are seeing the beginnings of scalability problems with having one
Condor log file for all jobs and users. We don't have complete details yet, but
we have an OSG/ATLAS person reporting a very large Condor log file that appears
to be causing the Condor Grid Monitor to consume excessive CPU.

The Grid Monitor uses the Condor log file on a head node to find out the status
of all jobs for a (local) user. It does so fairly often, and as the Condor log
get very large, this process takes a lot of time. Furthermore, as long as there
are Condor jobs running, there's no way to rotate out the Condor log file and
hence get back to speedy processing. Therefore, in a production environment, the
log file gets too big and the Condor Grid Monitor bogs down. We may also be
seeing submit-side problems as a result of this behavior, but we're not sure.

So, for 4.0.2, please consider moving to a per-user Condor log file. And even
then, what should administrators do when log files grow excessively large?
------- Comment #21 From 2006-02-03 11:35:38 -------
I would like to amend what Tim is saying: this problem is independent of the
Condor-G gridmonitor and happens when the gridmonitor is not in use. 

When using pre-web services GT4, there is a single Condor log file. Each
jobmanager periodically polls the job status by looking at this log file. The
polling happens by reading the entire log file. When many jobs have been
submitted, this is very slow. 

It's not clear how the log file can be rotated: if we rotate it at the wrong
time, the jobmanager might miss events that it hasn't seen yet. 

What do you think?
------- Comment #22 From 2006-02-03 11:42:13 -------
I think it would be pretty easy to modify the jobmanager perl scripts for
pre-WS GRAM to deal with a 
condor user log file per job. That would solve the problem completely for
pre-WS GRAM, but pose 
problems for the SEG in WS GRAM.
------- Comment #23 From 2007-02-15 16:49:46 -------
Joe,

If we can't think if a better easier way to do this, then we should at least
apply Alain's patch.

-Stu
------- Comment #24 From 2007-12-10 19:49:34 -------
This patch is applied to trunk and 4.0 branch.  Note that the condor log file
is normally created when the Condor SEG setup package is run.

Joe