Bug 1425 - jobmanager-condor bug breaks Grid3 scalability improvements
: jobmanager-condor bug breaks Grid3 scalability improvements
Status: RESOLVED FIXED
: GRAM
gt2 Gatekeeper/Jobmanager
: unspecified
: PC Linux
: P2 major
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2003-11-25 17:19 by
Modified: 2004-03-12 17:17 (History)


Attachments
fixed bad return statement (121.76 KB, application/x-gzip-compressed)
2003-12-09 10:10, Stuart Martin
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2003-11-25 17:19:18
SUMMARY:

There is a bug in the Globus jobmanager-condor script "condor.pm" (also known 
as "condor.in"), which can cause the jobmanager and the condor gridmonitor to 
fail and indefinitely report the incorrect status for a job.


TECHNICAL DETAILS:

Specifically, the poll() function's return value *type* is incorrect in certain 
circumstances, breaking code which expects the correct type.  The poll() 
function should always return a Perl hash (which in turn must contain an 
element with a JOB_STATE key).  Instead it is returning a scalar (or an invalid 
hash, depending on the version in CVS -- keep reading).

We tracked it down in bonsai, and the bug was initially introduced in CVS 
version 1.3.14.2 of the "condor.in" file in the Globus CVS repository (see 
green text at bottom right of the diff below), when the function was modified 
with added code that incorrectly returns a scalar status instead of a hash:

http://choate.mcs.anl.gov:8080/bonsai/cvsview2.cgi?
diff_mode=context&whitespace_mode=show&subdir=gram/jobmanager/setup/condor&comma
nd=DIFF_FRAMESET&file=condor.in&rev1=1.3.14.1&rev2=1.3.14.2&root=/home/globdev/C
VS/globus-packages

It appears a fix to this line was attempted in CVS version 1.5.18.1 of 
the "condor.in" file in the Globus CVS repository (see blue text near bottom 
right of the diff below, after line 273), where poll() was modified to return a 
scalar inside a hash -- but it's still missing a JOB_STATE key:

http://choate.mcs.anl.gov:8080/bonsai/cvsview2.cgi?
diff_mode=context&whitespace_mode=show&subdir=gram/jobmanager/setup/condor&comma
nd=DIFF_FRAMESET&file=condor.in&rev1=1.5&rev2=1.5.18.1&root=/home/globdev/CVS/gl
obus-packages

The bug can be fixed by simply correcting the return value to look like this:

	return { JOB_STATE => Globus::GRAM::JobState::DONE }


ACTION REQUESTED:

The Condor team is releasing an update to the current VDT with a fix for this 
bug, but we'd rather not distribute or maintain any "unofficial" Globus 
patches, so we'd like you to please release an advisory fix for Globus 2.4 
(which is derived from the earlier, version 1.3.14.2, condor.in).  Also, please 
fix the head of CVS, which is derived from the later version 1.5.18.1.

Thanks!

-Peter, Alan, Todd, Jaime, etc.
------- Comment #1 From 2003-11-25 17:31:22 -------
Sorry to pipe up here, but I fear that CVS version may (re) introduce the darn
IO::File. I cannot check, because I cannot follow the choate links. If all that
needs to be done is set 

 || return { JOB_STATE => Globus::GRAM::JobState::DONE };

we should do exactly that instead of reverting to a version that uses IO::*.

Mea culpa, isn't it?
------- Comment #2 From 2003-11-25 17:43:12 -------
Clickable links:

The first link (The old erroneous code), the link "287" at the bottom should get
you to the correct line of code:
http://tinyurl.com/wk7h

The second link (the new code), the link "287" at the bottom should get you to
the correct line of code:
http://tinyurl.com/wk7s

One of the other changes in the code is that the old code uses IO::File, the new
code uses open.  The old one should probably be updated because it's in real
world use.  The new one should be updated because it's new; it already includes
the optimization patches and thus does not use IO::File.
------- Comment #3 From 2003-11-25 18:01:31 -------
At that point, we may also fix the cancel() method. At one place it returns a
scalar, at another place a hash. Only one can be correct...
------- Comment #4 From 2003-11-25 18:05:01 -------
Jens,

I explicitly requested that only the line with the "return"
statement in it be fixed.  I didn't ask that any patches be backed
out.

So fear not -- I don't see why fixing this bug would result in the
re-introduction of IO:File.

-Peter

------- Comment #5 From 2003-12-04 13:35:22 -------
Peter,

I am looking into doing what you are requesting.   However, it looks to me that 
the IO bug would be reintroduced if I take version 1.3.14.2 and only change the 
return line as you requested.  Below is the relvant section including your 
fix.  Notice the first line contains the IO:File line, but there is no "use 
IO::File;" statement.  I am thinking we should talk out a solution on the phone.

>>>>
    $condor_log_file = new IO::File($self->{condor_logfile}, '<');
    if ( ! defined $condor_log_file )
    {
        return { JOB_STATE => Globus::GRAM::JobState::DONE };
    }
<<<<
------- Comment #6 From 2003-12-04 14:00:01 -------
Please DO NOT re-introduce IO::File. The latest revision of bug #950 includes a
fix for this bug without reintroducing IO::swampme.
------- Comment #7 From 2003-12-09 10:10:57 -------
Created an attachment (id=278) [details]
fixed bad return statement

This is an update to the condor.in that was in the GT 2.4.3 release.  After
testing it will be added to the globus advisory page at
http://www-unix.globus.org/toolkit/advisories.html
------- Comment #8 From 2004-03-12 17:17:49 -------
Advisory added.