Bugzilla – Bug 1425
jobmanager-condor bug breaks Grid3 scalability improvements
Last modified: 2004-03-12 17:17:49
You need to log in before you can comment on or make changes to this bug.
SUMMARY: There is a bug in the Globus jobmanager-condor script "condor.pm" (also known as "condor.in"), which can cause the jobmanager and the condor gridmonitor to fail and indefinitely report the incorrect status for a job. TECHNICAL DETAILS: Specifically, the poll() function's return value *type* is incorrect in certain circumstances, breaking code which expects the correct type. The poll() function should always return a Perl hash (which in turn must contain an element with a JOB_STATE key). Instead it is returning a scalar (or an invalid hash, depending on the version in CVS -- keep reading). We tracked it down in bonsai, and the bug was initially introduced in CVS version 1.3.14.2 of the "condor.in" file in the Globus CVS repository (see green text at bottom right of the diff below), when the function was modified with added code that incorrectly returns a scalar status instead of a hash: http://choate.mcs.anl.gov:8080/bonsai/cvsview2.cgi? diff_mode=context&whitespace_mode=show&subdir=gram/jobmanager/setup/condor&comma nd=DIFF_FRAMESET&file=condor.in&rev1=1.3.14.1&rev2=1.3.14.2&root=/home/globdev/C VS/globus-packages It appears a fix to this line was attempted in CVS version 1.5.18.1 of the "condor.in" file in the Globus CVS repository (see blue text near bottom right of the diff below, after line 273), where poll() was modified to return a scalar inside a hash -- but it's still missing a JOB_STATE key: http://choate.mcs.anl.gov:8080/bonsai/cvsview2.cgi? diff_mode=context&whitespace_mode=show&subdir=gram/jobmanager/setup/condor&comma nd=DIFF_FRAMESET&file=condor.in&rev1=1.5&rev2=1.5.18.1&root=/home/globdev/CVS/gl obus-packages The bug can be fixed by simply correcting the return value to look like this: return { JOB_STATE => Globus::GRAM::JobState::DONE } ACTION REQUESTED: The Condor team is releasing an update to the current VDT with a fix for this bug, but we'd rather not distribute or maintain any "unofficial" Globus patches, so we'd like you to please release an advisory fix for Globus 2.4 (which is derived from the earlier, version 1.3.14.2, condor.in). Also, please fix the head of CVS, which is derived from the later version 1.5.18.1. Thanks! -Peter, Alan, Todd, Jaime, etc.
Sorry to pipe up here, but I fear that CVS version may (re) introduce the darn IO::File. I cannot check, because I cannot follow the choate links. If all that needs to be done is set || return { JOB_STATE => Globus::GRAM::JobState::DONE }; we should do exactly that instead of reverting to a version that uses IO::*. Mea culpa, isn't it?
Clickable links: The first link (The old erroneous code), the link "287" at the bottom should get you to the correct line of code: http://tinyurl.com/wk7h The second link (the new code), the link "287" at the bottom should get you to the correct line of code: http://tinyurl.com/wk7s One of the other changes in the code is that the old code uses IO::File, the new code uses open. The old one should probably be updated because it's in real world use. The new one should be updated because it's new; it already includes the optimization patches and thus does not use IO::File.
At that point, we may also fix the cancel() method. At one place it returns a scalar, at another place a hash. Only one can be correct...
Jens, I explicitly requested that only the line with the "return" statement in it be fixed. I didn't ask that any patches be backed out. So fear not -- I don't see why fixing this bug would result in the re-introduction of IO:File. -Peter
Peter, I am looking into doing what you are requesting. However, it looks to me that the IO bug would be reintroduced if I take version 1.3.14.2 and only change the return line as you requested. Below is the relvant section including your fix. Notice the first line contains the IO:File line, but there is no "use IO::File;" statement. I am thinking we should talk out a solution on the phone. >>>> $condor_log_file = new IO::File($self->{condor_logfile}, '<'); if ( ! defined $condor_log_file ) { return { JOB_STATE => Globus::GRAM::JobState::DONE }; } <<<<
Please DO NOT re-introduce IO::File. The latest revision of bug #950 includes a fix for this bug without reintroducing IO::swampme.
Created an attachment (id=278) [details] fixed bad return statement This is an update to the condor.in that was in the GT 2.4.3 release. After testing it will be added to the globus advisory page at http://www-unix.globus.org/toolkit/advisories.html
Advisory added.