Bug 5775 - gram status of old jobs incorrect on some lsf systems
: gram status of old jobs incorrect on some lsf systems
Status: RESOLVED WONTFIX
: GRAM
gt2 Gatekeeper/Jobmanager
: 4.0.3
: All All
: P3 normal
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2008-01-11 11:00 by
Modified: 2012-09-05 13:44 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2008-01-11 11:00:18
Copied from an email I wrote a couple days ago:

Yesterday and today, I worked with Parag to debug the stuck jobs submitted from
samgfwd01.fnal.gov to grid1.oscer.ou.edu. The jobs were submitted using
Condor-G to the pre-WS GRAM service on grid1.oscer.ou.edu (which submits into
LSF).

The jobs were submitted on January 3rd and 4th. When after Condor submitted the
jobs to GRAM, it shut down the GRAM jobmanagers, relying on the Condor grid
monitor to obtain the status of the jobs. At some point, the grid monitor
reported that the jobs completed. At some later point, Condor restarted some of
the GRAM jobmanagers to obtain the output files of the completed jobs. By the
time the jobmanagers were restarted, the jobs had been purged by LSF and bjobs
reported the jobs couldn't be found. The jobmanager's reaction was to assume
the problem was temporary and keep the last job status it saw, which was
PENDING.

So Condor is waiting for the jobmanagers to transfer output files and then
report the jobs as DONE. Meanwhile, the jobmanagers think the jobs are still
sitting in the LSF queue waiting to run. Condor has a limit on the number of
jobmanagers it will let run at a time (default 10). So once 10 jobs get into
this state, no other jobs to this gatekeeper can make any forward progress,
because they must wait for one of the 10 jobmanagers to exit.

We don't know how long after a job completes LSF waits before purging it. We
also don't know why Condor took longer than that time to restart some of the
jobmanagers for completed jobs.

There appears to be a bug in GRAM's calling of bjobs. It tries to detect when
bjobs says a job doesn't exist and report an error back to the client. GRAM
assumes bjobs will return exit code 255 in this case. But bjobs on
grid1.oscer.ou.edu returns exit code 0. We don't know if this behavior has
changed between different versions of LSF.

One immediate remedy that can be done on grid1.oscer.ou.edu is to change the
GRAM code that executes jobs. It's in the perl module
$GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/lsf.pm, in subroutine poll().
In the code that compares $exit_code to 255, it should also check for exit code
0 and "Job <...> is not found" in stderr.

-----------------------------------------------

Some additional information:
The LSF version on grid1.oscer.ou.edu:
Platform LSF HPC 6.1 for Linux, Nov  4 2005
Copyright 1992-2005 Platform Computing Corporation
------- Comment #1 From 2012-09-05 13:44:42 -------
Doing some bugzilla cleanup...  Resolving old GRAM3 and GRAM4 issues that are
no longer relevant since we've moved on to GRAM5.  Also, we're now tracking
issue in jira.  Any new issues should be added here:

http://jira.globus.org/secure/VersionBoard.jspa?selectedProjectId=10363