Bugzilla – Bug 5775
gram status of old jobs incorrect on some lsf systems
Last modified: 2012-09-05 13:44:42
You need to log in before you can comment on or make changes to this bug.
Copied from an email I wrote a couple days ago: Yesterday and today, I worked with Parag to debug the stuck jobs submitted from samgfwd01.fnal.gov to grid1.oscer.ou.edu. The jobs were submitted using Condor-G to the pre-WS GRAM service on grid1.oscer.ou.edu (which submits into LSF). The jobs were submitted on January 3rd and 4th. When after Condor submitted the jobs to GRAM, it shut down the GRAM jobmanagers, relying on the Condor grid monitor to obtain the status of the jobs. At some point, the grid monitor reported that the jobs completed. At some later point, Condor restarted some of the GRAM jobmanagers to obtain the output files of the completed jobs. By the time the jobmanagers were restarted, the jobs had been purged by LSF and bjobs reported the jobs couldn't be found. The jobmanager's reaction was to assume the problem was temporary and keep the last job status it saw, which was PENDING. So Condor is waiting for the jobmanagers to transfer output files and then report the jobs as DONE. Meanwhile, the jobmanagers think the jobs are still sitting in the LSF queue waiting to run. Condor has a limit on the number of jobmanagers it will let run at a time (default 10). So once 10 jobs get into this state, no other jobs to this gatekeeper can make any forward progress, because they must wait for one of the 10 jobmanagers to exit. We don't know how long after a job completes LSF waits before purging it. We also don't know why Condor took longer than that time to restart some of the jobmanagers for completed jobs. There appears to be a bug in GRAM's calling of bjobs. It tries to detect when bjobs says a job doesn't exist and report an error back to the client. GRAM assumes bjobs will return exit code 255 in this case. But bjobs on grid1.oscer.ou.edu returns exit code 0. We don't know if this behavior has changed between different versions of LSF. One immediate remedy that can be done on grid1.oscer.ou.edu is to change the GRAM code that executes jobs. It's in the perl module $GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/lsf.pm, in subroutine poll(). In the code that compares $exit_code to 255, it should also check for exit code 0 and "Job <...> is not found" in stderr. ----------------------------------------------- Some additional information: The LSF version on grid1.oscer.ou.edu: Platform LSF HPC 6.1 for Linux, Nov 4 2005 Copyright 1992-2005 Platform Computing Corporation
Doing some bugzilla cleanup... Resolving old GRAM3 and GRAM4 issues that are no longer relevant since we've moved on to GRAM5. Also, we're now tracking issue in jira. Any new issues should be added here: http://jira.globus.org/secure/VersionBoard.jspa?selectedProjectId=10363