Bug 6184 - pbs.pm jobmanager fails jobs on qstat failure
: pbs.pm jobmanager fails jobs on qstat failure
gt2 Gatekeeper/Jobmanager
: 4.0.4
: All All
: P3 major
: ---
Assigned To:
  Show dependency treegraph
Reported: 2008-06-24 21:53 by
Modified: 2012-09-05 13:44 (History)



You need to log in before you can comment on or make changes to this bug.

Description From 2008-06-24 21:53:23

our PBS at the HPCC at USC frequently has problems that it appears
"overwhelmed", and refuses to answer to any PBS request, be it qsub, qstat or
pbsnodes. It returns with a "permission denied ... errno=15007" and a non-0,
non-153 exit code. This is a random yet transient condition. Running jobs are
not affected by PBS freaking out. 

Unfortunately, Globus thinks after a while of failing qstats that the job died
- even though it is still perfectly running. Worse, my local workflow manager,
upon the wrong info from Globus, will consider the job failed, and restart it.
Thus, I end up with 2 running jobs, and a WAW conflict in certain cases. 

I'll attach a proposed fix to the pbs.pm module that appears to fix this bug
for me. I am currently running with an alternate jobmanager that uses this
fixed module, and while other people using the regular jobmanager had problems
yesternight and last weekend, my jobs ran just fine and were unaffected. Thus,
I think the fix will solve the problem.
------- Comment #1 From 2008-06-24 21:55:26 -------
*** 493,512 ****
--- 493,520 ----
      # verifying that the job is no longer there.
      if($exit_code == 153)
          $self->log("qstat rc is 153 == Unknown Job ID == DONE");
          $state = Globus::GRAM::JobState::DONE;
      $self->nfssync( $description->stdout() )
          if $description->stdout() ne '';
      $self->nfssync( $description->stderr() )
          if $description->stderr() ne '';
+     elsif($exit_code != 0)
+     {
+     # While this could be an indication of serious trouble, it
+     # is safe to assume, once a job managed to get queued, any
+     # problems running qstat are of transient nature. 
+     $self->log("qstat failed (ec=$exit_code). Tell JM to ignore this poll");
+     return {};
+     }

          # Get 3rd field (after = )
          $_ = (split(/\s+/))[3];

              $state = Globus::GRAM::JobState::PENDING;
------- Comment #2 From 2012-09-05 13:44:52 -------
Doing some bugzilla cleanup...  Resolving old GRAM3 and GRAM4 issues that are
no longer relevant since we've moved on to GRAM5.  Also, we're now tracking
issue in jira.  Any new issues should be added here: