Bug 6688 - Need fix to LSF job manager to use bacct when bhist doesn't work
: Need fix to LSF job manager to use bacct when bhist doesn't work
Status: RESOLVED FIXED
: GRAM
gt2 Gatekeeper/Jobmanager
: 4.0.8
: Open Science Grid (OSG) All
: P3 normal
: 4.0.9
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2009-03-11 13:25 by
Modified: 2009-05-26 16:34 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2009-03-11 13:25:59
Hi,

The VDT has a new ticket about the LSF job manager.
<http://crt.cs.wisc.edu/Ticket/Display.html?user=guest&pass=guest&id=5009> The
ticket is long and involved, but one major item from it is this:

> Depending on when bhist is run on a job, it may not return any output.
> bhist is only useful for looking at jobs that are in the current
> working set (in the queue, running, or recently exited).  bhist only
> looks at the current lsb.events file (and enough history to get a
> complete picture of any job listed in the current events log.  On a
> busy cluster the lsb.events log could turn over several times a day.
> Once it's cycled out if you need info on a job you have to parse the
> output of 'bacct *-l jobid' *or use the lsf api to create a custom
> query tool.

Please see the VDT ticket for complete details.

We'd like to request a patch to the LSF job manager to address this problem. Do
you think you can do it? 

-alain
-----------------------------------------------------------------
Alain Roy
Open Science Grid Software Coordinator            roy@cs.wisc.edu
http://opensciencegrid.org                 http://vdt.cs.wisc.edu
------- Comment #1 From 2009-03-11 13:27:29 -------
Oh, one more thing to draw your attention to in the ticket, that is closely
related:

> There's another part to the lsf.pm fix that I haven't seen in the
> ticket. The module assumes that when bjobs reports that it has no
> knowledge of a job ("Job <123> is not found"), it will exit with exit
> status 255. Parag and I observed it exiting with exit status 0 in this
> case. So lsf.pm needs to examine the output of the tool, not just the
> exit status.
------- Comment #2 From 2009-03-26 10:02:07 -------
Any thoughts on this Stu?
Thanks! -alain
------- Comment #3 From 2009-03-27 08:16:34 -------
I have not had a chance to look at this yet.  I should be able to after next
week.

We don't have an LSF install, so we'll need to work with the site's install on
this one.
------- Comment #4 From 2009-03-27 10:30:49 -------
On Mar 27, 2009, at 8:16 AM, bugzilla@globus.org wrote:
> We don't have an LSF install, so we'll need to work with the site's  
> install on
> this one.

Horst will be happy to help you out.

-alain
------- Comment #5 From 2009-05-26 14:50:26 -------
I worked with Horst and have confirmed a solution to this problem.  Sometimes
LSF's bjobs command can return 0, but the job is actually "not found".  So an
additional check was needed to determine of the job is not found.  Here is the
cvs diff from the 4 0 branch that solved the problem.

RCS file: /home/globdev/CVS/globus-packages/gram/jobmanager/setup/lsf/lsf.in,v
retrieving revision 1.19.6.7
retrieving revision 1.19.6.6
diff -r1.19.6.7 -r1.19.6.6
398,400c398
<     # 5/09: On some systems, bjobs can return 0, but the job is *not found*.
<     #       An additional check for this has been added below.
<     if (($exit_code == 255) || (($exit_code == 0) && ($_ =~ /is not found/)))
---
>     if($exit_code == 255)
------- Comment #6 From 2009-05-26 16:12:17 -------
Thanks Stu!

Your patch doesn't have context. Can you provide a patch with context, or
verify that I'm looking at the correct bit of code? I think it's the following
snippet from the poll() function that needs to be modified:

    # get the exit code of the bjobs command.  For more info, do a 
    # search for $CHILD_ERROR in perlvar documentation.
    $exit_code = $? >> 8;

    # Verifying that the job is no longer there.
    # return code 255 = "Job <123> is not found"
    if($exit_code == 255)
    {
        $self->log("bjobs rc is 255 == Job <123> is not found, running bhist");
        # The job was not found. It can also be that it queried
        # LSF too soon: using bhist to determine whether the job
        # was actually submitted
        $statusBhist = system("$bhist $job_id 1>/dev/null 2>/dev/null");


So I think it's the "if($exit_code == 255)" in the middle there that needs to
be modified. Is that correct?

Thanks,
-alain
------- Comment #7 From 2009-05-26 16:34:24 -------
Yes - that's right.  Maybe this is better.

[anlextwls097-148:jobmanager/setup/lsf] smartin% diff -c lsf.in
foo/gram/jobmanager/setup/lsf/lsf.in
*** lsf.in    2009-05-26 14:37:38.000000000 -0500
--- foo/gram/jobmanager/setup/lsf/lsf.in    2007-06-12 15:44:09.000000000 -0500
***************
*** 395,403 ****

      # Verifying that the job is no longer there.
      # return code 255 = "Job <123> is not found"
!     # 5/09: On some systems, bjobs can return 0, but the job is *not found*.
!     #       An additional check for this has been added below.
!     if (($exit_code == 255) || (($exit_code == 0) && ($_ =~ /is not found/)))
      {
          $self->log("bjobs rc is 255 == Job <123> is not found, running
bhist");
          # The job was not found. It can also be that it queried
--- 395,401 ----

      # Verifying that the job is no longer there.
      # return code 255 = "Job <123> is not found"
!     if($exit_code == 255)
      {
          $self->log("bjobs rc is 255 == Job <123> is not found, running
bhist");
          # The job was not found. It can also be that it queried