Bugzilla – Bug 6688
Need fix to LSF job manager to use bacct when bhist doesn't work
Last modified: 2009-05-26 16:34:24
You need to log in before you can comment on or make changes to this bug.
Hi, The VDT has a new ticket about the LSF job manager. <http://crt.cs.wisc.edu/Ticket/Display.html?user=guest&pass=guest&id=5009> The ticket is long and involved, but one major item from it is this: > Depending on when bhist is run on a job, it may not return any output. > bhist is only useful for looking at jobs that are in the current > working set (in the queue, running, or recently exited). bhist only > looks at the current lsb.events file (and enough history to get a > complete picture of any job listed in the current events log. On a > busy cluster the lsb.events log could turn over several times a day. > Once it's cycled out if you need info on a job you have to parse the > output of 'bacct *-l jobid' *or use the lsf api to create a custom > query tool. Please see the VDT ticket for complete details. We'd like to request a patch to the LSF job manager to address this problem. Do you think you can do it? -alain ----------------------------------------------------------------- Alain Roy Open Science Grid Software Coordinator roy@cs.wisc.edu http://opensciencegrid.org http://vdt.cs.wisc.edu
Oh, one more thing to draw your attention to in the ticket, that is closely related: > There's another part to the lsf.pm fix that I haven't seen in the > ticket. The module assumes that when bjobs reports that it has no > knowledge of a job ("Job <123> is not found"), it will exit with exit > status 255. Parag and I observed it exiting with exit status 0 in this > case. So lsf.pm needs to examine the output of the tool, not just the > exit status.
Any thoughts on this Stu? Thanks! -alain
I have not had a chance to look at this yet. I should be able to after next week. We don't have an LSF install, so we'll need to work with the site's install on this one.
On Mar 27, 2009, at 8:16 AM, bugzilla@globus.org wrote: > We don't have an LSF install, so we'll need to work with the site's > install on > this one. Horst will be happy to help you out. -alain
I worked with Horst and have confirmed a solution to this problem. Sometimes LSF's bjobs command can return 0, but the job is actually "not found". So an additional check was needed to determine of the job is not found. Here is the cvs diff from the 4 0 branch that solved the problem. RCS file: /home/globdev/CVS/globus-packages/gram/jobmanager/setup/lsf/lsf.in,v retrieving revision 1.19.6.7 retrieving revision 1.19.6.6 diff -r1.19.6.7 -r1.19.6.6 398,400c398 < # 5/09: On some systems, bjobs can return 0, but the job is *not found*. < # An additional check for this has been added below. < if (($exit_code == 255) || (($exit_code == 0) && ($_ =~ /is not found/))) --- > if($exit_code == 255)
Thanks Stu! Your patch doesn't have context. Can you provide a patch with context, or verify that I'm looking at the correct bit of code? I think it's the following snippet from the poll() function that needs to be modified: # get the exit code of the bjobs command. For more info, do a # search for $CHILD_ERROR in perlvar documentation. $exit_code = $? >> 8; # Verifying that the job is no longer there. # return code 255 = "Job <123> is not found" if($exit_code == 255) { $self->log("bjobs rc is 255 == Job <123> is not found, running bhist"); # The job was not found. It can also be that it queried # LSF too soon: using bhist to determine whether the job # was actually submitted $statusBhist = system("$bhist $job_id 1>/dev/null 2>/dev/null"); So I think it's the "if($exit_code == 255)" in the middle there that needs to be modified. Is that correct? Thanks, -alain
Yes - that's right. Maybe this is better. [anlextwls097-148:jobmanager/setup/lsf] smartin% diff -c lsf.in foo/gram/jobmanager/setup/lsf/lsf.in *** lsf.in 2009-05-26 14:37:38.000000000 -0500 --- foo/gram/jobmanager/setup/lsf/lsf.in 2007-06-12 15:44:09.000000000 -0500 *************** *** 395,403 **** # Verifying that the job is no longer there. # return code 255 = "Job <123> is not found" ! # 5/09: On some systems, bjobs can return 0, but the job is *not found*. ! # An additional check for this has been added below. ! if (($exit_code == 255) || (($exit_code == 0) && ($_ =~ /is not found/))) { $self->log("bjobs rc is 255 == Job <123> is not found, running bhist"); # The job was not found. It can also be that it queried --- 395,401 ---- # Verifying that the job is no longer there. # return code 255 = "Job <123> is not found" ! if($exit_code == 255) { $self->log("bjobs rc is 255 == Job <123> is not found, running bhist"); # The job was not found. It can also be that it queried