Bug 4981 - Hundreds of thousands of bogus entries in PBS server logs
: Hundreds of thousands of bogus entries in PBS server logs
Status: RESOLVED FIXED
: GRAM
general
: unspecified
: PC All
: P3 major
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2007-01-26 08:51 by
Modified: 2008-02-06 11:25 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2007-01-26 08:51:53
Whenever a OSG user runs jobs at Vanderbilt (nanohub and usatlas1 are active
right now), a Perl script gets run which apparently has a bug.  It ends up
querying up to 20 times per second for the status of (mostly non-existent)
jobs.  This causes literally hundreds of thousands of bogus entries to be
placed in our PBS server logs similar to the following:

01/26/2007 08:50:39;0080;PBS_Server;Req;req_reject;Reject reply
code=15001(Unknown Job Id), aux=0, type=LocateJob, from nanohub@vmpg01

This causes our daily PBS server log file to swell from under 5 MB in size to
~250 MB (or more) in size, depending on how many grid jobs are run.

Here's the output of ps showing the offending jobs:

usatlas1 15198     1  0 08:16 ?        00:00:00 perl
/home/usatlas1/.globus/.gass_cache/local/md5/ec/2a68da2e447335a36279a59953201b/md5/2b/d23018e59fdaa350c5bd91d7dc79bd/data
--dest-url=https://gridui03.usatlas.bnl.gov:20047/tmp/condor_g_scratch.0xab9c360.23977/grid-monitor.vampire.accre.vanderbilt.edu:2119.18227/grid-monitor-job-status
usatlas1 20618 15198  1 08:25 ?        00:00:08 perl
/tmp/grid_manager_monitor_agent.usatlas1.15198.1000 --delete-self
--maxtime=3600s
nanohub  23795     1  5 08:27 ?        00:00:36 perl
/tmp/grid_manager_monitor_agent.nanohub.1495.1000 --delete-self --maxtime=3600s
root     11220  6502  0 08:38 pts/0    00:00:00 grep -i perl

Could someone please track down the exact cause of this bug and fix it? 
Thanks...
------- Comment #1 From 2007-01-29 14:32:51 -------
The PBS jobmanager has a little bit of code that's meant to detect when qstat
is trying to tell it that the job no longer exists - see the section in the
poll() function under $GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/pbs.pm
that reads:
    $_ = (grep(/job_state/, `$qstat -f $job_id 2>/dev/null`))[0];
    # get the exit code of the qstat command.  for info search $CHILD_ERROR
    # in perlvar documentation.
    $exit_code = $? >> 8;

    # return code 153 = "Unknown Job Id".
    # verifying that the job is no longer there.
    if($exit_code == 153)
    {
        $self->log("qstat rc is 153 == Unknown Job ID == DONE");
        $state = Globus::GRAM::JobState::DONE;

It looks from your PBS logs like it is returning 15001.  Bitshift that right by
8 spots, and you get 150, not 153.  So perhaps your PBS has different error
codes, and you can fix this by adding a check for 150 instead of 153.  Then the
PBS jobmanager will assert that the job is DONE, and it will remove the stale
state files and stop polling for them.

Can you try that and let me know if it works?
------- Comment #2 From 2007-01-30 08:23:06 -------
I have made the following changes to pbs.pm:

    #if($exit_code == 153)
    if($exit_code == 150) # Modified by bugzilla.globus.org request - klb, 1/29
    {
        #$self->log("qstat rc is 153 == Unknown Job ID == DONE");
        $self->log("qstat rc is 150 == Unknown Job ID == DONE");

It has made no difference.  Today's PBS server log is already 81 MB in size. 
Do I need to bounce our grid gateway or anything like that?  Thanks...

Kevin
------- Comment #3 From 2007-01-30 17:48:28 -------
From the OSG mailing list, it was pointed out that 15001 >> 8 is 58, not 150. 
That will teach me to use a calculator right-shifter without checking whether
it was in decimal or hex.

Stu - looking at pbs_error.h in a torque 2.0 installation shows that 15001 is
the return code for "Unknown Job ID".  However, a torque mailing list asserts
"The *only* time you can assume a job is finished is when qstat of the job ID
returns with the exit code 153".  You can see it when you qstat for a random
job:
qstat 001
qstat: Unknown Job Id 001.tg-master.uc.teragrid.org
bacon@tg-login1:/soft/torque-2.0.0p10-r1/include> echo $?
153

So I'm not clear on when 58 or 153 is correct.  Sounds like experimental
evidence suggests that a check for 58 may also be required, but I'd suggest
backing that up with some kind of theory understanding.
------- Comment #4 From 2007-01-31 08:58:22 -------
I've modified pbs.pm here at Vanderbilt to check for both 153 and 58.  I'll let
everyone know how this works out.  Thanks...

Kevin
------- Comment #5 From 2007-05-10 09:22:28 -------
(In reply to comment #4)
> I've modified pbs.pm here at Vanderbilt to check for both 153 and 58.  I'll let
> everyone know how this works out.  Thanks...
> 
> Kevin
> 
Does your problem resolved?
------- Comment #6 From 2007-05-10 09:28:51 -------
No...

/usr/spool/PBS/server_logs
root@vmpsched: grep "Reject reply code=15001" 20070510 | wc -l
   1816
/usr/spool/PBS/server_logs
root@vmpsched:

(In reply to comment #5)
> (In reply to comment #4)
> > I've modified pbs.pm here at Vanderbilt to check for both 153 and 58.  I'll let
> > everyone know how this works out.  Thanks...
> > 
> > Kevin
> > 
> Does your problem resolved?
> 
------- Comment #7 From 2007-05-10 10:24:31 -------
(In reply to comment #6)
> No...
> 
> /usr/spool/PBS/server_logs
> root@vmpsched: grep "Reject reply code=15001" 20070510 | wc -l
>    1816
> /usr/spool/PBS/server_logs
> root@vmpsched:
> 
> (In reply to comment #5)
> > (In reply to comment #4)
> > > I've modified pbs.pm here at Vanderbilt to check for both 153 and 58.  I'll let
> > > everyone know how this works out.  Thanks...
> > > 
> > > Kevin
> > > 
> > Does your problem resolved?
> > 
> 

I see in the list of resolved bugs of Globus 4.0.4 that this bug resloved.
Did you test 4.0.4?
------- Comment #8 From 2007-05-10 10:33:56 -------
We're running 4.0.3 and no, I haven't tested 4.0.4 because unless I'm
mis-reading or misunderstanding the release notes for Globus 4.0.4, this bug is
listed under Section 9, which is entitled "Known Problems as of 4.0.4".

(In reply to comment #7)
> 
> I see in the list of resolved bugs of Globus 4.0.4 that this bug resloved.
> Did you test 4.0.4?
> 
------- Comment #9 From 2007-05-10 11:08:47 -------
What about 4.1.1?
(In reply to comment #8)
> We're running 4.0.3 and no, I haven't tested 4.0.4 because unless I'm
> mis-reading or misunderstanding the release notes for Globus 4.0.4, this bug is
> listed under Section 9, which is entitled "Known Problems as of 4.0.4".
> 
> (In reply to comment #7)
> > 
> > I see in the list of resolved bugs of Globus 4.0.4 that this bug resloved.
> > Did you test 4.0.4?
> > 
> 
------- Comment #10 From 2007-05-10 11:12:04 -------
We do not install development releases on our production cluster.  I do not
have a test cluster for grid jobs.

Kevin

(In reply to comment #9)
> What about 4.1.1?
> (In reply to comment #8)
> > We're running 4.0.3 and no, I haven't tested 4.0.4 because unless I'm
> > mis-reading or misunderstanding the release notes for Globus 4.0.4, this bug is
> > listed under Section 9, which is entitled "Known Problems as of 4.0.4".
> > 
> > (In reply to comment #7)
> > > 
> > > I see in the list of resolved bugs of Globus 4.0.4 that this bug resloved.
> > > Did you test 4.0.4?
> > > 
> > 
> 
------- Comment #11 From 2007-05-11 02:43:13 -------
I ask from torqueusers mailing list about this problem and get the following
response:
------------------------------------------------------------
Garrick Staples <garrick@clusterresources.com>     
to torqueusers
On Fri, May 11, 2007 at 10:23:32AM +0330, Mehdi Sheikhalishahi alleged:
> Hi,
> I submit the following simple RSL script via Globus to Torque.
> ----------------------------------------------------------------------------------------
> +
> ( &(resourceManagerContact="Server.eng4.shirazu.ac.ir/jobmanager-pbs")
>   (count=2)
>   (label="subjob 0")
>   (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0)
>                (LD_LIBRARY_PATH /usr/local/globus-4.0.3/lib/))
>   (directory="/home/grid/Torque/simple")
>   (executable="/bin/hostname")
>   (stdout=hostnameOutput)
>   (stderr=hostnameError)
> )
> --------------------------My Report--------------------------
> Job was successfully executed,
> --------------------------hostnameOutput------------------------------
> localhost009
> localhost008
> ----------------------------------------------------------------------------------------
> But there are some strange errors messages on Server logs. The following are
> strange errors:
>
> 05/11/2007 09:38:52;0080;PBS_Server;Req;req_reject;Reject reply
> code=15001(Unknown Job Id), aux=0, type=StatusJob, from
> grid@Server.eng4.shirazu.ac.ir
> 05/11/2007 09:38:52;0100;PBS_Server;Req;;Type AuthenticateUser request
> received from grid@Server.eng4.shirazu.ac.ir, sock=15
> 05/11/2007 09:38:52;0100;PBS_Server;Req;;Type LocateJob request received
> from grid@Server.eng4.shirazu.ac.ir, sock=14
> 05/11/2007 09:38:52;0080;PBS_Server;Req;req_reject;Reject reply
> code=15001(Unknown Job Id), aux=0, type=LocateJob, from
> grid@Server.eng4.shirazu.ac.ir

This is normal for globus.  The globus job manager is looping over
'qstat $jobid' waiting for the job to end.  At some point, qstat returns
the "unknown job id" error and globus knows the job has exited.

(In reply to comment #10)
> We do not install development releases on our production cluster.  I do not
> have a test cluster for grid jobs.
> 
> Kevin
> 
> (In reply to comment #9)
> > What about 4.1.1?
> > (In reply to comment #8)
> > > We're running 4.0.3 and no, I haven't tested 4.0.4 because unless I'm
> > > mis-reading or misunderstanding the release notes for Globus 4.0.4, this bug is
> > > listed under Section 9, which is entitled "Known Problems as of 4.0.4".
> > > 
> > > (In reply to comment #7)
> > > > 
> > > > I see in the list of resolved bugs of Globus 4.0.4 that this bug resloved.
> > > > Did you test 4.0.4?
> > > > 
> > > 
> > 
> 
------- Comment #12 From 2007-05-11 07:58:40 -------
Please excuse my bluntness, but that's just plain wrong.  We're not seeing one
reject reply code=15001 error for each valid grid job when it exits, we're
seeing hundreds and thousands of them for jobs that exited days or weeks ago. 
That's definitely not "normal"...

Kevin

(In reply to comment #11)
> I ask from torqueusers mailing list about this problem and get the following
> response:
> ------------------------------------------------------------
> Garrick Staples <garrick@clusterresources.com>  
> to torqueusers
> On Fri, May 11, 2007 at 10:23:32AM +0330, Mehdi Sheikhalishahi alleged:
> > Hi,
> > I submit the following simple RSL script via Globus to Torque.
> > ----------------------------------------------------------------------------------------
> > +
> > ( &(resourceManagerContact="Server.eng4.shirazu.ac.ir/jobmanager-pbs")
> >   (count=2)
> >   (label="subjob 0")
> >   (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0)
> >                (LD_LIBRARY_PATH /usr/local/globus-4.0.3/lib/))
> >   (directory="/home/grid/Torque/simple")
> >   (executable="/bin/hostname")
> >   (stdout=hostnameOutput)
> >   (stderr=hostnameError)
> > )
> > --------------------------My Report--------------------------
> > Job was successfully executed,
> > --------------------------hostnameOutput------------------------------
> > localhost009
> > localhost008
> > ----------------------------------------------------------------------------------------
> > But there are some strange errors messages on Server logs. The following are
> > strange errors:
> >
> > 05/11/2007 09:38:52;0080;PBS_Server;Req;req_reject;Reject reply
> > code=15001(Unknown Job Id), aux=0, type=StatusJob, from
> > grid@Server.eng4.shirazu.ac.ir
> > 05/11/2007 09:38:52;0100;PBS_Server;Req;;Type AuthenticateUser request
> > received from grid@Server.eng4.shirazu.ac.ir, sock=15
> > 05/11/2007 09:38:52;0100;PBS_Server;Req;;Type LocateJob request received
> > from grid@Server.eng4.shirazu.ac.ir, sock=14
> > 05/11/2007 09:38:52;0080;PBS_Server;Req;req_reject;Reject reply
> > code=15001(Unknown Job Id), aux=0, type=LocateJob, from
> > grid@Server.eng4.shirazu.ac.ir
> 
> This is normal for globus.  The globus job manager is looping over
> 'qstat $jobid' waiting for the job to end.  At some point, qstat returns
> the "unknown job id" error and globus knows the job has exited.
> 
> (In reply to comment #10)
> > We do not install development releases on our production cluster.  I do not
> > have a test cluster for grid jobs.
> > 
> > Kevin
> > 
> > (In reply to comment #9)
> > > What about 4.1.1?
> > > (In reply to comment #8)
> > > > We're running 4.0.3 and no, I haven't tested 4.0.4 because unless I'm
> > > > mis-reading or misunderstanding the release notes for Globus 4.0.4, this bug is
> > > > listed under Section 9, which is entitled "Known Problems as of 4.0.4".
> > > > 
> > > > (In reply to comment #7)
> > > > > 
> > > > > I see in the list of resolved bugs of Globus 4.0.4 that this bug resloved.
> > > > > Did you test 4.0.4?
> > > > > 
> > > > 
> > > 
> > 
> 
------- Comment #13 From 2007-09-28 15:18:50 -------
This has come up for Steve Timm at FNAL as well.  On OSG troubleshooting, I had
a question that was left unanswered that I'm still curious about.  It was:
"The only thing I can think of is that you need to observe when one of them is
happening, and find out the PBS jobid that's being queried for.  Then run qstat
yourself on that jobid and see what return code it is giving back.  That will
tell you what exit_code to also treat as "job over".  I'd like to hear what it
is, though, and what the corresponding condition is that returns that kind of
error, other than just "Unknown Job ID".

I guess one question is whether you guys submit PBS jobs so quickly that you
wind up wrapping through job IDs so quickly that there is some kind of other
error being thrown, like "you don't have permission to see the details of this
job id" or something?"

Any details for the currently affected systems?
------- Comment #14 From 2007-10-01 13:42:50 -------
Steve Timm reports that the PBS jobmanager is successfully catching the DONE
status of the jobs (the 15001 on the server logs notwithstanding, the qstat
command still sees 153) but that the issue may be on the gridmanager/gahp side
of things.  We have a condor ticket open and are investigating.
------- Comment #15 From 2007-10-26 14:27:59 -------
We confirmed that the issue was the gridmonitor constatly touching the defunct
files since it wasn't sure which ones were relevant.  Jaime Frey has created a
patch to the gridmonitor that caches the results on jobids that report DONE
status and stops touching/polling them.  I do not know what version of condor-g
the updated gridmonitor appears in, if any.
------- Comment #16 From 2008-02-06 11:25:57 -------
Even though this was a fix in Condor-G's gridmonitor, I'm marking it as FIXED
here  instead of INVALID.  FIXED bugs are added to the release notes and get
more light of day.