Bugzilla – Bug 4981
Hundreds of thousands of bogus entries in PBS server logs
Last modified: 2008-02-06 11:25:57
You need to log in before you can comment on or make changes to this bug.
Whenever a OSG user runs jobs at Vanderbilt (nanohub and usatlas1 are active right now), a Perl script gets run which apparently has a bug. It ends up querying up to 20 times per second for the status of (mostly non-existent) jobs. This causes literally hundreds of thousands of bogus entries to be placed in our PBS server logs similar to the following: 01/26/2007 08:50:39;0080;PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=LocateJob, from nanohub@vmpg01 This causes our daily PBS server log file to swell from under 5 MB in size to ~250 MB (or more) in size, depending on how many grid jobs are run. Here's the output of ps showing the offending jobs: usatlas1 15198 1 0 08:16 ? 00:00:00 perl /home/usatlas1/.globus/.gass_cache/local/md5/ec/2a68da2e447335a36279a59953201b/md5/2b/d23018e59fdaa350c5bd91d7dc79bd/data --dest-url=https://gridui03.usatlas.bnl.gov:20047/tmp/condor_g_scratch.0xab9c360.23977/grid-monitor.vampire.accre.vanderbilt.edu:2119.18227/grid-monitor-job-status usatlas1 20618 15198 1 08:25 ? 00:00:08 perl /tmp/grid_manager_monitor_agent.usatlas1.15198.1000 --delete-self --maxtime=3600s nanohub 23795 1 5 08:27 ? 00:00:36 perl /tmp/grid_manager_monitor_agent.nanohub.1495.1000 --delete-self --maxtime=3600s root 11220 6502 0 08:38 pts/0 00:00:00 grep -i perl Could someone please track down the exact cause of this bug and fix it? Thanks...
The PBS jobmanager has a little bit of code that's meant to detect when qstat is trying to tell it that the job no longer exists - see the section in the poll() function under $GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/pbs.pm that reads: $_ = (grep(/job_state/, `$qstat -f $job_id 2>/dev/null`))[0]; # get the exit code of the qstat command. for info search $CHILD_ERROR # in perlvar documentation. $exit_code = $? >> 8; # return code 153 = "Unknown Job Id". # verifying that the job is no longer there. if($exit_code == 153) { $self->log("qstat rc is 153 == Unknown Job ID == DONE"); $state = Globus::GRAM::JobState::DONE; It looks from your PBS logs like it is returning 15001. Bitshift that right by 8 spots, and you get 150, not 153. So perhaps your PBS has different error codes, and you can fix this by adding a check for 150 instead of 153. Then the PBS jobmanager will assert that the job is DONE, and it will remove the stale state files and stop polling for them. Can you try that and let me know if it works?
I have made the following changes to pbs.pm: #if($exit_code == 153) if($exit_code == 150) # Modified by bugzilla.globus.org request - klb, 1/29 { #$self->log("qstat rc is 153 == Unknown Job ID == DONE"); $self->log("qstat rc is 150 == Unknown Job ID == DONE"); It has made no difference. Today's PBS server log is already 81 MB in size. Do I need to bounce our grid gateway or anything like that? Thanks... Kevin
From the OSG mailing list, it was pointed out that 15001 >> 8 is 58, not 150. That will teach me to use a calculator right-shifter without checking whether it was in decimal or hex. Stu - looking at pbs_error.h in a torque 2.0 installation shows that 15001 is the return code for "Unknown Job ID". However, a torque mailing list asserts "The *only* time you can assume a job is finished is when qstat of the job ID returns with the exit code 153". You can see it when you qstat for a random job: qstat 001 qstat: Unknown Job Id 001.tg-master.uc.teragrid.org bacon@tg-login1:/soft/torque-2.0.0p10-r1/include> echo $? 153 So I'm not clear on when 58 or 153 is correct. Sounds like experimental evidence suggests that a check for 58 may also be required, but I'd suggest backing that up with some kind of theory understanding.
I've modified pbs.pm here at Vanderbilt to check for both 153 and 58. I'll let everyone know how this works out. Thanks... Kevin
(In reply to comment #4) > I've modified pbs.pm here at Vanderbilt to check for both 153 and 58. I'll let > everyone know how this works out. Thanks... > > Kevin > Does your problem resolved?
No... /usr/spool/PBS/server_logs root@vmpsched: grep "Reject reply code=15001" 20070510 | wc -l 1816 /usr/spool/PBS/server_logs root@vmpsched: (In reply to comment #5) > (In reply to comment #4) > > I've modified pbs.pm here at Vanderbilt to check for both 153 and 58. I'll let > > everyone know how this works out. Thanks... > > > > Kevin > > > Does your problem resolved? >
(In reply to comment #6) > No... > > /usr/spool/PBS/server_logs > root@vmpsched: grep "Reject reply code=15001" 20070510 | wc -l > 1816 > /usr/spool/PBS/server_logs > root@vmpsched: > > (In reply to comment #5) > > (In reply to comment #4) > > > I've modified pbs.pm here at Vanderbilt to check for both 153 and 58. I'll let > > > everyone know how this works out. Thanks... > > > > > > Kevin > > > > > Does your problem resolved? > > > I see in the list of resolved bugs of Globus 4.0.4 that this bug resloved. Did you test 4.0.4?
We're running 4.0.3 and no, I haven't tested 4.0.4 because unless I'm mis-reading or misunderstanding the release notes for Globus 4.0.4, this bug is listed under Section 9, which is entitled "Known Problems as of 4.0.4". (In reply to comment #7) > > I see in the list of resolved bugs of Globus 4.0.4 that this bug resloved. > Did you test 4.0.4? >
What about 4.1.1? (In reply to comment #8) > We're running 4.0.3 and no, I haven't tested 4.0.4 because unless I'm > mis-reading or misunderstanding the release notes for Globus 4.0.4, this bug is > listed under Section 9, which is entitled "Known Problems as of 4.0.4". > > (In reply to comment #7) > > > > I see in the list of resolved bugs of Globus 4.0.4 that this bug resloved. > > Did you test 4.0.4? > > >
We do not install development releases on our production cluster. I do not have a test cluster for grid jobs. Kevin (In reply to comment #9) > What about 4.1.1? > (In reply to comment #8) > > We're running 4.0.3 and no, I haven't tested 4.0.4 because unless I'm > > mis-reading or misunderstanding the release notes for Globus 4.0.4, this bug is > > listed under Section 9, which is entitled "Known Problems as of 4.0.4". > > > > (In reply to comment #7) > > > > > > I see in the list of resolved bugs of Globus 4.0.4 that this bug resloved. > > > Did you test 4.0.4? > > > > > >
I ask from torqueusers mailing list about this problem and get the following response: ------------------------------------------------------------ Garrick Staples <garrick@clusterresources.com> to torqueusers On Fri, May 11, 2007 at 10:23:32AM +0330, Mehdi Sheikhalishahi alleged: > Hi, > I submit the following simple RSL script via Globus to Torque. > ---------------------------------------------------------------------------------------- > + > ( &(resourceManagerContact="Server.eng4.shirazu.ac.ir/jobmanager-pbs") > (count=2) > (label="subjob 0") > (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0) > (LD_LIBRARY_PATH /usr/local/globus-4.0.3/lib/)) > (directory="/home/grid/Torque/simple") > (executable="/bin/hostname") > (stdout=hostnameOutput) > (stderr=hostnameError) > ) > --------------------------My Report-------------------------- > Job was successfully executed, > --------------------------hostnameOutput------------------------------ > localhost009 > localhost008 > ---------------------------------------------------------------------------------------- > But there are some strange errors messages on Server logs. The following are > strange errors: > > 05/11/2007 09:38:52;0080;PBS_Server;Req;req_reject;Reject reply > code=15001(Unknown Job Id), aux=0, type=StatusJob, from > grid@Server.eng4.shirazu.ac.ir > 05/11/2007 09:38:52;0100;PBS_Server;Req;;Type AuthenticateUser request > received from grid@Server.eng4.shirazu.ac.ir, sock=15 > 05/11/2007 09:38:52;0100;PBS_Server;Req;;Type LocateJob request received > from grid@Server.eng4.shirazu.ac.ir, sock=14 > 05/11/2007 09:38:52;0080;PBS_Server;Req;req_reject;Reject reply > code=15001(Unknown Job Id), aux=0, type=LocateJob, from > grid@Server.eng4.shirazu.ac.ir This is normal for globus. The globus job manager is looping over 'qstat $jobid' waiting for the job to end. At some point, qstat returns the "unknown job id" error and globus knows the job has exited. (In reply to comment #10) > We do not install development releases on our production cluster. I do not > have a test cluster for grid jobs. > > Kevin > > (In reply to comment #9) > > What about 4.1.1? > > (In reply to comment #8) > > > We're running 4.0.3 and no, I haven't tested 4.0.4 because unless I'm > > > mis-reading or misunderstanding the release notes for Globus 4.0.4, this bug is > > > listed under Section 9, which is entitled "Known Problems as of 4.0.4". > > > > > > (In reply to comment #7) > > > > > > > > I see in the list of resolved bugs of Globus 4.0.4 that this bug resloved. > > > > Did you test 4.0.4? > > > > > > > > > >
Please excuse my bluntness, but that's just plain wrong. We're not seeing one reject reply code=15001 error for each valid grid job when it exits, we're seeing hundreds and thousands of them for jobs that exited days or weeks ago. That's definitely not "normal"... Kevin (In reply to comment #11) > I ask from torqueusers mailing list about this problem and get the following > response: > ------------------------------------------------------------ > Garrick Staples <garrick@clusterresources.com> > to torqueusers > On Fri, May 11, 2007 at 10:23:32AM +0330, Mehdi Sheikhalishahi alleged: > > Hi, > > I submit the following simple RSL script via Globus to Torque. > > ---------------------------------------------------------------------------------------- > > + > > ( &(resourceManagerContact="Server.eng4.shirazu.ac.ir/jobmanager-pbs") > > (count=2) > > (label="subjob 0") > > (environment=(GLOBUS_DUROC_SUBJOB_INDEX 0) > > (LD_LIBRARY_PATH /usr/local/globus-4.0.3/lib/)) > > (directory="/home/grid/Torque/simple") > > (executable="/bin/hostname") > > (stdout=hostnameOutput) > > (stderr=hostnameError) > > ) > > --------------------------My Report-------------------------- > > Job was successfully executed, > > --------------------------hostnameOutput------------------------------ > > localhost009 > > localhost008 > > ---------------------------------------------------------------------------------------- > > But there are some strange errors messages on Server logs. The following are > > strange errors: > > > > 05/11/2007 09:38:52;0080;PBS_Server;Req;req_reject;Reject reply > > code=15001(Unknown Job Id), aux=0, type=StatusJob, from > > grid@Server.eng4.shirazu.ac.ir > > 05/11/2007 09:38:52;0100;PBS_Server;Req;;Type AuthenticateUser request > > received from grid@Server.eng4.shirazu.ac.ir, sock=15 > > 05/11/2007 09:38:52;0100;PBS_Server;Req;;Type LocateJob request received > > from grid@Server.eng4.shirazu.ac.ir, sock=14 > > 05/11/2007 09:38:52;0080;PBS_Server;Req;req_reject;Reject reply > > code=15001(Unknown Job Id), aux=0, type=LocateJob, from > > grid@Server.eng4.shirazu.ac.ir > > This is normal for globus. The globus job manager is looping over > 'qstat $jobid' waiting for the job to end. At some point, qstat returns > the "unknown job id" error and globus knows the job has exited. > > (In reply to comment #10) > > We do not install development releases on our production cluster. I do not > > have a test cluster for grid jobs. > > > > Kevin > > > > (In reply to comment #9) > > > What about 4.1.1? > > > (In reply to comment #8) > > > > We're running 4.0.3 and no, I haven't tested 4.0.4 because unless I'm > > > > mis-reading or misunderstanding the release notes for Globus 4.0.4, this bug is > > > > listed under Section 9, which is entitled "Known Problems as of 4.0.4". > > > > > > > > (In reply to comment #7) > > > > > > > > > > I see in the list of resolved bugs of Globus 4.0.4 that this bug resloved. > > > > > Did you test 4.0.4? > > > > > > > > > > > > > > >
This has come up for Steve Timm at FNAL as well. On OSG troubleshooting, I had a question that was left unanswered that I'm still curious about. It was: "The only thing I can think of is that you need to observe when one of them is happening, and find out the PBS jobid that's being queried for. Then run qstat yourself on that jobid and see what return code it is giving back. That will tell you what exit_code to also treat as "job over". I'd like to hear what it is, though, and what the corresponding condition is that returns that kind of error, other than just "Unknown Job ID". I guess one question is whether you guys submit PBS jobs so quickly that you wind up wrapping through job IDs so quickly that there is some kind of other error being thrown, like "you don't have permission to see the details of this job id" or something?" Any details for the currently affected systems?
Steve Timm reports that the PBS jobmanager is successfully catching the DONE status of the jobs (the 15001 on the server logs notwithstanding, the qstat command still sees 153) but that the issue may be on the gridmanager/gahp side of things. We have a condor ticket open and are investigating.
We confirmed that the issue was the gridmonitor constatly touching the defunct files since it wasn't sure which ones were relevant. Jaime Frey has created a patch to the gridmonitor that caches the results on jobids that report DONE status and stops touching/polling them. I do not know what version of condor-g the updated gridmonitor appears in, if any.
Even though this was a fix in Condor-G's gridmonitor, I'm marking it as FIXED here instead of INVALID. FIXED bugs are added to the release notes and get more light of day.