Bug 2524 - Bad invalid script response error
: Bad invalid script response error
Status: RESOLVED FIXED
: GRAM
wsrf managed execution job service
: 3.9.4
: PC Linux
: P2 normal
: 3.9.5
Assigned To:
:
:
:
: 2620
  Show dependency treegraph
 
Reported: 2005-01-06 11:33 by
Modified: 2005-01-27 13:19 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2005-01-06 11:33:57
Submission ID: uuid:ba911c50-6007-11d9-bf6b-8c05ad099d09
WAITING FOR JOB TO FINISH
========== State Notification ==========
Job State: Failed
========================================
Exit Code: 0
Fault:
fault type: org.globus.exec.generated.FaultType:
stateWhenFailureOccurred: Unsubmitted
command: submit
stackTrace:
org.globus.exec.generated.FaultType: The job manager detected an invalid script
response
Timestamp: Thu Jan 06 09:24:12 PST 2005
Originator: Address:
https://128.9.64.179:9000/wsrf/services/ManagedJobFactoryService
Reference property[0]:
<ns1:ResourceID
xmlns:ns1="http://www.globus.org/namespaces/2004/10/gram/job">ba911c50-6007-11d9-bf6b-8c05ad099d09</ns1:ResourceID>


	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
	at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:274)
        ....

gt2ErrorCode: 0
timestamp:
java.util.GregorianCalendar[time=1105032252553,areFieldsSet=true,areAllFieldsSet=true,lenient=true,zone=sun.util.calendar.ZoneInfo[id="GMT",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=1,YEAR=2005,MONTH=0,WEEK_OF_YEAR=2,WEEK_OF_MONTH=2,DAY_OF_MONTH=6,DAY_OF_YEAR=6,DAY_OF_WEEK=5,DAY_OF_WEEK_IN_MONTH=1,AM_PM=1,HOUR=5,HOUR_OF_DAY=17,MINUTE=24,SECOND=12,MILLISECOND=553,ZONE_OFFSET=0,DST_OFFSET=0]
originator: Address:
https://128.9.64.179:9000/wsrf/services/ManagedJobFactoryService
Reference property[0]:
<ns1:ResourceID
xmlns:ns1="http://www.globus.org/namespaces/2004/10/gram/job">ba911c50-6007-11d9-bf6b-8c05ad099d09</ns1:ResourceID>

faultString: 
faultReason: 
description:
The job manager detected an invalid script response
Message:
null



All the important parts of the real error is missing. For the user, the most
important part of information is what the scheduler said, so they can correct
their RSL. For example, if the reason the submit failed is that the user is not
allowed to submit to that queue, and the scheduler outputs a error saying so, it
should be shown to the user.

Also the timestamp is in an ugly format.
------- Comment #1 From 2005-01-06 12:39:47 -------
Thanks Mats! 
 
I think this is an important issue we ought to fix soon (before 3.9.5?)  
so we don't get people ask us what's wrong whenever a job fail!  
 
Issues: 
 
1) saying the script fails is akin to implementation leaking to the client/user.  
   This message should only be in a log message. 
 
2) As Mats points out, we must be able to tell the user/client side,  
   within the error XML structure, wether what failed was:  
   a) the submission (i.e. enqueuing the job description to the scheduler  
      with whatever parameters were translated from RSL). 
   b) the job application itself (i.e the "/bin/echo Hello" etc...) 
 
   If a), then we must provide a description of which parameter of the job  
   description, or which scheduler policy, was violated.  
   For instance: 
     - "the user might is allowed to submit to queue xyz" 
     - "max wall time supplied for this job goes beyond possible value range" 
  As the submission command spits back the submission error(s) right  
  away, we should parse its stderr in order to extract the failure information,  
  or, if too complex for now (involves dedicated processing depending on the  
  scheduler/submission command), we could at least put the error message  
  in the complex XML structure sent back inside the SOAP error. 
 
 
------- Comment #2 From 2005-01-06 12:48:48 -------
I agree, this should be fixed for 3.9.5.  The perl submit routine traps the
output from the scheduler 
submission attempts, so i think it is just a matter of figuring out how this
error detail is passed along in 
an exception that can then be output by globusrun-ws.
------- Comment #3 From 2005-01-06 12:53:45 -------
I thought the code was already fetching the script errors.  It's possible that
the script crashed and left 
no output.  I'll double check that the code is doing the right thing with the
stderr, though.
------- Comment #4 From 2005-01-06 13:22:04 -------
Testing using an invalid queue name with PBS:

<job>
  <executable>/bin/hostname</executable>
  <directory>/tmp</directory>
  <argument>--fqdn</argument>
  <stdout>${GLOBUS_USER_HOME}/job.out</stdout>
  <stderr>${GLOBUS_USER_HOME}/job.err</stderr>
  <queue>non-existant</queue>
</job>

[rynge@devrandom tmp]$ globusrun-ws -submit -J -S  -factory
https://viz-login.isi.edu:9000/wsrf/services/ManagedJobFactoryService
-factory-type PBS -job-description-file test.xml
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:ea88d150-6016-11d9-81bc-0008744f939a
Termination time: 01/07/2005 19:12 GMT
Current job state: Failed
Destroying job...Done.
Cleaning up any delegated credentials...Done.
globusrun-ws: Job failed: The executable could not be started.



Using m-j-g gives the same error as the first comment in the bug report.

I added a statement to lib/perl/Globus/GRAM/JobManager/pbs.pm to copy the job
description file to /tmp. Using that description and qsub gives me the right error:

globus@viz-9:~$ qsub /tmp/bla.27692
qsub: Unknown queue


Using another cp statement for the error file, I made sure that the perl script
catches the right error:

globus@viz-9:~$ cat /tmp/err.27997
qsub: Unknown queue


------- Comment #5 From 2005-01-25 14:38:18 -------
I comitted an untested fix to the trunk.  Mats has a PBS installation that he
is
going to test with, so I am reassigning it to him for testing.
------- Comment #6 From 2005-01-27 13:19:38 -------
Mats,

I'm just going to close this bug since it blocks #2620.  Please reopen it if
your tests still fail or mark it verified if they pass.  Thanks.