Bugzilla – Bug 2524
Bad invalid script response error
Last modified: 2005-01-27 13:19:38
You need to log in before you can comment on or make changes to this bug.
Submission ID: uuid:ba911c50-6007-11d9-bf6b-8c05ad099d09 WAITING FOR JOB TO FINISH ========== State Notification ========== Job State: Failed ======================================== Exit Code: 0 Fault: fault type: org.globus.exec.generated.FaultType: stateWhenFailureOccurred: Unsubmitted command: submit stackTrace: org.globus.exec.generated.FaultType: The job manager detected an invalid script response Timestamp: Thu Jan 06 09:24:12 PST 2005 Originator: Address: https://128.9.64.179:9000/wsrf/services/ManagedJobFactoryService Reference property[0]: <ns1:ResourceID xmlns:ns1="http://www.globus.org/namespaces/2004/10/gram/job">ba911c50-6007-11d9-bf6b-8c05ad099d09</ns1:ResourceID> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:274) .... gt2ErrorCode: 0 timestamp: java.util.GregorianCalendar[time=1105032252553,areFieldsSet=true,areAllFieldsSet=true,lenient=true,zone=sun.util.calendar.ZoneInfo[id="GMT",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=1,YEAR=2005,MONTH=0,WEEK_OF_YEAR=2,WEEK_OF_MONTH=2,DAY_OF_MONTH=6,DAY_OF_YEAR=6,DAY_OF_WEEK=5,DAY_OF_WEEK_IN_MONTH=1,AM_PM=1,HOUR=5,HOUR_OF_DAY=17,MINUTE=24,SECOND=12,MILLISECOND=553,ZONE_OFFSET=0,DST_OFFSET=0] originator: Address: https://128.9.64.179:9000/wsrf/services/ManagedJobFactoryService Reference property[0]: <ns1:ResourceID xmlns:ns1="http://www.globus.org/namespaces/2004/10/gram/job">ba911c50-6007-11d9-bf6b-8c05ad099d09</ns1:ResourceID> faultString: faultReason: description: The job manager detected an invalid script response Message: null All the important parts of the real error is missing. For the user, the most important part of information is what the scheduler said, so they can correct their RSL. For example, if the reason the submit failed is that the user is not allowed to submit to that queue, and the scheduler outputs a error saying so, it should be shown to the user. Also the timestamp is in an ugly format.
Thanks Mats! I think this is an important issue we ought to fix soon (before 3.9.5?) so we don't get people ask us what's wrong whenever a job fail! Issues: 1) saying the script fails is akin to implementation leaking to the client/user. This message should only be in a log message. 2) As Mats points out, we must be able to tell the user/client side, within the error XML structure, wether what failed was: a) the submission (i.e. enqueuing the job description to the scheduler with whatever parameters were translated from RSL). b) the job application itself (i.e the "/bin/echo Hello" etc...) If a), then we must provide a description of which parameter of the job description, or which scheduler policy, was violated. For instance: - "the user might is allowed to submit to queue xyz" - "max wall time supplied for this job goes beyond possible value range" As the submission command spits back the submission error(s) right away, we should parse its stderr in order to extract the failure information, or, if too complex for now (involves dedicated processing depending on the scheduler/submission command), we could at least put the error message in the complex XML structure sent back inside the SOAP error.
I agree, this should be fixed for 3.9.5. The perl submit routine traps the output from the scheduler submission attempts, so i think it is just a matter of figuring out how this error detail is passed along in an exception that can then be output by globusrun-ws.
I thought the code was already fetching the script errors. It's possible that the script crashed and left no output. I'll double check that the code is doing the right thing with the stderr, though.
Testing using an invalid queue name with PBS: <job> <executable>/bin/hostname</executable> <directory>/tmp</directory> <argument>--fqdn</argument> <stdout>${GLOBUS_USER_HOME}/job.out</stdout> <stderr>${GLOBUS_USER_HOME}/job.err</stderr> <queue>non-existant</queue> </job> [rynge@devrandom tmp]$ globusrun-ws -submit -J -S -factory https://viz-login.isi.edu:9000/wsrf/services/ManagedJobFactoryService -factory-type PBS -job-description-file test.xml Delegating user credentials...Done. Submitting job...Done. Job ID: uuid:ea88d150-6016-11d9-81bc-0008744f939a Termination time: 01/07/2005 19:12 GMT Current job state: Failed Destroying job...Done. Cleaning up any delegated credentials...Done. globusrun-ws: Job failed: The executable could not be started. Using m-j-g gives the same error as the first comment in the bug report. I added a statement to lib/perl/Globus/GRAM/JobManager/pbs.pm to copy the job description file to /tmp. Using that description and qsub gives me the right error: globus@viz-9:~$ qsub /tmp/bla.27692 qsub: Unknown queue Using another cp statement for the error file, I made sure that the perl script catches the right error: globus@viz-9:~$ cat /tmp/err.27997 qsub: Unknown queue
I comitted an untested fix to the trunk. Mats has a PBS installation that he is going to test with, so I am reassigning it to him for testing.
Mats, I'm just going to close this bug since it blocks #2620. Please reopen it if your tests still fail or mark it verified if they pass. Thanks.