| Summary: | Bad invalid script response error | ||
|---|---|---|---|
| Product: | GRAM | Reporter: | Mats Rynge <rynge@isi.edu> |
| Component: | wsrf managed execution job service | Assignee: | Mats Rynge <rynge@isi.edu> |
| Status: | RESOLVED FIXED | ||
| Severity: | normal | CC: | alain@isi.edu, bester@mcs.anl.gov, lane@mcs.anl.gov, madduri@mcs.anl.gov, smartin@mcs.anl.gov |
| Priority: | P2 | ||
| Version: | 3.9.4 | ||
| Target Milestone: | 3.9.5 | ||
| Hardware: | PC | ||
| OS: | Linux | ||
| Bug Depends on: | |||
| Bug Blocks: | 2620 | ||
Thanks Mats!
I think this is an important issue we ought to fix soon (before 3.9.5?)
so we don't get people ask us what's wrong whenever a job fail!
Issues:
1) saying the script fails is akin to implementation leaking to the client/user.
This message should only be in a log message.
2) As Mats points out, we must be able to tell the user/client side,
within the error XML structure, wether what failed was:
a) the submission (i.e. enqueuing the job description to the scheduler
with whatever parameters were translated from RSL).
b) the job application itself (i.e the "/bin/echo Hello" etc...)
If a), then we must provide a description of which parameter of the job
description, or which scheduler policy, was violated.
For instance:
- "the user might is allowed to submit to queue xyz"
- "max wall time supplied for this job goes beyond possible value range"
As the submission command spits back the submission error(s) right
away, we should parse its stderr in order to extract the failure information,
or, if too complex for now (involves dedicated processing depending on the
scheduler/submission command), we could at least put the error message
in the complex XML structure sent back inside the SOAP error.
I agree, this should be fixed for 3.9.5. The perl submit routine traps the output from the scheduler submission attempts, so i think it is just a matter of figuring out how this error detail is passed along in an exception that can then be output by globusrun-ws.
I thought the code was already fetching the script errors. It's possible that the script crashed and left no output. I'll double check that the code is doing the right thing with the stderr, though.
Testing using an invalid queue name with PBS:
<job>
<executable>/bin/hostname</executable>
<directory>/tmp</directory>
<argument>--fqdn</argument>
<stdout>${GLOBUS_USER_HOME}/job.out</stdout>
<stderr>${GLOBUS_USER_HOME}/job.err</stderr>
<queue>non-existant</queue>
</job>
[rynge@devrandom tmp]$ globusrun-ws -submit -J -S -factory
https://viz-login.isi.edu:9000/wsrf/services/ManagedJobFactoryService
-factory-type PBS -job-description-file test.xml
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:ea88d150-6016-11d9-81bc-0008744f939a
Termination time: 01/07/2005 19:12 GMT
Current job state: Failed
Destroying job...Done.
Cleaning up any delegated credentials...Done.
globusrun-ws: Job failed: The executable could not be started.
Using m-j-g gives the same error as the first comment in the bug report.
I added a statement to lib/perl/Globus/GRAM/JobManager/pbs.pm to copy the job
description file to /tmp. Using that description and qsub gives me the right error:
globus@viz-9:~$ qsub /tmp/bla.27692
qsub: Unknown queue
Using another cp statement for the error file, I made sure that the perl script
catches the right error:
globus@viz-9:~$ cat /tmp/err.27997
qsub: Unknown queue
I comitted an untested fix to the trunk. Mats has a PBS installation that he is going to test with, so I am reassigning it to him for testing.
Mats, I'm just going to close this bug since it blocks #2620. Please reopen it if your tests still fail or mark it verified if they pass. Thanks.