Bug 3185 - pbs job manager exit code
: pbs job manager exit code
wsrf scheduler interface
: unspecified
: Other Linux
: P1 normal
: 4.0.2
Assigned To:
  Show dependency treegraph
Reported: 2005-04-15 15:43 by
Modified: 2006-02-20 15:04 (History)

pbs-exit.diff (1.79 KB, patch)
2005-04-15 15:56, Joe Bester


You need to log in before you can comment on or make changes to this bug.

Description From 2005-04-15 15:43:53
The exit code from PBS jobs is passed from the SEG to the GRAM service, but 
the scripts generated by the PBS perl module do not propage the exit code for 
multiple jobs. 
Applying a similar technique as was done in the lsf script will catch exit 
codes for multiple jobs which are not run on a cluster. For the cluster, the 
exit code needs to be passed from the script run via rsh on the machines in 
the PBS node file to the script started by PBS.
------- Comment #1 From 2005-04-15 15:56:45 -------
Created an attachment (id=581) [details]

Here's a patch for the first bit (jobtype=multiple && !cluster). The cluster
case will probably involve updating the script run on the execution nodes to
write the instance exit codes to a file which the script started by pbs will
read to determine exit code (as rsh does not propagate exit codes)
------- Comment #2 From 2005-05-18 11:40:29 -------
Please test exit codes for use with jobtype MPI jobs too.
------- Comment #3 From 2006-02-17 15:58:14 -------
Exit code handling is in the 4.0 branch now. I've modified the gram scheduler
tests to check that it works as well. The script relies on a shared file system
for all compute nodes (but that seems to be the case for other parts of the
script now anyway)

------- Comment #4 From 2006-02-20 15:04:42 -------
The fixed is merged into the trunk as well.