Bugzilla – Bug 3775
CAMPAIGN: Investigate job info from schedulers
Last modified: 2006-02-14 12:10:22
You need to
before you can comment on or make changes to this bug.
Investigate the common information available that would be valuable to publish about jobs. The
schedulers to investigate are: PBS, LSG, SGE, condor, fork.
Lessen or remove the need for a job wrapper like kickstart for gram jobs.
provide a gram client the simple means to locate a job on a service host in order to connect to it.
Here is the current set of information published about a WS GRAM job in GT 4.0:
serviceLevelAgreement: e.g. RSL
state: The current state of the job.
fault: The fault (if generated) indicating the reason for failure of the job to
localUserId: The job owner's local user account name.
userSubject: The GSI certificate DN of the job owner.
holding: Indicates whether a hold has been placed on this job.
stdoutURL: A GridFTP URL to the stdout file.
stderrURL: A GridFTP URL to the stderr file.
credentialPath: The path to the user proxy.
exitCode: simple 8 bit exit code
I think some job placement type information is needed for sure. Publishing rusage type information
would be nice, but how or can it be obtained. Perhaps there is other information about a job that
would be useful to publish for clients that will be discovered in this investigation. For each scheduler,
what information is typically available, what can be made available and how is it obtained (e.g.
command line tool or scheduler's log file or ...).
example: job placement / id info
example: Rusage type information
exit code --> in order to detect termination signal
1) Gather info from schedulers
2) Provide analysis/recommendations for subsequent campaigns
3) Confirm analysis/recommendations with scheduler admins and developers. For example, if we
decide to gather rusage info from an LSF log file, verify with people from platform and admins of LSF
installations that the method is reasonable/approved.
*** Bug 3244 has been marked as a duplicate of this bug. ***
Created an attachment (id=728) [details]
Attaching a file which contains information about what the supported schedulers
return with respect to exit codes and rusage data.
Created an attachment (id=729) [details]
Added notes about job placement information.
- the RSL (XML) used to submit the job
- a parsed form of some of the RSL (XML) parameters: project/account,
node types/names/counts, walltime, program, reservation id, etc.
- submit, start, exit times
- resource manager job id
- user DN
- proxy expiration time
- list of hosts the job is running on
One other idea, how could a job choose to publish something arbitrary about itself?
Case in point is the ANL Viz Portal Paraview service. When it launches on the back-end
resource it needs to communicate a host+port number of a started service back to the
client. Ideally the client should be able to query gram for that information.
Created an attachment (id=794) [details]
Added a conclusions document to this campaign