Bug 3775 - CAMPAIGN: Investigate job info from schedulers
: CAMPAIGN: Investigate job info from schedulers
Status: RESOLVED FIXED
: GRAM
wsrf scheduler interface
: development
: All All
: P3 normal
: 4.2
Assigned To:
:
:
:
: 4047
  Show dependency treegraph
 
Reported: 2005-09-22 15:49 by
Modified: 2006-02-14 12:10 (History)


Attachments
scheduler-info.txt (3.61 KB, text/plain)
2005-10-27 11:32, Joe Bester
Details
scheduler-info.txt (3.84 KB, text/plain)
2005-10-27 14:34, Joe Bester
Details
conculsions.txt (2.65 KB, text/plain)
2005-12-21 16:38, Joe Bester
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2005-09-22 15:49:07
Definition:

Investigate the common information available that would be valuable to publish about jobs.  The 
schedulers to investigate are: PBS, LSG, SGE, condor, fork.

Benefits:

Lessen or remove the need for a job wrapper like kickstart for gram jobs.
provide a gram client the simple means to locate a job on a service host in order to connect to it.

Description:

Here is the current set of information published about a WS GRAM job in GT 4.0:

serviceLevelAgreement:     e.g. RSL
state:                                The current state of the job.
fault:                                 The fault (if generated) indicating the reason for failure of the job to 
complete.
localUserId:                       The job owner's local user account name.
userSubject:                      The GSI certificate DN of the job owner.
holding:                             Indicates whether a hold has been placed on this job.
stdoutURL:                         A GridFTP URL to the stdout file.
stderrURL:                         A GridFTP URL to the stderr file.
credentialPath:                 The path to the user proxy.
exitCode:                          simple 8 bit exit code

I think some job placement type information is needed for sure.  Publishing rusage type information 
would be nice, but how or can it be obtained.  Perhaps there is other information about a job that 
would be useful to publish for clients that will be discovered in this investigation.  For each scheduler, 
what information is typically available, what can be made available and how is it obtained (e.g. 
command line tool or scheduler's log file or ...).

example: job placement / id info
-----------------------
    hostaddr="140.221.65.193"
    nodename="lucky0.mcs.anl.gov"
    username="smartin"
    Schedulerjobid="4188.lucky0.mcs.anl.gov"
    pid="23104"

example: Rusage type information
-----------------------
    resources_used.cput=00:00:00
    resources_used.mem=0kb
    resources_used.vmem=0kb
    resources_used.walltime=00:00:01

example: other
-----------------------
    exit code --> in order to detect termination signal
    priority="???"

Tasks:
    1) Gather info from schedulers
    2) Provide analysis/recommendations for subsequent campaigns
    3) Confirm analysis/recommendations with scheduler admins and developers.  For example, if we 
decide to gather rusage info from an LSF log file, verify with people from platform and admins of LSF 
installations that the method is reasonable/approved.
------- Comment #1 From 2005-10-04 16:41:40 -------
*** Bug 3244 has been marked as a duplicate of this bug. ***
------- Comment #2 From 2005-10-27 11:32:49 -------
Created an attachment (id=728) [details]
scheduler-info.txt

Attaching a file which contains information about what the supported schedulers
return with respect to exit codes and rusage data.
------- Comment #3 From 2005-10-27 14:34:18 -------
Created an attachment (id=729) [details]
scheduler-info.txt

Added notes about job placement information.
------- Comment #4 From 2005-11-08 14:10:10 -------
Some ideas:
- the RSL (XML) used to submit the job
- a parsed form of some of the RSL (XML) parameters: project/account,
   node types/names/counts, walltime, program, reservation id, etc.
- submit, start, exit times
- resource manager job id
- user DN
- proxy expiration time
- list of hosts the job is running on

One other idea, how could a job choose to publish something arbitrary about itself?
Case in point is the ANL Viz Portal Paraview service.  When it launches on the back-end
resource it needs to communicate a host+port number of a started service back to the
client.  Ideally the client should be able to query gram for that information.
------- Comment #5 From 2005-12-21 16:38:40 -------
Created an attachment (id=794) [details]
conculsions.txt

Added a conclusions document to this campaign