Bug 4984 - Change in behaviour in usage of client-generated job resource keys
: Change in behaviour in usage of client-generated job resource keys
Status: CLOSED FIXED
: GRAM
wsrf managed job factory service
: 4.0.3
: PC Linux
: P3 normal
: 4.0.6
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2007-01-29 03:06 by
Modified: 2007-11-30 23:08 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2007-01-29 03:06:42
Title: Change in behaviour in usage of client-generated job resource keys

Technologies:   Globus Resource Allocation Manager (GRAM)

Definition:

A client of the MJFS can provide a self-generated UUID which is used as a 
resource key of a job. This enables reliable job submission: A client can
resubmit a job with the same self-generated UUID in case it does not get a
response on the first job submission request e.g. due to network problems. 
If the job had been submitted in the first request and is still running, it
will not be submitted again to avoid unnecessary and undesired duplicate
resource usage. Instead the EPR of the existing job will be returned to the
client.

This however requires that the MJFS can rely on the fact that the resource key
provided by the client really is a UUID. To avoid problems with "non-unique 
UUIDS" MJFS will in future generate and use a self-generated UUID even if the
client provides a UUID. The EPR returned to the client will include the
server-side generated resource key.
To enable reliable job submission as described above, the pair (server-side
generated UUID, client-side generated UUID) will be stored in a mapping in the
MEJ-Home. This mapping will then be used to check whether a job submission is
an initial job submission or a resubmission if a client provides a
self-generated UUID in his job submission request.
------- Comment #1 From 2007-01-29 03:32:52 -------
From looking at cvs logs this was done somewhen in May 2006 but a bug was not
made at that time. The change was done in HEAD and in the 4.0.community branch.

There are two open issues here:

1. The map that stores the pairs (server-side generated UUID, client-side
   generated UUID) is not persisted => in case of a container restart this
   information will be lost. Although one can argue that this will happen
rarely
   we should change this.
2. When a job finishes (fails or runs to completion) the mapping for that job
   (if the client provided a self-generated UUID at all) is not removed.
   This can cause a heap of unnecessary memory. Condor-G for example generates
   UUID's before job submission => 1000 jobs submitted by Condor-G can cause
   1000 mappings in MEJH that will not be removed after the jobs finished.
------- Comment #2 From 2007-06-27 07:30:01 -------
*** Bug 5363 has been marked as a duplicate of this bug. ***
------- Comment #3 From 2007-06-27 09:20:19 -------
Fixed 1 and 2:
  Added static methods to MEJH, that work synchronized on the static
  Hashtable idempotenceIdMap:

   * addIdempotenceIdMapping()
      // add a new mapping; called in
      //   MEJH.create() after a new resource has been fully created and
      //                 persisted to disk for the first time if
      //                 idempotenceId is not null
      //   MEJH.recover() to recreate the mapping for a resource during
      //                  recovery if idempotenceId is not null

   * getIdempotenceIdMapping()
      // get a mapping for an idempotenceId; called in
      //   MEJH.create() to check if a client comes with the id of an already
      //                 existing job resource

   * removeIdempotenceIdMapping() 
      // remove an existing mapping; called in 
      //   MEJR.remove() when a job resource is removed 

  During recovery the mapping of persisted resources is regenerated because 
  otherwise that idempotenceId is lost which may cause problems in the
  probably very rare scenario when a resource has been created and for
  whatever reason (network outage, etc) no EPR of that newly created resource   
  was returned to the client and the container was restarted.
------- Comment #4 From 2007-07-16 11:24:07 -------
There's a bug with this feature in 4.0.5 when MultiJobs get removed:

During the processing of a MultiJob (MJ) the MJ gets splitted up into
normal jobs which then get submitted. For these subjobs no uuid gets
generated in ManagedMultiJobResource.createSubJob(). So no mapping
is stored in the idempotenceIdMap of ManagedExecutableJobHome if the
job gets submitted.
When the subjobs get removed later on there's no check in MEJResource.remove()
whether there is an idempotenceId or not.
This causes a problem when there is no idempotenceId:  when an idempotenceId
with value null should be removed from the idempotenceIdMap, there's no
check for null key value. The Hashtable idempotenceIdMap throws a
NullPointerException in this case.
------- Comment #5 From 2007-07-16 11:35:54 -------
Committed a fix to the 4.0 branch:
All operations of MEJH that work on idempotenceIdMap check
for null values now.