Bugzilla – Bug 4984
Change in behaviour in usage of client-generated job resource keys
Last modified: 2007-11-30 23:08:55
You need to
before you can comment on or make changes to this bug.
Title: Change in behaviour in usage of client-generated job resource keys
Technologies: Globus Resource Allocation Manager (GRAM)
A client of the MJFS can provide a self-generated UUID which is used as a
resource key of a job. This enables reliable job submission: A client can
resubmit a job with the same self-generated UUID in case it does not get a
response on the first job submission request e.g. due to network problems.
If the job had been submitted in the first request and is still running, it
will not be submitted again to avoid unnecessary and undesired duplicate
resource usage. Instead the EPR of the existing job will be returned to the
This however requires that the MJFS can rely on the fact that the resource key
provided by the client really is a UUID. To avoid problems with "non-unique
UUIDS" MJFS will in future generate and use a self-generated UUID even if the
client provides a UUID. The EPR returned to the client will include the
server-side generated resource key.
To enable reliable job submission as described above, the pair (server-side
generated UUID, client-side generated UUID) will be stored in a mapping in the
MEJ-Home. This mapping will then be used to check whether a job submission is
an initial job submission or a resubmission if a client provides a
self-generated UUID in his job submission request.
From looking at cvs logs this was done somewhen in May 2006 but a bug was not
made at that time. The change was done in HEAD and in the 4.0.community branch.
There are two open issues here:
1. The map that stores the pairs (server-side generated UUID, client-side
generated UUID) is not persisted => in case of a container restart this
information will be lost. Although one can argue that this will happen
we should change this.
2. When a job finishes (fails or runs to completion) the mapping for that job
(if the client provided a self-generated UUID at all) is not removed.
This can cause a heap of unnecessary memory. Condor-G for example generates
UUID's before job submission => 1000 jobs submitted by Condor-G can cause
1000 mappings in MEJH that will not be removed after the jobs finished.
*** Bug 5363 has been marked as a duplicate of this bug. ***
Fixed 1 and 2:
Added static methods to MEJH, that work synchronized on the static
// add a new mapping; called in
// MEJH.create() after a new resource has been fully created and
// persisted to disk for the first time if
// idempotenceId is not null
// MEJH.recover() to recreate the mapping for a resource during
// recovery if idempotenceId is not null
// get a mapping for an idempotenceId; called in
// MEJH.create() to check if a client comes with the id of an already
// existing job resource
// remove an existing mapping; called in
// MEJR.remove() when a job resource is removed
During recovery the mapping of persisted resources is regenerated because
otherwise that idempotenceId is lost which may cause problems in the
probably very rare scenario when a resource has been created and for
whatever reason (network outage, etc) no EPR of that newly created resource
was returned to the client and the container was restarted.
There's a bug with this feature in 4.0.5 when MultiJobs get removed:
During the processing of a MultiJob (MJ) the MJ gets splitted up into
normal jobs which then get submitted. For these subjobs no uuid gets
generated in ManagedMultiJobResource.createSubJob(). So no mapping
is stored in the idempotenceIdMap of ManagedExecutableJobHome if the
job gets submitted.
When the subjobs get removed later on there's no check in MEJResource.remove()
whether there is an idempotenceId or not.
This causes a problem when there is no idempotenceId: when an idempotenceId
with value null should be removed from the idempotenceIdMap, there's no
check for null key value. The Hashtable idempotenceIdMap throws a
NullPointerException in this case.
Committed a fix to the 4.0 branch:
All operations of MEJH that work on idempotenceIdMap check
for null values now.