Bugzilla – Bug 4984
Change in behaviour in usage of client-generated job resource keys
Last modified: 2007-11-30 23:08:55
You need to log in before you can comment on or make changes to this bug.
Title: Change in behaviour in usage of client-generated job resource keys Technologies: Globus Resource Allocation Manager (GRAM) Definition: A client of the MJFS can provide a self-generated UUID which is used as a resource key of a job. This enables reliable job submission: A client can resubmit a job with the same self-generated UUID in case it does not get a response on the first job submission request e.g. due to network problems. If the job had been submitted in the first request and is still running, it will not be submitted again to avoid unnecessary and undesired duplicate resource usage. Instead the EPR of the existing job will be returned to the client. This however requires that the MJFS can rely on the fact that the resource key provided by the client really is a UUID. To avoid problems with "non-unique UUIDS" MJFS will in future generate and use a self-generated UUID even if the client provides a UUID. The EPR returned to the client will include the server-side generated resource key. To enable reliable job submission as described above, the pair (server-side generated UUID, client-side generated UUID) will be stored in a mapping in the MEJ-Home. This mapping will then be used to check whether a job submission is an initial job submission or a resubmission if a client provides a self-generated UUID in his job submission request.
From looking at cvs logs this was done somewhen in May 2006 but a bug was not made at that time. The change was done in HEAD and in the 4.0.community branch. There are two open issues here: 1. The map that stores the pairs (server-side generated UUID, client-side generated UUID) is not persisted => in case of a container restart this information will be lost. Although one can argue that this will happen rarely we should change this. 2. When a job finishes (fails or runs to completion) the mapping for that job (if the client provided a self-generated UUID at all) is not removed. This can cause a heap of unnecessary memory. Condor-G for example generates UUID's before job submission => 1000 jobs submitted by Condor-G can cause 1000 mappings in MEJH that will not be removed after the jobs finished.
*** Bug 5363 has been marked as a duplicate of this bug. ***
Fixed 1 and 2: Added static methods to MEJH, that work synchronized on the static Hashtable idempotenceIdMap: * addIdempotenceIdMapping() // add a new mapping; called in // MEJH.create() after a new resource has been fully created and // persisted to disk for the first time if // idempotenceId is not null // MEJH.recover() to recreate the mapping for a resource during // recovery if idempotenceId is not null * getIdempotenceIdMapping() // get a mapping for an idempotenceId; called in // MEJH.create() to check if a client comes with the id of an already // existing job resource * removeIdempotenceIdMapping() // remove an existing mapping; called in // MEJR.remove() when a job resource is removed During recovery the mapping of persisted resources is regenerated because otherwise that idempotenceId is lost which may cause problems in the probably very rare scenario when a resource has been created and for whatever reason (network outage, etc) no EPR of that newly created resource was returned to the client and the container was restarted.
There's a bug with this feature in 4.0.5 when MultiJobs get removed: During the processing of a MultiJob (MJ) the MJ gets splitted up into normal jobs which then get submitted. For these subjobs no uuid gets generated in ManagedMultiJobResource.createSubJob(). So no mapping is stored in the idempotenceIdMap of ManagedExecutableJobHome if the job gets submitted. When the subjobs get removed later on there's no check in MEJResource.remove() whether there is an idempotenceId or not. This causes a problem when there is no idempotenceId: when an idempotenceId with value null should be removed from the idempotenceIdMap, there's no check for null key value. The Hashtable idempotenceIdMap throws a NullPointerException in this case.
Committed a fix to the 4.0 branch: All operations of MEJH that work on idempotenceIdMap check for null values now.