Bugzilla – Bug 4865
Events from SEG to JobStateMonitor are deleted too early i.e. jobs keep stuck
Last modified: 2007-06-25 19:04:40
You need to
before you can comment on or make changes to this bug.
Note 0: I'll attach a diagram that gives an overview over the scenario and
may help to understand what i describe below.
Note 1: "Notification" is not meant in the WSRF context here.
Note 2: The following may not be totally exact, but shall illustrate
what happens. Also I'm talking about a Condor pool instead
of a more abstract "LocalResourceManager".
Note 3: GREEN, BLUE, RED refer to the colors in the attached document
Under the current circumstances it happens sometimes that notifications
about job states from the SchedulerEventGenerator to the JobStateMonitor
get lost. This causes some jobs to keep stuck in the container.
A RunThread submits a job to the condor pool which writes the status of the
job to its logfile. The SchedulerEventGenerator monitors that logfile
and tells the JobStateMonitor about changes in the jobs state in the Condor
After a job is sent to the Condor pool, the job resource is registered
at the JobStateMonitor via the JobStateMonitorSubscriptionManager. The
registration at the JobStateMonitorSubscriptionManager is a non-blocking
action. The JobStateMonitorSubscriptionManager takes registration requests
by all jobs, queues them if necessary and registers them at the
JobStateMonitor on behalf of the job resources (GREEN).
It sometimes happens that events of a job are sent to the JobStateMonitor
from the SchedulerEventGenerator (BLUE) *before* the job resource had been
registered at the JobStateMonitor (GREEN). In that case, the events can't be
sent to the MEJobHome direcJobStateMonitorSubscriptionManagertly but are saved
in the EventCache. When the registration comes in, the cached events are
triggered to the MEJobHome.
Additionally a flushCache Task periodically removes all cached events that
are older than N milliseconds (RED).
If a registration for job state changes (GREEN) takes place *after* all
events of a job are cached (BLUE) and if the events are cached for longer
than N milliseconds, the job resource will never be notified about job state
changes, since all events had been removed by the flushCache task (RED).
Reason why this happens:
* Short running jobs and good throughput from GRAM4 to the Condor-pool
=> many job state change events enter the JobStateMonitor at the same time.
* The JobStateMonitorSubscriptionManager queues incoming registration
requests (RED). The registration at the JobStateMonitor takes from
about 0 to 20 seconds.
This causes heaping up registration requests in the queue and explains
why the registration is delayed: requests are heaped up and stay there for
up to 35 minutes before they get registered at the JobStateMonitor.
Some notes and questions:
1. why is job submission (BLUE) so fast and registration so slow.
The BLUE way should take much more time than the GREEN way.
Is it that the container is so busy that Threads don't get
much execution time (see item 2).
I tried to improve performance of the JobStateMonitor and things became
slightly better, but this just shifted the problem backwards and did
not eliminate it.
(What i did was a change of some synchronization issues and removal
of the DispatcherThread in JobStateMonitor which starts a Thread for each
2. I found that the registration is slow only until all jobs of a large
job submission are submitted by Condor-G i.e. all jobs are either done
or in the container.
If all jobs are in the container then things speed up noticeably.
I saw this correlation earlier: During the submission phase all
tasks in the container are slower.
What could cause this?
* Do Container-Threads have higher priority?
* Is the system busy with encryption and decryption of messages?
* go the GREEN way before the BLUE way and use blocking subscription
instead of non-blocking.
* avoid the flushCache task. I don't know exactely if this can be done
although the logfile which is used by the Condor pool contains only entries
for jobs submitted by GRAM. But AFAIK this is not the case with other
local resource managers like LSF, PBS, ...
Created an attachment (id=1129) [details]
diagram that illustrates the architecture
diagram that illustrates the architecture
I thought the reason I did the non-blocking (un)register was to avoid a
deadlock condition, but unfortunately I can't remember the details. It's
possible I'm wrong about that or that the condition doesn't exist anymore given
other code changes. I'd suggest trying the blocking register and unregister
idea and see if it causes problems for the large job run.
Jarek's patch from bug 4802 is relevant to this problem. Martin tested out
Jarek's patch and it improve the time for JSM registration, but more
investigation and testing is needed.
Created an attachment (id=1132) [details]
new registration model
I found the problem in the meantime and changed some things which cause
that registrations to the JobStateMonitor don't heap up but get registrated
within some milliseconds.
The main problem is persistence handling and the fact that the directory
that contains the persisted data was on an NFS-mounted partition.
Things got even worse since the access to that NFS-mounted partition on the
server became quite slow for whatever reason. This caused even about 450
out of 1000 jobs to keep stuck and made job submission in general much
The registration at the JobStateMonitor is a blocking operation which
includes the notifications to the MEJHome in case there are already some
some cached events for that resource in the EventCache. During this
notification a resource property is modified which in turn causes the
resource to be persisted to disk again (which was slow).
Because of that, registration requests heaped up in the
JobStateMonitorSubscriptionManager. Additionally, since the Events
"Job is Active" and "Job is Done" come in so quickly one after another,
the processing of these two events may have blocked each other, too.
But the main factor is probably the persistence handling on the slow disk.
I made the registration non-blocking and added an EventDispatchQueue were
events are queued by the JobStateMonitor that must be dispatched to a
JobStateChangeListener. An EventDispatchThread works on that queue.
With a slow disk, heaping up occurs here too of course, but no events
will get lost now.
I attached another diagram that illustrates the changes
1) I changed the construction of the of the SEG due to the warning in
a book about java threads:
"Do not allow the 'this' reference to excape during construction"
2) removal of the DispatcherThread
Because of the new creation of the JSM the resource had to be was
* SchedulerEventGenerator: new Constructor
* 2 new classes in the org.globus.exec.monitoring:
* removed JobStateMonitorSubscriptionManager and added the two operations
subscribe() and unsubscribe() to the StateMachine directly.
MAX_REGISTER_ATTEMPTS was removed. If a registration at the JobStateMonitor
fails one time (because it may already be registered) the job fails.
Peter looked at the changes and they are committed to globus_4_0_branch and
We should think about applying Jareks GRAM4 patch at
since this is reducing time in parts of GRAM4 that caused the problem
described in this bug.
Biggest factor here is the reduced number of persistence operations.
AFAIK GRAM4 in Trunk stores persistence data in a database. I assume
this will improve the persistence situation, too.
This fix has been committed to the "globus_4_0_branch" and
"globus_4_0_community" branch. It is included in the large GRAM patch that
will be included in VDT 1.6.
This fix will be included in GT 4.0.4, setting milestone accordingly.