Bug 4865 - Events from SEG to JobStateMonitor are deleted too early i.e. jobs keep stuck
: Events from SEG to JobStateMonitor are deleted too early i.e. jobs keep stuck
Status: CLOSED FIXED
: GRAM
wsrf managed execution job service
: unspecified
: PC Linux
: P3 normal
: 4.0.4
Assigned To:
:
:
:
: 4664
  Show dependency treegraph
 
Reported: 2006-11-21 11:23 by
Modified: 2007-06-25 19:04 (History)


Attachments
diagram that illustrates the architecture (65.38 KB, application/pdf)
2006-11-21 11:25, Martin Feller
Details
new registration model (38.01 KB, application/pdf)
2006-11-28 03:28, Martin Feller
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2006-11-21 11:23:55
Note 0: I'll attach a diagram that gives an overview over the scenario and
        may help to understand what i describe below.
Note 1: "Notification" is not meant in the WSRF context here.
Note 2: The following may not be totally exact, but shall illustrate
        what happens. Also I'm talking about a Condor pool instead
        of a more abstract "LocalResourceManager".
Note 3: GREEN, BLUE, RED refer to the colors in the attached document

What happens:
#############
Under the current circumstances it happens sometimes that notifications
about job states from the SchedulerEventGenerator to the JobStateMonitor
get lost. This causes some jobs to keep stuck in the container.

A RunThread submits a job to the condor pool which writes the status of the
job to its logfile. The SchedulerEventGenerator monitors that logfile
and tells the JobStateMonitor about changes in the jobs state in the Condor
pool (BLUE).

After a job is sent to the Condor pool, the job resource is registered
at the JobStateMonitor via the JobStateMonitorSubscriptionManager. The
registration at the JobStateMonitorSubscriptionManager is a non-blocking
action. The JobStateMonitorSubscriptionManager takes registration requests
by all jobs, queues them if necessary and registers them at the
JobStateMonitor on behalf of the job resources (GREEN).

It sometimes happens that events of a job are sent to the JobStateMonitor
from the SchedulerEventGenerator (BLUE) *before* the job resource had been
registered at the JobStateMonitor (GREEN). In that case, the events can't be
sent to the MEJobHome direcJobStateMonitorSubscriptionManagertly but are saved
in the EventCache. When the registration comes in, the cached events are
triggered to the MEJobHome.

Additionally a flushCache Task periodically removes all cached events that
are older than N milliseconds (RED).

If a registration for job state changes (GREEN) takes place *after* all
events of a job are cached (BLUE) and if the events are cached for longer
than N milliseconds, the job resource will never be notified about job state
changes, since all events had been removed by the flushCache task (RED).


Reason why this happens:
########################

* Short running jobs and good throughput from GRAM4 to the Condor-pool
  => many job state change events enter the JobStateMonitor at the same time.

* The JobStateMonitorSubscriptionManager queues incoming registration
  requests (RED). The registration at the JobStateMonitor takes from
  about 0 to 20 seconds.
  This causes heaping up registration requests in the queue and explains
  why the registration is delayed: requests are heaped up and stay there for
  up to 35 minutes before they get registered at the JobStateMonitor.


Some notes and questions:
#########################


1. why is job submission (BLUE) so fast and registration so slow.
   The BLUE way should take much more time than the GREEN way.
   Is it that the container is so busy that Threads don't get
   much execution time (see item 2).
   I tried to improve performance of the JobStateMonitor and things became
   slightly better, but this just shifted the problem backwards and did
   not eliminate it.
   (What i did was a change of some synchronization issues and removal
   of the DispatcherThread in JobStateMonitor which starts a Thread for each
   notification)

2. I found that the registration is slow only until all jobs of a large
   job submission are submitted by Condor-G i.e. all jobs are either done
   or in the container.
   If all jobs are in the container then things speed up noticeably.
   I saw this correlation earlier: During the submission phase all
   tasks in the container are slower.
   What could cause this?
   * Do Container-Threads have higher priority?
   * Is the system busy with encryption and decryption of messages?
   ... 


Possible solutions:
###################

* go the GREEN way before the BLUE way and use blocking subscription
  instead of non-blocking.

* avoid the flushCache task. I don't know exactely if this can be done
  although the logfile which is used by the Condor pool contains only entries
  for jobs submitted by GRAM. But AFAIK this is not the case with other
  local resource managers like LSF, PBS, ...
------- Comment #1 From 2006-11-21 11:25:21 -------
Created an attachment (id=1129) [details]
diagram that illustrates the architecture

diagram that illustrates the architecture
------- Comment #2 From 2006-11-21 12:31:36 -------
I thought the reason I did the non-blocking (un)register was to avoid a
deadlock condition, but unfortunately I can't remember the details. It's
possible I'm wrong about that or that the condition doesn't exist anymore given
other code changes. I'd suggest trying the blocking register and unregister
idea and see if it causes problems for the large job run.
------- Comment #3 From 2006-11-22 15:34:15 -------
Jarek's patch from bug 4802 is relevant to this problem.  Martin tested out
Jarek's patch and it improve the time for JSM registration, but more
investigation and testing is needed.
------- Comment #4 From 2006-11-28 03:28:12 -------
Created an attachment (id=1132) [details]
new registration model

I found the problem in the meantime and changed some things which cause
that registrations to the JobStateMonitor don't heap up but get registrated
within some milliseconds.

The main problem is persistence handling and the fact that the directory
that contains the persisted data was on an NFS-mounted partition.
Things got even worse since the access to that NFS-mounted partition on the
server became quite slow for whatever reason. This caused even about 450
out of 1000 jobs to keep stuck and made job submission in general much
slower.

The registration at the JobStateMonitor is a blocking operation which
includes the notifications to the MEJHome in case there are already some
some cached events for that resource in the EventCache. During this
notification a resource property is modified which in turn causes the
resource to be persisted to disk again (which was slow).
Because of that, registration requests heaped up in the
JobStateMonitorSubscriptionManager. Additionally, since the Events
"Job is Active" and "Job is Done" come in so quickly one after another,
the processing of these two events may have blocked each other, too.
But the main factor is probably the persistence handling on the slow disk.

I made the registration non-blocking and added an EventDispatchQueue were
events are queued by the JobStateMonitor that must be dispatched to a
JobStateChangeListener. An EventDispatchThread works on that queue.
With a slow disk, heaping up occurs here too of course, but no events
will get lost now.
I attached another diagram that illustrates the changes


ChangeLog:
##########
* JobStateMonitor: 
  1) I changed the construction of the of the SEG due to the warning in
     a book about java threads:
     "Do not allow the 'this' reference to excape during construction"
  2) removal of the DispatcherThread  
* ManagedJobFactoryResource:
  Because of the new creation of the JSM the resource had to be was 
  adapted too.
* SchedulerEventGenerator: new Constructor
* 2 new classes in the org.globus.exec.monitoring: 
    - EventDispatcherQueue
    - EventDispatcherThread
* removed JobStateMonitorSubscriptionManager and added the two operations
  subscribe() and unsubscribe() to the StateMachine directly.
  MAX_REGISTER_ATTEMPTS was removed. If a registration at the JobStateMonitor
  fails one time (because it may already be registered) the job fails. 

Peter looked at the changes and they are committed to globus_4_0_branch and
globus_4_0_community.

Future Work:
############
We should think about applying Jareks GRAM4 patch at
http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=4802
since this is reducing time in parts of GRAM4 that caused the problem
described in this bug.
Biggest factor here is the reduced number of persistence operations.
AFAIK GRAM4 in Trunk stores persistence data in a database. I assume
this will improve the persistence situation, too.
------- Comment #5 From 2006-11-29 10:41:10 -------
This fix has been committed to the "globus_4_0_branch" and
"globus_4_0_community" branch.  It is included in the large GRAM patch that
will be included in VDT 1.6.

This fix will be included in GT 4.0.4, setting milestone accordingly.