Bugzilla – Bug 4865
Events from SEG to JobStateMonitor are deleted too early i.e. jobs keep stuck
Last modified: 2007-06-25 19:04:40
You need to log in before you can comment on or make changes to this bug.
Note 0: I'll attach a diagram that gives an overview over the scenario and may help to understand what i describe below. Note 1: "Notification" is not meant in the WSRF context here. Note 2: The following may not be totally exact, but shall illustrate what happens. Also I'm talking about a Condor pool instead of a more abstract "LocalResourceManager". Note 3: GREEN, BLUE, RED refer to the colors in the attached document What happens: ############# Under the current circumstances it happens sometimes that notifications about job states from the SchedulerEventGenerator to the JobStateMonitor get lost. This causes some jobs to keep stuck in the container. A RunThread submits a job to the condor pool which writes the status of the job to its logfile. The SchedulerEventGenerator monitors that logfile and tells the JobStateMonitor about changes in the jobs state in the Condor pool (BLUE). After a job is sent to the Condor pool, the job resource is registered at the JobStateMonitor via the JobStateMonitorSubscriptionManager. The registration at the JobStateMonitorSubscriptionManager is a non-blocking action. The JobStateMonitorSubscriptionManager takes registration requests by all jobs, queues them if necessary and registers them at the JobStateMonitor on behalf of the job resources (GREEN). It sometimes happens that events of a job are sent to the JobStateMonitor from the SchedulerEventGenerator (BLUE) *before* the job resource had been registered at the JobStateMonitor (GREEN). In that case, the events can't be sent to the MEJobHome direcJobStateMonitorSubscriptionManagertly but are saved in the EventCache. When the registration comes in, the cached events are triggered to the MEJobHome. Additionally a flushCache Task periodically removes all cached events that are older than N milliseconds (RED). If a registration for job state changes (GREEN) takes place *after* all events of a job are cached (BLUE) and if the events are cached for longer than N milliseconds, the job resource will never be notified about job state changes, since all events had been removed by the flushCache task (RED). Reason why this happens: ######################## * Short running jobs and good throughput from GRAM4 to the Condor-pool => many job state change events enter the JobStateMonitor at the same time. * The JobStateMonitorSubscriptionManager queues incoming registration requests (RED). The registration at the JobStateMonitor takes from about 0 to 20 seconds. This causes heaping up registration requests in the queue and explains why the registration is delayed: requests are heaped up and stay there for up to 35 minutes before they get registered at the JobStateMonitor. Some notes and questions: ######################### 1. why is job submission (BLUE) so fast and registration so slow. The BLUE way should take much more time than the GREEN way. Is it that the container is so busy that Threads don't get much execution time (see item 2). I tried to improve performance of the JobStateMonitor and things became slightly better, but this just shifted the problem backwards and did not eliminate it. (What i did was a change of some synchronization issues and removal of the DispatcherThread in JobStateMonitor which starts a Thread for each notification) 2. I found that the registration is slow only until all jobs of a large job submission are submitted by Condor-G i.e. all jobs are either done or in the container. If all jobs are in the container then things speed up noticeably. I saw this correlation earlier: During the submission phase all tasks in the container are slower. What could cause this? * Do Container-Threads have higher priority? * Is the system busy with encryption and decryption of messages? ... Possible solutions: ################### * go the GREEN way before the BLUE way and use blocking subscription instead of non-blocking. * avoid the flushCache task. I don't know exactely if this can be done although the logfile which is used by the Condor pool contains only entries for jobs submitted by GRAM. But AFAIK this is not the case with other local resource managers like LSF, PBS, ...
Created an attachment (id=1129) [details] diagram that illustrates the architecture diagram that illustrates the architecture
I thought the reason I did the non-blocking (un)register was to avoid a deadlock condition, but unfortunately I can't remember the details. It's possible I'm wrong about that or that the condition doesn't exist anymore given other code changes. I'd suggest trying the blocking register and unregister idea and see if it causes problems for the large job run.
Jarek's patch from bug 4802 is relevant to this problem. Martin tested out Jarek's patch and it improve the time for JSM registration, but more investigation and testing is needed.
Created an attachment (id=1132) [details] new registration model I found the problem in the meantime and changed some things which cause that registrations to the JobStateMonitor don't heap up but get registrated within some milliseconds. The main problem is persistence handling and the fact that the directory that contains the persisted data was on an NFS-mounted partition. Things got even worse since the access to that NFS-mounted partition on the server became quite slow for whatever reason. This caused even about 450 out of 1000 jobs to keep stuck and made job submission in general much slower. The registration at the JobStateMonitor is a blocking operation which includes the notifications to the MEJHome in case there are already some some cached events for that resource in the EventCache. During this notification a resource property is modified which in turn causes the resource to be persisted to disk again (which was slow). Because of that, registration requests heaped up in the JobStateMonitorSubscriptionManager. Additionally, since the Events "Job is Active" and "Job is Done" come in so quickly one after another, the processing of these two events may have blocked each other, too. But the main factor is probably the persistence handling on the slow disk. I made the registration non-blocking and added an EventDispatchQueue were events are queued by the JobStateMonitor that must be dispatched to a JobStateChangeListener. An EventDispatchThread works on that queue. With a slow disk, heaping up occurs here too of course, but no events will get lost now. I attached another diagram that illustrates the changes ChangeLog: ########## * JobStateMonitor: 1) I changed the construction of the of the SEG due to the warning in a book about java threads: "Do not allow the 'this' reference to excape during construction" 2) removal of the DispatcherThread * ManagedJobFactoryResource: Because of the new creation of the JSM the resource had to be was adapted too. * SchedulerEventGenerator: new Constructor * 2 new classes in the org.globus.exec.monitoring: - EventDispatcherQueue - EventDispatcherThread * removed JobStateMonitorSubscriptionManager and added the two operations subscribe() and unsubscribe() to the StateMachine directly. MAX_REGISTER_ATTEMPTS was removed. If a registration at the JobStateMonitor fails one time (because it may already be registered) the job fails. Peter looked at the changes and they are committed to globus_4_0_branch and globus_4_0_community. Future Work: ############ We should think about applying Jareks GRAM4 patch at http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=4802 since this is reducing time in parts of GRAM4 that caused the problem described in this bug. Biggest factor here is the reduced number of persistence operations. AFAIK GRAM4 in Trunk stores persistence data in a database. I assume this will improve the persistence situation, too.
This fix has been committed to the "globus_4_0_branch" and "globus_4_0_community" branch. It is included in the large GRAM patch that will be included in VDT 1.6. This fix will be included in GT 4.0.4, setting milestone accordingly.