Bug 6139 - Deadlock situation in job termination
: Deadlock situation in job termination
Status: RESOLVED FIXED
: GRAM
wsrf managed execution job service
: alpha
: Macintosh All
: P3 blocker
: 4.2
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2008-06-09 11:59 by
Modified: 2008-06-09 12:16 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2008-06-09 11:59:59
There's a chance for deadlocks in job termination at the moment if jobs
have staging. If a job is terminated in a staging response state
(stageInResponse, stageOutResponse, fileCleanUpResponse) the currently
running transfer has to be cancelled.
This involves that the job resource is locked during that time, then the
job is unregistered from the StagingListener, the subscription resource is
destroyed, which involves a locking of the subscription resource behind the
scenes, and finally the rft resource is destroyed.

However, if a notification message from this transfer is caught by the
StagingListener at the same time, the subscription resource is locked
behind the scenes, and the executing thread waits to lock the job
resource in StagingListener.deliver().

In rare situations the notification-sending-thread locks the subscription
resource and tries to gather the lock of the job resource, and the
transfer-cancellation-thread locks the job resource and tries to gather
the lock of the subscription resource.

Solution:

To fix this, the thread who's responsible for sending  the notification
from RFT to Gram must not try to lock the job resource corresponding to
this transfer.
I added a single-threaded ExecutorService, to add those jobs back to the
processing cycle, once a notification from RFT comes in that tells that a
transfer finished. This removes the deadlock.