Bug 5247 - job cancellation can lead to container hanging
: job cancellation can lead to container hanging
Status: RESOLVED FIXED
: GRAM
wsrf managed execution job service
: 4.0.4
: PC Linux
: P1 blocker
: 4.2
Assigned To:
:
:
: 5656
: 5610
  Show dependency treegraph
 
Reported: 2007-04-25 17:22 by
Modified: 2008-04-04 08:08 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2007-04-25 17:22:59
Canceling a large amount of jobs, that are not yet fully processed, can cause
the container to run out of threads.
The reason for that is, that a container thread is blocked in a call to
MEJR.remove() which will not return, until the resource is in state Done or
Failed. To reach state Failed in case of a user cancel request, a resource will
have to walk through some states after the client requestes cancellation:
UserCancel->(FailureFileCleanUp->)FailureCacheCleanUp->Failed.
These states are not processed immediately one after the other, but the
resource is queued into the RunQueue  after processing of each state.
FileCleanup is a timeconsuming task, and if the RunQueue is busy, it may take a
good amount of time until the resource's state changed from UserCancel to the
Failed.
During that time the thread, which was assigned by the container to work on the
client's destroy request, is blocked. The maximum of container threads can be
reached quickly if the container isn't configured to run with many threads.
Another issue is that (AFAIK) container threads are also used for notifications
which are needed during FailureFileCleanUp (call from WS-GRAM to RFT). If all
threads are blocked in destruction of resource, no notifications can be sent to
indicate, that a FileCleanUp-request has been successfully processed by RFT,
which leads to a deadlock.

Unfortunately, there seems to be no quick solution here so far.
Possible solutions are:

1. Make the remove() method a non-blocking method.
   This is not ideal, because the client can not rely on destruction on
   resources then. If errors happen, the client has no way to get notified
   about that. The client will then never know whether a cancellation really
   succeeded and/or when the resource was really destroyed

2. Process the necessary steps without queuing the the resources.
   Not a nice solution, because all work should be done in the RunThreads that
   work on the RunQueue. Also this probably wouldn't really avoid the problem,
   because these tasks may take some time even if there's no queuing between
   the tasks.

3. Add a "cancel" method, which must be called if a resource is not in state
   Done or Failed. After processing of that cancellation state described above,
   the resource is in state Failed and can be destroyed by the client.
   Destroy is then no longer a blocking operation.
   But this will lead to some changes in both the server and the clients. So
   no solution for the 4.0 branch

In our tests, we use the following workaround:
We configure the container with a big amount of threads and use the
containerThreadsHighWaterMark configuration parameter to adjust the number
if there's not much traffic in the container.
------- Comment #1 From 2007-11-29 10:44:01 -------
a proposed design doc for a new implementation was sent to the gram-dev list
(http://www.globus.org/mail_archive/gram-user/2007/10/msg00004.html) to gather
requirements and comments.  Based on feedback, we redesigned the interface and
after investigation of implementation options, a new design proposal will be
recirculated
------- Comment #2 From 2008-01-21 16:39:34 -------
New concept is described here.
http://www.globus.org/mail_archive/gram-dev/2008/01/msg00000.html
Implementation is almost completed.