Bugzilla – Bug 5247
job cancellation can lead to container hanging
Last modified: 2008-04-04 08:08:00
You need to log in before you can comment on or make changes to this bug.
Canceling a large amount of jobs, that are not yet fully processed, can cause the container to run out of threads. The reason for that is, that a container thread is blocked in a call to MEJR.remove() which will not return, until the resource is in state Done or Failed. To reach state Failed in case of a user cancel request, a resource will have to walk through some states after the client requestes cancellation: UserCancel->(FailureFileCleanUp->)FailureCacheCleanUp->Failed. These states are not processed immediately one after the other, but the resource is queued into the RunQueue after processing of each state. FileCleanup is a timeconsuming task, and if the RunQueue is busy, it may take a good amount of time until the resource's state changed from UserCancel to the Failed. During that time the thread, which was assigned by the container to work on the client's destroy request, is blocked. The maximum of container threads can be reached quickly if the container isn't configured to run with many threads. Another issue is that (AFAIK) container threads are also used for notifications which are needed during FailureFileCleanUp (call from WS-GRAM to RFT). If all threads are blocked in destruction of resource, no notifications can be sent to indicate, that a FileCleanUp-request has been successfully processed by RFT, which leads to a deadlock. Unfortunately, there seems to be no quick solution here so far. Possible solutions are: 1. Make the remove() method a non-blocking method. This is not ideal, because the client can not rely on destruction on resources then. If errors happen, the client has no way to get notified about that. The client will then never know whether a cancellation really succeeded and/or when the resource was really destroyed 2. Process the necessary steps without queuing the the resources. Not a nice solution, because all work should be done in the RunThreads that work on the RunQueue. Also this probably wouldn't really avoid the problem, because these tasks may take some time even if there's no queuing between the tasks. 3. Add a "cancel" method, which must be called if a resource is not in state Done or Failed. After processing of that cancellation state described above, the resource is in state Failed and can be destroyed by the client. Destroy is then no longer a blocking operation. But this will lead to some changes in both the server and the clients. So no solution for the 4.0 branch In our tests, we use the following workaround: We configure the container with a big amount of threads and use the containerThreadsHighWaterMark configuration parameter to adjust the number if there's not much traffic in the container.
a proposed design doc for a new implementation was sent to the gram-dev list (http://www.globus.org/mail_archive/gram-user/2007/10/msg00004.html) to gather requirements and comments. Based on feedback, we redesigned the interface and after investigation of implementation options, a new design proposal will be recirculated
New concept is described here. http://www.globus.org/mail_archive/gram-dev/2008/01/msg00000.html Implementation is almost completed.