Bugzilla – Bug 4204
WorkManagerImpl incorrectly implements the WorkManager.schedule(...) method
Last modified: 2006-04-21 13:48:18
You need to log in before you can comment on or make changes to this bug.
WorkManager.schedule(...) must be implemented as a non-blocking method per the interface specification. Per the WorkManager.schedule(...) javadoc from the BEA website, the first two lines read: "Dispatches a Work asynchronously. The work is dispatched and the method returns immediately." The WorkManagerImpl.schedule(...) implementation, however, blocks on the call whenever the underlying thread pool is maxed. Code snippet: public class WorkmanagerImpl implements WorkManager ... { ... private PooledExecutor pool; ... public WorkManagerImpl(int maxPoolSize) { this.pool = new PooledExecutor(maxPoolSize); } ... public synchronized WorkItem schedule(Work work, WorkListener listener) ... if(work.isDaemon()) { Thread thread = new Thread(new WorkWrapper(work, listener)); thread.setName("Work" + thread.getName()); thread.setDaemon(true); thread.start(); } else { this.pool.execute(new WorkWrapper(work, listener)); } ... From PooledExecutor javadoc, " Blocked execution policy If the maximum pool size or queue size is bounded, then it is possible for incoming execute requests to block. There are four supported policies for handling this problem, and mechanics (based on the Strategy Object pattern) to allow others in subclasses: Run (the default) The thread making the execute request runs the task itself. This policy helps guard against lockup. " Since WorkManagerImpl uses PooledExecutor's default Blocked execution policy, when max threads is reached, the next invocation leads to a blocking call -- not only that, but the Runnable passed to execute is executed on the calling thread. With a service like DRS this results in a very long sequence of Work.run(...) methods being invoked on the ServiceThread of the container, thus the entire SOAP rpc blocks until X number of files are replicated across the Grid.
Off the top of my head, my suggestion for fixing this is to run a daemon thread in the background with the sole purpose of waiting on a pending queue of Work, when notified wakes up and attempts to this.pool.execute(nextWork). Change the BlockedExecutionHandler to a Wait policy so that the thread waits until the next thread is available from the pool. The WorkManagerImpl.schedule(...) method (when work is not a daemon) just adds the Work to the pending queue, notifies the schedule daemon, and returns immediately. I'm sure there are other more creative ways to implement it also.
I committed a fix for this problem to trunk. Please test it out and let me know. The PoolExecutor is now configured with an internal queue to queue up the requests, so the schedule method will not block anymore but the queue can build up now.
I am not sure this is fixed. I updated my java core with the fix (just updated the org.globus.wsrf.impl.work package). I attempted a concurrency test and a standalong test. The DRS code was not changed. When the DRS is started it schedules the DiscoverWork. But with the changed java ws core, the discover works never get executed. Seems that the PooledExecutor just queues the work up but never starts it. *REOPENING*
I do not see this behaviour in the tests. Can you do kill -QUIT to see thread dump?
I see what's happening. I have workers that schedule more workers and then wait for the 'child' (if you will) workers to finish by doing a wm.waitForAll (childWorkers). So, since my worker does not relinquish its thread, I support the PooledExecutor can never schedule the 'child' worker. Does that sound plausible based on your knowledge of how PooledExecutor should work? If so, what's the advice to container users? "Don't do that!" maybe? Or perhaps there's a way to configure the workmanager to grow indefinitely to avoid such conditions? Here's the kill -QUIT output of one (of 10) such workers trapped. "Thread-28" daemon prio=1 tid=0x8c411578 nid=0x3510 in Object.wait() [0xba3ff000 ..0xba3ff554] at java.lang.Object.wait(Native Method) - waiting on <0x4bacbf48> (a org.globus.wsrf.impl.work.WorkItemImpl) at java.lang.Object.wait(Object.java:474) at org.globus.wsrf.impl.work.WorkManagerImpl.waitForAll (WorkManagerImpl. java:179) - locked <0x4bacbf48> (a org.globus.wsrf.impl.work.WorkItemImpl) at org.globus.replica.replicator.impl.DiscoverWork._queryCatalogs (Discov erWork.java:324) at org.globus.replica.replicator.impl.DiscoverWork.run (DiscoverWork.java :119) at org.globus.wsrf.impl.work.WorkManagerImpl$WorkWrapper.run (WorkManager Impl.java:376) at EDU.oswego.cs.dl.util.concurrent.PooledExecutor$Worker.run(Unknown So urce) at java.lang.Thread.run(Thread.java:595)
I would view Work as independent processes, and I don't recommend doing synchronization between different Work instances. That can block the whole WorkManager as you exprienced. If you need to do some synchronization between Work instances, implement that Work that waits for other Work instances as a daemon Work (return true in isDeamon(), a new thread will be started specifically for that Work) and let the other independent Work instances queue up as normal. By default WorkManager is configured to use maximum of 5 threads. It can be configured to use more but I do think that there should be some limit on it.
This is closed to my satisfaction. Thanks for the quick fix.
So this fix wasn't committed to globus_4_0_branch? If not, we're going to need a patch for our 4.0.2 users of DRS.
Yes, it looks like I forgot to check in the fix into globus_4_0_branch. Sorry. I just committed the fix to globus_4_0_branch.
Created an attachment (id=930) [details] Patch against 4.0.2 source code