Bugzilla – Bug 3502
Container freezes (CPU spin) when making remote call between ws-resources
Last modified: 2007-09-12 16:42:14
You need to log in before you can comment on or make changes to this bug.
The container freezes up (by way of the CPU spinning 99.9% utilization on the container's thread). I encounter this consistently (20-25% likelihood) when running the Data Replication Service at the point where the DRS creates a RFT resource. The create message contains 10,000 file entries for the new RFT resource. Flow of control leaves the DRS, passes to the container, and never passes to the RFT Factory (createRFT) method. With the "org.globus.wsrf.container=DEBUG" log4j setting, the final log messages prior to the container freeze are: 2005-06-20 14:50:57,377 DEBUG impl.TransferWork [Thread-15,_createTransferRequests:132] _createTransferRequests 2005-06-20 14:50:57,382 DEBUG impl.DiscoverWork [Thread-14,_persistReplication:439] _persistReplication 2005-06-20 14:50:57,385 DEBUG impl.DiscoverWork [Thread-14,stop:71] [perf:discover][Thread-14][8858] 2005-06-20 14:50:57,387 DEBUG impl.ReplicationWorkListener [Thread-14,workCompleted:59] Work completed, key: 24415140 2005-06-20 14:50:57,444 DEBUG impl.TransferWork [Thread-15,_createTransferRequests:170] Creating transfer request, size: 10000 2005-06-20 14:50:57,448 DEBUG impl.TransferWork [Thread-15,_scheduleTransferRequests:208] _scheduleTransferRequests 2005-06-20 14:50:57,467 DEBUG impl.TransferWork [Thread-15,_scheduleTransferRequests:234] Waiting for transfer requests 2005-06-20 14:50:57,467 DEBUG impl.TransferRequestWork [Thread-14,_locateTransferFactory:174] _locateTransferFactory 2005-06-20 14:50:57,811 DEBUG impl.TransferRequestWork [Thread-14,_createTransfer:189] _createTransfer 2005-06-20 14:50:57,812 DEBUG impl.TransferRequestWork [Thread-14,_createTransfer:197] Attempting to create RFT resource 2005-06-20 14:50:58,548 DEBUG container.ServiceRequestQueue [Thread-7,enqueue:52] Putting request in queue 2005-06-20 14:50:58,550 DEBUG container.ServiceDispatcher [Thread-7,run:196] waiting threads: 3 2005-06-20 14:50:58,550 DEBUG container.ServiceRequestQueue [Thread-8,dequeue:32] Getting request from queue 2005-06-20 14:50:58,551 DEBUG container.GSIServiceThread [Thread-8,process:63] Thread-8: processing requests 2005-06-20 14:50:58,784 DEBUG container.GSIServiceThread [Thread-8,process:95] Thread-8: Authenticated globus user: /C=US/O=Globus Alliance/OU=User/CN=fc5bc4e4ca.7e1f4757 2005-06-20 14:50:58,785 DEBUG container.ServiceThread [Thread-8,process:323] Thread Thread-8 processing requests 2005-06-20 14:50:58,887 DEBUG container.ServiceThread [Thread-8,parseRequest:769] ServiceThread: requested file name = 'wsrf/services/ReliableFileTransferFactoryService' I can help with reproducing the bug and generating logs with different log4j settings. Unfortunately, the container process will not respond to "kill -QUIT" which would have produced a good stack trace of all threads. (It has to be killed with "kill -9".) Likewise, I have tried attaching via jdb using the appropriate jvm settings for the container, and the jdb cannot attached once the container is frozen.
I should also note that the container freeze occurs at another point when the DRS checks on the RFT status by calling it's GetResourceProperty operation. I assume that the cause is the same as with the CreateRFT operation.
What JVM are you using? Also, try with a smaller request and see if you can still see the same behavior and try doing kill -QUIT. Also, check if the db itself is not a problem.
java version "1.4.2_06" I will attempt to run smaller tests -- but it's important to note that the problem appears mostly at larger request sizes -- I can run tests including 1000 files in a transfer request with 100%~ success. I'll let you know what I find. As for the database, I'm not clear on what that has to do with it, since RFT doesn't even recieve any message to act on. The db shouldn't be an issue here.
I recreated the freeze with a file / transfer request size of 5000. The "kill -QUIT" signal still has no effect on the process. Just to demonstrate that the freeze results in the same concluding log messages I'll copy them here again: 2005-06-21 10:43:31,411 DEBUG impl.TransferRequestWork [Thread-13,_createTransfer:197] Attempting to create RFT resource 2005-06-21 10:43:32,145 DEBUG container.ServiceRequestQueue [Thread-7,enqueue:52] Putting request in queue 2005-06-21 10:43:32,146 DEBUG container.ServiceDispatcher [Thread-7,run:196] waiting threads: 2 2005-06-21 10:43:32,147 DEBUG container.ServiceRequestQueue [Thread-8,dequeue:32] Getting request from queue 2005-06-21 10:43:32,148 DEBUG container.GSIServiceThread [Thread-8,process:63] Thread-8: processing requests 2005-06-21 10:43:32,392 DEBUG container.GSIServiceThread [Thread-8,process:95] Thread-8: Authenticated globus user: /C=US/O=Globus Alliance/OU=User/CN=fc5bc4e4ca.7e1f4757 2005-06-21 10:43:32,393 DEBUG container.ServiceThread [Thread-8,process:323] Thread Thread-8 processing requests 2005-06-21 10:43:32,489 DEBUG container.ServiceThread [Thread-8,parseRequest:769] ServiceThread: requested file name = 'wsrf/services/ReliableFileTransferFactoryService'
Are you doing kill -QUIT on the right process? What's your log4j settings? Maybe you should enable all logging for all classes.
Yes, in fact I'm doing 'kill -QUIT' on a few pid's -- I first try the pid that shows up in top as taking 99.9 CPU, then I try it's parent, and then it's parent's parent (which is the root of the process tree as shows by pstree). My logging is what I mention in my first report: org.globus.wsrf.container=DEBUG And I also have: org.globus.replica=DEBUG org.globus.transfer=DEBUG The latter two just provide context. By "all logging for all classes", I assume that means: org.globus=DEBUG If you mean something else, please let me know.
Change 'rootCategory' to 'DEBUG'. That will generate a lot of output but might help.
After changing the log4j setting per your instruction, and then recreating the freeze, the following log is produced (last lines): 2005-06-21 13:45:58,251 DEBUG message.MessageElement [Thread-11,<init>:242] soapenv:mustUnderstand = '0' 2005-06-21 13:45:58,252 DEBUG i18n.ProjectResourceBundle [Thread- 14,handleGetObject:72] org.apache.axis.i18n.resource::handleGetObject (startElem00) 2005-06-21 13:45:58,254 DEBUG i18n.ProjectResourceBundle [Thread- 11,handleGetObject:72] org.apache.axis.i18n.resource::handleGetObject (pushHandler00) 2005-06-21 13:45:58,255 DEBUG encoding.SerializationContext [Thread- 14,startElement:991] Start element [http://www.globus.org/namespaces/2004/10/rft]:sourceUrl 2005-06-21 13:45:58,257 DEBUG encoding.DeserializationContext [Thread- 11,pushElementHandler:796] Pushing handler org.apache.axis.message.SOAPHandler@334bfc 2005-06-21 13:45:58,258 DEBUG utils.NSStack [Thread-14,push:75] NSPush (32) 2005-06-21 13:45:58,259 DEBUG encoding.DeserializationContext [Thread- 11,pushNewElement:765] Pushing element To 2005-06-21 13:45:58,261 DEBUG i18n.ProjectResourceBundle [Thread- 14,handleGetObject:72] org.apache.axis.i18n.resource::handleGetObject (endElem00) 2005-06-21 13:45:58,262 DEBUG utils.NSStack [Thread-11,push:75] NSPush (32) 2005-06-21 13:45:58,264 DEBUG encoding.SerializationContext [Thread- 14,endElement:1097] End element sourceUrl 2005-06-21 13:45:58,265 DEBUG encoding.DeserializationContext [Thread- 11,startElement:1067] Exit: DeserializationContext::startElement() 2005-06-21 13:45:58,266 DEBUG utils.NSStack [Thread-14,pop:109] NSPop (32) 2005-06-21 13:45:58,267 DEBUG encoding.DeserializationContext [Thread- 11,endElement:1078] Enter: DeserializationContext::endElement (http://schemas.xmlsoap.org/ws/2004/03/addressing, To) 2005-06-21 13:45:58,269 DEBUG i18n.ProjectResourceBundle [Thread- 14,handleGetObject:72] org.apache.axis.i18n.resource::handleGetObject (startElem00) 2005-06-21 13:45:58,270 DEBUG i18n.ProjectResourceBundle [Thread- 11,handleGetObject:72] org.apache.axis.i18n.resource::handleGetObject (popHandler00) 2005-06-21 13:45:58,272 DEBUG encoding.SerializationContext [Thread- 14,startElement:991] Start element [http://www.globus.org/namespaces/2004/10/rft]:destinationUrl 2005-06-21 13:45:58,273 DEBUG encoding.DeserializationContext [Thread- 11,popElementHandler:829] Popping handler org.apache.axis.message.SOAPHandler@334bfc Let me know if you'd like the entire log attached to this record.
Yes, please attach the entire log.
This URL points to the complete log (bugzilla doesn't seem to want to accept it as an attachment): http://www.isi.edu/~schuler/container.log.ned-0.200506211400.tar.bz2
I don't see anything really wrong in the log. Two interesting things but unrelated to the problem: 1) it looks like you are using GSI Sec Conv authentication on top of https, 2) I'm not seeing any output from MessageLoggingHandler, maybe logging for it was just explicitely disabled. A few other things to try: 1) Ensure you are working with latest code from cvs (if not already) 2) Try with jdk 1.4.2_08 and if that does not help with jdk 1.5
I started by testing with j2sdk1.4.2_08 and jdk1.5.0_03. With j2sdk1.4.2_08 the results are the same. It still hangs and very consistently. However, the good news is that with jdk1.5.0_03 I cannot recreate the hang. The cpu spikes but interestingly the cpu usage seems to be spread across two or three java processes, the sum of which would be near 100%. Also, at times, I see the cpu spike on a single java process up to 96%~ but doesn't top out at 99.9 and freeze. I'm not sure what improvements Sun has made to the JVM since 1.4, but it appears to mitigate this problem. I'm going to continue to run some tests (with jdk1.5) to make sure I cannot recreate the cpu spike during the GetRP call.
Reassigning to current wsrf developer to close/fix as appropriate