Bug 3502 - Container freezes (CPU spin) when making remote call between ws-resources
: Container freezes (CPU spin) when making remote call between ws-resources
Status: NEW
: Java WS Core
globus_wsrf_core
: unspecified
: PC Linux
: P3 critical
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2005-06-20 17:18 by
Modified: 2007-09-12 16:42 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2005-06-20 17:18:56
The container freezes up (by way of the CPU spinning 99.9% utilization on the
container's thread). I encounter this consistently (20-25% likelihood) when
running the Data Replication Service at the point where the DRS creates a RFT
resource. The create message contains 10,000 file entries for the new RFT
resource. Flow of control leaves the DRS, passes to the container, and never
passes to the RFT Factory (createRFT) method.

With the "org.globus.wsrf.container=DEBUG" log4j setting, the final log messages
prior to the container freeze are:

2005-06-20 14:50:57,377 DEBUG impl.TransferWork
[Thread-15,_createTransferRequests:132] _createTransferRequests
2005-06-20 14:50:57,382 DEBUG impl.DiscoverWork
[Thread-14,_persistReplication:439] _persistReplication
2005-06-20 14:50:57,385 DEBUG impl.DiscoverWork [Thread-14,stop:71]
[perf:discover][Thread-14][8858]
2005-06-20 14:50:57,387 DEBUG impl.ReplicationWorkListener
[Thread-14,workCompleted:59] Work completed, key: 24415140
2005-06-20 14:50:57,444 DEBUG impl.TransferWork
[Thread-15,_createTransferRequests:170] Creating transfer request, size: 10000
2005-06-20 14:50:57,448 DEBUG impl.TransferWork
[Thread-15,_scheduleTransferRequests:208] _scheduleTransferRequests
2005-06-20 14:50:57,467 DEBUG impl.TransferWork
[Thread-15,_scheduleTransferRequests:234] Waiting for transfer requests
2005-06-20 14:50:57,467 DEBUG impl.TransferRequestWork
[Thread-14,_locateTransferFactory:174] _locateTransferFactory
2005-06-20 14:50:57,811 DEBUG impl.TransferRequestWork
[Thread-14,_createTransfer:189] _createTransfer
2005-06-20 14:50:57,812 DEBUG impl.TransferRequestWork
[Thread-14,_createTransfer:197] Attempting to create RFT resource
2005-06-20 14:50:58,548 DEBUG container.ServiceRequestQueue
[Thread-7,enqueue:52] Putting request in queue
2005-06-20 14:50:58,550 DEBUG container.ServiceDispatcher [Thread-7,run:196]
waiting threads: 3
2005-06-20 14:50:58,550 DEBUG container.ServiceRequestQueue
[Thread-8,dequeue:32] Getting request from queue
2005-06-20 14:50:58,551 DEBUG container.GSIServiceThread [Thread-8,process:63]
Thread-8:
processing requests
2005-06-20 14:50:58,784 DEBUG container.GSIServiceThread [Thread-8,process:95]
Thread-8:
Authenticated globus user: /C=US/O=Globus Alliance/OU=User/CN=fc5bc4e4ca.7e1f4757
2005-06-20 14:50:58,785 DEBUG container.ServiceThread [Thread-8,process:323]
Thread Thread-8 processing requests
2005-06-20 14:50:58,887 DEBUG container.ServiceThread
[Thread-8,parseRequest:769] ServiceThread: requested file name =
'wsrf/services/ReliableFileTransferFactoryService'

I can help with reproducing the bug and generating logs with different log4j
settings. Unfortunately, the container process will not respond to "kill -QUIT"
which would have produced a good stack trace of all threads. (It has to be
killed with "kill -9".) Likewise, I have tried attaching via jdb using the
appropriate jvm settings for the container, and the jdb cannot attached once the
container is frozen.
------- Comment #1 From 2005-06-20 17:24:41 -------
I should also note that the container freeze occurs at another point when the
DRS checks on the RFT status by calling it's GetResourceProperty operation. I
assume that the cause is the same as with the CreateRFT operation.
------- Comment #2 From 2005-06-20 22:56:35 -------
What JVM are you using? Also, try with a smaller request and see if you can 
still see the same behavior and try doing kill -QUIT. Also, check if the db 
itself is not a problem.
------- Comment #3 From 2005-06-21 12:41:40 -------
java version "1.4.2_06"

I will attempt to run smaller tests -- but it's important to note that the 
problem appears mostly at larger request sizes -- I can run tests including 
1000 files in a transfer request with 100%~ success. I'll let you know what I 
find.

As for the database, I'm not clear on what that has to do with it, since RFT 
doesn't even recieve any message to act on. The db shouldn't be an issue here.
------- Comment #4 From 2005-06-21 13:09:16 -------
I recreated the freeze with a file / transfer request size of 5000. The "kill
-QUIT" signal still has no effect on the process.

Just to demonstrate that the freeze results in the same concluding log messages
I'll copy them here again:

2005-06-21 10:43:31,411 DEBUG impl.TransferRequestWork
[Thread-13,_createTransfer:197] Attempting to create RFT resource
2005-06-21 10:43:32,145 DEBUG container.ServiceRequestQueue
[Thread-7,enqueue:52] Putting request in queue
2005-06-21 10:43:32,146 DEBUG container.ServiceDispatcher [Thread-7,run:196]
waiting threads: 2
2005-06-21 10:43:32,147 DEBUG container.ServiceRequestQueue
[Thread-8,dequeue:32] Getting request from queue
2005-06-21 10:43:32,148 DEBUG container.GSIServiceThread [Thread-8,process:63]
Thread-8:
processing requests
2005-06-21 10:43:32,392 DEBUG container.GSIServiceThread [Thread-8,process:95]
Thread-8:
Authenticated globus user: /C=US/O=Globus Alliance/OU=User/CN=fc5bc4e4ca.7e1f4757
2005-06-21 10:43:32,393 DEBUG container.ServiceThread [Thread-8,process:323]
Thread Thread-8 processing requests
2005-06-21 10:43:32,489 DEBUG container.ServiceThread
[Thread-8,parseRequest:769] ServiceThread: requested file name =
'wsrf/services/ReliableFileTransferFactoryService'
 
------- Comment #5 From 2005-06-21 15:10:43 -------
Are you doing kill -QUIT on the right process? What's your log4j settings? 
Maybe you should enable all logging for all classes.
------- Comment #6 From 2005-06-21 15:25:55 -------
Yes, in fact I'm doing 'kill -QUIT' on a few pid's -- I first try the pid that 
shows up in top as taking 99.9 CPU, then I try it's parent, and then it's 
parent's parent (which is the root of the process tree as shows by pstree).

My logging is what I mention in my first report:
org.globus.wsrf.container=DEBUG

And I also have:
org.globus.replica=DEBUG
org.globus.transfer=DEBUG

The latter two just provide context. By "all logging for all classes", I 
assume that means:
org.globus=DEBUG

If you mean something else, please let me know.
------- Comment #7 From 2005-06-21 15:28:49 -------
Change 'rootCategory' to 'DEBUG'. That will generate a lot of output but might 
help. 
------- Comment #8 From 2005-06-21 15:57:53 -------
After changing the log4j setting per your instruction, and then recreating the 
freeze, the following log is produced (last lines):

2005-06-21 13:45:58,251 DEBUG message.MessageElement [Thread-11,<init>:242]   
soapenv:mustUnderstand = '0'
2005-06-21 13:45:58,252 DEBUG i18n.ProjectResourceBundle [Thread-
14,handleGetObject:72] org.apache.axis.i18n.resource::handleGetObject
(startElem00)
2005-06-21 13:45:58,254 DEBUG i18n.ProjectResourceBundle [Thread-
11,handleGetObject:72] org.apache.axis.i18n.resource::handleGetObject
(pushHandler00)
2005-06-21 13:45:58,255 DEBUG encoding.SerializationContext [Thread-
14,startElement:991] Start element 
[http://www.globus.org/namespaces/2004/10/rft]:sourceUrl
2005-06-21 13:45:58,257 DEBUG encoding.DeserializationContext [Thread-
11,pushElementHandler:796] Pushing handler 
org.apache.axis.message.SOAPHandler@334bfc
2005-06-21 13:45:58,258 DEBUG utils.NSStack [Thread-14,push:75] NSPush (32)
2005-06-21 13:45:58,259 DEBUG encoding.DeserializationContext [Thread-
11,pushNewElement:765] Pushing element To
2005-06-21 13:45:58,261 DEBUG i18n.ProjectResourceBundle [Thread-
14,handleGetObject:72] org.apache.axis.i18n.resource::handleGetObject
(endElem00)
2005-06-21 13:45:58,262 DEBUG utils.NSStack [Thread-11,push:75] NSPush (32)
2005-06-21 13:45:58,264 DEBUG encoding.SerializationContext [Thread-
14,endElement:1097] End element sourceUrl
2005-06-21 13:45:58,265 DEBUG encoding.DeserializationContext [Thread-
11,startElement:1067] Exit: DeserializationContext::startElement()
2005-06-21 13:45:58,266 DEBUG utils.NSStack [Thread-14,pop:109] NSPop (32)
2005-06-21 13:45:58,267 DEBUG encoding.DeserializationContext [Thread-
11,endElement:1078] Enter: DeserializationContext::endElement
(http://schemas.xmlsoap.org/ws/2004/03/addressing, To)
2005-06-21 13:45:58,269 DEBUG i18n.ProjectResourceBundle [Thread-
14,handleGetObject:72] org.apache.axis.i18n.resource::handleGetObject
(startElem00)
2005-06-21 13:45:58,270 DEBUG i18n.ProjectResourceBundle [Thread-
11,handleGetObject:72] org.apache.axis.i18n.resource::handleGetObject
(popHandler00)
2005-06-21 13:45:58,272 DEBUG encoding.SerializationContext [Thread-
14,startElement:991] Start element 
[http://www.globus.org/namespaces/2004/10/rft]:destinationUrl
2005-06-21 13:45:58,273 DEBUG encoding.DeserializationContext [Thread-
11,popElementHandler:829] Popping handler 
org.apache.axis.message.SOAPHandler@334bfc

Let me know if you'd like the entire log attached to this record.
------- Comment #9 From 2005-06-21 16:02:44 -------
Yes, please attach the entire log.
------- Comment #10 From 2005-06-21 16:35:36 -------
This URL points to the complete log (bugzilla doesn't seem to want to accept 
it as an attachment):

http://www.isi.edu/~schuler/container.log.ned-0.200506211400.tar.bz2

------- Comment #11 From 2005-06-22 00:30:01 -------
I don't see anything really wrong in the log. Two interesting things but 
unrelated to the problem: 1) it looks like you are using GSI Sec Conv 
authentication on top of https, 2) I'm not seeing any output from 
MessageLoggingHandler, maybe logging for it was just explicitely disabled.

A few other things to try:
1) Ensure you are working with latest code from cvs (if not already)
2) Try with jdk 1.4.2_08 and if that does not help with jdk 1.5
------- Comment #12 From 2005-06-24 15:42:15 -------
I started by testing with j2sdk1.4.2_08 and jdk1.5.0_03. With j2sdk1.4.2_08 
the results are the same. It still hangs and very consistently.

However, the good news is that with jdk1.5.0_03 I cannot recreate the hang. 
The cpu spikes but interestingly the cpu usage seems to be spread across two 
or three java processes, the sum of which would be near 100%. Also, at times, 
I see the cpu spike on a single java process up to 96%~ but doesn't top out at 
99.9 and freeze. I'm not sure what improvements Sun has made to the JVM since 
1.4, but it appears to mitigate this problem.

I'm going to continue to run some tests (with jdk1.5) to make sure I cannot 
recreate the cpu spike during the GetRP call.
------- Comment #13 From 2007-09-12 16:42:14 -------
Reassigning to current wsrf developer to close/fix as appropriate