Bugzilla – Bug 4874
LRUCache problem during job submission from Condor-G to GRAM4
Last modified: 2008-12-10 15:45:23
You need to log in before you can comment on or make changes to this bug.
Some days ago there was a new error during job submission from Condor-G to GRAM4 I never saw before. It occured in a test in one out of about 12000 jobs. The logging of the container was in info mode, that's why i didn't find anything there. I'll attach the logfile of the Gridmanager of Condor on the client-side. From this log can be seen that it's probably not a GRAM4 problem but maybe a synchronization issue in the LRUCache class.
Created an attachment (id=1135) [details] logfile of the Gridmanager from Condor on the client-side
This happened again today 3 times during a job submission of 2000 jobs using condor-g. GT version: 4.0.3 I attached the stack trace from condor's client side Gridmanager.log: faultDetail: {http://xml.apache.org/axis/}stackTrace:java.lang.NullPointerException at org.globus.wsrf.utils.cache.LRUCache.update(LRUCache.java:123) at org.globus.wsrf.impl.ResourceHomeImpl.find(ResourceHomeImpl.java:267) at org.globus.wsrf.impl.ResourceContextImpl.getResource(ResourceContextImpl.java:164) at org.globus.wsrf.impl.security.authentication.DescriptorHandler.invoke(DescriptorHandler.java:61) at org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32) at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118) at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83) at org.apache.axis.server.AxisServer.invoke(AxisServer.java:248) at org.globus.wsrf.container.ServiceThread.doPost(ServiceThread.java:676) at org.globus.wsrf.container.ServiceThread.process(ServiceThread.java:397) at org.globus.wsrf.container.GSIServiceThread.process(GSIServiceThread.java:151) at org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:302) {http://xml.apache.org/axis/}hostname:osg-itb.ligo.caltech.edu java.lang.NullPointerException at org.apache.axis.message.SOAPFaultBuilder.createFault(SOAPFaultBuilder.java:221) at org.apache.axis.message.SOAPFaultBuilder.endElement(SOAPFaultBuilder.java:128) at org.apache.axis.encoding.DeserializationContext.endElement(DeserializationContext.java:1087) at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(SAXParser.java:345) at org.apache.axis.encoding.DeserializationContext.parse(DeserializationContext.java:227) at org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:645) at org.apache.axis.Message.getSOAPEnvelope(Message.java:424) at org.apache.axis.message.addressing.handler.AddressingHandler.processClientResponse(AddressingHandler.java:305) at org.apache.axis.message.addressing.handler.AddressingHandler.invoke(AddressingHandler.java:110) at org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32) at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118) at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83) at org.apache.axis.client.AxisClient.invoke(AxisClient.java:190) at org.apache.axis.client.Call.invokeEngine(Call.java:2727) at org.apache.axis.client.Call.invoke(Call.java:2710) at org.apache.axis.client.Call.invoke(Call.java:2386) at org.apache.axis.client.Call.invoke(Call.java:2309) at org.apache.axis.client.Call.invoke(Call.java:1766) at org.globus.exec.generated.bindings.ManagedJobPortTypeSOAPBindingStub.getMultipleResourceProperties(ManagedJobPortTypeSOAPBindingStub.java:1496) at org.globus.exec.client.GramJob.refreshStatus(GramJob.java:1677) at condor.gahp.gt4.Gt4GramJobStatusHandler$JobStatusRunnable.run(Gt4GramJobStatusHandler.java:134) at java.lang.Thread.run(Thread.java:534)
This error happens very rarely but if this Exception is thrown the job fails. I assume that node.getValue() returns null. What about this: I guess a very infrequently occuring problem in the LRUCache shouldn't cause a job to fail. As long as we don't know for sure what happens we could catch the NullPointerException and log an error message instead of letting the Exception pass to WS-GRAM.
I think i'm hitting this issue again (2 times with wsrf from 4.2.1, but the line number is different, maybe due to differences in version). In Gram4 it causes that a job is stopped in it's processing 2008-12-09T13:02:45.155-06:00 WARN processing.StateProcessingTask [pool-1-thread-9,run:63] Job resource 35b3b270-c608-11dd-a1d2-b0667ead9260 not found. java.lang.NullPointerException at org.globus.wsrf.utils.cache.LRUCache.update(LRUCache.java:150) at org.globus.wsrf.impl.ResourceHomeImpl.find(ResourceHomeImpl.java:298) at org.globus.exec.service.exec.processing.StateProcessingTask.run(StateProcessingTask.java:55) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675) at java.lang.Thread.run(Thread.java:595)
Have you considered switching to ehcache? It's easy to learn and it is robust.
Yes, if we continue with Gram42 development, then Gram will probably have its own job-home mechanism (so far we just used what core provided). Tom suggested using ehcache as cache implementation in this context too.
I'm going to wrap EhCache in LRUCache. Hopefully this will fix everything with no API changes.