Bug 4874 - LRUCache problem during job submission from Condor-G to GRAM4
: LRUCache problem during job submission from Condor-G to GRAM4
Status: ASSIGNED
: Java WS Core
globus_wsrf_core
: unspecified
: PC Linux
: P3 normal
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2006-11-29 04:20 by
Modified: 2008-12-10 15:45 (History)


Attachments
logfile of the Gridmanager from Condor on the client-side (17.70 KB, text/plain)
2006-11-29 04:21, Martin Feller
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2006-11-29 04:20:00
Some days ago there was a new error during job submission from 
Condor-G to GRAM4 I never saw before.
It occured in a test in one out of about 12000 jobs.
The logging of the container was in info mode, that's why i didn't
find anything there. I'll attach the logfile of the Gridmanager of Condor
on the client-side.
From this log can be seen that it's probably not a GRAM4 problem but
maybe a synchronization issue in the LRUCache class.
------- Comment #1 From 2006-11-29 04:21:31 -------
Created an attachment (id=1135) [details]
logfile of the Gridmanager from Condor on the client-side
------- Comment #2 From 2007-05-08 17:25:03 -------
This happened again today 3 times during a job submission of 2000 jobs using
condor-g.
GT version: 4.0.3
I attached the stack trace from condor's client side Gridmanager.log:

faultDetail:
     {http://xml.apache.org/axis/}stackTrace:java.lang.NullPointerException
     at org.globus.wsrf.utils.cache.LRUCache.update(LRUCache.java:123)
     at org.globus.wsrf.impl.ResourceHomeImpl.find(ResourceHomeImpl.java:267)
     at
org.globus.wsrf.impl.ResourceContextImpl.getResource(ResourceContextImpl.java:164)
     at
org.globus.wsrf.impl.security.authentication.DescriptorHandler.invoke(DescriptorHandler.java:61)
     at
org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
     at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
     at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
     at org.apache.axis.server.AxisServer.invoke(AxisServer.java:248)
     at org.globus.wsrf.container.ServiceThread.doPost(ServiceThread.java:676)
     at org.globus.wsrf.container.ServiceThread.process(ServiceThread.java:397)
     at
org.globus.wsrf.container.GSIServiceThread.process(GSIServiceThread.java:151)
     at org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:302)

     {http://xml.apache.org/axis/}hostname:osg-itb.ligo.caltech.edu

java.lang.NullPointerException
     at
org.apache.axis.message.SOAPFaultBuilder.createFault(SOAPFaultBuilder.java:221)
     at
org.apache.axis.message.SOAPFaultBuilder.endElement(SOAPFaultBuilder.java:128)
     at
org.apache.axis.encoding.DeserializationContext.endElement(DeserializationContext.java:1087)
     at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
     at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(Unknown
Source)
     at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
     at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
     at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
     at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
     at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
     at javax.xml.parsers.SAXParser.parse(SAXParser.java:345)
     at
org.apache.axis.encoding.DeserializationContext.parse(DeserializationContext.java:227)
     at org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:645)
     at org.apache.axis.Message.getSOAPEnvelope(Message.java:424)
     at
org.apache.axis.message.addressing.handler.AddressingHandler.processClientResponse(AddressingHandler.java:305)
     at
org.apache.axis.message.addressing.handler.AddressingHandler.invoke(AddressingHandler.java:110)
     at
org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
     at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
     at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
     at org.apache.axis.client.AxisClient.invoke(AxisClient.java:190)
     at org.apache.axis.client.Call.invokeEngine(Call.java:2727)
     at org.apache.axis.client.Call.invoke(Call.java:2710)
     at org.apache.axis.client.Call.invoke(Call.java:2386)
     at org.apache.axis.client.Call.invoke(Call.java:2309)
     at org.apache.axis.client.Call.invoke(Call.java:1766)
     at
org.globus.exec.generated.bindings.ManagedJobPortTypeSOAPBindingStub.getMultipleResourceProperties(ManagedJobPortTypeSOAPBindingStub.java:1496)
     at org.globus.exec.client.GramJob.refreshStatus(GramJob.java:1677)
     at
condor.gahp.gt4.Gt4GramJobStatusHandler$JobStatusRunnable.run(Gt4GramJobStatusHandler.java:134)
     at java.lang.Thread.run(Thread.java:534)
------- Comment #3 From 2007-05-09 07:37:48 -------
This error happens very rarely but if this Exception is thrown the job
fails. I assume that node.getValue() returns null. 
What about this: I guess a very infrequently occuring problem in the
LRUCache shouldn't cause a job to fail. As long as we don't
know for sure what happens we could catch the NullPointerException
and log an error message instead of letting the Exception pass to 
WS-GRAM.
------- Comment #4 From 2008-12-09 14:33:41 -------
I think i'm hitting this issue again (2 times with wsrf from 4.2.1, but the
line number is different, maybe due to differences in version). In Gram4 it
causes that a job is stopped in it's processing

2008-12-09T13:02:45.155-06:00 WARN  processing.StateProcessingTask
[pool-1-thread-9,run:63] Job resource 35b3b270-c608-11dd-a1d2-b0667ead9260 not
found.
java.lang.NullPointerException
        at org.globus.wsrf.utils.cache.LRUCache.update(LRUCache.java:150)
        at
org.globus.wsrf.impl.ResourceHomeImpl.find(ResourceHomeImpl.java:298)
        at
org.globus.exec.service.exec.processing.StateProcessingTask.run(StateProcessingTask.java:55)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
        at java.lang.Thread.run(Thread.java:595)
------- Comment #5 From 2008-12-09 14:43:24 -------
Have you considered switching to ehcache?  It's easy to learn and it is robust.
------- Comment #6 From 2008-12-09 14:55:51 -------
Yes, if we continue with Gram42 development, then Gram will probably
have its own job-home mechanism (so far we just used what core provided).
Tom suggested using ehcache as cache implementation in this context too.
------- Comment #7 From 2008-12-10 15:44:09 -------
I'm going to wrap EhCache in LRUCache.  Hopefully this will fix everything with
no API changes.