Bug 2986 - Missing OverallStatus object in notification
: Missing OverallStatus object in notification
Status: RESOLVED FIXED
: RFT
RFT
: development
: PC Linux
: P3 normal
: ---
Assigned To:
:
:
:
: 2988
  Show dependency treegraph
 
Reported: 2005-03-21 15:14 by
Modified: 2005-03-25 19:16 (History)


Attachments
container.log (4.51 KB, text/plain)
2005-03-22 12:14, Peter Lane
Details
synchronized some portion of start call (2.56 KB, patch)
2005-03-22 17:49, Ravi Madduri
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2005-03-21 15:14:47
I'm seeing cases when a notification from RFT comes in without an OverallStatus
object.  I'm having trouble pinning it down since there's no fault if there's no
status object.  I am seeing the following two errors in the log, though, so I
presume they are related:

2005-03-21 13:03:08,118 ERROR service.ReliableFileTransferImpl
[Thread-6,start:111] Error in start
java.rmi.RemoteException: Error in startnull; nested exception is:

    at
org.globus.transfer.reliable.service.ReliableFileTransferResource.start(ReliableFileTransferResource.java:519)
    at
org.globus.transfer.reliable.service.ReliableFileTransferImpl.start(ReliableFileTransferImpl.java:102)
    at sun.reflect.GeneratedMethodAccessor417.invoke(Unknown Source)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:324)
    at org.apache.axis.providers.java.RPCProvider.invokeMethod(RPCProvider.java:384)
    at org.globus.axis.providers.RPCProvider.invokeMethodSub(RPCProvider.java:104)
    at
org.globus.axis.providers.PrivilegedInvokeMethodAction.run(PrivilegedInvokeMethodAction.java:39)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:379)
    at org.globus.gsi.jaas.GlobusSubject.runAs(GlobusSubject.java:49)
    at org.globus.gsi.jaas.JaasSubject.doAs(JaasSubject.java:84)
    at org.globus.axis.providers.RPCProvider.invokeMethod(RPCProvider.java:94)
    at
org.apache.axis.providers.java.RPCProvider.processMessage(RPCProvider.java:281)
    at org.apache.axis.providers.java.JavaProvider.invoke(JavaProvider.java:319)
   at
org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
    at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
    at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
    at org.apache.axis.handlers.soap.SOAPService.invoke(SOAPService.java:450)
    at org.apache.axis.server.AxisServer.invoke(AxisServer.java:285)
    at org.globus.wsrf.container.ServiceThread.doPost(ServiceThread.java:662)
    at org.globus.wsrf.container.ServiceThread.process(ServiceThread.java:393)
    at org.globus.wsrf.container.GSIServiceThread.process(GSIServiceThread.java:124)
    at org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:297)
Caused by:
    at
org.globus.transfer.reliable.service.ReliableFileTransferResource.start(ReliableFileTransferResource.java:512)

------------------------------

2005-03-21 13:45:55,084 ERROR service.URLExpander [Thread-266,changeDir:275]
Error expanding a directory URL
org.globus.ftp.exception.ServerException: Server refused performing the request.
Custom message: Server refused changing directory (error code 1) [Nested
exception message:  Custom message: Unexpected reply: 550
/home/lane/Development/Globus/working/ws-gram/service/java/test/unit/out.22: Not
a directory.].  Nested exception is
org.globus.ftp.exception.UnexpectedReplyCodeException:  Custom message:
Unexpected reply: 550
/home/lane/Development/Globus/working/ws-gram/service/java/test/unit/out.22: Not
a directory.
    at org.globus.ftp.vanilla.FTPControlChannel.execute(FTPControlChannel.java:310)
    at org.globus.ftp.FTPClient.changeDir(FTPClient.java:180)
    at
org.globus.transfer.reliable.service.URLExpander.changeDir(URLExpander.java:269)
    at org.globus.transfer.reliable.service.URLExpander.<init>(URLExpander.java:93)
    at
org.globus.transfer.reliable.service.TransferClient.transfer(TransferClient.java:640)
    at org.globus.transfer.reliable.service.TransferWork.run(TransferWork.java:620)
    at
org.globus.wsrf.impl.work.WorkManagerImpl$WorkWrapper.run(WorkManagerImpl.java:342)
    at java.lang.Thread.run(Thread.java:534)
------- Comment #1 From 2005-03-21 21:21:13 -------
How old is your checkout ? I fixed a possible NPE on march 17th and this seems
similar to what I got 
then. 
------- Comment #2 From 2005-03-21 21:23:28 -------
Actually looking at the line numbers in the trace it looks like start is called
twice on a resource. 
------- Comment #3 From 2005-03-21 22:48:10 -------
Unless there is an issue with uuids not being unique or something, it's not
really possible for start to be getting called multple times.  It always does a
create right before it does the start in the same method.

But anyway, I'll update and see if it helps.
------- Comment #4 From 2005-03-22 10:36:16 -------
I updated RFT last night and retried the test this morning.  I am still seeing
this error.
------- Comment #5 From 2005-03-22 11:51:09 -------
Container log is here:

http://www-unix.mcs.anl.gov/~lane/container.log_2986_1
------- Comment #6 From 2005-03-22 12:14:08 -------
Created an attachment (id=542) [details]
container.log

container log with just RFT debug messages
------- Comment #7 From 2005-03-22 17:49:25 -------
Created an attachment (id=543) [details]
synchronized some portion of start call

This has'nt given me the error with 4 clients submitting transfers. LEt me know
how it works out for you
------- Comment #8 From 2005-03-22 18:24:28 -------
The patch worked for 16 condor jobs.  I can stress it more tomorrow if need be.
------- Comment #9 From 2005-03-23 13:25:43 -------
My latest fix for this seems to work fine. 
------- Comment #10 From 2005-03-23 17:06:38 -------
I'm still getting errors resulting from missing OverallStatus even though I'm
not seeing RepeatedlyStartedFault.  I'll attach a container log with RFT
debuggin turned on.
------- Comment #11 From 2005-03-24 10:17:27 -------
Can I see the container logs for this ? And also how you were able to reproduce
this bug ?
------- Comment #12 From 2005-03-24 10:39:04 -------
Here's a container log with RFT debugging turned on.  I killed the container as
soon as I noticed the missing overall status error from GRAM.

http://www-unix.mcs.anl.gov/~lane/container.log_2986_2
------- Comment #13 From 2005-03-24 10:46:18 -------
I don't see any error in this log.. Can you may be tell me the use case you ran
so i can reproduce it 
myself ?
------- Comment #14 From 2005-03-24 11:00:54 -------
I ran the throughput tester with 50 parallel clients submitting jobs with the
following fileStageIn 
directive:

        <transfer>
            <sourceUrl>gsiftp://logan:2811/tmp/empty_dir/</sourceUrl>
           
<destinationUrl>file:///${GLOBUS_SCRATCH_DIR}/${THROUGHPUT_TESTER_JOB_ID}/</
destinationUrl>
        </transfer>
        <transfer>
            <sourceUrl>gsiftp://logan:2811/bin/echo</sourceUrl>
           
<destinationUrl>file:///${GLOBUS_SCRATCH_DIR}/${THROUGHPUT_TESTER_JOB_ID}/my_echo</
destinationUrl>
        </transfer>
------- Comment #15 From 2005-03-25 15:22:50 -------
After clearing out some other bugs, I am able to consistently reproduce this
with as little as 4 parallel clients.
------- Comment #16 From 2005-03-25 16:02:47 -------
Can you attach a container log with this ? I ran a long test with 6 clients
submitting rft requests in 
parallel for ~5 hours and did'nt see this bug once.
------- Comment #17 From 2005-03-25 19:16:58 -------
This looks like my fault.  The missing status object was from the
StagingListener, not the actual notification message from RFT (actually, there
wasn't any message yet).  So basically I was trying to access the
StagignListener before it actually got a notification.