Bug 3864 - RFT errors impact GRAM stability
: RFT errors impact GRAM stability
Status: RESOLVED FIXED
: RFT
RFT
: 4.0.1
: PC Linux
: P3 blocker
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2005-11-01 14:40 by
Modified: 2005-11-17 15:12 (History)


Attachments
kill -QUIT pid (80.36 KB, text/plain)
2005-11-01 15:51, Jens-S. Vöckler
Details
GridmanagerLog.voeckler (22.01 KB, application/octet-stream)
2005-11-10 10:27, Jens-S. Vöckler
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2005-11-01 14:40:29
After a failure like

2005-11-01 14:33:37,699 ERROR exec.StateMachine [Thread-22,processCompletedStagi
ngJob:2520] Unable to destroy transfer.
AxisFault
 faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.userException
 faultSubcode: 
 faultString: java.net.SocketTimeoutException: Read timed out
 faultActor: 
 faultNode: 
 faultDetail: 
        {http://xml.apache.org/axis/}stackTrace:java.net.SocketTimeoutException:
 Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at org.globus.gsi.gssapi.SSLUtil.read(SSLUtil.java:37)
        [...]

the remote container refuses to execute subsequent jobs:

$ time globusrun-ws -submit \
-F https://ept.uchicago.edu:8443/wsrf/services/ManagedJobFactoryService \
-c /bin/touch $HOME/asdf
Submitting job...Failed.
globusrun-ws: Error submitting job
globus_soap_message_module: Failed sending request
ManagedJobFactoryPortType_createManagedJob.
globus_xio: Operation was canceled
globus_xio: Operation timed out

real    4m0.177s
user    0m0.180s
sys     0m0.010s

What debug levels will you need? This is 4.0.1 branch as of last week.
------- Comment #1 From 2005-11-01 14:45:27 -------
If the container is still running make sure to do kill -QUIT <jvm process> and 
attach the output.
------- Comment #2 From 2005-11-01 14:57:30 -------
too late :-(
------- Comment #3 From 2005-11-01 15:51:23 -------
Created an attachment (id=734) [details]
kill -QUIT pid

OK, I was able to reproduce the behavior. container.log is attached.
------- Comment #4 From 2005-11-01 23:35:45 -------
This is a serious deadlock situation between RFT and the Delegation service. 
More specifically between RFTResource.remove() & DelegationResource.refresh(). 
First, the resource .remove() operation is called under a lock associated with 
the resource key. Now, the DelegationResource.refresh() obtains a lock on 
DelegationResource and notifies all listeners of the new credential. The RFT 
credential listener performs RFTResourceHome.find() operation which tries to 
get a lock for the key. But since that lock is already aquired by another 
thread calling the RFTResoure.remove() it blocks. But at the same time (and 
under the key lock) the RFTResoure.remove() blocks trying to obtain a lock on 
the DelegationResource to remove its credential listener. 
------- Comment #5 From 2005-11-07 12:38:50 -------
Making this a blocker for 4.0.2
------- Comment #6 From 2005-11-08 09:27:18 -------
Reassigning to Rachana as she may be looking at this from delegation service
perspective sometime this 
week.
------- Comment #7 From 2005-11-09 13:23:54 -------
Committed some changes to Delegation Service to prevent this deadlock. Jens, 
can you please update from trunk and run tests ? If a patch would work better, 
let me know and I'll add one to this bug. Thanks.
------- Comment #8 From 2005-11-09 14:29:18 -------
If it is checked into the *branch*, then I can check it out and compile.
However, compilation will take about 3 hours. 
------- Comment #9 From 2005-11-09 14:34:38 -------
Applied fix to 4.0 branch. Please update and test. Thanks.
------- Comment #10 From 2005-11-10 10:26:11 -------
While I don't see the blocked remote container any more, rerunning the same
test
workflow as before with Condor-G steadily increases the load on my submit host
(I am over 40 now, and 99% CPU in java), and none of my jobs ever leave state
UNSUBMITTED. I see various timeouts in the Gridmanager log, some when it tries
to extend delegations. 

I'll have to check a couple of other things, but I fear the fix is breaking
Condor-G client submissions. I *can* submit single jobs, with and without
advance delegation. But 50 jobs at a time doesn't fly any more:

018 (581.000.000) 11/10 09:46:02 Globus job submission failed!
    Reason: 0 GT4_GRAM_JOB_SUBMIT timed out
...
012 (581.000.000) 11/10 09:46:02 Job was held.
        Globus error: GT4_GRAM_JOB_SUBMIT timed out
        Code 0 Subcode 0
...

I sent the (local) java several SIGQUIT, but it didn't stop. 
------- Comment #11 From 2005-11-10 10:27:15 -------
Created an attachment (id=748) [details]
GridmanagerLog.voeckler

I've attached the gridmanager logfile, because it may contain clues what goes
on in the client... 
------- Comment #12 From 2005-11-10 10:50:28 -------
Jens,

This does not make any sense to me. Can you send the usual logs? the container 
log, kill -QUIT output (btw, that does not stop the container, just tells the 
JVM to dump its threads), etc.
------- Comment #13 From 2005-11-10 11:19:24 -------
What format is the attachment? It's binary data, but gunzip and unzip don't
like it.
------- Comment #14 From 2005-11-10 11:21:35 -------
Try using bunzip2 on it.
------- Comment #15 From 2005-11-10 11:36:01 -------
The log is bzip2 compressed. 

I had stopped the container at this point (no QUIT, sorry), and the container
log just contains the "usual" blubber about the MDS null pointer exception. I
stopped the client side rather unbenign. 

I will try to follow Rachana's suggestions, and restart once integrated into my
workflow code. 
------- Comment #16 From 2005-11-10 16:39:41 -------
OK, so the run without delegation in my microjobs (PRE/POST) went well. The
results and all log files, including QUIT dumps in the container logs, can be
found at (I ran against 3 hosts): 

http://griodine.uchicago.edu/ivdgl1/50x8x120/run0009/

I noticed a couple of non-GLUE warnings like:

WARN  impl.SimpleSubscriptionTopicListener [Thread-3601,topicChanged:108] Failed
to send notification for subscription with key
'9ebceed0-5234-11da-b591-e7b99b2b39de'

in the terminable log. 

In the mean-time, the latest compilation finished. I will update all sites, and
try to run again with delegation enabled. That is the scenario which originally
provoked the dead-lock.
------- Comment #17 From 2005-11-10 17:15:21 -------
These notification failed warnings can be ignored. This is already fixed in 
trunk.
------- Comment #18 From 2005-11-11 11:08:59 -------
Yes, the GLUE warnings do no longer happen in the new code. The last run, this
time with delegation, went well:

http://griodine.uchicago.edu/ivdgl1/50x8x120/run0010/

Curiously, it took longer than not using delegation (need to investigate), but
that may be an artefact of all the SC stuff happening on our machines. 

Superficially looking, there were no more strange client stalls. So, I am
willing to concede that comment #10 was a singularity. 

I've sent the containers each a QUIT before stopping them. The container logs
are in the directory above. Only terminable has warnings like

2005-11-10 18:53:42,508 WARN  impl.SimpleSubscriptionTopicListener
[Thread-2320,topicChanged:108] Failed to send notification for subscription with
key '976810b0-524d-11da-8d37-e78d2ecc4988'

I believe things look good. 
------- Comment #19 From 2005-11-11 11:14:15 -------
I see some 

2005-11-10 18:34:10,328 ERROR container.GSIServiceThread
[ServiceThread-977,process:135] Error processing request
java.io.EOFException
	at ...

I presume these occur in the wild, when script kiddies and botnets scan
well-known ports like 8443 ? 
------- Comment #20 From 2005-11-11 11:34:24 -------
Jens, 

The EOF could be for various reasons, not sure we need to delve into them. Do 
your tests work and jobs run without issues ? 

Rachana
------- Comment #21 From 2005-11-11 11:41:06 -------
No, we do not need to delve into the EOF message (I strongly suspect script
kiddies and port scanners). I cannot detect any issues except that the with
delegation workflow had only a speed-up of 8.5 while without it had 10. However,
since this is statistically insignificant data, I think everything is working
well. If the trend continues, e.g. I see repeatedly w/delegation is slower then
w/o delegation, I'll let you know (because it should be faster from what Stu
tells me). 

------- Comment #22 From 2005-11-11 11:58:57 -------
I am not sure I understand why we are expecting jobs with delegation to run 
faster then jobs that don't need delegation.
------- Comment #23 From 2005-11-11 12:05:46 -------
I don't understand "they don't need delegation"? Every time I contact the
remote
side, I will need to present credentials either as user proxy or as delegated
EPR. I was told that delegating once, and reusing the delegation EPR is faster
then shipping (and at the server, verifying each time) the certificate proxy
with each request. I am not aware that I can by-pass GSI... 
------- Comment #24 From 2005-11-11 12:25:52 -------
If you are not staging and the job you are running does not need credentials, 
there is no need to delegate. You will still need some client credentials to 
talk to the service.

I'm sorry, but I don't understand the alternate way of delegating - "shipping 
and verifying on server side". You mean using secure conversation and 
delegation ? If so, I would also expect delegating once using delegation 
service and passing EPR to each job, should be faster than using secure 
conversation to delegate for each job. 
------- Comment #25 From 2005-11-17 15:12:18 -------
As per discussions with Jens, closing the bug.

Rachan