Bugzilla – Bug 3864
RFT errors impact GRAM stability
Last modified: 2005-11-17 15:12:18
You need to log in before you can comment on or make changes to this bug.
After a failure like 2005-11-01 14:33:37,699 ERROR exec.StateMachine [Thread-22,processCompletedStagi ngJob:2520] Unable to destroy transfer. AxisFault faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.userException faultSubcode: faultString: java.net.SocketTimeoutException: Read timed out faultActor: faultNode: faultDetail: {http://xml.apache.org/axis/}stackTrace:java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at org.globus.gsi.gssapi.SSLUtil.read(SSLUtil.java:37) [...] the remote container refuses to execute subsequent jobs: $ time globusrun-ws -submit \ -F https://ept.uchicago.edu:8443/wsrf/services/ManagedJobFactoryService \ -c /bin/touch $HOME/asdf Submitting job...Failed. globusrun-ws: Error submitting job globus_soap_message_module: Failed sending request ManagedJobFactoryPortType_createManagedJob. globus_xio: Operation was canceled globus_xio: Operation timed out real 4m0.177s user 0m0.180s sys 0m0.010s What debug levels will you need? This is 4.0.1 branch as of last week.
If the container is still running make sure to do kill -QUIT <jvm process> and attach the output.
too late :-(
Created an attachment (id=734) [details] kill -QUIT pid OK, I was able to reproduce the behavior. container.log is attached.
This is a serious deadlock situation between RFT and the Delegation service. More specifically between RFTResource.remove() & DelegationResource.refresh(). First, the resource .remove() operation is called under a lock associated with the resource key. Now, the DelegationResource.refresh() obtains a lock on DelegationResource and notifies all listeners of the new credential. The RFT credential listener performs RFTResourceHome.find() operation which tries to get a lock for the key. But since that lock is already aquired by another thread calling the RFTResoure.remove() it blocks. But at the same time (and under the key lock) the RFTResoure.remove() blocks trying to obtain a lock on the DelegationResource to remove its credential listener.
Making this a blocker for 4.0.2
Reassigning to Rachana as she may be looking at this from delegation service perspective sometime this week.
Committed some changes to Delegation Service to prevent this deadlock. Jens, can you please update from trunk and run tests ? If a patch would work better, let me know and I'll add one to this bug. Thanks.
If it is checked into the *branch*, then I can check it out and compile. However, compilation will take about 3 hours.
Applied fix to 4.0 branch. Please update and test. Thanks.
While I don't see the blocked remote container any more, rerunning the same test workflow as before with Condor-G steadily increases the load on my submit host (I am over 40 now, and 99% CPU in java), and none of my jobs ever leave state UNSUBMITTED. I see various timeouts in the Gridmanager log, some when it tries to extend delegations. I'll have to check a couple of other things, but I fear the fix is breaking Condor-G client submissions. I *can* submit single jobs, with and without advance delegation. But 50 jobs at a time doesn't fly any more: 018 (581.000.000) 11/10 09:46:02 Globus job submission failed! Reason: 0 GT4_GRAM_JOB_SUBMIT timed out ... 012 (581.000.000) 11/10 09:46:02 Job was held. Globus error: GT4_GRAM_JOB_SUBMIT timed out Code 0 Subcode 0 ... I sent the (local) java several SIGQUIT, but it didn't stop.
Created an attachment (id=748) [details] GridmanagerLog.voeckler I've attached the gridmanager logfile, because it may contain clues what goes on in the client...
Jens, This does not make any sense to me. Can you send the usual logs? the container log, kill -QUIT output (btw, that does not stop the container, just tells the JVM to dump its threads), etc.
What format is the attachment? It's binary data, but gunzip and unzip don't like it.
Try using bunzip2 on it.
The log is bzip2 compressed. I had stopped the container at this point (no QUIT, sorry), and the container log just contains the "usual" blubber about the MDS null pointer exception. I stopped the client side rather unbenign. I will try to follow Rachana's suggestions, and restart once integrated into my workflow code.
OK, so the run without delegation in my microjobs (PRE/POST) went well. The results and all log files, including QUIT dumps in the container logs, can be found at (I ran against 3 hosts): http://griodine.uchicago.edu/ivdgl1/50x8x120/run0009/ I noticed a couple of non-GLUE warnings like: WARN impl.SimpleSubscriptionTopicListener [Thread-3601,topicChanged:108] Failed to send notification for subscription with key '9ebceed0-5234-11da-b591-e7b99b2b39de' in the terminable log. In the mean-time, the latest compilation finished. I will update all sites, and try to run again with delegation enabled. That is the scenario which originally provoked the dead-lock.
These notification failed warnings can be ignored. This is already fixed in trunk.
Yes, the GLUE warnings do no longer happen in the new code. The last run, this time with delegation, went well: http://griodine.uchicago.edu/ivdgl1/50x8x120/run0010/ Curiously, it took longer than not using delegation (need to investigate), but that may be an artefact of all the SC stuff happening on our machines. Superficially looking, there were no more strange client stalls. So, I am willing to concede that comment #10 was a singularity. I've sent the containers each a QUIT before stopping them. The container logs are in the directory above. Only terminable has warnings like 2005-11-10 18:53:42,508 WARN impl.SimpleSubscriptionTopicListener [Thread-2320,topicChanged:108] Failed to send notification for subscription with key '976810b0-524d-11da-8d37-e78d2ecc4988' I believe things look good.
I see some 2005-11-10 18:34:10,328 ERROR container.GSIServiceThread [ServiceThread-977,process:135] Error processing request java.io.EOFException at ... I presume these occur in the wild, when script kiddies and botnets scan well-known ports like 8443 ?
Jens, The EOF could be for various reasons, not sure we need to delve into them. Do your tests work and jobs run without issues ? Rachana
No, we do not need to delve into the EOF message (I strongly suspect script kiddies and port scanners). I cannot detect any issues except that the with delegation workflow had only a speed-up of 8.5 while without it had 10. However, since this is statistically insignificant data, I think everything is working well. If the trend continues, e.g. I see repeatedly w/delegation is slower then w/o delegation, I'll let you know (because it should be faster from what Stu tells me).
I am not sure I understand why we are expecting jobs with delegation to run faster then jobs that don't need delegation.
I don't understand "they don't need delegation"? Every time I contact the remote side, I will need to present credentials either as user proxy or as delegated EPR. I was told that delegating once, and reusing the delegation EPR is faster then shipping (and at the server, verifying each time) the certificate proxy with each request. I am not aware that I can by-pass GSI...
If you are not staging and the job you are running does not need credentials, there is no need to delegate. You will still need some client credentials to talk to the service. I'm sorry, but I don't understand the alternate way of delegating - "shipping and verifying on server side". You mean using secure conversation and delegation ? If so, I would also expect delegating once using delegation service and passing EPR to each job, should be faster than using secure conversation to delegate for each job.
As per discussions with Jens, closing the bug. Rachan