Bugzilla – Bug 3864
RFT errors impact GRAM stability
Last modified: 2005-11-17 15:12:18
You need to
before you can comment on or make changes to this bug.
After a failure like
2005-11-01 14:33:37,699 ERROR exec.StateMachine [Thread-22,processCompletedStagi
ngJob:2520] Unable to destroy transfer.
faultString: java.net.SocketTimeoutException: Read timed out
Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
the remote container refuses to execute subsequent jobs:
$ time globusrun-ws -submit \
-F https://ept.uchicago.edu:8443/wsrf/services/ManagedJobFactoryService \
-c /bin/touch $HOME/asdf
globusrun-ws: Error submitting job
globus_soap_message_module: Failed sending request
globus_xio: Operation was canceled
globus_xio: Operation timed out
What debug levels will you need? This is 4.0.1 branch as of last week.
If the container is still running make sure to do kill -QUIT <jvm process> and
attach the output.
too late :-(
Created an attachment (id=734) [details]
kill -QUIT pid
OK, I was able to reproduce the behavior. container.log is attached.
This is a serious deadlock situation between RFT and the Delegation service.
More specifically between RFTResource.remove() & DelegationResource.refresh().
First, the resource .remove() operation is called under a lock associated with
the resource key. Now, the DelegationResource.refresh() obtains a lock on
DelegationResource and notifies all listeners of the new credential. The RFT
credential listener performs RFTResourceHome.find() operation which tries to
get a lock for the key. But since that lock is already aquired by another
thread calling the RFTResoure.remove() it blocks. But at the same time (and
under the key lock) the RFTResoure.remove() blocks trying to obtain a lock on
the DelegationResource to remove its credential listener.
Making this a blocker for 4.0.2
Reassigning to Rachana as she may be looking at this from delegation service
perspective sometime this
Committed some changes to Delegation Service to prevent this deadlock. Jens,
can you please update from trunk and run tests ? If a patch would work better,
let me know and I'll add one to this bug. Thanks.
If it is checked into the *branch*, then I can check it out and compile.
However, compilation will take about 3 hours.
Applied fix to 4.0 branch. Please update and test. Thanks.
While I don't see the blocked remote container any more, rerunning the same
workflow as before with Condor-G steadily increases the load on my submit host
(I am over 40 now, and 99% CPU in java), and none of my jobs ever leave state
UNSUBMITTED. I see various timeouts in the Gridmanager log, some when it tries
to extend delegations.
I'll have to check a couple of other things, but I fear the fix is breaking
Condor-G client submissions. I *can* submit single jobs, with and without
advance delegation. But 50 jobs at a time doesn't fly any more:
018 (581.000.000) 11/10 09:46:02 Globus job submission failed!
Reason: 0 GT4_GRAM_JOB_SUBMIT timed out
012 (581.000.000) 11/10 09:46:02 Job was held.
Globus error: GT4_GRAM_JOB_SUBMIT timed out
Code 0 Subcode 0
I sent the (local) java several SIGQUIT, but it didn't stop.
Created an attachment (id=748) [details]
I've attached the gridmanager logfile, because it may contain clues what goes
on in the client...
This does not make any sense to me. Can you send the usual logs? the container
log, kill -QUIT output (btw, that does not stop the container, just tells the
JVM to dump its threads), etc.
What format is the attachment? It's binary data, but gunzip and unzip don't
Try using bunzip2 on it.
The log is bzip2 compressed.
I had stopped the container at this point (no QUIT, sorry), and the container
log just contains the "usual" blubber about the MDS null pointer exception. I
stopped the client side rather unbenign.
I will try to follow Rachana's suggestions, and restart once integrated into my
OK, so the run without delegation in my microjobs (PRE/POST) went well. The
results and all log files, including QUIT dumps in the container logs, can be
found at (I ran against 3 hosts):
I noticed a couple of non-GLUE warnings like:
WARN impl.SimpleSubscriptionTopicListener [Thread-3601,topicChanged:108] Failed
to send notification for subscription with key
in the terminable log.
In the mean-time, the latest compilation finished. I will update all sites, and
try to run again with delegation enabled. That is the scenario which originally
provoked the dead-lock.
These notification failed warnings can be ignored. This is already fixed in
Yes, the GLUE warnings do no longer happen in the new code. The last run, this
time with delegation, went well:
Curiously, it took longer than not using delegation (need to investigate), but
that may be an artefact of all the SC stuff happening on our machines.
Superficially looking, there were no more strange client stalls. So, I am
willing to concede that comment #10 was a singularity.
I've sent the containers each a QUIT before stopping them. The container logs
are in the directory above. Only terminable has warnings like
2005-11-10 18:53:42,508 WARN impl.SimpleSubscriptionTopicListener
[Thread-2320,topicChanged:108] Failed to send notification for subscription with
I believe things look good.
I see some
2005-11-10 18:34:10,328 ERROR container.GSIServiceThread
[ServiceThread-977,process:135] Error processing request
I presume these occur in the wild, when script kiddies and botnets scan
well-known ports like 8443 ?
The EOF could be for various reasons, not sure we need to delve into them. Do
your tests work and jobs run without issues ?
No, we do not need to delve into the EOF message (I strongly suspect script
kiddies and port scanners). I cannot detect any issues except that the with
delegation workflow had only a speed-up of 8.5 while without it had 10. However,
since this is statistically insignificant data, I think everything is working
well. If the trend continues, e.g. I see repeatedly w/delegation is slower then
w/o delegation, I'll let you know (because it should be faster from what Stu
I am not sure I understand why we are expecting jobs with delegation to run
faster then jobs that don't need delegation.
I don't understand "they don't need delegation"? Every time I contact the
side, I will need to present credentials either as user proxy or as delegated
EPR. I was told that delegating once, and reusing the delegation EPR is faster
then shipping (and at the server, verifying each time) the certificate proxy
with each request. I am not aware that I can by-pass GSI...
If you are not staging and the job you are running does not need credentials,
there is no need to delegate. You will still need some client credentials to
talk to the service.
I'm sorry, but I don't understand the alternate way of delegating - "shipping
and verifying on server side". You mean using secure conversation and
delegation ? If so, I would also expect delegating once using delegation
service and passing EPR to each job, should be faster than using secure
conversation to delegate for each job.
As per discussions with Jens, closing the bug.