Bugzilla – Bug 3411
Condor-G and pre-WS GT4 problems
Last modified: 2005-11-11 16:03:28
You need to log in before you can comment on or make changes to this bug.
I am running Condor-G's gridmanager against pre-WS GT4.0.0. I am testing the reliability of Condor-G to reconnect to a job if the submit host fell over. The reconnection works fine, if the job is still running. Hoewver, if the job finished in the meantime, it gets restarted. This is a "no go" for our ATLAS users -- successfully finished jobs must not be run again. Digging into the details with Jaime, he claims the following. Here Jaime's message: It looks like the jobmanager got confused by a somewhat unusual sequence of calls from condor-g for job 884.0 [note: this condor-id is from a prior run]. Below is an annotated version of the sequence of events after the gridmanager restarts. The indented lines the replies to the previous client command. The callbacks can appear between a command and its response. This is my own shorthand and hopefully not too confusing: callback register() rc=0, state=4, fc=155 (stage out failed) stdio_update( "bad rsl" ) # this is an intentionally bad rsl, the gridmanager takes the appropriate # action below rc=94, state=4, fc=0 # this is strange, the failure code should still be 155 # now we stop and restart the jobmanager to update stage out targets signal( "stop jobmanager" ) rc=0, state=4, fc=155 restart jm() rc=110 (waiting for commit), contact=... signal( commit ) rc=0, state=128 (stageout), fc=0 # the gridmanager hasn't processed the new stageout state and still has # the job marked as active, so it stops the jobmanager because the grid # monitor is active signal( "stop jobmanager" ) callback state=128, fc=0 rc=0, state=128, rc=0 callback state=4, fc=130 (jm stopped) # now the gridmanager processes the stageout state and realizes it needs # to restart the jobmanager again restart jm() rc=110, contact=... signal( commit ) rc=0, state=4, fc=130 # this is strange, the state should be 128 and the failure code 0 # (like above) job status() callback state=64, fc=130 # hmm, it looks like the jobmanager has now regressed and resubmitted # the job, but the failure code is still the wrong value of 130 rc=0, state=64, fc=130 callback state=1, fc=130 # regular job execution sequence follows, but with fc=130 the whole # way through I think it's time to consult the globus folks. Find and save the appropriate jobmanager logs if you can. Back to Jens: The files from a re-run can be found at http://griodine.uchicago.edu/~voeckler/ATLAS/condorg/ It contains the submit file, local log file, Gridmanager log file, and GRAM jobmanager log files. AFAI understand Jaime, Globus is producing unexpected results. The service I am running against is a pre-WS GT4, not GT2.
This behavior is easily reproducible?
Here is the client-viewed sequence of events for the new test run: short job: condor job id: 900.0 globus job id: 4479/1116961085 jobmanager logs: 4479, 4574, 4669, 4736 13:58:05 job submit (jm pid 4479) 13:58:05 rc=110 13:58:10 signal commit 13:58:10 rc=0, fc=0, state=32 13:58:10 callback state=64, fc=0 13:58:11 callback state=1, fc=0 13:58:11 signal stop jm 13:58:11 rc=0, fc=0, state=1 13:58:11 callback state=4, fc=130 # grid monitor fails 13:58:36 job restart (jm pid 4574) 13:58:37 rc=110 13:58:37 signal commit 13:58:37 rc=0, fc=0, state=1 13:59:08 callback state=2, fc=0 # gridmanager dies and restarts 14:09:03 callback register 14:09:03 rc=0, fc=155, state=4 14:09:03 signal stdio update "bad rsl" 14:09:03 rc=94, fc=0, state=4 14:09:03 signal stop jm 14:09:03 rc=0, fc=155, state=4 14:09:03 job restart (jm pid 4669) 14:09:04 rc=110 14:09:08 signal commit 14:09:09 rc=0, fc=0, state=128 14:09:09 signal stop jm 14:09:09 callback state=128, fc=0 14:09:09 rc=0, fc=0, state=128 14:09:10 callback state=4, fc=130 14:10:04 job restart (jm pid 4736) 14:10:05 rc=110 14:10:05 signal commit 14:10:05 rc=0, fc=130, state=4 14:10:05 job status 14:10:05 callback state=64, fc=130 14:10:05 rc=0, fc=130, state=64 14:10:06 callback state=1, fc=130 14:10:16 callback state=2, fc=130 14:15:05 job status 14:15:05 rc=0, fc=130, state=2 14:20:05 job status 14:20:05 rc=0, fc=130, state=2 14:20:16 callback state=128, fc=130 14:20:16 callback state=8, fc=130 14:20:16 signal stdio size 14:20:16 rc=0, fc=130, state=8 14:20:21 signal commit end 14:20:21 rc=0, fc=130, state=8 Jobmanager instance 4669 receives the stop signal from the client in the middle of the stage-out process. It appears to wait until the stage-out process completes before exitting. I suspect when it updates the state file as part of stage out, it ends up writing a job state of 4 and failure code of 130 to the file. Then when jobmanager 4736 reads these values, it somehow ends up treating the job as so-far unsubmitted.
The behavior is easily reproducable. Follow the link and find the submit files. Submit them to a host of your choice (well, you'll have to adjust the globusscheduler, of course). Substitute sleep for keg, if you don't have a VDS. Wait, say, five minutes. Kill every process with "condor" in its name on the submit host. Wait until the short job has finished on the remote site. Restart the local Condor. Voila.
Jaime, It should be fine to be able to do a stop manger during stage out, but do you really want to? Don't you want to avoid doing that? If you want to (or can't avoid) call stop manager during stage out state then I think this is a show stopper for 4.0.1. Seems odd that this case is just coming up now for pre-ws gram. If you can avoid this scenario via condor-g, then this becomes a lower priority. -Stu
Stopping the jobmanager during stageout is not something that condor-g normally does, but can happen under the right circumstances. I may be able to prevent it from happening in this repeatable case, but there's no way to prevent it in all cases, as it's a race condition.
Jens, I've coded up a potential work-around. Try using the condor_gridmanager binary in ftp:// ftp.cs.wisc.edu/condor/temporary/forjens/2005-06-01.
I've tried it against stock GT2 and it works as expected. I've tried it against pre-WS GT4 and it works ax expected. I supposed the fix will go into Condor 6.7.8? Thanks
The workaround will go into Condor 6.7.8. But the underlying jobmanager problem still needs to be fixed.
Fix committed to CVS. Files may be re-staged if restarted when this happens(stop during stage out) but the state file will not be modified after the stop signal is received. joe