Bug 3411 - Condor-G and pre-WS GT4 problems
: Condor-G and pre-WS GT4 problems
Status: RESOLVED FIXED
: GRAM
gt2 Gatekeeper/Jobmanager
: 4.0.0
: PC Linux
: P3 normal
: 4.0.2
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2005-05-24 18:05 by
Modified: 2005-11-11 16:03 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2005-05-24 18:05:35
I am running Condor-G's gridmanager against pre-WS GT4.0.0. I am testing the
reliability of Condor-G to reconnect to a job if the submit host fell over. The
reconnection works fine, if the job is still running. Hoewver, if the job
finished in the meantime, it gets restarted. This is a "no go" for our ATLAS
users -- successfully finished jobs must not be run again. 

Digging into the details with Jaime, he claims the following. Here Jaime's message: 

It looks like the jobmanager got confused by a somewhat unusual sequence of
calls from condor-g for job 884.0 [note: this condor-id is from a prior run].
Below is an annotated version of the sequence of events after the gridmanager
restarts. The indented lines the replies to the previous client command. The
callbacks can appear between a command and its response.

This is my own shorthand and hopefully not too confusing:

callback register()
      rc=0, state=4, fc=155 (stage out failed)
stdio_update( "bad rsl" )
# this is an intentionally bad rsl, the gridmanager takes the  
appropriate
# action below
      rc=94, state=4, fc=0
# this is strange, the failure code should still be 155
# now we stop and restart the jobmanager to update stage out targets
signal( "stop jobmanager" )
      rc=0, state=4, fc=155
restart jm()
      rc=110 (waiting for commit), contact=...
signal( commit )
      rc=0, state=128 (stageout), fc=0
# the gridmanager hasn't processed the new stageout state and still has
# the job marked as active, so it stops the jobmanager because the grid
# monitor is active
signal( "stop jobmanager" )
callback state=128, fc=0
      rc=0, state=128, rc=0
callback state=4, fc=130 (jm stopped)
# now the gridmanager processes the stageout state and realizes it needs
# to restart the jobmanager again
restart jm()
      rc=110, contact=...
signal( commit )
      rc=0, state=4, fc=130
# this is strange, the state should be 128 and the failure code 0
# (like above)
job status()
callback state=64, fc=130
# hmm, it looks like the jobmanager has now regressed and resubmitted
# the job, but the failure code is still the wrong value of 130
      rc=0, state=64, fc=130
callback state=1, fc=130
# regular job execution sequence follows, but with fc=130 the whole
# way through


I think it's time to consult the globus folks. Find and save the  
appropriate jobmanager logs if you can.

Back to Jens: The files from a re-run can be found at
http://griodine.uchicago.edu/~voeckler/ATLAS/condorg/

It contains the submit file, local log file, Gridmanager log file, and GRAM
jobmanager log files. AFAI understand Jaime, Globus is producing unexpected
results. The service I am running against is a pre-WS GT4, not GT2.
------- Comment #1 From 2005-05-25 02:17:55 -------
This behavior is easily reproducible?
------- Comment #2 From 2005-05-25 04:07:40 -------
Here is the client-viewed sequence of events for the new test run:

short job:
condor job id: 900.0  
globus job id: 4479/1116961085    
jobmanager logs: 4479, 4574, 4669, 4736

13:58:05 job submit (jm pid 4479)
13:58:05    rc=110
13:58:10 signal commit
13:58:10    rc=0, fc=0, state=32
13:58:10 callback state=64, fc=0
13:58:11 callback state=1, fc=0  
13:58:11 signal stop jm
13:58:11    rc=0, fc=0, state=1
13:58:11 callback state=4, fc=130
# grid monitor fails
13:58:36 job restart (jm pid 4574)
13:58:37    rc=110
13:58:37 signal commit
13:58:37    rc=0, fc=0, state=1
13:59:08 callback state=2, fc=0
# gridmanager dies and restarts
14:09:03 callback register
14:09:03    rc=0, fc=155, state=4
14:09:03 signal stdio update "bad rsl"
14:09:03    rc=94, fc=0, state=4
14:09:03 signal stop jm
14:09:03    rc=0, fc=155, state=4
14:09:03 job restart (jm pid 4669)
14:09:04    rc=110
14:09:08 signal commit
14:09:09    rc=0, fc=0, state=128
14:09:09 signal stop jm
14:09:09 callback state=128, fc=0
14:09:09    rc=0, fc=0, state=128
14:09:10 callback state=4, fc=130
14:10:04 job restart (jm pid 4736)
14:10:05    rc=110
14:10:05 signal commit
14:10:05    rc=0, fc=130, state=4
14:10:05 job status
14:10:05 callback state=64, fc=130
14:10:05    rc=0, fc=130, state=64
14:10:06 callback state=1, fc=130
14:10:16 callback state=2, fc=130
14:15:05 job status
14:15:05    rc=0, fc=130, state=2
14:20:05 job status
14:20:05    rc=0, fc=130, state=2
14:20:16 callback state=128, fc=130
14:20:16 callback state=8, fc=130
14:20:16 signal stdio size
14:20:16    rc=0, fc=130, state=8
14:20:21 signal commit end
14:20:21    rc=0, fc=130, state=8

Jobmanager instance 4669 receives the stop signal from the client in the middle of the stage-out 
process. It appears to wait until the stage-out process completes before exitting. I suspect when it 
updates the state file as part of stage out, it ends up writing a job state of 4 and failure code of 130 to 
the file. Then when jobmanager 4736 reads these values, it somehow ends up treating the job as so-far 
unsubmitted.
------- Comment #3 From 2005-05-25 10:15:06 -------
The behavior is easily reproducable. Follow the link and find the submit files.
Submit them to a host of your choice (well, you'll have to adjust the
globusscheduler, of course). Substitute sleep for keg, if you don't have a VDS.
Wait, say, five minutes. Kill every process with "condor" in its name on the
submit host. Wait until the short job has finished on the remote site. Restart
the local Condor. Voila. 
------- Comment #4 From 2005-05-27 10:40:52 -------
Jaime,

It should be fine to be able to do a stop manger during stage out, but do you really want to?  Don't you 
want to avoid doing that?  If you want to (or can't avoid) call stop manager during stage out state then I 
think this is a show stopper for 4.0.1.  Seems odd that this case is just coming up now for pre-ws 
gram.  If you can avoid this scenario via condor-g, then this becomes a lower priority.

-Stu
------- Comment #5 From 2005-06-01 14:20:54 -------
Stopping the jobmanager during stageout is not something that condor-g normally
does, but can happen 
under the right circumstances. I may be able to prevent it from happening in
this repeatable case, but 
there's no way to prevent it in all cases, as it's a race condition.
------- Comment #6 From 2005-06-01 16:13:52 -------
Jens, I've coded up a potential work-around. Try using the condor_gridmanager
binary in ftp://
ftp.cs.wisc.edu/condor/temporary/forjens/2005-06-01.
------- Comment #7 From 2005-06-02 13:56:34 -------
I've tried it against stock GT2 and it works as expected.
I've tried it against pre-WS GT4 and it works ax expected.

I supposed the fix will go into Condor 6.7.8? 

Thanks
------- Comment #8 From 2005-06-02 15:03:36 -------
The workaround will go into Condor 6.7.8. But the underlying jobmanager problem
still needs to be fixed.
------- Comment #9 From 2005-11-11 16:03:28 -------
Fix committed to CVS. Files may be re-staged if restarted when this
happens(stop
during stage out) but the state file will not be modified after the stop signal
is received.

joe