<?xml version="1.0" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "http://bugzilla.globus.org/bugzilla/bugzilla.dtd">

<bugzilla version="3.2.3"
          urlbase="http://bugzilla.globus.org/bugzilla/"
          maintainer="bacon@mcs.anl.gov"
>

    <bug>
          <bug_id>2574</bug_id>
          
          <creation_ts>2005-01-18 10:24</creation_ts>
          <short_desc>fork job not being killed at resource destruction</short_desc>
          <delta_ts>2005-04-01 12:00:31</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>GRAM</product>
          <component>wsrf managed execution job service</component>
          <version>3.9.4</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Linux</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          
          
          <priority>P1</priority>
          <bug_severity>blocker</bug_severity>
          <target_milestone>4.0</target_milestone>
          
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Tim Freeman">tfreeman@mcs.anl.gov</reporter>
          <assigned_to name="Stuart Martin">smartin@mcs.anl.gov</assigned_to>
          <cc>alain@isi.edu</cc>
    
    <cc>bester@mcs.anl.gov</cc>
    
    <cc>gaffaney@mcs.anl.gov</cc>
    
    <cc>lane@mcs.anl.gov</cc>
    
    <cc>madduri@mcs.anl.gov</cc>
    
    <cc>rynge@isi.edu</cc>
    
    <cc>smartin@mcs.anl.gov</cc>

      

      
          <long_desc isprivate="0">
            <who name="Tim Freeman">tfreeman@mcs.anl.gov</who>
            <bug_when>2005-01-18 10:24:11</bug_when>
            <thetext>When ManagedExecutableJobResource.remove() is called, no signals are being sent
to the forked job pid.  On early cancel, the job still continues.  I also put a
touch file in the fork scheduler adapter cancellation function which does not
get touched, so I think the resource-&gt;sudo-&gt;perl link is not there for destroy.</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Peter Lane">lane@mcs.anl.gov</who>
            <bug_when>2005-01-21 17:44:41</bug_when>
            <thetext>Fixed in trunk.</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Peter Lane">lane@mcs.anl.gov</who>
            <bug_when>2005-02-01 11:32:41</bug_when>
            <thetext>*** Bug 2672 has been marked as a duplicate of this bug. ***</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Peter Lane">lane@mcs.anl.gov</who>
            <bug_when>2005-02-01 11:33:52</bug_when>
            <thetext>Bob, how recent is your GRAM installation?  As noted in comment #1, a fix was committed on the 1/21/
2005.</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Bob Gaffaney">gaffaney@mcs.anl.gov</who>
            <bug_when>2005-02-01 13:46:50</bug_when>
            <thetext>I tried this again for both Condor and PBS with small numbers of jobs. For my 
installation (ws gram rebuilt today) the jobs remained enqueued with the 
scheduler after globusrun-ws -kill had been run.

*****
Log for a manual submission of PBS job below

globusrun-ws -submit -factory 
https://lucky0:8444/wsrf/services/ManagedJobFactoryService -factory-type PBS -
batch -o epr_0 -c /bin/sleep 1000
Submitting job...Done.
Job ID: uuid:d12b8280-7488-11d9-a2b0-0002a5ad41e5
Termination time: 02/02/2005 19:38 GMT
Done
[gaffaney@lucky0 one]$ qstat
Job id           Name             User             Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
3048.lucky0      STDIN            gaffaney                0 R luckyq
[gaffaney@lucky0 one]$ globusrun-ws -kill -j epr_0
Requesting original job description...Done.
Destroying job...Done.
[gaffaney@lucky0 one]$ qstat
Job id           Name             User             Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
3048.lucky0      STDIN            gaffaney                0 E luckyq
[gaffaney@lucky0 one]$ qstat
Job id           Name             User             Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
3048.lucky0      STDIN            gaffaney                0 E luckyq</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Bob Gaffaney">gaffaney@mcs.anl.gov</who>
            <bug_when>2005-02-01 13:49:54</bug_when>
            <thetext>I looked at what I pasted again and see that the state of the enqueued job 
went from R to E - This may be ok.
</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Bob Gaffaney">gaffaney@mcs.anl.gov</who>
            <bug_when>2005-02-01 13:53:37</bug_when>
            <thetext>OK - it might work for PBS but Condor still seems to be a problem - paste 
below:

*****************
globusrun-ws -submit -factory 
https://lucky0:8444/wsrf/services/ManagedJobFactoryService -factory-type 
Condor -batch -o epr_0 -c /bin/sleep 1000
Submitting job...Done.
Job ID: uuid:84c5203e-748a-11d9-a422-0002a5ad41e5
Termination time: 02/02/2005 19:50 GMT
Done
[gaffaney@lucky0 one]$ condor_q


-- Submitter: lucky0.mcs.anl.gov : &lt;140.221.65.193:58106&gt; : lucky0.mcs.anl.gov
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
3011.0   gaffaney        2/1  13:50   0+00:00:01 R  0   0.0  sleep 1000

1 jobs; 0 idle, 1 running, 0 held
[gaffaney@lucky0 one]$ globusrun-ws -kill -j epr_0
Requesting original job description...Done.
Destroying job...
Done
[gaffaney@lucky0 one]$ condor_q


-- Submitter: lucky0.mcs.anl.gov : &lt;140.221.65.193:58106&gt; : lucky0.mcs.anl.gov
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
3011.0   gaffaney        2/1  13:50   0+00:00:33 R  0   0.0  sleep 1000

1 jobs; 0 idle, 1 running, 0 held
[gaffaney@lucky0 one]$ condor_q


-- Submitter: lucky0.mcs.anl.gov : &lt;140.221.65.193:58106&gt; : lucky0.mcs.anl.gov
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
3011.0   gaffaney        2/1  13:50   0+00:00:46 R  0   0.0  sleep 1000

1 jobs; 0 idle, 1 running, 0 held
[gaffaney@lucky0 one]$ condor_q


-- Submitter: lucky0.mcs.anl.gov : &lt;140.221.65.193:58106&gt; : lucky0.mcs.anl.gov
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
3011.0   gaffaney        2/1  13:50   0+00:00:49 R  0   0.0  sleep 1000

1 jobs; 0 idle, 1 running, 0 held
[gaffaney@lucky0 one]$
</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Peter Lane">lane@mcs.anl.gov</who>
            <bug_when>2005-02-02 12:24:00</bug_when>
            <thetext>If it&apos;s working for other schedulers, then this is an issue with either Condor or the perl scheduler script 
for Condor.  Reassigning to Stu so he can determine who should be looking at this.</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Peter Lane">lane@mcs.anl.gov</who>
            <bug_when>2005-02-16 11:42:38</bug_when>
            <thetext>*** Bug 2749 has been marked as a duplicate of this bug. ***</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Peter Lane">lane@mcs.anl.gov</who>
            <bug_when>2005-02-16 11:50:55</bug_when>
            <thetext>Ignore the duplicate assignment.  It was intended for #2575.</thetext>
          </long_desc>
          <long_desc isprivate="0">
            <who name="Peter Lane">lane@mcs.anl.gov</who>
            <bug_when>2005-04-01 12:00:31</bug_when>
            <thetext>I&apos;m closing this since Fork and PBS job removal is working.  Condor has a
separate bug open for it.</thetext>
          </long_desc>
      
      

    </bug>

</bugzilla>