Bug 3333 - Multijob destruction failure
: Multijob destruction failure
Status: RESOLVED FIXED
: GRAM
wsrf managed multi job service
: 3.9.5
: PC Linux
: P3 normal
: 4.0.1
Assigned To:
:
:
:
: 3348
  Show dependency treegraph
 
Reported: 2005-05-11 13:10 by
Modified: 2005-08-03 17:11 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2005-05-11 13:10:33
Two machines host1 and host2 are setup. I have a multijob description 
which inturn has 2 jobs. The job description specifies the multijob to be 
submitted on host1, the first subjob on host2 and second subjob on host1. 
     Now I start the container only on host1 and submit the multijob from 
host2. As expected, the multijob submission fails due to a connection refused 
exception to host2. 
     But when I use the EPR to probe the job status, it tells me "Current job 
state: Failed", which means the job resource is not destroyed despite the 
failure. It looks OK that the resource has to live(till its termination time) 
to tell the job's tale.
     Now that the job story is learnt, I decide to kill the resource. The kill 
fails!! "globusrun-ws: Unable to destroy job: Error destroying job". 
      Tries to probe the status and it says "Current job state: Failed" 
followed by the rest of the description, which means the resource is still not 
destroyed.

I am including the console output and the multijob description used, to help 
debug.

The multijob description text is as follows
###########################
<?xml version="1.0" encoding="UTF-8"?>
<multiJob xmlns:gram="http://www.globus.org/namespaces/2004/10/gram/job"
     xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing">
    <factoryEndpoint>
        <wsa:Address>
            https://9.182.112.51:8443/wsrf/services/ManagedJobFactoryService
        </wsa:Address>
        <wsa:ReferenceProperties>
            <gram:ResourceID>Multi</gram:ResourceID>
        </wsa:ReferenceProperties>
    </factoryEndpoint>
    <directory>${GLOBUS_LOCATION}</directory>
    <count>1</count>

    <job>
        <factoryEndpoint>
            
<wsa:Address>https://9.182.112.52:8443/wsrf/services/ManagedJobFactoryService</w
sa:Address>
            <wsa:ReferenceProperties>
                <gram:ResourceID>Fork</gram:ResourceID>
            </wsa:ReferenceProperties>
        </factoryEndpoint>
        <executable>/bin/date</executable>
        <stdout>${GLOBUS_USER_HOME}/stdout.p1</stdout>
        <stderr>${GLOBUS_USER_HOME}/stderr.p1</stderr>
        <count>2</count>
    </job>

    <job>
        <factoryEndpoint>
            
<wsa:Address>https://9.182.112.51:8443/wsrf/services/ManagedJobFactoryService</w
sa:Address>
            <wsa:ReferenceProperties>
                <gram:ResourceID>Fork</gram:ResourceID>
            </wsa:ReferenceProperties>
        </factoryEndpoint>
        <executable>/bin/echo</executable>
        <argument>Hello World!</argument>
        <stdout>${GLOBUS_USER_HOME}/stdout.p2</stdout>
        <stderr>${GLOBUS_USER_HOME}/stderr.p2</stderr>
        <count>1</count>
    </job>

</multiJob>
-----------------------------

CONSOLE OUTPUT(Full stack trace not shown to limit space)
On submission
###########################
$ globusrun-ws -submit -f multijobmultimacdesc.xml -o jobepr.xml
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:becf38d6-b7cf-11d9-8cd2-000c29b13451
Termination time: 04/29/2005 10:24 GMT
Current job state: Failed
Destroying job...Failed.
globusrun-ws: Job failed: Unable to create sub-jobs.
AxisFault
 faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.userException
 faultSubcode:
 faultString: java.net.ConnectException: Connection refused
 faultActor:
 faultNode:
 faultDetail:
        {http://xml.apache.org/axis/}stackTrace:java.net.ConnectException: 
Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:331)
...........
<rest-of-the-stack-trace>
.............
$
-----------------------------

Status post 'submission and failure'
###########################
$ globusrun-ws -status -job-epr-file jobepr.xml           
Current job state: Failed
globusrun-ws: Job failed: Unable to create sub-jobs.
AxisFault
 faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.userException
 faultSubcode:
 faultString: java.net.ConnectException: Connection refused
 faultActor:
 faultNode:
 faultDetail:
        {http://xml.apache.org/axis/}stackTrace:java.net.ConnectException: 
Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:331)
...........
<rest-of-the-stack-trace>
.............
$
-----------------------------

Kill and status again
###########################
$ globusrun-ws -kill -job-epr-file jobepr.xml
Requesting original job description...Done.
Destroying job...Failed.
globusrun-ws: Unable to destroy job: Error destroying job
globus_soap_message_module: SOAP Fault
Fault code: soapenv:Server.generalException
$ globusrun-ws -status -job-epr-file jobepr.xml
Current job state: Failed
globusrun-ws: Job failed: Unable to create sub-jobs.
AxisFault
 faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.userException
 faultSubcode:
 faultString: java.net.ConnectException: Connection refused
 faultActor:
 faultNode:
 faultDetail:
        {http://xml.apache.org/axis/}stackTrace:java.net.ConnectException: 
Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:331)
...........
<rest-of-the-stack-trace>
.............
$
-----------------------------
------- Comment #1 From 2005-05-17 12:53:56 -------
Fix in trunk and globus_4_0_branch.  The code will now simply log a warning if
it can't destroy a sub-job resource instead of throwing an exception and
preventing the continuation of the multi-job resource destruction.

I also added a check for a null sub-job EPR before it attempts to destroy the
sub-job resource.  This just means that there will be no attempt at destroying a
sub-job which didn't ever get created.