Bugzilla – Bug 3333
Multijob destruction failure
Last modified: 2005-08-03 17:11:09
You need to log in before you can comment on or make changes to this bug.
Two machines host1 and host2 are setup. I have a multijob description which inturn has 2 jobs. The job description specifies the multijob to be submitted on host1, the first subjob on host2 and second subjob on host1. Now I start the container only on host1 and submit the multijob from host2. As expected, the multijob submission fails due to a connection refused exception to host2. But when I use the EPR to probe the job status, it tells me "Current job state: Failed", which means the job resource is not destroyed despite the failure. It looks OK that the resource has to live(till its termination time) to tell the job's tale. Now that the job story is learnt, I decide to kill the resource. The kill fails!! "globusrun-ws: Unable to destroy job: Error destroying job". Tries to probe the status and it says "Current job state: Failed" followed by the rest of the description, which means the resource is still not destroyed. I am including the console output and the multijob description used, to help debug. The multijob description text is as follows ########################### <?xml version="1.0" encoding="UTF-8"?> <multiJob xmlns:gram="http://www.globus.org/namespaces/2004/10/gram/job" xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing"> <factoryEndpoint> <wsa:Address> https://9.182.112.51:8443/wsrf/services/ManagedJobFactoryService </wsa:Address> <wsa:ReferenceProperties> <gram:ResourceID>Multi</gram:ResourceID> </wsa:ReferenceProperties> </factoryEndpoint> <directory>${GLOBUS_LOCATION}</directory> <count>1</count> <job> <factoryEndpoint> <wsa:Address>https://9.182.112.52:8443/wsrf/services/ManagedJobFactoryService</w sa:Address> <wsa:ReferenceProperties> <gram:ResourceID>Fork</gram:ResourceID> </wsa:ReferenceProperties> </factoryEndpoint> <executable>/bin/date</executable> <stdout>${GLOBUS_USER_HOME}/stdout.p1</stdout> <stderr>${GLOBUS_USER_HOME}/stderr.p1</stderr> <count>2</count> </job> <job> <factoryEndpoint> <wsa:Address>https://9.182.112.51:8443/wsrf/services/ManagedJobFactoryService</w sa:Address> <wsa:ReferenceProperties> <gram:ResourceID>Fork</gram:ResourceID> </wsa:ReferenceProperties> </factoryEndpoint> <executable>/bin/echo</executable> <argument>Hello World!</argument> <stdout>${GLOBUS_USER_HOME}/stdout.p2</stdout> <stderr>${GLOBUS_USER_HOME}/stderr.p2</stderr> <count>1</count> </job> </multiJob> ----------------------------- CONSOLE OUTPUT(Full stack trace not shown to limit space) On submission ########################### $ globusrun-ws -submit -f multijobmultimacdesc.xml -o jobepr.xml Delegating user credentials...Done. Submitting job...Done. Job ID: uuid:becf38d6-b7cf-11d9-8cd2-000c29b13451 Termination time: 04/29/2005 10:24 GMT Current job state: Failed Destroying job...Failed. globusrun-ws: Job failed: Unable to create sub-jobs. AxisFault faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.userException faultSubcode: faultString: java.net.ConnectException: Connection refused faultActor: faultNode: faultDetail: {http://xml.apache.org/axis/}stackTrace:java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:331) ........... <rest-of-the-stack-trace> ............. $ ----------------------------- Status post 'submission and failure' ########################### $ globusrun-ws -status -job-epr-file jobepr.xml Current job state: Failed globusrun-ws: Job failed: Unable to create sub-jobs. AxisFault faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.userException faultSubcode: faultString: java.net.ConnectException: Connection refused faultActor: faultNode: faultDetail: {http://xml.apache.org/axis/}stackTrace:java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:331) ........... <rest-of-the-stack-trace> ............. $ ----------------------------- Kill and status again ########################### $ globusrun-ws -kill -job-epr-file jobepr.xml Requesting original job description...Done. Destroying job...Failed. globusrun-ws: Unable to destroy job: Error destroying job globus_soap_message_module: SOAP Fault Fault code: soapenv:Server.generalException $ globusrun-ws -status -job-epr-file jobepr.xml Current job state: Failed globusrun-ws: Job failed: Unable to create sub-jobs. AxisFault faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.userException faultSubcode: faultString: java.net.ConnectException: Connection refused faultActor: faultNode: faultDetail: {http://xml.apache.org/axis/}stackTrace:java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:331) ........... <rest-of-the-stack-trace> ............. $ -----------------------------
Fix in trunk and globus_4_0_branch. The code will now simply log a warning if it can't destroy a sub-job resource instead of throwing an exception and preventing the continuation of the multi-job resource destruction. I also added a check for a null sub-job EPR before it attempts to destroy the sub-job resource. This just means that there will be no attempt at destroying a sub-job which didn't ever get created.