Bug 4908 - Stage out failes in JobManager.pm
: Stage out failes in JobManager.pm
wsrf managed execution job service
: 4.0.3
: PC Linux
: P3 normal
: 4.0.6
Assigned To:
  Show dependency treegraph
Reported: 2006-12-14 11:36 by
Modified: 2007-11-30 18:50 (History)



You need to log in before you can comment on or make changes to this bug.

Description From 2006-12-14 11:36:10
In the gt4.0.3-all-source-installer/source-trees/gram/
jobmanager/source/scripts/JobManager.pm file there is a line (nr 754)

  my $count = File::Path::rmtree($job_path);

which prints "Can't fetch initial working directory" to stderr during the
"cleanup" stage.

This output on stderr triggers the JobManagerScript
exec/service/exec/JobManagerScript.java#run(), line 316) to fail and hence to
move the job into failed state.

This is what it looks like from the client side:
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:8c37ed62-8b92-11db-b0e9-00065bf75b41
Termination time: 12/15/2006 16:45 GMT
Current job state: Unsubmitted
Current job state: Active
Current job state: StageOut
Current job state: CleanUp
Current job state: Failed
Destroying job...Done.
Cleaning up any delegated credentials...Done.

This is what it looks like (with some debug output enabled from the server
2006-12-14 17:48:53,870 DEBUG exec.StateMachine [RunQueue
CacheCleanUp-0,runScript:2930] running script cache_cleanup
2006-12-14 17:48:53,871 DEBUG exec.JobManagerScript [Thread-25,run:208]
Executing command:
/usr/bin/sudo -H -u globus16 -S
/opt/pc2/globus/libexec/globus-gridmap-and-execute -g
/etc/grid-security/grid-mapfile /opt/pc2/globus/libexec/globus-job-manager-s
cript.pl -m ccs -f /opt/pc2/globus/tmp/gram_job_mgr34706.tmp -c cache_cleanup
2006-12-14 17:48:54,122 DEBUG exec.JobManagerScript [Thread-25,run:225] first
line: null
2006-12-14 17:48:54,124 DEBUG exec.JobManagerScript [Thread-25,run:335] failure
message: Script stderr:
Can't fetch initial working directory at
/opt/pc2/globus/lib/perl/Globus/GRAM/JobManager.pm line 754
2006-12-14 17:48:54,132 DEBUG exec.JobManagerScript [Thread-25,setDone:345]
script is done, setting done flag
2006-12-14 17:48:54,132 DEBUG exec.StateMachine [RunQueue
CacheCleanUp-0,cacheCleanUp:2853] Done waiting for cache_cleanup script
2006-12-14 17:48:54,132 DEBUG exec.StateMachine [RunQueue
CacheCleanUp-0,cacheCleanUp:2860] script return code: 201
2006-12-14 17:48:54,133 DEBUG exec.StateMachine [RunQueue
CacheCleanUp-0,cacheCleanUp:2866] script return code means error!

If I add 
right infront of
  my $count = File::Path::rmtree($job_path);
everything works fine.

The value of $job_path looks like this:

If I enable the debug output of the JobManager.pm (with some nasty hack, as I
was not sure how to do that correctly), the output looked like this:
Thu Dec 14 16:49:09 2006 JM_SCRIPT: Execution of script was successful
Thu Dec 14 16:49:09 2006 JM_SCRIPT: ccsalloc returned jobid 3276
Thu Dec 14 16:49:09 2006 JM_SCRIPT: Returning jobstate PENDING with jobid=3276
Thu Dec 14 16:52:10 2006 JM_SCRIPT: New Perl JobManager created.
Thu Dec 14 16:52:10 2006 JM_SCRIPT: Using jm supplied job dir:
Thu Dec 14 16:52:10 2006 JM_SCRIPT: Using jm supplied job dir:
Thu Dec 14 16:52:10 2006 JM_SCRIPT: cache_cleanup(enter)
Thu Dec 14 16:52:10 2006 JM_SCRIPT: Cleaning files in job dir
Thu Dec 14 16:52:10 2006 JM_SCRIPT: Removed  files from
Thu Dec 14 16:52:10 2006 JM_SCRIPT: cache_cleanup(exit)

Adding the line changes the output to

Thu Dec 14 17:52:01 2006 JM_SCRIPT: cache_cleanup(enter)
Thu Dec 14 17:52:01 2006 JM_SCRIPT: Cleaning files in job dir
Thu Dec 14 17:52:01 2006 JM_SCRIPT: Removed 3 files from
                                           ^^^ the 3 is new
Thu Dec 14 17:52:01 2006 JM_SCRIPT: cache_cleanup(exit)

The problem has been discussed here already, but I don't see any conclusion
that helps me.

I hope that I have provided sufficient information.
------- Comment #1 From 2007-09-19 11:38:07 -------
Reassigning to current GRAM developer to close/fix as appropriate.
------- Comment #2 From 2007-11-09 15:07:28 -------
I was never able to reproduce this. But we recently got an e-mail
to gt-user of a guy that has exactly the same problem. If your fix
works for him too and if we don't run into problems ourselves i'll
commit your fix.
Can you tell in what situations you have that problem?
------- Comment #3 From 2007-11-09 16:07:02 -------
It's embarrassing but I have no idea about the exact context any more. I moved
to a different institution and don't work with GRAM any more, so I lost track
on this. Unfortunately I do not have access to that machine any more.

I searched on the mailing lists for the error message and found:

  - the problem occurred when using PBS. I am sure that I submitted the job as
an rsl file over the command line. It was some toy job like "date" or "echo
hello world".

  - discussion whether certificates are set up correctly. Unfortunately I
cannot verify this (lack of access to machine).

  - No details given

  - Claims that the error does not occur if the container is started with
"globus-start-container-detached". I have not tested that but I used
"globus-start-container" when the error occurred.
------- Comment #4 From 2007-11-28 16:19:41 -------
*** Bug 4195 has been marked as a duplicate of this bug. ***
------- Comment #5 From 2007-11-30 18:49:34 -------
fixed in the 4.0 branch and in trunk by adding chdir("/");
before calling rmtree().