Bugzilla – Bug 4908
Stage out failes in JobManager.pm
Last modified: 2007-11-30 18:50:06
You need to log in before you can comment on or make changes to this bug.
In the gt4.0.3-all-source-installer/source-trees/gram/ jobmanager/source/scripts/JobManager.pm file there is a line (nr 754) my $count = File::Path::rmtree($job_path); which prints "Can't fetch initial working directory" to stderr during the "cleanup" stage. This output on stderr triggers the JobManagerScript (gt4.0.3-all-source-installer/source-trees/ws-gram/service/java/source/src/org/globus/ exec/service/exec/JobManagerScript.java#run(), line 316) to fail and hence to move the job into failed state. This is what it looks like from the client side: Delegating user credentials...Done. Submitting job...Done. Job ID: uuid:8c37ed62-8b92-11db-b0e9-00065bf75b41 Termination time: 12/15/2006 16:45 GMT Current job state: Unsubmitted Current job state: Active Current job state: StageOut Current job state: CleanUp Current job state: Failed Destroying job...Done. Cleaning up any delegated credentials...Done. This is what it looks like (with some debug output enabled from the server side): 2006-12-14 17:48:53,870 DEBUG exec.StateMachine [RunQueue CacheCleanUp-0,runScript:2930] running script cache_cleanup 2006-12-14 17:48:53,871 DEBUG exec.JobManagerScript [Thread-25,run:208] Executing command: /usr/bin/sudo -H -u globus16 -S /opt/pc2/globus/libexec/globus-gridmap-and-execute -g /etc/grid-security/grid-mapfile /opt/pc2/globus/libexec/globus-job-manager-s cript.pl -m ccs -f /opt/pc2/globus/tmp/gram_job_mgr34706.tmp -c cache_cleanup 2006-12-14 17:48:54,122 DEBUG exec.JobManagerScript [Thread-25,run:225] first line: null 2006-12-14 17:48:54,124 DEBUG exec.JobManagerScript [Thread-25,run:335] failure message: Script stderr: Can't fetch initial working directory at /opt/pc2/globus/lib/perl/Globus/GRAM/JobManager.pm line 754 2006-12-14 17:48:54,132 DEBUG exec.JobManagerScript [Thread-25,setDone:345] script is done, setting done flag 2006-12-14 17:48:54,132 DEBUG exec.StateMachine [RunQueue CacheCleanUp-0,cacheCleanUp:2853] Done waiting for cache_cleanup script 2006-12-14 17:48:54,132 DEBUG exec.StateMachine [RunQueue CacheCleanUp-0,cacheCleanUp:2860] script return code: 201 2006-12-14 17:48:54,133 DEBUG exec.StateMachine [RunQueue CacheCleanUp-0,cacheCleanUp:2866] script return code means error! If I add chdir("/"); right infront of my $count = File::Path::rmtree($job_path); everything works fine. The value of $job_path looks like this: /home-pc2/user/globus16/.globus/8c37ed62-8b92-11db-b0e9-00065bf75b41 If I enable the debug output of the JobManager.pm (with some nasty hack, as I was not sure how to do that correctly), the output looked like this: ... Thu Dec 14 16:49:09 2006 JM_SCRIPT: Execution of script was successful Thu Dec 14 16:49:09 2006 JM_SCRIPT: ccsalloc returned jobid 3276 Thu Dec 14 16:49:09 2006 JM_SCRIPT: Returning jobstate PENDING with jobid=3276 Thu Dec 14 16:52:10 2006 JM_SCRIPT: New Perl JobManager created. Thu Dec 14 16:52:10 2006 JM_SCRIPT: Using jm supplied job dir: /home-pc2/user/globus16/.globus/a1f169e2-8b8a-11db-86c5-00065bf75b41 Thu Dec 14 16:52:10 2006 JM_SCRIPT: Using jm supplied job dir: /home-pc2/user/globus16/.globus/a1f169e2-8b8a-11db-86c5-00065bf75b41 Thu Dec 14 16:52:10 2006 JM_SCRIPT: cache_cleanup(enter) Thu Dec 14 16:52:10 2006 JM_SCRIPT: Cleaning files in job dir /home-pc2/user/globus16/.globus/a1f169e2-8b8a-11db-86c5-00065bf75b41 Thu Dec 14 16:52:10 2006 JM_SCRIPT: Removed files from /home-pc2/user/globus16/.globus/a1f169e2-8b8a-11db-86c5-00065bf75b41 Thu Dec 14 16:52:10 2006 JM_SCRIPT: cache_cleanup(exit) Adding the line changes the output to Thu Dec 14 17:52:01 2006 JM_SCRIPT: cache_cleanup(enter) Thu Dec 14 17:52:01 2006 JM_SCRIPT: Cleaning files in job dir /home-pc2/user/globus16/.globus/55be893e-8b93-11db-8178-00065bf75b41 Thu Dec 14 17:52:01 2006 JM_SCRIPT: Removed 3 files from /home-pc2/user/globus16/.globus/55be893e-8b93-11db-8178-00065bf75b41 ^^^ the 3 is new Thu Dec 14 17:52:01 2006 JM_SCRIPT: cache_cleanup(exit) The problem has been discussed here already, but I don't see any conclusion that helps me. http://www-unix.globus.org/mail_archive/discuss/2006/02/msg00017.html I hope that I have provided sufficient information.
Reassigning to current GRAM developer to close/fix as appropriate.
Dominic, I was never able to reproduce this. But we recently got an e-mail to gt-user of a guy that has exactly the same problem. If your fix works for him too and if we don't run into problems ourselves i'll commit your fix. Can you tell in what situations you have that problem? Martin
It's embarrassing but I have no idea about the exact context any more. I moved to a different institution and don't work with GRAM any more, so I lost track on this. Unfortunately I do not have access to that machine any more. I searched on the mailing lists for the error message and found: http://www.globus.org/mail_archive/discuss/2007/07/msg00000.html - the problem occurred when using PBS. I am sure that I submitted the job as an rsl file over the command line. It was some toy job like "date" or "echo hello world". http://www.globus.org/mail_archive/developer-discuss/2006/01/msg00050.html - discussion whether certificates are set up correctly. Unfortunately I cannot verify this (lack of access to machine). http://www.globus.org/mail_archive/gt4-friends/2005/05/msg00187.html - No details given http://www.globus.org/mail_archive/gt-user/2007/10/msg00236.html - Claims that the error does not occur if the container is started with "globus-start-container-detached". I have not tested that but I used "globus-start-container" when the error occurred.
*** Bug 4195 has been marked as a duplicate of this bug. ***
fixed in the 4.0 branch and in trunk by adding chdir("/"); before calling rmtree().