Bug 5207 - GRAM SoftEnv extension bug
: GRAM SoftEnv extension bug
Status: RESOLVED WONTFIX
: GRAM
gt2 Gatekeeper/Jobmanager
: 4.0.1
: All All
: P3 normal
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2007-04-12 22:11 by
Modified: 2012-09-05 13:44 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2007-04-12 22:11:44
There appears to be a race condition in the softenv extensions to GRAM in GT
4.0.1 (TeraGrid).  When I submit a job using (pre-WS) GRAM which uses the
softenv extension in the RSL, I periodically see an error message (or sometimes
more than one message) similar to the following in the stderr of the job:

/home/insley/.globus/job/tg-grid1.uc.teragrid.org/11164.1176323402/scheduler_pbs_cmd_script:
line 3: /home/insley
/.globus/job/tg-grid1.uc.teragrid.org/11164.1176323402/pbs_softenv_cmd_script.cache.sh:
No such file or directory

I frequently encounter this problem if I'm attempting to run on multiple nodes
(>50% of the time when I've tried to run of 32 nodes on the UC/ANL cluster). 
When I log in to each of the nodes allocated to the job, the /home/insley
/.globus/job/tg-grid1.uc.teragrid.org/11164.1176323402/pbs_softenv_cmd_script.cache.sh
file is indeed there.  Which leads us to believe that it is related to nfs not
being synced when attempting to access the *.cache.sh script.  

If I remove the softenv tag from the RSL, and instead specify it in my .soft
file, I have yet to encounter this problem.  However, the job I am submitting
is intended to be submitted from the visualization gateway, by any user.  So,
we won't have control of their .soft file, hence the need to use the softenv
tag in the RSL.

Also, perhaps even more troubling, is the fact that the job doesn't actually
fail.  When I login to each of the nodes where the job should be running, I
find that for each error message like the one above that was printed to stderr,
I find a node where the executable is not running.  But it will continue to run
on the rest of the nodes (presumedly blocked in MPI_INIT), until the wallclock
time expires.

(problem reported by Joe Insley to the TeraGrid software-wg)
------- Comment #1 From 2007-04-13 16:21:12 -------
JP,

I think the thing to try would be to add a touch command before any files that
will be executed in these temp scripts.  For example,

Index: JobManager.pm
===================================================================
RCS file:
/home/globdev/CVS/globus-packages/gram/jobmanager/source/scripts/Attic/JobManager.pm,v
retrieving revision 1.15.10.2
diff -c -r1.15.10.2 JobManager.pm
*** JobManager.pm       12 Mar 2007 14:03:17 -0000      1.15.10.2
--- JobManager.pm       13 Apr 2007 21:19:26 -0000
***************
*** 1074,1079 ****
--- 1074,1080 ----
          close(SOFTENV);

          print $job_script_fh "$soft_msc $softenv_script_name\n";
+         print $job_script_fh "touch $softenv_script_name.cache.sh\n";
          print $job_script_fh ". $softenv_script_name.cache.sh\n";
          print $job_script_fh "rm $softenv_script_name"
                             . " $softenv_script_name.cache.sh\n";

Can you try this or something similar and see if that does the trick?

-Stu
------- Comment #2 From 2012-09-05 13:44:38 -------
Doing some bugzilla cleanup...  Resolving old GRAM3 and GRAM4 issues that are
no longer relevant since we've moved on to GRAM5.  Also, we're now tracking
issue in jira.  Any new issues should be added here:

http://jira.globus.org/secure/VersionBoard.jspa?selectedProjectId=10363