Bug 755 - stdout/stderr files in GASS cache lost after running short job
: stdout/stderr files in GASS cache lost after running short job
Status: RESOLVED DUPLICATE of bug 950
: GRAM
gt2 Gatekeeper/Jobmanager
: unspecified
: RS/6000 AIX
: P2 normal
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2003-02-26 16:01 by
Modified: 2004-06-29 17:10 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2003-02-26 16:01:55
We hit a when running short jobs using the LoadLeveler job manager on a
multi-node cluster running NSF.

Basically, if the job run-time is short enough, the default stdout and stderr
files appear to be deleted from the GASS chace by the job-manager, making it
impossible to get the output, either interactively using globus-job-run or using
globus-job-get-output after submitting a batch job.

The problem only occurs when running [a] short jobs when [b] the job is
scheduled by LoadLeveler to run on a node different that the
gatekeeper/job-manager node, and [c] the gass-cache is on a shared NFS
filesystem (aside: NFS is required by LoadLeveler).

It looks like the problem is that the stdout/stderr files are created in the
GASS cache OK on the node that the job is actually running on, but they show up
a empty on the node the gatekeeper/job-manager node for 10-20 seconds, due to
the NFS update delay. I'm somewhat guessing here, but perhaps for a short
running job the job-manager receives the job complete state *before* NFS has
updated the corresponding stdout/stderr files on the actual job-manager's node
(i.e. they still show up as empty) and as a result the job-manager cleans them
up. Does this sound reasonable?

The workaround/fix we came up with for the LoadLeveler job manager is to touch
the stdout/stderr files in the ./JobManager/loadleveler.pm script in the poll
function just before returning that the job has completed. This seems to prevent
the stdout/stderr files from being deleted before they are retreived. I'm not
sure if this is because by touching the files we force NFS to sync up, or
because something in the job-manager cleanup code is specifically looking for
empty and/or invalid timestamped GASS cache files and then removing them.


I can provide more details and maybe try to reproduce an example if you like,
but does this behavior sound reasonable? I expect it would affect other job
managers, so I'm concerend that our fix is only good for LoadLeveler.
Any insight would be welcome, thanks.
------- Comment #1 From 2003-02-26 16:12:46 -------
opps - make that "...multi-node cluster running *NFS*." :-)

------- Comment #2 From 2003-02-27 18:05:01 -------
>We hit a when running short jobs using the LoadLeveler job manager on a
>multi-node cluster running NSF.
>
>Basically, if the job run-time is short enough, the default stdout and stderr
>files appear to be deleted from the GASS chace by the job-manager, making it
>impossible to get the output, either interactively using globus-job-run or 
>using
>globus-job-get-output after submitting a batch job.
>
>The problem only occurs when running [a] short jobs when [b] the job is
>scheduled by LoadLeveler to run on a node different that the
>gatekeeper/job-manager node, and [c] the gass-cache is on a shared NFS
>filesystem (aside: NFS is required by LoadLeveler).
>
>It looks like the problem is that the stdout/stderr files are created in the
>GASS cache OK on the node that the job is actually running on, but they 
>show up
>a empty on the node the gatekeeper/job-manager node for 10-20 seconds, due to
>the NFS update delay. I'm somewhat guessing here, but perhaps for a short
>running job the job-manager receives the job complete state *before* NFS has
>updated the corresponding stdout/stderr files on the actual job-manager's node
>(i.e. they still show up as empty) and as a result the job-manager cleans them
>up. Does this sound reasonable?

I think so.  After the JM get the DONE state for the job, it will do a last 
check if there is any more information in stdout/stderr and cleanup the job 
(remove all files).  If the data takes 10-20 seconds to get to the gass 
cache file, then it is quite likely that the JM will not have any data to send.

I have been informed of the same problem on a cluster using NFS here at ANL.

You should rule out the possibility that your loadleveler poll script is 
not broken.  Maybe it misinterprets a loadleveler status to be DONE, but I 
think it is probably due to NFS as this would effect long running jobs too.


>The workaround/fix we came up with for the LoadLeveler job manager is to touch
>the stdout/stderr files in the ./JobManager/loadleveler.pm script in the poll
>function just before returning that the job has completed. This seems to 
>prevent
>the stdout/stderr files from being deleted before they are retreived. I'm not
>sure if this is because by touching the files we force NFS to sync up, or
>because something in the job-manager cleanup code is specifically looking for
>empty and/or invalid timestamped GASS cache files and then removing them.
>
>
>I can provide more details and maybe try to reproduce an example if you like,
>but does this behavior sound reasonable? I expect it would affect other job
>managers, so I'm concerend that our fix is only good for LoadLeveler.
>Any insight would be welcome, thanks.



------- Comment #3 From 2003-02-27 21:37:18 -------
>...
>I have been informed of the same problem on a cluster using NFS here at ANL.
>
>You should rule out the possibility that your loadleveler poll script is 
>not broken.

In other words I should *not* rule out the possibility that my loadlevler 
script *is* broken*? 

[did you miss a negative somewhere in that statement... :-)]


Is this something that can/should be fixed in the Globus job-manager source, or 
is it up to the job-manager scripts themselves to make sure that all the job's 
output files are completely 'flushed' in the GASS cache (everywhere!) before 
returning "DONE"?

- Gareth
------- Comment #4 From 2003-02-28 10:05:01 -------
> >...
> >I have been informed of the same problem on a cluster using NFS here at ANL.
> >
> >You should rule out the possibility that your loadleveler poll script is
> >not broken.
>
>In other words I should *not* rule out the possibility that my loadlevler
>script *is* broken*?
>
>[did you miss a negative somewhere in that statement... :-)]

I didn't not :-)


>Is this something that can/should be fixed in the Globus job-manager 
>source, or
>is it up to the job-manager scripts themselves to make sure that all the 
>job's
>output files are completely 'flushed' in the GASS cache (everywhere!) before
>returning "DONE"?

I think it makes the most sense to have the JM ensure that files it cares 
about (stdout, stderr, ...) have been 'flushed'.  Doing a touch in the 
script is good for now til we get a better fix in the JM.


>- Gareth



------- Comment #5 From 2003-07-25 10:54:52 -------

*** This bug has been marked as a duplicate of 950 ***