Bugzilla – Bug 755
stdout/stderr files in GASS cache lost after running short job
Last modified: 2004-06-29 17:10:56
You need to log in before you can comment on or make changes to this bug.
We hit a when running short jobs using the LoadLeveler job manager on a multi-node cluster running NSF. Basically, if the job run-time is short enough, the default stdout and stderr files appear to be deleted from the GASS chace by the job-manager, making it impossible to get the output, either interactively using globus-job-run or using globus-job-get-output after submitting a batch job. The problem only occurs when running [a] short jobs when [b] the job is scheduled by LoadLeveler to run on a node different that the gatekeeper/job-manager node, and [c] the gass-cache is on a shared NFS filesystem (aside: NFS is required by LoadLeveler). It looks like the problem is that the stdout/stderr files are created in the GASS cache OK on the node that the job is actually running on, but they show up a empty on the node the gatekeeper/job-manager node for 10-20 seconds, due to the NFS update delay. I'm somewhat guessing here, but perhaps for a short running job the job-manager receives the job complete state *before* NFS has updated the corresponding stdout/stderr files on the actual job-manager's node (i.e. they still show up as empty) and as a result the job-manager cleans them up. Does this sound reasonable? The workaround/fix we came up with for the LoadLeveler job manager is to touch the stdout/stderr files in the ./JobManager/loadleveler.pm script in the poll function just before returning that the job has completed. This seems to prevent the stdout/stderr files from being deleted before they are retreived. I'm not sure if this is because by touching the files we force NFS to sync up, or because something in the job-manager cleanup code is specifically looking for empty and/or invalid timestamped GASS cache files and then removing them. I can provide more details and maybe try to reproduce an example if you like, but does this behavior sound reasonable? I expect it would affect other job managers, so I'm concerend that our fix is only good for LoadLeveler. Any insight would be welcome, thanks.
opps - make that "...multi-node cluster running *NFS*." :-)
>We hit a when running short jobs using the LoadLeveler job manager on a >multi-node cluster running NSF. > >Basically, if the job run-time is short enough, the default stdout and stderr >files appear to be deleted from the GASS chace by the job-manager, making it >impossible to get the output, either interactively using globus-job-run or >using >globus-job-get-output after submitting a batch job. > >The problem only occurs when running [a] short jobs when [b] the job is >scheduled by LoadLeveler to run on a node different that the >gatekeeper/job-manager node, and [c] the gass-cache is on a shared NFS >filesystem (aside: NFS is required by LoadLeveler). > >It looks like the problem is that the stdout/stderr files are created in the >GASS cache OK on the node that the job is actually running on, but they >show up >a empty on the node the gatekeeper/job-manager node for 10-20 seconds, due to >the NFS update delay. I'm somewhat guessing here, but perhaps for a short >running job the job-manager receives the job complete state *before* NFS has >updated the corresponding stdout/stderr files on the actual job-manager's node >(i.e. they still show up as empty) and as a result the job-manager cleans them >up. Does this sound reasonable? I think so. After the JM get the DONE state for the job, it will do a last check if there is any more information in stdout/stderr and cleanup the job (remove all files). If the data takes 10-20 seconds to get to the gass cache file, then it is quite likely that the JM will not have any data to send. I have been informed of the same problem on a cluster using NFS here at ANL. You should rule out the possibility that your loadleveler poll script is not broken. Maybe it misinterprets a loadleveler status to be DONE, but I think it is probably due to NFS as this would effect long running jobs too. >The workaround/fix we came up with for the LoadLeveler job manager is to touch >the stdout/stderr files in the ./JobManager/loadleveler.pm script in the poll >function just before returning that the job has completed. This seems to >prevent >the stdout/stderr files from being deleted before they are retreived. I'm not >sure if this is because by touching the files we force NFS to sync up, or >because something in the job-manager cleanup code is specifically looking for >empty and/or invalid timestamped GASS cache files and then removing them. > > >I can provide more details and maybe try to reproduce an example if you like, >but does this behavior sound reasonable? I expect it would affect other job >managers, so I'm concerend that our fix is only good for LoadLeveler. >Any insight would be welcome, thanks.
>... >I have been informed of the same problem on a cluster using NFS here at ANL. > >You should rule out the possibility that your loadleveler poll script is >not broken. In other words I should *not* rule out the possibility that my loadlevler script *is* broken*? [did you miss a negative somewhere in that statement... :-)] Is this something that can/should be fixed in the Globus job-manager source, or is it up to the job-manager scripts themselves to make sure that all the job's output files are completely 'flushed' in the GASS cache (everywhere!) before returning "DONE"? - Gareth
> >... > >I have been informed of the same problem on a cluster using NFS here at ANL. > > > >You should rule out the possibility that your loadleveler poll script is > >not broken. > >In other words I should *not* rule out the possibility that my loadlevler >script *is* broken*? > >[did you miss a negative somewhere in that statement... :-)] I didn't not :-) >Is this something that can/should be fixed in the Globus job-manager >source, or >is it up to the job-manager scripts themselves to make sure that all the >job's >output files are completely 'flushed' in the GASS cache (everywhere!) before >returning "DONE"? I think it makes the most sense to have the JM ensure that files it cares about (stdout, stderr, ...) have been 'flushed'. Doing a touch in the script is good for now til we get a better fix in the JM. >- Gareth
*** This bug has been marked as a duplicate of 950 ***