Bugzilla – Bug 3373
globus removes the temporary job directory before pbs writes back into it
Last modified: 2012-09-12 13:54:32
You need to log in before you can comment on or make changes to this bug.
I've configured globus to use pbs as its default job manager and it works fine. But I start to have problems when I configure my pbs server to route all the jobs it receives to a queue on a remote machine. PBS is not a problem because I can get the routing functionality working between the two machine I am testing globus and pbs on. Here's what's happening when I just use plain pbs in running my experiment between machines A and B: setup: machine A has its default queue set as a routing queue to an execution queue on machine B. 1. user runs this command on machine A qsub <pbsjob> -- pbsjob is just a script containing the command hostname 2. server on machien A receives the job and routes it to the execution queue on machine B. 3. machine B executes the job and returns the output back to machine A. 4. an output of the job containing machine B's hostname gets written into the users working directory. And here's when I include globus in the picture: setup: machine A has GRAM service and pbs server running. The default queue on machine A is a routing queue to an execution queue on machine B. 1. user runs "globus-job-run machineA/jobmanager-pbs /bin/hostname" on machine A 2. gram service on machine A receives it and lets the pbs server on machine A handle the job. 3. pbs on machine A routes the job the the execution queue. But once this step gets executed, globus will now think that the pbs has already finished executing the job so it will already delete the temporary directory that got created in the .globus/job/machineA/ which is supposed to hold the output of the pbs job. Globus will also return nothing on the terminal. 4. pbs server on machine B receives and executes the job. Once finished, it will try to send the results back to working directory where the job was lauched on machine A. But since this directory doesn't exist anymore, the pbs server on machine B will just give up and put the output of the job into to pbs' undelivered directory. Is this a known bug in globus? Is there a way to fix this problem without modifying the source? By the way, routing of queues doesn't work on PBS. I applied a patch posted at the torque mailing list to get this functionality to work. http://www.supercluster.org/pipermail/torqueusers/2005-April/001567.html
There could be 2 things at fault here: 1) the jobmanager is detecting that the job is DONE before it is really done. 2) The bytes in the stdout file are not showing up on Machine A in the job directory (~/.globus/...) due to NFS delays in propagating the info. By your description it sounds like it is 1. If so, then the JM perl script would need to be modified to correctly interpret when the job is truely DONE.
Subject: Re: globus removes the temporary job directory before pbs writes back into it Can you tell me where I can find the jobmanager perl script? I had a look at etc/grid-services/jobmanager-pbs but it looks like that's not the script you were referring to. Or is it the pbs.pm file which needs to be modified? Thanks. bugzilla-daemon@mcs.anl.gov wrote: > http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=3373 > > smartin@mcs.anl.gov changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Status|NEW |ASSIGNED > > > > ------- Additional Comments From smartin@mcs.anl.gov 2005-05-18 10:06 ------- > There could be 2 things at fault here: > 1) the jobmanager is detecting that the job is DONE before it is really done. > 2) The bytes in the stdout file are not showing up on Machine A in the job directory (~/.globus/...) due > to NFS delays in propagating the info. > > By your description it sounds like it is 1. If so, then the JM perl script would need to be modified to > correctly interpret when the job is truely DONE. > > > > ------- You are receiving this mail because: ------- > You reported the bug, or are watching the reporter. >
It would be the pbs.pm file that would need to be modified $GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/pbs.pm
Created an attachment (id=625) [details] diff of the changes made to pbs.pm
Subject: Re: globus removes the temporary job directory before pbs writes back into it Hi, We've written a patch to fix this problem. We have used the bug report posted on the LCG bugzilla as a reference. https://savannah.cern.ch/bugs/?func=detailitem&item_id=6329 Will this patch be applied on the future releases of Globus? Thanks, Gerson bugzilla-daemon@mcs.anl.gov wrote: > http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=3373 > > > > > > ------- Additional Comments From smartin@mcs.anl.gov 2005-05-19 15:11 ------- > It would be the pbs.pm file that would need to be modified > > $GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/pbs.pm > > > > > ------- You are receiving this mail because: ------- > You reported the bug, or are watching the reporter. >
We've migrated our issue tracking software to jira.globus.org. Any new issues should be added here: http://jira.globus.org/secure/VersionBoard.jspa?selectedProjectId=10363 As this issue hasn't been commented on in several years, we're closing it. If you feel it is still relevant, please add it to jira.