Bug 3373 - globus removes the temporary job directory before pbs writes back into it
: globus removes the temporary job directory before pbs writes back into it
Status: RESOLVED LATER
: GRAM
gt2 Gatekeeper/Jobmanager
: 4.0.0
: PC Linux
: P3 major
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2005-05-17 20:46 by
Modified: 2012-09-12 13:54 (History)


Attachments
diff of the changes made to pbs.pm (3.93 KB, patch)
2005-05-26 00:45, Gerson Galang
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2005-05-17 20:46:12
I've configured globus to use pbs as its default job manager and it works fine.
But I start to have problems when I configure my pbs server to route all the
jobs it receives to a queue on a remote machine. PBS is not a problem because I
can get the routing functionality working between the two machine I am testing
globus and pbs on.

Here's what's happening when I just use plain pbs in running my experiment
between machines A and B:
setup: machine A has its default queue set as a routing queue to an execution
queue on machine B.

1. user runs this command on machine A
qsub <pbsjob> -- pbsjob is just a script containing the command hostname

2. server on machien A receives the job and routes it to the execution queue on
machine B.

3. machine B executes the job and returns the output back to machine A.

4. an output of the job containing machine B's hostname gets written into the
users working directory.

And here's when I include globus in the picture:
setup: machine A has GRAM service and pbs server running. The default queue on
machine A is a routing queue to an execution queue on machine B.

1. user runs "globus-job-run machineA/jobmanager-pbs /bin/hostname" on machine A

2. gram service on machine A receives it and lets the pbs server on machine A
handle the job.

3. pbs on machine A routes the job the the execution queue. But once this step
gets executed, globus will now think that the pbs has already finished executing
the job so it will already delete the temporary directory that got created in
the .globus/job/machineA/ which is supposed to hold the output of the pbs job.
Globus will also return nothing on the terminal.

4. pbs server on machine B receives and executes the job. Once finished, it will
try to send the results back to working directory where the job was lauched on
machine A. But since this directory doesn't exist anymore, the pbs server on
machine B will just give up and put the output of the job into to pbs'
undelivered directory.

Is this a known bug in globus? Is there a way to fix this problem without
modifying the source?

By the way, routing of queues doesn't work on PBS. I applied a patch posted at
the torque mailing list to get this functionality to work.
http://www.supercluster.org/pipermail/torqueusers/2005-April/001567.html
------- Comment #1 From 2005-05-18 10:06:10 -------
There could be 2 things at fault here:
  1) the jobmanager is detecting that the job is DONE before it is really done.
  2) The bytes in the stdout file are not showing up on Machine A in the job directory (~/.globus/...) due 
to NFS delays in propagating the info.

By your description it sounds like it is 1.  If so, then the JM perl script would need to be modified to 
correctly interpret when the job is truely DONE.
------- Comment #2 From 2005-05-18 19:25:03 -------
Subject: Re:  globus removes the temporary job directory before pbs
 writes back into it

Can you tell me where I can find the jobmanager perl script? I had a 
look at etc/grid-services/jobmanager-pbs but it looks like that's not 
the script you were referring to. Or is it the pbs.pm file which needs 
to be modified?

Thanks.

bugzilla-daemon@mcs.anl.gov wrote:
> http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=3373
> 
> smartin@mcs.anl.gov changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|NEW                         |ASSIGNED
> 
> 
> 
> ------- Additional Comments From smartin@mcs.anl.gov  2005-05-18 10:06 -------
> There could be 2 things at fault here:
>   1) the jobmanager is detecting that the job is DONE before it is really done.
>   2) The bytes in the stdout file are not showing up on Machine A in the job directory (~/.globus/...) due 
> to NFS delays in propagating the info.
> 
> By your description it sounds like it is 1.  If so, then the JM perl script would need to be modified to 
> correctly interpret when the job is truely DONE.
> 
> 
> 
> ------- You are receiving this mail because: -------
> You reported the bug, or are watching the reporter.
> 

------- Comment #3 From 2005-05-19 15:11:18 -------
It would be the pbs.pm file that would need to be modified

$GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/pbs.pm
------- Comment #4 From 2005-05-26 00:45:57 -------
Created an attachment (id=625) [details]
diff of the changes made to pbs.pm
------- Comment #5 From 2005-05-26 00:55:02 -------
Subject: Re:  globus removes the temporary job directory before pbs
 writes back into it

Hi,

We've written a patch to fix this problem. We have used the bug report 
posted on the LCG bugzilla as a reference.

https://savannah.cern.ch/bugs/?func=detailitem&item_id=6329

Will this patch be applied on the future releases of Globus?

Thanks,
Gerson


bugzilla-daemon@mcs.anl.gov wrote:
> http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=3373
> 
> 
> 
> 
> 
> ------- Additional Comments From smartin@mcs.anl.gov  2005-05-19 15:11 -------
> It would be the pbs.pm file that would need to be modified
> 
> $GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/pbs.pm
> 
> 
> 
> 
> ------- You are receiving this mail because: -------
> You reported the bug, or are watching the reporter.
> 

------- Comment #6 From 2012-09-12 13:54:32 -------
We've migrated our issue tracking software to jira.globus.org. Any new issues
should be added here:

http://jira.globus.org/secure/VersionBoard.jspa?selectedProjectId=10363

As this issue hasn't been commented on in several years, we're closing it. If
you feel it is still relevant, please add it to jira.