Bug 950 - Globus 2.2.4 does not print output for various job managers, LSF and PBS
: Globus 2.2.4 does not print output for various job managers, LSF and PBS
Status: RESOLVED FIXED
: GRAM
gt2 Gatekeeper/Jobmanager
: unspecified
: PC Linux
: P2 critical
: 3.2
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2003-05-14 10:39 by
Modified: 2004-02-02 08:22 (History)


Attachments
Current LSF for grid3 (11.25 KB, text/plain)
2003-10-14 15:26, Jens-S. Vöckler
Details
More NFS sync logging (23.54 KB, text/plain)
2003-10-14 15:27, Jens-S. Vöckler
Details
Globus 2.4.3 setup/globus/condor.in (6.82 KB, patch)
2003-10-28 10:58, Jens-S. Vöckler
Details
Globus 2.4.3 setup/globus/pbs.in (14.77 KB, patch)
2003-10-28 11:00, Jens-S. Vöckler
Details
Globus 2.4.3 lib/perl/Globus/GRAM/JobManager.pm (10.83 KB, patch)
2003-10-28 11:01, Jens-S. Vöckler
Details
Globus 2.4.3 lib/perl/Globus/GRAM/StdioMerger.pm (1.69 KB, patch)
2003-10-28 11:02, Jens-S. Vöckler
Details
Globus 2.4.3 lib/perl/Globus/GRAM/JobManager/condor.pm (6.92 KB, patch)
2003-10-28 11:04, Jens-S. Vöckler
Details
Globus 2.4.3 lib/perl/Globus/GRAM/JobManager/pbs.pm (14.86 KB, patch)
2003-10-28 11:06, Jens-S. Vöckler
Details
Globus 2.2.4 setup/globus/condor.in (9.32 KB, patch)
2003-10-28 21:02, Jens-S. Vöckler
Details
Globus 2.2.4 setup/globus/fork.in (2.86 KB, patch)
2003-10-28 21:03, Jens-S. Vöckler
Details
Globus 2.2.4 setup/globus/lsf.in (15.87 KB, patch)
2003-10-28 21:04, Jens-S. Vöckler
Details
Globus 2.2.4 setup/globus/pbs.in (16.21 KB, patch)
2003-10-28 21:04, Jens-S. Vöckler
Details
Globus 2.2.4 lib/perl/Globus/GRAM/JobManager.pm (17.38 KB, patch)
2003-10-28 21:05, Jens-S. Vöckler
Details
Globus 2.2.4 lib/perl/Globus/GRAM/StdioMerger.pm (7.10 KB, patch)
2003-10-28 21:06, Jens-S. Vöckler
Details
Globus 2.2.4 lib/perl/Globus/GRAM/JobManager/fork.pm (2.89 KB, patch)
2003-10-28 21:08, Jens-S. Vöckler
Details
Globus 2.2.4 lib/perl/Globus/GRAM/JobManager/condor.pm (9.36 KB, patch)
2003-10-28 21:09, Jens-S. Vöckler
Details
Globus 2.2.4 lib/perl/Globus/GRAM/JobManager/pbs.pm (16.25 KB, patch)
2003-10-28 21:11, Jens-S. Vöckler
Details
Globus 2.2.4 lib/perl/Globus/GRAM/JobManager/lsf.pm (15.90 KB, patch)
2003-10-28 21:12, Jens-S. Vöckler
Details
Globus 2.2.4 collecttive patches (fork/condor/pbs/lsf) (49.62 KB, application/octet-stream)
2003-10-29 12:43, Jens-S. Vöckler
Details
Globus 2.4.3 collective patches (fork/condor/pbs) (30.61 KB, application/octet-stream)
2003-10-29 12:46, Jens-S. Vöckler
Details
Globus 2.2.4 collective patches (fork/lsf/pbs/condor) (50.01 KB, application/octet-stream)
2003-11-12 17:51, Jens-S. Vöckler
Details
Globus 2.4.3 collective patches (fork/condor/pbs) (31.18 KB, application/octet-stream)
2003-11-12 17:54, Jens-S. Vöckler
Details
Globus 2.2.4 collective patches (fork/lsf/pbs/condor) (50.10 KB, application/octet-stream)
2003-12-02 10:24, Jens-S. Vöckler
Details
Globus 2.4.3 collective patches (fork/condor/pbs) (31.34 KB, application/x-tgz)
2003-12-02 10:25, Jens-S. Vöckler
Details
Patch to the VDT 1.1.12 (Globus 2.2.4) lsf.pm (932 bytes, patch)
2004-01-28 09:02, Gabriele Carcassi
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2003-05-14 10:39:36
I installed globus 2.2.4 and LSF job manager. 
After I run a globus/LSF job, I could receive 
an email about successful execution of the command, but globus 2.2.4
doesn't print out any 
 output. I checked the log file, no error message is shown. 

[dtyu@pdsfgrid2 ~]$ globus-job-run
atlasgrid01.usatlas.bnl.gov/jobmanager-lsf -q grid-test /bin/echo
"hello"

Attached please see the email from LSF admin accout.

Was this a cache corruption problem?

Thank you very much. 
Regards,
Dantong



From: 	LSF <lsfadmin@rcf.rhic.bnl.gov>
To: 	dtyu@acas055.usatlas.bnl.gov
Subject: 	Job 71832: <# LSF batch job script built by Globus job manager;  #!
/bin/sh;#BSUB -q grid-test;#BSUB -i /dev/null;#BSUB -o /dev/null;#BSUB -e
/usatlas/u/dtyu/.globus/.gass_cache/globus_gass_cache_1052768850;#BSUB -n
1;#BSUB -N;GLOBUS_GRAM_MYJOB> Done
Date: 	Mon, 12 May 2003 15:47:46 -0400	
        Job <# LSF batch job script built by Globus job manager;  #!
/bin/sh;#BSUB -q grid-test;#BSUB -i /dev/null;#BSUB -o /dev/null;#BSUB -e
/usatlas/u/dtyu/.globus/.gass_cache/globus_gass_cache_1052768850;#BSUB -n
1;#BSUB -N;GLOBUS_GRAM_MYJOB> was submitted from host <spider> by user <dtyu>.
Job was executed on host(s) <acas055>, in queue <grid-test>, as user <dtyu>.
</usatlas/u/dtyu> was used as the home directory.
</direct/usatlas+u/dtyu> was used as the working directory.
Started at Mon May 12 15:47:40 2003
Results reported at Mon May 12 15:47:46 2003

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
# LSF batch job script built by Globus job manager


#! /bin/sh
#BSUB -q grid-test
#BSUB -i /dev/null
#BSUB -o /dev/null
#BSUB -e /usatlas/u/dtyu/.globus/.gass_cache/globus_gass_cache_1052768850
#BSUB -n 1
#BSUB -N
GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://spider.usatlas.bnl.gov:6159/; export
GLOBUS_GRAM_MYJOB_CONTACT
X509_CERT_DIR=/etc/grid-security/certificates; export X509_CERT_DIR
GLOBUS_GRAM_JOB_CONTACT=https://spider.usatlas.bnl.gov:6158/16623/1052768844/;
export GLOBUS_GRAM_JOB_CONTACT
GLOBUS_LOCATION=/opt/globus2; export GLOBUS_LOCATION
X509_USER_PROXY=/usatlas/u/dtyu/.globus/.gass_cache/globus_gass_cache_1052768848;
export X509_USER_PROXY

# Changing to directory as requested by user
cd /usatlas/u/dtyu

# Executing job as requested by user
/bin/echo hello >
/usatlas/u/dtyu/.globus/.gass_cache/globus_gass_cache_1052768849 < /dev/null &
wait

------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :      0.05 sec.
    Max Memory :         2 MB
    Max Swap   :         4 MB

    Max Processes  :         1

Read file </usatlas/u/dtyu/.globus/.gass_cache/globus_gass_cache_1052768850> for
stderr output of this job.
------- Comment #1 From 2003-05-14 13:00:00 -------
This could be due to NFS problems/delays.

Can you try submitting the LSF job by hand using bsub and verify that the 
output does get to the output file?

I have seen some NFS problems/delays where the scheduler returns status of 
DONE, but the data is not in the job's output file.  We are working on some 
improvements to the scheduler perl scripts where the JM forces and NSF update 
after the job is DONE.

-Stu
------- Comment #2 From 2003-05-15 17:50:34 -------
I submitted LSF job manually. It did not run because
 the long path of output file could not be created.
But I shortened the output/error files, still used 
the NFS directory. It works perfectly well.

I submitted another job, it works fine. 
globus-job-run atlasgrid01.usatlas.bnl.gov/jobmanager-lsf /bin/ls -R

I re-submitted /bin/echo "hello" throught jobmanager-lsf, 
it does NOT work because the 
 output is too short and got lost. 
 
------- Comment #3 From 2003-05-16 09:28:42 -------
We at GriPhyN/Chimera have developed a set of patches, as we stumbled about the
same problem, which appears to be NFS related (I stumbled on Condor myself, and
on PBS with PDQ). I have a set of patches, alas the "published" ones are for
2.4.0 (bug 931). Since I don't have access to any LSF myself, I recreated for
LSF what I did for Condor and PBS.

If you'd like to try my patches, can you tar up
$GLOBUS_LOCATION/lib/perl/Globus/GRAM and send it to me? I'll patch the files
with what I did for 2.4.0, and send it back for testing (e.g. move the GRAM to
GRAM.old and untar the patches in a new GRAM directory). This test would also
benefit Globus.

I hope that you had installed all the updates for 2.2.4, though, or my patching
process will be a lot more tedious than anticipated when writing this offer.
------- Comment #4 From 2003-06-18 08:36:09 -------
Stu,

About a month ago, you wrote: "We are working on some improvements to the 
scheduler perl scripts where the JM forces and NSF update after the job is 
DONE." 

How are these progressing? Do you have an estimated time for the release? Will 
they only be for Globus 2.4, or will they include Globus 2.2.4?

Thanks,
-alain

------- Comment #5 From 2003-06-23 14:54:35 -------
We have not been able to work on the NFS updates yet.  After the 3.0 release, I 
hope to schedule some time in order to update all the scheduler scripts with 
the Jen's improvements.

-Stu
------- Comment #6 From 2003-06-23 15:11:37 -------
GLOBUS 2.2.4 has the problem of missing output. You mentioned that 
GLOBUS 2.4 has fix, does the fix work on GLOBUS 2.2.4? We use 
VDT 1.1.9 which includes GLOBUS 2.2.4. This problem will seriously 
affect our incoming data production. This problem also happens on 
GLOBUS CONDOR JOB manager, in which the job outputs are constantly
missing. I do not know when GLOBUS 3.0 will be released. If you could 
give me the precise time, I would  appreciate. 
------- Comment #7 From 2003-06-23 15:12:27 -------
GLOBUS 2.2.4 has the problem of missing output. You mentioned that 
GLOBUS 2.4 has fix, does the fix work on GLOBUS 2.2.4? We use 
VDT 1.1.9 which includes GLOBUS 2.2.4. This problem will seriously 
affect our incoming data production. This problem also happens on 
GLOBUS CONDOR JOB manager, in which the job outputs are constantly
missing. I do not know when GLOBUS 3.0 will be released. If you could 
give me the precise time, I would  appreciate. 
------- Comment #8 From 2003-06-23 16:26:47 -------
The fixes from 2.4 (bug #931) may work on 2.2.4, if and only if the 2.2.4 had
all updates and advisories installed. Unfortunately, GPT displayed some problems
with some of the fixes, in such that it claimed it installed an update, but
really didn't install it unless -force'd to do so. 

Do you want my diffs for 2.2.4? Which remote scheduling system are you using? 
------- Comment #9 From 2003-07-17 11:05:04 -------
The fixes from 2.4 (bug #931) may work on 2.2.4, if and only if the 2.2.4 had
all updates and advisories installed. Unfortunately, GPT displayed some problems
with some of the fixes, in such that it claimed it installed an update, but
really didn't install it unless -force'd to do so. 

I downloaded globus 2.4.2  and the most updated globus lsf job manager
(globus_gram_job_manager_setup_lsf-1.4.tar.gz
) from ftp://ftp.globus.org/pub/gt2/2.4/2.4.2/, unless you have other version
which is not listed at http://www.globus.org/gt2.4/download.htmland. I  tested 
it with globus lsf job manager, the problem of  missing result still exists. 
 I am not sure how this job manager can help globus 2.2.4.
 

Do you want my diffs for 2.2.4? Which remote scheduling system are you using? 

Yes, I do want a fix. I am using LSF 5.X. I downloaded the most updated version
globus lsf job manager for globus 2.2.4 from
ftp://ftp.globus.org/pub/gt2/2.2/2.2.4/extra/src/globus_gram_job_manager_setup_lsf-1.2.tar.gz
If you could get the source tar ball from there and apply your diff, I am happy
to try it. 

Thank you.
Dantong
------- Comment #10 From 2003-07-23 08:04:18 -------
*** Bug 1076 has been marked as a duplicate of this bug. ***
------- Comment #11 From 2003-07-23 08:05:16 -------
*** Bug 951 has been marked as a duplicate of this bug. ***
------- Comment #12 From 2003-07-25 10:54:53 -------
*** Bug 755 has been marked as a duplicate of this bug. ***
------- Comment #13 From 2003-07-29 13:13:19 -------
We have developed a fix for this and a few other GRAM script related problems, 
but they resulted in adding some new APIs to the Perl. These are committed to 
the CVS trunk, and were developed in the gram_script_cleanup_branch of our 
CVS. 
 
If you like, I can generate a set of source packages containing these changes; 
otherwise, you can wait until the next major or minor Globus Toolkit release 
(which will contain the changes from the trunk), or check out the files from 
CVS yourself. 
 
joe 
------- Comment #14 From 2003-09-09 11:02:23 -------
Hi Joe,

I have tried looking at CVS to get the files myself, but I am not sure if I am 
looking in the correct place.

Could you provide a patch, or tell me how to create one?

Thanks,
Gabriele
------- Comment #15 From 2003-09-17 22:05:46 -------
Dear All:

 We have not received the fix described by Joe, even we made several contacts
with Joe. I do not think that this ticket could be marked as "RESOLVED FIXED".
The following email came from our Grid production manager for ATLAS. 
He could not keep track of large number of globus jobs if the standard output
files are corrupted or deleted before the jobs finish. Again this problem might
be caused by the NFS bug of globus lsf job manager since the gass cache is in
NFS directory. The same problem happens in PBS. Condor job manager has much less
corruption rate than PBS and LSF. We really need the fix to be included in the
globus current release patches. 

Regards,
Dantong

---------- Forwarded message ----------
Date: Wed, 17 Sep 2003 15:31:23 -0400
From: LSF <lsfadmin@rcf.rhic.bnl.gov>
To: kaushik@acas036.usatlas.bnl.gov
Subject: Job 29049: <#! /bin/sh;#;# LSF batch job script built by Globus
    Job Manager;#;#BSUB -q grid;#BSUB -i /dev/null;#BSUB -e
    /usatlas/u/kaushik/.globus/.gass_cache/local/md5/27/59d865951107d1df2de5c490
    873133/md5/7f/0ff29a97a9ff2823da9dcbefdb17eb/dat> Done

Job <#! /bin/sh;#;# LSF batch job script built by Globus Job Manager;#;#BSUB -q
grid;#BSUB -i /dev/null;#BSUB -e
/usatlas/u/kaushik/.globus/.gass_cache/local/md5/27/59d865951107d1df2de5c490873133/md5/7f/0ff29a97a9ff2823da9dcbefdb17eb/dat>
was submitted from host <gremlin> by user <kaushik>.
Job was executed on host(s) <acas036>, in queue <grid>, as user <kaushik>.
</usatlas/u/kaushik> was used as the home directory.
</direct/usatlas+u/kaushik> was used as the working directory.
Started at Wed Sep 17 15:25:55 2003
Results reported at Wed Sep 17 15:31:23 2003

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
#! /bin/sh
#
# LSF batch job script built by Globus Job Manager
#
#BSUB -q grid
#BSUB -i /dev/null
#BSUB -e
/usatlas/u/kaushik/.globus/.gass_cache/local/md5/27/59d865951107d1df2de5c490873133/md5/7f/0ff29a97a9ff2823da9dcbefdb17eb/data
#BSUB -o
/usatlas/u/kaushik/.globus/.gass_cache/local/md5/27/59d865951107d1df2de5c490873133/md5/ca/908394fb9a7c932dcd5ececd6700fc/data
#BSUB -N
#BSUB -n 1
X509_USER_PROXY=/usatlas/u/kaushik/.globus/.gass_cache/local/md5/27/59d865951107d1df2de5c490873133/md5/d6/f2416a405e5b65a030259e5d7161d8/data;
export X509_USER_PROXY
GLOBUS_LOCATION=/opt/globus-2.4; export GLOBUS_LOCATION
GLOBUS_GRAM_JOB_CONTACT=https://gremlin.usatlas.bnl.gov:6157/1189/1063826742/;
export GLOBUS_GRAM_JOB_CONTACT
GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://gremlin.usatlas.bnl.gov:6158/; export
GLOBUS_GRAM_MYJOB_CONTACT
HOME=/usatlas/u/kaushik; export HOME
LOGNAME=kaushik; export LOGNAME

        if test 'X${LD_LIBRARY_PATH}' != 'X'; then
            LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:"
        else
            LD_LIBRARY_PATH=""
        fi
        export LD_LIBRARY_PATH

#Change to directory requested by user
cd /usatlas/scratch/dtyu/atlasgrid/atlas000_14143
/usatlas/scratch/dtyu/atlasgrid/atlas000_14143/dc1-run  &
wait

------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :    309.53 sec.
    Max Memory :       167 MB
    Max Swap   :       207 MB

    Max Processes  :         6

The output (if any) follows:

 error in CFPUT : Can't open configuration file
STOP  batch job impossible  statement executed


PS: The stderr output (if any) follows:



PS:

Fail to open output file
/usatlas/u/kaushik/.globus/.gass_cache/local/md5/27/59d865951107d1df2de5c490873133/md5/ca/908394fb9a7c932dcd5ececd6700fc/data:
No such file or directory.
Output is stored in this mail.
Fail to open stderr file
/usatlas/u/kaushik/.globus/.gass_cache/local/md5/27/59d865951107d1df2de5c490873133/md5/7f/0ff29a97a9ff2823da9dcbefdb17eb/data:
No such file or directory.
The stderr output is included in this report.
------- Comment #16 From 2003-09-26 09:05:02 -------
I sent this a month ago, must have been some email problem causing it to not
arrive. Sorry about that.

joe

> Date: Tue, 26 Aug 2003 13:52:44 -0500
> From: Joe Bester <bester@mcs.anl.gov>
> To: Dantong Yu <dtyu@bnl.gov>
> Cc: carcassi@bnl.gov, jlauret4@bnl.gov
> Subject: Re: Patches for Globus LSF jobmanager.
>
> On Mon, Aug 25, 2003 at 02:09:29PM -0400, Dantong Yu wrote:
> > Joe:
> > Thank you very much for replying. Since we are anxious waiting for the
> > fix. Could you please generate a set of source packages containing the
> >  fixes you described below at your earliest convenience? Is there a new
> > version for globus_gram_job_manager_setup_lsf?
> >
> >  Thank you very much.
> >  Regards,
> >  Dantong
>
> I've put a few packages on http://www-unix.mcs.anl.gov/~bester/lsf-for-2.4/
> These should be installable on GT2 2.4.2. Let me know if you have any
> issues with them. The lsf package depends on that version of the job
> manager package on that page.


------- Comment #17 From 2003-10-07 09:16:02 -------
I applied the patch via gpt-build on globus 2.4.2. The job result missing
problem  still exists. The follow is the test result, Gabriele, could you
please submit your test result on this node also?

Thank you very much. 
Dantong
 
[dtyu@stargrid01 ~]$ foreach i ( 1 2 3 4 5 6 7 8 9 10 )
foreach? echo $i
foreach? globus-job-run stargrid01.rcf.bnl.gov/jobmanager-lsf /bin/echo "Hello, 
$i"
foreach? end
1
Hello, 1
2
3
Hello, 3
4
5
Hello, 5
6
Hello, 6
7
8
Hello, 8
9
10
Hello, 10
[dtyu@stargrid01 ~]$ 


cc. My lsf.pm file, I modified it a little bit at the beginning part on LSF setup. 


use Globus::GRAM::Error;
use Globus::GRAM::JobState;
use Globus::GRAM::JobManager;
use Globus::Core::Paths;

use Config;

package Globus::GRAM::JobManager::lsf;

@ISA = qw(Globus::GRAM::JobManager);

my ($lsf_profile, $mpirun, $bsub, $bjobs, $bkill);

BEGIN
{
    $lsf_profile = '/usr/lsf/conf/profile.lsf';
    $mpirun = 'no';
    $bsub   = "/usr/lsf/5.1/linux2.4-glibc2.1-x86/bin/bsub";
    $bjobs  = "/usr/lsf/5.1/linux2.4-glibc2.1-x86/bin/bjobs";
    $bkill  = "/usr/lsf/5.1/linux2.4-glibc2.1-x86/bin/bkill";
}

sub submit
{
    my $self = shift;
    my $description = $self->{JobDescription};
    my $tag = $description->cache_tag() or $tag = $ENV{GLOBUS_GRAM_JOB_CONTACT};
    my $status;
    my $lsf_job_script;
    my $lsf_job_script_name;
    my $errfile = '';
    my $queue;
    my $job_id;
    my $script_url;
    my @arguments;
    my $email_when = '';
    my $library_path;
    my $cache_pgm = "$Globus::Core::Paths::bindir/globus-gass-cache";
    my @library_vars;

    $self->log('Entering lsf submit');

    # check jobtype
    if(defined($description->jobtype()))
    {
	if($description->jobtype !~ /^(mpi|single|multiple)$/)
	{
	    return Globus::GRAM::Error::JOBTYPE_NOT_SUPPORTED;
	}
	elsif($description->jobtype() eq 'mpi' && $mpirun eq 'no')
	{
	    return Globus::GRAM::Error::JOBTYPE_NOT_SUPPORTED;
	}
    }
    if( $description->directory eq '')
    {
	return Globus::GRAM::Error::RSL_DIRECTORY;
    }
    if((! -d $description->directory) || (! -r $description->directory))
    {
	return Globus::GRAM::Error::BAD_DIRECTORY;
    }

    # make sure the files are accessible (NFS sync) when you check for them
    $self->nfssync( $description->executable() )
	unless $description->executable() eq '';
    $self->nfssync( $description->stdin() )
	unless $description->stdin() eq '';

    if( $description->executable eq '')
    {
	return Globus::GRAM::Error::RSL_EXECUTABLE();
    }
    elsif(! -f $description->executable())
    {
	return Globus::GRAM::Error::EXECUTABLE_NOT_FOUND();
    }
    elsif(! -x $description->executable())
    {
	return Globus::GRAM::Error::EXECUTABLE_PERMISSIONS();
    }
    elsif( $description->stdin() eq '')
    {
	return Globus::GRAM::Error::RSL_STDIN;
    }
    elsif(! -r $description->stdin())
    {
       return Globus::GRAM::Error::STDIN_NOT_FOUND();
   }

    $self->log('Determining job max time cpu from job description');
    if(defined($description->max_cpu_time())) 
    {
	$cpu_time = $description->max_cpu_time();
	$self->log("   using maxcputime of $cpu_time");
    }
    elsif(defined($description->max_time()))
    {
	$cpu_time = $description->max_time();
	$self->log("   using maxtime of $cpu_time");
    }
    else
    {
	$cpu_time = 0;
	$self->log('   using queue default');
    }

    $self->log('Determining job max wall time limit from job description');
    if(defined($description->max_wall_time()))
    {
	$wall_time = $description->max_wall_time();
	$self->log("    using maxwalltime of $wall_time");
    }
    else
    {
	$wall_time = 0;
	$self->log('    using queue default');
    }

    if($description->queue() ne '')
    {
	$queue = $description->queue();
    }
    else {
        $queue = 'star_cas_dd';
    }

    $self->log('Building job script');

    $script_url = "$tag/lsf_job_script.$$"; 
    $self->fork_and_exec_cmd( $cache_pgm, '-add', '-t', $tag, 
			      '-n', $script_url, 'file:/dev/null' );
    $lsf_job_script_name = $self->pipe_out_cmd( $cache_pgm, '-query', '-t', 
						$tag, $script_url );
    chomp($lsf_job_script_name);
    if($lsf_job_script_name eq '')
    {
	return Globus::GRAM::ERROR::TEMP_SCRIPT_FILE_FAILED();
    }

    local(*JOB);
    open( JOB, '>' . $lsf_job_script_name );
    print JOB<<"EOF";
#! /bin/sh
#
# LSF batch job script built by Globus Job Manager
#
EOF

    if(defined($queue))
    {
	print JOB "#BSUB -q $queue\n";
    }
    if(defined($description->project()))
    {
	print JOB '#BSUB -P ', $description->project(), "\n";
    }

    if($cpu_time != 0)
    {
	if($description->jobtype() eq 'multiple')
	{
	    $total_cpu_time = $cpu_time * $description->count();
	}
	else
	{
	    $total_cpu_time = $cpu_time;
	}
	print JOB "#BSUB -c ${total_cpu_time}\n";
    }

    if($wall_time != 0)
    {
	print JOB "#BSUB -W $wall_time\n";
    }

    if($description->max_memory() != 0)
    {
	$max_memory = $description->max_memory() * 1024;

	if($description->jobtype() eq 'multiple')
	{
	    $total_max_memory = $max_memory * $description->count();
	}
	else
	{
	    $total_max_memory = $max_memory;
	}
	print JOB "#BSUB -M ${total_max_memory}\n";
    }
    print JOB '#BSUB -i ', $description->stdin(), "\n";
    print JOB '#BSUB -e ', $description->stderr(), "\n";
    print JOB '#BSUB -o ', $description->stdout(), "\n";
    print JOB "#BSUB -N\n";
    print JOB '#BSUB -n ', $description->count(), "\n";

    foreach my $tuple ($description->environment())
    {
	if(!ref($tuple) || scalar(@$tuple) != 2)
	{
	    return Globus::GRAM::Error::RSL_ENVIRONMENT();
	}
	print JOB $tuple->[0], '=', $tuple->[1],
		'; export ', $tuple->[0], "\n";
    }

    $library_path = join(':', $description->library_path());
    @library_vars = ('LD_LIBRARY_PATH');

    if($Config{osname} eq 'irix')
    {
	push(@library_vars, 'LD_LIBRARYN32_PATH', 'LD_LIBRARY64_PATH');
    }

    foreach (@library_vars)
    {
	print JOB <<"EOF";

	if test 'X\${$_}' != 'X'; then
	    $_="\${LD_LIBRARY_PATH}:$library_path"
	else
	    $_="$library_path"
	fi
	export $_
EOF
    }

    print JOB "\n#Change to directory requested by user\n";
    print JOB 'cd ', $description->directory(), "\n";

    @arguments = $description->arguments();

    foreach(@arguments)
    {
        if(ref($_))
	{
	    return Globus::GRAM::Error::RSL_ARGUMENTS;
	}
    }
    if($arguments[0])
    {
        foreach(@arguments)
        {
             $_ =~ s/\\/\\\\/g;
	     $_ =~ s/\$/\\\$/g;
	     $_ =~ s/"/\\\"/g; #"
	     $_ =~ s/`/\\\`/g; #`
	     
	     $args .= '"' . $_ . '" ';
        }
    }
    else
    {
	$args = '';
    }
    if($description->jobtype() eq 'mpi')
    {
	print JOB "$mpirun -np ", $description->count(), ' ';
	print JOB $description->executable(), " $args \n";
    }
    elsif($description->jobtype() eq 'multiple')
    {
	for(my $i = 0; $i < $description->count(); $i++)
	{
	    print JOB $description->executable(), " $args &\n";
	}
	print JOB "wait\n";
    }
    else
    {
	print JOB $description->executable(), " $args\n";
    }
    close(JOB);
    chmod 0755, $lsf_job_script_name;

    if($description->logfile() ne '')
    {
        $errfile = "2>" . $description->logfile();
    }
    $self->nfssync( $lsf_job_script_name );
    $job_id = (grep(/is submitted/,
                   split(/\n/, `$bsub < $lsf_job_script_name $errfile`)))[0];

    if($? == 0)
    {
	$job_id =~ m/<([^>]*)>/;
	$job_id = $1;

	return {
	           JOB_ID => $job_id,
		   JOB_STATE => Globus::GRAM::JobState::PENDING
		};
    }
    #system("$cache_pgm -cleanup-url $tag/lsf_job_script.$$");
    $self->fork_and_exec_cmd( $cache_pgm, '-cleanup-url', 
			      "$tag/lsf_job_script.$$" );

    return Globus::GRAM::Error::INVALID_SCRIPT_REPLY;
}

sub poll
{
    # The LSF bjobs command is used to obtain the current
    # status of the job. This status is then returned.
    #
    # The Status field can contain one of the following strings:
    #
    # string        stands for                      Globus context meaning
    # --------------------------------------------------------------------
    # RUN           Running                         ACTIVE
    # PEND          Wating to be scheduled          PENDING
    # USUSP         Suspended while running         SUSPENDED
    # PSUSP         Suspended while pending         SUSPENDED
    # SSUSP         Suspended by system             SUSPENDED
    # DONE          Completed sucessfully           DONE
    # EXIT          Completed unsuccessfully        FAILED
    # UNKWN         Unknown state                   *ignore*
    # ZOMBI         Unknown state                   FAILED

    my $self = shift;
    my $description = $self->{JobDescription};
    my $job_id = $description->jobid();
    my $state;
    my $status_line;
    my $exit_code;

    $self->log("polling job $job_id");

    # Get first line matching job id
    # needs to be back-ticks to source lsf profile
    $_ = (grep(/$job_id/, `$bjobs $job_id 2>/dev/null`))[0];

    # get the exit code of the bjobs command.  For more info, do a 
    # search for $CHILD_ERROR in perlvar documentation.
    $exit_code = $? >> 8;

    # Verifying that the job is no longer there.
    # return code 255 = "Job <123> is not found"
    if($exit_code == 255)
    {
        $self->log("bjobs rc is 255 == Job <123> is not found == DONE");
        $state = Globus::GRAM::JobState::DONE;
	$self->nfssync( $description->stdout() )
	    if $description->stdout() ne '';
	$self->nfssync( $description->stderr() )
	    if $description->stderr() ne '';
    }
    else
    {

        # Get 3th field (status)
        $_ = (split(/\s+/))[2];

        if(/PEND/)
        {
            $state = Globus::GRAM::JobState::PENDING;
        }
        elsif(/DONE/)
        {
            $state = Globus::GRAM::JobState::DONE;
	    $self->nfssync( $description->stdout() )
		if $description->stdout() ne '';
	    $self->nfssync( $description->stderr() )
		if $description->stderr() ne '';
        }
        elsif(/USUSP|SSUSP|PSUSP/)
        {
            $state = Globus::GRAM::JobState::SUSPENDED;
        }
        elsif(/RUN/)
        {
            $state = Globus::GRAM::JobState::ACTIVE;
        }
        elsif(/EXIT/)
        {
            return Globus::GRAM::Error::JOB_EXIT_CODE_NON_ZERO();
        }
        elsif(/UNKWN/)
        {
            # We want the JM to ignore this poll and keep the same state
            # as the previous state.  Returning an empty hash will do the job.
            $self->log("bjobs returned the UNKWN state.  Telling JM to ignore
this poll");
            return {};
        }
        elsif(/ZOMBI/)
        {
            return Globus::GRAM::Error::LOCAL_SCHEDULER_ERROR();
        }
        else
        {
            # This else is reached by an unknown response from lsf.
            # It could be that LSF was temporarily unavailable, but that it
            # can recover and the submitted job is fine.
            # We want the JM to ignore this poll and keep the same state
            # as the previous state.  Returning an empty hash will do the job.
            $self->log("bjobs returned an unknown response.  Telling JM to
ignore this poll");
            return {};
        }
    }

    return {JOB_STATE => $state};
}

sub cancel
{
    my $self = shift;
    my $description = $self->{JobDescription};
    my $job_id = $description->jobid();

    $self->log("cancel job $job_id");
    # needs to be back-ticks to source lsf profile
    system("$bkill $job_id >/dev/null 2>/dev/null");

    if($? == 0)
    {
	return { JOB_STATE => Globus::GRAM::JobState::FAILED };
    }
    return Globus::GRAM::Error::JOB_CANCEL_FAILED();
}

1;
------- Comment #18 From 2003-10-07 09:18:24 -------
Here is the LSF email back to me indicating that the gass cache file is
missing.


From:     LSF <lsfadmin@rcas6146.rcf.bnl.gov>
To:     dtyu@rcf.rhic.bnl.gov
Subject:     Job 513707: <#! /bin/sh;#;# LSF batch job script built by Globus
Job
Manager;#;#BSUB -q star_cas_dd;#BSUB -i /dev/null;#BSUB -e
/u0b/dtyu/.globus/.gass_cache/local/md5/2c/42/56/918b72ff60ceb5ecd7126be167/md5/2a/0e/e4/074420c2d703fcddb342435443/d>
Done
Date:     Tue, 7 Oct 2003 10:08:25 -0400    
Job <#! /bin/sh;#;# LSF batch job script built by Globus Job Manager;#;#BSUB -q
star_cas_dd;#BSUB -i /dev/null;#BSUB -e
/u0b/dtyu/.globus/.gass_cache/local/md5/2c/42/56/918b72ff60ceb5ecd7126be167/md5/2a/0e/e4/074420c2d703fcddb342435443/d>
was submitted from host <stargrid01> by user <dtyu>.
Job was executed on host(s) <rcas6146>, in queue <star_cas_dd>, as user <dtyu>.
</u0b/dtyu> was used as the home directory.
</direct/u0b/dtyu> was used as the working directory.
Started at Tue Oct  7 10:08:25 2003
Results reported at Tue Oct  7 10:08:25 2003

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
#! /bin/sh
#
# LSF batch job script built by Globus Job Manager
#
#BSUB -q star_cas_dd
#BSUB -i /dev/null
#BSUB -e
/u0b/dtyu/.globus/.gass_cache/local/md5/2c/42/56/918b72ff60ceb5ecd7126be167/md5/2a/0e/e4/074420c2d703fcddb342435443/data
#BSUB -o
/u0b/dtyu/.globus/.gass_cache/local/md5/2c/42/56/918b72ff60ceb5ecd7126be167/md5/5b/77/ad/90dfeae07f36d051a28a5fb7c3/data
#BSUB -N
#BSUB -n 1
X509_USER_PROXY=/u0b/dtyu/.globus/.gass_cache/local/md5/2c/42/56/918b72ff60ceb5ecd7126be167/md5/17/2c/01/a51d0d63736a28daf95bb97969/data;
export X509_USER_PROXY
GLOBUS_LOCATION=/home/globus-2; export GLOBUS_LOCATION
GLOBUS_GRAM_JOB_CONTACT=https://stargrid01.rcf.bnl.gov:6160/6415/1065535689/;
export GLOBUS_GRAM_JOB_CONTACT
GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://stargrid01.rcf.bnl.gov:6161/; export
GLOBUS_GRAM_MYJOB_CONTACT
HOME=/u0b/dtyu; export HOME
LOGNAME=dtyu; export LOGNAME

        if test 'X${LD_LIBRARY_PATH}' != 'X'; then
            LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:"
        else
            LD_LIBRARY_PATH=""
        fi
        export LD_LIBRARY_PATH

#Change to directory requested by user
cd /u0b/dtyu
/bin/echo "Hello, 2"  &
wait

------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :      0.04 sec.
    Max Memory :         3 MB
    Max Swap   :         5 MB

    Max Processes  :         1

The output (if any) follows:

Hello, 2


PS: The stderr output (if any) follows:



PS:

Fail to open output file
/u0b/dtyu/.globus/.gass_cache/local/md5/2c/42/56/918b72ff60ceb5ecd7126be167/md5/5b/77/ad/90dfeae07f36d051a28a5fb7c3/data:
No such file or directory.
Output is stored in this mail.
Fail to open stderr file
/u0b/dtyu/.globus/.gass_cache/local/md5/2c/42/56/918b72ff60ceb5ecd7126be167/md5/2a/0e/e4/074420c2d703fcddb342435443/data:
No such file or directory.
The stderr output is included in this report.
------- Comment #19 From 2003-10-07 15:51:34 -------
The missing result problem also happens with PBS job manager. One of 
our collaborator at university at New Mexico discovered that the NSF 
mounted gass cache might be corrupted and the result files are missing. 
We addressed this problem in our testbed meeting and I was assigned to 
 report this problem to GLOBUS bugzilla system. 

Here is a error message found in several places:
/home/dtyu/.globus/.gass_cache/local/md5/f7/8c6ea9bac808103b023c4e63d205c8/md5/83/9515c4508fd1ea6d7d23950e6ebbac/data:
line 18: /usr/local/GLOBUS/bin/globus-sh-exec: No such file or directory


I guess that the NFS mounted gass cache files has the tendency to be corrupted 
or removed before the job finishes. Even the new fix provided in this thread
does not completely solve the problem. I also discovered that the CONDOR job
manager has much higher success rate and more robust compared with LSF and PBS.


Regards,
Dantong



------- Comment #20 From 2003-10-12 11:12:28 -------
Since Joe said that he does not have access to LSF batch queue system, it is 
very hard for him to test whether a fix works. Last time, I invited  the globus
developer to apply an account at Brookhaven National lab. 
By this way, the person who works on this fix could get first hand experience to
 test gram job manager for LSF and shorten the turn-around time for a real fix. 
I am attaching the instruction here again:

Goto:
 http://www.acf.bnl.gov/UserInfo/GettingStarted/NewUser/
After you get an account, follow the instruction at:
http://www.acf.bnl.gov/UserInfo/GettingStarted/

When you apply for an account, use me as your BNL site sponser.
If you have question, please ask me via bugzilla.

Cheers
Dantong
------- Comment #21 From 2003-10-13 16:21:08 -------
How long are you in ANL this week? 
I suggest sit together and look over the problems together?
------- Comment #22 From 2003-10-13 16:33:02 -------
I will be in ANL Oct/13~ Oct /15 for (grid3 meeting and the first day of
Griphyn
meeting) 
 I am in either A261, A216,or  C101 of Building  221 (MCS building). 
 We can arrange some time and place to work on this issue. Please let me
 know your arrangement. 


 Thank you very much. 
 Dantong
------- Comment #23 From 2003-10-14 15:18:07 -------
Intermediary report: 

I spent some time with Dantong today, and first patched the lsf.pm to include
the NFS patches. Alas, this does not improve the reliability. 

There are two classes of errors visible besides the successful runs. One LSF
error "No such file or directory" in some cases. Note that the output of the
test, while not visible at the submit site, is part of the email report LSF
sends, see Dantong's example message. 

The seconds class of failures reports "Stale NFS handle". Again, the true output
from stdout is being report by LSF's email message on the job, while it is
missing at the submit site. 

Finally, the cases that succeed to show the stdout on the submit site do not
report it as part of the LSF's email message for the job. In this case, the
message reports "Read ... for stdXXX of job". 

Thus, all these messages appear to generated by LSF. Also, the errors appear to
be related to the interaction of LSF and NFS. Stu and I agree that there appears
to be (at least one) NFS race condition, where the LSF scheduler expects to see
the files or directories for stdout/stderr in the GASS cache, but cannot find it
there. 

I can see two reasons where the NFS race condition between gatekeeper node,
remote scheduling system, NFS server and worker node (which might all be on
different hosts) can screw things despite the NFS syncs: 

[1] Files that LSF expects to (physically) see before it starts a job.
[2] Files that LSF expects to handle after a job was done. 

While the Globus JM scripts can take care of case 1, and indeed we patched
Dantong's lsf.pm to do so, there is little we can do about case 2, because it
happens before control passes back to the jobmanager. If the LSF system looks
for the files on a host other than the host where the job ran, it may, due to
NFS lag, not see the file contents immediately. 

I will attach Dantong's patched lsf.pm and JobManager.pm (more NFS reporting) to
this report shortly. We will continue to investigate. 
------- Comment #24 From 2003-10-14 15:26:12 -------
Created an attachment (id=221) [details]
Current LSF for grid3

This is the lsf.pm that I gave to Dantong for his setup. It orientes itself on
the ISI's lsf.pm module, which they claim to run successfully. 
------- Comment #25 From 2003-10-14 15:27:22 -------
Created an attachment (id=222) [details]
More NFS sync logging

This is a slightly extended version of the JobManager.pm module which contains
plenty more logging information on the attempted NFS sync operations.
------- Comment #26 From 2003-10-14 17:20:18 -------
Jens suggested that I do some direct job submission to LSF with NFS
standout/err
files.

I submitted 19 simple jobs. All of them finished successful. Jens and I will
look into the LSF outputs more carefully. 

This is the command I used:
#!/bin/bash
for ((i=1; i< 20; i++)) do
echo $i;
perl -pi -e "s|echostring[\d]*|echostring$i|" lsf-test-script;
bsub < lsf-test-script ;
done

Here is the LSF job script which I carved out from a globus LSF job.

------------------------------------------------------------
# LSBATCH: User input
#! /bin/sh
#
# LSF batch job script built by Globus Job Manager
#
#BSUB -q grid-test
#BSUB -i /dev/null
#BSUB -o /usatlas/u/dtyu/.globus/.gass_cache/local/data_output_echostring19
#BSUB -e /usatlas/u/dtyu/.globus/.gass_cache/local/data_error_echostring19
#BSUB -N
#BSUB -n 1
X509_USER_PROXY=/usatlas/u/dtyu/.globus/.gass_cache/local/md5/1d/3f725a90295adb77a68e5fd69ff229/md5/bb/816d1eeb3f67675e92e124635b8510/data;
export X509_USER_PROXY
GLOBUS_LOCATION=/data/Grid3/globus; export GLOBUS_LOCATION
GLOBUS_GRAM_JOB_CONTACT=https://spider.usatlas.bnl.gov:6447/7568/1066164014/;
export GLOBUS_GRAM_JOB_CONTACT
GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://spider.usatlas.bnl.gov:6449/; export
GLOBUS_GRAM_MYJOB_CONTACT
HOME=/usatlas/u/dtyu; export HOME
LOGNAME=dtyu; export LOGNAME

        if test 'X${LD_LIBRARY_PATH}' != 'X'; then
            LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:"
        else
            LD_LIBRARY_PATH=""
        fi
        export LD_LIBRARY_PATH

#Change to directory requested by user
cd /usatlas/u/dtyu
/bin/echo "hello echostring19"  &
wait


------- Comment #27 From 2003-10-16 14:42:08 -------
*** Bug 1249 has been marked as a duplicate of this bug. ***
------- Comment #28 From 2003-10-16 21:11:50 -------
Jens gave me a nfssync tool which could be called in LSF script. 

After I added the following lines, 
 it gives me the error message.

[dtyu@spider ~]$ globus-job-run spider.usatlas.bnl.gov/jobmanager-lsf -q
grid-test /bin/echo "hello"
GRAM Job failed because it is unknown if the job was submitted (error code 126)
[dtyu@spider ~]$ 


Here is the lines what I added, 
        for(my $i = 0; $i < $description->count(); $i++)
        {
            print JOB $description->executable(), " $args &\n";
        }
        print JOB "wait\n";
        print JOB '/usatlas/projects/Grid3/nfssync/nfssync -bc ',
$descriptor->stderr(), ' ', $descriptor->stdout(), "\n";
#        print JOB "lsgrun -p -m \"\$LSB_HOSTS\" ",
#                  $description->executable(), " $args\n";


Thank you very much.
Dantong
------- Comment #29 From 2003-10-24 16:12:46 -------
Hi Dantong,

here is another try for the jobmanager. Please remove the nfssync program
line from the script for now, as it does something weird. I have no clue why
Globus would issue an error 126 - I am not that familiar with the source.

         # print JOB '/usatlas/projects/Grid3/nfssync/nfssync -bc ',
#$descriptor->stderr(), ' ', $descriptor->stdout(), "\n";

Instead, follow Stu's directions: in your $HOME/.globus/.gass_cache directory of
the execution site may or may not exist a file called "config".

[a] if it DOES NOT exist, simply change into the directory, and execute

        echo "type=flat" > config

[b] if it DOES exist, fire up your text editor, and change the first line
    to read

        type=flat

    save and exit.

Please retry your 10 x submit test scripts with this change. Note that this is
your user's configuration file on your worker pool.

Jens.
------- Comment #30 From 2003-10-24 17:08:01 -------
After I modify config to type=flat

Then I found this error. 
[dtyu@spider .gass_cache]$ globus-job-run spider.usatlas.bnl.gov/jobmanager-lsf
/bin/echo "hello"
GRAM Job submission failed because cannot access cache files in
~/.globus/.gass_cache, check permissions, quota, and disk space (error code 76)

But I do have file quota on my home directory.  

Dantong
------- Comment #31 From 2003-10-24 18:35:57 -------
Hi Dantong,

I am sorry Dantong to not have tested it myself before asking you - I am getting
the same error. The part I had failed to understand from the instructions of the
Globus developers amount to:

cd $HOME/.globus
mv .gass_cache xx
nohup rm -rf xx &
mkdir .gass_cache
cd .gass_cache
echo "type=flat" > config

It will then create a zillion of files directly in .gass_cache (as in old
Globus). It will be interesting to see, if this alleviates the NFS problem. But
I do see new problems cropping up from too many files in an NFS directory (for
busy production). Also, the gram*log does not appear to be continuous any more. 

Jens.
------- Comment #32 From 2003-10-28 07:58:26 -------
One of our collaborator discovered the similar problem in PBS+globus
jobmanager.

From:     Frederick Luehring <luehring@indiana.edu>
To:     grid3-core@ivdgl.org
Cc:     Matt Allen <malallen@indiana.edu>, rats@indiana.edu
Subject:     Globus + PBS toubles at IU
Date:     Mon, 27 Oct 2003 19:25:27 -0500    Hi Everyone,    Two users at the
IU ATLAS
Tier 2 center (IU_ATLAS_2) have encountered problems where jobs finish
successfully but only a small fraction of the jobs return sysout and syserr
successfully. Before I ask the system administrator to spend time investigating
this, I wanted to check that we were not seeing a previously reported problem
involving an interaction between globus and PBS (we are running PBS and not
condor) that causes the output files to be lost. Does anyone remember how to
test for this problem?     Some further details on what has happened. Ed May
has
submitted 10 ATLAS jobs and only gotten two jobs back successfully. He is using
globus-submit to submit the jobs and globus-output to retrieve the output.
Nickolai Kouropatkine is using MOP to run CMS simulation. His files are
returned
to him by globus-url-copy.      
I would appreciate any advice that the experts could give us on how to debug
this. Thanks greatly...                                                Fred
------- Comment #33 From 2003-10-28 10:58:55 -------
Created an attachment (id=228) [details]
Globus 2.4.3 setup/globus/condor.in

Updates the condor.in file in the setup directory, in case the jobmanager setup
script is run another time. It is meant to be used in conjunction with the
updated JobManager.pm and StdioMerger.pm file (see these patches). 
------- Comment #34 From 2003-10-28 11:00:07 -------
Created an attachment (id=229) [details]
Globus 2.4.3 setup/globus/pbs.in 

Updates the pbs.in file in the setup directory, in case the jobmanager setup
script is run another time. It is meant to be used in conjunction with the
updated JobManager.pm and StdioMerger.pm file (see these patches). 
------- Comment #35 From 2003-10-28 11:01:29 -------
Created an attachment (id=230) [details]
Globus 2.4.3 lib/perl/Globus/GRAM/JobManager.pm

patches JobManager.pm file with new methods to be used in the
scheduler-specific
jobmanager scripts.
------- Comment #36 From 2003-10-28 11:02:32 -------
Created an attachment (id=231) [details]
Globus 2.4.3 lib/perl/Globus/GRAM/StdioMerger.pm 

Update the stdio merger to be more efficient... 
------- Comment #37 From 2003-10-28 11:04:16 -------
Created an attachment (id=232) [details]
Globus 2.4.3 lib/perl/Globus/GRAM/JobManager/condor.pm 

Updates the condor.pm jobmanager script to ask the NFS server to sync stdio
files (and others) before sending it back. Note that such a request is *not*
mandatory, and a busy NFS server may still chose to ignore it. 
------- Comment #38 From 2003-10-28 11:06:16 -------
Created an attachment (id=233) [details]
Globus 2.4.3 lib/perl/Globus/GRAM/JobManager/pbs.pm 

Updates the pbs.pm jobmanager script to ask the NFS server to sync stdio
files (and others) before sending it back. Note that such a request is *not*
mandatory, and a busy NFS server may still chose to ignore it. Also introduces
a lot effeciency fixes along the way - a gatekeeper may be enabled to handle
more simultaneous jobs that way. 
------- Comment #39 From 2003-10-28 11:09:55 -------
Ok folks, I need from you, whose jobmanager-<scheduler> is not in the patched
set, your $GLOBUS_LOCATION/setup/globus/<scheduler>.in and
$GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/<scheduler>.pm scripts to be
able to patch those. Please only Globus 2.4.3 for now and here.
 
------- Comment #40 From 2003-10-28 21:02:48 -------
Created an attachment (id=234) [details]
Globus 2.2.4 setup/globus/condor.in

Updates the condor.in file in the setup directory, in case the jobmanager setup

script is run another time. It is meant to be used in conjunction with the
updated JobManager.pm and StdioMerger.pm file (see these patches). 
------- Comment #41 From 2003-10-28 21:03:35 -------
Created an attachment (id=235) [details]
Globus 2.2.4 setup/globus/fork.in

Updates the fork.in file in the setup directory, in case the jobmanager setup
script is run another time. It is meant to be used in conjunction with the
updated JobManager.pm and StdioMerger.pm file (see these patches). 
------- Comment #42 From 2003-10-28 21:04:20 -------
Created an attachment (id=236) [details]
Globus 2.2.4 setup/globus/lsf.in

Updates the lsf.in file in the setup directory, in case the jobmanager setup
script is run another time. It is meant to be used in conjunction with the
updated JobManager.pm and StdioMerger.pm file (see these patches). 
------- Comment #43 From 2003-10-28 21:04:56 -------
Created an attachment (id=237) [details]
Globus 2.2.4 setup/globus/pbs.in

Updates the pbs.in file in the setup directory, in case the jobmanager setup
script is run another time. It is meant to be used in conjunction with the
updated JobManager.pm and StdioMerger.pm file (see these patches). 
------- Comment #44 From 2003-10-28 21:05:56 -------
Created an attachment (id=238) [details]
Globus 2.2.4 lib/perl/Globus/GRAM/JobManager.pm

patches JobManager.pm file with new methods to be used in the
scheduler-specific jobmanager scripts.
------- Comment #45 From 2003-10-28 21:06:45 -------
Created an attachment (id=239) [details]
Globus 2.2.4 lib/perl/Globus/GRAM/StdioMerger.pm

Update the stdio merger to be more efficient... 
------- Comment #46 From 2003-10-28 21:08:44 -------
Created an attachment (id=240) [details]
Globus 2.2.4 lib/perl/Globus/GRAM/JobManager/fork.pm

New fork jobmanager, eradicates IO::* and possibly things arising from the
difference between " or " and " || " (the latter has higher precedence binding
above assignment, the former below assignment precedence). 
------- Comment #47 From 2003-10-28 21:09:41 -------
Created an attachment (id=241) [details]
Globus 2.2.4 lib/perl/Globus/GRAM/JobManager/condor.pm

Updates the condor.pm jobmanager script to ask the NFS server to sync stdio
files (and others) before sending it back. Note that such a request is *not*
mandatory, and a busy NFS server may still chose to ignore it. Also eliminates
IO::* and possibly some bugs. 
------- Comment #48 From 2003-10-28 21:11:06 -------
Created an attachment (id=242) [details]
Globus 2.2.4 lib/perl/Globus/GRAM/JobManager/pbs.pm

Updates the pbs.pm jobmanager script to ask the NFS server to sync stdio files
(and others) before sending it back. Note that such a request is *not*
mandatory, and a busy NFS server may still chose to ignore it. Also introduces
a lot effeciency fixes along the way - a gatekeeper may be enabled to handle
more simultaneous jobs that way. Also, a quoting issue of environment variables
may still exist. 
------- Comment #49 From 2003-10-28 21:12:55 -------
Created an attachment (id=243) [details]
Globus 2.2.4 lib/perl/Globus/GRAM/JobManager/lsf.pm

From memory and with to be used with caution. Eradicates IO::* and tries to be
NFS synching. A quoting issue of environment variables may still exist or been
introduced - check.
------- Comment #50 From 2003-10-28 21:15:36 -------
Actually, the backported 2.2.4 patches use knowledge gained from the 2.4.3 port
- I would rank them as slightly "better".
------- Comment #51 From 2003-10-29 12:43:59 -------
Created an attachment (id=247) [details]
Globus 2.2.4 collecttive patches (fork/condor/pbs/lsf)

This tarball contains all the patches, original files and modified files for a
(vanilla) Globus 2.2.4 installation for the jobmanagers fork, condor, pbs and
lsf. You are getting with this tarball:

lib/perl/Globus/GRAM/JobManager.diff
lib/perl/Globus/GRAM/JobManager.old
lib/perl/Globus/GRAM/JobManager.pm
lib/perl/Globus/GRAM/JobManager/condor.diff
lib/perl/Globus/GRAM/JobManager/condor.old
lib/perl/Globus/GRAM/JobManager/condor.pm
lib/perl/Globus/GRAM/JobManager/fork.diff
lib/perl/Globus/GRAM/JobManager/fork.old
lib/perl/Globus/GRAM/JobManager/fork.pm
lib/perl/Globus/GRAM/JobManager/lsf.diff
lib/perl/Globus/GRAM/JobManager/lsf.old
lib/perl/Globus/GRAM/JobManager/lsf.pm
lib/perl/Globus/GRAM/JobManager/pbs.diff
lib/perl/Globus/GRAM/JobManager/pbs.old
lib/perl/Globus/GRAM/JobManager/pbs.pm
lib/perl/Globus/GRAM/StdioMerger.diff
lib/perl/Globus/GRAM/StdioMerger.old
lib/perl/Globus/GRAM/StdioMerger.pm
setup/globus/condor.diff
setup/globus/condor.in
setup/globus/condor.old
setup/globus/fork.diff
setup/globus/fork.in
setup/globus/fork.old
setup/globus/lsf.diff
setup/globus/lsf.in
setup/globus/lsf.old
setup/globus/pbs.diff
setup/globus/pbs.in
setup/globus/pbs.old

These patches are checked against the 2.4.3 patches, and are the latest. 
------- Comment #52 From 2003-10-29 12:46:13 -------
Created an attachment (id=248) [details]
Globus 2.4.3 collective patches (fork/condor/pbs)

This tarball contains all the patches, original files and modified files for a
(vanilla) Globus 2.4.3 installation for the jobmanagers fork, condor and pbs. I
don't have any LSF available to me, but lsf.pm does need patching, of course.
You are getting with this tarball:

lib/perl/Globus/GRAM/JobManager.diff
lib/perl/Globus/GRAM/JobManager.old
lib/perl/Globus/GRAM/JobManager.pm
lib/perl/Globus/GRAM/JobManager/condor.diff
lib/perl/Globus/GRAM/JobManager/condor.old
lib/perl/Globus/GRAM/JobManager/condor.pm
lib/perl/Globus/GRAM/JobManager/fork.diff
lib/perl/Globus/GRAM/JobManager/fork.old
lib/perl/Globus/GRAM/JobManager/fork.pm
lib/perl/Globus/GRAM/JobManager/pbs.diff
lib/perl/Globus/GRAM/JobManager/pbs.old
lib/perl/Globus/GRAM/JobManager/pbs.pm
lib/perl/Globus/GRAM/StdioMerger.diff
lib/perl/Globus/GRAM/StdioMerger.old
lib/perl/Globus/GRAM/StdioMerger.pm
setup/globus/condor.diff
setup/globus/condor.in
setup/globus/condor.old
setup/globus/fork.diff
setup/globus/fork.in
setup/globus/fork.old
setup/globus/pbs.diff
setup/globus/pbs.in
setup/globus/pbs.old

These patches are checked against the 2.2.4 patches, and are the latest.
------- Comment #53 From 2003-10-29 13:05:52 -------
Mini-HOWTO use the patches:

Both patch sets have been checked against one another. The tarballs contain the
latest fixes.

[0] Read the instructions to the end before starting. 

[1] Chose the version of Globus you are running. Please note that the current
patch sets only support 2.2.4 for VDT, and 2.4.3 for TeraGrid. Unfortunately, I
cannot simulate all stages of updates of a particular Globus version, thus you
may experience some offset with the patches. 

[2] Download the tarball and unpack in a place of your convenience, but NOT in
$GLOBUS_LOCATION. Provided in each tarball are three entries for each patched file:

    [a] The new version ending either in ".pm" or ".in"
    [b] The original file ending in ".old"
    [c] The patch file ending in ".diff"

It is recommended that you attempt to use the GNU patch tool to update your
installation, essentially applying [2c] to your files. The directory location of
the files is preserved. It is NOT recommended to use the [2a] file as a drop-in
replacement. The [2a] and [2b] files are provided as a reference for you to
determine how much your installation deviates from mine. 

[3] Create a backup of your files that reflect files from [2a], e.g. by moving
them to a suffix ".org" and copying them back onto the original locatin with the
original suffix. This preserves the timestamp of the original file. 

[4] Patch the file. Refer to GNU patches manual how to run patch. If you run
patch from within deep the directory tree, you may need to 'truncate' paths from
the front of the patch by using an argument of -p, which specifies how many
directory level you want to cut off. You may experience some offset when patching. 

[5] If you experience any patch failures in step [4], you may try to manually
integrate the "correct" thing. This option is for Perl experts only. If you are
not in that class, back out all patches from the backups you generated in step
[3]. Your installation may not be patchable with the patches I provided. You may
request to have me look at it, but my time in the pre-SC season is very tight. I
can correct obvious blunders.  

Another note: The patches in the lib/perl directory have higher priority, as
those are the files that are actually being used by the jobmanager. The files in
setup/globus have lower priority, but constitute the template from which an
update, and any gpt-* tool run, may overwrite the version in lib/perl with. 
------- Comment #54 From 2003-10-29 13:23:36 -------
(From update of attachment 228 [details])
use the collective patch.
------- Comment #55 From 2003-10-29 13:23:53 -------
(From update of attachment 229 [details])
use collective patch.
------- Comment #56 From 2003-10-29 13:24:11 -------
(From update of attachment 230 [details])
use collective patch.
------- Comment #57 From 2003-10-29 13:24:25 -------
(From update of attachment 231 [details])
use collective patch
------- Comment #58 From 2003-10-29 13:24:41 -------
(From update of attachment 232 [details])
use collective patch.
------- Comment #59 From 2003-10-29 13:24:56 -------
(From update of attachment 233 [details])
use collective patch.
------- Comment #60 From 2003-10-29 13:25:12 -------
(From update of attachment 234 [details])
use collective patch.
------- Comment #61 From 2003-10-29 13:25:26 -------
(From update of attachment 235 [details])
use collective patch.
------- Comment #62 From 2003-10-29 13:25:42 -------
(From update of attachment 236 [details])
use collective patch.
------- Comment #63 From 2003-10-29 13:26:00 -------
(From update of attachment 237 [details])
use collective patch.
------- Comment #64 From 2003-10-29 13:26:18 -------
(From update of attachment 238 [details])
use collective patch.
------- Comment #65 From 2003-10-29 13:26:34 -------
(From update of attachment 239 [details])
use collective patch.
------- Comment #66 From 2003-10-29 13:26:49 -------
(From update of attachment 240 [details])
use collective patch.
------- Comment #67 From 2003-10-29 13:27:05 -------
(From update of attachment 241 [details])
use collective patch.
------- Comment #68 From 2003-10-29 13:27:20 -------
(From update of attachment 242 [details])
use collective patch.
------- Comment #69 From 2003-10-29 13:27:37 -------
(From update of attachment 243 [details])
use collective patch.
------- Comment #70 From 2003-10-30 12:40:28 -------
teragrid is also seeing this bug - they are having failure rates of 30-40% pre-
patches. This is urgent to be fixed for them because of SC dependencies.

I've added sandra and nick to the cc list because of this.
------- Comment #71 From 2003-11-12 17:51:42 -------
Created an attachment (id=258) [details]
Globus 2.2.4 collective patches (fork/lsf/pbs/condor)

Updates the $self->pipe_out_cmd method to solve a problem with PBS, where it
didn't see job state changes correctly. Updates files JobManager.* and
StdioMerger.* from the previous tarball. This tarball contains all files:

lib/perl/Globus/GRAM/JobManager/condor.pm
lib/perl/Globus/GRAM/JobManager/condor.old
lib/perl/Globus/GRAM/JobManager/lsf.diff
lib/perl/Globus/GRAM/JobManager/fork.old
lib/perl/Globus/GRAM/JobManager/lsf.old
lib/perl/Globus/GRAM/JobManager/condor.diff
lib/perl/Globus/GRAM/JobManager/pbs.diff
lib/perl/Globus/GRAM/JobManager/fork.pm
lib/perl/Globus/GRAM/JobManager/fork.diff
lib/perl/Globus/GRAM/JobManager/lsf.pm
lib/perl/Globus/GRAM/JobManager/pbs.pm
lib/perl/Globus/GRAM/JobManager/pbs.old
lib/perl/Globus/GRAM/StdioMerger.old
lib/perl/Globus/GRAM/StdioMerger.pm
lib/perl/Globus/GRAM/StdioMerger.diff
lib/perl/Globus/GRAM/JobManager.pm
lib/perl/Globus/GRAM/JobManager.old
lib/perl/Globus/GRAM/JobManager.diff
setup/globus/condor.in
setup/globus/condor.old
setup/globus/lsf.diff
setup/globus/fork.old
setup/globus/lsf.old
setup/globus/condor.diff
setup/globus/pbs.diff
setup/globus/fork.in
setup/globus/fork.diff
setup/globus/lsf.in
setup/globus/pbs.in
setup/globus/pbs.old

Again, only files (JobManager|StdioMerger).(pm|diff) are different.
------- Comment #72 From 2003-11-12 17:54:27 -------
Created an attachment (id=259) [details]
Globus 2.4.3 collective patches (fork/condor/pbs)

Updates the $self->pipe_out_cmd method to solve a problem with PBS, where it
didn't see job state changes correctly. Updates files JobManager.* and
StdioMerger.* from the previous tarball. This tarball contains all files:

lib/perl/Globus/GRAM/JobManager/condor.pm
lib/perl/Globus/GRAM/JobManager/condor.old
lib/perl/Globus/GRAM/JobManager/fork.old
lib/perl/Globus/GRAM/JobManager/condor.diff
lib/perl/Globus/GRAM/JobManager/pbs.diff
lib/perl/Globus/GRAM/JobManager/fork.pm
lib/perl/Globus/GRAM/JobManager/fork.diff
lib/perl/Globus/GRAM/JobManager/pbs.pm
lib/perl/Globus/GRAM/JobManager/pbs.old
lib/perl/Globus/GRAM/StdioMerger.old
lib/perl/Globus/GRAM/StdioMerger.pm
lib/perl/Globus/GRAM/StdioMerger.diff
lib/perl/Globus/GRAM/JobManager.pm
lib/perl/Globus/GRAM/JobManager.old
lib/perl/Globus/GRAM/JobManager.diff
setup/globus/condor.in
setup/globus/condor.old
setup/globus/fork.old
setup/globus/condor.diff
setup/globus/pbs.diff
setup/globus/fork.in
setup/globus/fork.diff
setup/globus/pbs.in
setup/globus/pbs.old

Again, only files (JobManager|StdioMerger).(pm|diff) are different.
------- Comment #73 From 2003-12-01 15:19:36 -------
Please read bug 1425 to avoid making the Condor GridMonitor stumble.
------- Comment #74 From 2003-12-02 10:10:48 -------
I just realized, while trying to run jobs on the LCG/HEP, that the problem may
be more profound that we realized. The LCG gatekeeper does *not* have access to
any shared filesystem the worker nodes can see. Thus, it is the sole
responsibility of the remote scheduling system, an LCG adapted PBS, to propagate
things into the gatekeeper's GASS cache in a timely fashion. Unfortunately, I
rarely see my stdout from batched jobs (as opposed to g-j-r interactive jobs). 
------- Comment #75 From 2003-12-02 10:24:53 -------
Created an attachment (id=269) [details]
Globus 2.2.4 collective patches (fork/lsf/pbs/condor)

Additionally fixes bug #1425.
------- Comment #76 From 2003-12-02 10:25:55 -------
Created an attachment (id=270) [details]
Globus 2.4.3 collective patches (fork/condor/pbs)

Additionally fixes bug #1425.
------- Comment #77 From 2003-12-02 14:05:01 -------
Jens,

Is the only change the patch for bug #1425, or is there something else 
different as well?

Did you find that the problem from #1425 existed in the PBS and LSF job 
managers as well, or just Condor? We only looked at the Condor jobmanager.

-alain


------- Comment #78 From 2003-12-02 14:41:15 -------
Alain,

when integrating fix 1425, I went through all jobmanagers to see, if they suffer
from a similar problem. Additionally, I added an nfssync to the Condor one at
this  (new) DONE stage, which was missing before. Download and have a look
yourself :-)

------- Comment #79 From 2003-12-03 15:02:48 -------
Jens,
 Do you know how to apply your latest patch into globus LSF 
 job submission script? I tried to download your attachment
 and got attachment.cgi. What is the format of the patch?
 
 I have VDT 1.1.11 installed. 

 Regards,
 Dantong
------- Comment #80 From 2003-12-03 16:05:01 -------
On Wed, 3 Dec 2003 bugzilla-daemon@mcs.anl.gov wrote:

>  Do you know how to apply your latest patch into globus LSF job
>  submission script? I tried to download your attachment and got
>  attachment.cgi. What is the format of the patch?

I believe I marked it binary. Try renaming it to something.tar.gz, and ask
"file", what it thinks that you downloaded. LSF will be unaffected, no
changes here.

Ciao,
Dipl.-Ing. Jens-S. Vöckler   voeckler at cs dot uchicago dot edu
University of Chicago; Research Institutes Building #402;
5640 South Ellis Avenue; Chicago, IL 60637-1433; USA; +1 773 834 6693
	You can rely on NFS for only one thing - don't!

------- Comment #81 From 2004-01-23 09:53:32 -------
All of the fork/condor/pbs and JobManager patches are now merged to the CVS 
trunk. 
 
joe 
------- Comment #82 From 2004-01-23 10:23:34 -------
Joe,

 Could you please give more details about CVS trunk? How to get 
 this patch from your CVS trunk? 
 Does this patch include LSF? I saw a patch posted 
 on 10/14/2003 of "LSF for grid3"? Is this  the newest LSF patch
that could be applied to Globus 2.4.x.

Since I have been following this bug for long time, more details 
about fix will be highly appreciated for my documentation. 
What is the problem which causes this NFS bug? How do you fix the
NFS-related bug, especically for globus job manager for LSF?

Regards,
Dantong
------- Comment #83 From 2004-01-23 10:51:13 -------
Hi Dantong,

I have been working on this for a few days now, and today I have discovered 
that our problem at BNL doesn't seem to be related to NFS. It is actually much 
simpler: the job-manager start polling LSF too soon, LSF says that doesn't 
know about the JOBID, and the JobManager reports it as DONE.

Naturally, after it's done, the cache is cleaned, and LSF can't write the 
output anymore.

A simple change, that is returning PENDING when LSF doesn't find the job, 
seems to have removed the problem. I am trying not to rush to conclusion 
here... but the current results I am having are _very_ encouraging!

More on this later!
------- Comment #84 From 2004-01-23 11:41:45 -------
Hi all,

I'll go more in the details of what I have found. I'd be interested in knowing 
if other people are seeing the same problem, or if this is an RCF only 
problem. From all the preliminary tests, I am not losing outputs anymore, but 
before declaring this really fixed, I'll be making further stress tests. But I 
have made progress which other people might find interesting.

Initially, I have been working on understanding LSF and NFS setup here at RCF: 
there was a general believe that it was an NFS issue, so I started 
understanding those pieces. After having gained enough info, I have started 
testing Globus (2.4.2) and the LSF jobmanager. While doing that, I noticed 
that some jobs where advertised as DONE by Globus while they were advertised 
as PENDING by LSF.

Today I gathered some statistics on that, and noticed that the number of jobs 
mistracked was exactly the number of jobs that lost the output. I looked at 
the gram logs, and noticed that for each job mistracked, the following error 
was in the log:
bjobs rc is 255 == Job <123> is not found == DONE

That corresponds to the branch in which the lsf JobManager gets an error from 
LSF (bjobs), in which case it just reports the job as done. Now I thought: 
what does LSF return if I ask for a job he is not aware of? I tried, and got 
the same error. The hypothesis was: if the jobmanager contacts LSF too soon, 
bjobs might say that the job wasn't found.

So, I have changed this:
    if($exit_code == 255)
    {
        $self->log("bjobs rc is 255 == Job <123> is not found == DONE");
        $state = Globus::GRAM::JobState::DONE;
    }

into this:
    if($exit_code == 255)
    {
        $self->log("bjobs rc is 255 == Job <123> is not found == PENDING");
        $state = Globus::GRAM::JobState::PENDING;
    }

From the log now I see that some jobs print that message once, but then they 
proceed, they are reported correctly and the output is displayed.

Now, this is not at all a complete solution: if a job is really non existent 
(because of some error) the jobmanager would keep polling... so I will need to 
find a way to distibguish the two cases. It was just a way to quickly test my 
hypothesis.

Also, I don't know whether this is the only cause of the problems we where 
having, but it sure is part of the equation.

As I said, I will be making further stress tests to gather statistics on 
thousands of jobs, and what comes out. I'll also try to find a good way to 
understand which kind of error condition I am getting (suggestions?).

I could attach a patch right now, but I think it's better to wait until I 
finish. What do people think?

Gabriele
------- Comment #85 From 2004-01-28 09:01:24 -------
Hi all,

I am ready to declare this bug fixed at RCF: yesterday I submitted 1000 jobs 
and all of them were tracked correctly and returned their outputs. All except 
one which failed for other reasons. I really hope this will fix the problem 
for other sites too.

I'll attach a patch with the modification: I had to strip out other STAR 
custom modification, hopefully it won't affect it. I tried it by itself, but 
didn't make another stress test. The diff will be against the lsf.pm found in 
VDT 1.1.12.

Before going in the details, I'd like to thank all the people that beared with 
me during the investigation: Ofer Rind (LSF admin at RCF), Azadeh Handley 
(Platform support), Robert Petckus (NFS admin at RCF), Jens, Joe, Stuart and 
Dantong. Wihout their knowledge of the various pieces, I wouldn't have gone 
very far.

Azadeh confirmed the issue of bjobs not reporting jobs submitted right away. 
For the LSF savvy, this is due to "the fact that you use multithreading, which 
will force mbatchd to spawn a child mbatchd to process bjobs queries. When you 
submit a new job if at that moment you have a child mbatchd spawned to process 
your bjobs query, then that child mbatchd will not be aware of your new job. 
Mbatchd would usually kill the child mbatchd and spawn a new process in these 
situations. However it is all the matter of timing and how fast you run bjobs 
after your bsub." As to how to distinguish this case, "bhist is always your 
best bet, as it will directly check the events file rather than polling 
mbatchd"

So what the patch does is this: when bjobs reports that no job was found, 
checks whether bhist agrees. If it does, reports FAILED, and if bhist finds it 
reports PENDING.

Thanks,
Gabriele
------- Comment #86 From 2004-01-28 09:02:36 -------
Created an attachment (id=296) [details]
Patch to the VDT 1.1.12 (Globus 2.2.4) lsf.pm
------- Comment #87 From 2004-01-29 13:56:29 -------
*** Bug 1283 has been marked as a duplicate of this bug. ***
------- Comment #88 From 2004-02-02 08:22:31 -------
LSF patch committed. 
 
joe