Bugzilla – Bug 950
Globus 2.2.4 does not print output for various job managers, LSF and PBS
Last modified: 2004-02-02 08:22:31
You need to log in before you can comment on or make changes to this bug.
I installed globus 2.2.4 and LSF job manager. After I run a globus/LSF job, I could receive an email about successful execution of the command, but globus 2.2.4 doesn't print out any output. I checked the log file, no error message is shown. [dtyu@pdsfgrid2 ~]$ globus-job-run atlasgrid01.usatlas.bnl.gov/jobmanager-lsf -q grid-test /bin/echo "hello" Attached please see the email from LSF admin accout. Was this a cache corruption problem? Thank you very much. Regards, Dantong From: LSF <lsfadmin@rcf.rhic.bnl.gov> To: dtyu@acas055.usatlas.bnl.gov Subject: Job 71832: <# LSF batch job script built by Globus job manager; #! /bin/sh;#BSUB -q grid-test;#BSUB -i /dev/null;#BSUB -o /dev/null;#BSUB -e /usatlas/u/dtyu/.globus/.gass_cache/globus_gass_cache_1052768850;#BSUB -n 1;#BSUB -N;GLOBUS_GRAM_MYJOB> Done Date: Mon, 12 May 2003 15:47:46 -0400 Job <# LSF batch job script built by Globus job manager; #! /bin/sh;#BSUB -q grid-test;#BSUB -i /dev/null;#BSUB -o /dev/null;#BSUB -e /usatlas/u/dtyu/.globus/.gass_cache/globus_gass_cache_1052768850;#BSUB -n 1;#BSUB -N;GLOBUS_GRAM_MYJOB> was submitted from host <spider> by user <dtyu>. Job was executed on host(s) <acas055>, in queue <grid-test>, as user <dtyu>. </usatlas/u/dtyu> was used as the home directory. </direct/usatlas+u/dtyu> was used as the working directory. Started at Mon May 12 15:47:40 2003 Results reported at Mon May 12 15:47:46 2003 Your job looked like: ------------------------------------------------------------ # LSBATCH: User input # LSF batch job script built by Globus job manager #! /bin/sh #BSUB -q grid-test #BSUB -i /dev/null #BSUB -o /dev/null #BSUB -e /usatlas/u/dtyu/.globus/.gass_cache/globus_gass_cache_1052768850 #BSUB -n 1 #BSUB -N GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://spider.usatlas.bnl.gov:6159/; export GLOBUS_GRAM_MYJOB_CONTACT X509_CERT_DIR=/etc/grid-security/certificates; export X509_CERT_DIR GLOBUS_GRAM_JOB_CONTACT=https://spider.usatlas.bnl.gov:6158/16623/1052768844/; export GLOBUS_GRAM_JOB_CONTACT GLOBUS_LOCATION=/opt/globus2; export GLOBUS_LOCATION X509_USER_PROXY=/usatlas/u/dtyu/.globus/.gass_cache/globus_gass_cache_1052768848; export X509_USER_PROXY # Changing to directory as requested by user cd /usatlas/u/dtyu # Executing job as requested by user /bin/echo hello > /usatlas/u/dtyu/.globus/.gass_cache/globus_gass_cache_1052768849 < /dev/null & wait ------------------------------------------------------------ Successfully completed. Resource usage summary: CPU time : 0.05 sec. Max Memory : 2 MB Max Swap : 4 MB Max Processes : 1 Read file </usatlas/u/dtyu/.globus/.gass_cache/globus_gass_cache_1052768850> for stderr output of this job.
This could be due to NFS problems/delays. Can you try submitting the LSF job by hand using bsub and verify that the output does get to the output file? I have seen some NFS problems/delays where the scheduler returns status of DONE, but the data is not in the job's output file. We are working on some improvements to the scheduler perl scripts where the JM forces and NSF update after the job is DONE. -Stu
I submitted LSF job manually. It did not run because the long path of output file could not be created. But I shortened the output/error files, still used the NFS directory. It works perfectly well. I submitted another job, it works fine. globus-job-run atlasgrid01.usatlas.bnl.gov/jobmanager-lsf /bin/ls -R I re-submitted /bin/echo "hello" throught jobmanager-lsf, it does NOT work because the output is too short and got lost.
We at GriPhyN/Chimera have developed a set of patches, as we stumbled about the same problem, which appears to be NFS related (I stumbled on Condor myself, and on PBS with PDQ). I have a set of patches, alas the "published" ones are for 2.4.0 (bug 931). Since I don't have access to any LSF myself, I recreated for LSF what I did for Condor and PBS. If you'd like to try my patches, can you tar up $GLOBUS_LOCATION/lib/perl/Globus/GRAM and send it to me? I'll patch the files with what I did for 2.4.0, and send it back for testing (e.g. move the GRAM to GRAM.old and untar the patches in a new GRAM directory). This test would also benefit Globus. I hope that you had installed all the updates for 2.2.4, though, or my patching process will be a lot more tedious than anticipated when writing this offer.
Stu, About a month ago, you wrote: "We are working on some improvements to the scheduler perl scripts where the JM forces and NSF update after the job is DONE." How are these progressing? Do you have an estimated time for the release? Will they only be for Globus 2.4, or will they include Globus 2.2.4? Thanks, -alain
We have not been able to work on the NFS updates yet. After the 3.0 release, I hope to schedule some time in order to update all the scheduler scripts with the Jen's improvements. -Stu
GLOBUS 2.2.4 has the problem of missing output. You mentioned that GLOBUS 2.4 has fix, does the fix work on GLOBUS 2.2.4? We use VDT 1.1.9 which includes GLOBUS 2.2.4. This problem will seriously affect our incoming data production. This problem also happens on GLOBUS CONDOR JOB manager, in which the job outputs are constantly missing. I do not know when GLOBUS 3.0 will be released. If you could give me the precise time, I would appreciate.
The fixes from 2.4 (bug #931) may work on 2.2.4, if and only if the 2.2.4 had all updates and advisories installed. Unfortunately, GPT displayed some problems with some of the fixes, in such that it claimed it installed an update, but really didn't install it unless -force'd to do so. Do you want my diffs for 2.2.4? Which remote scheduling system are you using?
The fixes from 2.4 (bug #931) may work on 2.2.4, if and only if the 2.2.4 had all updates and advisories installed. Unfortunately, GPT displayed some problems with some of the fixes, in such that it claimed it installed an update, but really didn't install it unless -force'd to do so. I downloaded globus 2.4.2 and the most updated globus lsf job manager (globus_gram_job_manager_setup_lsf-1.4.tar.gz ) from ftp://ftp.globus.org/pub/gt2/2.4/2.4.2/, unless you have other version which is not listed at http://www.globus.org/gt2.4/download.htmland. I tested it with globus lsf job manager, the problem of missing result still exists. I am not sure how this job manager can help globus 2.2.4. Do you want my diffs for 2.2.4? Which remote scheduling system are you using? Yes, I do want a fix. I am using LSF 5.X. I downloaded the most updated version globus lsf job manager for globus 2.2.4 from ftp://ftp.globus.org/pub/gt2/2.2/2.2.4/extra/src/globus_gram_job_manager_setup_lsf-1.2.tar.gz If you could get the source tar ball from there and apply your diff, I am happy to try it. Thank you. Dantong
*** Bug 1076 has been marked as a duplicate of this bug. ***
*** Bug 951 has been marked as a duplicate of this bug. ***
*** Bug 755 has been marked as a duplicate of this bug. ***
We have developed a fix for this and a few other GRAM script related problems, but they resulted in adding some new APIs to the Perl. These are committed to the CVS trunk, and were developed in the gram_script_cleanup_branch of our CVS. If you like, I can generate a set of source packages containing these changes; otherwise, you can wait until the next major or minor Globus Toolkit release (which will contain the changes from the trunk), or check out the files from CVS yourself. joe
Hi Joe, I have tried looking at CVS to get the files myself, but I am not sure if I am looking in the correct place. Could you provide a patch, or tell me how to create one? Thanks, Gabriele
Dear All: We have not received the fix described by Joe, even we made several contacts with Joe. I do not think that this ticket could be marked as "RESOLVED FIXED". The following email came from our Grid production manager for ATLAS. He could not keep track of large number of globus jobs if the standard output files are corrupted or deleted before the jobs finish. Again this problem might be caused by the NFS bug of globus lsf job manager since the gass cache is in NFS directory. The same problem happens in PBS. Condor job manager has much less corruption rate than PBS and LSF. We really need the fix to be included in the globus current release patches. Regards, Dantong ---------- Forwarded message ---------- Date: Wed, 17 Sep 2003 15:31:23 -0400 From: LSF <lsfadmin@rcf.rhic.bnl.gov> To: kaushik@acas036.usatlas.bnl.gov Subject: Job 29049: <#! /bin/sh;#;# LSF batch job script built by Globus Job Manager;#;#BSUB -q grid;#BSUB -i /dev/null;#BSUB -e /usatlas/u/kaushik/.globus/.gass_cache/local/md5/27/59d865951107d1df2de5c490 873133/md5/7f/0ff29a97a9ff2823da9dcbefdb17eb/dat> Done Job <#! /bin/sh;#;# LSF batch job script built by Globus Job Manager;#;#BSUB -q grid;#BSUB -i /dev/null;#BSUB -e /usatlas/u/kaushik/.globus/.gass_cache/local/md5/27/59d865951107d1df2de5c490873133/md5/7f/0ff29a97a9ff2823da9dcbefdb17eb/dat> was submitted from host <gremlin> by user <kaushik>. Job was executed on host(s) <acas036>, in queue <grid>, as user <kaushik>. </usatlas/u/kaushik> was used as the home directory. </direct/usatlas+u/kaushik> was used as the working directory. Started at Wed Sep 17 15:25:55 2003 Results reported at Wed Sep 17 15:31:23 2003 Your job looked like: ------------------------------------------------------------ # LSBATCH: User input #! /bin/sh # # LSF batch job script built by Globus Job Manager # #BSUB -q grid #BSUB -i /dev/null #BSUB -e /usatlas/u/kaushik/.globus/.gass_cache/local/md5/27/59d865951107d1df2de5c490873133/md5/7f/0ff29a97a9ff2823da9dcbefdb17eb/data #BSUB -o /usatlas/u/kaushik/.globus/.gass_cache/local/md5/27/59d865951107d1df2de5c490873133/md5/ca/908394fb9a7c932dcd5ececd6700fc/data #BSUB -N #BSUB -n 1 X509_USER_PROXY=/usatlas/u/kaushik/.globus/.gass_cache/local/md5/27/59d865951107d1df2de5c490873133/md5/d6/f2416a405e5b65a030259e5d7161d8/data; export X509_USER_PROXY GLOBUS_LOCATION=/opt/globus-2.4; export GLOBUS_LOCATION GLOBUS_GRAM_JOB_CONTACT=https://gremlin.usatlas.bnl.gov:6157/1189/1063826742/; export GLOBUS_GRAM_JOB_CONTACT GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://gremlin.usatlas.bnl.gov:6158/; export GLOBUS_GRAM_MYJOB_CONTACT HOME=/usatlas/u/kaushik; export HOME LOGNAME=kaushik; export LOGNAME if test 'X${LD_LIBRARY_PATH}' != 'X'; then LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:" else LD_LIBRARY_PATH="" fi export LD_LIBRARY_PATH #Change to directory requested by user cd /usatlas/scratch/dtyu/atlasgrid/atlas000_14143 /usatlas/scratch/dtyu/atlasgrid/atlas000_14143/dc1-run & wait ------------------------------------------------------------ Successfully completed. Resource usage summary: CPU time : 309.53 sec. Max Memory : 167 MB Max Swap : 207 MB Max Processes : 6 The output (if any) follows: error in CFPUT : Can't open configuration file STOP batch job impossible statement executed PS: The stderr output (if any) follows: PS: Fail to open output file /usatlas/u/kaushik/.globus/.gass_cache/local/md5/27/59d865951107d1df2de5c490873133/md5/ca/908394fb9a7c932dcd5ececd6700fc/data: No such file or directory. Output is stored in this mail. Fail to open stderr file /usatlas/u/kaushik/.globus/.gass_cache/local/md5/27/59d865951107d1df2de5c490873133/md5/7f/0ff29a97a9ff2823da9dcbefdb17eb/data: No such file or directory. The stderr output is included in this report.
I sent this a month ago, must have been some email problem causing it to not arrive. Sorry about that. joe > Date: Tue, 26 Aug 2003 13:52:44 -0500 > From: Joe Bester <bester@mcs.anl.gov> > To: Dantong Yu <dtyu@bnl.gov> > Cc: carcassi@bnl.gov, jlauret4@bnl.gov > Subject: Re: Patches for Globus LSF jobmanager. > > On Mon, Aug 25, 2003 at 02:09:29PM -0400, Dantong Yu wrote: > > Joe: > > Thank you very much for replying. Since we are anxious waiting for the > > fix. Could you please generate a set of source packages containing the > > fixes you described below at your earliest convenience? Is there a new > > version for globus_gram_job_manager_setup_lsf? > > > > Thank you very much. > > Regards, > > Dantong > > I've put a few packages on http://www-unix.mcs.anl.gov/~bester/lsf-for-2.4/ > These should be installable on GT2 2.4.2. Let me know if you have any > issues with them. The lsf package depends on that version of the job > manager package on that page.
I applied the patch via gpt-build on globus 2.4.2. The job result missing problem still exists. The follow is the test result, Gabriele, could you please submit your test result on this node also? Thank you very much. Dantong [dtyu@stargrid01 ~]$ foreach i ( 1 2 3 4 5 6 7 8 9 10 ) foreach? echo $i foreach? globus-job-run stargrid01.rcf.bnl.gov/jobmanager-lsf /bin/echo "Hello, $i" foreach? end 1 Hello, 1 2 3 Hello, 3 4 5 Hello, 5 6 Hello, 6 7 8 Hello, 8 9 10 Hello, 10 [dtyu@stargrid01 ~]$ cc. My lsf.pm file, I modified it a little bit at the beginning part on LSF setup. use Globus::GRAM::Error; use Globus::GRAM::JobState; use Globus::GRAM::JobManager; use Globus::Core::Paths; use Config; package Globus::GRAM::JobManager::lsf; @ISA = qw(Globus::GRAM::JobManager); my ($lsf_profile, $mpirun, $bsub, $bjobs, $bkill); BEGIN { $lsf_profile = '/usr/lsf/conf/profile.lsf'; $mpirun = 'no'; $bsub = "/usr/lsf/5.1/linux2.4-glibc2.1-x86/bin/bsub"; $bjobs = "/usr/lsf/5.1/linux2.4-glibc2.1-x86/bin/bjobs"; $bkill = "/usr/lsf/5.1/linux2.4-glibc2.1-x86/bin/bkill"; } sub submit { my $self = shift; my $description = $self->{JobDescription}; my $tag = $description->cache_tag() or $tag = $ENV{GLOBUS_GRAM_JOB_CONTACT}; my $status; my $lsf_job_script; my $lsf_job_script_name; my $errfile = ''; my $queue; my $job_id; my $script_url; my @arguments; my $email_when = ''; my $library_path; my $cache_pgm = "$Globus::Core::Paths::bindir/globus-gass-cache"; my @library_vars; $self->log('Entering lsf submit'); # check jobtype if(defined($description->jobtype())) { if($description->jobtype !~ /^(mpi|single|multiple)$/) { return Globus::GRAM::Error::JOBTYPE_NOT_SUPPORTED; } elsif($description->jobtype() eq 'mpi' && $mpirun eq 'no') { return Globus::GRAM::Error::JOBTYPE_NOT_SUPPORTED; } } if( $description->directory eq '') { return Globus::GRAM::Error::RSL_DIRECTORY; } if((! -d $description->directory) || (! -r $description->directory)) { return Globus::GRAM::Error::BAD_DIRECTORY; } # make sure the files are accessible (NFS sync) when you check for them $self->nfssync( $description->executable() ) unless $description->executable() eq ''; $self->nfssync( $description->stdin() ) unless $description->stdin() eq ''; if( $description->executable eq '') { return Globus::GRAM::Error::RSL_EXECUTABLE(); } elsif(! -f $description->executable()) { return Globus::GRAM::Error::EXECUTABLE_NOT_FOUND(); } elsif(! -x $description->executable()) { return Globus::GRAM::Error::EXECUTABLE_PERMISSIONS(); } elsif( $description->stdin() eq '') { return Globus::GRAM::Error::RSL_STDIN; } elsif(! -r $description->stdin()) { return Globus::GRAM::Error::STDIN_NOT_FOUND(); } $self->log('Determining job max time cpu from job description'); if(defined($description->max_cpu_time())) { $cpu_time = $description->max_cpu_time(); $self->log(" using maxcputime of $cpu_time"); } elsif(defined($description->max_time())) { $cpu_time = $description->max_time(); $self->log(" using maxtime of $cpu_time"); } else { $cpu_time = 0; $self->log(' using queue default'); } $self->log('Determining job max wall time limit from job description'); if(defined($description->max_wall_time())) { $wall_time = $description->max_wall_time(); $self->log(" using maxwalltime of $wall_time"); } else { $wall_time = 0; $self->log(' using queue default'); } if($description->queue() ne '') { $queue = $description->queue(); } else { $queue = 'star_cas_dd'; } $self->log('Building job script'); $script_url = "$tag/lsf_job_script.$$"; $self->fork_and_exec_cmd( $cache_pgm, '-add', '-t', $tag, '-n', $script_url, 'file:/dev/null' ); $lsf_job_script_name = $self->pipe_out_cmd( $cache_pgm, '-query', '-t', $tag, $script_url ); chomp($lsf_job_script_name); if($lsf_job_script_name eq '') { return Globus::GRAM::ERROR::TEMP_SCRIPT_FILE_FAILED(); } local(*JOB); open( JOB, '>' . $lsf_job_script_name ); print JOB<<"EOF"; #! /bin/sh # # LSF batch job script built by Globus Job Manager # EOF if(defined($queue)) { print JOB "#BSUB -q $queue\n"; } if(defined($description->project())) { print JOB '#BSUB -P ', $description->project(), "\n"; } if($cpu_time != 0) { if($description->jobtype() eq 'multiple') { $total_cpu_time = $cpu_time * $description->count(); } else { $total_cpu_time = $cpu_time; } print JOB "#BSUB -c ${total_cpu_time}\n"; } if($wall_time != 0) { print JOB "#BSUB -W $wall_time\n"; } if($description->max_memory() != 0) { $max_memory = $description->max_memory() * 1024; if($description->jobtype() eq 'multiple') { $total_max_memory = $max_memory * $description->count(); } else { $total_max_memory = $max_memory; } print JOB "#BSUB -M ${total_max_memory}\n"; } print JOB '#BSUB -i ', $description->stdin(), "\n"; print JOB '#BSUB -e ', $description->stderr(), "\n"; print JOB '#BSUB -o ', $description->stdout(), "\n"; print JOB "#BSUB -N\n"; print JOB '#BSUB -n ', $description->count(), "\n"; foreach my $tuple ($description->environment()) { if(!ref($tuple) || scalar(@$tuple) != 2) { return Globus::GRAM::Error::RSL_ENVIRONMENT(); } print JOB $tuple->[0], '=', $tuple->[1], '; export ', $tuple->[0], "\n"; } $library_path = join(':', $description->library_path()); @library_vars = ('LD_LIBRARY_PATH'); if($Config{osname} eq 'irix') { push(@library_vars, 'LD_LIBRARYN32_PATH', 'LD_LIBRARY64_PATH'); } foreach (@library_vars) { print JOB <<"EOF"; if test 'X\${$_}' != 'X'; then $_="\${LD_LIBRARY_PATH}:$library_path" else $_="$library_path" fi export $_ EOF } print JOB "\n#Change to directory requested by user\n"; print JOB 'cd ', $description->directory(), "\n"; @arguments = $description->arguments(); foreach(@arguments) { if(ref($_)) { return Globus::GRAM::Error::RSL_ARGUMENTS; } } if($arguments[0]) { foreach(@arguments) { $_ =~ s/\\/\\\\/g; $_ =~ s/\$/\\\$/g; $_ =~ s/"/\\\"/g; #" $_ =~ s/`/\\\`/g; #` $args .= '"' . $_ . '" '; } } else { $args = ''; } if($description->jobtype() eq 'mpi') { print JOB "$mpirun -np ", $description->count(), ' '; print JOB $description->executable(), " $args \n"; } elsif($description->jobtype() eq 'multiple') { for(my $i = 0; $i < $description->count(); $i++) { print JOB $description->executable(), " $args &\n"; } print JOB "wait\n"; } else { print JOB $description->executable(), " $args\n"; } close(JOB); chmod 0755, $lsf_job_script_name; if($description->logfile() ne '') { $errfile = "2>" . $description->logfile(); } $self->nfssync( $lsf_job_script_name ); $job_id = (grep(/is submitted/, split(/\n/, `$bsub < $lsf_job_script_name $errfile`)))[0]; if($? == 0) { $job_id =~ m/<([^>]*)>/; $job_id = $1; return { JOB_ID => $job_id, JOB_STATE => Globus::GRAM::JobState::PENDING }; } #system("$cache_pgm -cleanup-url $tag/lsf_job_script.$$"); $self->fork_and_exec_cmd( $cache_pgm, '-cleanup-url', "$tag/lsf_job_script.$$" ); return Globus::GRAM::Error::INVALID_SCRIPT_REPLY; } sub poll { # The LSF bjobs command is used to obtain the current # status of the job. This status is then returned. # # The Status field can contain one of the following strings: # # string stands for Globus context meaning # -------------------------------------------------------------------- # RUN Running ACTIVE # PEND Wating to be scheduled PENDING # USUSP Suspended while running SUSPENDED # PSUSP Suspended while pending SUSPENDED # SSUSP Suspended by system SUSPENDED # DONE Completed sucessfully DONE # EXIT Completed unsuccessfully FAILED # UNKWN Unknown state *ignore* # ZOMBI Unknown state FAILED my $self = shift; my $description = $self->{JobDescription}; my $job_id = $description->jobid(); my $state; my $status_line; my $exit_code; $self->log("polling job $job_id"); # Get first line matching job id # needs to be back-ticks to source lsf profile $_ = (grep(/$job_id/, `$bjobs $job_id 2>/dev/null`))[0]; # get the exit code of the bjobs command. For more info, do a # search for $CHILD_ERROR in perlvar documentation. $exit_code = $? >> 8; # Verifying that the job is no longer there. # return code 255 = "Job <123> is not found" if($exit_code == 255) { $self->log("bjobs rc is 255 == Job <123> is not found == DONE"); $state = Globus::GRAM::JobState::DONE; $self->nfssync( $description->stdout() ) if $description->stdout() ne ''; $self->nfssync( $description->stderr() ) if $description->stderr() ne ''; } else { # Get 3th field (status) $_ = (split(/\s+/))[2]; if(/PEND/) { $state = Globus::GRAM::JobState::PENDING; } elsif(/DONE/) { $state = Globus::GRAM::JobState::DONE; $self->nfssync( $description->stdout() ) if $description->stdout() ne ''; $self->nfssync( $description->stderr() ) if $description->stderr() ne ''; } elsif(/USUSP|SSUSP|PSUSP/) { $state = Globus::GRAM::JobState::SUSPENDED; } elsif(/RUN/) { $state = Globus::GRAM::JobState::ACTIVE; } elsif(/EXIT/) { return Globus::GRAM::Error::JOB_EXIT_CODE_NON_ZERO(); } elsif(/UNKWN/) { # We want the JM to ignore this poll and keep the same state # as the previous state. Returning an empty hash will do the job. $self->log("bjobs returned the UNKWN state. Telling JM to ignore this poll"); return {}; } elsif(/ZOMBI/) { return Globus::GRAM::Error::LOCAL_SCHEDULER_ERROR(); } else { # This else is reached by an unknown response from lsf. # It could be that LSF was temporarily unavailable, but that it # can recover and the submitted job is fine. # We want the JM to ignore this poll and keep the same state # as the previous state. Returning an empty hash will do the job. $self->log("bjobs returned an unknown response. Telling JM to ignore this poll"); return {}; } } return {JOB_STATE => $state}; } sub cancel { my $self = shift; my $description = $self->{JobDescription}; my $job_id = $description->jobid(); $self->log("cancel job $job_id"); # needs to be back-ticks to source lsf profile system("$bkill $job_id >/dev/null 2>/dev/null"); if($? == 0) { return { JOB_STATE => Globus::GRAM::JobState::FAILED }; } return Globus::GRAM::Error::JOB_CANCEL_FAILED(); } 1;
Here is the LSF email back to me indicating that the gass cache file is missing. From: LSF <lsfadmin@rcas6146.rcf.bnl.gov> To: dtyu@rcf.rhic.bnl.gov Subject: Job 513707: <#! /bin/sh;#;# LSF batch job script built by Globus Job Manager;#;#BSUB -q star_cas_dd;#BSUB -i /dev/null;#BSUB -e /u0b/dtyu/.globus/.gass_cache/local/md5/2c/42/56/918b72ff60ceb5ecd7126be167/md5/2a/0e/e4/074420c2d703fcddb342435443/d> Done Date: Tue, 7 Oct 2003 10:08:25 -0400 Job <#! /bin/sh;#;# LSF batch job script built by Globus Job Manager;#;#BSUB -q star_cas_dd;#BSUB -i /dev/null;#BSUB -e /u0b/dtyu/.globus/.gass_cache/local/md5/2c/42/56/918b72ff60ceb5ecd7126be167/md5/2a/0e/e4/074420c2d703fcddb342435443/d> was submitted from host <stargrid01> by user <dtyu>. Job was executed on host(s) <rcas6146>, in queue <star_cas_dd>, as user <dtyu>. </u0b/dtyu> was used as the home directory. </direct/u0b/dtyu> was used as the working directory. Started at Tue Oct 7 10:08:25 2003 Results reported at Tue Oct 7 10:08:25 2003 Your job looked like: ------------------------------------------------------------ # LSBATCH: User input #! /bin/sh # # LSF batch job script built by Globus Job Manager # #BSUB -q star_cas_dd #BSUB -i /dev/null #BSUB -e /u0b/dtyu/.globus/.gass_cache/local/md5/2c/42/56/918b72ff60ceb5ecd7126be167/md5/2a/0e/e4/074420c2d703fcddb342435443/data #BSUB -o /u0b/dtyu/.globus/.gass_cache/local/md5/2c/42/56/918b72ff60ceb5ecd7126be167/md5/5b/77/ad/90dfeae07f36d051a28a5fb7c3/data #BSUB -N #BSUB -n 1 X509_USER_PROXY=/u0b/dtyu/.globus/.gass_cache/local/md5/2c/42/56/918b72ff60ceb5ecd7126be167/md5/17/2c/01/a51d0d63736a28daf95bb97969/data; export X509_USER_PROXY GLOBUS_LOCATION=/home/globus-2; export GLOBUS_LOCATION GLOBUS_GRAM_JOB_CONTACT=https://stargrid01.rcf.bnl.gov:6160/6415/1065535689/; export GLOBUS_GRAM_JOB_CONTACT GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://stargrid01.rcf.bnl.gov:6161/; export GLOBUS_GRAM_MYJOB_CONTACT HOME=/u0b/dtyu; export HOME LOGNAME=dtyu; export LOGNAME if test 'X${LD_LIBRARY_PATH}' != 'X'; then LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:" else LD_LIBRARY_PATH="" fi export LD_LIBRARY_PATH #Change to directory requested by user cd /u0b/dtyu /bin/echo "Hello, 2" & wait ------------------------------------------------------------ Successfully completed. Resource usage summary: CPU time : 0.04 sec. Max Memory : 3 MB Max Swap : 5 MB Max Processes : 1 The output (if any) follows: Hello, 2 PS: The stderr output (if any) follows: PS: Fail to open output file /u0b/dtyu/.globus/.gass_cache/local/md5/2c/42/56/918b72ff60ceb5ecd7126be167/md5/5b/77/ad/90dfeae07f36d051a28a5fb7c3/data: No such file or directory. Output is stored in this mail. Fail to open stderr file /u0b/dtyu/.globus/.gass_cache/local/md5/2c/42/56/918b72ff60ceb5ecd7126be167/md5/2a/0e/e4/074420c2d703fcddb342435443/data: No such file or directory. The stderr output is included in this report.
The missing result problem also happens with PBS job manager. One of our collaborator at university at New Mexico discovered that the NSF mounted gass cache might be corrupted and the result files are missing. We addressed this problem in our testbed meeting and I was assigned to report this problem to GLOBUS bugzilla system. Here is a error message found in several places: /home/dtyu/.globus/.gass_cache/local/md5/f7/8c6ea9bac808103b023c4e63d205c8/md5/83/9515c4508fd1ea6d7d23950e6ebbac/data: line 18: /usr/local/GLOBUS/bin/globus-sh-exec: No such file or directory I guess that the NFS mounted gass cache files has the tendency to be corrupted or removed before the job finishes. Even the new fix provided in this thread does not completely solve the problem. I also discovered that the CONDOR job manager has much higher success rate and more robust compared with LSF and PBS. Regards, Dantong
Since Joe said that he does not have access to LSF batch queue system, it is very hard for him to test whether a fix works. Last time, I invited the globus developer to apply an account at Brookhaven National lab. By this way, the person who works on this fix could get first hand experience to test gram job manager for LSF and shorten the turn-around time for a real fix. I am attaching the instruction here again: Goto: http://www.acf.bnl.gov/UserInfo/GettingStarted/NewUser/ After you get an account, follow the instruction at: http://www.acf.bnl.gov/UserInfo/GettingStarted/ When you apply for an account, use me as your BNL site sponser. If you have question, please ask me via bugzilla. Cheers Dantong
How long are you in ANL this week? I suggest sit together and look over the problems together?
I will be in ANL Oct/13~ Oct /15 for (grid3 meeting and the first day of Griphyn meeting) I am in either A261, A216,or C101 of Building 221 (MCS building). We can arrange some time and place to work on this issue. Please let me know your arrangement. Thank you very much. Dantong
Intermediary report: I spent some time with Dantong today, and first patched the lsf.pm to include the NFS patches. Alas, this does not improve the reliability. There are two classes of errors visible besides the successful runs. One LSF error "No such file or directory" in some cases. Note that the output of the test, while not visible at the submit site, is part of the email report LSF sends, see Dantong's example message. The seconds class of failures reports "Stale NFS handle". Again, the true output from stdout is being report by LSF's email message on the job, while it is missing at the submit site. Finally, the cases that succeed to show the stdout on the submit site do not report it as part of the LSF's email message for the job. In this case, the message reports "Read ... for stdXXX of job". Thus, all these messages appear to generated by LSF. Also, the errors appear to be related to the interaction of LSF and NFS. Stu and I agree that there appears to be (at least one) NFS race condition, where the LSF scheduler expects to see the files or directories for stdout/stderr in the GASS cache, but cannot find it there. I can see two reasons where the NFS race condition between gatekeeper node, remote scheduling system, NFS server and worker node (which might all be on different hosts) can screw things despite the NFS syncs: [1] Files that LSF expects to (physically) see before it starts a job. [2] Files that LSF expects to handle after a job was done. While the Globus JM scripts can take care of case 1, and indeed we patched Dantong's lsf.pm to do so, there is little we can do about case 2, because it happens before control passes back to the jobmanager. If the LSF system looks for the files on a host other than the host where the job ran, it may, due to NFS lag, not see the file contents immediately. I will attach Dantong's patched lsf.pm and JobManager.pm (more NFS reporting) to this report shortly. We will continue to investigate.
Created an attachment (id=221) [details] Current LSF for grid3 This is the lsf.pm that I gave to Dantong for his setup. It orientes itself on the ISI's lsf.pm module, which they claim to run successfully.
Created an attachment (id=222) [details] More NFS sync logging This is a slightly extended version of the JobManager.pm module which contains plenty more logging information on the attempted NFS sync operations.
Jens suggested that I do some direct job submission to LSF with NFS standout/err files. I submitted 19 simple jobs. All of them finished successful. Jens and I will look into the LSF outputs more carefully. This is the command I used: #!/bin/bash for ((i=1; i< 20; i++)) do echo $i; perl -pi -e "s|echostring[\d]*|echostring$i|" lsf-test-script; bsub < lsf-test-script ; done Here is the LSF job script which I carved out from a globus LSF job. ------------------------------------------------------------ # LSBATCH: User input #! /bin/sh # # LSF batch job script built by Globus Job Manager # #BSUB -q grid-test #BSUB -i /dev/null #BSUB -o /usatlas/u/dtyu/.globus/.gass_cache/local/data_output_echostring19 #BSUB -e /usatlas/u/dtyu/.globus/.gass_cache/local/data_error_echostring19 #BSUB -N #BSUB -n 1 X509_USER_PROXY=/usatlas/u/dtyu/.globus/.gass_cache/local/md5/1d/3f725a90295adb77a68e5fd69ff229/md5/bb/816d1eeb3f67675e92e124635b8510/data; export X509_USER_PROXY GLOBUS_LOCATION=/data/Grid3/globus; export GLOBUS_LOCATION GLOBUS_GRAM_JOB_CONTACT=https://spider.usatlas.bnl.gov:6447/7568/1066164014/; export GLOBUS_GRAM_JOB_CONTACT GLOBUS_GRAM_MYJOB_CONTACT=URLx-nexus://spider.usatlas.bnl.gov:6449/; export GLOBUS_GRAM_MYJOB_CONTACT HOME=/usatlas/u/dtyu; export HOME LOGNAME=dtyu; export LOGNAME if test 'X${LD_LIBRARY_PATH}' != 'X'; then LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:" else LD_LIBRARY_PATH="" fi export LD_LIBRARY_PATH #Change to directory requested by user cd /usatlas/u/dtyu /bin/echo "hello echostring19" & wait
*** Bug 1249 has been marked as a duplicate of this bug. ***
Jens gave me a nfssync tool which could be called in LSF script. After I added the following lines, it gives me the error message. [dtyu@spider ~]$ globus-job-run spider.usatlas.bnl.gov/jobmanager-lsf -q grid-test /bin/echo "hello" GRAM Job failed because it is unknown if the job was submitted (error code 126) [dtyu@spider ~]$ Here is the lines what I added, for(my $i = 0; $i < $description->count(); $i++) { print JOB $description->executable(), " $args &\n"; } print JOB "wait\n"; print JOB '/usatlas/projects/Grid3/nfssync/nfssync -bc ', $descriptor->stderr(), ' ', $descriptor->stdout(), "\n"; # print JOB "lsgrun -p -m \"\$LSB_HOSTS\" ", # $description->executable(), " $args\n"; Thank you very much. Dantong
Hi Dantong, here is another try for the jobmanager. Please remove the nfssync program line from the script for now, as it does something weird. I have no clue why Globus would issue an error 126 - I am not that familiar with the source. # print JOB '/usatlas/projects/Grid3/nfssync/nfssync -bc ', #$descriptor->stderr(), ' ', $descriptor->stdout(), "\n"; Instead, follow Stu's directions: in your $HOME/.globus/.gass_cache directory of the execution site may or may not exist a file called "config". [a] if it DOES NOT exist, simply change into the directory, and execute echo "type=flat" > config [b] if it DOES exist, fire up your text editor, and change the first line to read type=flat save and exit. Please retry your 10 x submit test scripts with this change. Note that this is your user's configuration file on your worker pool. Jens.
After I modify config to type=flat Then I found this error. [dtyu@spider .gass_cache]$ globus-job-run spider.usatlas.bnl.gov/jobmanager-lsf /bin/echo "hello" GRAM Job submission failed because cannot access cache files in ~/.globus/.gass_cache, check permissions, quota, and disk space (error code 76) But I do have file quota on my home directory. Dantong
Hi Dantong, I am sorry Dantong to not have tested it myself before asking you - I am getting the same error. The part I had failed to understand from the instructions of the Globus developers amount to: cd $HOME/.globus mv .gass_cache xx nohup rm -rf xx & mkdir .gass_cache cd .gass_cache echo "type=flat" > config It will then create a zillion of files directly in .gass_cache (as in old Globus). It will be interesting to see, if this alleviates the NFS problem. But I do see new problems cropping up from too many files in an NFS directory (for busy production). Also, the gram*log does not appear to be continuous any more. Jens.
One of our collaborator discovered the similar problem in PBS+globus jobmanager. From: Frederick Luehring <luehring@indiana.edu> To: grid3-core@ivdgl.org Cc: Matt Allen <malallen@indiana.edu>, rats@indiana.edu Subject: Globus + PBS toubles at IU Date: Mon, 27 Oct 2003 19:25:27 -0500 Hi Everyone, Two users at the IU ATLAS Tier 2 center (IU_ATLAS_2) have encountered problems where jobs finish successfully but only a small fraction of the jobs return sysout and syserr successfully. Before I ask the system administrator to spend time investigating this, I wanted to check that we were not seeing a previously reported problem involving an interaction between globus and PBS (we are running PBS and not condor) that causes the output files to be lost. Does anyone remember how to test for this problem? Some further details on what has happened. Ed May has submitted 10 ATLAS jobs and only gotten two jobs back successfully. He is using globus-submit to submit the jobs and globus-output to retrieve the output. Nickolai Kouropatkine is using MOP to run CMS simulation. His files are returned to him by globus-url-copy. I would appreciate any advice that the experts could give us on how to debug this. Thanks greatly... Fred
Created an attachment (id=228) [details] Globus 2.4.3 setup/globus/condor.in Updates the condor.in file in the setup directory, in case the jobmanager setup script is run another time. It is meant to be used in conjunction with the updated JobManager.pm and StdioMerger.pm file (see these patches).
Created an attachment (id=229) [details] Globus 2.4.3 setup/globus/pbs.in Updates the pbs.in file in the setup directory, in case the jobmanager setup script is run another time. It is meant to be used in conjunction with the updated JobManager.pm and StdioMerger.pm file (see these patches).
Created an attachment (id=230) [details] Globus 2.4.3 lib/perl/Globus/GRAM/JobManager.pm patches JobManager.pm file with new methods to be used in the scheduler-specific jobmanager scripts.
Created an attachment (id=231) [details] Globus 2.4.3 lib/perl/Globus/GRAM/StdioMerger.pm Update the stdio merger to be more efficient...
Created an attachment (id=232) [details] Globus 2.4.3 lib/perl/Globus/GRAM/JobManager/condor.pm Updates the condor.pm jobmanager script to ask the NFS server to sync stdio files (and others) before sending it back. Note that such a request is *not* mandatory, and a busy NFS server may still chose to ignore it.
Created an attachment (id=233) [details] Globus 2.4.3 lib/perl/Globus/GRAM/JobManager/pbs.pm Updates the pbs.pm jobmanager script to ask the NFS server to sync stdio files (and others) before sending it back. Note that such a request is *not* mandatory, and a busy NFS server may still chose to ignore it. Also introduces a lot effeciency fixes along the way - a gatekeeper may be enabled to handle more simultaneous jobs that way.
Ok folks, I need from you, whose jobmanager-<scheduler> is not in the patched set, your $GLOBUS_LOCATION/setup/globus/<scheduler>.in and $GLOBUS_LOCATION/lib/perl/Globus/GRAM/JobManager/<scheduler>.pm scripts to be able to patch those. Please only Globus 2.4.3 for now and here.
Created an attachment (id=234) [details] Globus 2.2.4 setup/globus/condor.in Updates the condor.in file in the setup directory, in case the jobmanager setup script is run another time. It is meant to be used in conjunction with the updated JobManager.pm and StdioMerger.pm file (see these patches).
Created an attachment (id=235) [details] Globus 2.2.4 setup/globus/fork.in Updates the fork.in file in the setup directory, in case the jobmanager setup script is run another time. It is meant to be used in conjunction with the updated JobManager.pm and StdioMerger.pm file (see these patches).
Created an attachment (id=236) [details] Globus 2.2.4 setup/globus/lsf.in Updates the lsf.in file in the setup directory, in case the jobmanager setup script is run another time. It is meant to be used in conjunction with the updated JobManager.pm and StdioMerger.pm file (see these patches).
Created an attachment (id=237) [details] Globus 2.2.4 setup/globus/pbs.in Updates the pbs.in file in the setup directory, in case the jobmanager setup script is run another time. It is meant to be used in conjunction with the updated JobManager.pm and StdioMerger.pm file (see these patches).
Created an attachment (id=238) [details] Globus 2.2.4 lib/perl/Globus/GRAM/JobManager.pm patches JobManager.pm file with new methods to be used in the scheduler-specific jobmanager scripts.
Created an attachment (id=239) [details] Globus 2.2.4 lib/perl/Globus/GRAM/StdioMerger.pm Update the stdio merger to be more efficient...
Created an attachment (id=240) [details] Globus 2.2.4 lib/perl/Globus/GRAM/JobManager/fork.pm New fork jobmanager, eradicates IO::* and possibly things arising from the difference between " or " and " || " (the latter has higher precedence binding above assignment, the former below assignment precedence).
Created an attachment (id=241) [details] Globus 2.2.4 lib/perl/Globus/GRAM/JobManager/condor.pm Updates the condor.pm jobmanager script to ask the NFS server to sync stdio files (and others) before sending it back. Note that such a request is *not* mandatory, and a busy NFS server may still chose to ignore it. Also eliminates IO::* and possibly some bugs.
Created an attachment (id=242) [details] Globus 2.2.4 lib/perl/Globus/GRAM/JobManager/pbs.pm Updates the pbs.pm jobmanager script to ask the NFS server to sync stdio files (and others) before sending it back. Note that such a request is *not* mandatory, and a busy NFS server may still chose to ignore it. Also introduces a lot effeciency fixes along the way - a gatekeeper may be enabled to handle more simultaneous jobs that way. Also, a quoting issue of environment variables may still exist.
Created an attachment (id=243) [details] Globus 2.2.4 lib/perl/Globus/GRAM/JobManager/lsf.pm From memory and with to be used with caution. Eradicates IO::* and tries to be NFS synching. A quoting issue of environment variables may still exist or been introduced - check.
Actually, the backported 2.2.4 patches use knowledge gained from the 2.4.3 port - I would rank them as slightly "better".
Created an attachment (id=247) [details] Globus 2.2.4 collecttive patches (fork/condor/pbs/lsf) This tarball contains all the patches, original files and modified files for a (vanilla) Globus 2.2.4 installation for the jobmanagers fork, condor, pbs and lsf. You are getting with this tarball: lib/perl/Globus/GRAM/JobManager.diff lib/perl/Globus/GRAM/JobManager.old lib/perl/Globus/GRAM/JobManager.pm lib/perl/Globus/GRAM/JobManager/condor.diff lib/perl/Globus/GRAM/JobManager/condor.old lib/perl/Globus/GRAM/JobManager/condor.pm lib/perl/Globus/GRAM/JobManager/fork.diff lib/perl/Globus/GRAM/JobManager/fork.old lib/perl/Globus/GRAM/JobManager/fork.pm lib/perl/Globus/GRAM/JobManager/lsf.diff lib/perl/Globus/GRAM/JobManager/lsf.old lib/perl/Globus/GRAM/JobManager/lsf.pm lib/perl/Globus/GRAM/JobManager/pbs.diff lib/perl/Globus/GRAM/JobManager/pbs.old lib/perl/Globus/GRAM/JobManager/pbs.pm lib/perl/Globus/GRAM/StdioMerger.diff lib/perl/Globus/GRAM/StdioMerger.old lib/perl/Globus/GRAM/StdioMerger.pm setup/globus/condor.diff setup/globus/condor.in setup/globus/condor.old setup/globus/fork.diff setup/globus/fork.in setup/globus/fork.old setup/globus/lsf.diff setup/globus/lsf.in setup/globus/lsf.old setup/globus/pbs.diff setup/globus/pbs.in setup/globus/pbs.old These patches are checked against the 2.4.3 patches, and are the latest.
Created an attachment (id=248) [details] Globus 2.4.3 collective patches (fork/condor/pbs) This tarball contains all the patches, original files and modified files for a (vanilla) Globus 2.4.3 installation for the jobmanagers fork, condor and pbs. I don't have any LSF available to me, but lsf.pm does need patching, of course. You are getting with this tarball: lib/perl/Globus/GRAM/JobManager.diff lib/perl/Globus/GRAM/JobManager.old lib/perl/Globus/GRAM/JobManager.pm lib/perl/Globus/GRAM/JobManager/condor.diff lib/perl/Globus/GRAM/JobManager/condor.old lib/perl/Globus/GRAM/JobManager/condor.pm lib/perl/Globus/GRAM/JobManager/fork.diff lib/perl/Globus/GRAM/JobManager/fork.old lib/perl/Globus/GRAM/JobManager/fork.pm lib/perl/Globus/GRAM/JobManager/pbs.diff lib/perl/Globus/GRAM/JobManager/pbs.old lib/perl/Globus/GRAM/JobManager/pbs.pm lib/perl/Globus/GRAM/StdioMerger.diff lib/perl/Globus/GRAM/StdioMerger.old lib/perl/Globus/GRAM/StdioMerger.pm setup/globus/condor.diff setup/globus/condor.in setup/globus/condor.old setup/globus/fork.diff setup/globus/fork.in setup/globus/fork.old setup/globus/pbs.diff setup/globus/pbs.in setup/globus/pbs.old These patches are checked against the 2.2.4 patches, and are the latest.
Mini-HOWTO use the patches: Both patch sets have been checked against one another. The tarballs contain the latest fixes. [0] Read the instructions to the end before starting. [1] Chose the version of Globus you are running. Please note that the current patch sets only support 2.2.4 for VDT, and 2.4.3 for TeraGrid. Unfortunately, I cannot simulate all stages of updates of a particular Globus version, thus you may experience some offset with the patches. [2] Download the tarball and unpack in a place of your convenience, but NOT in $GLOBUS_LOCATION. Provided in each tarball are three entries for each patched file: [a] The new version ending either in ".pm" or ".in" [b] The original file ending in ".old" [c] The patch file ending in ".diff" It is recommended that you attempt to use the GNU patch tool to update your installation, essentially applying [2c] to your files. The directory location of the files is preserved. It is NOT recommended to use the [2a] file as a drop-in replacement. The [2a] and [2b] files are provided as a reference for you to determine how much your installation deviates from mine. [3] Create a backup of your files that reflect files from [2a], e.g. by moving them to a suffix ".org" and copying them back onto the original locatin with the original suffix. This preserves the timestamp of the original file. [4] Patch the file. Refer to GNU patches manual how to run patch. If you run patch from within deep the directory tree, you may need to 'truncate' paths from the front of the patch by using an argument of -p, which specifies how many directory level you want to cut off. You may experience some offset when patching. [5] If you experience any patch failures in step [4], you may try to manually integrate the "correct" thing. This option is for Perl experts only. If you are not in that class, back out all patches from the backups you generated in step [3]. Your installation may not be patchable with the patches I provided. You may request to have me look at it, but my time in the pre-SC season is very tight. I can correct obvious blunders. Another note: The patches in the lib/perl directory have higher priority, as those are the files that are actually being used by the jobmanager. The files in setup/globus have lower priority, but constitute the template from which an update, and any gpt-* tool run, may overwrite the version in lib/perl with.
(From update of attachment 228 [details]) use the collective patch.
(From update of attachment 229 [details]) use collective patch.
(From update of attachment 230 [details]) use collective patch.
(From update of attachment 231 [details]) use collective patch
(From update of attachment 232 [details]) use collective patch.
(From update of attachment 233 [details]) use collective patch.
(From update of attachment 234 [details]) use collective patch.
(From update of attachment 235 [details]) use collective patch.
(From update of attachment 236 [details]) use collective patch.
(From update of attachment 237 [details]) use collective patch.
(From update of attachment 238 [details]) use collective patch.
(From update of attachment 239 [details]) use collective patch.
(From update of attachment 240 [details]) use collective patch.
(From update of attachment 241 [details]) use collective patch.
(From update of attachment 242 [details]) use collective patch.
(From update of attachment 243 [details]) use collective patch.
teragrid is also seeing this bug - they are having failure rates of 30-40% pre- patches. This is urgent to be fixed for them because of SC dependencies. I've added sandra and nick to the cc list because of this.
Created an attachment (id=258) [details] Globus 2.2.4 collective patches (fork/lsf/pbs/condor) Updates the $self->pipe_out_cmd method to solve a problem with PBS, where it didn't see job state changes correctly. Updates files JobManager.* and StdioMerger.* from the previous tarball. This tarball contains all files: lib/perl/Globus/GRAM/JobManager/condor.pm lib/perl/Globus/GRAM/JobManager/condor.old lib/perl/Globus/GRAM/JobManager/lsf.diff lib/perl/Globus/GRAM/JobManager/fork.old lib/perl/Globus/GRAM/JobManager/lsf.old lib/perl/Globus/GRAM/JobManager/condor.diff lib/perl/Globus/GRAM/JobManager/pbs.diff lib/perl/Globus/GRAM/JobManager/fork.pm lib/perl/Globus/GRAM/JobManager/fork.diff lib/perl/Globus/GRAM/JobManager/lsf.pm lib/perl/Globus/GRAM/JobManager/pbs.pm lib/perl/Globus/GRAM/JobManager/pbs.old lib/perl/Globus/GRAM/StdioMerger.old lib/perl/Globus/GRAM/StdioMerger.pm lib/perl/Globus/GRAM/StdioMerger.diff lib/perl/Globus/GRAM/JobManager.pm lib/perl/Globus/GRAM/JobManager.old lib/perl/Globus/GRAM/JobManager.diff setup/globus/condor.in setup/globus/condor.old setup/globus/lsf.diff setup/globus/fork.old setup/globus/lsf.old setup/globus/condor.diff setup/globus/pbs.diff setup/globus/fork.in setup/globus/fork.diff setup/globus/lsf.in setup/globus/pbs.in setup/globus/pbs.old Again, only files (JobManager|StdioMerger).(pm|diff) are different.
Created an attachment (id=259) [details] Globus 2.4.3 collective patches (fork/condor/pbs) Updates the $self->pipe_out_cmd method to solve a problem with PBS, where it didn't see job state changes correctly. Updates files JobManager.* and StdioMerger.* from the previous tarball. This tarball contains all files: lib/perl/Globus/GRAM/JobManager/condor.pm lib/perl/Globus/GRAM/JobManager/condor.old lib/perl/Globus/GRAM/JobManager/fork.old lib/perl/Globus/GRAM/JobManager/condor.diff lib/perl/Globus/GRAM/JobManager/pbs.diff lib/perl/Globus/GRAM/JobManager/fork.pm lib/perl/Globus/GRAM/JobManager/fork.diff lib/perl/Globus/GRAM/JobManager/pbs.pm lib/perl/Globus/GRAM/JobManager/pbs.old lib/perl/Globus/GRAM/StdioMerger.old lib/perl/Globus/GRAM/StdioMerger.pm lib/perl/Globus/GRAM/StdioMerger.diff lib/perl/Globus/GRAM/JobManager.pm lib/perl/Globus/GRAM/JobManager.old lib/perl/Globus/GRAM/JobManager.diff setup/globus/condor.in setup/globus/condor.old setup/globus/fork.old setup/globus/condor.diff setup/globus/pbs.diff setup/globus/fork.in setup/globus/fork.diff setup/globus/pbs.in setup/globus/pbs.old Again, only files (JobManager|StdioMerger).(pm|diff) are different.
Please read bug 1425 to avoid making the Condor GridMonitor stumble.
I just realized, while trying to run jobs on the LCG/HEP, that the problem may be more profound that we realized. The LCG gatekeeper does *not* have access to any shared filesystem the worker nodes can see. Thus, it is the sole responsibility of the remote scheduling system, an LCG adapted PBS, to propagate things into the gatekeeper's GASS cache in a timely fashion. Unfortunately, I rarely see my stdout from batched jobs (as opposed to g-j-r interactive jobs).
Created an attachment (id=269) [details] Globus 2.2.4 collective patches (fork/lsf/pbs/condor) Additionally fixes bug #1425.
Created an attachment (id=270) [details] Globus 2.4.3 collective patches (fork/condor/pbs) Additionally fixes bug #1425.
Jens, Is the only change the patch for bug #1425, or is there something else different as well? Did you find that the problem from #1425 existed in the PBS and LSF job managers as well, or just Condor? We only looked at the Condor jobmanager. -alain
Alain, when integrating fix 1425, I went through all jobmanagers to see, if they suffer from a similar problem. Additionally, I added an nfssync to the Condor one at this (new) DONE stage, which was missing before. Download and have a look yourself :-)
Jens, Do you know how to apply your latest patch into globus LSF job submission script? I tried to download your attachment and got attachment.cgi. What is the format of the patch? I have VDT 1.1.11 installed. Regards, Dantong
On Wed, 3 Dec 2003 bugzilla-daemon@mcs.anl.gov wrote: > Do you know how to apply your latest patch into globus LSF job > submission script? I tried to download your attachment and got > attachment.cgi. What is the format of the patch? I believe I marked it binary. Try renaming it to something.tar.gz, and ask "file", what it thinks that you downloaded. LSF will be unaffected, no changes here. Ciao, Dipl.-Ing. Jens-S. Vöckler voeckler at cs dot uchicago dot edu University of Chicago; Research Institutes Building #402; 5640 South Ellis Avenue; Chicago, IL 60637-1433; USA; +1 773 834 6693 You can rely on NFS for only one thing - don't!
All of the fork/condor/pbs and JobManager patches are now merged to the CVS trunk. joe
Joe, Could you please give more details about CVS trunk? How to get this patch from your CVS trunk? Does this patch include LSF? I saw a patch posted on 10/14/2003 of "LSF for grid3"? Is this the newest LSF patch that could be applied to Globus 2.4.x. Since I have been following this bug for long time, more details about fix will be highly appreciated for my documentation. What is the problem which causes this NFS bug? How do you fix the NFS-related bug, especically for globus job manager for LSF? Regards, Dantong
Hi Dantong, I have been working on this for a few days now, and today I have discovered that our problem at BNL doesn't seem to be related to NFS. It is actually much simpler: the job-manager start polling LSF too soon, LSF says that doesn't know about the JOBID, and the JobManager reports it as DONE. Naturally, after it's done, the cache is cleaned, and LSF can't write the output anymore. A simple change, that is returning PENDING when LSF doesn't find the job, seems to have removed the problem. I am trying not to rush to conclusion here... but the current results I am having are _very_ encouraging! More on this later!
Hi all, I'll go more in the details of what I have found. I'd be interested in knowing if other people are seeing the same problem, or if this is an RCF only problem. From all the preliminary tests, I am not losing outputs anymore, but before declaring this really fixed, I'll be making further stress tests. But I have made progress which other people might find interesting. Initially, I have been working on understanding LSF and NFS setup here at RCF: there was a general believe that it was an NFS issue, so I started understanding those pieces. After having gained enough info, I have started testing Globus (2.4.2) and the LSF jobmanager. While doing that, I noticed that some jobs where advertised as DONE by Globus while they were advertised as PENDING by LSF. Today I gathered some statistics on that, and noticed that the number of jobs mistracked was exactly the number of jobs that lost the output. I looked at the gram logs, and noticed that for each job mistracked, the following error was in the log: bjobs rc is 255 == Job <123> is not found == DONE That corresponds to the branch in which the lsf JobManager gets an error from LSF (bjobs), in which case it just reports the job as done. Now I thought: what does LSF return if I ask for a job he is not aware of? I tried, and got the same error. The hypothesis was: if the jobmanager contacts LSF too soon, bjobs might say that the job wasn't found. So, I have changed this: if($exit_code == 255) { $self->log("bjobs rc is 255 == Job <123> is not found == DONE"); $state = Globus::GRAM::JobState::DONE; } into this: if($exit_code == 255) { $self->log("bjobs rc is 255 == Job <123> is not found == PENDING"); $state = Globus::GRAM::JobState::PENDING; } From the log now I see that some jobs print that message once, but then they proceed, they are reported correctly and the output is displayed. Now, this is not at all a complete solution: if a job is really non existent (because of some error) the jobmanager would keep polling... so I will need to find a way to distibguish the two cases. It was just a way to quickly test my hypothesis. Also, I don't know whether this is the only cause of the problems we where having, but it sure is part of the equation. As I said, I will be making further stress tests to gather statistics on thousands of jobs, and what comes out. I'll also try to find a good way to understand which kind of error condition I am getting (suggestions?). I could attach a patch right now, but I think it's better to wait until I finish. What do people think? Gabriele
Hi all, I am ready to declare this bug fixed at RCF: yesterday I submitted 1000 jobs and all of them were tracked correctly and returned their outputs. All except one which failed for other reasons. I really hope this will fix the problem for other sites too. I'll attach a patch with the modification: I had to strip out other STAR custom modification, hopefully it won't affect it. I tried it by itself, but didn't make another stress test. The diff will be against the lsf.pm found in VDT 1.1.12. Before going in the details, I'd like to thank all the people that beared with me during the investigation: Ofer Rind (LSF admin at RCF), Azadeh Handley (Platform support), Robert Petckus (NFS admin at RCF), Jens, Joe, Stuart and Dantong. Wihout their knowledge of the various pieces, I wouldn't have gone very far. Azadeh confirmed the issue of bjobs not reporting jobs submitted right away. For the LSF savvy, this is due to "the fact that you use multithreading, which will force mbatchd to spawn a child mbatchd to process bjobs queries. When you submit a new job if at that moment you have a child mbatchd spawned to process your bjobs query, then that child mbatchd will not be aware of your new job. Mbatchd would usually kill the child mbatchd and spawn a new process in these situations. However it is all the matter of timing and how fast you run bjobs after your bsub." As to how to distinguish this case, "bhist is always your best bet, as it will directly check the events file rather than polling mbatchd" So what the patch does is this: when bjobs reports that no job was found, checks whether bhist agrees. If it does, reports FAILED, and if bhist finds it reports PENDING. Thanks, Gabriele
Created an attachment (id=296) [details] Patch to the VDT 1.1.12 (Globus 2.2.4) lsf.pm
*** Bug 1283 has been marked as a duplicate of this bug. ***
LSF patch committed. joe