Bugzilla – Bug 4360
globus-job-get-output bug prevents output delivery, PBS jobmanager affected. See also globus-job-clean, globus-job-cancel
Last modified: 2012-09-12 13:24:17
You need to log in before you can comment on or make changes to this bug.
The job submission line in the globus-job-get-output script has an error that causes $GLOBUS_LOCATION to be evaluated in the context of the system the gatekeeper is running on rather than the target compute resource - I edited as follows to correct it: $ diff globus-job-get-output.orig globus-job-get-output 184c184 < "&(executable=\$(GLOBUS_LOCATION)/bin/globus-sh-exec)(arguments=-exec \"file=\`\${bindir}/globus-gass-cache -query -t anExtraTag x-gass-cache://${jobid}${stream}\`; if test -r \"\"\$file\"\" ; then ${command} \$file ; else echo Invalid job id. 1>&2; fi\")" --- > "&(executable=\"\${GLOBUS_LOCATION}/bin/globus-sh-exec\")(arguments=-exec \"file=\`\${bindir}/globus-gass-cache -query -t anExtraTag x-gass-cache://${jobid}${stream}\`; if test -r \"\"\$file\"\" ; then ${command} \$file ; else echo Invalid job id. 1>&2; fi\")" To get this to work with our PBS jobmanagers, I also had to change the following line so that executables specified with a leading environment variable reference do not get ./ prefixed to them: #original: if ($description->executable() =~ m|^[^/]|) changed to if ($description->executable() =~ m|^[^/]| && $description->executable =~ m|^[^\$]|) { $description->add('executable', './' . $description->executable()); }
globus-job-cancel and globus-job-clean (identical scripts except in name) also suffer from the same problem and should be patched: # diff globus-job-clean.orig globus-job-clean.patched 202c202 < myrsl="&(executable=\$(GLOBUS_LOCATION)/bin/globus-sh-exec)(arguments=-exec \"bad=0; \$bindir/globus-gass-cache -cleanup-url x-gass-cache://${jobid}stdout >/dev/null 2>/dev/null; if test \$? != 0; then bad=1; fi ; \$bindir/globus-gass-cache -cleanup-url x-gass-cache://${jobid}stderr >/dev/null 2>/dev/null; if test \$? != 0; then bad=1; fi; echo \$bad;\")" --- > myrsl="&(executable=\"\${GLOBUS_LOCATION}/bin/globus-sh-exec\")(arguments=-exec \"bad=0; \$bindir/globus-gass-cache -cleanup-url x-gass-cache://${jobid}stdout >/dev/null 2>/dev/null; if test \$? != 0; then bad=1; fi ; \$bindir/globus-gass-cache -cleanup-url x-gass-cache://${jobid}stderr >/dev/null 2>/dev/null; if test \$? != 0; then bad=1; fi; echo \$bad;\")"
Can you explain how this patch works? It looks to me like it is switching the executable's path from being resolved using the GLOBUS_LOCATION RSL substitution in the original code, to using the GLOBUS_LOCATION environment variable in the job's environment in the modified code. I think that both the GLOBUS_LOCATION environment variable and RSL substitution are both set based on the -home argument to the job manager, so I'm wondering why this is helping. joe
Subject: Re: globus-job-get-output bug prevents output delivery, PBS jobmanager affected. See also globus-job-clean, globus-job-cancel Here's the scenario: User, Gatekeeper and target computing system all have different GLOBUS_LOCATION On the user workstation, GLOBUS_LOCATION=/usr/local/globus/globus-4.0.1 On the gatekeeper system, GLOBUS_LOCATION=/usr/local/globus/globus-4.0.1-r3 On the target computing system, GLOBUS_LOCATION=/usr/local/packages/tg/globus-4.0.1-r3 Using globus-job-get-output from the distribution, I get the following error returned: [dsimmel@kaminari ~]$ globus-job-get-output.orig -r gt4- submit.psc.teragrid.org/jobmanager-rachel-pbs -out $myjob DATE: Tue Apr 25 14:07:43 2006 PBS JOB ID: 51592 $LOCAL: /carson64a/local/51592 Execution host: carson64a Current directory: /usr/users/0/dsimmel /var/spool/OpenPBS/mom_priv/jobs/51592.rache.SC: /usr/local/globus/ globus-4.0.1-r3/bin/globus-sh-exec: not found - - - - - The DATE, PBS JOB ID, $LOCAL, Execution host, and Current directory lines are returned by the compute platform (rachel) for every job submitted. The error reflects interpretation of $(GLOBUS_LOCATION) in the RSL submitted in the context of the gatekeeper system, rather than the target compute platform, which is where the commands need to execute. This despite the fact that we force GLOBUS_LOCATION in our jobmanager script to match the path on the target computing platform. The PBS script that is generated by the jobmanager and submitted on the compute platform for the -get-output command looks like: [root@gt4-submit tmp]# cat pbs.rachel.out.23779 #! /bin/sh # PBS batch job script built by Globus job manager # #PBS -S /bin/sh #PBS -N TG23779 #PBS -m n #PBS -o /usr/users/0/dsimmel/.globus/job/gt4-submit.psc.teragrid.org/ 23776.1145988460/stdout #PBS -e /usr/users/0/dsimmel/.globus/job/gt4-submit.psc.teragrid.org/ 23776.1145988460/stderr #PBS -l nodes=1:ppn=1 X509_USER_PROXY="/usr/users/0/dsimmel/.globus/job/gt4- submit.psc.teragrid.org/23776.1145988460/x509_up"; export X509_USER_PROXY; GLOBUS_LOCATION="/usr/local/packages/tg/globus-4.0.1-r3"; export GLOBUS_LOCATION; GLOBUS_GRAM_JOB_CONTACT="https://gt4-submit.psc.teragrid.org: 50037/23776/1145988460/"; export GLOBUS_GRAM_JOB_CONTACT; GLOBUS_GRAM_MYJOB_CONTACT="URLx-nexus://gt4-submit.psc.teragrid.org: 50038/"; export GLOBUS_GRAM_MYJOB_CONTACT; HOME="/usr/users/0/dsimmel"; export HOME; LOGNAME="dsimmel"; export LOGNAME; LD_LIBRARY_PATH=; export LD_LIBRARY_PATH; #Source the Globus enviroment script . /usr/local/packages/tg/globus-4.0.1-r3/etc/globus-user-env.sh cd ${LOCAL} export OMP_NUM_THREADS ${PBS_VPPN} /usr/local/globus/globus-4.0.1-r3/bin/globus-sh-exec "-exec" "file=\`\ ${bindir}/globus-gass-cache -query -t anExtraTag x-gass-cache:// https://gt4-submit.psc.teragrid.org:50037/23746/1145988277/stdout\`; if tes t -r \"\$file\" ; then \${GLOBUS_SH_CAT-cat} \$file ; else echo Invalid job id. 1>&2; fi" </dev/null - - - - - The patch I applied to the globus-job-get-output client is as follows: [dsimmel@kaminari ~]$ diff $GLOBUS_LOCATION/bin/globus-job-get- output.orig $GLOBUS_LOCATION/bin/globus-job-get-output.patched 184c184 < "&(executable=\$(GLOBUS_LOCATION)/bin/globus-sh-exec) (arguments=-exec \"file=\`\${bindir}/globus-gass-cache -query -t anExtraTag x-gass-cache://${jobid}${stream}\`; if test -r \"\"\$file \"\" ; then ${command} \$file ; else echo Invalid job id. 1>&2; fi\")" --- > "&(executable=\"\${GLOBUS_LOCATION}/bin/globus-sh-exec\") (arguments=-exec \"file=\`\${bindir}/globus-gass-cache -query -t anExtraTag x-gass-cache://${jobid}${stream}\`; if test -r \"\"\$file \"\" ; then ${command} \$file ; else echo Invalid job id. 1>&2; fi\")" - - - - - When I run using the patched edition, we get: [dsimmel@kaminari ~]$ globus-job-get-output.patched -r gt4- submit.psc.teragrid.org/jobmanager-rachel-pbs -out $myjob DATE: Tue Apr 25 14:39:41 2006 PBS JOB ID: 51593 $LOCAL: /carson64a/local/51593 Execution host: carson64a Current directory: /usr/users/0/dsimmel DATE: Tue Apr 25 14:04:41 2006 PBS JOB ID: 51591 $LOCAL: /carson64a/local/51591 Execution host: carson64a Current directory: /usr/users/0/dsimmel Tue Apr 25 14:04:41 EDT 2006 - - - - - The first DATE...Current directory is for the -get-output, the rest is the original job's stdout. The PBS script in this case looks like: [root@gt4-submit tmp]# cat pbs.rachel.out.23809 #! /bin/sh # PBS batch job script built by Globus job manager # #PBS -S /bin/sh #PBS -N TG23809 #PBS -m n #PBS -o /usr/users/0/dsimmel/.globus/job/gt4-submit.psc.teragrid.org/ 23806.1145990378/stdout #PBS -e /usr/users/0/dsimmel/.globus/job/gt4-submit.psc.teragrid.org/ 23806.1145990378/stderr #PBS -l nodes=1:ppn=1 X509_USER_PROXY="/usr/users/0/dsimmel/.globus/job/gt4- submit.psc.teragrid.org/23806.1145990378/x509_up"; export X509_USER_PROXY; GLOBUS_LOCATION="/usr/local/packages/tg/globus-4.0.1-r3"; export GLOBUS_LOCATION; GLOBUS_GRAM_JOB_CONTACT="https://gt4-submit.psc.teragrid.org: 50037/23806/1145990378/"; export GLOBUS_GRAM_JOB_CONTACT; GLOBUS_GRAM_MYJOB_CONTACT="URLx-nexus://gt4-submit.psc.teragrid.org: 50038/"; export GLOBUS_GRAM_MYJOB_CONTACT; HOME="/usr/users/0/dsimmel"; export HOME; LOGNAME="dsimmel"; export LOGNAME; LD_LIBRARY_PATH=; export LD_LIBRARY_PATH; #Source the Globus enviroment script . /usr/local/packages/tg/globus-4.0.1-r3/etc/globus-user-env.sh cd ${LOCAL} export OMP_NUM_THREADS ${PBS_VPPN} ${GLOBUS_LOCATION}/bin/globus-sh-exec "-exec" "file=\`\${bindir}/ globus-gass-cache -query -t anExtraTag x-gass-cache://https://gt4- submit.psc.teragrid.org:50037/23746/1145988277/stdout\`; if test -r \"\$file\" ; then \${GLOBUS_SH_CAT-cat} \$file ; else echo Invalid job id. 1>&2; fi" </dev/null - - - - - In this case, GLOBUS_LOCATION does not get interpreted until the script is run on the target, and the right thing happens. Note that I had to change a line in the PBS jobmanager to prevent it from prefixing the executable $0 with "./" for $0 not beginning with a /: if ($description->executable() =~ m|^[^/]| && $description- >executable =~ m|^[^\$]|) { $description->add('executable', './' . $description- >executable()); } - - - - - Note that globus-job-clean (a.k.a. globus-job-cancel) also suffer from this problem, and work correctly if the RSL submitted passes through the literal \${GLOBUS_LOCATION} rather than the RSL $ (GLOBUS_LOCATION): [dsimmel@kaminari ~]$ diff $GLOBUS_LOCATION/bin/globus-job-clean.orig $GLOBUS_LOCATION/bin/globus-job-clean.patched 202c202 < myrsl="&(executable=\$(GLOBUS_LOCATION)/bin/globus-sh-exec) (arguments=-exec \"bad=0; \$bindir/globus-gass-cache -cleanup-url x- gass-cache://${jobid}stdout >/dev/null 2>/dev/null; if test \$? != 0; then bad=1; fi ; \$bindir/globus-gass-cache -cleanup-url x-gass- cache://${jobid}stderr >/dev/null 2>/dev/null; if test \$? != 0; then bad=1; fi; echo \$bad;\")" --- > myrsl="&(executable=\"\${GLOBUS_LOCATION}/bin/globus-sh-exec \")(arguments=-exec \"bad=0; \$bindir/globus-gass-cache -cleanup-url x-gass-cache://${jobid}stdout >/dev/null 2>/dev/null; if test \$? != 0; then bad=1; fi ; \$bindir/globus-gass-cache -cleanup-url x-gass- cache://${jobid}stderr >/dev/null 2>/dev/null; if test \$? != 0; then bad=1; fi; echo \$bad;\")" - - - - - - Derek On Apr 25, 2006, at 9:09 AM, bugzilla-daemon@mcs.anl.gov wrote: > http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=4360 > > > bester@mcs.anl.gov changed: > > What |Removed |Added > ---------------------------------------------------------------------- > ------ > Severity|blocker |normal > Target Milestone|4.0.1 |--- > > > > > ------- Comment #2 from bester@mcs.anl.gov 2006-04-25 08:09 ------- > Can you explain how this patch works? It looks to me like it is > switching the > executable's path from being resolved using the GLOBUS_LOCATION RSL > substitution in the original code, to using the GLOBUS_LOCATION > environment > variable in the job's environment in the modified code. I think > that both the > GLOBUS_LOCATION environment variable and RSL substitution are both > set based on > the -home argument to the job manager, so I'm wondering why this is > helping. > > joe > > > > > ------- You are receiving this mail because: ------- > You are on the CC list for the bug, or are watching someone who is. > You reported the bug, or are watching the reporter. --- Derek Simmel <dsimmel@psc.edu> Pittsburgh Supercomputing Center (412) 268-1035
Created an attachment (id=947) [details] patch to gram/jobmanager/source directory Here's an alternative patch which adds a command-line option to the job manager which allows systems to present a different GLOBUS_LOCATION for the target execution machine instead of using the same for the job manager environment and the job environment. This avoids the scheduler-script-specific tweaks. If you use this new option -target-globus-location, the GLOBUS_LOCATION rsl value will be substituted with that value, and the GLOBUS_LOCATION environment variable will be substituted with that value in the job's environment. The script invocations used by the job manager (to submit and stage jobs) will have the job manager's globus location in their environment. joe
I've committed the new patch to the CVS trunk.
Apologies for not returning to comment sooner. If I understand this approach correctly, it assumes that the -target-globus-location will be the same for all target computing resources served by the GRAM. This means that in order to serve multiple different target resources that may each have different local GLOBUS_LOCATIONs, we would have to run separate instances of GRAM, one for each target resource with a different GLOBUS_LOCATION. Is this right? (In reply to comment #4) > Created an attachment (id=947) [edit] [details] > patch to gram/jobmanager/source directory > > Here's an alternative patch which adds a command-line option to the job manager > which allows systems to present a different GLOBUS_LOCATION for the target > execution machine instead of using the same for the job manager environment and > the job environment. This avoids the scheduler-script-specific tweaks. If you > use this new option -target-globus-location, the GLOBUS_LOCATION rsl value will > be substituted with that value, and the GLOBUS_LOCATION environment variable > will be substituted with that value in the job's environment. The script > invocations used by the job manager (to submit and stage jobs) will have the > job manager's globus location in their environment. > > joe >
Hi Derek, I'm looking through 4.2 bugs. Did this get resolved? Does Joe's patch do what you needed? -Stu
Subject: Re: globus-job-get-output bug prevents output delivery, PBS jobmanager affected. See also globus-job-clean, globus-job-cancel No, as far as I know, this was not resolved - we continued on here at PSC with the patches I had made at the time. I didn't get an answer to my last questions, and I don't recall being able to utilize the patch Joe made - to be frank I haven't looked at this in quite a long time. It has not been raised as a significant user issue here at PSC (yet) since only a very few users have ever tried to submit jobs via Globus to our systems. - Derek --- Derek Simmel Pittsburgh Supercomputing Center (412) 268-1035
We've migrated our issue tracking software to jira.globus.org. Any new issues should be added here: http://jira.globus.org/secure/VersionBoard.jspa?selectedProjectId=10363 As this issue hasn't been commented on in several years, we're closing it. If you feel it is still relevant, please add it to jira.