Bug 4431 - freeze by Unsubmitted on personal WS GRAM
: freeze by Unsubmitted on personal WS GRAM
Status: RESOLVED INVALID
: GRAM
wsrf managed execution job service
: 4.0.2
: PC Linux
: P3 normal
: 4.2
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2006-05-23 00:47 by
Modified: 2006-10-03 09:15 (History)


Attachments
container-log-of-unsubmited.gz (20.97 KB, application/octet-stream)
2006-06-08 03:10, tashiro
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2006-05-23 00:47:27
I installed GT 4.0.2 and use WS GRAM by non-root user.
But the WS GRAM did not work well in some cases.

I installed GT 4.0.2 to /home/tashiro/gt402 and, I started container.
I submitted the WS GRAM job, but the Job Done was not reported.
Only "Unsubmitted" was reported and
not returned to command line, until I pressed Control-C.

The failure result is shown below.

tashiro$ ls touched_it
ls: touched_it: No such file or directory
tashiro$ globusrun-ws -submit -F \
    https://myhost:5447/wsrf/services/ManagedJobFactoryService \
    -subject "`grid-cert-info -subject`" -c /bin/touch touched_it
Submitting job...Done.
Job ID: uuid:61fdabd4-e6f3-11da-97e9-00304829a67e
Termination time: 05/20/2006 04:53 GMT
Current job state: Unsubmitted
Canceling...Canceled.
Destroying job...Done.
globusrun-ws: Operation was canceled
tashiro$ ls touched_it
touched_it
tashiro$

WS container log is like this:

2006-05-19 13:53:16,356 INFO  exec.StateMachine [RunQueue
Other,logJobAccepted:3
236] Job 61fdabd4-e6f3-11da-97e9-00304829a67e accepted for local user 'tashiro'
2006-05-19 13:57:42,723 INFO  exec.StateMachine [RunQueue
Other,logJobFailed:325
5] Job 61fdabd4-e6f3-11da-97e9-00304829a67e failed

I had waited around 1 minute before "Unsubmitted" was reported.
I pressed Control-C after "Unsubmitted" was reported.

On the successful case, The program was finished in around 10 seconds.
The command didn't stop and the job was successfully "Done".

GT4 conditions are listed below.
  - I have no root or globus user privilege.
  - I installed gt4.0.2-all-source-installer.tar.bz2 to $HOME/gt402 .
  - I started PostgreSQL 7.4.9 on port 5445 by my account.
  - I started globus-gridftp-server on port 5446 by my account.
  - I started WS container on port 5447 by my account.
  - I use Fork jobmanager. I didn't change jobmanager setting.

I edited some files to work.
  - $GLOBUS_LOCATION/etc/globus_wsrf_rft/jndi-config.xml
  - $GLOBUS_LOCATION/etc/globus_wsrf_core/global_security_description.xml
  - $GLOBUS_LOCATION/etc/gram-service/globus_gram_fs_map_config.xml
  - $HOME/.gridmap
  - execute $GLOBUS_LOCATION/setup/globus/setup-gram-service-common
    (staging subject were specified.)

I had tested it on some hosts.
2 hosts ware failed, and other 8 hosts were successful.

Below hosts were "Unsubmitted" and not return. (Failure)
  - Red Hat Linux 8.0, Xeon x2, Java 1.5, GCC 3.3.3
  - SuSE Linux Enterprise Server 8 SP4, Opteron x2, Java 1.5, GCC 3.2.2
  (The results were always failure when I try on above hosts.)

I had reported the same report for GT 4.0.1 on Fri, 02 Dec 2005.
http://www-unix.globus.org/mail_archive/discuss/2005/12/msg00007.html

At that time, the problem had gone somewhere.
The WS GRAM began to work fine, although I did nothing.
The cause of error was not resolved.

But, I have same problem on GT 4.0.2 now.


I tried to check suggestions previously commented.

tashiro$ ls -l $GLOBUS_LOCATION/var/globus-fork.log
-rw--w--w-    1 tashiro  tashiro         0 May  9 10:58
/home/tashiro/gt402/var/
globus-fork.log
tashiro$

tashiro$ cd $GLOBUS_LOCATION/test/globus_scheduler_event_generator_test
globus_scheduler_event_generator_test$ ./TESTS.pl
seg-api-test............ok
seg-module-load-test....ok
seg-timestamp-test......ok
All tests successful.
Files=3, Tests=6,  0 wallclock secs ( 0.10 cusr +  0.06 csys =  0.16 CPU)
globus_scheduler_event_generator_test$ cd \
    $GLOBUS_LOCATION/test/globus_scheduler_event_generator_fork_test
globus_scheduler_event_generator_fork_test$ ./TESTS.pl
Warning: Do not start a service container while this test script is
running.
test-fork-seg....ok
All tests successful.
Files=1, Tests=1,  6 wallclock secs ( 0.07 cusr +  0.01 csys =  0.08
CPU)
globus_scheduler_event_generator_fork_test$

It seems the test was successful.

I tried other suggestion.
http://www-unix.globus.org/mail_archive/discuss/2005/11/msg00485.html

tashiro$ counter-client \
    -s https://myhost:5447/wsrf/services/CounterService \
    -z "`grid-cert-info -subject`"
Got notification with value: 3
Counter has value: 3
Got notification with value: 13
tashiro$

It seems the test was successful, too.

In addition, The Globus Bug 4191 seems to have the same error report.
On this report, touched_it file is not exist.
http://bugzilla.globus.org/globus/show_bug.cgi?id=4191

But the touched_it is exist on my result.

(I tried to ask this to gt-user@globus.org, but the mail was not delivered.
  It seems there are delivering problem. I'm asking owner-gt-user@globus.org
  about this problem. I don't get the reply mail yet.)
------- Comment #1 From 2006-05-23 09:17:57 -------
Do not subscribe to gt-user for gram problems. Subscribe instead to gram-user.
Send email to majordomo@globus.org with the text "subscribe gram-user".
------- Comment #2 From 2006-05-24 06:38:03 -------
Sorry, I subscribed to gram-user.

  (But the undeliver of gt-user is other problem.
   I'm continuously waiting reply from administrator of gt-user mailing list.
   There are possibly other people to be dropped the mail unpredictably
   by globus.org like me. I'm worrying about globus.org miss-configuration of
   mailing-list server or gt-user mailing list.)

Anyway thanks for your suggestion.
------- Comment #3 From 2006-06-06 18:03:09 -------
Can you turn on full GRAM debug logging, restart the container, and submit only
one job. Attach the container log file to this bug report.
------- Comment #4 From 2006-06-08 03:10:09 -------
Created an attachment (id=975) [details]
container-log-of-unsubmited.gz

I attach the pricese container debug log.
This log includes some information for me, thus, I gzipped it.

timeline
Thu Jun  8 16:41:00 JST 2006 : container start
Thu Jun  8 16:43:00 JST 2006 : submit the job
Thu Jun  8 16:45:04 JST 2006 : unsubmitted was reported
Thu Jun  8 16:50:00 JST 2006 : pressed control-C
Thu Jun  8 16:53:00 JST 2006 : end container

$ echo $GLOBUS_LOCATION
/home/tashiro/gt4test/gt402
$ globusrun-ws -submit -F
https://myhost:5447/wsrf/services/ManagedJobFactoryService  -subject
"`grid-cert-info -subject`" -c /bin/touch touched_it
Submitting job...Done.
Job ID: uuid:695c4406-f6c2-11da-bcca-00304829a67e
Termination time: 06/09/2006 07:43 GMT
Current job state: Unsubmitted
Canceling...Canceled.
Destroying job...Done.
globusrun-ws: Operation was canceled
$ ls -l $HOME/touched_it
-rw-r--r--    1 tashiro  tashiro         0 Jun  8 16:43
/home/tashiro/touched_it
$ 


$ diff -u container-log4j.properties.060608 container-log4j.properties
--- container-log4j.properties.060608   2006-06-08 16:24:14.000000000 +0900
+++ container-log4j.properties  2006-06-08 16:24:26.000000000 +0900
@@ -24,7 +24,7 @@
 # log4j.category.org.globus.mds=DEBUG

 # Uncomment the following line to enable GRAM debugging
-# log4j.category.org.globus.exec=DEBUG
+log4j.category.org.globus.exec=DEBUG

 # Uncomment the following line to enable RFT debugging
-# log4j.category.org.globus.transfer=DEBUG
+log4j.category.org.globus.transfer=DEBUG
------- Comment #5 From 2006-10-03 01:41:52 -------
I've heard the report shown below, from the member of AIST team.
I don't know if this report was present or not, to globus team.

But, My container is now running fine.
The problem was network file system (NFS, NAS) problem.

My home directory was NFS mounted file system.
I changed the $GLOBUS_LOCATION/var/globus-fork.log to
non NFS partition.
($GL/var/ directory was moved to /tmp which is not NFS, and symlink it.)
Then the WS GRAM (container) was worked.

Because, if the locking of var/globus-fork.log fails,
schedler-event-generator (or globus-fork-starter?)
didn't work correctly.
------- Comment #6 From 2006-10-03 09:15:21 -------
Ok, thanks Tashiro. That's something we should probably add to our
troubleshooting documentation. Resolving as INVALID.