Bug 4749 - client receives no state notifications if RSL <gram:count> is "big"
: client receives no state notifications if RSL <gram:count> is "big"
wsrf managed execution job service
: 4.0.1
: PC Windows XP
: P1 critical
: 4.2
Assigned To:
  Show dependency treegraph
Reported: 2006-10-05 09:29 by
Modified: 2007-03-05 15:54 (History)



You need to log in before you can comment on or make changes to this bug.

Description From 2006-10-05 09:29:25
+++ This bug was initially created as a clone of Bug #2700 +++

I'm reopening this bug because I'm running into a very similar problem which
makes me believe it was never fully fixed.  We have been using count=25 and
submitting to Condor, exactly as the original poster was.  These are submitted
as batch jobs, however, whose status we check approximately every five minutes.
 These jobs typically run in Condor for several hours before finishing. 
However, they -do- finish, but that never seems to be detected by GRAM.  The
status typically goes from Pending -> Active okay, but as I said, never from
Active -> anything else.  Upon close inspection of the state files on disk,
it's evident that -some- of the sub-jobs get marked as Done, but not all of
them.  Another reason I'm reopening this bug is because I'm getting the same
warning message as the original poster:

2006-10-05 08:30:09,503 WARN  exec.JobManagerScript [Thread-316,run:269] 
java.lang.NumberFormatException: For input string: ""
        at java.lang.Integer.parseInt(Integer.java:468)
        at java.lang.Integer.parseInt(Integer.java:497)

I've just turned on GRAM debug and will attach that later on after this batch
of jobs finishes.  By the way, I'm using GT 4.0.1, which of course makes me
think I should have the fix.

Thanks for your assistance!


Adam Bazinet

+++ This bug was initially created as a clone of Bug #2700 +++

Gaurang noticed the following w/ a Condor factory:

"Whenever i try to modify the test.xml in share/gram_client/ and change the 
count to something around 20 or more then if i submit the run using both 
manager-job or globusrun-ws the clients start hanging and lose connection 
with the jobs. Even though the jobs appear in the remote [Condor] queue and 
run [fine], some interaction between the submit client and the remote side 

The problem was verified with both Java and C GRAM clients that the 
client side does not receive *any* state change event. In the Java client, 
which polls the job for status eventually, the job status comes back 
as "Unsubmitted".

This indicates a server-side issue. Maybe a problem in the state machine of 
the job, might be due to faulty SEG events?

The problem does *not* appear when submitting to Fork, 
but when submitting to Condor. Processes spawn by the 
count > 1 are executed by Condor on several hosts.

Gaurang noticed the issue most times with count = 50, 
and every time with count = 100. 
He did not managed to determine a consistent threshold, 
but starting at count = 20 the problem appear 50% of the time. 

One WARN log message on the server-side (I'm not sure this is 
significant though):

2005-02-04 14:55:56,700 WARN  exec.JobManagerScript [Thread-87,run:258] 
java.lang.NumberFormatException: For input string: ""
        at java.lang.Integer.parseInt(Integer.java:489)
        at java.lang.Integer.parseInt(Integer.java:518)

We excuted with .exec. DEBUG turned on. Log will follow in next post.
------- Comment #1 From 2006-10-05 14:40:41 -------
I have been monitoring my batch of 25 jobs.  Of the jobs that have finished so
far, two of them have not had their state updated appropriately (#7 and #10),
which is really bizarre!  Here's why (using job #7 of 25 as an example)

1) The event is logged in the Condor XML log file: 

    <a n="MyType"><s>JobTerminatedEvent</s></a>
    <a n="EventTypeNumber"><i>5</i></a>
    <a n="EventTime"><s>2006-10-05T13:53:16</s></a>
    <a n="Cluster"><i>7249</i></a>
    <a n="Proc"><i>7</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="TerminatedNormally"><b v="t"/></a>
    <a n="ReturnValue"><i>0</i></a>
    <a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
    <a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
    <a n="TotalLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
    <a n="TotalRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a>
    <a n="SentBytes"><r>1.016383000000000E+06</r></a>
    <a n="ReceivedBytes"><r>1.257883000000000E+06</r></a>
    <a n="TotalSentBytes"><r>1.016383000000000E+06</r></a>
    <a n="TotalReceivedBytes"><r>2.515766000000000E+06</r></a>

2) I have GRAM debug turned on, and it also says it was notified of the 'Done'

2006-10-05 13:53:16,727 DEBUG monitoring.SchedulerEventGenerator
[Thread-7,run:132] seg input line: 001;1160070796;7249.007.000;8;0
2006-10-05 13:53:16,727 DEBUG monitoring.JobStateMonitor
[Thread-7,addEvent:467]  JSM receiving scheduler event 7249.007.000 [Thu Oct 05
13:53:16 EDT 2006] Done
2006-10-05 13:53:16,728 DEBUG monitoring.JobStateMonitor
[Thread-7,addEvent:474] Dispatching event 7249.007.000 to job
2006-10-05 13:53:16,728 DEBUG monitoring.JobStateMonitor
[Thread-7,dispatchEvent:495] Entering dispatchEvent()
2006-10-05 13:53:16,728 DEBUG monitoring.JobStateMonitor
[Thread-7,dispatchEvent:498] dispatching 7249.007.000 [Thu Oct 05 13:53:16 EDT
2006] Done
2006-10-05 13:53:16,728 DEBUG exec.ManagedExecutableJobHome
[Thread-7,jobStateChanged:445] Receiving jobStateChange event for resource key
timestamp    Thu Oct 05 13:53:16 EDT 2006
(new) state    Done
exitCode    0
2006-10-05 13:53:16,729 DEBUG exec.StateMachine
[resourceKey:5e1bdff0-547a-11db-98df-f477513bd2c7] 7249.007.000 Done

That should be pretty clear, right?  But in the
ManagedExecutableJobResourceStateType folder under
~/.globus/persisted/, for this job batch, I currently have
this listing, in which it is clear that job #7 (which is the eighth from the
top) never got marked as Done.

<ns25:localJobState xsi:type="ns36:StateEnumeration"
  <ns25:localJobState xsi:type="ns37:StateEnumeration"
  <ns25:localJobState xsi:type="ns38:StateEnumeration"
  <ns25:localJobState xsi:type="ns39:StateEnumeration"
  <ns25:localJobState xsi:type="ns40:StateEnumeration"
  <ns25:localJobState xsi:type="ns41:StateEnumeration"
  <ns25:localJobState xsi:type="ns42:StateEnumeration"
  <ns25:localJobState xsi:type="ns43:StateEnumeration"
  <ns25:localJobState xsi:type="ns44:StateEnumeration"
  <ns25:localJobState xsi:type="ns45:StateEnumeration"
  <ns25:localJobState xsi:type="ns46:StateEnumeration"
  <ns25:localJobState xsi:type="ns47:StateEnumeration"
  <ns25:localJobState xsi:type="ns48:StateEnumeration"
  <ns25:localJobState xsi:type="ns49:StateEnumeration"
  <ns25:localJobState xsi:type="ns50:StateEnumeration"
  <ns25:localJobState xsi:type="ns51:StateEnumeration"
  <ns25:localJobState xsi:type="ns52:StateEnumeration"
  <ns25:localJobState xsi:type="ns53:StateEnumeration"
  <ns25:localJobState xsi:type="ns54:StateEnumeration"
  <ns25:localJobState xsi:type="ns55:StateEnumeration"
  <ns25:localJobState xsi:type="ns56:StateEnumeration"
  <ns25:localJobState xsi:type="ns57:StateEnumeration"
  <ns25:localJobState xsi:type="ns58:StateEnumeration"
  <ns25:localJobState xsi:type="ns59:StateEnumeration"
  <ns25:localJobState xsi:type="ns60:StateEnumeration"
------- Comment #2 From 2006-10-05 14:49:42 -------
The latest GT release is 4.0.3. I'd suggest trying that first.
------- Comment #3 From 2006-10-11 19:00:58 -------
(In reply to comment #2)
> The latest GT release is 4.0.3. I'd suggest trying that first.

I have recently conducted exhaustive tests of both 4.0.2 and 4.0.3.  The
problem, however, persists unchanged.  Not only do sub-jobs sometimes not go
from Active to Done, they occasionally don't even go from Pending to Active. 
This is hugely disappointing for us at this juncture... is there anything else
we can try?  We really appreciate your input.



------- Comment #4 From 2006-10-20 09:59:57 -------

Thanks for this detailed bug report!  I am marking this as a top priority to be
looked at, however this is a busy time with supercomputing coming soon.  As
developers time frees up this will be looked at and hopefully we will be able
to provide a patch for you.

------- Comment #5 From 2006-10-20 10:15:57 -------
Thanks a bunch!
------- Comment #6 From 2007-03-05 12:47:13 -------
if i remember right, then you didn't run into these problems with 4.0.4 so far,
What's the status here?
I just tried it on my laptop: submitted jobs to my Condor installation
(just one node) with count elements of 50 and 100, both without problems.
You always submitted to a pool consisting of several nodes and never to a 
one-node pool, right? Unfortunately the Condor pool i normally use isn't
available at the moment.
------- Comment #7 From 2007-03-05 13:19:58 -------
That's right, I did two test batches of count 10, of somewhat long running jobs
 (a few hours), to a multi-node pool if you want to call it that, and had no
problems.  We're not making use of this feature much in our production grid
right now, but I'm pretty sure I could reproduce this problem in 4.0.3 if I had
to, since I had tested that extensively before.  I'm just hoping it has gone
away.  You can do whatever you'd like with this bug, at this point, but thanks
for looking into it with me.
------- Comment #8 From 2007-03-05 15:54:04 -------
i'll close it for now, but feel free to open it if needed.