Bugzilla – Bug 4749
client receives no state notifications if RSL <gram:count> is "big"
Last modified: 2007-03-05 15:54:22
You need to log in before you can comment on or make changes to this bug.
+++ This bug was initially created as a clone of Bug #2700 +++ I'm reopening this bug because I'm running into a very similar problem which makes me believe it was never fully fixed. We have been using count=25 and submitting to Condor, exactly as the original poster was. These are submitted as batch jobs, however, whose status we check approximately every five minutes. These jobs typically run in Condor for several hours before finishing. However, they -do- finish, but that never seems to be detected by GRAM. The status typically goes from Pending -> Active okay, but as I said, never from Active -> anything else. Upon close inspection of the state files on disk, it's evident that -some- of the sub-jobs get marked as Done, but not all of them. Another reason I'm reopening this bug is because I'm getting the same warning message as the original poster: 2006-10-05 08:30:09,503 WARN exec.JobManagerScript [Thread-316,run:269] java.lang.NumberFormatException: For input string: "" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Integer.parseInt(Integer.java:468) at java.lang.Integer.parseInt(Integer.java:497) at org.globus.exec.service.exec.JobManagerScript.run(JobManagerScript.java:212) I've just turned on GRAM debug and will attach that later on after this batch of jobs finishes. By the way, I'm using GT 4.0.1, which of course makes me think I should have the fix. Thanks for your assistance! Sincerely, Adam Bazinet +++ This bug was initially created as a clone of Bug #2700 +++ Gaurang noticed the following w/ a Condor factory: "Whenever i try to modify the test.xml in share/gram_client/ and change the count to something around 20 or more then if i submit the run using both manager-job or globusrun-ws the clients start hanging and lose connection with the jobs. Even though the jobs appear in the remote [Condor] queue and run [fine], some interaction between the submit client and the remote side breaks." The problem was verified with both Java and C GRAM clients that the client side does not receive *any* state change event. In the Java client, which polls the job for status eventually, the job status comes back as "Unsubmitted". This indicates a server-side issue. Maybe a problem in the state machine of the job, might be due to faulty SEG events? The problem does *not* appear when submitting to Fork, but when submitting to Condor. Processes spawn by the count > 1 are executed by Condor on several hosts. Gaurang noticed the issue most times with count = 50, and every time with count = 100. He did not managed to determine a consistent threshold, but starting at count = 20 the problem appear 50% of the time. One WARN log message on the server-side (I'm not sure this is significant though): 2005-02-04 14:55:56,700 WARN exec.JobManagerScript [Thread-87,run:258] java.lang.NumberFormatException: For input string: "" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) at java.lang.Integer.parseInt(Integer.java:489) at java.lang.Integer.parseInt(Integer.java:518) at org.globus.exec.service.exec.JobManagerScript.run(JobManagerScript.java:202) We excuted with .exec. DEBUG turned on. Log will follow in next post.
I have been monitoring my batch of 25 jobs. Of the jobs that have finished so far, two of them have not had their state updated appropriately (#7 and #10), which is really bizarre! Here's why (using job #7 of 25 as an example) 1) The event is logged in the Condor XML log file: <c> <a n="MyType"><s>JobTerminatedEvent</s></a> <a n="EventTypeNumber"><i>5</i></a> <a n="EventTime"><s>2006-10-05T13:53:16</s></a> <a n="Cluster"><i>7249</i></a> <a n="Proc"><i>7</i></a> <a n="Subproc"><i>0</i></a> <a n="TerminatedNormally"><b v="t"/></a> <a n="ReturnValue"><i>0</i></a> <a n="RunLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a> <a n="RunRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a> <a n="TotalLocalUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a> <a n="TotalRemoteUsage"><s>Usr 0 00:00:00, Sys 0 00:00:00</s></a> <a n="SentBytes"><r>1.016383000000000E+06</r></a> <a n="ReceivedBytes"><r>1.257883000000000E+06</r></a> <a n="TotalSentBytes"><r>1.016383000000000E+06</r></a> <a n="TotalReceivedBytes"><r>2.515766000000000E+06</r></a> </c> 2) I have GRAM debug turned on, and it also says it was notified of the 'Done' event: 2006-10-05 13:53:16,727 DEBUG monitoring.SchedulerEventGenerator [Thread-7,run:132] seg input line: 001;1160070796;7249.007.000;8;0 2006-10-05 13:53:16,727 DEBUG monitoring.JobStateMonitor [Thread-7,addEvent:467] JSM receiving scheduler event 7249.007.000 [Thu Oct 05 13:53:16 EDT 2006] Done 2006-10-05 13:53:16,728 DEBUG monitoring.JobStateMonitor [Thread-7,addEvent:474] Dispatching event 7249.007.000 to job 5e1bdff0-547a-11db-98df-f477513bd2c7 2006-10-05 13:53:16,728 DEBUG monitoring.JobStateMonitor [Thread-7,dispatchEvent:495] Entering dispatchEvent() 2006-10-05 13:53:16,728 DEBUG monitoring.JobStateMonitor [Thread-7,dispatchEvent:498] dispatching 7249.007.000 [Thu Oct 05 13:53:16 EDT 2006] Done 2006-10-05 13:53:16,728 DEBUG exec.ManagedExecutableJobHome [Thread-7,jobStateChanged:445] Receiving jobStateChange event for resource key {http://www.globus.org/namespaces/2004/10/gram/job}ResourceID=5e1bdff0-547a-11db-98df-f477513bd2c7 with: timestamp Thu Oct 05 13:53:16 EDT 2006 (new) state Done exitCode 0 2006-10-05 13:53:16,729 DEBUG exec.StateMachine [Thread-7,insertNotifiedJobData:2340] [resourceKey:5e1bdff0-547a-11db-98df-f477513bd2c7] 7249.007.000 Done {http://www.globus.org/namespaces/2004/10/gram/job}ResourceID=5e1bdff0-547a-11db-98df-f477513bd2c7 That should be pretty clear, right? But in the ManagedExecutableJobResourceStateType folder under ~/.globus/persisted/128.8.141.71-8443/, for this job batch, I currently have this listing, in which it is clear that job #7 (which is the eighth from the top) never got marked as Done. <ns25:localJobState xsi:type="ns36:StateEnumeration" xmlns:ns36="http://www.globus.org/namespaces/2004/10/gram/job/types">Active</ns25:localJobState> <ns25:localJobState xsi:type="ns37:StateEnumeration" xmlns:ns37="http://www.globus.org/namespaces/2004/10/gram/job/types">Active</ns25:localJobState> <ns25:localJobState xsi:type="ns38:StateEnumeration" xmlns:ns38="http://www.globus.org/namespaces/2004/10/gram/job/types">Active</ns25:localJobState> <ns25:localJobState xsi:type="ns39:StateEnumeration" xmlns:ns39="http://www.globus.org/namespaces/2004/10/gram/job/types">Done</ns25:localJobState> <ns25:localJobState xsi:type="ns40:StateEnumeration" xmlns:ns40="http://www.globus.org/namespaces/2004/10/gram/job/types">Active</ns25:localJobState> <ns25:localJobState xsi:type="ns41:StateEnumeration" xmlns:ns41="http://www.globus.org/namespaces/2004/10/gram/job/types">Done</ns25:localJobState> <ns25:localJobState xsi:type="ns42:StateEnumeration" xmlns:ns42="http://www.globus.org/namespaces/2004/10/gram/job/types">Active</ns25:localJobState> <ns25:localJobState xsi:type="ns43:StateEnumeration" xmlns:ns43="http://www.globus.org/namespaces/2004/10/gram/job/types">Active</ns25:localJobState> <ns25:localJobState xsi:type="ns44:StateEnumeration" xmlns:ns44="http://www.globus.org/namespaces/2004/10/gram/job/types">Done</ns25:localJobState> <ns25:localJobState xsi:type="ns45:StateEnumeration" xmlns:ns45="http://www.globus.org/namespaces/2004/10/gram/job/types">Done</ns25:localJobState> <ns25:localJobState xsi:type="ns46:StateEnumeration" xmlns:ns46="http://www.globus.org/namespaces/2004/10/gram/job/types">Active</ns25:localJobState> <ns25:localJobState xsi:type="ns47:StateEnumeration" xmlns:ns47="http://www.globus.org/namespaces/2004/10/gram/job/types">Done</ns25:localJobState> <ns25:localJobState xsi:type="ns48:StateEnumeration" xmlns:ns48="http://www.globus.org/namespaces/2004/10/gram/job/types">Done</ns25:localJobState> <ns25:localJobState xsi:type="ns49:StateEnumeration" xmlns:ns49="http://www.globus.org/namespaces/2004/10/gram/job/types">Done</ns25:localJobState> <ns25:localJobState xsi:type="ns50:StateEnumeration" xmlns:ns50="http://www.globus.org/namespaces/2004/10/gram/job/types">Done</ns25:localJobState> <ns25:localJobState xsi:type="ns51:StateEnumeration" xmlns:ns51="http://www.globus.org/namespaces/2004/10/gram/job/types">Done</ns25:localJobState> <ns25:localJobState xsi:type="ns52:StateEnumeration" xmlns:ns52="http://www.globus.org/namespaces/2004/10/gram/job/types">Done</ns25:localJobState> <ns25:localJobState xsi:type="ns53:StateEnumeration" xmlns:ns53="http://www.globus.org/namespaces/2004/10/gram/job/types">Done</ns25:localJobState> <ns25:localJobState xsi:type="ns54:StateEnumeration" xmlns:ns54="http://www.globus.org/namespaces/2004/10/gram/job/types">Done</ns25:localJobState> <ns25:localJobState xsi:type="ns55:StateEnumeration" xmlns:ns55="http://www.globus.org/namespaces/2004/10/gram/job/types">Done</ns25:localJobState> <ns25:localJobState xsi:type="ns56:StateEnumeration" xmlns:ns56="http://www.globus.org/namespaces/2004/10/gram/job/types">Active</ns25:localJobState> <ns25:localJobState xsi:type="ns57:StateEnumeration" xmlns:ns57="http://www.globus.org/namespaces/2004/10/gram/job/types">Active</ns25:localJobState> <ns25:localJobState xsi:type="ns58:StateEnumeration" xmlns:ns58="http://www.globus.org/namespaces/2004/10/gram/job/types">Active</ns25:localJobState> <ns25:localJobState xsi:type="ns59:StateEnumeration" xmlns:ns59="http://www.globus.org/namespaces/2004/10/gram/job/types">Active</ns25:localJobState> <ns25:localJobState xsi:type="ns60:StateEnumeration" xmlns:ns60="http://www.globus.org/namespaces/2004/10/gram/job/types">Active</ns25:localJobState>
The latest GT release is 4.0.3. I'd suggest trying that first.
(In reply to comment #2) > The latest GT release is 4.0.3. I'd suggest trying that first. > I have recently conducted exhaustive tests of both 4.0.2 and 4.0.3. The problem, however, persists unchanged. Not only do sub-jobs sometimes not go from Active to Done, they occasionally don't even go from Pending to Active. This is hugely disappointing for us at this juncture... is there anything else we can try? We really appreciate your input. Thanks, Adam Sinc
Adam, Thanks for this detailed bug report! I am marking this as a top priority to be looked at, however this is a busy time with supercomputing coming soon. As developers time frees up this will be looked at and hopefully we will be able to provide a patch for you. -Stu
Thanks a bunch!
Adam, if i remember right, then you didn't run into these problems with 4.0.4 so far, right? What's the status here? I just tried it on my laptop: submitted jobs to my Condor installation (just one node) with count elements of 50 and 100, both without problems. You always submitted to a pool consisting of several nodes and never to a one-node pool, right? Unfortunately the Condor pool i normally use isn't available at the moment.
That's right, I did two test batches of count 10, of somewhat long running jobs (a few hours), to a multi-node pool if you want to call it that, and had no problems. We're not making use of this feature much in our production grid right now, but I'm pretty sure I could reproduce this problem in 4.0.3 if I had to, since I had tested that extensively before. I'm just hoping it has gone away. You can do whatever you'd like with this bug, at this point, but thanks for looking into it with me.
Adam, i'll close it for now, but feel free to open it if needed.