Bug 2582 - CAMPAIGN: WS GRAM Study #5 (container/service stability in job submissions)
: CAMPAIGN: WS GRAM Study #5 (container/service stability in job submissions)
Status: RESOLVED FIXED
: GRAM
wsrf managed execution job service
: 3.9.5
: PC Linux
: P3 normal
: 4.0
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2005-01-19 16:01 by
Modified: 2005-04-06 17:45 (History)


Attachments
excerpts of server-side log from 5:36 Friday 14 Jan study #5 test (21.82 KB, text/plain)
2005-01-19 16:28, Alain Andrieux
Details
client job submission program (1.85 KB, text/plain)
2005-04-06 17:37, Stuart Martin
Details
The tail end of the log file from the long-run-test.pl (3.21 KB, text/plain)
2005-04-06 17:38, Stuart Martin
Details
pmap output at the beginning of the run (13.59 KB, text/plain)
2005-04-06 17:39, Stuart Martin
Details
pmap output at the end of the 2nd run (74.54 KB, text/plain)
2005-04-06 17:41, Stuart Martin
Details
top output before the start of the job submissions (530 bytes, text/plain)
2005-04-06 17:42, Stuart Martin
Details
top output after all the job submissions (531 bytes, text/plain)
2005-04-06 17:42, Stuart Martin
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2005-01-19 16:01:16
Projects:
 WS GRAM

Technologies:
 Globus Resource Allocation Manager (GRAM)

Definition:

Failures of job submissions and subsequent crash/unresponsiveness 
of the GRAM services have been noticed (see bug#2528 at
http://bugzilla.globus.org/globus/show_bug.cgi?id=2528)

It seems in order to do the GRAM Study #5 outlined in
http://www-unix.globus.org/toolkit/docs/development/4.0-drafts/perf_overview.html
in order to determine the average life expectancy of a running GRAM server
processing a steady stream of submissions of identical job descriptions.
However it is not guaranteed that GRAM will consistently fail more or less 
after the same duration, so the results should be taken with a grain of salt 
unless strong consistency is established (which would require lots of long testing).

Deliverables:
  1. RSL job description file for job type 2 (JT2) as refered to in study #5
     (writing to stdout/err, staging, stagout)
  1. Script to automate study #5: submit stream of jobs continously 
     as needed so that a load of 10 jobs be maintained in the GRAM queue.
     Job description is always as defined in deliverable 1.
  2. Run script several times to obtain a series of results. 
     Each run should last a long time (max one month) until the container/GRAM 
     services do not accept job submissions and process jobs to success.

Tasks:

  1. Define a job description file specifying the job to submit in study #5. 
  2. Write a script to automate study #5 and maintain a given job load (10 jobs)    
     for a very long time (one month). Possibility: reuse Throughput Tester 
     (a Java client) as it provides load maintenance functions. 
     Modify Throughput Tester if needed.
  3. Run study #5 repeatedly and provide results in terms of GRAM service 
     stability in the face of study #5 conditions.
 

Time Estimate:
  Until 3.9.4 release? (long running tests)
------- Comment #1 From 2005-01-19 16:28:19 -------
Created an attachment (id=486) [details]
excerpts of server-side log from 5:36 Friday 14 Jan study #5 test
------- Comment #2 From 2005-01-19 16:34:39 -------
BATCH OF STUDY #5 TEST RESULTS:
-------------------------------
Host:                      dc-user2.isi.edu
Port:                      9000
Test start date:           Jan 14 17:36 PST 2005
Last client side output:   Jan 14 20:31 PST 2005
Duration of test:          2 hours 55 minutes
Last successful job at:    Fri Jan 14 19:04:02 PST 2005 (/bin/date in stdout)
Number of successful jobs: 136

AFTER CRASH, MANUAL SUBMISSION OF A SIMPLE JOB:
----------------------------------------------
Used m-j-g (Java client) and globus-ws to submit a /bin/true job.

On client:
- No notifications seemed to be delivered to client. 
- Client hanged waiting for notifications forever (Java) or timed-out (C)
- Java client would pull state of the job and obtain "Unsubmitted". 
  
On server: lots of identical authorization-related log messages,
getMultipleResource propertiesClient pulls gain, still unsubmitted..

All submitted jobs are now stuck in Unsubmitted mode.

INTERPRETATION:
---------------
It is probably what happened that stopped the load maintenance/study #5 
test: 10 jobs got stuck, so no more jobs were submitted by the client.

Maybe pace of submission of test is too high for GRAM to follow,
and not get stuck enventually?

------- Comment #3 From 2005-01-20 15:19:05 -------
BATCH OF STUDY #5 TEST RESULTS:
-------------------------------
GRAM server:               dc-user2.isi.edu:8888
GridFTP server:            dc-user2:9001
postgreSQL server:         tubby.isi.edu

Test start date:           Wed Jan 19 21:13 PST 2005
Last server output:        Thu Jan 20 04:49 PST 2005
Duration of test:          7 hours 36 minutes

Last succesful 
   stagein:                Thu Jan 20 04:17 (ls -l ~/123/my_date)
   executable:             Thu Jan 20 04:18:23 PST 2005 (tail -n 1 ~/123/stdout)
   stageout:               Thu Jan 20 04:18 (ls -l /tmp/stdout.study5)

Number of 
   successful jobs:        1423 (wc ~/123/stdout)
   failed jobs:            953 (due to stagein failure, see below)
              
Note: before the test was started, the successful submission of a 
      job with the study #5 JT2 RSL was verified. The container 
      was stopped and restarted afresh.              

ISSUES
------

a) shortly after the test started, errors appeared on the client side:
   - job failure because:
       - StageIn failing because 
          - RFT resource not being created because of 
             - a NoSuchResourceException 
   See the client log excerpt for more details. Those errors appeared 
   throughout the duration of the whole test (that is, until the 
   crash of the container)
   
b) After the last successful submission the container crashed
   because of an OutOfMemoryError
   
CLIENT log (excerpt, default log properties)
--------------------------------------------
WS GRAM Study #5 with load = 10 and duration = 2592000 s
2005-01-19 21:39:35,352 ERROR throughput.ClientThread [Thread-7,stateChanged:260] 
a job failed with handle
https://128.9.64.179:8888/wsrf/services/ManagedExecutableJobService?9dad51f0-6aa5-11d9-978a-cc2162090ce2:
fault type: org.globus.exec.generated.StagingFaultType:
attribute: fileStageIn
command: StageIn
description:
Staging error for RSL element fileStageIn.
faultReason:.
faultString:.
gt2ErrorCode: 0
originator: Address:
https://128.9.64.179:8888/wsrf/services/ManagedJobFactoryService
Reference property[0]:
<ns1:ResourceID
xmlns:ns1="http://www.globus.org/namespaces/2004/10/gram/job">9dad51f0-6aa5-11d9-978a-cc2162090ce2</ns1:ResourceID>
                                                                               
stackTrace:
org.globus.exec.generated.StagingFaultType: Staging error for RSL element
fileStageIn.
Timestamp: Wed Jan 19 21:39:24 PST 2005
Originator: Address:
https://128.9.64.179:8888/wsrf/services/ManagedJobFactoryService
Reference property[0]:
<ns1:ResourceID
xmlns:ns1="http://www.globus.org/namespaces/2004/10/gram/job">9dad51f0-6aa5-11d9-978a-cc2162090ce2</ns1:ResourceID>
                                                                               
Caused by: org.oasis.wsrf.faults.BaseFaultType: AxisFault
 faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.userException
 faultSubcode:.
 faultString: java.rmi.RemoteException: Unable to create RFT resource; nested
exception is:.
>...org.globus.transfer.reliable.service.exception.RftException: Error
processing delegated credentialError getting delegation resource [Caused by:
org.globus.wsrf.NoSuchResourceException] [Caused by: Error getting delegation
resource [Caused by: org.globus.wsrf.NoSuchResourceException]]
 faultActor:.
 faultNode:.
 faultDetail:.
>...{http://xml.apache.org/axis/}stackTrace:java.rmi.RemoteException: Unable to
create RFT resource; nested exception is:.
 faultDetail:.
>...{http://xml.apache.org/axis/}stackTrace:java.rmi.RemoteException: Unable to
create RFT resource; nested exception is:.
>...org.globus.transfer.reliable.service.exception.RftException: Error
processing delegated credentialError getting delegation resource [Caused by:
org.globus.wsrf.NoSuchResourceException] [Caused by: Error getting delegation
resource [Caused by: org.globus.wsrf.NoSuchResourceException]]
>...at
org.globus.transfer.reliable.service.factory.ReliableFileTransferFactoryService.createReliableFileTransfer(ReliableFileTransferFactoryService.java:185)>...at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[...]
Caused by: org.globus.transfer.reliable.service.exception.RftException: Error
processing delegated credentialError getting delegation resource [Caused by:
org.globus.wsrf.NoSuchResourceException] [Caused by: Error getting delegation
resource [Caused by: org.globus.wsrf.NoSuchResourceException]]
>...at
org.globus.transfer.reliable.service.ReliableFileTransferResource.processDelegatedCredential(ReliableFileTransferResource.java:365)
>...at
org.globus.transfer.reliable.service.ReliableFileTransferResource.&lt;init&gt;(ReliableFileTransferResource.java:222)
>...at
org.globus.transfer.reliable.service.ReliableFileTransferHome.create(ReliableFileTransferHome.java:108)
>...at
org.globus.transfer.reliable.service.factory.ReliableFileTransferFactoryService.createReliableFileTransfer(ReliableFileTransferFactoryService.java:180)>......
23 more
                                                                               
>...{http://xml.apache.org/axis/}hostname:dc-user2.isi.edu
                                                                               
java.rmi.RemoteException: Unable to create RFT resource; nested exception is:.
>...org.globus.transfer.reliable.service.exception.RftException: Error
processing delegated credentialError getting delegation resource [Caused by:
org.globus.wsrf.NoSuchResourceException] [Caused by: Error getting delegation
resource [Caused by: org.globus.wsrf.NoSuchResourceException]]
>...at
org.apache.axis.message.SOAPFaultBuilder.createFault(SOAPFaultBuilder.java:221)
[...]
>...at org.apache.axis.client.Call.invoke(Call.java:1765)
>...at
org.globus.rft.generated.bindings.ReliableFileTransferFactoryPortTypeSOAPBindingStub.createReliableFileTransfer(ReliableFileTransferFactoryPortTypeSOAPBindingStub.java:874)
>...at
org.globus.exec.service.exec.StateMachine.submitStagingRequest(StateMachine.java:1910)
>...at
org.globus.exec.service.exec.StateMachine.processStageInState(StateMachine.java:571)
>...at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>...at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>...at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>...at java.lang.reflect.Method.invoke(Method.java:324)
>...at org.globus.exec.service.exec.StateMachine.processState(StateMachine.java:258)
>...at org.globus.exec.service.exec.RunQueue.run(RunQueue.java:93)
Timestamp: Wed Jan 19 21:39:24 PST 2005
[...]
>...at
org.globus.wsrf.encoding.ObjectDeserializer.toObject(ObjectDeserializer.java:56)
>...at org.globus.exec.client.GramJob.deliver(GramJob.java:1440)
>...at
org.globus.wsrf.impl.notification.NotificationConsumerProvider.notify(NotificationConsumerProvider.java:106)

[...]
>...at org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:265)
                                                                               
stateWhenFailureOccurred: StageIn
timestamp:
java.util.GregorianCalendar[time=1106199564817,areFieldsSet=true,areAllFieldsSet=true,lenient=true,zone=sun.util.calendar.ZoneInfo[id="GMT",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=1,YEAR=2005,MONTH=0,WEEK_OF_YEAR=4,WEEK_OF_MONTH=4,DAY_OF_MONTH=20,DAY_OF_YEAR=20,DAY_OF_WEEK=5,DAY_OF_WEEK_IN_MONTH=3,AM_PM=0,HOUR=5,HOUR_OF_DAY=5,MINUTE=39,SECOND=24,MILLISECOND=817,ZONE_OFFSET=0,DST_OFFSET=0]
Message:
null
2005-01-19 21:39:40,678 WARN  client.GramJob
[Thread-7,destroyDelegatedCredential:1279] Unable to destroy resource
AxisFault
 faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.generalException
 faultSubcode:.
 faultString:.
 faultActor:.
 faultNode:.
 faultDetail:.
>...{http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-ResourceLifetime-1.2-draft-01.xsd}ResourceUnknownFault:<ns2:Timestamp
xmlns:ns2="http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd">2005-01-20T05:39:40.390Z</ns2:Timestamp><ns3:Originator
xmlns:ns3="http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd"><ns2:Address
xmlns:ns2="http://schemas.xmlsoap.org/ws/2004/03/addressing">https://128.9.64.179:8888/wsrf/services/DelegationService</ns2:Address><ns4:ReferenceProperties
xmlns:ns4="http://schemas.xmlsoap.org/ws/2004/03/addressing"><ns1:DelegationKey
soapenv:mustUnderstand="0"
xmlns:ns1="http://www.globus.org/08/2004/delegationService"
xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">a55a0ab0-6aa5-11d9-978a-cc2162090ce2</ns1:DelegationKey></ns4:ReferenceProperties><ns5:ReferenceParameters
xmlns:ns5="http://schemas.xmlsoap.org/ws/2004/03/addressing"/></ns3:Originator><ns4:Description
xmlns:ns4="http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd">Failed
to remove resource</ns4:Description><ns5:FaultCause
xmlns:ns5="http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd"><ns5:Timestamp>2005-01-20T05:39:40.390Z</ns5:Timestamp><ns5:ErrorCode
dialect="http://www.globus.org/fault/stacktrace">org.oasis.wsrf.lifetime.ResourceUnknownFaultType</ns5:ErrorCode><ns5:Description>
>...at
org.globus.wsrf.impl.lifetime.DestroyProvider.destroy(DestroyProvider.java:39)

[...]
>...at org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:268)
</ns5:Description></ns5:FaultCause><ns6:FaultCause
xmlns:ns6="http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd"><ns6:Timestamp>2005-01-20T05:39:40.393Z</ns6:Timestamp><ns6:ErrorCode
dialect="http://www.globus.org/fault/exception"/><ns6:Description>org.globus.wsrf.NoSuchResourceException
>...at
org.globus.delegation.service.DelegationResource.load(DelegationResource.java:404)
>...at
org.globus.wsrf.impl.ResourceHomeImpl.createNewInstanceAndLoad(ResourceHomeImpl.java:236)
    >...at org.globus.wsrf.impl.ResourceHomeImpl.get(ResourceHomeImpl.java:271)
>...at org.globus.wsrf.impl.ResourceHomeImpl.remove(ResourceHomeImpl.java:304)
>...at org.globus.wsrf.impl.lifetime.DestroyProvider.destroy(DestroyProvider.java:37

[...]
>...at org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:268)
</ns6:Description></ns6:FaultCause>
>...{http://xml.apache.org/axis/}exceptionName:org.oasis.wsrf.lifetime.ResourceUnknownFaultType
>...{http://xml.apache.org/axis/}hostname:dc-user2.isi.edu
>...at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

[...]
>...at
org.globus.delegationService.DelegationPortTypeSOAPBindingStub.destroy(DelegationPortTypeSOAPBindingStub.java:1145)
>...at org.globus.exec.client.GramJob.destroyDelegatedCredential(GramJob.java:1271)
>...at org.globus.exec.client.GramJob.destroyDelegatedCredentials(GramJob.java:1200)
>...at org.globus.exec.client.GramJob.destroy(GramJob.java:1181)
>...at
org.globus.exec.service.test.throughput.ClientThread.stateChanged(ClientThread.java:267)
>...at org.globus.exec.client.GramJob.setState(GramJob.java:271)
>...at org.globus.exec.client.GramJob.deliver(GramJob.java:1464)
>...at
org.globus.wsrf.impl.notification.NotificationConsumerProvider.notify(NotificationConsumerProvider.java:106)

[...]
>...at org.globus.wsrf.container.GSIServiceThread.process(GSIServiceThread.java:124)
>...at org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:265)

[then repeat the same errors many times (953)]


SERVER LOG
----------
Starting...
[...]
,authorize:281] Authorized "/C=US/O=Globus
Alliance/OU=User/CN=1016d3c1646.f0ecdb2d" to invoke
"{http://www.globus.org/namespaces/2004/10/rft}createReliableFileTransfer".
2005-01-19 21:39:24,735 ERROR delegation.DelegationUtil
[Thread-3,getDelegationResource:254] Error getting delegation resource
org.globus.wsrf.NoSuchResourceException
>...at
org.globus.delegation.service.DelegationResource.load(DelegationResource.java:404)
>...at org.globus.delegation.service.DelegationHome.find(DelegationHome.java:50)
)
>...at
org.globus.delegation.DelegationUtil.getDelegationResource(DelegationUtil.java:252)
>...at
org.globus.delegation.DelegationUtil.registerDelegationListener(DelegationUtil.java:157)
>...at
org.globus.transfer.reliable.service.ReliableFileTransferResource.processDelegatedCredential(ReliableFileTransferResource.java:358)
[...]
2005-01-19 21:39:24,743 ERROR factory.ReliableFileTransferFactoryService
[Thread-3,createReliableFileTransfer:183] Unable to create RFT resource
Error processing delegated credentialError getting delegation resource [Caused
by: org.globus.wsrf.NoSuchResourceException]

[...]
2005-01-20 04:15:26,438 ERROR delegation.DelegationUtil
[Thread-33114,getDelegationResource:254] Error getting delegation resource
org.globus.wsrf.NoSuchResourceException
>...at
org.globus.delegation.service.DelegationResource.load(DelegationResource.java:404)
>...at org.globus.delegation.service.DelegationHome.find(DelegationHome.java:50)
>...at
org.globus.delegation.DelegationUtil.getDelegationResource(DelegationUtil.java:252)
>...at
org.globus.delegation.DelegationUtil.registerDelegationListener(DelegationUtil.java:167)
>...at
org.globus.exec.service.utils.DelegatedCredential.getDelegatedCredential(DelegatedCredential.java:136)
>...at
org.globus.exec.service.job.ManagedJobResourceImpl.getJobCredential(ManagedJobResourceImpl.java:329)
[...]
2005-01-20 04:15:26,440 ERROR factory.ManagedJobFactoryService
[Thread-33114,createManagedJob:269] Job creation failed.
java.lang.RuntimeException: Couldn't obtain a delegated credential.
>...at
org.globus.exec.service.job.ManagedJobResourceImpl.getJobCredential(ManagedJobResourceImpl.java:338)
[...]
Caused by: org.globus.delegation.DelegationException: Error getting delegation
resource [Caused by: org.globus.wsrf.NoSuchResourceException]
>...at
org.globus.delegation.DelegationUtil.getDelegationResource(DelegationUtil.java:255)
>...at
org.globus.delegation.DelegationUtil.registerDelegationListener(DelegationUtil.java:167)

[...]
2005-01-20 04:17:21,070 INFO  authorization.ServiceAuthorizationChain
[Thread-33162,authorize:281] Authorized "/C=US/O=Globus
Alliance/OU=User/CN=1016d3c1646.f0ecdb2d" to invoke
"{http://www.globus.org/namespaces/2004/10/rft}start".
2005-01-20 04:17:21,089 ERROR service.ReliableFileTransferResource
[Thread-33162,store:297] Unable to store subscriptions
java.lang.Exception: Unable to store subscriptionsnull
>...at
org.globus.transfer.reliable.service.ReliableFileTransferResource.storeSubscriptions(ReliableFileTransferResource.java:266)
>...at
org.globus.transfer.reliable.service.ReliableFileTransferResource.store(ReliableFileTransferResource.java:295)
[...]

2005-01-20 04:17:23,301 INFO  authorization.ServiceAuthorizationChain
[Thread-33148,authorize:281] Authorized
"/C=US/O=NPACI/OU=SDSC/CN=host/dc.isi.edu" to invoke
"{http://wsrf.globus.org/core/notification}notify".

cat ~/123/stdout
----------------
Wed Jan 19 21:38:49 PST 2005
Wed Jan 19 21:38:56 PST 2005
Wed Jan 19 21:39:06 PST 2005
[...]
Thu Jan 20 04:16:55 PST 2005
Thu Jan 20 04:17:22 PST 2005
Thu Jan 20 04:17:37 PST 2005
Thu Jan 20 04:18:23 PST 2005


INTERPRETATION:
---------------
  
   It is tempting to interpret the OutOfMemoryError as a result of 
   repeated submissions of jobs and creation of resources on the server 
   while the server is not able to destroy resources for jobs that are Done. 
   We need to understand more the behavior of the Throughput Tester 
   (used to maintain load) in order to infirm or not infirm this hypothesis
   
------- Comment #4 From 2005-01-21 15:52:53 -------
BATCH OF STUDY #5 TEST RESULTS:
-------------------------------
GRAM server:               dc-user2.isi.edu:8888
GridFTP server:            dc-user2:9001
postgreSQL server:         tubby.isi.edu

Test start date:           Thu Jan 20 15:24 PST 2005
Last server output:        Fri Jan 21 12:53 PST 2005
Duration of test:          7 hours 29 minutes

Last succesful
   stagein:                Thu Jan 20 21:10 (ls -l ~/123/my_date)
   executable:             Thu Jan 20 21:10 (tail -n 1 ~/123/stdout)
   stageout:               Thu Jan 20 21:09 (ls -l /tmp/stdout.study5)

Number of jobs:
   successful:             1433 (wc ~/123/stdout)
   failed:                 1082
   total:                  
              
Note: used same GT installation (trunk Jan 19) 
      as previous post in this campaignzilla.

OBSERVATIONS
------------
 a) same issues in job failure as in previous post 
    and crash of container because of OutOfMemory error.
 b) about same results, i.e.:
    container stability: about 7 hours and 30 minutes before crash
    successful jobs: less than 1500
    failed jobs: around 1,000 (ouch! maybe problem with 
                               installation/test itself).

------- Comment #5 From 2005-01-24 19:28:38 -------
Without any staging directive in the RSL, the study 5 test runs very well 
with no error noticed (did not carry the the test through). This confirms 
that the main issue (apart from the eventual OutOfMemory error) pointed 
out by the existing study #5 test is buggy RFT/staging (see the results above 
for more details).

It could be interesting to run the test with different numbers in terms 
of concurrency of submitted jobs and clients. Therefore:

Repeated the test with different parameters fed to the Throughput Tester:
--------------------------------------------------------------------------

case      |  clients  |  jobs/client) |  summary of results

A=study#5       1           10 jobs      RFT/staging errors (see above)

B              10            1 job       lots of NPE in a Locator class; 
                                         GRAM service stuck after about 100 
                                         "Done" jobs. So in a way this is 
                                         worse than results for A. 

C               1            1 job       see details below:


Case C results:

This is equivalent to a purely serial submission of jobs. 

Observation 1) 

We do not see the staging errors (the problem that delegated 
credentials could not be obtained which failed RFT) of study #5 i.e. A). 
This seems to indicate that concurrency is instrumental in 
the triggering of those RFT errors. 

Observation 2)

2723 jobs were "Done" over a duration of more than 46 hours 
(almost 2 days) which shows a dramatic difference in terms 
of service stability  compared to a load of 10 concurrently 
processed jobs.
Afterwards no more successful JT2 jobs could be submitted:
2.1) On the server, which keeps writing to output, there is a OutOfMemoryError
on the createJob call whenever a job is submitted manually. This shows 
that we eventually get memory issues in GRAM even with no job                  
                       concurrency.
2.2) If the job is stripped of its staging directives it succeeds, which seems
to indicate that the memory leak is due to RFT/staging-related things in GRAM. 

Observation 3) 

Here are the ERROR logs on the server:

2005-01-22 04:23:43,825 ERROR exec.ManagedExecutableJobResource
[Thread-2,deliver:1218] Unable to destroy transfer.

2005-01-22 04:25:38,994 ERROR container.GSIServiceThread
[Thread-2000,process:117] Error processing request
java.net.SocketException: Connection reset

hread-14902,authorize:281] Authorized "/C=US/O=Globus
Alliance/OU=User/CN=1016d3c1646.f0ecdb2d" to invoke
"{http://www.globus.org/namespaces/2004/10/rft}destroy".
2005-01-22 04:40:48,712 ERROR container.ServiceThread [Thread-3,process:410]
Error closing output stream

hread-3,authorize:281] Authorized "/C=US/O=Globus
Alliance/OU=User/CN=1016d3c1646.f0ecdb2d" to invoke
"{http://www.globus.org/namespaces/2004/10/rft}createReliableFileTransfer".
2005-01-22 04:41:29,991 ERROR container.GSIServiceThread
[Thread-14922,process:117] Error processing request
java.net.SocketException: Connection reset

hread-14902,authorize:281] Authorized "/C=US/O=Globus
Alliance/OU=User/CN=1016d3c1646.f0ecdb2d" to invoke
"{http://www.globus.org/namespaces/2004/10/rft}start".
2005-01-22 05:28:10,439 ERROR container.GSIServiceThread
[Thread-15113,process:117] Error processing request

hread-15113,authorize:281] Authorized "/C=US/O=Globus
Alliance/OU=User/CN=1016d3c1646.f0ecdb2d" to invoke
"{http://www.globus.org/namespaces/2004/10/rft}start".
2005-01-22 19:29:30,818 ERROR service.ReliableFileTransferResource
[Thread-15113,store:297] Unable to store subscriptions
java.lang.Exception: Unable to store subscriptionsnull
>...at
org.globus.transfer.reliable.service.ReliableFileTransferResource.storeSubscriptions(ReliableFileTransferResource.java:266)
[repeated 3 times]

hread-3,authorize:281] Authorized "/C=US/O=Globus
Alliance/OU=User/CN=1016d3c1646.f0ecdb2d" to invoke
"{http://www.globus.org/namespaces/2004/10/gram/job/exec}getMultipleResourceProperties".
2005-01-23 13:30:54,376 ERROR container.ServiceThread [Thread-15113,process:410]
Error closing output stream


------- Comment #6 From 2005-01-24 23:03:12 -------
We should try to get someone to perform similar tests with RFT directly to see
if the staging-related 
leaks are with the GRAM-RFT interaction code or the RFT service code.
------- Comment #7 From 2005-01-25 10:36:59 -------
Are you logging stderr too .. Cause OOM AFAIK errors are not logged if you are
logging stdout. Also I 
would like to know more details about this test so I can simulate it with my
client. 
------- Comment #8 From 2005-01-25 11:54:58 -------
Alain,

Can you send me the info on the case B results? The NPEs? 
------- Comment #9 From 2005-01-25 15:41:41 -------
Revising the results of case study B (10 clients - 1 job/client):

a) 
All errors to fail jobs are RFT/staging errors and look similar 
to the staging errors in case study A (see logs in results for study #5). 
So at least we have consistency in terms of the causes for job failure 
b/w case studies A and B. 

b)
NPEs appear (later) in the *client* when it tries to refresh the state of a job

2005-01-21 20:39:58,195 ERROR throughput.ClientThread [Thread-2,run:136] Unable
to refresh job.
java.lang.NullPointerException
>...at
org.globus.exec.generated.service.ManagedJobServiceAddressingLocator.getManagedJobPortTypePort(ManagedJobServiceAddressingLocator.java:12)
>...at
org.globus.exec.utils.client.ManagedJobClientHelper.getPort(ManagedJobClientHelper.java:29)
>...at org.globus.exec.client.GramJob.refreshStatus(GramJob.java:1536)
>...at
org.globus.exec.service.test.throughput.ClientThread.run(ClientThread.java:134)

This NPE is due to a null job EPR passed to the locator. Looks like it 
*might* be a bug in ClientThread but I would need to spend more time on this. 
------- Comment #10 From 2005-02-24 16:28:47 -------
Now that gram scalability has been improved.  This study should be resumed.
------- Comment #11 From 2005-02-25 16:22:12 -------
I have a container running on Lucky0.  I am running the wsrf gram scheduler
test for fork for submit 
jobs (30 in all) in a loop until 10000 itterations.  Every 100 itterations I
run top and pmap on the globus 
container pid.  I'm not sure what pmap will offer up, but it looked interesting
:-).  This should run over 
the weekend.  We'll see if the container is still functional on Monday.
------- Comment #12 From 2005-04-06 17:35:13 -------
The tests on Lucky revealed a JVM bug:

http://64.233.161.104/search?q=cache:udlaMTxJaksJ:bugs.sun.com/bugdatabase/
view_bug.do%3Fbug_id%3D4959566+4F533F4C494E55583F491418160E4350500306&hl=en

So I moved to Vanguard.mcs.anl.gov, were I was able to perform a long run test that lasted for more 
than 500,000 jobs over a 23 day period without the container crashing.

See attachments for details.
------- Comment #13 From 2005-04-06 17:37:50 -------
Created an attachment (id=563) [details]
client job submission program

This is a perl script that builds and submits in a loop the
globus_wsrf_gram_scheduler_test program.
------- Comment #14 From 2005-04-06 17:38:50 -------
Created an attachment (id=564) [details]
The tail end of the log file from the long-run-test.pl
------- Comment #15 From 2005-04-06 17:39:43 -------
Created an attachment (id=565) [details]
pmap output at the beginning of the run
------- Comment #16 From 2005-04-06 17:41:39 -------
Created an attachment (id=566) [details]
pmap output at the end of the 2nd run

The long-run-test program was executed for 10,000 itterations twice.  So this
is the output after the *second* long-run-test execution.
------- Comment #17 From 2005-04-06 17:42:09 -------
Created an attachment (id=567) [details]
top output before the start of the job submissions
------- Comment #18 From 2005-04-06 17:42:36 -------
Created an attachment (id=568) [details]
top output after all the job submissions
------- Comment #19 From 2005-04-06 17:45:11 -------
This campaign is done.  Should be rerun after the 4.0 code has been frozen.