Bugzilla – Bug 2582
CAMPAIGN: WS GRAM Study #5 (container/service stability in job submissions)
Last modified: 2005-04-06 17:45:11
You need to log in before you can comment on or make changes to this bug.
Projects: WS GRAM Technologies: Globus Resource Allocation Manager (GRAM) Definition: Failures of job submissions and subsequent crash/unresponsiveness of the GRAM services have been noticed (see bug#2528 at http://bugzilla.globus.org/globus/show_bug.cgi?id=2528) It seems in order to do the GRAM Study #5 outlined in http://www-unix.globus.org/toolkit/docs/development/4.0-drafts/perf_overview.html in order to determine the average life expectancy of a running GRAM server processing a steady stream of submissions of identical job descriptions. However it is not guaranteed that GRAM will consistently fail more or less after the same duration, so the results should be taken with a grain of salt unless strong consistency is established (which would require lots of long testing). Deliverables: 1. RSL job description file for job type 2 (JT2) as refered to in study #5 (writing to stdout/err, staging, stagout) 1. Script to automate study #5: submit stream of jobs continously as needed so that a load of 10 jobs be maintained in the GRAM queue. Job description is always as defined in deliverable 1. 2. Run script several times to obtain a series of results. Each run should last a long time (max one month) until the container/GRAM services do not accept job submissions and process jobs to success. Tasks: 1. Define a job description file specifying the job to submit in study #5. 2. Write a script to automate study #5 and maintain a given job load (10 jobs) for a very long time (one month). Possibility: reuse Throughput Tester (a Java client) as it provides load maintenance functions. Modify Throughput Tester if needed. 3. Run study #5 repeatedly and provide results in terms of GRAM service stability in the face of study #5 conditions. Time Estimate: Until 3.9.4 release? (long running tests)
Created an attachment (id=486) [details] excerpts of server-side log from 5:36 Friday 14 Jan study #5 test
BATCH OF STUDY #5 TEST RESULTS: ------------------------------- Host: dc-user2.isi.edu Port: 9000 Test start date: Jan 14 17:36 PST 2005 Last client side output: Jan 14 20:31 PST 2005 Duration of test: 2 hours 55 minutes Last successful job at: Fri Jan 14 19:04:02 PST 2005 (/bin/date in stdout) Number of successful jobs: 136 AFTER CRASH, MANUAL SUBMISSION OF A SIMPLE JOB: ---------------------------------------------- Used m-j-g (Java client) and globus-ws to submit a /bin/true job. On client: - No notifications seemed to be delivered to client. - Client hanged waiting for notifications forever (Java) or timed-out (C) - Java client would pull state of the job and obtain "Unsubmitted". On server: lots of identical authorization-related log messages, getMultipleResource propertiesClient pulls gain, still unsubmitted.. All submitted jobs are now stuck in Unsubmitted mode. INTERPRETATION: --------------- It is probably what happened that stopped the load maintenance/study #5 test: 10 jobs got stuck, so no more jobs were submitted by the client. Maybe pace of submission of test is too high for GRAM to follow, and not get stuck enventually?
BATCH OF STUDY #5 TEST RESULTS: ------------------------------- GRAM server: dc-user2.isi.edu:8888 GridFTP server: dc-user2:9001 postgreSQL server: tubby.isi.edu Test start date: Wed Jan 19 21:13 PST 2005 Last server output: Thu Jan 20 04:49 PST 2005 Duration of test: 7 hours 36 minutes Last succesful stagein: Thu Jan 20 04:17 (ls -l ~/123/my_date) executable: Thu Jan 20 04:18:23 PST 2005 (tail -n 1 ~/123/stdout) stageout: Thu Jan 20 04:18 (ls -l /tmp/stdout.study5) Number of successful jobs: 1423 (wc ~/123/stdout) failed jobs: 953 (due to stagein failure, see below) Note: before the test was started, the successful submission of a job with the study #5 JT2 RSL was verified. The container was stopped and restarted afresh. ISSUES ------ a) shortly after the test started, errors appeared on the client side: - job failure because: - StageIn failing because - RFT resource not being created because of - a NoSuchResourceException See the client log excerpt for more details. Those errors appeared throughout the duration of the whole test (that is, until the crash of the container) b) After the last successful submission the container crashed because of an OutOfMemoryError CLIENT log (excerpt, default log properties) -------------------------------------------- WS GRAM Study #5 with load = 10 and duration = 2592000 s 2005-01-19 21:39:35,352 ERROR throughput.ClientThread [Thread-7,stateChanged:260] a job failed with handle https://128.9.64.179:8888/wsrf/services/ManagedExecutableJobService?9dad51f0-6aa5-11d9-978a-cc2162090ce2: fault type: org.globus.exec.generated.StagingFaultType: attribute: fileStageIn command: StageIn description: Staging error for RSL element fileStageIn. faultReason:. faultString:. gt2ErrorCode: 0 originator: Address: https://128.9.64.179:8888/wsrf/services/ManagedJobFactoryService Reference property[0]: <ns1:ResourceID xmlns:ns1="http://www.globus.org/namespaces/2004/10/gram/job">9dad51f0-6aa5-11d9-978a-cc2162090ce2</ns1:ResourceID> stackTrace: org.globus.exec.generated.StagingFaultType: Staging error for RSL element fileStageIn. Timestamp: Wed Jan 19 21:39:24 PST 2005 Originator: Address: https://128.9.64.179:8888/wsrf/services/ManagedJobFactoryService Reference property[0]: <ns1:ResourceID xmlns:ns1="http://www.globus.org/namespaces/2004/10/gram/job">9dad51f0-6aa5-11d9-978a-cc2162090ce2</ns1:ResourceID> Caused by: org.oasis.wsrf.faults.BaseFaultType: AxisFault faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.userException faultSubcode:. faultString: java.rmi.RemoteException: Unable to create RFT resource; nested exception is:. >...org.globus.transfer.reliable.service.exception.RftException: Error processing delegated credentialError getting delegation resource [Caused by: org.globus.wsrf.NoSuchResourceException] [Caused by: Error getting delegation resource [Caused by: org.globus.wsrf.NoSuchResourceException]] faultActor:. faultNode:. faultDetail:. >...{http://xml.apache.org/axis/}stackTrace:java.rmi.RemoteException: Unable to create RFT resource; nested exception is:. faultDetail:. >...{http://xml.apache.org/axis/}stackTrace:java.rmi.RemoteException: Unable to create RFT resource; nested exception is:. >...org.globus.transfer.reliable.service.exception.RftException: Error processing delegated credentialError getting delegation resource [Caused by: org.globus.wsrf.NoSuchResourceException] [Caused by: Error getting delegation resource [Caused by: org.globus.wsrf.NoSuchResourceException]] >...at org.globus.transfer.reliable.service.factory.ReliableFileTransferFactoryService.createReliableFileTransfer(ReliableFileTransferFactoryService.java:185)>...at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [...] Caused by: org.globus.transfer.reliable.service.exception.RftException: Error processing delegated credentialError getting delegation resource [Caused by: org.globus.wsrf.NoSuchResourceException] [Caused by: Error getting delegation resource [Caused by: org.globus.wsrf.NoSuchResourceException]] >...at org.globus.transfer.reliable.service.ReliableFileTransferResource.processDelegatedCredential(ReliableFileTransferResource.java:365) >...at org.globus.transfer.reliable.service.ReliableFileTransferResource.<init>(ReliableFileTransferResource.java:222) >...at org.globus.transfer.reliable.service.ReliableFileTransferHome.create(ReliableFileTransferHome.java:108) >...at org.globus.transfer.reliable.service.factory.ReliableFileTransferFactoryService.createReliableFileTransfer(ReliableFileTransferFactoryService.java:180)>...... 23 more >...{http://xml.apache.org/axis/}hostname:dc-user2.isi.edu java.rmi.RemoteException: Unable to create RFT resource; nested exception is:. >...org.globus.transfer.reliable.service.exception.RftException: Error processing delegated credentialError getting delegation resource [Caused by: org.globus.wsrf.NoSuchResourceException] [Caused by: Error getting delegation resource [Caused by: org.globus.wsrf.NoSuchResourceException]] >...at org.apache.axis.message.SOAPFaultBuilder.createFault(SOAPFaultBuilder.java:221) [...] >...at org.apache.axis.client.Call.invoke(Call.java:1765) >...at org.globus.rft.generated.bindings.ReliableFileTransferFactoryPortTypeSOAPBindingStub.createReliableFileTransfer(ReliableFileTransferFactoryPortTypeSOAPBindingStub.java:874) >...at org.globus.exec.service.exec.StateMachine.submitStagingRequest(StateMachine.java:1910) >...at org.globus.exec.service.exec.StateMachine.processStageInState(StateMachine.java:571) >...at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >...at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >...at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >...at java.lang.reflect.Method.invoke(Method.java:324) >...at org.globus.exec.service.exec.StateMachine.processState(StateMachine.java:258) >...at org.globus.exec.service.exec.RunQueue.run(RunQueue.java:93) Timestamp: Wed Jan 19 21:39:24 PST 2005 [...] >...at org.globus.wsrf.encoding.ObjectDeserializer.toObject(ObjectDeserializer.java:56) >...at org.globus.exec.client.GramJob.deliver(GramJob.java:1440) >...at org.globus.wsrf.impl.notification.NotificationConsumerProvider.notify(NotificationConsumerProvider.java:106) [...] >...at org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:265) stateWhenFailureOccurred: StageIn timestamp: java.util.GregorianCalendar[time=1106199564817,areFieldsSet=true,areAllFieldsSet=true,lenient=true,zone=sun.util.calendar.ZoneInfo[id="GMT",offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=1,YEAR=2005,MONTH=0,WEEK_OF_YEAR=4,WEEK_OF_MONTH=4,DAY_OF_MONTH=20,DAY_OF_YEAR=20,DAY_OF_WEEK=5,DAY_OF_WEEK_IN_MONTH=3,AM_PM=0,HOUR=5,HOUR_OF_DAY=5,MINUTE=39,SECOND=24,MILLISECOND=817,ZONE_OFFSET=0,DST_OFFSET=0] Message: null 2005-01-19 21:39:40,678 WARN client.GramJob [Thread-7,destroyDelegatedCredential:1279] Unable to destroy resource AxisFault faultCode: {http://schemas.xmlsoap.org/soap/envelope/}Server.generalException faultSubcode:. faultString:. faultActor:. faultNode:. faultDetail:. >...{http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-ResourceLifetime-1.2-draft-01.xsd}ResourceUnknownFault:<ns2:Timestamp xmlns:ns2="http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd">2005-01-20T05:39:40.390Z</ns2:Timestamp><ns3:Originator xmlns:ns3="http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd"><ns2:Address xmlns:ns2="http://schemas.xmlsoap.org/ws/2004/03/addressing">https://128.9.64.179:8888/wsrf/services/DelegationService</ns2:Address><ns4:ReferenceProperties xmlns:ns4="http://schemas.xmlsoap.org/ws/2004/03/addressing"><ns1:DelegationKey soapenv:mustUnderstand="0" xmlns:ns1="http://www.globus.org/08/2004/delegationService" xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">a55a0ab0-6aa5-11d9-978a-cc2162090ce2</ns1:DelegationKey></ns4:ReferenceProperties><ns5:ReferenceParameters xmlns:ns5="http://schemas.xmlsoap.org/ws/2004/03/addressing"/></ns3:Originator><ns4:Description xmlns:ns4="http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd">Failed to remove resource</ns4:Description><ns5:FaultCause xmlns:ns5="http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd"><ns5:Timestamp>2005-01-20T05:39:40.390Z</ns5:Timestamp><ns5:ErrorCode dialect="http://www.globus.org/fault/stacktrace">org.oasis.wsrf.lifetime.ResourceUnknownFaultType</ns5:ErrorCode><ns5:Description> >...at org.globus.wsrf.impl.lifetime.DestroyProvider.destroy(DestroyProvider.java:39) [...] >...at org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:268) </ns5:Description></ns5:FaultCause><ns6:FaultCause xmlns:ns6="http://docs.oasis-open.org/wsrf/2004/06/wsrf-WS-BaseFaults-1.2-draft-01.xsd"><ns6:Timestamp>2005-01-20T05:39:40.393Z</ns6:Timestamp><ns6:ErrorCode dialect="http://www.globus.org/fault/exception"/><ns6:Description>org.globus.wsrf.NoSuchResourceException >...at org.globus.delegation.service.DelegationResource.load(DelegationResource.java:404) >...at org.globus.wsrf.impl.ResourceHomeImpl.createNewInstanceAndLoad(ResourceHomeImpl.java:236) >...at org.globus.wsrf.impl.ResourceHomeImpl.get(ResourceHomeImpl.java:271) >...at org.globus.wsrf.impl.ResourceHomeImpl.remove(ResourceHomeImpl.java:304) >...at org.globus.wsrf.impl.lifetime.DestroyProvider.destroy(DestroyProvider.java:37 [...] >...at org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:268) </ns6:Description></ns6:FaultCause> >...{http://xml.apache.org/axis/}exceptionName:org.oasis.wsrf.lifetime.ResourceUnknownFaultType >...{http://xml.apache.org/axis/}hostname:dc-user2.isi.edu >...at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) [...] >...at org.globus.delegationService.DelegationPortTypeSOAPBindingStub.destroy(DelegationPortTypeSOAPBindingStub.java:1145) >...at org.globus.exec.client.GramJob.destroyDelegatedCredential(GramJob.java:1271) >...at org.globus.exec.client.GramJob.destroyDelegatedCredentials(GramJob.java:1200) >...at org.globus.exec.client.GramJob.destroy(GramJob.java:1181) >...at org.globus.exec.service.test.throughput.ClientThread.stateChanged(ClientThread.java:267) >...at org.globus.exec.client.GramJob.setState(GramJob.java:271) >...at org.globus.exec.client.GramJob.deliver(GramJob.java:1464) >...at org.globus.wsrf.impl.notification.NotificationConsumerProvider.notify(NotificationConsumerProvider.java:106) [...] >...at org.globus.wsrf.container.GSIServiceThread.process(GSIServiceThread.java:124) >...at org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:265) [then repeat the same errors many times (953)] SERVER LOG ---------- Starting... [...] ,authorize:281] Authorized "/C=US/O=Globus Alliance/OU=User/CN=1016d3c1646.f0ecdb2d" to invoke "{http://www.globus.org/namespaces/2004/10/rft}createReliableFileTransfer". 2005-01-19 21:39:24,735 ERROR delegation.DelegationUtil [Thread-3,getDelegationResource:254] Error getting delegation resource org.globus.wsrf.NoSuchResourceException >...at org.globus.delegation.service.DelegationResource.load(DelegationResource.java:404) >...at org.globus.delegation.service.DelegationHome.find(DelegationHome.java:50) ) >...at org.globus.delegation.DelegationUtil.getDelegationResource(DelegationUtil.java:252) >...at org.globus.delegation.DelegationUtil.registerDelegationListener(DelegationUtil.java:157) >...at org.globus.transfer.reliable.service.ReliableFileTransferResource.processDelegatedCredential(ReliableFileTransferResource.java:358) [...] 2005-01-19 21:39:24,743 ERROR factory.ReliableFileTransferFactoryService [Thread-3,createReliableFileTransfer:183] Unable to create RFT resource Error processing delegated credentialError getting delegation resource [Caused by: org.globus.wsrf.NoSuchResourceException] [...] 2005-01-20 04:15:26,438 ERROR delegation.DelegationUtil [Thread-33114,getDelegationResource:254] Error getting delegation resource org.globus.wsrf.NoSuchResourceException >...at org.globus.delegation.service.DelegationResource.load(DelegationResource.java:404) >...at org.globus.delegation.service.DelegationHome.find(DelegationHome.java:50) >...at org.globus.delegation.DelegationUtil.getDelegationResource(DelegationUtil.java:252) >...at org.globus.delegation.DelegationUtil.registerDelegationListener(DelegationUtil.java:167) >...at org.globus.exec.service.utils.DelegatedCredential.getDelegatedCredential(DelegatedCredential.java:136) >...at org.globus.exec.service.job.ManagedJobResourceImpl.getJobCredential(ManagedJobResourceImpl.java:329) [...] 2005-01-20 04:15:26,440 ERROR factory.ManagedJobFactoryService [Thread-33114,createManagedJob:269] Job creation failed. java.lang.RuntimeException: Couldn't obtain a delegated credential. >...at org.globus.exec.service.job.ManagedJobResourceImpl.getJobCredential(ManagedJobResourceImpl.java:338) [...] Caused by: org.globus.delegation.DelegationException: Error getting delegation resource [Caused by: org.globus.wsrf.NoSuchResourceException] >...at org.globus.delegation.DelegationUtil.getDelegationResource(DelegationUtil.java:255) >...at org.globus.delegation.DelegationUtil.registerDelegationListener(DelegationUtil.java:167) [...] 2005-01-20 04:17:21,070 INFO authorization.ServiceAuthorizationChain [Thread-33162,authorize:281] Authorized "/C=US/O=Globus Alliance/OU=User/CN=1016d3c1646.f0ecdb2d" to invoke "{http://www.globus.org/namespaces/2004/10/rft}start". 2005-01-20 04:17:21,089 ERROR service.ReliableFileTransferResource [Thread-33162,store:297] Unable to store subscriptions java.lang.Exception: Unable to store subscriptionsnull >...at org.globus.transfer.reliable.service.ReliableFileTransferResource.storeSubscriptions(ReliableFileTransferResource.java:266) >...at org.globus.transfer.reliable.service.ReliableFileTransferResource.store(ReliableFileTransferResource.java:295) [...] 2005-01-20 04:17:23,301 INFO authorization.ServiceAuthorizationChain [Thread-33148,authorize:281] Authorized "/C=US/O=NPACI/OU=SDSC/CN=host/dc.isi.edu" to invoke "{http://wsrf.globus.org/core/notification}notify". cat ~/123/stdout ---------------- Wed Jan 19 21:38:49 PST 2005 Wed Jan 19 21:38:56 PST 2005 Wed Jan 19 21:39:06 PST 2005 [...] Thu Jan 20 04:16:55 PST 2005 Thu Jan 20 04:17:22 PST 2005 Thu Jan 20 04:17:37 PST 2005 Thu Jan 20 04:18:23 PST 2005 INTERPRETATION: --------------- It is tempting to interpret the OutOfMemoryError as a result of repeated submissions of jobs and creation of resources on the server while the server is not able to destroy resources for jobs that are Done. We need to understand more the behavior of the Throughput Tester (used to maintain load) in order to infirm or not infirm this hypothesis
BATCH OF STUDY #5 TEST RESULTS: ------------------------------- GRAM server: dc-user2.isi.edu:8888 GridFTP server: dc-user2:9001 postgreSQL server: tubby.isi.edu Test start date: Thu Jan 20 15:24 PST 2005 Last server output: Fri Jan 21 12:53 PST 2005 Duration of test: 7 hours 29 minutes Last succesful stagein: Thu Jan 20 21:10 (ls -l ~/123/my_date) executable: Thu Jan 20 21:10 (tail -n 1 ~/123/stdout) stageout: Thu Jan 20 21:09 (ls -l /tmp/stdout.study5) Number of jobs: successful: 1433 (wc ~/123/stdout) failed: 1082 total: Note: used same GT installation (trunk Jan 19) as previous post in this campaignzilla. OBSERVATIONS ------------ a) same issues in job failure as in previous post and crash of container because of OutOfMemory error. b) about same results, i.e.: container stability: about 7 hours and 30 minutes before crash successful jobs: less than 1500 failed jobs: around 1,000 (ouch! maybe problem with installation/test itself).
Without any staging directive in the RSL, the study 5 test runs very well with no error noticed (did not carry the the test through). This confirms that the main issue (apart from the eventual OutOfMemory error) pointed out by the existing study #5 test is buggy RFT/staging (see the results above for more details). It could be interesting to run the test with different numbers in terms of concurrency of submitted jobs and clients. Therefore: Repeated the test with different parameters fed to the Throughput Tester: -------------------------------------------------------------------------- case | clients | jobs/client) | summary of results A=study#5 1 10 jobs RFT/staging errors (see above) B 10 1 job lots of NPE in a Locator class; GRAM service stuck after about 100 "Done" jobs. So in a way this is worse than results for A. C 1 1 job see details below: Case C results: This is equivalent to a purely serial submission of jobs. Observation 1) We do not see the staging errors (the problem that delegated credentials could not be obtained which failed RFT) of study #5 i.e. A). This seems to indicate that concurrency is instrumental in the triggering of those RFT errors. Observation 2) 2723 jobs were "Done" over a duration of more than 46 hours (almost 2 days) which shows a dramatic difference in terms of service stability compared to a load of 10 concurrently processed jobs. Afterwards no more successful JT2 jobs could be submitted: 2.1) On the server, which keeps writing to output, there is a OutOfMemoryError on the createJob call whenever a job is submitted manually. This shows that we eventually get memory issues in GRAM even with no job concurrency. 2.2) If the job is stripped of its staging directives it succeeds, which seems to indicate that the memory leak is due to RFT/staging-related things in GRAM. Observation 3) Here are the ERROR logs on the server: 2005-01-22 04:23:43,825 ERROR exec.ManagedExecutableJobResource [Thread-2,deliver:1218] Unable to destroy transfer. 2005-01-22 04:25:38,994 ERROR container.GSIServiceThread [Thread-2000,process:117] Error processing request java.net.SocketException: Connection reset hread-14902,authorize:281] Authorized "/C=US/O=Globus Alliance/OU=User/CN=1016d3c1646.f0ecdb2d" to invoke "{http://www.globus.org/namespaces/2004/10/rft}destroy". 2005-01-22 04:40:48,712 ERROR container.ServiceThread [Thread-3,process:410] Error closing output stream hread-3,authorize:281] Authorized "/C=US/O=Globus Alliance/OU=User/CN=1016d3c1646.f0ecdb2d" to invoke "{http://www.globus.org/namespaces/2004/10/rft}createReliableFileTransfer". 2005-01-22 04:41:29,991 ERROR container.GSIServiceThread [Thread-14922,process:117] Error processing request java.net.SocketException: Connection reset hread-14902,authorize:281] Authorized "/C=US/O=Globus Alliance/OU=User/CN=1016d3c1646.f0ecdb2d" to invoke "{http://www.globus.org/namespaces/2004/10/rft}start". 2005-01-22 05:28:10,439 ERROR container.GSIServiceThread [Thread-15113,process:117] Error processing request hread-15113,authorize:281] Authorized "/C=US/O=Globus Alliance/OU=User/CN=1016d3c1646.f0ecdb2d" to invoke "{http://www.globus.org/namespaces/2004/10/rft}start". 2005-01-22 19:29:30,818 ERROR service.ReliableFileTransferResource [Thread-15113,store:297] Unable to store subscriptions java.lang.Exception: Unable to store subscriptionsnull >...at org.globus.transfer.reliable.service.ReliableFileTransferResource.storeSubscriptions(ReliableFileTransferResource.java:266) [repeated 3 times] hread-3,authorize:281] Authorized "/C=US/O=Globus Alliance/OU=User/CN=1016d3c1646.f0ecdb2d" to invoke "{http://www.globus.org/namespaces/2004/10/gram/job/exec}getMultipleResourceProperties". 2005-01-23 13:30:54,376 ERROR container.ServiceThread [Thread-15113,process:410] Error closing output stream
We should try to get someone to perform similar tests with RFT directly to see if the staging-related leaks are with the GRAM-RFT interaction code or the RFT service code.
Are you logging stderr too .. Cause OOM AFAIK errors are not logged if you are logging stdout. Also I would like to know more details about this test so I can simulate it with my client.
Alain, Can you send me the info on the case B results? The NPEs?
Revising the results of case study B (10 clients - 1 job/client): a) All errors to fail jobs are RFT/staging errors and look similar to the staging errors in case study A (see logs in results for study #5). So at least we have consistency in terms of the causes for job failure b/w case studies A and B. b) NPEs appear (later) in the *client* when it tries to refresh the state of a job 2005-01-21 20:39:58,195 ERROR throughput.ClientThread [Thread-2,run:136] Unable to refresh job. java.lang.NullPointerException >...at org.globus.exec.generated.service.ManagedJobServiceAddressingLocator.getManagedJobPortTypePort(ManagedJobServiceAddressingLocator.java:12) >...at org.globus.exec.utils.client.ManagedJobClientHelper.getPort(ManagedJobClientHelper.java:29) >...at org.globus.exec.client.GramJob.refreshStatus(GramJob.java:1536) >...at org.globus.exec.service.test.throughput.ClientThread.run(ClientThread.java:134) This NPE is due to a null job EPR passed to the locator. Looks like it *might* be a bug in ClientThread but I would need to spend more time on this.
Now that gram scalability has been improved. This study should be resumed.
I have a container running on Lucky0. I am running the wsrf gram scheduler test for fork for submit jobs (30 in all) in a loop until 10000 itterations. Every 100 itterations I run top and pmap on the globus container pid. I'm not sure what pmap will offer up, but it looked interesting :-). This should run over the weekend. We'll see if the container is still functional on Monday.
The tests on Lucky revealed a JVM bug: http://64.233.161.104/search?q=cache:udlaMTxJaksJ:bugs.sun.com/bugdatabase/ view_bug.do%3Fbug_id%3D4959566+4F533F4C494E55583F491418160E4350500306&hl=en So I moved to Vanguard.mcs.anl.gov, were I was able to perform a long run test that lasted for more than 500,000 jobs over a 23 day period without the container crashing. See attachments for details.
Created an attachment (id=563) [details] client job submission program This is a perl script that builds and submits in a loop the globus_wsrf_gram_scheduler_test program.
Created an attachment (id=564) [details] The tail end of the log file from the long-run-test.pl
Created an attachment (id=565) [details] pmap output at the beginning of the run
Created an attachment (id=566) [details] pmap output at the end of the 2nd run The long-run-test program was executed for 10,000 itterations twice. So this is the output after the *second* long-run-test execution.
Created an attachment (id=567) [details] top output before the start of the job submissions
Created an attachment (id=568) [details] top output after all the job submissions
This campaign is done. Should be rerun after the 4.0 code has been frozen.