Bug 4751 - GRAM2 GRAM4 Performance Comparison
: GRAM2 GRAM4 Performance Comparison
Status: RESOLVED FIXED
: GRAM
Campaign
: 4.0.3
: Macintosh All
: P3 normal
: ---
Assigned To:
:
:
: 4858 5027
: 4036
  Show dependency treegraph
 
Reported: 2006-10-05 13:53 by
Modified: 2008-02-05 15:04 (History)


Attachments
GRAM4 patch (see below) (21.99 KB, patch)
2006-10-23 11:22, Martin Feller
Details
GRAM4 patch v2 (23.07 KB, patch)
2006-10-24 15:06, Martin Feller
Details
Overview about how many jobs are in which state during a test with 1000 simple jobs submitted concurrently by Condor-G (19.16 KB, text/plain)
2006-11-13 02:57, Martin Feller
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2006-10-05 13:53:18
Title: GRAM2 GRAM4 Performance Comparison

Projects: OSG, TG

Technologies:    GRAM4, GRAM2

Definition:

Recently, the GRAM4 testing focus has been on analyzing and improving GRAM4 for
throughput for a typical OSG application scenario (campaign 4664).  Initially,
GRAM4 throughput was  significantly worse than GRAM2, but after bug fixes and
new optimizations the 2 services are very close to the same.
The OSG application scenario consisted of a job with 2MB stageIn data, 2 MB
stageOut data and fileCleanUp.

GRAM4 is now ready for a comprehensive test comparison with GRAM2.  This will
single out different GRAM features and provide performance and throughput
results for those specific interaction.  This will show if there are certain
GRAM4 features that need further optimizations as well as provide results for a
comparison with the GRAM2 feature performance.

The testing will show the effects on performance and system resources by
modifying these variables:
    1) Creating a unique job directory for each job or sharing an existing one
    2) Varying the size of staging data including no staging directives at all
    3) Credential delegation: None, Shared, Each job (*)
    4) Sequential or parallel job submission
    5) Throttled or unthrottled job submission


Test Cases
###########

The following are the standard test cases that will be used for the above
testing variables:

   1) No staging, no delegation

   2) No staging, delegation

   3) Stage    In of 1 x 10KB file, delegation (*), unique job dir, fileCleanUp
   4) Stage   Out of 1 x 10KB file, delegation (*), unique job dir, fileCleanUp
   5) Stage InOut of 1 x 10KB file, delegation (*), unique job dir, fileCleanUp

   6) Stage    In of 1 x 1 MB file, delegation (*), unique job dir, fileCleanUp
   7) Stage   Out of 1 x 1 MB file, delegation (*), unique job dir, fileCleanUp
   8) Stage InOut of 1 x 1 MB file, delegation (*), unique job dir, fileCleanUp

   9)  Stage    In of 1 x 100 MB file, delegation (*), unique job dir,
fileCleanUp
   10) Stage   Out of 1 x 100 MB file, delegation (*), unique job dir,
fileCleanUp
   11) Stage InOut of 1 x 100 MB file, delegation (*), unique job dir,
fileCleanUp

   12) Stage    In of 100 x 1 MB file, delegation (*), unique job dir,
fileCleanUp
   13) Stage   Out of 100 x 1 MB file, delegation (*), unique job dir,
fileCleanUp
   14) Stage InOut of 100 x 1 MB file, delegation (*), unique job dir,
fileCleanUp

(*) credential delegation in job submissions to GRAM2: per job (required in
protocol)
    credential delegation in job submissions to GRAM4: shared (standard
functionality for all condor-g jobs)    


Test Scenarios
###########

software:
    GRAM4 service from 4.0.3 + patches
    GRAM2 service from 4.0.3
        - standard LRM job state monitoring where each job queries the LRM
        - Or should we use the SEG here with GRAM2?
    condor-g version 6.7.19
        - Use grid-monitor for remote job state monitoring
    custom condor-g from version 6.7.19
        - delegation code was removed in the custom version

Test results will be collected at -
http://www-unix.mcs.anl.gov/~feller/GRAM/perf/4-0

For each test scenario executed, the effects on the system resources (cpu,
memory, paging, ...) of the client and the server machine will be collected.

I) ONE client submits ONE job at a time 100 iterations to GRAM2, GRAM4

    Use wrapper script around globusrun-ws/globusrun for (1) and (2)
    Use custom condor-g for (1)
    Use condor-g for (2)
        - compare results from globusrun and condor-g test to understand
overhead of condor-g client.
    Use condor-g for (3) - (14)

    client-side measurements for each iteration will be taken from the time
just before the remote service operation call until just after it returns.
    The mean, median and mode are to be reported for each test.


II) ONE client submits 1000 jobs concurrently to GRAM2, GRAM4

    a) Throttled
    We need to determine the best throttle value for GRAM4.  Is 1000 jobs
enough to provide an accurate steady state performance picture?  
     We'll use Condor-G for all scenarios (1) - (14)

    b) Unthrottled
        We'll use Condor-G for all scenarios (1) - (14)

    service-side measurements will be taken for the test duration starting when
the first job resource is create until the last job resource has been
destroyed.


III) M clients on M hosts submit N jobs concurrently to GRAM2, GRAM4

    a) Throttled
        We'll use Condor-G for all scenarios (1) - (14)

    i) Test with the GridFTP server running on a separate host

    b)  unthrottled:
        We'll use Condor-G for all scenarios (1) - (14)

    i) Test with the GridFTP server running on a separate host

    service-side measurements will be taken for the test duration starting when
the first job resource is create until the last job resource has been
destroyed.

Deliverables:

1) Collected test results for above tests
2) GRAM comparison paper

Tasks:

1) Execute all tests in scenario I, collecting data
    a. 16 tests to GRAM2
    b. 16 tests to GRAM4

2) Execute all tests in scenario II, collecting data
    a. 14 throttled tests to GRAM2
    b. 14 unthrottled tests to GRAM2
    c. 14 throttled tests to GRAM4
    d. 14 unthrottled tests to GRAM4

3) Execute all tests in scenario III, collecting data
    a. 14 throttled tests to GRAM2
    b. 14 unthrottled tests to GRAM2
    c. do test cases 5, 8, 11, 14 throttled to GRAM2 with GridFTP on a separate
host
    c. do test cases 5, 8, 11, 14 unthrottled to GRAM2 with GridFTP on a
separate host
    a. 14 throttled tests to GRAM4
    b. 14 unthrottled tests to GRAM4
    c. do test cases 5, 8, 11, 14 throttled to GRAM4 with GridFTP on a separate
host
    c. do test cases 5, 8, 11, 14 unthrottled to GRAM4 with GridFTP on a
separate host

4) Add performance results to GRAM comparison paper

Time Estimate:

10 days
------- Comment #1 From 2006-10-13 09:24:20 -------
recently reported in bug 4745 is the poor performance of a basic multi-job
submission.  A basic multi-job test for gram2 and gram4 should be added to the
set of tests current run at http://www-unix.mcs.anl.gov/~feller/GRAM/perf/4-0/

I'm thinking a multi-job with 2 sub jobs, where each sub job is a no staging
job like row #2 and #3.    The multi-job service and gram services should all
run on the same host.  This will make testing simple, but still provide a
valuable benchmark with which to compare performance experienced by users.
------- Comment #2 From 2006-10-20 09:58:10 -------
Peter mentioned some time ago that the communication should take place here
and not just via email. Right. That's why I continue to report here from
now on.

A test with  concurrent job submission throttled <--> unthrottled finished.
Results on http://www-unix.mcs.anl.gov/~feller/GRAM/perf/4-0/ (3rd link)

I just saw realized that the tests so far did not include file cleanup
and that the jobs didn't have a unique job directory but shared a common
one like it was described in the campaign.

Also there are still problems with condor during tests. In some tests
to GRAM4 the jobs are simply not submitted by Condor on the client side.
The client Condor logfile contains submission timeouts in these tests.
On the server-side the container logfile does not show errors.
I ran 20 tests with 10 jobs per test and reproduced this behaviour: One
test of the 20 didn't submit any job. No errors on the server-side.
I sent an e-mail to Jaime Frey from Condor with a description and the logs. 

A document which describes the testing setup is currently reviewed and
will be attached here hopefully soon.


A short summary for those who were not involved in the communication which
didn't take place on bugzilla:

* The automated testing software is finished and tests can be run now with
  clients globusrun, globusrun-ws and condor-g

* sequential job submission tests: 
   1st link on http://www-unix.mcs.anl.gov/~feller/GRAM/perf/4-0/
   the columns in the "Test#" column and "mean" column in the result table
   are links, even if they don't feel like links.
   The target of the link in the "Test# column" "Test#" show the job
   description used for the test and how the job is submitted
   The target of the link in the "mean" column is a histogram that shows
   the distribution of the job durations.
   Some histograms (like Test 4,7,16,19) show strange behaviour.
   Maybe this is due to DestroyProvider errors that occured in the
   container. I'll have a look at that.

* concurrent job submission tests:
   2nd link on http://www-unix.mcs.anl.gov/~feller/GRAM/perf/4-0/
     Be careful with the results since sometimes jobs keep stuck in
     the condor pool:
     Test #2:
       two jobs changed their state from running to idle in Condor
       and didn't change that state for about 4 minutes until i stopped them.
     Test #3:
       Similar here: One job wasn't started for a long time. The reason can
       either be GRAM2 or Condor here.
------- Comment #3 From 2006-10-23 11:22:12 -------
Created an attachment (id=1100) [details]
GRAM4 patch (see below)

here's the patch which includes the changes to GRAM4 which
i used for testing. It was done on a fresh
"cvs co -r globus_4_0_branch ws-gram" 2 hours ago.

The changes are:

* Only one RunQueue exists. The amount of RunThread objects
  working on that queue are configured via JNDI like before.
  All run threads work on this queue.

* RunQueueGroup.java is no longer needed. 

* Some unnecessary import-statements had been removed. Still
  there are a lot of them which are not needed. I suggest to
  remove them in the future.

* A test for the notifications sent by RFT to GRAM4 was added:
  only notifications that indicate that a transfer failed or has been
  successfully finished will be processed in GRAM4
  (this was a bugzilla "campaign")

* some additional logging statements (debug-mode) had been added

The patch must be applied to the code in the directory 
service/java/source/src/org/globus/exec/service/exec

service/java/source/src/org/globus/exec/service/exec/RunQueueGroup.java
must be removed, otherwise compilation fails since RunQueueGroup uses
parts of RunQueue which don't exist any more (categories of
RunQueues: RUN_QUEUE_STAGE_IN, ...)

I hope all changes are ok with you, please review them.

I tested the code in a number of tests with 1000 concurrent jobs
and it worked fine.

comments?
------- Comment #4 From 2006-10-24 15:06:18 -------
Created an attachment (id=1105) [details]
GRAM4 patch v2

Here's a revised version of the patch.
The first one contained some unused code i forgot to remove.
Additionally some hints by Jarek are incorporated in the new RunQueue
------- Comment #5 From 2006-10-24 21:06:39 -------
Hi Martin.  Just wanted to say that Stu and I talked today and that the 1000
concurrent job runs should run through to completion (not stopped prematurely).
 Also I am curious how you are handling/reporting errors that happen during the
run.  Did you encounter any errors during the test on
http://www-unix.mcs.anl.gov/~feller/GRAM/perf/4-0/10_19_2006/?
------- Comment #6 From 2006-10-27 11:11:39 -------
Lisa:

All jobs run to completion now (if they run to completion at all) although i
still think the idea was not too bad. In any case we need some mechanism to
stop tests with problems in order to not to allow one test to block the
following tests. Now we decided to stop a test if it didn't finish within
a generously measured timeframe.

So far every now and then tests fail or keep stuck for various reasons:

* jobs don't get submitted by condor
* jobs get submitted by condor but condor doesn't release them from 
  a hold state
* jobs keep stuck in the server-side condor pool
* jobs keep stuck in state WaitingForStateChanges. These jobs don't keep
  stuck in the server-side condor pool.
  The reason could be that condor doesn't log them being processed or
  the SEG somehow misses that they are fininshed.
  Whatever the reason is: for those jobs MEJHome.jobStateChanged() is
  not called when they're processed and so they keep stuck in
  WaitingForStateChanges.

Maybe this sounds bigger like it is since these errors don't happen too
often, but they happen about in every fourth to fifth test.

I don't know exactely what you mean by error and failure handling here:
What I was doing with tests with failures and errors is that I didn't use
them and didn't publish their results on the webpage (except for some to
illustrate the kind of problem).
In the production tests that will start now we should not use them at all.
Additionally now all tests run against a fresh container and don'&#359; run
against a container that had errors in a past test (see below).
Tell me if that's not what you were thinking about.

What happened in the meantime

* the code which includes the exeriences from former tests  is in CVS now
  (globus_4_0_branch)
  So this code is the latest production code and will be used for tests.
* I extended the testing framework in the way i described it in the doc:
  All tests with job submissions to GRAM4 are submitted to a
  "Worker Container". It's started right before a test via a GRAM4 job
  to the "Controller Container" and terminated after the test finished.
  After termination the logfile is saved and persistence data is removed.
  This ensures that each test runs against a fresh container on the
  server-side.

Lisa: tell me if you have comments on the Testing doc. I can add the changes
to the doc then.

So I'll start tests over the weekend against the latest branch code
and plan to run all tests in the next week (if not too many errors occur).

Comments welcome!

Martin
------- Comment #7 From 2006-11-02 09:33:49 -------
I changed one parameter in the Condor configuration on the client side
which affects performance of job submissions to GRAM2
(GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE; see below). This parameter
wasn't explicitely set before, that's why i didn't realize it and its
importance. From a monitoring during the test the new value (50) seems
to be high enough, since there had rarely been more that 40 
gridmanager-processes.

The values for concurrent job submissions to GRAM2 can be found on
http://www-unix.mcs.anl.gov/~feller/GRAM/perf/4-0/11_02_2006/

Here's a summary of all parameters that affect throughput and performance
and some comments by Jaime Frey on these parameters:


1. GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE (our setting: 2000)
##############################################
imits the total number of jobs that Condor-G will have submitted to a
remote resource at any given time

2. GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE (our setting: 50)
###############################################
the maximum number of jobs that can be in the process of being submitted
at any time. It affects both pre-WS GRAM and WS GRAM. If it's set the same
for both, then you don't have to worry about it.

3. GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE (default:5, our new setting: 50)
###########################################
This parameter only affects pre-WS GRAM. It defaults to 10. So if 20 jobs
complete at the same time, Condor-G will only restart the jobmanagers for 10.
As stage-out and cleanup finish for those 10, Condor-G will restart
jobmanagers for the remaining 10. If there are 5 submits going on as well,
then only 5 jobmanagers will be restarted at a time for completed jobs.
This can certainly slow down the runtime for your tests.

If your attempt is to show the throughput that a single client can expect
to see, then the defaults for these limits are fine. If you're attempting
to measure the throughput that a server can provide to a set of clients,
I would suggest raising  these limits to 3-4 times their default values.

4. ENABLE_GRID_MONITOR (our setting: TRUE)
######################
The GridMonitor allows you to submit many more jobs to a GT2 GRAM server
than is normally possible.

The biggest factor, though, is probably the grid monitor. When it's active,
jobmanagers are shut down and the grid monitor reports the status of all jobs
every minute. When Condor-G hears from the grid monitor that a job has
completed, it restarts the appropriate jobmanager. The jobmanager then has
to notice that the job is completed before it starts stage-out. This adds
overhead to job execution time, which will be especially noticeable for your
/bin/true jobs because they're so short. 
Without the grid monitor, the gatekeeper runs a jobmanager process for each
job. If several hundred jobs are submitted (across all users), the machine
slows to a crawl and becomes unusable. If most or all users use condor-g with
the grid monitor, the gatekeeper can handle a lot more simultaneously
submitted jobs. 
------- Comment #8 From 2006-11-02 09:41:09 -------
While collecting values for concurrent job submission with Condor-G as
client and GRAM4 as server I found that sometimes keep stuck in certain
states. During my last tests with code from branch this happened again
and more often and I had a closer look.

The following problems only occured in tests with job submissions to GRAM4.
BTW: All job submissions to GRAM2 run without problems.
This makes me think that it's a GRAM4<-->Condor problem.

Problem 1: Jobs keep stuck in state StageInHold:
  Jobs submitted by Condor have a hold state set to "stageIn". So
  they run until they reach the state stageIn and must be released
  by Condor in order to be further processed.
  I sent an email to Jaime Frey to verify this but didn't get an answer
  so far.
  Unfortunately jobs don't get released and that'll cause submission
  timeouts.

Problem 2: Jobs keep stuck in state StageInResponse:
  This happened two times now and both times GRAM4 didn't receive
  any notifications from RFT although the logfile indicates that RFT
  successfully processed the transfers. 

Problem 3 (and most ugly): today all concurrent tests against GRAM4 failed:
  Reasons on the client-side:

  018 (2285.425.000) 11/02 02:51:31 Globus job submission failed!
    Reason: 0 java.net.SocketTimeoutException: Read timed out

  018 (2286.006.000) 11/02 02:59:31 Globus job submission failed!
    Reason: 0 GT4_GRAM_JOB_SUBMIT timed out

  If a test fails then even a submission with globusrun-ws as client
  takes very long. It seems like there's really  a connection problem
  in the container that does not occur in sequential submission.
  These connection problems didn't appear with job submissions with
  a small amount of jobs.

These are too many errors to continue. So i suggest to investigate time
here to figure out what it is:

1) I'm just doing a new setup of the GT4 again. Maybe it was a "Monday-setup".
2) If this doesn't work, I'll do a setup of a new download from CVS
3) If this doesn't work as well we should have a look what happens


Ok, maybe we don't need step 2) and 3)
I just setup the container again and it seems to work much better in a just
running test (although there was one submission error again, but a
different one).

The only reason I can think of why there were these strange connection errors:
The container (code from branch) was compiled with Java1.4 some days ago. 
That's the Java version they use in osg-041. So far i didn't run a concurrent
test with condor-g as client against that new container.
Then Jarek and me started profiling and had to install java1.5 since the
profiler demands it. So the container I used for the concurrent test was
executed by another Java runtime than it was compiled with.
AFAIK that's the only thing that changed.

Maybe all error 1) - 3) can be explained by that, since 2) is very new to me.
If anybody wants to see logfiles: I can provide them (in DEBUG mode for
GRAM and RFT) for all kind of errors
------- Comment #9 From 2006-11-07 13:06:20 -------
Update:

Since the time of my last entry (#8) there had been problems with job
submissions of 1000 concurrent jobs from condor-g to GT4 from branch. I
mentioned these problems in #6 Problem3. These timeouts didn't appear
in each test, but in about every second or third test. Sometimes just 1-4
jobs out of 1000 have problems, sometimes the first 10 jobs fail directly
(in this case a test is aborted).
The GT4 logfile does not indicate any problem here.

I realized that all jobs with the "GT4_GRAM_JOB_SUBMIT timed out" failure
message actually entered the GT4 container but kept stuck in a hold state.
Jaime Frey from Condor verified that it's the default behaviour of Condor
to put a hold state to the job descriptions. When the jobs reaches the hold
state (before submission or stageIn in case there is stageIn) it must be
released from that hold by Condor in order to be further processed. Condor
waits 5 minutes to get a message from the job and if there's no notification
from the job, the job is held. And that's exactely what we have in our case.

I don't think it's a hardware problems since sequential job submissions
and concurrent job submissions to GRAM2 don't cause errors.

Finally I went back to software versions I used for testing before I used
code from branch: and this works without problems.

Here's the server-side software that does not cause problems at the moment:
* Container version globus_4_0_2.
* Then I replaced WS-GRAM and RFT with versions from globus_4_0_branch in
  order to have the improvements gained during performance testing.
* Additionally I exchanged axis.jar with a version from globus_4_0_branch,
  since the older versions didn't close TCP-connections of notifications
  properly. 

=> There had been no problems during 6 tests.
=> So it seems to be not a Condor problem.

I talked to Stuart and Jarek about that.

Plan:
* replace the container core of the setup from globus_4_0_2 with core from 
  globus_4_0_branch to check if core could be the problem.
* If not: try the same (working) setup like above but with globus_4_0_3
  instead of globus_4_0_2
* Make sure all containers are compiled and run with the same java 
  version (1.4).
* If all this does not help and we still have problems with newer versions
  of GT: make sure that the axis.jar used by Condor-G is the same version
  like on the server-side.
------- Comment #10 From 2006-11-13 02:57:18 -------
Created an attachment (id=1126) [details]
Overview about how many jobs are in which state during a test with 1000 simple
jobs submitted concurrently by Condor-G

Jarek did some changes on GRAM4 which brought quite good performance
improvements when we measured times with the Profiler "YourKit".
Additionally i measured some tests with a newer version of a security
library  from bouncycastle, which brought some performance improvement
too.
During these time measurements 10 simple jobs were submitted sequentially
by "globusrun-ws".
But these improvements could not be seen at all during a concurrent and
unthrottled job submission of 1000 simple jobs from Condor-G to GRAM4.

Job Characteristics:
* jobs are dummy jobs (/bin/true), no staging, no fileCleanUp, no unique
  job directory per job
* jobs are executed in a condor-pool on the server-side with about 240 nodes


Explanation:

Short:
  GRAM4 is in this scenario fast enough. It's the server-side Condor-Pool
  that throttles performance and not GRAM4.
  => Performance improvements in job submissions with those simple jobs
  can't be measured with that Pool.

Details: 
  Since we have unthrottled submission and no staging before job submission
  the number of jobs in the pool is often much higher than the number of
  available nodes. Sometimes up to 500-600 jobs are displayed by "condor_q".

  The attached file shows a snaphot every minute during the test, that
  shows how many job resources are are in which state.
  Example:
  -------------------------------------------------------------------
  2006-11-08 21:28:13
  -------------------------------------------------------------------
                     JOB STATES | #JOBS IN THAT STATE
  -------------------------------------------------------------------
                         CleanUp:    1
                            Done:  423
          WaitingForStateChanges:  576
  -------------------------------------------------------------------
  TOTAL #JOBS: 1000
  -------------------------------------------------------------------

  This means that at 21:28:13 all 1000 jobs had been submitted by the
  client (TOTAL #JOB: 1000), 1 job is in state CleanUp, 423 jobs are fully
  processed (Done) and 576 are in state WaitingForStateChanges

  The test ran from about 21:14 to 21:48 and took about 34 min.
  After 11 minutes all 1000 jobs had been submitted by the client.

  Jobs in states Done (1), WaitingForStateChanges (2) and PendingHold (3)
  don't cause work for the container, since they are fully processed (1),
  waiting for a signal from the SchedulerEventGenerator (2) or waiting
  to get released from the hold state by Condor on the client side (3).

  What can be seen in the attached file is, that during each snapshot there
  are really not many resources that need to be processed. The biggest
  amount of resources is either done or waiting, especially after all 1000
  jobs had been submitted by the client (about 21:25).
  So the container really doesn't have much work to do. We are not faster
  because the Condor-pool on the server-side is throttling performance
  and not because GRAM4 is slow here.
  I watched the pool during the test and found that jobs are processed
  until the very end of the test. So it's not the case that Condor is fast
  and the SchedulerEventGenerator doesn't react fast enough.
  Our pool is able to process about 30 jobs per minute, at least when so
  so many jobs (500-600) are queued. I assume this is management overhead.

  Jobs with staging is another issue.
------- Comment #11 From 2006-11-13 03:39:47 -------
Latest tests:

Concurrent job submission from Condor-G -> GRAM4
This time with the code that supposed to work.

I ran the following 3 scenarios (same like on 
http://www-unix.mcs.anl.gov/~feller/GRAM/perf/4-0/11_09_2006/):

1. stageIn, stageOut, shared job directory
2. no staging, shared job directory
3. stageIn, stageOut, fileCleanUp, unique job directory

(i know, the ordering is strange ... but i forgot to change it)

For each scenario 30 tests with 10 jobs each were run
(so altogether 90 tests with 10 jobs each)

Failures:
#########

Scenario 1:
  Test #10: all jobs keep stuck in state "StageInResponse"
  Test #13: all jobs keep stuck in state "StageInResponse"

Scenario 2:
  Test #1:  all jobs keep stuck in state "PendingHold"
  Test #18: all jobs keep stuck in state "PendingHold"
  Test #22: all jobs keep stuck in state "PendingHold"
  Test #26: 2 jobs keep stuck in state "PendingHold"

Scenario 3:
  Test #0:  all jobs keep stuck in state "StageInResponse"
  Test #8:  all jobs keep stuck in state "StageIn"
  Test #12: all jobs keep stuck in state "StageInResponse"
  Test #18: all jobs keep stuck in state "PendingHold"


Explanations we have so far:

jobs keep stuck in state "StageInResponse":
   Job resources are in this state when fileStageIn was started.
   Normally GRAM4 gets a notification from RFT when the transfer finished.
   GRAM4 didn't receive the notifications. I aborted the test after about
   10-15 minutes (or later).
   From the logfiles can be seen that GRAM4 subscribes for notifications of
   job state changes of RFT resources and notifications are sent when RFT
   resources change their state, but somehow they don't reach their target.
   We didn't find a solution here so far.

jobs keep stuck in state "PendingHold":
   This is what was discussed recently:
   The jobs wait to get released by Condor from that state but don't get
   releases. After 5 minutes Condor declares a submission timeout.
   Jarek found that in case jobs keep stuck in state PendingHold indeed no
   subscribe call for jobstate notifications entered the container. This of
   course explains the behaviour: Since there's no subscribe call, Condor
   does not get notified about job state changes and will never release
   resource from a hold state since it does not know about the states of the
   resources.
   I sent an email to Jaime for further information about the Condor-G code.
   but didn't get an answer so far.

jobs keep stuck in state "StageIn":
   This is strange, since logging of the jobs stops in the middle of a
   StateMachine.processStageInState(). And then nothing goes on for the next
   10 minutes until abortion.
   Jarek saw this problem once and from looking at the thread dump of the JVM
   it seems to be a very rare Axis (WebServices Engine) problem.

In case anybody is interested:
The whole output of the 90 tests can be found here:
www-unix.mcs.anl.gov/~feller/10JobsManyIterations_11_10_2006.tar.gz (303MB)
For information about where to find what: feller(at)mcs.anl.gov
------- Comment #12 From 2006-11-14 06:43:36 -------
Probably we found the origin of all three failures I mentioned in #11.
At least the 160 tests from last night (each with 10 jobs) ran without
any problems. And this is probably too much to say it's coincidence.

It was a thread safety issue in Axis which finally seemed to cause all three
problems. Jarek found it yesterday.

Details in short:
The main reason for all the failures was that the SOAP messages in a
notification call had (sometimes!) not been correctly serialized. This was due
to the fact that a type mapping specified in the client- or server-side
wsdd-file was somehow ignored (sometimes) and this lead to a wrong
serialization of the java-type and to the problems.
Client- or server-side in this context means: this seemed to happen on both
sides: 
* on the client-side when Condor-G subscribed for notifications of job state
  changes in order to release the jobs from a hold state. So due to the
  wrong serialization the subscribe calls didn't reach the container.
* on the server-side when GRAM subscribed for notifications of state changes
  in RFT resources: When RFT sent the notifications they didn't pass through
  Axis due to these serialization errors.

The first hint came from the container logs in full debug mode.
Java Thread dumps of the container and the Condor-G client finally pointed to
the same code in Axis, where threads kept stuck. And all failures seem to be
caused by missing synchronization there.

I think this is hot stuff! I wouldn't have found this on my own without
the help of that Geronimo-guy :-)
------- Comment #13 From 2006-11-14 14:41:01 -------
short info:

another 110 tests passed without problem with the updated software
=> we declare our problems from #11 as solved.

But without new problems life would be boring:
some(!) RAID systems on OSG crashed; one of them contained our data.
OSG admins assume everything can be recovered. If not: I saved the most
important things.
But osg-test1 will not be available for tests during the next days.
------- Comment #14 From 2006-11-28 07:36:12 -------
I moved the location of the persistence data of the container from an
NFS location to a local disk.
=> some results of job submissions vary much less than before.

Persistence data on local partition:
http://www-unix.mcs.anl.gov/~feller/GRAM/perf/4-0/11_27_2006/

Persistence data on NFS partition:
http://www-unix.mcs.anl.gov/~feller/GRAM/perf/4-0/10_28_2006/

I didn't run all tests in 11_27_2006, just some to see the effect.
------- Comment #15 From 2006-12-04 10:21:13 -------
Performance information

1. Jareks WS-GRAM patch.
     doesn't show improvement. Probably due to the fact that
     Jarek did it when i stored the persistence data on NFS.
     Since the factor he did was to reduce the number of how
     many times the resource data was persisted to disk

2. Connection sharing in RFT
     this shows good improvement.

http://www-unix.mcs.anl.gov/~feller/GRAM/perf/4-0/12_04_2006/standard/
http://www-unix.mcs.anl.gov/~feller/GRAM/perf/4-0/12_04_2006/jareksPatch/
http://www-unix.mcs.anl.gov/~feller/GRAM/perf/4-0/12_04_2006/jareksPatchPlusConnSharing/

I did some measurements of calls to RFT (WS calls from GRAM). I compared the
time it takes for the called methods be processed (a) with time it takes to
call those methods from GRAM4 via a WS call (b). I found that the WS overhead
is quite big. In the following overview there are always two measurements
(N/M msec): The first one is with local transport enabled, the seconde with
local transport disabled.

Creation of an RFT resource:
 a) ReliableFileTransferFactoryService.createReliableFileTransfer(): 
       79/79 msec
 b) call to ReliableFileTransferFactoryService.createReliableFileTransfer()
    from GRAM:
       455/530 msec

Subscription for state changes of a RFT resource:
 a) ReliableFileTransferImpl.subscribe(): 
       28/10 msec
 b) call to ReliableFileTransferImpl.subscribe():
       378/448 msec

Start of an RFT resource: 
 a) ReliableFileTransferImpl.start():
       4/4 msec
 b) call to ReliableFileTransferImpl.start():
       407/501 msec

In both cases (enabled/disabled local transport) we loose more that 1 second
during WS communication between GRAM and RFT due to the WS overhead. 
The positive effect of enabled LocalTransport is less than I expected. From
logging output it seems that in both cases the messages go through Axis,
which is, to my knowledge, not necessary at all. Not only that it takes
longer than calling the methods directly but it also burdens the container
with additional WS calls.
I must check if creation, subscription and destruction of the RFT resource
and the subscription resource can be done without WS calls.
This of course can only be done if RFT is in the same container
Are there any known problems with this?

BTW: getting a delegated credential in GRAM4 from the DelegationService
     is done without a WS call
------- Comment #16 From 2008-02-05 15:04:00 -------
A TeraGrid 2007 paper "GT4 GRAM: A Functionality and Performance Study." was
written based on this work done here.

http://www.globus.org/alliance/publications/papers.php#TG07-GRAM