Bug 1460 - Condor Jobmanager missing several features
: Condor Jobmanager missing several features
Status: NEW
: LRMA
Jobmanagers
: unspecified
: All All
: P3 contribution
: ---
Assigned To:
:
:
:
:
  Show dependency treegraph
 
Reported: 2003-12-10 08:52 by
Modified: 2008-07-18 15:24 (History)


Attachments
Updated condor.pm jobmanager file (12.51 KB, text/plain)
2003-12-10 08:53, Beth Kirschner
Details


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2003-12-10 08:52:18
The condor jobmanager is missing several useful features:

1) file transfer between client/server
2) MPI
3) (optional) XML based user logs
------- Comment #1 From 2003-12-10 08:53:52 -------
Created an attachment (id=279) [details]
Updated condor.pm jobmanager file
------- Comment #2 From 2004-01-15 17:12:31 -------
These improvements are made against an old version of condor.in.  Any chance 
you could submit a patch based condor.in in the gt 3.2 alpha release?

Jaime, any comments on the proposed patch?

-Stu
------- Comment #3 From 2004-03-01 22:34:24 -------
Hey, sorry for not replying sooner. Here are my initial reactions to the
proposed change, in no particular order...

I think the default for XML logging should be false.

Hard-coding whether to do XML logging in condor.pm concerns me a little. If an
administrator changes this while any jobs are submitted, the jobmanagers for
those jobs will get very confused. This problem could be avoided by having the
poll() function guess the format of the log file by looking at the first line.

In the poll_xml_log() function, I don't see anything that unlinks the temp file
created there. Also, could the encapsulation of the job log within the "jobfile"
tag be done in memory (avoiding the temp file entirely), or would that be less
efficient?

I'm also confused as to why poll() removes the user log only if it's not in xml
format.

The submit_event_user_notes doesn't seem to be related to any of the features
mentioned in the ticket.

Enabling file transfer may not be appropriate for some clusters. If a cluster
has a shared filesystem, condor file transfer would be inefficient, though it
shouldn't produce wrong behavior. Maybe this should be marked as an optional
feature, though I'm not sure what default is best.

This brings up a problem with the jobmanager-batch system interface. As far as I
know, there's no way for condor.pm to know what (if any) files the client
requested gram to stage in or out. This is especially important for input files
for pools without a shared filesystem.

To run MPI jobs, a condor submit machine needs special configuration. If a
client submits an MPI to a condor submit machine that isn't configured properly,
the job may sit idle in the queue forever. I think it'd be good to add a bit of
code to the submit() function that tries to determine if the submit machine has
been configured to run MPI jobs. It need not be fool-proof, just catch the
common cases.

The modified condor.pm appears to be based on an old version. Someone would have
to massage it into the current condor.pm. That shouldn't be difficult.
------- Comment #4 From 2004-03-02 14:03:56 -------
Jaime - thanks for the input.  Given the issues and changes required, we'll
have to 
schedule some work for this, but I don't see it happening too soon.
------- Comment #5 From 2006-06-20 23:14:06 -------
I'd like to raise this bug back from the dead (last comment was from March
2004).  There is some movement on issue (1) - file transfer between
client/server.  This has been implemented and thoroughly used on at least two
OSG sites (UCSD and Caltech, I believe).  There is some documentation here:
http://osg-docdb.opensciencegrid.org/cgi-bin/ShowDocument?docid=382.  Over the
next couple of days, I'll see if I can coax it into a format appropriate for a
patch, and attach it to this bug (if there is interest).

As for issue (2), I'd really like to see this happen (with the features that
Jaime outlined below).  I realize that all the globus folk are rather busy, so
I'll see if we can put some resources into this here at UNL.