Bugzilla – Bug 1460
Condor Jobmanager missing several features
Last modified: 2008-07-18 15:24:32
You need to log in before you can comment on or make changes to this bug.
The condor jobmanager is missing several useful features: 1) file transfer between client/server 2) MPI 3) (optional) XML based user logs
Created an attachment (id=279) [details] Updated condor.pm jobmanager file
These improvements are made against an old version of condor.in. Any chance you could submit a patch based condor.in in the gt 3.2 alpha release? Jaime, any comments on the proposed patch? -Stu
Hey, sorry for not replying sooner. Here are my initial reactions to the proposed change, in no particular order... I think the default for XML logging should be false. Hard-coding whether to do XML logging in condor.pm concerns me a little. If an administrator changes this while any jobs are submitted, the jobmanagers for those jobs will get very confused. This problem could be avoided by having the poll() function guess the format of the log file by looking at the first line. In the poll_xml_log() function, I don't see anything that unlinks the temp file created there. Also, could the encapsulation of the job log within the "jobfile" tag be done in memory (avoiding the temp file entirely), or would that be less efficient? I'm also confused as to why poll() removes the user log only if it's not in xml format. The submit_event_user_notes doesn't seem to be related to any of the features mentioned in the ticket. Enabling file transfer may not be appropriate for some clusters. If a cluster has a shared filesystem, condor file transfer would be inefficient, though it shouldn't produce wrong behavior. Maybe this should be marked as an optional feature, though I'm not sure what default is best. This brings up a problem with the jobmanager-batch system interface. As far as I know, there's no way for condor.pm to know what (if any) files the client requested gram to stage in or out. This is especially important for input files for pools without a shared filesystem. To run MPI jobs, a condor submit machine needs special configuration. If a client submits an MPI to a condor submit machine that isn't configured properly, the job may sit idle in the queue forever. I think it'd be good to add a bit of code to the submit() function that tries to determine if the submit machine has been configured to run MPI jobs. It need not be fool-proof, just catch the common cases. The modified condor.pm appears to be based on an old version. Someone would have to massage it into the current condor.pm. That shouldn't be difficult.
Jaime - thanks for the input. Given the issues and changes required, we'll have to schedule some work for this, but I don't see it happening too soon.
I'd like to raise this bug back from the dead (last comment was from March 2004). There is some movement on issue (1) - file transfer between client/server. This has been implemented and thoroughly used on at least two OSG sites (UCSD and Caltech, I believe). There is some documentation here: http://osg-docdb.opensciencegrid.org/cgi-bin/ShowDocument?docid=382. Over the next couple of days, I'll see if I can coax it into a format appropriate for a patch, and attach it to this bug (if there is interest). As for issue (2), I'd really like to see this happen (with the features that Jaime outlined below). I realize that all the globus folk are rather busy, so I'll see if we can put some resources into this here at UNL.