Bugzilla – Bug 660
NFS vs. local installation
Last modified: 2009-03-13 15:01:49
You need to log in before you can comment on or make changes to this bug.
References: <http://bugzilla.globus.org/bugzilla/show_bug.cgi?id=135> (This bug report was recently closed; in my opinion, it should not have been. I don't see an option to re-open it.) <http://www-unix.globus.org/mail_archive/discuss/2003/01/msg00265.html> (A description of the problem I recently posted to discuss@globus.org.) It has been said that Globus 2.X is designed to be locally installed (unlike Globus 1.X, which has the install/deploy mechanism -- overly complex, but it worked). Here at SDSC, most applications, including Globus, are installed in a large NFS-mounted filesystem that's shared by several hundred workstations. In one possible scenario, I install Globus 2.2.3 in /usr/local/apps/globus-2.2.3, which is on an NFS filesystem. There are, say, 100 workstations (all with the same hardware and OS) that need shared access to the Globus installation for the client programs and libraries. In addition, there are, say, 3 server systems sharing the same Globus installation and providing Globus services (gsigatekeeper, gsiftp, gris). If I don't pay attention to the NFS issues, I get all three servers trying to write to the same $GLOBUS_LOCATION/var/globus-gatekeeper.log file. I don't want three systems trying to write simultaneously to the same log file, but the filesystem is configured to map "root" to "nobody" *and* it's mounted read-only, so none of them have permission to write to the log file anyway. This is a well-known problem with a known workaround: make the var subdirectory a symlink to a locally mounted filesystem, say, /scratch/slocal/globus-2.2.3/var. All three servers have to have the exact same directory path for the symlink target (which is ok for SDSC, but could be a problem for some). Nothing under the var subdirectory needs to be visible to the client workstations, so var can just be a dangling symlink as seen from all systems other than the servers. If that were the only issue, I wouldn't mind, but we're not done yet. There are several files under $GLOBUS_LOCATION/etc that need to be distinct for each server system. Charles Bacon has said that the only such files are: globus-job-manager.conf grid-info-resource-ldif.conf grid-info-resource-register.conf grid-info.conf At the time, I pointed out that the LDAP certificate and key also needed to be distinct for each server, but they've since been moved to /etc/grid-security/ldap . I suspect that the set of files under etc that need to be localized is a moving target, changing from one release to the next. Even if it doesn't change, it's not well documented, and there's no direct support for this kind of thing in the Globus installation procedures; all this stuff has to be done manually. I've thought of making the entire $GLOBUS_LOCATION/etc directory a symlink to a local directory, as I do for $G_L/var -- but unlike $G_L/var, the $G_L/etc directory contains things that need to be visible to the clients. It's practical to replicate the etc directory across the 3 server systems, but not across the 100 client systems. Something else that nobody has mentioned so far: there's also a $GLOBUS_LOCATION/tmp directory. On the installations I've checked, it just contains an empty subdirectory called "gram_job_state". I don't know what it's used for, or whether it needs to be localized for each server, or whether having it on a read-only filesystem is going to cause problems. The fact that we keep thinking of new instances of the problem months after the initial response tells me that this area needs to be cleaned up. So, there are (at least) three relevant classes of files under $GLOBUS_LOCATION: 1. Read-only files that are needed for both client and server systems, and can be shared by all systems. Examples: etc/* (with some exceptions) bin/* sbin/* libexec/* lib/* include/* 2. Read-only files that need to be localized for each server, which are not needed by clients. Examples: etc/globus-job-manager.conf etc/grid-info-resource-ldif.conf etc/grid-info-resource-register.conf etc/grid-info.conf 3. Writable files that need to be localized for each server, which are not needed by clients. Examples: var/globus-gatekeeper.log tmp/* (???) If these classes of files could be separated into distinct directories, it would go a long way towards making NFS installations easier. (The distinction between classes 2 and 3 may not be important; My suggestion: Step 0: Given the current architecture, clearly document in the Admin Guide (<http://www.globus.org/gt2.2/admin/guide-install.html>), in a step-by-step series of instructions, what needs to be done to share a Globus installation on a (possibly read-only) NFS-mounted filesystem. This would include an exhaustive list of which files under $G_L/etc need to be localized. Step 1: Move all the shareable files in $GLOBUS_LOCATION/etc into the $GLOBUS_LOCATION/share directory. (The name "share" is highly suggestive, don't you think?) The $GLOBUS_LOCATION/etc directory would then contain *only* read-only files that are not needed by clients. Given this arrangement, I could do a Globus installation, then copy the etc, var, and possibly tmp directories to a local filesystem on each server, and replace the original etc, var, and tmp directories with symlinks (i.e., extend what I do now for $G_L/var to $G_L/etc and $G_L/tmp). I would no longer have to keep track of the arbitrary subset of files under $G_L/etc that need to be localized. Step 2: Make the Globus installation procedure handle this automatically. If I set a new environment variable, $GLOBUS_LOCAL_DIR, pointing to a directory on a local filesystem, the installation procedure would automatically create the proper subdirectories as symlinks. Now I don't even have to remember that var, etc, and tmp are the directories I need to set up; the installation procedure would handle this for me. (This could be done either in gpt-{build,install} or in gpt-postinstall.) If $GLOBUS_LOCAL_DIR is not set, everything is done the same way it is now. There are other possible approaches, including the Globus-1.1.X style "deploy" directory. I've tried to suggest an approach that's as close as possible to the current architecture.
This guide would be a good idea. Right now, however, we are responding to cluster/NFS queries with a recommendation to just install into NFS, have a single gatekeeper, and use ganglia on the backend. That particular setup requires no extra steps.
This problem report has nothing to do with clusters. The situation I'm facing involves multiple client systems and a few server systems sharing a single Globus installation.
Understood. I was not attempting to address all of the needs in your response with my cluster response. A common case of what you're describing is a cluster install, and there is a good way to do that without modification. The rest will have to wait on more documentation. When that exists, this bug will move from assigned to fixed.
Sorry, but your response really didn't address anything I was asking about. I understand how to to do installations on clusters. The current procedure for installing Globus on a shared NFS-mounted filesystem is poorly documented, confusing, and clunky. Making it merely confusing and clunky would be an improvement, but not enough of one to justify closing this bug report. In the meantime, I could really use a definitive list of which directories and files need to be localized for each server system and are not needed by client systems. (I'm assuming that client-only systems will not be bothered if these files are missing.) So far, I know about the $G_L/var directory and the following files under $G_L/etc: globus-job-manager.conf grid-info-resource-ldif.conf grid-info-resource-register.conf grid-info.conf Is this a complete list? In particular, please provide information about the $G_L/tmp directory. What is it for? Does it need to be writeable by server systems? Does it need to be visible to client systems?
I understand that there needs to be a distinct $GLOBUS_LOCATION/tmp directory for each server system (for the tmp/gram_job_state subdirectory), so add that to the list. Also (though this isn't strictly a Globus issue), if you install GSI-OpenSSH, the $GLOBUS_LOCATION/etc/ssh directory needs to be *partially* localized for each system running the ssshd. In particular, there need to be distinct copies of the key files (6 of them) -- but the ssh_config file needs to be visible on all systems running the ssh client. I'm fairly sure that the moduli and ssh_prng files can be shared among all client and server systems, as can the ssh_config and sshd_config file unless there's a need for system-specific customization. Will somebody please take a look at this and tell me whether there's anything I haven't thought of?
Further investigation shows that the ssh key files may not be a problem. I've found that I can just create them as symbolic links to the system ssh key files. For example, if the system ssh keys are in the /etc/ssh directory, I can do something like the following: cd $GLOBUS_LOCATION/etc/ssh rm -f *key* ln -s /etc/ssh/*key* . Even if the $GLOBUS_LOCATION/etc/ssh directory is shared across multiple systems (via an NFS mount), the symlinks will correctly point to the local keys on each system.
It may be worse than I thought. I have an NMI 2.1 installation on a shared NFS filesystem, visible to numerous client machines and, so far, to a single server machine. The server is named giis (it happens to be a GIIS server, but that's beside the point for now). One of the client machines, which I'll use as a example, is elmak. I want to install some Globus services on another system, orion. I temporarily shut down services on giis and replaced some files and directories (based on the list in this bug report) with symlinks into a directory under /var/globus, which is on a local non-NFS filesystem. On giis, I created /var/globus and copied the existing files to the appropriate locations. On orion, I did the same thing, editing the *.conf files to refer to the correct hostname. Now the following files and directories are symlinks into /var/globus: etc/globus-job-manager.conf etc/grid-info-resource-ldif.conf etc/grid-info-resource-register.conf etc/grid-info.conf tmp/ var/ On giis and orion, all these files and directories exist on local disk, which is what I need for the services to work properly. On elmak (and other client machines), I have not created a /var/globus directory, and the linked files do not exist. Now I try to do a grid-info-search: elmak% grid-info-search -h giis.npaci.edu -b 'Mds-Vo-name=npaci, o=Grid' -x -now rap '(objectClass=MdsHost)' dn /usr/local/apps/nmi-2.1/bin/grid-info-search: /usr/local/apps/nmi-2.1/etc/grid-i nfo.conf: not found Apparently $GLOBUS_LOCATION/etc/grid-info.conf needs to be visible on each client system, and must be unique for each server system. In fact, it may need to be unique on each client system. After I put everything back the way it was (etc/*.conf on the shared filesystem), I ran "grid-info-search" with no arguments on elmak; it gave me information about giis.sdsc.edu. (I suppose that makes some sense, since there is no MDS service on elmak.) Hmm. Now that I think about this, my guess is that grid-info.conf is used only by the client. If my guess is correct, I can leave grid-info.conf on the shared filesystem (i.e., it shouldn't be on the list). The only drawback of this is that all grid-info-search queries default to the information specified in the shared grid-info.conf file, rather than the local host.
I've made a little more progress figuring this stuff out. It seems that grid-info.conf does not need to be localized. The main effect of sharing a single copy among all clients is that the host option for grid-info-search is going to be the same for all client systems. I don't think that's a problem. (I usually specify a host anyway). However, there seem to be two more files that need to be localized, beyond what we've figured out so far. Recall that I had originally installed Globus (actually NMI 2.1) on giis.sdsc.edu (aka giis.npaci.edu), which is the NPACI GIIS server; I'm now trying to set up some Globus services (primarily gridftp) on orion.sdsc.edu. The files grid-info-slapd.conf and grid-info-site-policy.conf both contain settings that are appropriate only for the GIIS server (I don't want GRIS's reporting to orion, for example). So, the current set of things that need to be localized *seems* to be: the var directory the tmp directory etc/globus-job-manager.conf etc/grid-info-resource-ldif.conf etc/grid-info-resource-register.conf etc/grid-info-site-policy.conf etc/grid-info-slapd.conf (but *not* etc/grid-info.conf). Let me emphasize again that organizing everything so that shared files are in one subdirectory and server-only files are in another subdirectory would have saved me a great deal of time. (My gpt-wizard tool, <http://www.sdsc.edu/~kst/gpt-wizard/>, optionally handle this stuff automatically. The current version handles the set of directories and files that I thought were needed as of a couple of months ago; I'll update it once my understanding stabilizes.)
tmp/gram_job_state is needed, and is _not_ automatically recreated by the jobmanager. If $G_L/tmp is symlinked to /tmp and /tmp is cleaned during boot, make sure that /tmp/gram_job_state is added again.
Reassigning to gt-dev@globus.org pending triage of these bugs following my departure