Bug 5617 - GRAM4 seg hangs with fork jobs
: GRAM4 seg hangs with fork jobs
Status: RESOLVED FIXED
: GRAM
wsrf scheduler interface
: 4.0.5
: PC Linux
: P3 normal
: 4.2.1
Assigned To:
:
: 4.0.x
:
:
  Show dependency treegraph
 
Reported: 2007-10-15 16:26 by
Modified: 2008-07-18 14:11 (History)


Attachments


Note

You need to log in before you can comment on or make changes to this bug.


Description From 2007-10-15 16:26:04
SEG hangs when the logfile, globus-fork.log, is on NFS. 

I traced this down to the "fork_starter.c" code. Specifically, the permission
of this file is set to 622.  The open/write c-code in "fork_starter.c" has a
do-while loop over "fcntl" function (see below).  If file is mounted on NFS, rc
returns as '-1' and errno as 11 (EAGAIN) because it cannot get a write-lock
over NFS. (Note, I am running rpc.rstatd on the system) This becomes an
infinite loop as all subsequent calls to fcntl return the same result.  If I
change the file permission to '666', then it proceeds normally. Or if I ignore
setting a write lock by changing the code (my simple test program) putting rc=0
into the EAGAIN case, it works fine. 

below is the loop.  

 -Jeff

    do
    {
        rc = fcntl(logfd, F_SETLKW, &lock);

        if (rc < 0)
        {
            switch (errno)
            {
                case EACCES:
                case EAGAIN:
                    rc = 1;
                    break;
                case EBADF:
                    globus_assert(errno != EBADF);
                    break;
                case EDEADLK:
                    globus_assert(errno != EDEADLK);
                    break;
                case EFAULT:
                    globus_assert(errno != EFAULT);
                    break;
                case EINTR:
                    rc = 1;
                    break;
            }
        }
    }
    while (rc == 1);
------- Comment #1 From 2007-10-17 10:42:39 -------
*** Bug 5620 has been marked as a duplicate of this bug. ***
------- Comment #2 From 2007-10-17 10:43:09 -------
Thanks for the report Jeff.  Joe is off til the end of the month, but we should
be able to make this change for 4.0.6.

-Stu
------- Comment #3 From 2008-01-14 10:22:08 -------
Jeff,

Joe and I discussed this some.  Seems this needs further investigation.  I'm
removing the 4.0.6 milestone.

-Stu
------- Comment #4 From 2008-02-01 09:39:12 -------
I've put a new version of the globus fork starter in
http://www-unix.mcs.anl.gov/~bester/patches/globus_fork_starter-0.4.tar.gz
which should detect errors better in this situation better before the job is
started and report them. I don't have access to a system that doesn't have
working fnctl locks, so I can't verify that this catches errors properly. If
this detects the problem for you, we can probably call that program in the
setup package to check that the logging file will work in practice.
------- Comment #5 From 2008-05-14 15:10:01 -------
Any feedback on this patched version?
------- Comment #6 From 2008-07-18 14:11:38 -------
This fix is committed to 4.2 branch (for 4.2.1) and 4.0 branch (for 4.0.8) and
trunk.