os: fork+exec incompatible with 'write error returned by close' #22243

rsc · 2017-10-13T01:18:04Z

[Split off from discussion in #22220, which is about a related ETXTBSY problem.]

The conventional wisdom is that you must always be careful to check the result of close(2), because write(2) may accept data and then fail to write it, and the OS will only tell you in close(2). Usually there is mention of NFS at this point, but honestly I'm not sure how to demonstrate this behavior reliably in a real program. (If anyone knows, that would help.)

That is, let's assume this can happen:

T1: fd = open
T1: write(fd), success
T1: close(fd), error

In a multithreaded program that does fork+exec you have to worry about fds leaking into the programs started by exec. So we mark all our fds O_CLOEXEC. They still make it into the child process during fork, but they are closed automatically when the child does the +exec.

So this can happen:

T1: fd = open
T1: write(fd), success
3: T2: start fork+exec, forking P2 with its own copy of fd, called fd2 here
4: P2 does exec, which does close(fd2)
5: T1: close(fd)

Or this can happen:

T1: fd = open
T1: write(fd), success
3: T2: start fork+exec, forking P2 with its own copy of fd, called fd2 here
4: T1: close(fd)
5: P2 does exec, which does close(fd2)

In both cases, we end up with the same fd open in two processes, the parent that originally opened it and is going to fastidiously check the result of close, and the child that inherited it and is not going to check the result of close, because the close is implicit in the exec.

My question is this: if you have two fds pointing at the same underlying open file, and that open file has a pending write error, and then both fds get closed, which close operation reports the write error?

The first? The second? Neither of those is a good answer, because the parent, who we want to get the error, can be either the first or the second depending on the race with the child.

Both? That would be helpful in this case, although I think it would be surprising behavior.

The one that isn't a close-on-exec? That's not even possible, because if the parent close happens first then the kernel has no idea whether the child is going to close fd2 explicitly or during exec.

The first close that isn't a close-on-exec? That would work too, although I think it would also be surprising.

Looking at the Linux kernel I see no opportunity for differentiating the different closes. It looks like they will both do:

if (filp->f_op->flush)
	retval = filp->f_op->flush(filp, id);

I grepped all of Linux v4.12-rc6 (what I already had checked out) for \.flush *= and found that only fs/cifs, fs/ecryptfs, and handful of device drivers seem to have implementations. So it looks like, no, close really doesn't fail in practice, so this is moot.

If you ever did want to guarantee to see the error returned by close, though, I don't think Go can guarantee that - unless the kernel returns the error from all closes of an underlying open file.

/cc @ianlancetaylor @crawshaw @aclements

The text was updated successfully, but these errors were encountered:

rsc · 2017-10-13T01:18:14Z

I guess I will close this.

rsc · 2017-10-13T01:26:16Z

Twitter points out to me that apparently POSIX defines that close can return EINTR, meaning the fd didn't actually get closed and it's up to the caller to repeat the call! But on Linux at least that's not what EINTR means, if it can happen at all. https://lwn.net/Articles/576478/

The comments on that article also suggest that AFS (in particular OpenAFS) is the file system that most often returned errors on close, which makes sense I suppose.

ianlancetaylor · 2017-10-13T01:38:07Z

The Linux kernel NFS implementation does set .flush, and the implementation does return errors, so the standard reference to NFS seems accurate.

But I agree that it seems possible to lose those errors if there is a fork.

We could handle this in the syscall package, but is it worth it?

crawshaw · 2017-10-13T02:14:58Z

NFS is the best theoretical example, but if you want another you can construct a test around, there are a variety of ways to do this with a TCP socket.

The easiest way I can think of involves a fake network device interfering with packets. It may be possible to do something more portable with SO_LINGER.

rsc · 2017-10-13T15:07:48Z

We could handle this in the syscall package, but is it worth it?

If the flush implementation returns errors from only the first close, not also from subsequent closes, then even this single-threaded C program will miss the error:

fd = open(name, O_WRONLY|O_CLOEXEC)
if(write(fd) < 0)
	error()
if(fork() == 0) {
	exec()
	exit(1)
}
wait()
if(close(fd) < 0)
	error()

The implicit close during the child's exec will consume the error. Given that a single-threaded C program has the problem, I don't see what we can do in package syscall.

rsc · 2017-10-13T15:12:22Z

The alternative of course is that all the closes get an error, that fsync returns an error every time you call it, not just the first time. I got lost following the NFS code trying to figure out if that's true.

ianlancetaylor · 2017-10-13T17:33:24Z

As far as I can see the NFS code does not save the close/flush value, but I am not certain.

We could fix this in the syscall package as follows: syscall.Close acquires a read lock on a rwmutex across the actual system call. forkAndExecInChild acquires a write lock on the same rwmutex. Thus Close and forkAndExecInChild can not occur simultaneously (and, unfortunately, only one fork can happen at a time). In forkAndExecInChild the parent opens a pipe before the fork (on GNU/Linux this already happens if some SysProcAttr fields are set). After the fork, the parent waits for a response on the pipe. After the fork, the child walks through all the open file descriptors (we've cleverly stored the maximum file descriptor value from syscall.Open, syscall.Pipe, syscall.Socket, etc.). For each file descriptor marked close-on-exec, the child closes it. Any error from close is reported back to the parent on the pipe. When all file descriptors are closed, the child notifies the parent on the pipe that this is complete. The parent saves the errors, indexed by file descriptor; this is safe because the descriptors are, of course, open in the parent. When the parent sees the notification on the pipe, it releases the lock, permitting syscall.Close calls to proceed. When a syscall.Close closes a descriptor, it checks (and clears) the saved error. If the close call succeeded but there is a saved error, syscall.Close returns the saved error.

RalphCorderoy · 2017-10-13T17:47:24Z

syscall.Close acquires a read lock on a rwmutex across the actual system call

Things like syscall.Dup2 can also close FDs so they'd need blocking too.

rsc closed this as completed Oct 13, 2017

golang locked and limited conversation to collaborators Oct 13, 2018

gopherbot added the FrozenDueToAge label Oct 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

os: fork+exec incompatible with 'write error returned by close' #22243

os: fork+exec incompatible with 'write error returned by close' #22243

rsc commented Oct 13, 2017

rsc commented Oct 13, 2017

rsc commented Oct 13, 2017

ianlancetaylor commented Oct 13, 2017

crawshaw commented Oct 13, 2017

rsc commented Oct 13, 2017

rsc commented Oct 13, 2017

ianlancetaylor commented Oct 13, 2017

RalphCorderoy commented Oct 13, 2017

os: fork+exec incompatible with 'write error returned by close' #22243

os: fork+exec incompatible with 'write error returned by close' #22243

Comments

rsc commented Oct 13, 2017

rsc commented Oct 13, 2017

rsc commented Oct 13, 2017

ianlancetaylor commented Oct 13, 2017

crawshaw commented Oct 13, 2017

rsc commented Oct 13, 2017

rsc commented Oct 13, 2017

ianlancetaylor commented Oct 13, 2017

RalphCorderoy commented Oct 13, 2017