Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os: StartProcess ETXTBSY race on Unix systems #22315

Open
rsc opened this issue Oct 18, 2017 · 30 comments
Open

os: StartProcess ETXTBSY race on Unix systems #22315

rsc opened this issue Oct 18, 2017 · 30 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@rsc
Copy link
Contributor

rsc commented Oct 18, 2017

Modern Unix systems appear to have a fundamental design flaw in the interaction between multithreaded programs, fork+exec, and the prohibition on executing a program if that program is open for writing.

Below is a simple multithreaded C program. It creates 20 threads all doing the same thing: write an exit 0 shell script to /var/tmp/fork-exec-N (for different N), and then fork and exec that script. Repeat ad infinitum. Note that the shell script fds are opened O_CLOEXEC, so that an fd being written by one thread does not leak into the fork+exec's shell script of a different thread.

On my Linux workstation, this program produces a never-ending stream of ETXTBSY errors. The problem is that O_CLOEXEC is not enough. The fd being written by one thread can leak into the forked child of a second thread, and it stays there until that child calls exec. If the first thread closes the fd and calls exec before the second thread's child does exec, then the first thread's exec will get ETXTBSY, because somewhere in the system (specifically, in the child of the second thread), there is an fd still open for writing the first thread's shell script, and according to modern Unix rules, one must not exec a program if there exists any fd anywhere open for writing that program.

Five years ago this bit us because cmd/go installed cmd/cgo (that is, copied the binary from a temporary location to somewhere/bin/cgo) and then executed it. To fix this we put a sleep+retry loop around the fork+exec of cgo when it gets ETXTBSY. Now (as of last week or so) we don't ever install cmd/cgo and execute it in the same cmd/go process, so that specific race is gone, although as I write this cmd/go still has the sleep+retry loop, which I intend to remove.

Last week this bit us again because cmd/go updated a build stamp in the binary, closed it, and executed it. The resulting flaky ETXTBSY failures were reported as #22220. A pending CL fixes this by not updating the build stamp in temporary binaries, which are the main ones we execute. There's still one case where we write+execute a program, which is go test -cpuprofile x.prof pkg. The cpuprofile flag (and a few others) cause cmd/go to leave the pkg.test in the current directory for debugging purposes but also run the test. Luckily running the test is currently the final thing cmd/go does, and it waits for any other fork+exec'ed programs to finish before fork+exec'ing the test. So the race cannot happen in this case.

In general this race is going to happen every time anyone writes a program that both writes and executes a program. It's easy to imagine other build systems running into this, but also programs that do things like unzip a zip file and then run a program inside it - think a program supervisor or mini container runtime. As soon as there are multiple threads doing fork+exec at the same time, and one of them is doing fork+exec of a program that was previously open for write in the same process, you have a mysterious flaky problem.

It seems like maybe Go should take care of this, if possible. We've now hit it twice in cmd/go, five years apart, and at least this past time it took the better part of a day to figure out. (I don't remember how long it took five years ago, in part because I don't remember anything about discovering it five years ago. I also don't want to rediscover all this five years from now.)

There are a few hacks we could use:

  • In os.StartProcess, if we see ETXTBSY, sleep 100ms and try again, maybe a few times, up to say 1 second of sleeping. In general we don't know how long to sleep.
  • Arrange with a locking mechanism that close must never complete during a fork+exec sequence. The end of the fork+exec sequence needs to be the point where we know the close-on-exec fds have been closed. Unfortunately there is no portable way to identify that point.
    • If the exec fails and the child tells us and exits, we can wait for the exit. That's easy.
    • If the exec succeeds, we find out because the exec closes the child's end of the status pipe, and we get EOF.
      • If we know that an OS does close-on-exec work in increasing fd order, then we could also track the maximum fd we've opened and move the status pipe above that. Then seeing the status pipe close would mean all other fds are closed too.
      • If the OS had a "close all fds above x", we could use that. (I don't know of any that do, but it sure would help.)
    • It may not be OK to block all closes on a wedged fork+exec (in general an exec'ed program may be loaded from some slow network server).
  • Note that vfork(2) is not a solution. Vfork is defined as the parent does not continue executing until the child is no longer using the parent's memory image. In the case of a successful exec, at least on Linux, vfork releases the memory image before doing any of the close-on-exec work, so the parent continues running before the child has closed the fds we care about.

None of these seem great. The ETXTBSY sleep, up to 1 second, might be the best option. It would certainly reduce the flake rate and in many cases would probably make it undetectable. It would not help exec of very slow-to-load programs, but that's not the common case.

I wondered how Java deals with this, and the answer seems to be that Java doesn't deal with this. https://bugs.openjdk.java.net/browse/JDK-8068370 was filed in 2014 and is still open.

#include <pthread.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <errno.h>
#include <stdint.h>

void* runner(void*);

int
main(void)
{
	int i;
	pthread_t pid[20];

	for(i=1; i<20; i++)
		pthread_create(&pid[i], 0, runner, (void*)(uintptr_t)i);
	runner(0);
	return 0;
}

char script[] = "#!/bin/sh\nexit 0\n";

void*
runner(void *v)
{
	int i, fd, pid, status;
	char buf[100], *argv[2];
	
	i = (int)(uintptr_t)v;
	snprintf(buf, sizeof buf, "/var/tmp/fork-exec-%d", i);
	argv[0] = buf;
	argv[1] = 0;
	for(;;) {
		fd = open(buf, O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0777);
		if(fd < 0) {
			perror("open");
			exit(2);
		}
		write(fd, script, strlen(script));
		close(fd);
		pid = fork();
		if(pid < 0) {
			perror("fork");
			exit(2);
		}
		if(pid == 0) {
			execve(buf, argv, 0);
			exit(errno);
		}
		if(waitpid(pid, &status, 0) < 0) {
			perror("waitpid");
			exit(2);
		}
		if(!WIFEXITED(status)) {
			perror("waitpid not exited");
			exit(2);
		}
		status = WEXITSTATUS(status);
		if(status != 0)
			fprintf(stderr, "exec: %d %s\n", status, strerror(status));
	}
	return 0;
}
@rsc rsc added this to the Go1.10 milestone Oct 18, 2017
@rsc
Copy link
Contributor Author

rsc commented Oct 18, 2017

@gopherbot
Copy link

Change https://golang.org/cl/71570 mentions this issue: cmd/go: skip updateBuildID on binaries we will run

@gopherbot
Copy link

Change https://golang.org/cl/71571 mentions this issue: cmd/go: delete ETXTBSY hack that is no longer needed

@RalphCorderoy
Copy link

Userspace workarounds seem flawed or less than ideal. This is a kernel
problem, like O_CLOEXEC. Perhaps lobby for a O_CLOFORK that's similar
but close on fork instead. The writer would open, write, close, fork,
exec so wouldn't make use of it, but any other thread that forks
wouldn't carry the FD with it so the writer's close would succeeding in
nailing the sole, final, reference to the "open file description", as
POSIX calls it.

@ianlancetaylor
Copy link
Contributor

O_CLOFORK is a good idea. Does anybody want to suggest that to the Linux kernel maintainers? I expect that if someone can get it into Linux it will flow through to the other kernels.

I'm going to repeat a hack I described elsewhere that I believe would work for pure Go programs.

  • record the highest file descriptor returned by syscall.Open, syscall.Socket, syscall.Dup, etc.
  • add a new RWMutex in syscall: forkMutex
  • during syscall.Close, acquire a read lock on forkMutex
  • in syscall.forkAndExecInChild acquire a write lock on forkMutex, and
  • open a pipe in the parent (as we already do if UidMappings is set), and
  • in the child, loop through the descriptors up to the highest one,
  • closing each one that is marked close-on-exec, then close the pipe to the parent
  • in the parent, when the pipe is closed, release the forkMutex lock

The effect of this should be that when syscall.Close returns, we know for sure that there is no forked child that has an open copy of the descriptor.

The disadvantages are that all forks are serialized, and that all forks waste time closing descriptors that will shortly be closed anyhow. Also, of course, forks temporarily block closes, but that is unlikely to be significant.

@RalphCorderoy
Copy link

O_CLOFORK is a good idea. Does anybody want to suggest that to the Linux kernel maintainers?

I'm happy to have a go, but I'm a nobody on that list. I was assuming folks here might have the ear of a Google kernel developer or two in that area that would vet the idea and suggest it to the list if worthy. :-)

during syscall.Close, acquire a read lock on forkMutex

And syscall.Dup2 and Dup3 as they may cause newfd to close.

Do syscall.Open et al also synchronise with forkMutex somehow? I'm wondering if they can be creating more FDs, either above or below the highwater mark, whilst forkAndExecInChild is looping, closing close-on-exec ones.

@ianlancetaylor
Copy link
Contributor

Is there a place to file a feature request against the Linux kernel? I know nothing about the kernel development process. I hear it uses git.

Agree about Dup2 and Dup3.

As far as I can see it doesn't matter if syscall.Open and friends create a new FD while the child is looping, because the child won't see the new descriptor anyhow.

@rsc
Copy link
Contributor Author

rsc commented Oct 18, 2017

@ianlancetaylor thanks, yes, the explicit closes would solve the problem with slow execs, which would be nice. That might make this actually palatable. You also don't even need the extra pipe if you use vfork in this approach.

I agree with @RalphCorderoy that there's a race between the "maintain the max" and "fork", in that Open might create a new fd, then fork runs in a different thread before Open can update the max. But since fds are created lowest-available, it should suffice for the child to assume that max is, say, 10 larger than it is.

Also note that this need not be an RWMutex (and for that matter the current syscall.ForkMutex need not be an RWMutex either). It just needs to be an "either-or" mutex. An RWMutex allows N readers or 1 writer. The mutex we need would allow N of type A or N of type B, just never a mix. If we built that (not difficult, I don't think), then programs that never fork would not serialize any of their closes, and programs that fork a lot but don't close things would not serialize any of their forks.

O_CLOFORK would require having fcntl F_SETFL/F_GETFL support for that bit too, and it would complicate fork a little more than it already is. An alternative that would be equally fine for us would be a "close all fd's above" or "tell me the maximum fd of my process" syscall. I don't know if a new bit or a new syscall is more likely.

@rsc
Copy link
Contributor Author

rsc commented Oct 18, 2017

I should maybe also note that macOS fixes this problem by putting #if 0 around the ETXTBSY check in the kernel implementation of exec. That would be a third option for Linux although probably less likely than the other two.

@RalphCorderoy
Copy link

I've emailed linux-kernel@vger.kernel.org. Will reference an archive once it appears.
If they're unpersuaded, then there's the POSIX folks at Open Group; they have a bug tracker.

@RalphCorderoy
Copy link

linux-kernel mailing-list archive of post: https://marc.info/?l=linux-kernel&m=150834137201488

gopherbot pushed a commit that referenced this issue Oct 19, 2017
On modern Unix systems it is basically impossible for a multithreaded
program to open a binary for write, close it, and then fork+exec that
same binary. So don't write the binary if we're going to fork+exec it.

This fixes the ETXTBSY flakes.

Fixes #22220.
See also #22315.

Change-Id: I6be4802fa174726ef2a93d5b2f09f708da897cdb
Reviewed-on: https://go-review.googlesource.com/71570
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
gopherbot pushed a commit that referenced this issue Oct 20, 2017
This hack existed because cmd/go used to install (write) and then run
cmd/cgo in the same invocation, and writing and then running a program
is a no-no in modern multithreaded Unix programs (see #22315).

As of CL 68338, cmd/go no longer installs any programs that it then
tries to use. It never did this for any program other than cgo, and
CL 68338 removed that special case for cgo.

Now this special case, added for #3001 long ago, can be removed too.

Change-Id: I338f1f8665e9aca823e33ef7dda9d19f665e4281
Reviewed-on: https://go-review.googlesource.com/71571
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@bradfitz
Copy link
Contributor

What's the plan here for Go 1.10?

@RalphCorderoy, looks like you never got a reply, eh?

@ianlancetaylor ianlancetaylor modified the milestones: Go1.10, Go1.11 Dec 6, 2017
@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 27, 2018
@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.12 Jun 27, 2018
@ianlancetaylor
Copy link
Contributor

Looks like Solaris and macOS and OpenBSD have O_CLOFORK already. Hopefully it will catch on further.

@bcmills
Copy link
Contributor

bcmills commented Dec 15, 2022

I observed what I believe is an equivalent race for pipes when investigating #36107.

The race can cause a Write call to a pipe to spuriously succeed (instead of returning EPIPE) after the read side of the pipe has been closed.

The mechanism is the same:

  • The parent creates the pipe and (with syscall.ForkLock held) sets it to be CLOEXEC.
  • The parent process calls os.StartProcess, which forks a child process.
    • (The child process inherits a copy of the file descriptors for the pipe.)
  • Before the child process has reached its exec, the parent process closes the read side of the pipe.
    • (However, the FD for that side remains open in the child process.)
  • The parent process calls Write on the write side, causing the kernel to buffer the write (since the pipe FD is still open in the child). The Write succeeds.
  • Finally, the child process reaches its exec call, closing its copy of the read FD and dropping the buffered bytes.

This race can be observed by running os_test.TestEPIPE concurrently with a call to os.StartProcess.

@gopherbot
Copy link

Change https://go.dev/cl/458015 mentions this issue: os: clean up tests

@gopherbot
Copy link

Change https://go.dev/cl/458016 mentions this issue: os/exec: retry ETXTBSY errors in TestFindExecutableVsNoexec

gopherbot pushed a commit that referenced this issue Dec 16, 2022
I made this test parallel in CL 439196, which exposed it to the
fork/exec race condition described in #22315. The ETXTBSY errors from
that race should resolve on their own, so we can simply retry the call
to get past them.

Fixes #56811.
Updates #22315.

Change-Id: I2c6aa405bf3a1769d69cf08bf661a9e7f86440b4
Reviewed-on: https://go-review.googlesource.com/c/go/+/458016
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
gopherbot pushed a commit that referenced this issue Jan 19, 2023
- Use testenv.Command instead of exec.Command to try to get more
  useful timeout behavior.

- Parallelize tests that appear not to require global state.
  (And add explanatory comments for a few that are not
  parallelizable for subtle reasons.)

- Consolidate some “Helper” tests with their parent tests.

- Use t.TempDir instead of os.MkdirTemp when appropriate.

- Factor out subtests for repeated test helpers.

For #36107.
Updates #22315.

Change-Id: Ic24b6957094dcd40908a59f48e44c8993729222b
Reviewed-on: https://go-review.googlesource.com/c/go/+/458015
Reviewed-by: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Bryan Mills <bcmills@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
@geofft
Copy link

geofft commented Jun 11, 2023

If the OS had a "close all fds above x", we could use that. (I don't know of any that do, but it sure would help.)

Since this was written, Linux and FreeBSD added a close_range syscall (about fall 2020 - Linux kernel 5.9, FreeBSD 12.2, should be identical semantics). The reason for close_range instead of closefrom (which also exists in a few OSes) is so that if you want to have a child process inherit, say, fds 0, 1, 2, and 1000, you can do two syscalls instead of 998.

Assuming that the set of open fds is sparse, you can also do a pretty good job of this on many OSes, including older Linux versions, by using /dev/fd, /proc/self/fd, or equivalent. From a quick Google, https://github.com/cptpcrd/close_fds seems to be a pretty thorough attempt at finding the best OS-specific way.

Also, just to make sure I understand correctly - the idea here is that you would have some sort of process-wide two-sided lock where calls to close require a critical section around the close call alone and calls to fork require a critical section on the other side from before fork until unwanted file descriptors are closed. (It's not exactly an rwlock because you can have multiple closes at once or multiple forks at once, you just can't have one of each. Also note the lock has to work properly across fork.) And the reason a "close all fds above x" operation would help is that it would allow you to reduce the second critical section to be [fork, close_range] instead of [fork, exec triggering close-on-exec], which is helpful because exec could potentially be slow because of the filesystem, and also because if you're waiting for a self-pipe to close, it might get closed before other FDs are closed. Do I have all of that right?

By the way, there is one mildly exciting complication here, which is that closing the FD can also potentially be slow because of the filesystem. At least on Linux, closing a file descriptor calls the filesystem's flush operation on the underlying file description - even if there are other file descriptors open to that file description which will eventually be closed too! So it's possible that if thread A opens a file on a slow filesystem, thread B forks a child for exec, and thread A then closes the file, both thread A and the child will be slowed down. This isn't any worse than the status quo, to be clear, where the close happens on exec. But it is a really good argument in favor of O_CLOFORK, and that it should be implemented in the form where the FD is never copied to the new process's FD table in the first place instead of actually being closed on fork.

(Another related subtlety - close can return errors! The file might fail to write for I/O reasons or the filesystem might check quota only when the writes is flushed. This can cause actual data loss if you assume it succeeded. On Linux, close returns the error from the filesystem's flush operation and I believe always gets rid of the FD, and I guess it's up to the filesystem whether a second flush on another FD to the same open file description also returns the same error, or if it's lost state at that point. And close_range and of course close-on-exec do not report errors back to userspace the way close does. So that's another really good argument for O_CLOFORK implemented as never-inherit, to ensure that errors from close successfully get to the right close call.)

Note that vfork(2) is not a solution. Vfork is defined as the parent does not continue executing until the child is no longer using the parent's memory image. In the case of a successful exec, at least on Linux, vfork releases the memory image before doing any of the close-on-exec work, so the parent continues running before the child has closed the fds we care about.

The bigger problem is that vfork (at least on Linux and FreeBSD) only suspends the calling thread, and other threads continue to run, so it doesn't solve the problem at all. (If it weren't for that, I think vfork + close_range would avoid this problem, since the parent is still suspended during close_range.)

@ianlancetaylor
Copy link
Contributor

@geofft Thanks for pointing out close_range. Unfortunately I'm not sure how to use it in a fully reliable way, as Go programs can of course turn off the "close-on-exec" flag for a file descriptor. I think that what we want in the absence of O_CLOFORK is an efficient way to close all descriptors marked close-on-exec, while leaving the non-close-on-exec descriptors alone.

@gopherbot
Copy link

Change https://go.dev/cl/522015 mentions this issue: cmd/go: retry ETXTBSY errors when running test binaries

gopherbot pushed a commit that referenced this issue Aug 22, 2023
An ETXTBSY error when starting a test binary is almost certainly
caused by the race reported in #22315. That race will resolve quickly
on its own, so we should just retry the command instead of reporting a
spurious failure.

Fixes #62221.

Change-Id: I408f3eaa7ab5d7efbc7a2b1c8bea3dbc459fc794
Reviewed-on: https://go-review.googlesource.com/c/go/+/522015
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Bryan Mills <bcmills@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
@gopherbot
Copy link

Change https://go.dev/cl/522176 mentions this issue: [release-branch.go1.21] cmd/go: retry ETXTBSY errors when running test binaries

cellularmitosis pushed a commit to cellularmitosis/go that referenced this issue Aug 24, 2023
An ETXTBSY error when starting a test binary is almost certainly
caused by the race reported in golang#22315. That race will resolve quickly
on its own, so we should just retry the command instead of reporting a
spurious failure.

Fixes golang#62221.

Change-Id: I408f3eaa7ab5d7efbc7a2b1c8bea3dbc459fc794
Reviewed-on: https://go-review.googlesource.com/c/go/+/522015
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Bryan Mills <bcmills@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
gopherbot pushed a commit that referenced this issue Aug 30, 2023
…t binaries

An ETXTBSY error when starting a test binary is almost certainly
caused by the race reported in #22315. That race will resolve quickly
on its own, so we should just retry the command instead of reporting a
spurious failure.

Fixes #62222.
Updates #62221.

Change-Id: I408f3eaa7ab5d7efbc7a2b1c8bea3dbc459fc794
Reviewed-on: https://go-review.googlesource.com/c/go/+/522015
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Bryan Mills <bcmills@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
(cherry picked from commit 4dc2564)
Reviewed-on: https://go-review.googlesource.com/c/go/+/522176
Auto-Submit: Dmitri Shuralyov <dmitshur@google.com>
@gopherbot
Copy link

Change https://go.dev/cl/560415 mentions this issue: cmd/go: avoid copying a binary to be exec'd in TestScript/gotoolchain_path

gopherbot pushed a commit that referenced this issue Feb 6, 2024
…_path

Runinng 'go build' writes the binary in a separate process, so avoids
the race described in #22315. However, the script engine's 'cp'
command currently executes in-process, so it does not avoid that bug
and may retain stale file descriptors when running tests in parallel.

Avoid the race in this particular test by giving the final binary
location in the '-o' argument instead of copying it there after the
fact.

Fixes #64019.

Change-Id: I96d276f33c09e39f465e9877356f1d8f2ae55062
Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-longtest,gotip-windows-amd64-longtest
Reviewed-on: https://go-review.googlesource.com/c/go/+/560415
Auto-Submit: Bryan Mills <bcmills@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Matloob <matloob@golang.org>
ezz-no pushed a commit to ezz-no/go-ezzno that referenced this issue Feb 18, 2024
…_path

Runinng 'go build' writes the binary in a separate process, so avoids
the race described in golang#22315. However, the script engine's 'cp'
command currently executes in-process, so it does not avoid that bug
and may retain stale file descriptors when running tests in parallel.

Avoid the race in this particular test by giving the final binary
location in the '-o' argument instead of copying it there after the
fact.

Fixes golang#64019.

Change-Id: I96d276f33c09e39f465e9877356f1d8f2ae55062
Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-longtest,gotip-windows-amd64-longtest
Reviewed-on: https://go-review.googlesource.com/c/go/+/560415
Auto-Submit: Bryan Mills <bcmills@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Matloob <matloob@golang.org>
DrJosh9000 added a commit to buildkite/agent that referenced this issue Apr 22, 2024
Since #2325, we learned that the "text file busy" error reported by the customer was probably golang/go#22315.

This change makes the retry and log pathway more specific to that error, and retries more times more frequently.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests