Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os: process.Wait marks the process as already finished on ptrace stop signals on Linux #60321

Open
mitar opened this issue May 20, 2023 · 14 comments · May be fixed by #60461
Open

os: process.Wait marks the process as already finished on ptrace stop signals on Linux #60321

mitar opened this issue May 20, 2023 · 14 comments · May be fixed by #60461
Labels
help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@mitar
Copy link
Contributor

mitar commented May 20, 2023

What version of Go are you using (go version)?

$ go version
go version go1.20.4 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/mitar/.cache/go-build"
GOENV="/home/mitar/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/mitar/.go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/mitar/.go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20.4"
GCCGO="/usr/bin/gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3740724317=/tmp/go-build -gno-record-gcc-switches"

What did you do?

https://go.dev/play/p/PYGxV0Kg3pF

What did you expect to see?

2009/11/10 23:00:00 status: true 5
2009/11/10 23:00:00 end

Or nothing, if process.Wait returns only on exit status and not stopped status.

What did you see instead?

2009/11/10 23:00:00 status: true 5
2009/11/10 23:00:00 os: process already finished
@ianlancetaylor
Copy link
Contributor

Thanks for the bug report. I don't understand why this is happening. We call waitid with WEXITED, which is documented as waiting for children that have terminated. Why does it return for a child that has been trapped but not terminated?

strace -f reports

pid 557792] waitid(P_PID, 557797, {si_signo=SIGCHLD, si_code=CLD_TRAPPED, si_pid=557797, si_uid=100, si_status=SIGSTOP, si_utime=0, si_stime=0}, WEXITED|WNOWAIT, NULL) = 0

but why does waitid return?

@ianlancetaylor ianlancetaylor changed the title os/exec: process.Wait() marks the process as already finished on stop signals as well os: process.Wait marks the process as already finished on stop signals as well May 20, 2023
@ianlancetaylor ianlancetaylor added help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels May 20, 2023
@ianlancetaylor ianlancetaylor added this to the Backlog milestone May 20, 2023
@mitar
Copy link
Contributor Author

mitar commented May 20, 2023

A good catch. Go uses waitid in blockUntilWaitable which does (should?) return only on process exit, but then uses syscall.Wait4 in Process.wait itself, which does return for both exited and stopped. This is why I used process.Wait call in the first place, because I thought that I will be getting exited and stopped (which I do). But this seems to break some assumptions in the code.

So one way to fix this is to use waitid instead of syscall.Wait4 and really wait only for exited processes in both places (given that we figure out why waitid returns for stopped processes as well).

BTW, I could not find anywhere explicitly documented that waitid does not return for stopped AND traced processes. In Linux man page for waitid it is unclear what is the behavior here, because for waitpid it says:

WUNTRACED --- also return if a child has stopped (but not traced via ptrace(2)). Status for traced children which have stopped is provided even if this option is not specified.

So maybe the same "Status for traced children which have stopped is provided even if this option is not specified." holds for waitid as well?

BSD seems to have clearer options here:

WTRAPPED	 Report	the status of selected processes which are being
		 traced	via ptrace(2) and have trapped or reached a break-
		 point.	 This flag is implicitly set for the functions wait(),
		 waitpid(), wait3(), and wait4().
		 For the waitid() and wait6() functions, the flag has to be
		 explicitly included in	options	if status reports from trapped
		 processes are expected.

@bcmills
Copy link
Contributor

bcmills commented May 22, 2023

BTW, I could not find anywhere explicitly documented that waitid does not return for stopped AND traced processes.

Isn't that required by POSIX?
https://pubs.opengroup.org/onlinepubs/9699919799/functions/waitid.html:

The options argument is used to specify which state changes waitid() shall wait for. It is formed by OR'ing together the following flags:

It goes on to describe WCONTINUED, WEXITED, and WSTOPPED, and to state that “[a]pplications shall specify at least one of the flags WEXITED, WSTOPPED, or WCONTINUED to be OR'ed in with the options argument.” Nothing there about ptrace because ptrace isn't in POSIX.

https://pubs.opengroup.org/onlinepubs/9699919799/functions/wait.html does give some detail about ptrace, but only to say:

Most applications do not need to concern themselves with such extensions because they have control over what extensions they or their children use. However, applications, such as command interpreters, that invoke arbitrary processes may see this behavior when those arbitrary processes misuse such extensions.

At any rate, I wonder if the Linux waitid behavior was broken by the patch described in (and linked from) https://lwn.net/Articles/688624/.

@ianlancetaylor
Copy link
Contributor

@bcmills Nice find. Seems like on Linux we can't rely on waitid to do what we need. That sucks. I wonder if using pidfd_open would fix that problem (#60320)?

@bcmills bcmills changed the title os: process.Wait marks the process as already finished on stop signals as well os: process.Wait marks the process as already finished on ptrace stop signals on Linux May 22, 2023
@mitar
Copy link
Contributor Author

mitar commented May 26, 2023

A nice find indeed. Now that I know what to look for it is interesting that this man page says nothing about it, but this one does:

Since Linux 4.7, the __WALL flag is automatically implied if the child is being ptraced.

mitar added a commit to mitar/go that referenced this issue May 26, 2023
Using pidfd allows us to have a handle on the process and
poll the handle to non-blocking wait for the process to
exit.

Fixes golang#34396
Fixes golang#60321
Fixes golang#60320
@mitar
Copy link
Contributor Author

mitar commented May 26, 2023

I made a draft in #60461 which addresses this. See Gerrit for some of my comments. It is not yet perfect but it does fix the issue here.

mitar added a commit to mitar/go that referenced this issue May 26, 2023
Using pidfd allows us to have a handle on the process and
poll the handle to non-blocking wait for the process to
exit.

Fixes golang#34396
Fixes golang#60321
Fixes golang#60320
@gopherbot
Copy link

Change https://go.dev/cl/498615 mentions this issue: os/exec: use pidfd for waiting and signaling of processes

@bcmills
Copy link
Contributor

bcmills commented May 26, 2023

Interesting; the man-pages update containing that line was commited back in 2016, so the pages on linux.die.net must be quite old. 😵‍💫

(https://man7.org/linux/man-pages/man2/waitid.2.html also includes the update.)

@mitar
Copy link
Contributor Author

mitar commented May 26, 2023

I just finished the comments now.

@mitar
Copy link
Contributor Author

mitar commented May 26, 2023

So while working on this I think there are two ways to address this: pidfd helps here because it allows us to really detect exit and just exit so we do not have to call wait unnecessary. But another solution is to simply expand cases when we loop around wait and do another wait if wait returned for any other reason than exit/signal (we could do such a loop in blockUntilWaitable as well, which is not necessary with pidfd approach).

mitar added a commit to mitar/go that referenced this issue May 26, 2023
Using pidfd allows us to have a handle on the process and
poll the handle to non-blocking wait for the process to
exit.

Fixes golang#34396
Fixes golang#60321
Fixes golang#60320
@ianlancetaylor
Copy link
Contributor

A loop sounds good if it works, but would it work? If a process is stopped due to ptrace will waitid immediately return a notification that the process is stopped? In other words, would a loop just be a busy wait?

@mitar
Copy link
Contributor Author

mitar commented May 26, 2023

A good point. I tested this now (by removing blockUntilWaitable from the call and letting the loop handle everything). On ptrace's trap signal it returns once and the second call then blocks. But I could not find any documentation to support this behavior.

@mitar
Copy link
Contributor Author

mitar commented May 26, 2023

(This is helpful behavior for this particular problem, but for my own init program in fact means that only one wait in the process succeeds and if there are multiple waits for the same process (e.g., one by pid and one using -1, like a wait loop to reap unknown reparented processes to the init), it is arbitrary which one will succeed and which one will stay blocked. And there is no timeout to blocking.)

@mitar
Copy link
Contributor Author

mitar commented May 26, 2023

I have read through kernel code and I do not think that the patch linked above broke this. That patch deals with the question if threads or not are waited on by default. And if thread is ptraced, then it is waited on.

From looking at the code it seems that wait simply always waits on all signals from ptraced processes. I also tested this by calling waitid with WEXITED|__WALL options without ptracing the process and sent a SIGSTOP signal. Wait didn't return on that. But it did return if I added WSTOPPED to options. So I think BSD has it better here where they provide WTRAPPED.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants