Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os: on NetBSD, *Process.Wait sometimes deadlocks after cmd.Process.Signal returns "process already finished" #48789

Closed
bcmills opened this issue Oct 5, 2021 · 11 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-NetBSD
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Oct 5, 2021

I've noticed a recurring pattern in cmd/go test failures on NetBSD builders, and I believe that it indicates a bug in either os.Process or the kernel itself on that platform.

The cmd/go tests start processes running “in the background” using os/exec, and use a waitOrStop function to terminate any remaining processes at the conclusion of each test.

The waitOrStop function starts a goroutine, then blocks on cmd.Wait(). The background goroutine blocks until either the call to cmd.Wait completes or the Context is canceled, then sends a signal to the process. If that signal fails with os: process already finished, then we assume that the process actually has already finished, and the background goroutine simply blocks until the call to cmd.Wait (inevitably) returns.

That all happens here:

go/src/cmd/go/script_test.go

Lines 1164 to 1206 in a05a7d4

errc := make(chan error)
go func() {
select {
case errc <- nil:
return
case <-ctx.Done():
}
err := cmd.Process.Signal(interrupt)
if err == nil {
err = ctx.Err() // Report ctx.Err() as the reason we interrupted.
} else if err.Error() == "os: process already finished" {
errc <- nil
return
}
if killDelay > 0 {
timer := time.NewTimer(killDelay)
select {
// Report ctx.Err() as the reason we interrupted the process...
case errc <- ctx.Err():
timer.Stop()
return
// ...but after killDelay has elapsed, fall back to a stronger signal.
case <-timer.C:
}
// Wait still hasn't returned.
// Kill the process harder to make sure that it exits.
//
// Ignore any error: if cmd.Process has already terminated, we still
// want to send ctx.Err() (or the error from the Interrupt call)
// to properly attribute the signal that may have terminated it.
_ = cmd.Process.Kill()
}
errc <- err
}()
waitErr := cmd.Wait()
if interruptErr := <-errc; interruptErr != nil {
return interruptErr
}


What I'm seeing on (some of?) the NetBSD builders is that after cmd.Process.Signal fails with process already finished, the call to cmd.Wait continues to block, seemingly forever.

The relevant goroutine traces are:

goroutine 1536 [syscall, 8 minutes]:
syscall.Syscall6(0x42d7, 0x9117f20, 0x0, 0x8f1a5a0, 0x0, 0x0, 0x0)
	/tmp/workdir/go/src/syscall/asm_unix_386.s:44 +0x5 fp=0x9117ecc sp=0x9117ec8 pc=0x80bc1e5
syscall.wait4(0x42d7, 0x9117f20, 0x0, 0x8f1a5a0)
	/tmp/workdir/go/src/syscall/zsyscall_netbsd_386.go:35 +0x5b fp=0x9117f04 sp=0x9117ecc pc=0x80b8f3b
syscall.Wait4(0x42d7, 0x9117f44, 0x0, 0x8f1a5a0)
	/tmp/workdir/go/src/syscall/syscall_bsd.go:145 +0x3b fp=0x9117f28 sp=0x9117f04 pc=0x80b6abb
os.(*Process).wait(0x8e98150)
	/tmp/workdir/go/src/os/exec_unix.go:44 +0xf2 fp=0x9117f60 sp=0x9117f28 pc=0x80e2762
os.(*Process).Wait(...)
	/tmp/workdir/go/src/os/exec.go:132
os/exec.(*Cmd).Wait(0x8c7bc30)
	/tmp/workdir/go/src/os/exec/exec.go:507 +0x4d fp=0x9117f9c sp=0x9117f60 pc=0x817eedd
cmd/go_test.waitOrStop({0x874073c, 0x8dc2300}, 0x8c7bc30, {0x873f89c, 0x869bd38}, 0x6472661e3)
	/tmp/workdir/go/src/cmd/go/script_test.go:1203 +0x11a fp=0x9117fc0 sp=0x9117f9c pc=0x85800ea
cmd/go_test.(*testScript).startBackground.func1()
	/tmp/workdir/go/src/cmd/go/script_test.go:1143 +0x5f fp=0x9117ff0 sp=0x9117fc0 pc=0x857ff8f
runtime.goexit()
	/tmp/workdir/go/src/runtime/asm_386.s:1311 +0x1 fp=0x9117ff4 sp=0x9117ff0 pc=0x80ab241
created by cmd/go_test.(*testScript).startBackground
	/tmp/workdir/go/src/cmd/go/script_test.go:1142 +0x390

goroutine 1537 [chan send, 8 minutes]:
runtime.gopark(0x869b798, 0x8dbc430, 0xf, 0x16, 0x2)
	/tmp/workdir/go/src/runtime/proc.go:366 +0xf6 fp=0x90e8f04 sp=0x90e8ef0 pc=0x807f376
runtime.chansend(0x8dbc400, 0x87302f0, 0x1, 0x8580314)
	/tmp/workdir/go/src/runtime/chan.go:258 +0x2f4 fp=0x90e8f48 sp=0x90e8f04 pc=0x804db54
runtime.chansend1(0x8dbc400, 0x87302f0)
	/tmp/workdir/go/src/runtime/chan.go:144 +0x24 fp=0x90e8f60 sp=0x90e8f48 pc=0x804d854
cmd/go_test.waitOrStop.func1()
	/tmp/workdir/go/src/cmd/go/script_test.go:1176 +0x184 fp=0x90e8ff0 sp=0x90e8f60 pc=0x8580314
runtime.goexit()
	/tmp/workdir/go/src/runtime/asm_386.s:1311 +0x1 fp=0x90e8ff4 sp=0x90e8ff0 pc=0x80ab241
created by cmd/go_test.waitOrStop
	/tmp/workdir/go/src/cmd/go/script_test.go:1165 +0x10e

Note that goroutine 1537 is blocked at script_test.go:1176, which is the send on errc after cmd.Process.Signal fails with os: process already finished.

Goroutine 1536 is blocked at the call to cmd.Wait, which is itself blocked on syscall.Wait4.


The failure rate with these symptoms is fairly high: something on the order of 20 failures per month.

greplogs --dashboard -md -l -e '(?m)panic: test timed out.*(?:.*\n)*.*\[syscall, .* minutes\]:\n(?:.+\n\t.+\n)*syscall\.Wait.*\n\t.+\n(?:.+\n\t.+\n)*cmd/go_test\.waitOrStop'

2021-10-04T22:46:23-17674e2/netbsd-386-9_0
2021-10-04T18:15:09-9f8d558/netbsd-386-9_0
2021-09-30T19:56:06-eb9f090/netbsd-386-9_0
2021-09-29T15:23:27-aeb4fba/netbsd-amd64-9_0
2021-09-28T17:18:36-ff7b041/netbsd-amd64-9_0
2021-09-28T15:26:21-583eeaa/netbsd-386-9_0
2021-09-27T18:57:20-3d795ea/netbsd-amd64-9_0
2021-09-22T16:24:17-74ba70b/netbsd-amd64-9_0
2021-09-22T15:00:53-91c2318/netbsd-amd64-9_0
2021-09-21T20:39:31-48cf96c/netbsd-386-9_0
2021-09-20T23:04:13-d7e3e44/netbsd-arm64-bsiegert
2021-09-20T00:13:47-a83a558/netbsd-amd64-9_0
2021-09-17T19:32:44-74e384f/netbsd-amd64-9_0
2021-09-16T23:57:40-8d2a9c3/netbsd-386-9_0
2021-09-16T19:38:19-bcdc61d/netbsd-amd64-9_0
2021-09-10T17:11:39-5a4b9f9/netbsd-amd64-9_0
2021-09-09T16:32:28-a53e3d5/netbsd-amd64-9_0
2021-09-08T11:57:03-9295723/netbsd-386-9_0
2021-09-07T03:56:13-6226020/netbsd-386-9_0
2021-09-04T10:58:11-5ec298d/netbsd-386-9_0
2021-08-31T16:43:46-6815235/netbsd-386-9_0
2021-08-30T22:07:49-b06cfe9/netbsd-386-9_0
2021-08-27T05:13:44-2c60a99/netbsd-386-9_0
2021-08-24T22:23:12-54cdef1/netbsd-386-9_0
2021-08-23T21:22:58-8157960/netbsd-386-9_0
2021-08-23T21:22:58-8157960/netbsd-arm64-bsiegert
2021-08-22T21:43:43-1958582/netbsd-386-9_0
2021-08-20T03:25:17-c92c2c9/netbsd-386-9_0
2021-08-19T20:50:13-65074a4/netbsd-amd64-9_0
2021-08-18T21:19:22-c2bd9ee/netbsd-386-9_0
2021-08-18T20:11:28-165ebd8/netbsd-386-9_0
2021-08-17T16:22:15-cf12b0d/netbsd-386-9_0
2021-08-17T15:00:04-3001b0a/netbsd-386-9_0
2021-08-17T04:37:32-a304273/netbsd-386-9_0
2021-08-17T01:29:37-1951afc/netbsd-386-9_0
2021-08-16T18:44:38-56a919f/netbsd-386-9_0
2021-08-16T18:44:32-ff36d11/netbsd-amd64-9_0
2021-08-16T13:38:52-a192ef8/netbsd-386-9_0
2021-08-12T17:43:16-39634e7/netbsd-386-9_0
2021-08-09T20:06:35-f1dce31/netbsd-386-9_0
2021-08-06T16:51:12-70546f6/netbsd-386-9_0
2021-07-28T03:27:13-b39e0f4/netbsd-386-9_0
2021-07-15T20:39:22-0941dbc/netbsd-amd64-9_0
2021-06-29T16:57:13-3463852/netbsd-386-9_0
2021-06-28T20:51:30-956c81b/netbsd-386-9_0
2021-06-24T03:45:33-a9bb382/netbsd-386-9_0
2021-06-15T20:59:42-d77f4c0/netbsd-amd64-9_0
2021-06-11T20:31:30-16b5d76/netbsd-386-9_0
2021-06-09T17:11:44-df35ade/netbsd-386-9_0
2021-06-09T15:09:13-139e935/netbsd-386-9_0
2021-05-21T17:43:46-4fda54c/netbsd-arm64-bsiegert
2021-05-21T17:35:47-8876b9b/netbsd-386-9_0
2021-05-17T16:02:12-b1aff42/netbsd-386-9_0
2021-05-12T15:23:09-0388670/netbsd-386-9_0
2021-05-12T02:04:57-1a0ea1a/netbsd-386-9_0
2021-05-11T18:22:54-9b84814/netbsd-386-9_0
2021-05-10T13:16:56-2870259/netbsd-386-9_0
2021-05-06T16:00:55-6c591f7/netbsd-386-9_0
2020-09-19T05:13:19-ccf581f/netbsd-386-9_0

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-NetBSD labels Oct 5, 2021
@bcmills bcmills added this to the Go1.18 milestone Oct 5, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Oct 5, 2021

@bsiegert, @coypoop: do you know who might be able to look into this on the NetBSD side?

@ianlancetaylor, @tklauser: is this possibly related to #13987?

@bcmills bcmills changed the title os: *Process.Wait sometimes deadlocks after cmd.Process.Signal fails with "process already finished" os: on NetBSD, *Process.Wait sometimes deadlocks after cmd.Process.Signal returns "process already finished" Oct 5, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Oct 5, 2021

#44801 may be related, in that it involves a hang in os.Process.Wait.

@ianlancetaylor
Copy link
Contributor

I see two possible sequences that could lead to this.

First, wait6(P_PID, p.Pid, nil, WEXITED|WNOWAIT, nil, nil) might return with no error, and then wait4(p.Pid, &status, 0, &rusage) hangs without returning.

Second, kill(p.Pid, SIGQUIT) might return ESRCH, but wait4(p.Pid, &status, 0, &rusage) might hang without returning.

Neither should be possible. The first case seems more likely. If we don't see a response from somebody familiar with NetBSD we should probably move the netbsd build tag from os/wait_wait6.go to os/wait_unimp.go. CC @tklauser since CL 315281 changed the os package to start using wait_wait6.go on NetBSD.

@tklauser
Copy link
Member

tklauser commented Oct 6, 2021

The first case seems more likely to me too, given that the failure seems to have started occuring only after https://golang.org/cl/315281 was submitted on 2021-05-02.

I'll send a CL to move the netbsd build tag from os/wait_wait6.go to os/wait_unimp.go.

@gopherbot
Copy link

Change https://golang.org/cl/354249 mentions this issue: os: don't use wait6 on netbsd

gopherbot pushed a commit that referenced this issue Oct 7, 2021
CL 315281 changed the os package use wait6 on netbsd. This seems to be
causing frequent test failures as reported in #48789. Revert that change
using wait6 on netbsd for now.

Updates #13987
Updates #16028
For #48789

Change-Id: Ieddffc65611c7f449971eaa8ed6f4299a5f742c2
Reviewed-on: https://go-review.googlesource.com/c/go/+/354249
Trust: Tobias Klauser <tobias.klauser@gmail.com>
Trust: Bryan C. Mills <bcmills@google.com>
Trust: Benny Siegert <bsiegert@gmail.com>
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
Reviewed-by: Benny Siegert <bsiegert@gmail.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@bsiegert
Copy link
Contributor

bsiegert commented Nov 9, 2021

@bcmills Have you seen more deadlocks since the commit above landed at the beginning of October?

@bcmills
Copy link
Contributor Author

bcmills commented Nov 9, 2021

Looks like the deadlocks ended at that CL. 👍

The only failures I can find involving cmd/go_test.waitOrStop since Oct. 7 look like memory corruption (#49209) rather than deadlocks.

greplogs --dashboard -md -l -e \(\?ms\)\\Anetbsd.\*cmd/go_test\\.waitOrStop --since=2021-10-06

2021-11-05T16:51:14-c58417b/netbsd-amd64-9_0
2021-11-04T20:24:01-99699d1/netbsd-386-9_0
2021-11-03T19:32:33-74f99d0/netbsd-amd64-9_0
2021-10-27T21:34:46-de1abf7/netbsd-386-9_0
2021-10-19T07:45:46-ee92daa/netbsd-amd64-9_0
2021-10-06T23:19:38-f375844/netbsd-amd64-9_0
2021-10-06T23:01:01-4ffa2f1/netbsd-amd64-9_0
2021-10-06T20:30:12-4a37a1d/netbsd-amd64-9_0
2021-10-06T20:21:49-f580b75/netbsd-386-9_0

@bcmills
Copy link
Contributor Author

bcmills commented Dec 15, 2021

We believe that the netbsd-amd64-9_0-n2 builder is not affected by #49209, so it looks like this deadlock still occurs.

greplogs --dashboard -md -l -e '(?ms)\Anetbsd-[a-z0-9_-]+-n2 .*cmd/go_test\.waitOrStop' --since=2021-11-01

2021-12-14T01:48:22-1afa432/netbsd-amd64-9_0-n2

goroutine 3696 [chan receive, 2 minutes]:
runtime.gopark(0x18?, 0xf1ed60?, 0x0?, 0xc0?, 0x0?)
	/tmp/workdir/go/src/runtime/proc.go:366 +0xd6 fp=0xc000597e10 sp=0xc000597df0 pc=0x438c96
runtime.chanrecv(0xc00092a600, 0xc000597f20, 0x1)
	/tmp/workdir/go/src/runtime/chan.go:577 +0x56c fp=0xc000597ea0 sp=0xc000597e10 pc=0x407b4c
runtime.chanrecv1(0xc00002de00?, 0x38?)
	/tmp/workdir/go/src/runtime/chan.go:440 +0x18 fp=0xc000597ec8 sp=0xc000597ea0 pc=0x407578
os/exec.(*Cmd).Wait(0xc000824160)
	/tmp/workdir/go/src/os/exec/exec.go:515 +0x188 fp=0xc000597f40 sp=0xc000597ec8 pc=0x552168
cmd/go_test.waitOrStop({0xbd8d08?, 0xc00091d740}, 0xc000824160, {0xbd73e8?, 0xbd06c0}, 0x646b83877)
	/tmp/workdir/go/src/cmd/go/script_test.go:1288 +0x137 fp=0xc000597f90 sp=0xc000597f40 pc=0x9c0417
cmd/go_test.(*testScript).startBackground.func1()
	/tmp/workdir/go/src/cmd/go/script_test.go:1161 +0x4d fp=0xc000597fe0 sp=0xc000597f90 pc=0x9c004d
runtime.goexit()
	/tmp/workdir/go/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc000597fe8 sp=0xc000597fe0 pc=0x467f61
created by cmd/go_test.(*testScript).startBackground
	/tmp/workdir/go/src/cmd/go/script_test.go:1160 +0x43b

goroutine 3697 [chan send, 2 minutes]:
runtime.gopark(0x47cbb2?, 0x47ab18?, 0x25?, 0x0?, 0x50e6?)
	/tmp/workdir/go/src/runtime/proc.go:366 +0xd6 fp=0xc000443e28 sp=0xc000443e08 pc=0x438c96
runtime.chansend(0xc0000d49c0, 0xc000443f80, 0x1, 0xc000443eb8?)
	/tmp/workdir/go/src/runtime/chan.go:258 +0x425 fp=0xc000443eb0 sp=0xc000443e28 pc=0x406ca5
runtime.chansend1(0xc000443f90?, 0xbd73e8?)
	/tmp/workdir/go/src/runtime/chan.go:144 +0x1d fp=0xc000443ee0 sp=0xc000443eb0 pc=0x40685d
cmd/go_test.waitOrStop.func1()
	/tmp/workdir/go/src/cmd/go/script_test.go:1285 +0x2cf fp=0xc000443fe0 sp=0xc000443ee0 pc=0x9c07af
runtime.goexit()
	/tmp/workdir/go/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc000443fe8 sp=0xc000443fe0 pc=0x467f61
created by cmd/go_test.waitOrStop
	/tmp/workdir/go/src/cmd/go/script_test.go:1250 +0x12d

@bcmills bcmills reopened this Dec 15, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Dec 15, 2021

Hmm, maybe not! That send is here:
https://cs.opensource.google/go/go/+/master:src/cmd/go/script_test.go;l=1285;drc=2580d0e08d5e9f979b943758d3c49877fb2324cb

which is not on the os.ErrProcessDone path.

@gopherbot
Copy link

Change https://go.dev/cl/431855 mentions this issue: os: use wait6 to avoid wait/kill race on netbsd

gopherbot pushed a commit that referenced this issue Sep 19, 2022
Resend of CL 315281 which was partially reverted by CL 354249 after the
original CL was suspected to cause test failures as reported in #48789.
It seems that both wait4 and wait6 lead to that particular deadlock, so
let's use wait6. That way we at least don't hit #13987 on netbsd.

Updates #13987
For #48789
For #50138

Change-Id: Iadc4a771217b7e9e821502e89afa07036e0dcb6f
Reviewed-on: https://go-review.googlesource.com/c/go/+/431855
Reviewed-by: Benny Siegert <bsiegert@gmail.com>
Auto-Submit: Tobias Klauser <tobias.klauser@gmail.com>
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@gopherbot
Copy link

Change https://go.dev/cl/483396 mentions this issue: os: adjust wait6/waitid comment for netbsd

gopherbot pushed a commit that referenced this issue Apr 11, 2023
CL 431855 changed (*Process).blockUntilWaitable on netbsd to use wait6
again.

Update #48789

Change-Id: I948f5445a44ab2e82c02560480a2a244d2b5f473
Reviewed-on: https://go-review.googlesource.com/c/go/+/483396
Reviewed-by: Benny Siegert <bsiegert@gmail.com>
Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
@golang golang locked and limited conversation to collaborators Apr 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-NetBSD
Projects
None yet
Development

No branches or pull requests

5 participants