Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/exec: TestContextCancel is flaky on netbsd-arm64-bsiegert builder #42061

Closed
bcmills opened this issue Oct 19, 2020 · 10 comments
Closed

os/exec: TestContextCancel is flaky on netbsd-arm64-bsiegert builder #42061

bcmills opened this issue Oct 19, 2020 · 10 comments
Labels
ExpertNeeded NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-NetBSD
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Oct 19, 2020

2020-10-18T08:22:01-515e6a9/netbsd-arm64-bsiegert
2020-10-16T03:02:36-912262b/netbsd-arm64-bsiegert
2020-10-14T15:19:28-52669c4/netbsd-arm64-bsiegert
2020-10-12T21:13:04-15a11ce/netbsd-arm64-bsiegert
2020-10-07T20:18:35-ade5161/netbsd-arm64-bsiegert
2020-10-06T22:56:00-f8e5540/netbsd-arm64-bsiegert
2020-10-06T21:55:49-2e4ceaf/netbsd-arm64-bsiegert

It looks like the test hard-codes a 1-second timeout on the cancellation terminating the child process. That might not be a reasonable assumption on slow, heavily-loaded builders.

go/src/os/exec/exec_test.go

Lines 1081 to 1083 in f8df205

if time.Since(start) > time.Second {
t.Fatal("canceling context did not stop program")
}

CC @bsiegert @bradfitz @ianlancetaylor

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 19, 2020
@bcmills bcmills added this to the Backlog milestone Oct 19, 2020
@bsiegert
Copy link
Contributor

Do you suggest lengthening the grace time? Or something more clever?

@bcmills
Copy link
Contributor Author

bcmills commented Oct 19, 2020

I would be inclined to eliminate the dependency on the grace time entirely.

That could be done fairly easily by swapping the goroutines: instead of calling io.WriteString on the test-function goroutine and stdout.Read in a background goroutine, I would call io.WriteString and cancel on the background goroutine and move the Read loop to the test function.

Then you just keep reading until EOF, and if EOF never comes the test crashes itself and you figure out what happened from the goroutine dump (which you'll probably need anyway, because if the test legitimately fails there's probably a deadlock somewhere in os/exec).

@gopherbot
Copy link

Change https://golang.org/cl/303351 mentions this issue: os/exec: Fix a flake in TestContextCancel

@gopherbot
Copy link

Change https://golang.org/cl/303352 mentions this issue: os/exec: simplify TestContextCancel

gopherbot pushed a commit that referenced this issue Mar 23, 2021
TestContextCancel is a test that ensures a process is killed soon after
canceling the context, even if Wait is not called (#16222). The test
checks whether the process exited without calling Wait by writing some
data to its stdin.

Currently the test involves two goroutines writing to stdin and reading
from stdout. However the reading goroutine is not very necessary to
detect the process exit.

This patch simplifies the test by connecting the process stdout to
/dev/null.

For #42061

Change-Id: I0447a1c024ee5abb050c627ec3766b731b02181a
Reviewed-on: https://go-review.googlesource.com/c/go/+/303352
Trust: Emmanuel Odeke <emmanuel@orijtech.com>
Run-TryBot: Emmanuel Odeke <emmanuel@orijtech.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@bcmills
Copy link
Contributor Author

bcmills commented Dec 1, 2021

Unfortunately still flaky.

greplogs --dashboard -md -l -e 'FAIL: TestContextCancel .*\n\s+.*canceling context did not stop program' --since=2021-03-23

2021-11-30T20:04:58-7ccbcc9/netbsd-arm64-bsiegert
2021-11-11T20:25:07-eb68e33/netbsd-arm64-bsiegert
2021-11-01T15:55:29-0ae4c7f/netbsd-arm64-bsiegert
2021-10-27T20:03:17-68bd512/netbsd-arm64-bsiegert
2021-10-26T23:12:17-13eccaa/netbsd-arm64-bsiegert
2021-10-12T20:53:52-ac6d706/netbsd-arm64-bsiegert
2021-10-11T17:02:03-577bb7d/netbsd-arm64-bsiegert
2021-10-05T17:10:45-5140ad1/netbsd-arm64-bsiegert
2021-09-29T17:06:53-99d5d8a/netbsd-arm64-bsiegert
2021-09-21T20:39:31-48cf96c/netbsd-arm64-bsiegert
2021-09-06T14:02:24-ecfff58/netbsd-arm64-bsiegert
2021-09-06T06:52:46-c7f09ee/netbsd-arm64-bsiegert
2021-09-02T15:14:50-782aa42/netbsd-arm64-bsiegert
2021-08-17T23:59:53-ddfcc02/netbsd-arm64-bsiegert
2021-08-16T08:37:46-fcdc3c0/netbsd-arm64-bsiegert
2021-07-26T22:15:24-bfbb288/netbsd-arm64-bsiegert
2021-07-15T20:16:41-69728ea/netbsd-arm64-bsiegert
2021-07-08T17:53:43-296ddf2/netbsd-arm64-bsiegert
2021-06-17T14:53:37-4dede02/netbsd-arm64-bsiegert
2021-05-26T13:25:43-e4615ad/netbsd-arm64-bsiegert
2021-05-21T17:35:02-5fee772/netbsd-arm64-bsiegert
2021-05-09T17:07:22-c14ecac/netbsd-arm64-bsiegert
2021-05-07T17:43:31-b44c78b/netbsd-arm64-bsiegert
2021-05-06T16:28:56-5f9fe47/netbsd-arm64-bsiegert
2021-05-04T23:14:32-cb0658e/netbsd-arm64-bsiegert
2021-04-21T21:14:04-54af9fd/netbsd-arm64-bsiegert
2021-04-19T23:17:10-3711ea0/netbsd-arm64-bsiegert
2021-04-19T21:27:43-e97d8eb/netbsd-arm64-bsiegert
2021-04-19T16:49:40-5780ab4/netbsd-arm64-bsiegert
2021-04-16T21:19:53-b91f8a4/netbsd-arm64-bsiegert
2021-04-15T12:38:13-7ad496b/netbsd-arm64-bsiegert
2021-04-15T00:40:43-566a87c/netbsd-arm64-bsiegert
2021-04-14T03:15:34-e7ab1a5/netbsd-arm64-bsiegert
2021-04-12T16:44:49-70ed28e/netbsd-arm64-bsiegert
2021-04-11T20:41:32-0da9eff/netbsd-arm64-bsiegert
2021-04-07T13:24:10-4520da4/netbsd-arm64-bsiegert
2021-04-07T05:19:22-1395432/netbsd-arm64-bsiegert
2021-04-03T20:14:37-971c715/netbsd-arm64-bsiegert
2021-04-02T16:30:18-759116b/netbsd-arm64-bsiegert
2021-03-31T20:13:06-4acefa0/netbsd-arm64-bsiegert
2021-03-30T04:48:27-032ef4b/netbsd-arm64-bsiegert
2021-03-29T18:38:06-2476449/netbsd-arm64-bsiegert
2021-03-27T05:38:26-49dccf1/netbsd-arm64-bsiegert
2021-03-25T19:16:13-34ef294/netbsd-arm64-bsiegert
2021-03-25T19:13:14-0fc595e/netbsd-arm64-bsiegert
2021-03-23T22:36:59-4e27aa6/netbsd-arm64-bsiegert

@bcmills bcmills reopened this Dec 1, 2021
@gopherbot
Copy link

Change https://golang.org/cl/368317 mentions this issue: os/exec: in TestContextCancel, dump goroutines on failure

gopherbot pushed a commit that referenced this issue Dec 2, 2021
If this test fails, we want to know exactly what the os/exec
goroutines are doing. Panicking gives us a goroutine dump,
whereas t.Fatal does not.

While we're here, use exponential backoff instead of a hard-coded 1ms
sleep. We want to give the OS enough time to actually terminate the
subprocess.

For #42061

Change-Id: I3d50a71ac314853c68a935218e7f97ce18b08b5b
Reviewed-on: https://go-review.googlesource.com/c/go/+/368317
Trust: Bryan C. Mills <bcmills@google.com>
Run-TryBot: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@bcmills
Copy link
Contributor Author

bcmills commented Dec 16, 2021

We have a goroutine dump! Curiously, at the point where the failure occurs the command is not running at all. This looks a lot like #50187, which suggests that perhaps NetBSD is sometimes failing to close input/output pipes when the underlying process exits or is terminated.

@bsiegert, @coypoop: could you please investigate?

--- FAIL: TestContextCancel (60.76s)
panic: canceling context did not stop program [recovered]
	panic: canceling context did not stop program

goroutine 87 [running]:
panic({0x2b1a00, 0x39c140})
	/var/gobuilder/buildlet/go/src/runtime/panic.go:941 +0x3d8 fp=0x40002cdb50 sp=0x40002cda90 pc=0x44488
testing.tRunner.func1.2({0x2b1a00, 0x39c140})
	/var/gobuilder/buildlet/go/src/testing/testing.go:1390 +0x1c8 fp=0x40002cdc00 sp=0x40002cdb50 pc=0xe6c98
testing.tRunner.func1()
	/var/gobuilder/buildlet/go/src/testing/testing.go:1393 +0x380 fp=0x40002cdd70 sp=0x40002cdc00 pc=0xe6800
runtime.deferCallSave(0x40002cde50, 0x400020ff88?)
	/var/gobuilder/buildlet/go/src/runtime/panic.go:750 +0x94 fp=0x40002cdd80 sp=0x40002cdd70 pc=0x44054
runtime.runOpenDeferFrame(0x4000012910?, 0x40000128c0)
	/var/gobuilder/buildlet/go/src/runtime/panic.go:723 +0x1b8 fp=0x40002cddd0 sp=0x40002cdd80 pc=0x43e78
panic({0x2b1a00, 0x39c140})
	/var/gobuilder/buildlet/go/src/runtime/panic.go:838 +0x20c fp=0x40002cde90 sp=0x40002cddd0 pc=0x442bc
os/exec_test.TestContextCancel(0x40000829c0)
	/var/gobuilder/buildlet/go/src/os/exec/exec_test.go:994 +0x300 fp=0x40002cdf60 sp=0x40002cde90 pc=0x2876e0
testing.tRunner(0x40000829c0, 0x343e10)
	/var/gobuilder/buildlet/go/src/testing/testing.go:1440 +0x110 fp=0x40002cdfb0 sp=0x40002cdf60 pc=0xe63b0
testing.(*T).Run.func1()
	/var/gobuilder/buildlet/go/src/testing/testing.go:1487 +0x30 fp=0x40002cdfd0 sp=0x40002cdfb0 pc=0xe7120
runtime.goexit()
	/var/gobuilder/buildlet/go/src/runtime/asm_arm64.s:1259 +0x4 fp=0x40002cdfd0 sp=0x40002cdfd0 pc=0x75704
created by testing.(*T).Run
	/var/gobuilder/buildlet/go/src/testing/testing.go:1487 +0x328

goroutine 1 [chan receive]:
runtime.gopark(0x4000062000?, 0x8de0c?, 0xf0?, 0xfd?, 0x1be388e5?)
	/var/gobuilder/buildlet/go/src/runtime/proc.go:366 +0xe8 fp=0x4000105970 sp=0x4000105950 pc=0x47228
runtime.chanrecv(0x40001961c0, 0x0, 0x1)
	/var/gobuilder/buildlet/go/src/runtime/chan.go:577 +0x590 fp=0x4000105a00 sp=0x4000105970 pc=0x16c40
runtime.chanrecv1(0x400006c1e0?, 0x40001964b8?)
	/var/gobuilder/buildlet/go/src/runtime/chan.go:440 +0x14 fp=0x4000105a30 sp=0x4000105a00 pc=0x166a4
testing.tRunner.func1()
	/var/gobuilder/buildlet/go/src/testing/testing.go:1406 +0x458 fp=0x4000105ba0 sp=0x4000105a30 pc=0xe68d8
testing.tRunner(0x400010cea0, 0x4000105cb8)
	/var/gobuilder/buildlet/go/src/testing/testing.go:1446 +0x14c fp=0x4000105bf0 sp=0x4000105ba0 pc=0xe63ec
testing.runTests(0x4000112c80?, {0x55bd80, 0x1f, 0x1f}, {0xb300000000000000?, 0x226b0?, 0x561c80?})
	/var/gobuilder/buildlet/go/src/testing/testing.go:1838 +0x3e8 fp=0x4000105ce0 sp=0x4000105bf0 pc=0xe8c78
testing.(*M).Run(0x4000112c80)
	/var/gobuilder/buildlet/go/src/testing/testing.go:1720 +0x510 fp=0x4000105ee0 sp=0x4000105ce0 pc=0xe79f0
main.main()
	_testmain.go:111 +0x1e8 fp=0x4000105f70 sp=0x4000105ee0 pc=0x2883a8
runtime.main()
	/var/gobuilder/buildlet/go/src/runtime/proc.go:255 +0x278 fp=0x4000105fd0 sp=0x4000105f70 pc=0x46e48
runtime.goexit()
	/var/gobuilder/buildlet/go/src/runtime/asm_arm64.s:1259 +0x4 fp=0x4000105fd0 sp=0x4000105fd0 pc=0x75704

goroutine 2 [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	/var/gobuilder/buildlet/go/src/runtime/proc.go:366 +0xe8 fp=0x4000036fa0 sp=0x4000036f80 pc=0x47228
runtime.goparkunlock(...)
	/var/gobuilder/buildlet/go/src/runtime/proc.go:372
runtime.forcegchelper()
	/var/gobuilder/buildlet/go/src/runtime/proc.go:306 +0xb8 fp=0x4000036fd0 sp=0x4000036fa0 pc=0x470b8
runtime.goexit()
	/var/gobuilder/buildlet/go/src/runtime/asm_arm64.s:1259 +0x4 fp=0x4000036fd0 sp=0x4000036fd0 pc=0x75704
created by runtime.init.6
	/var/gobuilder/buildlet/go/src/runtime/proc.go:294 +0x28

goroutine 3 [GC sweep wait]:
runtime.gopark(0x1?, 0x0?, 0x0?, 0x0?, 0x0?)
	/var/gobuilder/buildlet/go/src/runtime/proc.go:366 +0xe8 fp=0x4000037770 sp=0x4000037750 pc=0x47228
runtime.goparkunlock(...)
	/var/gobuilder/buildlet/go/src/runtime/proc.go:372
runtime.bgsweep(0x0?)
	/var/gobuilder/buildlet/go/src/runtime/mgcsweep.go:297 +0x110 fp=0x40000377b0 sp=0x4000037770 pc=0x32ca0
runtime.gcenable.func1()
	/var/gobuilder/buildlet/go/src/runtime/mgc.go:177 +0x2c fp=0x40000377d0 sp=0x40000377b0 pc=0x2850c
runtime.goexit()
	/var/gobuilder/buildlet/go/src/runtime/asm_arm64.s:1259 +0x4 fp=0x40000377d0 sp=0x40000377d0 pc=0x75704
created by runtime.gcenable
	/var/gobuilder/buildlet/go/src/runtime/mgc.go:177 +0x74

goroutine 4 [GC scavenge wait]:
runtime.gopark(0x165990d2c3686c?, 0x10000?, 0x0?, 0x0?, 0x0?)
	/var/gobuilder/buildlet/go/src/runtime/proc.go:366 +0xe8 fp=0x4000037f10 sp=0x4000037ef0 pc=0x47228
runtime.goparkunlock(...)
	/var/gobuilder/buildlet/go/src/runtime/proc.go:372
runtime.bgscavenge(0x0?)
	/var/gobuilder/buildlet/go/src/runtime/mgcscavenge.go:357 +0x280 fp=0x4000037fb0 sp=0x4000037f10 pc=0x30c40
runtime.gcenable.func2()
	/var/gobuilder/buildlet/go/src/runtime/mgc.go:178 +0x2c fp=0x4000037fd0 sp=0x4000037fb0 pc=0x284ac
runtime.goexit()
	/var/gobuilder/buildlet/go/src/runtime/asm_arm64.s:1259 +0x4 fp=0x4000037fd0 sp=0x4000037fd0 pc=0x75704
created by runtime.gcenable
	/var/gobuilder/buildlet/go/src/runtime/mgc.go:178 +0xb8

goroutine 5 [finalizer wait]:
runtime.gopark(0x0?, 0x343d70?, 0x0?, 0xc0?, 0x1000000010?)
	/var/gobuilder/buildlet/go/src/runtime/proc.go:366 +0xe8 fp=0x4000036590 sp=0x4000036570 pc=0x47228
runtime.goparkunlock(...)
	/var/gobuilder/buildlet/go/src/runtime/proc.go:372
runtime.runfinq()
	/var/gobuilder/buildlet/go/src/runtime/mfinal.go:177 +0xd4 fp=0x40000367d0 sp=0x4000036590 pc=0x27694
runtime.goexit()
	/var/gobuilder/buildlet/go/src/runtime/asm_arm64.s:1259 +0x4 fp=0x40000367d0 sp=0x40000367d0 pc=0x75704
created by runtime.createfing
	/var/gobuilder/buildlet/go/src/runtime/mfinal.go:157 +0x9c

goroutine 42 [GC worker (idle)]:
runtime.gopark(0x4000032758?, 0x4000198820?, 0xa8?, 0x27?, 0xe7120?)
	/var/gobuilder/buildlet/go/src/runtime/proc.go:366 +0xe8 fp=0x4000032740 sp=0x4000032720 pc=0x47228
runtime.gcBgMarkWorker()
	/var/gobuilder/buildlet/go/src/runtime/mgc.go:1207 +0xec fp=0x40000327d0 sp=0x4000032740 pc=0x2a9ac
runtime.goexit()
	/var/gobuilder/buildlet/go/src/runtime/asm_arm64.s:1259 +0x4 fp=0x40000327d0 sp=0x40000327d0 pc=0x75704
created by runtime.gcBgMarkStartWorkers
	/var/gobuilder/buildlet/go/src/runtime/mgc.go:1131 +0x2c

goroutine 58 [GC worker (idle)]:
runtime.gopark(0x592ac0?, 0x1?, 0x5f?, 0x33?, 0x284ff8?)
	/var/gobuilder/buildlet/go/src/runtime/proc.go:366 +0xe8 fp=0x4000032f40 sp=0x4000032f20 pc=0x47228
runtime.gcBgMarkWorker()
	/var/gobuilder/buildlet/go/src/runtime/mgc.go:1207 +0xec fp=0x4000032fd0 sp=0x4000032f40 pc=0x2a9ac
runtime.goexit()
	/var/gobuilder/buildlet/go/src/runtime/asm_arm64.s:1259 +0x4 fp=0x4000032fd0 sp=0x4000032fd0 pc=0x75704
created by runtime.gcBgMarkStartWorkers
	/var/gobuilder/buildlet/go/src/runtime/mgc.go:1131 +0x2c

goroutine 59 [GC worker (idle)]:
runtime.gopark(0x165990d2c4c1ce?, 0x0?, 0x0?, 0x0?, 0x0?)
	/var/gobuilder/buildlet/go/src/runtime/proc.go:366 +0xe8 fp=0x4000034740 sp=0x4000034720 pc=0x47228
runtime.gcBgMarkWorker()
	/var/gobuilder/buildlet/go/src/runtime/mgc.go:1207 +0xec fp=0x40000347d0 sp=0x4000034740 pc=0x2a9ac
runtime.goexit()
	/var/gobuilder/buildlet/go/src/runtime/asm_arm64.s:1259 +0x4 fp=0x40000347d0 sp=0x40000347d0 pc=0x75704
created by runtime.gcBgMarkStartWorkers
	/var/gobuilder/buildlet/go/src/runtime/mgc.go:1131 +0x2c

goroutine 60 [GC worker (idle)]:
runtime.gopark(0x165990d2c480dd?, 0x0?, 0x0?, 0x0?, 0x0?)
	/var/gobuilder/buildlet/go/src/runtime/proc.go:366 +0xe8 fp=0x4000034f40 sp=0x4000034f20 pc=0x47228
runtime.gcBgMarkWorker()
	/var/gobuilder/buildlet/go/src/runtime/mgc.go:1207 +0xec fp=0x4000034fd0 sp=0x4000034f40 pc=0x2a9ac
runtime.goexit()
	/var/gobuilder/buildlet/go/src/runtime/asm_arm64.s:1259 +0x4 fp=0x4000034fd0 sp=0x4000034fd0 pc=0x75704
created by runtime.gcBgMarkStartWorkers
	/var/gobuilder/buildlet/go/src/runtime/mgc.go:1131 +0x2c
FAIL	os/exec	61.869s

@bcmills
Copy link
Contributor Author

bcmills commented Dec 16, 2021

It's a consistent pattern. The os/exec goroutines are gone, but writes to the input pipe continue to succeed.

greplogs --dashboard -md -l -e 'FAIL: TestContextCancel ' --since=2021-12-02

2021-12-15T23:51:57-6e7c691/netbsd-arm64-bsiegert
2021-12-15T19:56:22-567b177/netbsd-arm64-bsiegert
2021-12-09T13:36:38-474ebb9/netbsd-arm64-bsiegert
2021-12-04T04:41:31-cd5f2cf/netbsd-arm64-bsiegert

@gopherbot
Copy link

Change https://golang.org/cl/372795 mentions this issue: os/exec: skip TestContextCancel on netbsd/arm64

gopherbot pushed a commit that referenced this issue Dec 17, 2021
For #42061

Change-Id: I3b4c774ad9e375d4bfef1cfb4336c35ed30a6430
Reviewed-on: https://go-review.googlesource.com/c/go/+/372795
Trust: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@bcmills
Copy link
Contributor Author

bcmills commented Jun 13, 2023

Duplicate of #58699

@bcmills bcmills marked this as a duplicate of #58699 Jun 13, 2023
@bcmills bcmills closed this as not planned Won't fix, can't repro, duplicate, stale Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExpertNeeded NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-NetBSD
Projects
None yet
Development

No branches or pull requests

3 participants