Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/go: ScriptTest consistently timing out on dragonfly-amd64-5_8 builder #38797

Closed
bcmills opened this issue May 1, 2020 · 10 comments
Closed
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Dragonfly
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented May 1, 2020

As of CL 231223, the dragonfly-amd64-5_8 builder (and only that builder) is consistently timing out on the cmd/go tests, during one of the script tests:
https://build.golang.org/log/fb00e74ea4d50113498d60db4d6b30c09ee0a4ea

The builder goes unresponsive enough that the test's usual timeout behavior doesn't halt the test in time, so all we get is a goroutine dump from the test process (which does not reveal the source of the hang).

I tried to use gomote ssh to investigate, but it failed due to a configuration error (#38796):

~/go/src$ mote ssh
$ ssh -p 2222 user-bcmills-dragonfly-amd64-5_8-0@farmer.golang.org # auth using https://github.com/bcmills.keys
Enter passphrase for key '/usr/local/google/home/bcmills/.ssh/id_rsa':
instance "user-bcmills-dragonfly-amd64-5_8-0" host type "host-dragonfly-amd64-5_8" does not have SSH configured
Connection to farmer.golang.org closed.

I don't see how CL 231223 could be causing deadlocks, since it is mostly a straight refactor, but I tried a revert (in CL 231557) and it passed as a SlowBot. So I'm not sure what to do about that: I hate to roll back based on a seemingly-unrelated failure on a builder I can't even access, especially given that the mainline dragonfly-amd64 builder is still passing.

If I could at least figure out which test is deadlocking, I could add a skip for that test on the theory that it's likely a bad interaction with a kernel bug...

CC @tuxillo @andybons

@bcmills bcmills added OS-Dragonfly NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels May 1, 2020
@bcmills bcmills added this to the Go1.15 milestone May 1, 2020
@bcmills
Copy link
Contributor Author

bcmills commented May 1, 2020

go test proxy running at GOPROXY=http://127.0.0.1:64987/mod
go proxy: no archive rsc.io v1.5.2: file does not exist
panic: test timed out after 9m0s

goroutine 1853 [running]:
testing.(*M).startAlarm.func1()
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/testing/testing.go:1534 +0xdf
created by time.goFunc
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/time/sleep.go:167 +0x44

…

goroutine 1849 [IO wait, 8 minutes]:
internal/poll.runtime_pollWait(0x8013eea18, 0x72, 0xffffffffffffffff)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/runtime/netpoll.go:220 +0x55
internal/poll.(*pollDesc).wait(0xc0008bb278, 0x72, 0x8001, 0x8000, 0xffffffffffffffff)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc0008bb260, 0xc0005f6000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/internal/poll/fd_unix.go:160 +0x193
os.(*File).read(...)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/os/file_posix.go:31
os.(*File).Read(0xc000010688, 0xc0005f6000, 0x8000, 0x8000, 0x0, 0x406f00, 0xc0008bafc0)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/os/file.go:116 +0x71
io.copyBuffer(0xbb9d60, 0xc0001cd480, 0xbb9bc0, 0xc000010688, 0xc0005f6000, 0x8000, 0x8000, 0xc0008bafc0, 0xc000494fb0, 0x1)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/io/io.go:409 +0x122
io.Copy(...)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/io/io.go:368
os/exec.(*Cmd).writerDescriptor.func1(0xc000494fb0, 0xc0001cd3c0)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/os/exec/exec.go:311 +0x63
os/exec.(*Cmd).Start.func1(0xc0000d49a0, 0xc0001cd520)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/os/exec/exec.go:441 +0x27
created by os/exec.(*Cmd).Start
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/os/exec/exec.go:440 +0x60a

goroutine 1848 [IO wait, 8 minutes]:
internal/poll.runtime_pollWait(0x8013eee78, 0x72, 0xffffffffffffffff)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/runtime/netpoll.go:220 +0x55
internal/poll.(*pollDesc).wait(0xc0008bb1b8, 0x72, 0x8001, 0x8000, 0xffffffffffffffff)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc0008bb1a0, 0xc0005c2000, 0x8000, 0x8000, 0x0, 0x0, 0x0)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/internal/poll/fd_unix.go:160 +0x193
os.(*File).read(...)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/os/file_posix.go:31
os.(*File).Read(0xc000010670, 0xc0005c2000, 0x8000, 0x8000, 0x0, 0x9af600, 0xc000494738)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/os/file.go:116 +0x71
io.copyBuffer(0xbb9d60, 0xc0001cd460, 0xbb9bc0, 0xc000010670, 0xc0005c2000, 0x8000, 0x8000, 0x3, 0x0, 0xc0004313e0)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/io/io.go:409 +0x122
io.Copy(...)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/io/io.go:368
os/exec.(*Cmd).writerDescriptor.func1(0xc0001ef4a0, 0xc0000d4840)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/os/exec/exec.go:311 +0x63
os/exec.(*Cmd).Start.func1(0xc0000d49a0, 0xc0001cd4e0)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/os/exec/exec.go:441 +0x27
created by os/exec.(*Cmd).Start
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/os/exec/exec.go:440 +0x60a

goroutine 1850 [chan receive]:
cmd/go_test.ctxWait(0xbc81a0, 0xc0001ef4a0, 0xc0000d49a0, 0x0, 0x0)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/cmd/go/script_test.go:1030 +0x160
cmd/go_test.(*testScript).cmdExec.func1(0xc0000d49a0, 0xc0002aede0)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/cmd/go/script_test.go:643 +0x43
created by cmd/go_test.(*testScript).cmdExec
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/cmd/go/script_test.go:642 +0x326

goroutine 1851 [syscall, 8 minutes]:
syscall.Syscall6(0x7, 0x77678, 0xc000490e9c, 0x0, 0xc0002f8750, 0x0, 0x0, 0xc0002f8750, 0x0, 0xc000490e68)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/syscall/asm_unix_amd64.s:39 +0x5
syscall.wait4(0x77678, 0xc000490e9c, 0x0, 0xc0002f8750, 0x90, 0xabfa00, 0x461101)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/syscall/zsyscall_dragonfly_amd64.go:34 +0x7b
syscall.Wait4(0x77678, 0xc000490eec, 0x0, 0xc0002f8750, 0xc00030d680, 0xc000490f58, 0x406f2f)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/syscall/syscall_bsd.go:129 +0x51
os.(*Process).wait(0xc0002fa540, 0xc0008bad80, 0x0, 0x0)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/os/exec_unix.go:38 +0x7b
os.(*Process).Wait(...)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/os/exec.go:125
os/exec.(*Cmd).Wait(0xc0000d49a0, 0xc0008baf00, 0xc000490fb0)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/os/exec/exec.go:507 +0x60
cmd/go_test.ctxWait.func1(0xc0008bb380, 0xc0000d49a0)
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/cmd/go/script_test.go:1023 +0x2b
created by cmd/go_test.ctxWait
	/tmp/workdir-host-dragonfly-amd64-5_8/go/src/cmd/go/script_test.go:1023 +0x77

…
FAIL	cmd/go	547.208s

@jirfag
Copy link

jirfag commented May 6, 2020

looks similar to #38824

@bcmills
Copy link
Contributor Author

bcmills commented May 6, 2020

@jirfag, could you expand on that a bit? The goroutine dump linked above doesn't seem to show anything stuck in syscall.StartProcess, which is the key symptom of that issue.

@jirfag
Copy link

jirfag commented May 7, 2020

you are right, sorry, I've mislooked

@bcmills bcmills self-assigned this May 13, 2020
@bcmills
Copy link
Contributor Author

bcmills commented May 13, 2020

This builder seems to be fairly slow overall. I logged in using gomote ssh and ran make.bash and several runs of ../bin/go test cmd/go, and noticed that even make.bash seems very sluggish.

I made some tweaks to the cmd/go tests to make the timeout-shutdown path more responsive and I haven't been able to obtain a timeout since then, but I also haven't seen a go test cmd/go run clock in under 500s (the default go test timeout is 600s, and somehow most of the builders are ending up reducing that to 540s).

However, the builder does not seem to have GO_TEST_SHORT set, so by my understanding it should be running go test -short cmd/go rather than the full 500s long-mode test. That command reliably completes in ~55s when run in a gomote ssh session, so I have no idea why it would be timing out after 540s on the builder.

@tuxillo, is there a reason this builder should be particularly slow, or something else about it that would cause it to run long tests instead of short ones?

@andybons, is there some way to confirm that the tests are indeed being run with -short as expected?

@gopherbot
Copy link

Change https://golang.org/cl/233526 mentions this issue: cmd/go: terminate TestScript commands more aggressively when the test times out

@bcmills bcmills added the Builders x/build issues (builders, bots, dashboards) label May 13, 2020
@tuxillo
Copy link
Contributor

tuxillo commented May 13, 2020

@bcmills that is strange indeed. The builder is running as a VM in a Linux host, the same host as the other dragonfly builder.
How do I reproduce the issue?

@bcmills
Copy link
Contributor Author

bcmills commented May 13, 2020

@tuxillo, I still haven't been able to reproduce the failure reliably outside of the post-submit builds. I'm hoping that CL 233526 will help to clarify the post-submit failures.

gopherbot pushed a commit that referenced this issue May 14, 2020
… times out

- Avoid starting subprocesses when the test is already very close to
  timing out. The overhead of starting and stopping processes may
  cause the test to exceed its deadline even if each individual
  process is signaled soon after it is started.

- If a command does not shut down quickly enough after receiving
  os.Interrupt, send it os.Kill using the same style of grace period
  as in CL 228438.

- Fail the test if a background command whose exit status is not
  ignored is left running at the end of the test. We have no reliable
  way to distinguish a failure due to the termination signal from an
  unexpected failure, and the termination signal varies across
  platforms (so may cause failure on one platform but success on
  another).

For #38797

Change-Id: I767898cf551dca45579bf01a9d1bb312e12d6193
Reviewed-on: https://go-review.googlesource.com/c/go/+/233526
Run-TryBot: Bryan C. Mills <bcmills@google.com>
Reviewed-by: Jay Conrod <jayconrod@google.com>
@bcmills
Copy link
Contributor Author

bcmills commented May 15, 2020

This symptom is no longer occurring, although I don't know what would have fixed it. (My CL was only improving the diagnostics on failure.)

I still suspect that there may be something wrong with this builder, but I don't know what and don't plan to investigate further. We can refer back to this issue if there is another regression on this builder.

Since the specific symptom reported in this issue is no longer present, closing as non-reproducible.

@bcmills bcmills closed this as completed May 15, 2020
@tuxillo
Copy link
Contributor

tuxillo commented May 17, 2020

@bcmills Thanks a lot for taking care of this. If the builder keeps failing again, I'll reinstall it.

@golang golang locked and limited conversation to collaborators May 17, 2021
@rsc rsc unassigned bcmills Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Dragonfly
Projects
None yet
Development

No branches or pull requests

4 participants