Skip to content

os/exec: failures with "netpollBreak write failed" on linux-amd64 since 2021-11-10 #49533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bcmills opened this issue Nov 11, 2021 · 16 comments
Closed
Assignees
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. release-blocker Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Nov 11, 2021

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels Nov 11, 2021
@bcmills bcmills modified the milestones: Go1.19, Go1.18 Nov 11, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Nov 11, 2021

(@prattmic, @ChrisHines: Is this possibly related to #44343?)

@prattmic
Copy link
Member

errno 32 is EPIPE, meaning the other end of the pipe (netpollBreakRd) is closed. The runtime never closes that FD, so I suspect this is a bug of something closing an FD it doesn't own.

@prattmic prattmic self-assigned this Nov 11, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Nov 11, 2021

The runtime never closes that FD, so I suspect this is a bug of something closing an FD it doesn't own.

Ooh, that sounds like another os.File finalizer race, maybe! Or, looks like the failure is in os/exec in both of the above — could be related to #49500?

@bcmills
Copy link
Contributor Author

bcmills commented Nov 11, 2021

Yep, here it is, I think:

// We have no good portable way to check whether an FD is open.
// We use NewFile to create a *os.File, which lets us
// know whether it is open, but then we have to cope with
// the finalizer on the *os.File.
f := os.NewFile(fd, "")
if _, err := f.Stat(); err != nil {
// Close the file to clear the finalizer.
// We expect the Close to fail.
f.Close()
} else {
fmt.Printf("fd %d open at test start\n", fd)
haveUnexpectedFDs = true
// Use a global variable to avoid running
// the finalizer, which would close the FD.
unfinalizedFiles = append(unfinalizedFiles, f)
}

@bcmills bcmills changed the title runtime: failures with "netpollBreak write failed" on linux-amd64 since 2021-11-10 os/exec: failures with "netpollBreak write failed" on linux-amd64 since 2021-11-10 Nov 11, 2021
@bcmills bcmills added Testing An issue that has been verified to require only test changes, not just a test failure. NeedsFix The path to resolution is known, but the work has not been done. labels Nov 11, 2021
@gopherbot gopherbot removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 11, 2021
@prattmic
Copy link
Member

Oof, that is quite broken. But are we sure that is the cause of these crashes? That only runs for GO_WANT_HELPER_PROCESS=1, but the test logs look like it is the main test process crashing.

Either way, this may be another case of the new pacer shaking out a bug with slightly earlier GC runs.

cc @mknyszek

@bcmills
Copy link
Contributor Author

bcmills commented Nov 11, 2021

That only runs for GO_WANT_HELPER_PROCESS=1, but the test logs look like it is the main test process crashing.

If I'm reading correctly, it's the opposite: it is only skipped for GO_WANT_HELPER_PROCESS=1.

@prattmic
Copy link
Member

🤦

@prattmic
Copy link
Member

(I am attempting to reproduce this issue, no luck yet.)

Upon closer inspection, I still not entirely convinced this code is the issue. Note that:

  1. poll.IsPollDescriptor returns true for the epoll FD and 2 netpollBreak FDs, so they should be skipped.
  2. The loop either closes files it thinks are bad FDs (stat failed), thus clearing the finalizer.
  3. Or, puts the os.File in a global to force it to live for the entire process lifetime.

I do see two potential sources of bugs:

  1. Most notably, the epoll FD and netpollBreak FDs are initialized on on netpollInit, which occurs on the first os.NewFile, net FD creation, or timer creation. The first iteration through the loop may be before any of those, so FD 3 may pass that check, and then in NewFile become one of the netpoll FDs. However, in that case FD 3 should become the epoll FD, with netpollBreak as FDs 4 and 5, so if anything the epoll FD should be the one closed.

  2. The f.Stat(); f.Close() pattern assumes there can't be a racing open in between, but an earlier init function could have started a goroutine that opens a file concurrently.

@ianlancetaylor
Copy link
Member

I think there is a race condition here. The call to f.Stat could occur before the pipe is created, so it fails. Then the pipe could be created, perhaps with the same descriptor as f. Then the init function calls f.Close, closing the pipe descriptor. This doesn't seem like a likely race, but I doesn't seem impossible.

@prattmic
Copy link
Member

I did manage to reproduce this on a linux-amd64-fedora gomote after ~30min with:

$ gomote run -dir /workdir/go/src/os/exec $(gomote-instance) /workdir/go/bin/go test -c os/exec
$ gomote run -dir /workdir/go/src/os/exec $(gomote-instance) /bin/bash -c 'while /workdir/go/src/os/exec/exec.test -test.short; do true; done'

@prattmic
Copy link
Member

I've also seen two failures of the form:

runtime: epollctl failed with 22
fatal error: runtime: epollctl failed

goroutine 4 [running]:
runtime.throw({0x74bc90?, 0x3?})
        /workdir/go/src/runtime/panic.go:992 +0x71 fp=0xc000033de0 sp=0xc000033db0 pc=0x436c91
runtime.netpollinit()
        /workdir/go/src/runtime/netpoll_epoll.go:54 +0x165 fp=0xc000033e30 sp=0xc000033de0 pc=0x432d05
runtime.netpollGenericInit()
        /workdir/go/src/runtime/netpoll.go:127 +0x3a fp=0xc000033e48 sp=0xc000033e30 pc=0x43239a
runtime.doaddtimer(0xc000022500, 0xc000068000)
        /workdir/go/src/runtime/time.go:287 +0x30 fp=0xc000033ea0 sp=0xc000033e48 pc=0x455650
runtime.modtimer(0xc000068000, 0x6726dc4c38, 0x0, 0x777938, {0x0?, 0x0}, 0x0)
        /workdir/go/src/runtime/time.go:493 +0x366 fp=0xc000033ee8 sp=0xc000033ea0 pc=0x456186
runtime.resettimer(...)
        /workdir/go/src/runtime/time.go:540
time.resetTimer(...)
        /workdir/go/src/runtime/time.go:230
runtime.scavengeSleep(0x4792a1c0)
        /workdir/go/src/runtime/mgcscavenge.go:243 +0x73 fp=0xc000033f38 sp=0xc000033ee8 pc=0x422a53
runtime.bgscavenge(0x0?)
        /workdir/go/src/runtime/mgcscavenge.go:381 +0x18f fp=0xc000033fc8 sp=0xc000033f38 pc=0x422c8f
runtime.gcenable.func2()
        /workdir/go/src/runtime/mgc.go:178 +0x26 fp=0xc000033fe0 sp=0xc000033fc8 pc=0x41acc6
runtime.goexit()
        /workdir/go/src/runtime/asm_amd64.s:1579 +0x1 fp=0xc000033fe8 sp=0xc000033fe0 pc=0x468961
created by runtime.gcenable
        /workdir/go/src/runtime/mgc.go:178 +0xaa

goroutine 1 [running, locked to thread]:
        goroutine running on other thread; stack unavailable

This is a failure attempting to add netpollBreakRd to epoll, which seems to be in the same vain.

I'm going to fix the races in that init function and we'll see if that gets rid of the crashes. I think it is worthwhile either way.

@bcmills
Copy link
Contributor Author

bcmills commented Nov 12, 2021

This one also looks like it could be related:

2021-11-08T17:30:10-6a9d811/illumos-amd64:

runtime: port_getn on fd 3 failed (errno=9)
fatal error: runtime: netpoll failed

runtime stack:
runtime.throw({0x745af3?, 0x3?})
	/var/tmp/workdir-host-illumos-amd64-jclulow/go/src/runtime/panic.go:992 +0x71 fp=0xfffffc7fc65ff130 sp=0xfffffc7fc65ff100 pc=0x436ab1
runtime.netpoll(0xc000024000?)
	/var/tmp/workdir-host-illumos-amd64-jclulow/go/src/runtime/netpoll_solaris.go:249 +0x46f fp=0xfffffc7fc65ffdc8 sp=0xfffffc7fc65ff130 pc=0x4328cf
runtime.findrunnable()
	/var/tmp/workdir-host-illumos-amd64-jclulow/go/src/runtime/proc.go:2941 +0x585 fp=0xfffffc7fc65ffeb8 sp=0xfffffc7fc65ffdc8 pc=0x43e3a5
runtime.schedule()
	/var/tmp/workdir-host-illumos-amd64-jclulow/go/src/runtime/proc.go:3361 +0x239 fp=0xfffffc7fc65fff00 sp=0xfffffc7fc65ffeb8 pc=0x43f599
runtime.park_m(0xc0000031e0?)
	/var/tmp/workdir-host-illumos-amd64-jclulow/go/src/runtime/proc.go:3510 +0x14d fp=0xfffffc7fc65fff30 sp=0xfffffc7fc65fff00 pc=0x43facd
runtime.mcall()
	/var/tmp/workdir-host-illumos-amd64-jclulow/go/src/runtime/asm_amd64.s:433 +0x4c fp=0xfffffc7fc65fff40 sp=0xfffffc7fc65fff30 pc=0x46608c

goroutine 1 [runnable, locked to thread]:
runtime.syscall_sysvicall6(0x6bedd0, 0x2, 0x4, 0xc0001120f8, 0x0, 0x0, 0x0, 0x0)
	/var/tmp/workdir-host-illumos-amd64-jclulow/go/src/runtime/syscall_solaris.go:45 +0x85 fp=0xc0000b3ba0 sp=0xc0000b3b48 pc=0x464365
syscall.Fstat(0x994d60?, 0xc0001120c0?)
	/var/tmp/workdir-host-illumos-amd64-jclulow/go/src/syscall/zsyscall_solaris_amd64.go:485 +0x57 fp=0xc0000b3c18 sp=0xc0000b3ba0 pc=0x473fd7
…
FAIL	os/exec

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/364035 mentions this issue: os/exec: avoid NewFile on unknown FDs

@prattmic
Copy link
Member

With https://golang.org/cl/364035, I have not reproduced a crash for >45min. Before I could repro in <5min.

@mknyszek
Copy link
Contributor

Have you seen something like

runtime: netpoll: break fd ready for -2
fatal error: runtime: netpoll: break fd ready for something unexpected

I'm investigating #49500 and managed to reproduce such a failure. (Running TestExtraFiles in a loop.)

Full stack trace:

fd 4 open at test start
runtime: netpoll: break fd ready for -2
fatal error: runtime: netpoll: break fd ready for something unexpected

runtime stack:
runtime.throw({0x75844e?, 0x0?})
	/tmp/workdir/go/src/runtime/panic.go:992 +0x71
runtime.netpoll(0x800?)
	/tmp/workdir/go/src/runtime/netpoll_kqueue.go:147 +0x2e7
runtime.findrunnable()
	/tmp/workdir/go/src/runtime/proc.go:2750 +0x185
runtime.schedule()
	/tmp/workdir/go/src/runtime/proc.go:3361 +0x239
runtime.park_m(0xc000083520?)
	/tmp/workdir/go/src/runtime/proc.go:3510 +0x14d
runtime.mcall()
	/tmp/workdir/go/src/runtime/asm_amd64.s:433 +0x43

goroutine 1 [chan receive]:
testing.(*T).Run(0xc000082820, {0x748d16?, 0xc00012bbb0?}, 0x7775e8)
	/tmp/workdir/go/src/testing/testing.go:1458 +0x37a
testing.runTests.func1(0xc000082820?)
	/tmp/workdir/go/src/testing/testing.go:1810 +0x6e
testing.tRunner(0xc000082820, 0xc00012bcd8)
	/tmp/workdir/go/src/testing/testing.go:1410 +0x102
testing.runTests(0xc00019e640?, {0x98dd00, 0x1f, 0x1f}, {0x801400108?, 0x40?, 0x0?})
	/tmp/workdir/go/src/testing/testing.go:1808 +0x457
testing.(*M).Run(0xc00019e640)
	/tmp/workdir/go/src/testing/testing.go:1690 +0x5d9
main.main()
	_testmain.go:111 +0x1aa

goroutine 9 [chan receive]:
os/exec_test.TestExtraFilesRace(0xc000082ea0)
	/tmp/workdir/go/src/os/exec/exec_test.go:753 +0x3e5
testing.tRunner(0xc000082ea0, 0x7775e8)
	/tmp/workdir/go/src/testing/testing.go:1410 +0x102
created by testing.(*T).Run
	/tmp/workdir/go/src/testing/testing.go:1457 +0x35f

goroutine 10 [syscall]:
syscall.Syscall6(0x214, 0x0, 0x133f, 0x0, 0x18, 0x0, 0x0)
	/tmp/workdir/go/src/syscall/asm_unix_amd64.s:39 +0x5
os.(*Process).blockUntilWaitable(0xc0000a4000)
	/tmp/workdir/go/src/os/wait_wait6.go:30 +0x5a
os.(*Process).wait(0xc0000a4000)
	/tmp/workdir/go/src/os/exec_unix.go:22 +0x28
os.(*Process).Wait(...)
	/tmp/workdir/go/src/os/exec.go:132
os/exec.(*Cmd).Wait(0xc000146000)
	/tmp/workdir/go/src/os/exec/exec.go:507 +0x54
os/exec.(*Cmd).Run(0x6192c8db?)
	/tmp/workdir/go/src/os/exec/exec.go:341 +0x39
os/exec.(*Cmd).CombinedOutput(0xc000146000)
	/tmp/workdir/go/src/os/exec/exec.go:567 +0x96
os/exec_test.TestExtraFilesRace.func3(0xc000082ea0?, 0x7775e8?)
	/tmp/workdir/go/src/os/exec/exec_test.go:731 +0x25
created by os/exec_test.TestExtraFilesRace
	/tmp/workdir/go/src/os/exec/exec_test.go:751 +0x35b

goroutine 11 [runnable]:
syscall.Syscall6(0x214, 0x0, 0x133e, 0x0, 0x18, 0x0, 0x0)
	/tmp/workdir/go/src/syscall/asm_unix_amd64.s:39 +0x5
os.(*Process).blockUntilWaitable(0xc000016330)
	/tmp/workdir/go/src/os/wait_wait6.go:30 +0x5a
os.(*Process).wait(0xc000016330)
	/tmp/workdir/go/src/os/exec_unix.go:22 +0x28
os.(*Process).Wait(...)
	/tmp/workdir/go/src/os/exec.go:132
os/exec.(*Cmd).Wait(0xc000146420)
	/tmp/workdir/go/src/os/exec/exec.go:507 +0x54
os/exec.(*Cmd).Run(0x0?)
	/tmp/workdir/go/src/os/exec/exec.go:341 +0x39
os/exec.(*Cmd).CombinedOutput(0xc000146420)
	/tmp/workdir/go/src/os/exec/exec.go:567 +0x96
os/exec_test.TestExtraFilesRace.func3(0x0?, 0x0?)
	/tmp/workdir/go/src/os/exec/exec_test.go:731 +0x25
created by os/exec_test.TestExtraFilesRace
	/tmp/workdir/go/src/os/exec/exec_test.go:752 +0x3c5

goroutine 12 [runnable]:
syscall.Syscall(0x3, 0xb, 0xc00005801e, 0x5e2)
	/tmp/workdir/go/src/syscall/asm_unix_amd64.s:19 +0x5
syscall.read(0xc0001a0780?, {0xc00005801e?, 0x44ce01?, 0x4ca29e?})
	/tmp/workdir/go/src/syscall/zsyscall_freebsd_amd64.go:949 +0x4d
syscall.Read(...)
	/tmp/workdir/go/src/syscall/syscall_unix.go:188
internal/poll.ignoringEINTRIO(...)
	/tmp/workdir/go/src/internal/poll/fd_unix.go:794
internal/poll.(*FD).Read(0xc0001a0780?, {0xc00005801e?, 0x5e2?, 0x5e2?})
	/tmp/workdir/go/src/internal/poll/fd_unix.go:163 +0x285
os.(*File).read(...)
	/tmp/workdir/go/src/os/file_posix.go:31
os.(*File).Read(0xc000010360, {0xc00005801e?, 0x8282980e8?, 0xc00002e6a0?})
	/tmp/workdir/go/src/os/file.go:119 +0x5e
bytes.(*Buffer).ReadFrom(0xc000073440, {0x7cf800, 0xc000010360})
	/tmp/workdir/go/src/bytes/buffer.go:204 +0x98
io.copyBuffer({0x7cf240, 0xc000073440}, {0x7cf800, 0xc000010360}, {0x0, 0x0, 0x0})
	/tmp/workdir/go/src/io/io.go:412 +0x14b
io.Copy(...)
	/tmp/workdir/go/src/io/io.go:385
os/exec.(*Cmd).writerDescriptor.func1()
	/tmp/workdir/go/src/os/exec/exec.go:311 +0x3a
os/exec.(*Cmd).Start.func1(0x0?)
	/tmp/workdir/go/src/os/exec/exec.go:441 +0x25
created by os/exec.(*Cmd).Start
	/tmp/workdir/go/src/os/exec/exec.go:440 +0x715

goroutine 20 [IO wait]:
internal/poll.runtime_pollWait(0x82820eac0, 0x72)
	/tmp/workdir/go/src/runtime/netpoll.go:233 +0x89
internal/poll.(*pollDesc).wait(0xc00005c240?, 0xc000202000?, 0x1)
	/tmp/workdir/go/src/internal/poll/fd_poll_runtime.go:83 +0x32
internal/poll.(*pollDesc).waitRead(...)
	/tmp/workdir/go/src/internal/poll/fd_poll_runtime.go:88
internal/poll.(*FD).Read(0xc00005c240, {0xc000202000, 0x200, 0x200})
	/tmp/workdir/go/src/internal/poll/fd_unix.go:167 +0x25a
os.(*File).read(...)
	/tmp/workdir/go/src/os/file_posix.go:31
os.(*File).Read(0xc000090010, {0xc000202000?, 0x90b40a?, 0xc00002dea0?})
	/tmp/workdir/go/src/os/file.go:119 +0x5e
bytes.(*Buffer).ReadFrom(0xc000092000, {0x7cf800, 0xc000090010})
	/tmp/workdir/go/src/bytes/buffer.go:204 +0x98
io.copyBuffer({0x7cf240, 0xc000092000}, {0x7cf800, 0xc000090010}, {0x0, 0x0, 0x0})
	/tmp/workdir/go/src/io/io.go:412 +0x14b
io.Copy(...)
	/tmp/workdir/go/src/io/io.go:385
os/exec.(*Cmd).writerDescriptor.func1()
	/tmp/workdir/go/src/os/exec/exec.go:311 +0x3a
os/exec.(*Cmd).Start.func1(0x7775e8?)
	/tmp/workdir/go/src/os/exec/exec.go:441 +0x25
created by os/exec.(*Cmd).Start
	/tmp/workdir/go/src/os/exec/exec.go:440 +0x715

@mknyszek
Copy link
Contributor

Actually yeah, given that fd 4 open at test start gets printed, I think it's the same. I'm gonna bet FD 4 gets erroneously closed.

@golang golang locked and limited conversation to collaborators Jun 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. release-blocker Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

5 participants