Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os: apparent deadlock in TestFdReadRace on Solaris-derivative builders #44818

Closed
bcmills opened this issue Mar 5, 2021 · 7 comments
Closed
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 OS-illumos OS-Solaris release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Mar 5, 2021

2021-03-05T09:24:34-2217e89/illumos-amd64

goroutine 260 [semacquire, 2 minutes]:
sync.runtime_Semacquire(0xc00031f4a8)
	/var/tmp/workdir-host-illumos-amd64-jclulow/go/src/runtime/sema.go:56 +0x45
sync.(*WaitGroup).Wait(0xc00031f4a0)
	/var/tmp/workdir-host-illumos-amd64-jclulow/go/src/sync/waitgroup.go:130 +0x65
os_test.TestFdReadRace(0xc000453080)
	/var/tmp/workdir-host-illumos-amd64-jclulow/go/src/os/pipe_test.go:474 +0x1e9
testing.tRunner(0xc000453080, 0x5e5358)
	/var/tmp/workdir-host-illumos-amd64-jclulow/go/src/testing/testing.go:1194 +0xef
created by testing.(*T).Run
	/var/tmp/workdir-host-illumos-amd64-jclulow/go/src/testing/testing.go:1239 +0x2b3

…

goroutine 383 [syscall, 2 minutes]:
syscall.read(0x6, 0xc000445786, 0xa, 0xa, 0x0, 0x405ba5, 0xc000051260)
	/var/tmp/workdir-host-illumos-amd64-jclulow/go/src/syscall/zsyscall_solaris_amd64.go:766 +0x8d
syscall.Read(...)
	/var/tmp/workdir-host-illumos-amd64-jclulow/go/src/syscall/syscall_unix.go:188
internal/poll.ignoringEINTRIO(...)
	/var/tmp/workdir-host-illumos-amd64-jclulow/go/src/internal/poll/fd_unix.go:582
internal/poll.(*FD).Read(0xc00004ef00, 0xc000445786, 0xa, 0xa, 0x0, 0x0, 0x0)
	/var/tmp/workdir-host-illumos-amd64-jclulow/go/src/internal/poll/fd_unix.go:163 +0x14e
os.(*File).read(...)
	/var/tmp/workdir-host-illumos-amd64-jclulow/go/src/os/file_posix.go:32
os.(*File).Read(0xc000010270, 0xc000445786, 0xa, 0xa, 0x0, 0x0, 0x6fdee0)
	/var/tmp/workdir-host-illumos-amd64-jclulow/go/src/os/file.go:119 +0x77
os_test.TestFdReadRace.func1(0xc00031f4a0, 0xc000010270, 0xc000051260, 0xc000453080)
	/var/tmp/workdir-host-illumos-amd64-jclulow/go/src/os/pipe_test.go:453 +0xf2
created by os_test.TestFdReadRace
	/var/tmp/workdir-host-illumos-amd64-jclulow/go/src/os/pipe_test.go:448 +0x18a

2020-04-23T19:40:07-6303c34/solaris-amd64-oraclerel

goroutine 194 [semacquire, 2 minutes]:
sync.runtime_Semacquire(0xc00007e008)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0xc00007e000)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/sync/waitgroup.go:130 +0x64
os_test.TestFdReadRace(0xc0006c2140)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/os/pipe_test.go:454 +0x1da
testing.tRunner(0xc0006c2140, 0x5b0c00)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/testing/testing.go:1027 +0xdc
created by testing.(*T).Run
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/testing/testing.go:1078 +0x355

…

goroutine 417 [syscall, 2 minutes]:
syscall.read(0xd, 0xc00029e786, 0xa, 0xa, 0x0, 0xc0007c2000, 0xc0000801e0)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/syscall/zsyscall_solaris_amd64.go:765 +0x90
syscall.Read(...)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/syscall/syscall_unix.go:189
internal/poll.(*FD).Read(0xc0000800c0, 0xc00029e786, 0xa, 0xa, 0x0, 0x0, 0x0)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/internal/poll/fd_unix.go:156 +0x133
os.(*File).read(...)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/os/file_posix.go:31
os.(*File).Read(0xc0003b0000, 0xc00029e786, 0xa, 0xa, 0x0, 0x0, 0x6f65e0)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/os/file.go:116 +0x71
os_test.TestFdReadRace.func1(0xc00007e000, 0xc0003b0000, 0xc0007c2000, 0xc0006c2140)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/os/pipe_test.go:433 +0xed
created by os_test.TestFdReadRace
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/os/pipe_test.go:428 +0x17f

It's not obvious to me whether this in a bug in the test, the os package, the Go runtime, or a longstanding bug in Solaris-based kernels.

CC @jclulow @ianlancetaylor @bradfitz

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-illumos OS-Solaris labels Mar 5, 2021
@bcmills bcmills added this to the Backlog milestone Mar 5, 2021
@bcmills
Copy link
Contributor Author

bcmills commented May 3, 2021

Given the sudden increase in the failure rate at the beginning of March, I think this needs to be a release-blocker at least until we can figure out whether it is a symptom of a 1.17 regression in the os package.

@bcmills bcmills modified the milestones: Backlog, Go1.17 May 3, 2021
@toothrot
Copy link
Contributor

toothrot commented May 6, 2021

Also /cc'ing @griesemer as an owner of os

@heschi
Copy link
Contributor

heschi commented May 13, 2021

Weekly check-in: this needs to be investigated before beta 1.

@dmitshur
Copy link
Contributor

Another weekly ping as this is a release blocking issue.

@dmitshur dmitshur added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label May 20, 2021
@ianlancetaylor
Copy link
Contributor

I think I see a possible sequence of events that can lead to this. TestFdReadRace is making a concurrent call to Read in one goroutine and to Fd and Close in a different goroutine. The call to Fd is interesting because it puts the descriptor into blocking mode. I think that this can happen:

  • goroutine 2 calls Fd
    • Fd calls the fcntl system call to put the descriptor into blocking mode
  • goroutine 1 calls Read
    • Read calls readLock which adds a reference to the associated fdMutex
    • Read calls the read system call, which blocks, because the descriptor is in blocking mode
  • goroutine 2 calls Close
    • Close calls increfAndClose which adds a reference and sets mutexClosed in the fdMutex
    • Close calls evict to unblock any I/O waiting in the poller, but there is none
    • Close calls decref, which removes a reference, leaving the one added by Read
    • Close sees that the FD is in blocking mode, so it returns

At this point goroutine 2 has returned and exited, and goroutine 1 is sitting in a blocked call to read which will never wake up. This is consistent with the stack traces in the test failures.

As far as I can tell this could happen on any system. I don't know why we only see the problem on Solaris. The test has a 10 millisecond wait to let goroutine 1 call Read before goroutine 2 calls Fd. If Read is called before the descriptor is put into blocking mode, the problem described here won't happen. It is possible that on Solaris the timeout is not enough for goroutine 1 to call Read, but I don't know why Solaris would be significantly different here.

So I'm not sure that this analysis is correct. Maybe I'm missing something. If this is correct, I don't see a way to fix it. Concurrent calls to Fd and Read will always carry the possibility of a blocking read system call that never returns. This is not a likely sequence of events. But see https://golang.org/cl/123176, which was written because of an internal Google test case that did exactly that (internal Google reference 111218325).

@gopherbot
Copy link

Change https://golang.org/cl/322893 mentions this issue: os: deflake TestFdReadRace

@golang golang locked and limited conversation to collaborators May 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 OS-illumos OS-Solaris release-blocker
Projects
None yet
Development

No branches or pull requests

6 participants