Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: TestDialerLocalAddr and TestNetpollDeadlock are flaky on macOS 10.12 #22019

Closed
dr2chase opened this issue Sep 25, 2017 · 16 comments
Closed
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@dr2chase
Copy link
Contributor

What version of Go are you using (go version)?

development (1.9+)
go version devel +4fa3d69e00 Fri Sep 22 14:03:36 2017 -0400 darwin/amd64

Does this issue reproduce with the latest release?

intermittently. That's the bug.

What operating system and processor architecture are you using (go env)?

go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/drchase/GoogleDrive/work/gocode"
GORACE=""
GOROOT="/Users/drchase/GoogleDrive/work/go"
GOTOOLDIR="/Users/drchase/GoogleDrive/work/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/y4/90mx9nxx54bfvgh9jh0_410h0095tn/T/go-build459430235=/tmp/go-build -gno-record-gcc-switches -fno-common"
CXX="clang++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

If you run run.bash repeatedly on a Mac, for instance running a binary search to isolate a single failing function, the test sometimes fails. This then causes the binary search to go down a false path and not work.

@ianlancetaylor ianlancetaylor changed the title net/TestDialerLocalAddr is flaky on Mac and should be disabled net: TestDialerLocalAddr is flaky on Mac and should be disabled Sep 25, 2017
@ianlancetaylor ianlancetaylor added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin labels Sep 25, 2017
@ianlancetaylor ianlancetaylor added this to the Go1.10 milestone Sep 25, 2017
@ianlancetaylor
Copy link
Contributor

CC @mikioh

@ianlancetaylor
Copy link
Contributor

Here are some failures from the builders:

2017-12-01T00:23:05-301b127/darwin-amd64-10_12:
--- FAIL: TestDialerLocalAddr (75.11s)
	dial_test.go:642: tcp6 [::]:0->::1: got dial tcp6 [::]:0->[::1]:59465: getsockopt: operation timed out; want <nil>
FAIL
FAIL	net

2017-12-01T15:43:52-08176b2/darwin-amd64-10_12:
--- FAIL: TestDialerLocalAddr (75.03s)
	dial_test.go:642: tcp [::1]:0->::1: got dial tcp [::1]:0->[::1]:49317: getsockopt: operation timed out; want <nil>
FAIL
FAIL	net

2017-12-01T20:54:31-7b57e21/darwin-amd64-10_12:
--- FAIL: TestDialerLocalAddr (75.05s)
	dial_test.go:642: tcp6 :0->::1: got dial tcp6 :0->[::1]:59475: getsockopt: operation timed out; want <nil>
FAIL
FAIL	net

2017-12-01T21:05:40-8bb51a7/darwin-amd64-10_12:
--- FAIL: TestDialerLocalAddr (75.03s)
	dial_test.go:642: tcp [::]:0->::1: got dial tcp [::]:0->[::1]:49294: getsockopt: operation timed out; want <nil>
FAIL
FAIL	net

2017-12-01T23:50:23-bfa7a55/darwin-amd64-10_12:
--- FAIL: TestDialerLocalAddr (75.03s)
	dial_test.go:642: tcp 0.0.0.0:0->::1: got dial tcp 0.0.0.0:0->[::1]:59413: getsockopt: operation timed out; want <nil>
FAIL
FAIL	net

2017-12-04T21:45:40-9d70b3a/darwin-amd64-10_12:
--- FAIL: TestDialerLocalAddr (75.02s)
	dial_test.go:642: tcp6 0.0.0.0:0->::1: got dial tcp6 0.0.0.0:0->[::1]:49309: getsockopt: operation timed out; want <nil>
FAIL
FAIL	net

2017-12-05T18:42:38-49fec9b/darwin-amd64-10_12:
--- FAIL: TestDialerLocalAddr (75.03s)
	dial_test.go:642: tcp6 [::1]:0->::1: got dial tcp6 [::1]:0->[::1]:49296: getsockopt: operation timed out; want <nil>
FAIL
FAIL	net

2017-12-06T18:33:50-3e52f22/darwin-amd64-10_12:
--- FAIL: TestDialerLocalAddr (75.03s)
	dial_test.go:642: tcp6 [::]:0->::1: got dial tcp6 [::]:0->[::1]:53707: getsockopt: operation timed out; want <nil>
FAIL
FAIL	net

2017-12-14T17:12:09-1fc71e3/darwin-amd64-10_12:
--- FAIL: TestDialerLocalAddr (75.03s)
	dial_test.go:637: tcp 0.0.0.0:0->::1: got dial tcp 0.0.0.0:0->[::1]:49306: getsockopt: operation timed out; want <nil>
FAIL
FAIL	net

2017-12-18T20:24:12-3aeb395/freebsd-arm-paulzhol:
--- FAIL: TestDialerLocalAddr (0.03s)
	dial_test.go:637: tcp [::]:0->127.0.0.1: got dial tcp [::]:0->127.0.0.1:46629: connect: address already in use; want <nil>
FAIL
FAIL	net

2018-01-02T20:02:06-2ba3fd4/darwin-amd64-10_12:
--- FAIL: TestDialerLocalAddr (75.02s)
	dial_test.go:637: tcp [::]:0->::1: got dial tcp [::]:0->[::1]:59610: getsockopt: operation timed out; want <nil>
FAIL
FAIL	net

@gopherbot
Copy link

Change https://golang.org/cl/86000 mentions this issue: net: ignore TestDialerLocalAddr timeout failures on Darwin

gopherbot pushed a commit that referenced this issue Jan 3, 2018
I don't know why these errors occur. Ignore them to avoid breaking the
build.

Updates #22019

Change-Id: Ia048e6d9b928e8e237b311ff3a364e7a23af4aa4
Reviewed-on: https://go-review.googlesource.com/86000
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@ianlancetaylor ianlancetaylor modified the milestones: Go1.10, Unplanned Jan 3, 2018
@bcmills
Copy link
Contributor

bcmills commented May 15, 2019

#22403 was marked as a duplicate of this, and recurred in https://build.golang.org/log/b8eabc31874c5ab9c691a4a3e4dceb2c8656b16c:

--- FAIL: TestNetpollDeadlock (60.07s)
    crash_test.go:95: testprognet NetpollDeadlock exit status: exit status 2
    crash_test.go:409: output does not start with "done\n":
        dialing
        SIGQUIT: quit
        PC=0x7fffa2d6dc86 m=0 sigcode=0
        
        goroutine 0 [idle]:
        runtime.pthread_cond_wait(0x123c288, 0x123c248, 0x7fff00000000)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/sys_darwin.go:368 +0x39
        runtime.semasleep(0xffffffffffffffff, 0x7fff5fbff4f8)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/os_darwin.go:63 +0x85
        runtime.notesleep(0x123c048)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/lock_sema.go:173 +0xe0
        runtime.stoplockedm()
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/proc.go:2068 +0x88
        runtime.schedule()
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/proc.go:2469 +0x485
        runtime.park_m(0xc000000180)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/proc.go:2610 +0x9d
        runtime.mcall(0x1057ce6)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/asm_amd64.s:318 +0x5b
        
        goroutine 1 [select, locked to thread]:
        net.(*sysDialer).dialParallel(0xc0000ee080, 0x1168940, 0xc0000ae008, 0xc0000dc060, 0x1, 0x1, 0xc0000dc070, 0x1, 0x1, 0x0, ...)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/dial.go:489 +0x322
        net.(*Dialer).DialContext(0xc000044e30, 0x1168940, 0xc0000ae008, 0x1140620, 0x3, 0x11421f0, 0xf, 0x0, 0x0, 0x0, ...)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/dial.go:422 +0x4f0
        net.(*Dialer).Dial(...)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/dial.go:347
        net.Dial(0x1140620, 0x3, 0x11421f0, 0xf, 0x1, 0x8, 0x0, 0x0)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/dial.go:318 +0xa8
        main.NetpollDeadlockInit()
        	/private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/testdata/testprognet/net.go:19 +0xa7
        main.registerInit(...)
        	/private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/testdata/testprognet/main.go:20
        main.init.0()
        	/private/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/testdata/testprognet/net.go:13 +0x106
        
        goroutine 34 [syscall]:
        os/signal.signal_recv(0x0)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/sigqueue.go:139 +0x9f
        os/signal.loop()
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_unix.go:23 +0x22
        created by os/signal.init.0
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/os/signal/signal_unix.go:29 +0x41
        
        goroutine 23 [IO wait]:
        internal/poll.runtime_pollWait(0x1671f08, 0x77, 0xc000045950)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/netpoll.go:184 +0x55
        internal/poll.(*pollDesc).wait(0xc0000f4018, 0x77, 0x1168900, 0xc000086200, 0xc0000f4000)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/internal/poll/fd_poll_runtime.go:87 +0x10b
        internal/poll.(*pollDesc).waitWrite(...)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/internal/poll/fd_poll_runtime.go:96
        internal/poll.(*FD).WaitWrite(...)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/internal/poll/fd_unix.go:498
        net.(*netFD).connect(0xc0000f4000, 0x1168900, 0xc000086200, 0x0, 0x0, 0x11674c0, 0xc0000f2080, 0x0, 0x0, 0x0, ...)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/fd_unix.go:152 +0x27e
        net.(*netFD).dial(0xc0000f4000, 0x1168900, 0xc000086200, 0x1168f20, 0x0, 0x1168f20, 0xc0000b0150, 0x0, 0x1, 0xc000045b78)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/sock_posix.go:149 +0xff
        net.socket(0x1168900, 0xc000086200, 0x1140620, 0x3, 0x1e, 0x1, 0x0, 0x0, 0x1168f20, 0x0, ...)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/sock_posix.go:70 +0x1c0
        net.internetSocket(0x1168900, 0xc000086200, 0x1140620, 0x3, 0x1168f20, 0x0, 0x1168f20, 0xc0000b0150, 0x1, 0x0, ...)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/ipsock_posix.go:141 +0x141
        net.(*sysDialer).doDialTCP(0xc0000ee080, 0x1168900, 0xc000086200, 0x0, 0xc0000b0150, 0x1116f00, 0x1256a30, 0x0)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/tcpsock_posix.go:65 +0xc2
        net.(*sysDialer).dialTCP(0xc0000ee080, 0x1168900, 0xc000086200, 0x0, 0xc0000b0150, 0xa3036326a7, 0x2b9c0830, 0x2b9c0830000345b8)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/tcpsock_posix.go:61 +0xd7
        net.(*sysDialer).dialSingle(0xc0000ee080, 0x1168900, 0xc000086200, 0x1167900, 0xc0000b0150, 0x0, 0x0, 0x0, 0x0)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/dial.go:578 +0x34b
        net.(*sysDialer).dialSerial(0xc0000ee080, 0x1168900, 0xc000086200, 0xc0000dc060, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/dial.go:546 +0x221
        net.(*sysDialer).dialParallel.func1(0x1168900, 0xc000086200, 0xc0000dc001)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/dial.go:467 +0x95
        created by net.(*sysDialer).dialParallel
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/dial.go:482 +0x226
        
        goroutine 35 [select]:
        net.(*netFD).connect.func2(0x1168900, 0xc000086200, 0xc0000f4000, 0xc00008a1e0, 0xc00008a180)
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/fd_unix.go:129 +0xba
        created by net.(*netFD).connect
        	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/fd_unix.go:128 +0x256
        
        rax    0x104
        rbx    0x500
        rcx    0x7fff5fbff318
        rdx    0x500
        rdi    0x123c288
        rsi    0x50100000600
        rbp    0x7fff5fbff3b0
        rsp    0x7fff5fbff318
        r8     0x0
        r9     0x60
        r10    0x0
        r11    0x202
        r12    0x123c288
        r13    0x16
        r14    0x1
        r15    0x7fffaba873c0
        rip    0x7fffa2d6dc86
        rflags 0x203
        cs     0x7
        fs     0x0
        gs     0x0
FAIL
FAIL	runtime	90.150s

@bcmills bcmills changed the title net: TestDialerLocalAddr is flaky on Mac and should be disabled net: TestDialerLocalAddr and TestNetpollDeadlock are flaky on Mac May 15, 2019
@bcmills
Copy link
Contributor

bcmills commented Sep 11, 2019

@bcmills bcmills added the Testing An issue that has been verified to require only test changes, not just a test failure. label Sep 11, 2019
@bcmills bcmills modified the milestones: Unplanned, Go1.14 Sep 11, 2019
@bcmills
Copy link
Contributor

bcmills commented Sep 11, 2019

@mikioh @bradfitz @ianlancetaylor, I think we need someone to look into this for 1.14.

I'm worried about another case like #32336, where an occasional test flake turned out to be a symptom of a serious bug.

@bcmills
Copy link
Contributor

bcmills commented Sep 16, 2019

@bcmills
Copy link
Contributor

bcmills commented Sep 26, 2019

@bcmills
Copy link
Contributor

bcmills commented Oct 2, 2019

@bcmills
Copy link
Contributor

bcmills commented Oct 2, 2019

#32919 may be related.

@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@ianlancetaylor
Copy link
Contributor

As far as I can see all failures are on Darwin 10.12. There are no failures on newer versions. This makes me suspect that it is a kernel problem that was fixed.

@bcmills
Copy link
Contributor

bcmills commented Oct 15, 2019

@ianlancetaylor, as far as I can tell that is indeed the case.

Do you think 10.12 is in wide enough usage to warrant a workaround for the kernel issue?

If not, should we skip TestNetpollDeadlock on the darwin-amd64-10_12 builder, and perhaps upgrade the darwin-amd64-race builder to get past the bug?

(CC @bradfitz @dmitshur @toothrot)

@ianlancetaylor
Copy link
Contributor

I personally have no idea whether 10.12 is in wide use. I would be fine with a workaround but I don't have one. The test seems to be timing out waiting for a non-blocking connect to complete. I don't know why.

@dmitshur
Copy link
Contributor

In case it's helpful, this StatCounter Global Stats page suggests 10.12 is currently at just under 10% use, while 10.13 is at just over 20%, and 10.14 is at over 50%. (10.15 isn't on the chart yet; the latest entry is from September, before it was out.)

@bcmills
Copy link
Contributor

bcmills commented Oct 15, 2019

To me that suggests that we should probably upgrade the -race builder and add some skips for the flaky tests. I'll file an issue for the upgrade.

@bcmills bcmills changed the title net: TestDialerLocalAddr and TestNetpollDeadlock are flaky on Mac net: TestDialerLocalAddr and TestNetpollDeadlock are flaky on macOS 10.12 Oct 15, 2019
@gopherbot
Copy link

Change https://golang.org/cl/202618 mentions this issue: net: ignore or skip known-flaky localhost Dial operations on macOS 10.12 builder

@golang golang locked and limited conversation to collaborators Oct 21, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

6 participants