Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: deadlock in TestNotTemporaryRead via net.withTCPConnPair #29685

Closed
Helflym opened this issue Jan 11, 2019 · 12 comments
Closed

net: deadlock in TestNotTemporaryRead via net.withTCPConnPair #29685

Helflym opened this issue Jan 11, 2019 · 12 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@Helflym
Copy link
Contributor

Helflym commented Jan 11, 2019

Hi,
I'm trying to resolve timeouts occurring on aix/ppc64 with net.TestNotTemporaryRead.

https://build.golang.org/log/45540cc03c1d37057e8f725d7f2dd431652ddf4c
https://build.golang.org/log/37d60c3b3cd46cf39d118f84e695049d390da40e
...

This timeout occurs because Accept() seems to be stuck in a infinite loop if the server is already closed. It's only a guess because I can't trigger the bug manually on my local machine. However, a similar behaviorr can be easily made with: https://play.golang.org/p/0IXrHf87i-2.
It does work on linux/amd64 but it times out on aix/ppc64. This might not be the root of this bug but a possible workaround can be to increase the delay on the server.

However, I've several questions:

  • Why is the client doing Accept() and the server doing Dial() ? Is it supposed to be the opposite or it doesn't matter ? This is the case for some others tests of net_test.go.
  • What behavior Go is expected when an accept() is made when the server is already closed (but the port is still listened) ?
    Should it succeed or not ? On aix/ppc64, accept syscall returns EAGAIN (because of O_NONBLOCK flag) and on linux/amd64 it does succeed.

I've also discovered that the behavior of connect is slightly different on AIX than on Linux (I don't know about others OSes). I've tried with the following C code (taken from #6828): accept_after_connect.c.txt. The first connect doesn't return EINPROGRESS as on Linux. It doesn't seem a bug as a connection can result from the listen syscall.
Does Go want EINPROGRESS to be returned ? (*netFD).connect will wait with netpoll if it is.

@ianlancetaylor
Copy link
Contributor

Why is the client doing Accept() and the server doing Dial() ? Is it supposed to be the opposite or it doesn't matter ? This is the case for some others tests of net_test.go.

The names used are confusing but I can't see that it matters in this case.

What behavior Go is expected when an accept() is made when the server is already closed (but the port is still listened) ?

Go expects that calling the Accept method on a closed Listener will return an error. But I'm not sure that is what you are asking; I'm not sure quite what you are asking. The TestNotTemporaryRead test is intended to see what happens when Read is called on a connected socket, and then the other side of the socket connection is closed.

Does Go want EINPROGRESS to be returned ?

As opposed to what? The Go code will work fine if connect blocks until the connection is complete. It's more efficient if the kernel returns EINPROGRESS, because then we don't have to tie up a thread. But if we tie up a thread waiting for connect to complete, everything should still work.

The stack traces from the trybots suggest that the Accept never completes. But it appears that the Dial does complete. I don't understand how a Dial could complete without the corresponding Accept also completing. Does AIX have anything like strace that could show the exact system calls? What does your test program display on AIX?

@julieqiu julieqiu changed the title net: net.TestNotTemporaryRead timeouts on aix/ppc64 net: TestNotTemporaryRead timeouts on aix/ppc64 Jan 11, 2019
@julieqiu julieqiu added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jan 11, 2019
@Helflym
Copy link
Contributor Author

Helflym commented Jan 14, 2019

Does Go want EINPROGRESS to be returned ?

As opposed to what? The Go code will work fine if connect blocks until the connection is complete. It's more efficient if the kernel returns EINPROGRESS, because then we don't have to tie up a thread. But if we tie up a thread waiting for connect to complete, everything should still work.

That's the problem. On AIX, connect doesn't block after a listen. On Linux, man says:

   listen() marks the socket referred to by sockfd as a passive socket, that is, as a socket that will be used to accept incoming connection requests using accept(2).

There is nothing similar on AIX. Therefore, once a socket is opened with listen, every connection is accepted. There is no "passive socket".

Using the C program above, I've the following trace on AIX:

socket(2, 1, 0)                                 = 3
bind(3, 0x0FFFFFFFFFFFEF1C, 16)                 = 0
listen(3, 5)                                    = 0
ngetsockname(3, 0x0FFFFFFFFFFFEF1C, 0x0FFFFFFFFFFFEF30) = 0
socket(2, 1, 0)                                 = 4
kfcntl(4, F_GETFL, 0x0000000110008158)          = 2
kioctl(4, -2147195266, 0x0FFFFFFFFFFFEDF0, 0x0000000000000000) = 0
kioctl(4, -2147195267, 0x0FFFFFFFFFFFEDF0, 0x0000000000000000) = 0
kfcntl(4, F_SETFL, 0x0000000000000006)          = 0
getsockopt(4, 65535, 4104, 0x0FFFFFFFFFFFEE64, 0x0FFFFFFFFFFFEE60) = 0
connext(4, 0x0FFFFFFFFFFFEF1C, 16)              = 0

The last connect (connext here) doesn't return any error, while on Linux it does.

socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
listen(3, 5)                            = 0
getsockname(3, {sa_family=AF_INET, sin_port=htons(36821), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 4
fcntl(4, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
connect(4, {sa_family=AF_INET, sin_port=htons(36821), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)

If you look at TestDialContextCancelRace, the connect function explicitly returns EINPROGRESS if not error is returned. It's the case on AIX.
However, I didn't managed to do something similar on TestNotTemporaryRead...

I don't know if it's possible to change how net.Listen works to match Linux behavior on AIX. Moreover, it seems a little risky to modify it as some AIX users might want this feature as they're used to.

Anyway, this difference only impacts TestNotTemporaryRead at the moment. Moreover, this bug only happens when a machine is very busy. Increasing the sleep duration and adding another sleep in the client should resolve the problem because Accept will be called before the server is closed and thanks to the second sleep, the read will still be issued after it (it might not be needed but it's safer I think).

@ianlancetaylor
Copy link
Contributor

On GNU/Linux EINPROGRESS just means that the descriptor was marked as non-blocking but the connect would block. The connect does still succeed, and when it does, the poller will mark the socket as readable, and a second call to connect will complete without error.

@gopherbot
Copy link

Change https://golang.org/cl/158038 mentions this issue: net: increase TestNotTemporaryRead server sleep

@Helflym
Copy link
Contributor Author

Helflym commented Jan 16, 2019

Well, I've submitted a CL to fix this test. I'll see later if I can provide a more general fix on that issue.

gopherbot pushed a commit that referenced this issue Jan 16, 2019
On aix/ppc64, if the server closes before the client calls Accept,
this test will fail.

Increasing the time before the server closes should resolve this
timeout.

Updates #29685

Change-Id: Iebb849d694fc9c37cf216ce1f0b8741249b98016
Reviewed-on: https://go-review.googlesource.com/c/158038
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@bcmills
Copy link
Contributor

bcmills commented Jul 8, 2019

Another recent timeout in this test, again on the aix-ppc64 builder:
https://build.golang.org/log/ac2fca1f71734020ee6533f578ccbbcf89fe8cd8

@Helflym
Copy link
Contributor Author

Helflym commented Jul 9, 2019

Yes, it seems the new sleep time is still not enough... This test isn't fully compatible with AIX behavior which is slightly different if an accept is made after a listen. However, I won't have time to work on it before 1.13 release. Is it possible to mark it as flaky until a better fix is made ?

@bcmills
Copy link
Contributor

bcmills commented Jul 10, 2019

Is it possible to mark it as flaky until a better fix is made ?

Yep, see testenv.SkipFlaky.

if runtime.GOOS == "aix" {
	testenv.SkipFlaky(t, 29685)
}

@bcmills bcmills added NeedsFix The path to resolution is known, but the work has not been done. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Jul 10, 2019
@bcmills bcmills added this to the Go1.14 milestone Jul 10, 2019
@bcmills bcmills added the Testing An issue that has been verified to require only test changes, not just a test failure. label Jul 10, 2019
@gopherbot
Copy link

Change https://golang.org/cl/185717 mentions this issue: net: set TestNotTemporaryRead flaky for AIX

gopherbot pushed a commit that referenced this issue Jul 11, 2019
This test sometimes times out when the machine is busy.
The reason behind is still a bit blurry. But it seems to comes from
the fact that on AIX, once a listen is performed a socket, every
connection will be accepted even before an accept is made (which only
occurs when a machine is busy). On Linux, a socket is created as a
"passive socket" which seems to wait for the accept before allowing
incoming connections.

Updates #29685

Change-Id: I41b053b7d5f5b4420b72d6a217be72e41220d769
Reviewed-on: https://go-review.googlesource.com/c/go/+/185717
Run-TryBot: Clément Chigot <clement.chigot@atos.net>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@bcmills
Copy link
Contributor

bcmills commented Feb 11, 2022

This failure mode does not appear to be specific to AIX.

greplogs --dashboard -md -l -e 'goroutine \d+ \[chan receive, \d+ minutes\]:\n(.+\n\t.+\n)*net\.withTCPConnPair'

2022-02-09T22:58:35-2bf5ae0/openbsd-arm64-jsing
2021-08-14T15:14:43-50f4ebb/illumos-amd64
2021-04-16T03:16:11-cf2396c/illumos-amd64
2019-08-25T04:39:15-1a7c15f/illumos-amd64-joyent
2019-07-10T21:26:21-80cca23/aix-ppc64
2019-07-10T18:26:53-9d56dd8/aix-ppc64
2019-07-06T13:25:59-13327f2/aix-ppc64
2019-06-26T20:10:05-fc26cba/aix-ppc64
2019-06-25T19:10:58-fa0537b/aix-ppc64
2019-06-25T17:51:54-816ce1a/aix-ppc64
2019-06-25T13:27:07-e28f0d9/aix-ppc64
2019-06-24T18:23:08-fafc92d/aix-ppc64
2019-06-21T17:24:29-41b9e99/aix-ppc64
2019-06-21T14:52:06-3f1422c/aix-ppc64
2019-06-19T19:23:51-26954bd/aix-ppc64
2019-06-10T16:19:39-8f296f5/aix-ppc64
2019-06-09T16:23:11-323212b/aix-ppc64
2019-06-06T11:40:17-5ec1406/aix-ppc64
2019-06-06T00:11:29-c00ff65/aix-ppc64

The stack traces in the more recent openbsd and illumos failures have more runtime details in them, but as far as I can tell they are essentially the same as the older aix-ppc64 traces. Here's the one from openbsd-arm64-jsing:

goroutine 2061 [IO wait, 14 minutes]:
runtime.gopark(0x400040ed18?, 0x9d748?, 0xe0?, 0x1d?, 0x258f4d?)
	/home/gopher/build/go/src/runtime/proc.go:366 +0xe8 fp=0x400040ece0 sp=0x400040ecc0 pc=0x47818
runtime.netpollblock(0xb?, 0x0?, 0x0?)
	/home/gopher/build/go/src/runtime/netpoll.go:522 +0x16c fp=0x400040ed20 sp=0x400040ece0 pc=0x40a6c
internal/poll.runtime_pollWait(0x4e83d45d8, 0x72)
	/home/gopher/build/go/src/runtime/netpoll.go:302 +0xa4 fp=0x400040ed50 sp=0x400040ed20 pc=0x70b84
internal/poll.(*pollDesc).wait(0x40003f8300?, 0x16e0c?, 0x0)
	/home/gopher/build/go/src/internal/poll/fd_poll_runtime.go:83 +0x2c fp=0x400040ed80 sp=0x400040ed50 pc=0x9820c
internal/poll.(*pollDesc).waitRead(...)
	/home/gopher/build/go/src/internal/poll/fd_poll_runtime.go:88
internal/poll.(*FD).Accept(0x40003f8300)
	/home/gopher/build/go/src/internal/poll/fd_unix.go:614 +0x1d8 fp=0x400040ee20 sp=0x400040ed80 pc=0x9c1c8
net.(*netFD).accept(0x40003f8300)
	/home/gopher/build/go/src/net/fd_unix.go:172 +0x28 fp=0x400040eee0 sp=0x400040ee20 pc=0x1413f8
net.(*TCPListener).accept(0x400000e888)
	/home/gopher/build/go/src/net/tcpsock_posix.go:139 +0x2c fp=0x400040ef20 sp=0x400040eee0 pc=0x15e0cc
net.(*TCPListener).Accept(0x400000e888)
	/home/gopher/build/go/src/net/tcpsock.go:288 +0x30 fp=0x400040ef60 sp=0x400040ef20 pc=0x15cec0
net.withTCPConnPair.func1()
	/home/gopher/build/go/src/net/net_test.go:427 +0x48 fp=0x400040efd0 sp=0x400040ef60 pc=0x1b14a8
runtime.goexit()
	/home/gopher/build/go/src/runtime/asm_arm64.s:1259 +0x4 fp=0x400040efd0 sp=0x400040efd0 pc=0x760b4
created by net.withTCPConnPair
	/home/gopher/build/go/src/net/net_test.go:426 +0x110

goroutine 609 [chan receive, 14 minutes]:
runtime.gopark(0x4000184060?, 0x15e38?, 0x58?, 0xde?, 0x1d230?)
	/home/gopher/build/go/src/runtime/proc.go:366 +0xe8 fp=0x400015ddd0 sp=0x400015ddb0 pc=0x47818
runtime.chanrecv(0x4000133140, 0x400015dee8, 0x1)
	/home/gopher/build/go/src/runtime/chan.go:577 +0x58c fp=0x400015de60 sp=0x400015ddd0 pc=0x1727c
runtime.chanrecv1(0x2af000?, 0x2?)
	/home/gopher/build/go/src/runtime/chan.go:440 +0x14 fp=0x400015de90 sp=0x400015de60 pc=0x16cb4
net.withTCPConnPair(0x4000456820, 0x26e888, 0x26e880)
	/home/gopher/build/go/src/net/net_test.go:445 +0x1b0 fp=0x400015df30 sp=0x400015de90 pc=0x1b1210
net.TestNotTemporaryRead(0x0?)
	/home/gopher/build/go/src/net/net_test.go:546 +0x3c fp=0x400015df60 sp=0x400015df30 pc=0x1b20fc
testing.tRunner(0x4000456820, 0x26e890)
	/home/gopher/build/go/src/testing/testing.go:1439 +0x110 fp=0x400015dfb0 sp=0x400015df60 pc=0xe0ac0
testing.(*T).Run.func1()
	/home/gopher/build/go/src/testing/testing.go:1486 +0x30 fp=0x400015dfd0 sp=0x400015dfb0 pc=0xe1830
runtime.goexit()
	/home/gopher/build/go/src/runtime/asm_arm64.s:1259 +0x4 fp=0x400015dfd0 sp=0x400015dfd0 pc=0x760b4
created by testing.(*T).Run
	/home/gopher/build/go/src/testing/testing.go:1486 +0x328

@bcmills bcmills changed the title net: TestNotTemporaryRead timeouts on aix/ppc64 net: deadlock in TestNotTemporaryRead via net.withTCPConnPair Feb 11, 2022
@gopherbot
Copy link

Change https://go.dev/cl/385314 mentions this issue: net: in TestNotTemporaryRead, do not assume that a dialed connection has been accepted

@gopherbot
Copy link

Change https://go.dev/cl/385754 mentions this issue: net: in TestNotTemporaryRead, reject io.EOF on platforms other than plan9

gopherbot pushed a commit that referenced this issue Apr 1, 2022
…lan9

Updates #29685

Change-Id: Id8dca078213942666871ac8ded663326e98427fe
Reviewed-on: https://go-review.googlesource.com/c/go/+/385754
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Emmanuel Odeke <emmanuel@orijtech.com>
Trust: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@golang golang locked and limited conversation to collaborators Feb 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests

6 participants