runtime/netpoll: randomly hanging Accept on macOS since go1.18 #54529

tmm1 · 2022-08-18T18:03:17Z

Context

We ship golang binaries that run a net/http.Server to a number of users on different platforms.

Recently we've started receiving reports from macOS users, mostly using M1 mac minis, of an issue where the http server stops responding.

The reports started shortly after we shipped an upgrade from go1.17 to go1.18

Observations

The issue is very intermittent, with some users experiencing it atleast once a week.

At the same time, my own M1 mac mini has never experience this issue.

We have been trying to debug for a couple months, and have been able to gather some information via users:

does not seem to be ulimit related, as other parts of the binary continue to function properly, creating files and making network requests
the listening socket is active, because curl experiences a timeout instead of connection refused
```
curl: (28) Failed to connect to 127.0.0.1 port 8089 after 25926 ms: Operation timed out
```

golang stack trace shows the TCPListener.Accept sitting idle in netpoll

goroutine 69 [IO wait, 771 minutes]:
internal/poll.runtime_pollWait(0x12bdee198, 0x72)
    runtime/netpoll.go:302 +0xa4
internal/poll.(*pollDesc).wait(0x140000c2780?, 0x4?, 0x0)
    internal/poll/fd_poll_runtime.go:83 +0x2c
internal/poll.(*pollDesc).waitRead(...)
    internal/poll/fd_poll_runtime.go:88
internal/poll.(*FD).Accept(0x140000c2780)
    internal/poll/fd_unix.go:614 +0x1d0
net.(*netFD).accept(0x140000c2780)
    net/fd_unix.go:172 +0x28
net.(*TCPListener).accept(0x1400066e0c0)
    net/tcpsock_posix.go:139 +0x2c
net.(*TCPListener).Accept(0x1400066e0c0)
    net/tcpsock.go:288 +0x30

I'm looking for any suggestions on how to debug this further, or if anyone has seen anything similar.

Currently I am wondering if this issue affects just the single socket, or is something process-wide. To verify, we've spun up another listener on a secondary port, and are waiting to hear if that port keeps working when the primary port stops responding.

As a workaround, we are considering adding an internal health check in the process that tries to connect back to itself over loopback. If the connection times out, we could restart the TCPListener.

What version of Go are you using (`go version`)?

$ go version
go version go1.18.5 darwin/arm64

Does this issue reproduce with the latest release?

Have not tried with go 1.19

The text was updated successfully, but these errors were encountered:

panjf2000 · 2022-08-19T04:27:49Z

I would suggest that you take a look at Send-Q and Recv-Q of the TCP listener, and see if these are being accumulated and not reduced during this issue, if so, then it may indicate that Go netpoll fails to wake up the goroutine which holds the listener for accepting new sockets, if not, then it might not be a bug of Go.

panjf2000 · 2022-08-19T04:32:20Z

I've gone through the netpoll code roughly, but didn't catch anything, maybe I missed something, will keep browsing the source code.

tmm1 · 2022-08-19T05:24:41Z

take a look at Send-Q and Recv-Q of the TCP listener

What's the best way to do so on macOS? Does netstat show this info?

EDIT: The answer is yes, a simple netstat -an will work.

panjf2000 · 2022-08-19T05:31:55Z

BTW, What's the output of sudo sysctl -p on your Mac?
@tmm1

panjf2000 · 2022-08-19T05:37:08Z

take a look at Send-Q and Recv-Q of the TCP listener

What's the best way to do so on macOS? Does netstat show this info?

EDIT: The answer is yes, a simple netstat -an will work.

You may want to add a -L as well to the command to show the amounts of SYN Queue and Accept queue.
@tmm1

tmm1 · 2022-08-19T05:39:51Z

BTW, What's the output of sudo sysctl -p on your Mac?

Doesn't work here.. sysctl: illegal option -- p

panjf2000 · 2022-08-19T05:50:27Z

BTW, What's the output of sudo sysctl -p on your Mac?

Doesn't work here.. sysctl: illegal option -- p

Then go with echo "#include <sys/sysctl.h>" | gcc -M -xc - | grep sysctl and open the .h file printed by the above command, searching for somaxconn, see what we can get?

panjf2000 · 2022-08-19T06:13:11Z

Another thought: it could be that your somaxconn is too small to accept more sockets and your listener was overwhelmed, have you ever tried netstat -nt|grep {listener_port} to find out how many connections were stuck in SYN_RECV state during this issue? or how many times the listen queue overflowed (by netstat -s | grep overflow) and how many connections it dropped (by netstat -s| grep dropped)?

panjf2000 · 2022-08-20T02:34:20Z

BTW, What's the output of sudo sysctl -p on your Mac?

Doesn't work here.. sysctl: illegal option -- p

Then go with echo "#include <sys/sysctl.h>" | gcc -M -xc - | grep sysctl and open the .h file printed by the above command, searching for somaxconn, see what we can get?

I found a #define KIPC_SOMAXCONN in sysctl.h, so the somaxconn configuration on Mac should be kern.ipc.somaxconn, run sysctl kern.ipc.somaxconn=1000 on your Mac and it ought to fix your case.

Please let me know if the above solution doesn't work for your case (:

tmm1 · 2022-08-24T16:02:53Z

I received some of the requested details from a customer: https://gist.github.com/thully/03bc2a4e73cae53e079e2db66d3bebb1

tmm1 · 2022-08-26T20:19:03Z

I received some of the requested details from a customer:

Our process listens on *:8089. I added another listener for 127.0.0.1:58089 which continues to work normally when port 8089 stops responding.

Excerpts:

lsof shows both listeners. There are a few files open, but nowhere close to the open file limit.

Neither lsof nor netstat -at shows any tcp connections going to either port.

$ lsof -nPp
COMMAND   PID   USER   FD      TYPE             DEVICE  SIZE/OFF                NODE NAME
channels- 650 userna    5u     IPv4 0x405cd9dd81544b77       0t0                 TCP 127.0.0.1:58089 (LISTEN)
channels- 650 userna   10u     IPv6 0x405cd9d8b4aa1017       0t0                 TCP *:8089 (LISTEN)

$ netstat -an
Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)    
tcp4       0      0  127.0.0.1.58089        *.*                    LISTEN     
tcp46      0      0  *.8089                 *.*                    LISTEN

netstat -s shows no overflows or drops

netstat -aL shows that the listen queue is indeed filled up, just for the one port.

     -L    Show the size of the various listen queues.  The first count shows the number of unaccepted
           connections.  The second count shows the amount of unaccepted incomplete connections.  The third
           count is the maximum number of queued connections.

$ netstat -anL
Current listen queue sizes (qlen/incqlen/maxqlen)
Listen         Local Address         
0/0/128        127.0.0.1.58089        
128/128/128    *.8089

So @panjf2000 is correct in that the connections are piling up to the somaxconn limit, and once it fills up then clients start receiving timeouts.

But I don't see how increasing somaxconn would make any difference, since the golang process is stuck and refusing to drain sockets off the queue.

goroutine 52 [IO wait, 2085 minutes]:
internal/poll.runtime_pollWait(0xe47d058, 0x72)
	runtime/netpoll.go:302 +0x89
internal/poll.(*pollDesc).wait(0xc0010a2400?, 0x4?, 0x0)
	internal/poll/fd_poll_runtime.go:83 +0x32
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:88
internal/poll.(*FD).Accept(0xc0010a2400)
	internal/poll/fd_unix.go:614 +0x22c
net.(*netFD).accept(0xc0010a2400)
	net/fd_unix.go:172 +0x35
net.(*TCPListener).accept(0xc000fe65a0)
	net/tcpsock_posix.go:139 +0x28
net.(*TCPListener).Accept(0xc000fe65a0)
	net/tcpsock.go:288 +0x3d
github.com/soheilhy/cmux.(*cMux).Serve(0xc000fcdbd0)
	github.com/soheilhy/cmux@v0.1.4/cmux.go:162 +0xa9
created by main.StartHTTPServer
	github.com/fancybits/channels-server/http.go:770 +0x1953

tmm1 · 2022-08-26T21:24:55Z

netstat -aL shows that the listen queue is indeed filled up, just for the one port.

I asked a user to check the current values, and it says:

Current listen queue sizes (qlen/incqlen/maxqlen)
Listen         Local Address         
0/0/128 127.0.0.1.58089
15/15/128 *.8089

So it appears those are incrementing on some users' systems for some reason. And once they hit 128, things stop making their way down to the process?

I wonder what the application is supposed to do to tell the kernel to clear out the unaccepted connections?

tmm1 · 2022-08-26T21:49:36Z

I was able to reproduce somewhat on my own M1 mini, using a SYN-flood. Atleast I can make the qlen/incqlen max out, but in my case the numbers reset back to zero very quickly afterwards..

tmm1@tmm1-m1 ~ % echo; date; sudo /opt/homebrew/Cellar/hping/3.20051105/sbin/hping3 -q -c 256 -S -p 8089 --faster localhost; echo; date; netstat -anL | grep 8089; netstat -nt | grep 8089 | head -6; sleep 2; echo; date; netstat -anL | grep 8089

Fri Aug 26 14:48:05 PDT 2022
HPING localhost (utun13 127.0.0.1): S set, 40 headers + 0 data bytes
--- localhost hping statistic ---
256 packets tramitted, 0 packets received, 100% packet loss
round-trip min/avg/max = 0.0/0.0/0.0 ms

Fri Aug 26 14:48:05 PDT 2022
0/0/128        127.0.0.1.58089
128/128/128    *.8089
tcp4       0      0  127.0.0.1.8089         127.0.0.1.2722         SYN_RCVD
tcp4       0      0  127.0.0.1.8089         127.0.0.1.2721         SYN_RCVD
tcp4       0      0  127.0.0.1.8089         127.0.0.1.2720         SYN_RCVD
tcp4       0      0  127.0.0.1.8089         127.0.0.1.2719         SYN_RCVD
tcp4       0      0  127.0.0.1.8089         127.0.0.1.2718         SYN_RCVD
tcp4       0      0  127.0.0.1.8089         127.0.0.1.2717         SYN_RCVD

Fri Aug 26 14:48:07 PDT 2022
0/0/128        127.0.0.1.58089
0/0/128        *.8089

tmm1 · 2022-08-27T00:46:20Z

Re-reading @panjf2000's advice above, things are starting to make more sense.

I would suggest that you take a look at Send-Q and Recv-Q of the TCP listener, and see if these are being accumulated and not reduced during this issue, if so, then it may indicate that Go netpoll fails to wake up the goroutine which holds the listener for accepting new sockets, if not, then it might not be a bug of Go.

I'm not seeing any Send-Q/Recv-Q on the LISTEN socket via netstat -ant. This indicates that the kernel has nothing to send to the application, and so its normal that the golang netpoll is not waking up because there are no kqueue events present to respond to.

You may want to add a -L as well to the command to show the amounts of SYN Queue and Accept queue.

The SYN/accept queue is increasing, and getting stuck. So this is clearly some sort of kernel bug on the macOS side.

Users affected by this bug never seen qlen drop down to zero. Even when using hping3 to simulate a SYN flood, the number spikes to 128 and drops down quickly, but not to zero.

run sysctl kern.ipc.somaxconn=1000 on your Mac and it ought to fix your case

The issue is not a qlen spike, but rather a leak somewhere.

However increasing somaxconn is still a useful workaround, as it delays the problem. I confirmed that after changing the sysctl and restarting my program, netstat -L now reports a much higher maxqlen.

The users reporting this issue are experiencing maxqlen exhaustion every 1-3 days, with up to 4 leaks an hour.

So far one commonality is that all the affected users have some kind of network extension installed, as visible by systemextensionsctl list.

panjf2000 · 2022-08-27T04:26:24Z

I'm not seeing any Send-Q/Recv-Q on the LISTEN socket via netstat -ant. This indicates that the kernel has nothing to send to the application, and so its normal that the golang netpoll is not waking up because there are no kqueue events present to respond to.

I'm not an expert with macOS and although the netstat differs a bit on Linux from macOS, I think recv-q and send-q represent the count of bytes not copied by the user program connected to this socket and the count of bytes not acknowledged by the remote host as Linux does when you use it without -L, and with -L, those represent SYN queue and accept queue we really care about.

I would suggest that you take a look at Send-Q and Recv-Q of the TCP listener, and see if these are being accumulated and not reduced during this issue, if so, then it may indicate that Go netpoll fails to wake up the goroutine which holds the listener for accepting new sockets, if not, then it might not be a bug of Go.

What I really meant by what I said before was for SYN queue and accept queue, therefore, I think your case does indicate an issue inside Go netpoll, it could either be that kevent didn't work well in netpoll to wake up goroutine of the listener, or that netpoll didn't use it right.

tmm1 · 2022-08-27T05:33:40Z

It seems like the app can't do anything other than call accept(). When a real request comes in and accept is called, the qlen is not being cleared out just decremented by one. So it doesn't seem like go could be doing anything else.. unless waking up and calling accept() earlier would have made a difference.

tmm1 · 2022-08-29T23:25:50Z

I filed FB11405981 with Apple

A user reports that removing Little Snitch network extension fixed the issue. More details here: https://community.getchannels.com/t/channels-dvr-stops-responding-randomly-on-macos/32418/83

jimlambrt · 2023-12-06T18:57:09Z

I am experiencing this as well in Ventura 13.6.1 and Sonoma 14.1.2

goroutine 238 [IO wait]:
internal/poll.runtime_pollWait(0x10807c9b8, 0x72)
	/Users/jimlambrt/.asdf/installs/golang/1.21.1/go/src/runtime/netpoll.go:343 +0xa0
internal/poll.(*pollDesc).wait(0x140000c7c00?, 0x0?, 0x0)
	/Users/jimlambrt/.asdf/installs/golang/1.21.1/go/src/internal/poll/fd_poll_runtime.go:84 +0x28
internal/poll.(*pollDesc).waitRead(...)
	/Users/jimlambrt/.asdf/installs/golang/1.21.1/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0x140000c7c00)
	/Users/jimlambrt/.asdf/installs/golang/1.21.1/go/src/internal/poll/fd_unix.go:611 +0x250
net.(*netFD).accept(0x140000c7c00)
	/Users/jimlambrt/.asdf/installs/golang/1.21.1/go/src/net/fd_unix.go:172 +0x28
net.(*TCPListener).accept(0x140000b0760)
	/Users/jimlambrt/.asdf/installs/golang/1.21.1/go/src/net/tcpsock_posix.go:152 +0x28
net.(*TCPListener).Accept(0x140000b0760)

run sysctl kern.ipc.somaxconn=1000 did not fix this for me.

tmm1 · 2023-12-08T22:10:32Z

goroutine 238 [IO wait]:

this looks normal to me?

in my case, note the really long wait times indicating the "stuck" state:

goroutine 69 [IO wait, 771 minutes]:

goroutine 52 [IO wait, 2085 minutes]:

jimlambrt · 2023-12-10T12:40:47Z

Update: I've resolved my issue. The listener's IP address was not routable. Once I fixed that, the listener had no problem accepting connections of course.

chenweiyj · 2024-03-28T06:42:28Z

I have met the same problem, when I create more than 200 nodes connected via net.TcpDial. And after many rounds of message deliveries, the final broadcast makes around 40 nodes not receive the broadcast message on my M1 MacBook Air with macOS Sonoma 14.4.

joedian added the NeedsInvestigation label Aug 19, 2022

seankhliao added the OS-Darwin label Aug 20, 2022

seankhliao added this to the Unplanned milestone Aug 20, 2022

bcmills mentioned this issue Sep 8, 2022

net/http: timeout in TestServerHijackGetsBackgroundByte_big #54935

Open

panjf2000 mentioned this issue Jun 12, 2023

[Bug]: TCP Accept Queue overflows on macOS panjf2000/gnet#477

Closed

3 tasks

panjf2000 mentioned this issue Aug 11, 2023

net/http: server will drop tcp connections after a certain number of concurrent connections are made #61924

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime/netpoll: randomly hanging Accept on macOS since go1.18 #54529

runtime/netpoll: randomly hanging Accept on macOS since go1.18 #54529

tmm1 commented Aug 18, 2022

panjf2000 commented Aug 19, 2022 •

edited

Loading

panjf2000 commented Aug 19, 2022 •

edited

Loading

tmm1 commented Aug 19, 2022 •

edited

Loading

panjf2000 commented Aug 19, 2022

panjf2000 commented Aug 19, 2022 •

edited

Loading

tmm1 commented Aug 19, 2022

panjf2000 commented Aug 19, 2022

panjf2000 commented Aug 19, 2022 •

edited

Loading

panjf2000 commented Aug 20, 2022

tmm1 commented Aug 24, 2022

tmm1 commented Aug 26, 2022 •

edited

Loading

tmm1 commented Aug 26, 2022

tmm1 commented Aug 26, 2022

tmm1 commented Aug 27, 2022

panjf2000 commented Aug 27, 2022 •

edited

Loading

tmm1 commented Aug 27, 2022

tmm1 commented Aug 29, 2022

jimlambrt commented Dec 6, 2023 •

edited

Loading

tmm1 commented Dec 8, 2023

jimlambrt commented Dec 10, 2023

chenweiyj commented Mar 28, 2024

runtime/netpoll: randomly hanging Accept on macOS since go1.18 #54529

runtime/netpoll: randomly hanging Accept on macOS since go1.18 #54529

Comments

tmm1 commented Aug 18, 2022

Context

Observations

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

panjf2000 commented Aug 19, 2022 • edited Loading

panjf2000 commented Aug 19, 2022 • edited Loading

tmm1 commented Aug 19, 2022 • edited Loading

panjf2000 commented Aug 19, 2022

panjf2000 commented Aug 19, 2022 • edited Loading

tmm1 commented Aug 19, 2022

panjf2000 commented Aug 19, 2022

panjf2000 commented Aug 19, 2022 • edited Loading

panjf2000 commented Aug 20, 2022

tmm1 commented Aug 24, 2022

tmm1 commented Aug 26, 2022 • edited Loading

tmm1 commented Aug 26, 2022

tmm1 commented Aug 26, 2022

tmm1 commented Aug 27, 2022

panjf2000 commented Aug 27, 2022 • edited Loading

tmm1 commented Aug 27, 2022

tmm1 commented Aug 29, 2022

jimlambrt commented Dec 6, 2023 • edited Loading

tmm1 commented Dec 8, 2023

jimlambrt commented Dec 10, 2023

chenweiyj commented Mar 28, 2024

What version of Go are you using (`go version`)?

panjf2000 commented Aug 19, 2022 •

edited

Loading

panjf2000 commented Aug 19, 2022 •

edited

Loading

tmm1 commented Aug 19, 2022 •

edited

Loading

panjf2000 commented Aug 19, 2022 •

edited

Loading

panjf2000 commented Aug 19, 2022 •

edited

Loading

tmm1 commented Aug 26, 2022 •

edited

Loading

panjf2000 commented Aug 27, 2022 •

edited

Loading

jimlambrt commented Dec 6, 2023 •

edited

Loading