Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/netpoll: randomly hanging Accept on macOS since go1.18 #54529

Open
tmm1 opened this issue Aug 18, 2022 · 21 comments
Open

runtime/netpoll: randomly hanging Accept on macOS since go1.18 #54529

tmm1 opened this issue Aug 18, 2022 · 21 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin
Milestone

Comments

@tmm1
Copy link
Contributor

tmm1 commented Aug 18, 2022

Context

We ship golang binaries that run a net/http.Server to a number of users on different platforms.

Recently we've started receiving reports from macOS users, mostly using M1 mac minis, of an issue where the http server stops responding.

The reports started shortly after we shipped an upgrade from go1.17 to go1.18

Observations

The issue is very intermittent, with some users experiencing it atleast once a week.

At the same time, my own M1 mac mini has never experience this issue.

We have been trying to debug for a couple months, and have been able to gather some information via users:

  • does not seem to be ulimit related, as other parts of the binary continue to function properly, creating files and making network requests

  • the listening socket is active, because curl experiences a timeout instead of connection refused

    curl: (28) Failed to connect to 127.0.0.1 port 8089 after 25926 ms: Operation timed out
    
  • golang stack trace shows the TCPListener.Accept sitting idle in netpoll

    goroutine 69 [IO wait, 771 minutes]:
    internal/poll.runtime_pollWait(0x12bdee198, 0x72)
        runtime/netpoll.go:302 +0xa4
    internal/poll.(*pollDesc).wait(0x140000c2780?, 0x4?, 0x0)
        internal/poll/fd_poll_runtime.go:83 +0x2c
    internal/poll.(*pollDesc).waitRead(...)
        internal/poll/fd_poll_runtime.go:88
    internal/poll.(*FD).Accept(0x140000c2780)
        internal/poll/fd_unix.go:614 +0x1d0
    net.(*netFD).accept(0x140000c2780)
        net/fd_unix.go:172 +0x28
    net.(*TCPListener).accept(0x1400066e0c0)
        net/tcpsock_posix.go:139 +0x2c
    net.(*TCPListener).Accept(0x1400066e0c0)
        net/tcpsock.go:288 +0x30
    

I'm looking for any suggestions on how to debug this further, or if anyone has seen anything similar.

Currently I am wondering if this issue affects just the single socket, or is something process-wide. To verify, we've spun up another listener on a secondary port, and are waiting to hear if that port keeps working when the primary port stops responding.

As a workaround, we are considering adding an internal health check in the process that tries to connect back to itself over loopback. If the connection times out, we could restart the TCPListener.

What version of Go are you using (go version)?

$ go version
go version go1.18.5 darwin/arm64

Does this issue reproduce with the latest release?

Have not tried with go 1.19

@panjf2000
Copy link
Member

panjf2000 commented Aug 19, 2022

I would suggest that you take a look at Send-Q and Recv-Q of the TCP listener, and see if these are being accumulated and not reduced during this issue, if so, then it may indicate that Go netpoll fails to wake up the goroutine which holds the listener for accepting new sockets, if not, then it might not be a bug of Go.

@panjf2000
Copy link
Member

panjf2000 commented Aug 19, 2022

I've gone through the netpoll code roughly, but didn't catch anything, maybe I missed something, will keep browsing the source code.

@tmm1
Copy link
Contributor Author

tmm1 commented Aug 19, 2022

take a look at Send-Q and Recv-Q of the TCP listener

What's the best way to do so on macOS? Does netstat show this info?

EDIT: The answer is yes, a simple netstat -an will work.

@panjf2000
Copy link
Member

BTW, What's the output of sudo sysctl -p on your Mac?
@tmm1

@panjf2000
Copy link
Member

panjf2000 commented Aug 19, 2022

take a look at Send-Q and Recv-Q of the TCP listener

What's the best way to do so on macOS? Does netstat show this info?

EDIT: The answer is yes, a simple netstat -an will work.

You may want to add a -L as well to the command to show the amounts of SYN Queue and Accept queue.
@tmm1

@tmm1
Copy link
Contributor Author

tmm1 commented Aug 19, 2022

BTW, What's the output of sudo sysctl -p on your Mac?

Doesn't work here.. sysctl: illegal option -- p

@panjf2000
Copy link
Member

BTW, What's the output of sudo sysctl -p on your Mac?

Doesn't work here.. sysctl: illegal option -- p

Then go with echo "#include <sys/sysctl.h>" | gcc -M -xc - | grep sysctl and open the .h file printed by the above command, searching for somaxconn, see what we can get?

@panjf2000
Copy link
Member

panjf2000 commented Aug 19, 2022

Another thought: it could be that your somaxconn is too small to accept more sockets and your listener was overwhelmed, have you ever tried netstat -nt|grep {listener_port} to find out how many connections were stuck in SYN_RECV state during this issue? or how many times the listen queue overflowed (by netstat -s | grep overflow) and how many connections it dropped (by netstat -s| grep dropped)?

@joedian joedian added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 19, 2022
@panjf2000
Copy link
Member

BTW, What's the output of sudo sysctl -p on your Mac?

Doesn't work here.. sysctl: illegal option -- p

Then go with echo "#include <sys/sysctl.h>" | gcc -M -xc - | grep sysctl and open the .h file printed by the above command, searching for somaxconn, see what we can get?

I found a #define KIPC_SOMAXCONN in sysctl.h, so the somaxconn configuration on Mac should be kern.ipc.somaxconn, run sysctl kern.ipc.somaxconn=1000 on your Mac and it ought to fix your case.

Please let me know if the above solution doesn't work for your case (:

@seankhliao seankhliao added this to the Unplanned milestone Aug 20, 2022
@tmm1
Copy link
Contributor Author

tmm1 commented Aug 24, 2022

I received some of the requested details from a customer: https://gist.github.com/thully/03bc2a4e73cae53e079e2db66d3bebb1

@tmm1
Copy link
Contributor Author

tmm1 commented Aug 26, 2022

I received some of the requested details from a customer:

Our process listens on *:8089. I added another listener for 127.0.0.1:58089 which continues to work normally when port 8089 stops responding.

Excerpts:

lsof shows both listeners. There are a few files open, but nowhere close to the open file limit.

Neither lsof nor netstat -at shows any tcp connections going to either port.

$ lsof -nPp
COMMAND   PID   USER   FD      TYPE             DEVICE  SIZE/OFF                NODE NAME
channels- 650 userna    5u     IPv4 0x405cd9dd81544b77       0t0                 TCP 127.0.0.1:58089 (LISTEN)
channels- 650 userna   10u     IPv6 0x405cd9d8b4aa1017       0t0                 TCP *:8089 (LISTEN)

$ netstat -an
Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)    
tcp4       0      0  127.0.0.1.58089        *.*                    LISTEN     
tcp46      0      0  *.8089                 *.*                    LISTEN     

netstat -s shows no overflows or drops

netstat -aL shows that the listen queue is indeed filled up, just for the one port.

     -L    Show the size of the various listen queues.  The first count shows the number of unaccepted
           connections.  The second count shows the amount of unaccepted incomplete connections.  The third
           count is the maximum number of queued connections.
$ netstat -anL
Current listen queue sizes (qlen/incqlen/maxqlen)
Listen         Local Address         
0/0/128        127.0.0.1.58089        
128/128/128    *.8089

So @panjf2000 is correct in that the connections are piling up to the somaxconn limit, and once it fills up then clients start receiving timeouts.

But I don't see how increasing somaxconn would make any difference, since the golang process is stuck and refusing to drain sockets off the queue.

goroutine 52 [IO wait, 2085 minutes]:
internal/poll.runtime_pollWait(0xe47d058, 0x72)
	runtime/netpoll.go:302 +0x89
internal/poll.(*pollDesc).wait(0xc0010a2400?, 0x4?, 0x0)
	internal/poll/fd_poll_runtime.go:83 +0x32
internal/poll.(*pollDesc).waitRead(...)
	internal/poll/fd_poll_runtime.go:88
internal/poll.(*FD).Accept(0xc0010a2400)
	internal/poll/fd_unix.go:614 +0x22c
net.(*netFD).accept(0xc0010a2400)
	net/fd_unix.go:172 +0x35
net.(*TCPListener).accept(0xc000fe65a0)
	net/tcpsock_posix.go:139 +0x28
net.(*TCPListener).Accept(0xc000fe65a0)
	net/tcpsock.go:288 +0x3d
github.com/soheilhy/cmux.(*cMux).Serve(0xc000fcdbd0)
	github.com/soheilhy/cmux@v0.1.4/cmux.go:162 +0xa9
created by main.StartHTTPServer
	github.com/fancybits/channels-server/http.go:770 +0x1953

@tmm1
Copy link
Contributor Author

tmm1 commented Aug 26, 2022

netstat -aL shows that the listen queue is indeed filled up, just for the one port.

I asked a user to check the current values, and it says:

Current listen queue sizes (qlen/incqlen/maxqlen)
Listen         Local Address         
0/0/128 127.0.0.1.58089
15/15/128 *.8089

So it appears those are incrementing on some users' systems for some reason. And once they hit 128, things stop making their way down to the process?

I wonder what the application is supposed to do to tell the kernel to clear out the unaccepted connections?

@tmm1
Copy link
Contributor Author

tmm1 commented Aug 26, 2022

I was able to reproduce somewhat on my own M1 mini, using a SYN-flood. Atleast I can make the qlen/incqlen max out, but in my case the numbers reset back to zero very quickly afterwards..

tmm1@tmm1-m1 ~ % echo; date; sudo /opt/homebrew/Cellar/hping/3.20051105/sbin/hping3 -q -c 256 -S -p 8089 --faster localhost; echo; date; netstat -anL | grep 8089; netstat -nt | grep 8089 | head -6; sleep 2; echo; date; netstat -anL | grep 8089

Fri Aug 26 14:48:05 PDT 2022
HPING localhost (utun13 127.0.0.1): S set, 40 headers + 0 data bytes
--- localhost hping statistic ---
256 packets tramitted, 0 packets received, 100% packet loss
round-trip min/avg/max = 0.0/0.0/0.0 ms

Fri Aug 26 14:48:05 PDT 2022
0/0/128        127.0.0.1.58089
128/128/128    *.8089
tcp4       0      0  127.0.0.1.8089         127.0.0.1.2722         SYN_RCVD
tcp4       0      0  127.0.0.1.8089         127.0.0.1.2721         SYN_RCVD
tcp4       0      0  127.0.0.1.8089         127.0.0.1.2720         SYN_RCVD
tcp4       0      0  127.0.0.1.8089         127.0.0.1.2719         SYN_RCVD
tcp4       0      0  127.0.0.1.8089         127.0.0.1.2718         SYN_RCVD
tcp4       0      0  127.0.0.1.8089         127.0.0.1.2717         SYN_RCVD

Fri Aug 26 14:48:07 PDT 2022
0/0/128        127.0.0.1.58089
0/0/128        *.8089

@tmm1
Copy link
Contributor Author

tmm1 commented Aug 27, 2022

Re-reading @panjf2000's advice above, things are starting to make more sense.

I would suggest that you take a look at Send-Q and Recv-Q of the TCP listener, and see if these are being accumulated and not reduced during this issue, if so, then it may indicate that Go netpoll fails to wake up the goroutine which holds the listener for accepting new sockets, if not, then it might not be a bug of Go.

I'm not seeing any Send-Q/Recv-Q on the LISTEN socket via netstat -ant. This indicates that the kernel has nothing to send to the application, and so its normal that the golang netpoll is not waking up because there are no kqueue events present to respond to.

You may want to add a -L as well to the command to show the amounts of SYN Queue and Accept queue.

The SYN/accept queue is increasing, and getting stuck. So this is clearly some sort of kernel bug on the macOS side.

Users affected by this bug never seen qlen drop down to zero. Even when using hping3 to simulate a SYN flood, the number spikes to 128 and drops down quickly, but not to zero.

run sysctl kern.ipc.somaxconn=1000 on your Mac and it ought to fix your case

The issue is not a qlen spike, but rather a leak somewhere.

However increasing somaxconn is still a useful workaround, as it delays the problem. I confirmed that after changing the sysctl and restarting my program, netstat -L now reports a much higher maxqlen.

The users reporting this issue are experiencing maxqlen exhaustion every 1-3 days, with up to 4 leaks an hour.

So far one commonality is that all the affected users have some kind of network extension installed, as visible by systemextensionsctl list.

@panjf2000
Copy link
Member

panjf2000 commented Aug 27, 2022

I'm not seeing any Send-Q/Recv-Q on the LISTEN socket via netstat -ant. This indicates that the kernel has nothing to send to the application, and so its normal that the golang netpoll is not waking up because there are no kqueue events present to respond to.

I'm not an expert with macOS and although the netstat differs a bit on Linux from macOS, I think recv-q and send-q represent the count of bytes not copied by the user program connected to this socket and the count of bytes not acknowledged by the remote host as Linux does when you use it without -L, and with -L, those represent SYN queue and accept queue we really care about.

I would suggest that you take a look at Send-Q and Recv-Q of the TCP listener, and see if these are being accumulated and not reduced during this issue, if so, then it may indicate that Go netpoll fails to wake up the goroutine which holds the listener for accepting new sockets, if not, then it might not be a bug of Go.

What I really meant by what I said before was for SYN queue and accept queue, therefore, I think your case does indicate an issue inside Go netpoll, it could either be that kevent didn't work well in netpoll to wake up goroutine of the listener, or that netpoll didn't use it right.

@tmm1
Copy link
Contributor Author

tmm1 commented Aug 27, 2022

It seems like the app can't do anything other than call accept(). When a real request comes in and accept is called, the qlen is not being cleared out just decremented by one. So it doesn't seem like go could be doing anything else.. unless waking up and calling accept() earlier would have made a difference.

@tmm1
Copy link
Contributor Author

tmm1 commented Aug 29, 2022

I filed FB11405981 with Apple

A user reports that removing Little Snitch network extension fixed the issue. More details here: https://community.getchannels.com/t/channels-dvr-stops-responding-randomly-on-macos/32418/83

@jimlambrt
Copy link

jimlambrt commented Dec 6, 2023

I am experiencing this as well in Ventura 13.6.1 and Sonoma 14.1.2

goroutine 238 [IO wait]:
internal/poll.runtime_pollWait(0x10807c9b8, 0x72)
	/Users/jimlambrt/.asdf/installs/golang/1.21.1/go/src/runtime/netpoll.go:343 +0xa0
internal/poll.(*pollDesc).wait(0x140000c7c00?, 0x0?, 0x0)
	/Users/jimlambrt/.asdf/installs/golang/1.21.1/go/src/internal/poll/fd_poll_runtime.go:84 +0x28
internal/poll.(*pollDesc).waitRead(...)
	/Users/jimlambrt/.asdf/installs/golang/1.21.1/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Accept(0x140000c7c00)
	/Users/jimlambrt/.asdf/installs/golang/1.21.1/go/src/internal/poll/fd_unix.go:611 +0x250
net.(*netFD).accept(0x140000c7c00)
	/Users/jimlambrt/.asdf/installs/golang/1.21.1/go/src/net/fd_unix.go:172 +0x28
net.(*TCPListener).accept(0x140000b0760)
	/Users/jimlambrt/.asdf/installs/golang/1.21.1/go/src/net/tcpsock_posix.go:152 +0x28
net.(*TCPListener).Accept(0x140000b0760)

run sysctl kern.ipc.somaxconn=1000 did not fix this for me.

@tmm1
Copy link
Contributor Author

tmm1 commented Dec 8, 2023

goroutine 238 [IO wait]:

this looks normal to me?

in my case, note the really long wait times indicating the "stuck" state:

goroutine 69 [IO wait, 771 minutes]:

goroutine 52 [IO wait, 2085 minutes]:

@jimlambrt
Copy link

Update: I've resolved my issue. The listener's IP address was not routable. Once I fixed that, the listener had no problem accepting connections of course.

@chenweiyj
Copy link

I have met the same problem, when I create more than 200 nodes connected via net.TcpDial. And after many rounds of message deliveries, the final broadcast makes around 40 nodes not receive the broadcast message on my M1 MacBook Air with macOS Sonoma 14.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-Darwin
Projects
None yet
Development

No branches or pull requests

6 participants