Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: FileConn can yield blocking descriptors, leading to livelock #61205

Closed
philhofer opened this issue Jul 6, 2023 · 6 comments
Closed

net: FileConn can yield blocking descriptors, leading to livelock #61205

philhofer opened this issue Jul 6, 2023 · 6 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@philhofer
Copy link
Contributor

What version of Go are you using (go version)?

go1.20.5

$ go version

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

Linux/amd64

What did you do?

We use net.FileConn to construct a net.Conn from a file descriptor passed in from another process via a unix socket.

net.FileConn calls internal/poll.FD.Init(net, true), which then makes the descriptor pollable conditional on poll configuration succeeding:

        err := fd.pd.init(fd)
        if err != nil {
                // If we could not initialize the runtime poller,
                // assume we are using blocking mode.
                fd.isBlocking = 1
        }

The trouble here is that fd.pd.init(fd) can produce an error when the number of poll watches exceeds /proc/sys/fs/epoll/max_user_watches, and thus the non-blocking file descriptor is incorrectly labelled as blocking.
The consequence of this is that calls to write(2), read(2), etc. that return EAGAIN do not invoke the netpoller and instead loop forever (or as long as the operation would block), which consumes 100% of the CPU time. Here's a flame graph I captured from a profile:
Screenshot 2023-07-06 9 06 37 AM

I think the correct fix here is to have poll.FD.Init produce a real error when netpoller registration fails.

@ianlancetaylor
Copy link
Contributor

What is the value of /proc/sys/fs/epoll/max_user_watches on your systems?

@ianlancetaylor ianlancetaylor added this to the Go1.22 milestone Jul 6, 2023
@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 6, 2023
@philhofer
Copy link
Contributor Author

# cat /proc/sys/fs/epoll/max_user_watches 
1766110

I'm not 100% certain this is the error that causes isBlocking to be set to 1. I am certain, though, that a file descriptor leak in our process led to this livelock, so exceeding max_user_watches was my best guess as to the error that is being returned from fd.pd.init.

@philhofer
Copy link
Contributor Author

I've uploaded a (large) execution trace to a public S3 bucket:

aws s3 cp s3://sneller-samples/pub/go-61205.trace . --no-sign-request

@ramondeklein
Copy link

Trace file can be downloaded using https://sneller-samples.s3.amazonaws.com/pub/go-61205.trace.

@ianlancetaylor
Copy link
Contributor

As far as I can tell, if epoll_ctl fails because max_user_watches is exceeded, then the error is returned back up the stack and should be returned from net.FileConn. So I don't think I understand the problem. Can you show a test case? Thanks.

@bcmills bcmills added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jul 13, 2023
@gopherbot
Copy link

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@gopherbot gopherbot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

5 participants