Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: forceCloseSockets in test is not safe for finalized fds #15525

Closed
bradfitz opened this issue May 3, 2016 · 15 comments
Closed

net: forceCloseSockets in test is not safe for finalized fds #15525

bradfitz opened this issue May 3, 2016 · 15 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@bradfitz
Copy link
Contributor

bradfitz commented May 3, 2016

I was just debugging mysterious test failures (EBADFs) in random places in the net package from a CL of mine.

I notice the test failures went away when I disabled my new (unrelated) tests.

It turned out that I was forgetting to close some net.Conns in my test (fixed: https://go-review.googlesource.com/#/c/22031/6..7/src/net/net_test.go) and the corruption in the net package state was coming from finalizers.

So somehow the finalizers are messing up the event loop's state.

/cc @aclements @ianlancetaylor @dvyukov

@bradfitz bradfitz added this to the Go1.7Maybe milestone May 3, 2016
@dvyukov
Copy link
Member

dvyukov commented May 3, 2016

Repro?
Were EBADF coming from your calls or from runtime netpoll?

@bradfitz
Copy link
Contributor Author

bradfitz commented May 3, 2016

Repro is check out patchset 6 or earlier from the Gerrit review above (git fetch https://go.googlesource.com/go refs/changes/31/22031/6 && git checkout FETCH_HEAD) and then run the net tests and watch it fail. Use go test -count=20 net to make it fail reliably.

@mikioh
Copy link
Contributor

mikioh commented May 3, 2016

I suppose that this is the same as #14910. Once we break net.{Conn,Listener,PacketConn} loose from our control it can be garbage collected. Keep your belongings with you safely.

@dvyukov
Copy link
Member

dvyukov commented May 4, 2016

@mikioh I don't see that Brad's tests call .Fd()

@mikioh
Copy link
Contributor

mikioh commented May 4, 2016

@dvyukov,

Yup, I mistook the code path.

@rsc
Copy link
Contributor

rsc commented May 17, 2016

@randall77, I believe this is a possible dup of #15277. If SSA thinks the arg slots are dead and is reusing them for other things, that might end up smashing them too early, causing premature garbage collection.

@rsc rsc modified the milestones: Go1.7, Go1.7Maybe May 17, 2016
@randall77
Copy link
Contributor

I can't repro. Following Brad's instructions I get a smattering of i/o timeouts, bind: address already in use, unreachable networks, etc, but no EBADF or anything similar. I tried on both linux and darwin amd64.

Anyone who can repro, please provide more detailed instructions. Or try patching in https://go-review.googlesource.com/c/22365/ for me and see if it helps.

@rsc
Copy link
Contributor

rsc commented May 18, 2016

I tried refs/changes/31/22031/6 on both Linux and Darwin and they both just hang. After 10 minutes the test runner sends them a SIGQUIT. So that seems not to be a working repro.

@bradfitz
Copy link
Contributor Author

bradfitz commented May 18, 2016

I'll try to repro this again. (I was on Darwin at the time, sitting next to @ianlancetaylor who I showed this to also)

I don't think it required a specific -testing.run=.... regexp, but maybe it did.

@rsc
Copy link
Contributor

rsc commented May 18, 2016

I reproduced with "go test -short -count=20 net" on Linux. I get lots of failures, some of them probably not real problems, but I also do get many EBADFs.

  • Cherry-picking Keith's CL did not help.
  • "go test -a -gcflags=-ssa=0 -short -count=20" still fails.

Next step is probably to make the os.File close finalizer print the stacks of all current goroutines before closing an fd.

@bradfitz
Copy link
Contributor Author

(Since Russ can reproduce this, I'm not spending time on an easier repro.)

@bradfitz bradfitz added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 20, 2016
@rsc rsc modified the milestones: Go1.7Beta, Go1.7 May 27, 2016
@rsc
Copy link
Contributor

rsc commented May 27, 2016

To reproduce this on Linux it suffices to start at tip and comment out the defer ln.Close() in the goroutine inside net_test.go's TestZeroByteRead and then run go test -short -count=20 net. I am completely baffled. The test is installing various hooks that might be causing problems, so maybe this is just a buggy test. Or maybe it's a real finalization problem. It's difficult to tell. I asked @aclements to look at it when he finishes the bug he is currently working on.

I can't figure out how, even in the presence of early finalization, ln might get closed more than once, but that seems to be what is happening. Closing the fd twice could, if the fd were reused between the two, step on someone else. But then having stepped on someone else, when that someone else closes their fd, that could step on a third person, and so on, creating a chain of errors like in the failure output. I can't come up with another explanation for how this one async close could cause so many different test failures otherwise. But I also can't explain how the fd could possibly be closed twice. We're so careful about that - even if the program were racy or there were a normal Close racing with the finalizer close (and there doesn't appear to be), the uses of the fd are interlocked and reference counted, and we only actually close the fd when the ref count drops to zero.

Note: I commented last night that it looked like the liveness bug I just fixed, but that makes no sense since that was specific to SSA and this is not. I deleted that comment.

@rsc rsc modified the milestones: Go1.7Maybe, Go1.7Beta May 27, 2016
@rsc rsc changed the title net: Conn finalizers can corrupt network state net: forceCloseSockets in test is not safe for finalized fds May 27, 2016
@rsc
Copy link
Contributor

rsc commented May 27, 2016

It is in fact a double-close of an fd. The problem is that if a test does not close its own *netFD (what's inside a net.Conn or net.Listener), then a later test that makes use of forceCloseSockets will close the fd itself (using syscall.Close, not the *netFD which has all the right checks). Then when the finalizer on the *netFD eventually runs, it too will close that same numeric fd, and by then it may have been reused, leading to the kind of chain reaction I described in my previous message. Skipping the two tests that make use of forceCloseSockets makes the repro pass.

So either we just make sure to close all the *netFDs we open in the net test, or we change forceCloseSockets to call close on *netFDs not sockets, or we delete forceCloseSockets entirely (it seems like a bad idea as implemented today), or we change it to use syscall.Shutdown instead of syscall.Close, or we live with the fact that forgetting to close a *netFD blows up badly.

Since it's just a bad test, its OK to fix for Go 1.7 or to leave for Go 1.8. Leaving for @mikioh.

@mikioh
Copy link
Contributor

mikioh commented May 27, 2016

Oops, I don't remember the reason why I left forceCloseSockets in non-main test functions but it's clearly wrong, thanks.

@gopherbot
Copy link

CL https://golang.org/cl/23505 mentions this issue.

@mikioh mikioh removed their assignment May 31, 2016
@golang golang locked and limited conversation to collaborators May 31, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

6 participants