|
|
Created:
15 years, 3 months ago by dho Modified:
15 years, 3 months ago Reviewers:
CC:
rsc, golang-dev Visibility:
Public. |
DescriptionFix netFD.Close races demonstrated in Issues 321 and 271.
Patch Set 1 #Patch Set 2 : code review 163052: Addresses some race conditions in calling Close. #
Total comments: 1
Patch Set 3 : code review 163052: Fix netFD.Close races #Patch Set 4 : code review 163052: Fix netFD.Close races #
Total comments: 10
Patch Set 5 : code review 163052: Fix netFD.Close races #Patch Set 6 : code review 163052: Fix netFD.Close races #Patch Set 7 : code review 163052: Fix netFD.Close races #Patch Set 8 : code review 163052: Fix netFD.Close races #Patch Set 9 : code review 163052: Fix netFD.Close races demonstrated in Issues 321 and 271. #
MessagesTotal messages: 27
Hello rsc, agl (cc: golang-dev@googlegroups.com), I'd like you to review the following change.
Sign in to reply to this message.
can you reproduce #360? does this fix it? On Mon, Nov 30, 2009 at 20:41, <devon.odell@gmail.com> wrote: > Reviewers: rsc, agl, > > Message: > Hello rsc, agl (cc: golang-dev@googlegroups.com), > > I'd like you to review the following change. > > > Description: > Addresses some race conditions in calling Close. > > This is probably really ugly, but it seems to work for all > the provided test cases, which I can reproduce. > > Please review this at http://codereview.appspot.com/163052 > > Affected files: > M src/pkg/net/fd.go > M src/pkg/syscall/mkerrors.sh > M src/pkg/syscall/zerrors_darwin_386.go > M src/pkg/syscall/zerrors_darwin_amd64.go > M src/pkg/syscall/zerrors_freebsd_amd64.go > M src/pkg/syscall/zerrors_linux_386.go > M src/pkg/syscall/zerrors_linux_amd64.go > > >
Sign in to reply to this message.
2009/11/30 Russ Cox <rsc@golang.org>: > can you reproduce #360? does this fix it? Yes, and yes. Also seems to fix #321, but I can't test #271, so feedback from wjosephson would be nice. --dho > On Mon, Nov 30, 2009 at 20:41, <devon.odell@gmail.com> wrote: >> Reviewers: rsc, agl, >> >> Message: >> Hello rsc, agl (cc: golang-dev@googlegroups.com), >> >> I'd like you to review the following change. >> >> >> Description: >> Addresses some race conditions in calling Close. >> >> This is probably really ugly, but it seems to work for all >> the provided test cases, which I can reproduce. >> >> Please review this at http://codereview.appspot.com/163052 >> >> Affected files: >> M src/pkg/net/fd.go >> M src/pkg/syscall/mkerrors.sh >> M src/pkg/syscall/zerrors_darwin_386.go >> M src/pkg/syscall/zerrors_darwin_amd64.go >> M src/pkg/syscall/zerrors_freebsd_amd64.go >> M src/pkg/syscall/zerrors_linux_386.go >> M src/pkg/syscall/zerrors_linux_amd64.go >> >> >> >
Sign in to reply to this message.
Thanks for looking into this. I'm worried that this is just pushing the races down deeper but not solving them. A ref count without any kind of locking seems unlikely to be correct, and there are other uses of the fd besides in the poll server. I think that in netFD, any use of the fd field needs to be bracketed by a call to a new method lockfd() and a call to a new method unlockfd(). To catch all the uses I suggest renaming the fd field to sysfd. (You can't just grep because in most of the conn structs, there is a fd *netFD field. Changing the name will help keep these two separate anyway.) The methods would be something like // lockfd allows the caller to use sysfd. // When finished, the caller must call unlockfd. func (fd *netFD) lockfd() { fd.sysmu.Lock(); fd.sysref++; fd.sysmu.Unlock(); } // unlock releases sysfd. After calling this, // the caller should not refer to fd.sysfd again // without calling lockfd. func (fd *netFD) unlockfd() { fd.sysmu.Lock(); if fd.sysref--; fd.sysref == 0 && fd.closing && fd.sysfd >= 0 { // In case the user has set linger, // switch to blocking mode so the close blocks. // As long as this doesn't happen often, // we can handle the extra OS processes. // Otherwise we'll need to use the pollserver // for Close too. Sigh. syscall.SetNonblock(fd.sysfd, false); fd.file.Close(); fd.sysfd = -1; } fd.sysmu.Unlock(); } Then, here's a typical func from sock.go: func setReuseAddr(fd *netFD, reuse bool) os.Error { return setsockoptInt(fd.fd, syscall.SOL_SOCKET, syscall.SO_REUSEADDR, boolint(reuse)) } It needs to be: func setReuseAddr(fd *netFD, reuse bool) os.Error { fd.lockfd(); defer fd.unlockfd(); return setsockoptInt(fd.sysfd, syscall.SOL_SOCKET, syscall.SO_REUSEADDR, boolint(reuse)) } There are many such places. In addition to the fd.sysfd there are a couple uses of fd.file.Fd(), which should probably be rewritten to fd.sysfd, and one or two fd.file.Method() for various values of Method. I think the poll server is one place that doesn't need it: the poll server always acts on behalf of another method like Read or Write, so as long as the top-level Read or Write has the lock/unlock the pollserver's usage will be handled automatically. For example, Read says func (fd *netFD) Read(p []byte) (n int, err os.Error) { if fd == nil || fd.file == nil { return 0, os.EINVAL } fd.rio.Lock(); defer fd.rio.Unlock(); and you just have to throw in fd.lockfd(); defer fd.unlockfd(); and you're done, no need to fiddle with the poll server. This should also take care of the Close race that Adam fixed by moving the close to the poll server. Close now doesn't need to bother the poll server at all, just set the closing flag: func (fd *netFD) Close() os.Error { fd.lockfd(); fd.closing = true; fd.unlockfd(); return nil; } so the WaitCloseAck part of the poll server can probably be yanked out. http://codereview.appspot.com/163052/diff/16/19 File src/pkg/syscall/zerrors_darwin_386.go (right): http://codereview.appspot.com/163052/diff/16/19#newcode390 src/pkg/syscall/zerrors_darwin_386.go:390: SHUT_RD = 0; please add these 3 lines to zerrors_darwin_amd64.go too. i'm sure it's the same values. also, are any of these generated by the script? i note that there is no 0x prefix on these. and what about freebsd 386?
Sign in to reply to this message.
On Mon, Nov 30, 2009 at 9:22 PM, <rsc@golang.org> wrote: > also, are any of these generated by the script? > i note that there is no 0x prefix on these. > > and what about freebsd 386? Those are my fault. I edited the script, but the delta was huge, probably because I've a Ubuntu Karmic based system and it was generated on Hardy. Do I edited it by hand (before the FreeBSD stuff landed.) AGL
Sign in to reply to this message.
On Tue, Dec 01, 2009 at 05:22:40AM +0000, rsc@golang.org wrote: > Thanks for looking into this. > > I'm worried that this is just pushing the races down deeper > but not solving them. A ref count without any kind of > locking seems unlikely to be correct, and there are other > uses of the fd besides in the poll server. I don't really understand the issue at hand. I see something eating a ton of CPU and not making any system calls. At a guess a loop over a closed channel or similar. > I think that in netFD, any use of the fd field needs to be bracketed > by a call to a new method lockfd() and a call to a new method > unlockfd(). To catch all the uses I suggest renaming the fd field > to sysfd. (You can't just grep because in most of the conn structs, > there is a fd *netFD field. Changing the name will help keep these > two separate anyway.) I made this change and the others you suggested (to try and understand better something that's fairly hazy in my head right now). It's still not working. Even with copious amounts of debugging (fmt.Printf in every loop) I still can't seem to get *any* of the loops looping. > // lockfd allows the caller to use sysfd. > // When finished, the caller must call unlockfd. > func (fd *netFD) lockfd() { > fd.sysmu.Lock(); > fd.sysref++; > fd.sysmu.Unlock(); > } Given (in practise for most modern CPUs) fd.sysref++ is atomic, it's not clear if there is any value touching the mutex there. > It needs to be: > > func setReuseAddr(fd *netFD, reuse bool) os.Error { > fd.lockfd(); > defer fd.unlockfd(); > > return setsockoptInt(fd.sysfd, syscall.SOL_SOCKET, syscall.SO_REUSEADDR, boolint(reuse)) > } So we bump a reference for a few microseconds doing a call that might as well fail if there is a close going on underneath it. I don't see how the lock there helps much. > There are many such places. I think I got them all. > In addition to the fd.sysfd there are a couple uses of fd.file.Fd(), > which should probably be rewritten to fd.sysfd, and one or two > fd.file.Method() for various values of Method. I should have those too. > This should also take care of the Close race that Adam fixed > by moving the close to the poll server. Close now doesn't > need to bother the poll server at all, just set the closing flag: > > func (fd *netFD) Close() os.Error { > fd.lockfd(); > fd.closing = true; > fd.unlockfd(); > return nil; > } > > so the WaitCloseAck part of the poll server can probably > be yanked out. Gone. And 'cc Anyhow, to eyeball this as a starting point: http://codereview.appspot.com/164057
Sign in to reply to this message.
2009/12/1 Chris Wedgwood <cw@f00f.org>: > On Tue, Dec 01, 2009 at 05:22:40AM +0000, rsc@golang.org wrote: > >> Thanks for looking into this. >> >> I'm worried that this is just pushing the races down deeper >> but not solving them. A ref count without any kind of >> locking seems unlikely to be correct, and there are other >> uses of the fd besides in the poll server. Sorry, was staring at it for 3 hours figuring it out before I implemented it, and I guess I still didn't have it all figured out. I was thinking that the rio/wio locks would take care of it, but the refcounting isn't done in a locked region. More on this further down. > I don't really understand the issue at hand. I see something eating a > ton of CPU and not making any system calls. At a guess a loop over a > closed channel or similar. I wasn't able to figure out which channel it was, even with tons of print()s. >> I think that in netFD, any use of the fd field needs to be bracketed >> by a call to a new method lockfd() and a call to a new method >> unlockfd(). To catch all the uses I suggest renaming the fd field >> to sysfd. (You can't just grep because in most of the conn structs, >> there is a fd *netFD field. Changing the name will help keep these >> two separate anyway.) > > I made this change and the others you suggested (to try and understand > better something that's fairly hazy in my head right now). > > It's still not working. > > Even with copious amounts of debugging (fmt.Printf in every loop) I > still can't seem to get *any* of the loops looping. I don't think that will work all by itself. The refcounting, syscall.Shutdown and EOF detection is still necessary. >> // lockfd allows the caller to use sysfd. >> // When finished, the caller must call unlockfd. >> func (fd *netFD) lockfd() { >> fd.sysmu.Lock(); >> fd.sysref++; >> fd.sysmu.Unlock(); >> } > > Given (in practise for most modern CPUs) fd.sysref++ is atomic, it's > not clear if there is any value touching the mutex there. > >> It needs to be: >> >> func setReuseAddr(fd *netFD, reuse bool) os.Error { >> fd.lockfd(); >> defer fd.unlockfd(); >> >> return setsockoptInt(fd.sysfd, syscall.SOL_SOCKET, syscall.SO_REUSEADDR, boolint(reuse)) >> } > > So we bump a reference for a few microseconds doing a call that might > as well fail if there is a close going on underneath it. I don't see > how the lock there helps much. I don't think that the bumping the reference has anything to do with making the race more difficult to hit. This problem is typically easier to hit when run under ktrace / strace / similar, so I'd guess that with that and the copious print debugging that we both did, that would defer it if anything did. >> There are many such places. > > I think I got them all. > >> In addition to the fd.sysfd there are a couple uses of fd.file.Fd(), >> which should probably be rewritten to fd.sysfd, and one or two >> fd.file.Method() for various values of Method. > > I should have those too. > >> This should also take care of the Close race that Adam fixed >> by moving the close to the poll server. Close now doesn't >> need to bother the poll server at all, just set the closing flag: >> >> func (fd *netFD) Close() os.Error { >> fd.lockfd(); >> fd.closing = true; >> fd.unlockfd(); >> return nil; >> } >> >> so the WaitCloseAck part of the poll server can probably >> be yanked out. > > Gone. And 'cc > > > > Anyhow, to eyeball this as a starting point: > > http://codereview.appspot.com/164057 I'll merge this into mine, eyeball for any missed references, and re-upload. --dho
Sign in to reply to this message.
2009/12/1 Devon H. O'Dell <devon.odell@gmail.com>: > 2009/12/1 Chris Wedgwood <cw@f00f.org>: >> It's still not working. >> >> Even with copious amounts of debugging (fmt.Printf in every loop) I >> still can't seem to get *any* of the loops looping. > > I don't think that will work all by itself. The refcounting, > syscall.Shutdown and EOF detection is still necessary. After merging in your patchset, 360 remains fixed, but 321 is again an issue. --dho
Sign in to reply to this message.
Hello rsc, agl, cw (cc: golang-dev@googlegroups.com), I'd like you to review the following change.
Sign in to reply to this message.
On 2009/12/01 15:08:03, dho wrote: > Hello rsc, agl, cw (cc: mailto:golang-dev@googlegroups.com), > > I'd like you to review the following change. On top of the changes cw made, there was a missing addref in sock.go, and there's an implied reference between AddFD and WakeFD. This change seems better, while also still fixing both issues. --dho
Sign in to reply to this message.
Looks pretty good, thanks. A few small comments. http://codereview.appspot.com/163052/diff/1016/1017 File src/pkg/net/fd.go (right): http://codereview.appspot.com/163052/diff/1016/1017#newcode120 src/pkg/net/fd.go:120: // This check verifies that the underlying file descriptor hasn't been drop this comment and the TODO below http://codereview.appspot.com/163052/diff/1016/1017#newcode129 src/pkg/net/fd.go:129: fd.getref(); This one and the one in WakeFD should be unnecessary: the calling goroutine should have done it for us already. That avoids having a getref and delref split across functions. http://codereview.appspot.com/163052/diff/1016/1017#newcode272 src/pkg/net/fd.go:272: blank line inserted; delete http://codereview.appspot.com/163052/diff/1016/1017#newcode338 src/pkg/net/fd.go:338: func (fd *netFD) getref() { These are typically called incref and decref. http://codereview.appspot.com/163052/diff/1016/1017#newcode376 src/pkg/net/fd.go:376: func (fd *netFD) Close() os.Error { Still need the if fd == nil || fd.file == nil checks from the old version of Close. http://codereview.appspot.com/163052/diff/1016/1017#newcode399 src/pkg/net/fd.go:399: if (isEAGAIN(err) && fd.rdeadline >= 0) || (isEOF(err) && n == 0) { What does checking isEOF do? Is this left over from a previous fix? http://codereview.appspot.com/163052/diff/1016/1017#newcode514 src/pkg/net/fd.go:514: fd.getref(); I see - you need the ref ops in WaitRead because the refs here are so tight. I'd move these up out of the for loop, just one fd.incref(); defer fd.decref() for the entire function should be fine. http://codereview.appspot.com/163052/diff/1016/1018 File src/pkg/net/sock.go (right): http://codereview.appspot.com/163052/diff/1016/1018#newcode140 src/pkg/net/sock.go:140: fd.delref(); might as well use defer like above, for consistency. http://codereview.appspot.com/163052/diff/1016/1021 File src/pkg/net/unixsock.go (right): http://codereview.appspot.com/163052/diff/1016/1021#newcode346 src/pkg/net/unixsock.go:346: if l == nil || l.fd == nil || l.fd.sysfd < 0 { might as well drop the || l.fd.sysfd < 0. it's only correct half the time anyway because of the race. if sysfd < 0 the accept will get EINVAL for us.
Sign in to reply to this message.
2009/12/1 <rsc@golang.org>: > Looks pretty good, thanks. > A few small comments. Some responses inline... > http://codereview.appspot.com/163052/diff/1016/1017 > File src/pkg/net/fd.go (right): > > http://codereview.appspot.com/163052/diff/1016/1017#newcode120 > src/pkg/net/fd.go:120: // This check verifies that the underlying file > descriptor hasn't been > drop this comment and the TODO below Done > http://codereview.appspot.com/163052/diff/1016/1017#newcode129 > src/pkg/net/fd.go:129: fd.getref(); > This one and the one in WakeFD should be > unnecessary: the calling goroutine should have > done it for us already. > That avoids having a getref and delref split > across functions. I think this is due to me not grabbing refs in CheckDeadlines; done. > http://codereview.appspot.com/163052/diff/1016/1017#newcode272 > src/pkg/net/fd.go:272: > blank line inserted; delete Done > http://codereview.appspot.com/163052/diff/1016/1017#newcode338 > src/pkg/net/fd.go:338: func (fd *netFD) getref() { > These are typically called incref and decref. Done > http://codereview.appspot.com/163052/diff/1016/1017#newcode376 > src/pkg/net/fd.go:376: func (fd *netFD) Close() os.Error { > Still need the if fd == nil || fd.file == nil checks from the old > version of Close. Done > http://codereview.appspot.com/163052/diff/1016/1017#newcode399 > src/pkg/net/fd.go:399: if (isEAGAIN(err) && fd.rdeadline >= 0) || > (isEOF(err) && n == 0) { > What does checking isEOF do? > Is this left over from a previous fix? Without this, #360 persists. Though strace / ktrace don't show it, I'm having a hard time believing that something's not doing a syscall, getting 0 bytes back, and failing to detect the EOF. I've seen it happen before -- so we wait on the cr channel, and we never send anything down it because we have 0 bytes to read, and we have an EOF from the async syscall.Shutdown() on the fd. I don't know if that's ok or indicative of something else (I'm guessing the latter, but I don't have a good idea of what right now). > http://codereview.appspot.com/163052/diff/1016/1017#newcode514 > src/pkg/net/fd.go:514: fd.getref(); > I see - you need the ref ops in WaitRead > because the refs here are so tight. > I'd move these up out of the for loop, just one > > fd.incref(); > defer fd.decref() > > for the entire function should be fine. Done > http://codereview.appspot.com/163052/diff/1016/1018 > File src/pkg/net/sock.go (right): > > http://codereview.appspot.com/163052/diff/1016/1018#newcode140 > src/pkg/net/sock.go:140: fd.delref(); > might as well use defer like above, for consistency. > > http://codereview.appspot.com/163052/diff/1016/1021 > File src/pkg/net/unixsock.go (right): > > http://codereview.appspot.com/163052/diff/1016/1021#newcode346 > src/pkg/net/unixsock.go:346: if l == nil || l.fd == nil || l.fd.sysfd < > 0 { > might as well drop the || l.fd.sysfd < 0. > it's only correct half the time anyway > because of the race. > if sysfd < 0 the accept will get EINVAL for us. Done > http://codereview.appspot.com/163052 >
Sign in to reply to this message.
http://codereview.appspot.com/163052/diff/1016/1017 File src/pkg/net/fd.go (right): http://codereview.appspot.com/163052/diff/1016/1017#newcode399 src/pkg/net/fd.go:399: if (isEAGAIN(err) && fd.rdeadline >= 0) || (isEOF(err) && n == 0) { On 2009/12/01 17:25:02, rsc wrote: > What does checking isEOF do? > Is this left over from a previous fix? why do we even have isEOF? i can see isEGAIN is needed because err==os.EAGAIN won't work, but err==os.EOF should surely? i really don't understand why this is needed, i'm also wondering if there are syscalls that i can't see for some reason with strace (seems unlikely though)
Sign in to reply to this message.
2009/12/1 <cw@f00f.org>: > > http://codereview.appspot.com/163052/diff/1016/1017 > File src/pkg/net/fd.go (right): > > http://codereview.appspot.com/163052/diff/1016/1017#newcode399 > src/pkg/net/fd.go:399: if (isEAGAIN(err) && fd.rdeadline >= 0) || > (isEOF(err) && n == 0) { > On 2009/12/01 17:25:02, rsc wrote: >> >> What does checking isEOF do? >> Is this left over from a previous fix? > > why do we even have isEOF? i can see isEGAIN is needed because > err==os.EAGAIN won't work, but err==os.EOF should surely? > > i really don't understand why this is needed, i'm also wondering if > there are syscalls that i can't see for some reason with strace (seems > unlikely though) It's bogus. I'm removing it, as it causes the net tests to fail, but now we'll have to figure out what exactly is wrong with the websockets test. --dho
Sign in to reply to this message.
On Tue, Dec 01, 2009 at 02:34:29PM -0500, Devon H. O'Dell wrote: > It's bogus. I'm removing it, as it causes the net tests to fail, but > now we'll have to figure out what exactly is wrong with the websockets > test. I dropped if from what I have so we should basically have identical trees now? http://codereview.appspot.com/164057 This is working for me in all cases, but I'm having a hard time convincing myself it's entirely correct.
Sign in to reply to this message.
2009/12/1 Chris Wedgwood <cw@f00f.org>: > On Tue, Dec 01, 2009 at 02:34:29PM -0500, Devon H. O'Dell wrote: > >> It's bogus. I'm removing it, as it causes the net tests to fail, but >> now we'll have to figure out what exactly is wrong with the websockets >> test. > > I dropped if from what I have so we should basically have identical > trees now? > > http://codereview.appspot.com/164057 Minus the fact that I didn't upload the tree where I got rid of the n == 0/isEOF check -- our working copies are exactly the same. > This is working for me in all cases, but I'm having a hard time > convincing myself it's entirely correct. We're down to websocket at this point, so let's see what's up with that... --dho
Sign in to reply to this message.
Hi all, This should incorporate all of the changes discussed between cw's and my changes and fix issue #321 and hopefully #271 as well.
Sign in to reply to this message.
LGTM thanks.
Sign in to reply to this message.
Hello rsc, agl, cw (cc: golang-dev@googlegroups.com), I'd like you to review the following change.
Sign in to reply to this message.
LGTM. I am waiting for you (dho) to add Shutdown to FreeBSD before I submit this.
Sign in to reply to this message.
*** Submitted as http://code.google.com/p/go/source/detail?r=048b72a03aa3 *** net: fix netFD.Close races Fixes issue 271. Fixes issue 321. R=rsc, agl, cw CC=golang-dev http://codereview.appspot.com/163052 Committer: Russ Cox <rsc@golang.org>
Sign in to reply to this message.
|