Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/sys/unix: panic when using solaris Event Ports #54254

Closed
nshalman opened this issue Aug 4, 2022 · 22 comments
Closed

x/sys/unix: panic when using solaris Event Ports #54254

nshalman opened this issue Aug 4, 2022 · 22 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-illumos OS-Solaris
Milestone

Comments

@nshalman
Copy link

nshalman commented Aug 4, 2022

Summary

If you are running on illumos, you are likely suffering from https://www.illumos.org/issues/14898 and you should look into applying updates. If you are on an LTS release of some sort, you can ask your distro provider if that fix can be backported for you.

If you are running on Solaris, you should probably contact support and send them a link to the illumos issue to see if there's a corresponding fix available for Solaris.

There's a small chance that you're running into #54363 if the version of x/sys/unix in use doesn't contain the fix from https://go.dev/cl/422338.

Original Report (for posterity)

CAVEAT

This is clearly my own fault, but I am not smart enough (yet?) to figure out where my bug is.
It was introduced while attempting to fix a different problem in golang/sys@594fa53
as worked on in https://go-review.googlesource.com/c/sys/+/380034

What version of Go are you using (go version)?

$ go version
go version go1.19 solaris/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
warning: GOPATH set to GOROOT (/home/nshalman/go/) has no effect
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/nshalman/.cache/go-build"
GOENV="/home/nshalman/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="solaris"
GOINSECURE=""
GOMODCACHE="/home/nshalman/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="solaris"
GOPATH="/home/nshalman/go/"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/nshalman/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/nshalman/go/pkg/tool/solaris_amd64"
GOVCS=""
GOVERSION="go1.19"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/nshalman/sys/go.mod"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3559396907=/tmp/go-build -gno-record-gcc-switches"

What did you do?

See fsnotify/fsnotify#371 (comment)

git clone https://github.com/nshalman/fsnotify -b fen-v2
cd fsnotify
mkdir deleteme
go run ./cmd/fsnotify deleteme/ &
for i in {1..500}; do touch deleteme/$i; done;

What did you expect to see?

No panics

What did you see instead?

panic: mismanaged memory

goroutine 33 [running]:
golang.org/x/sys/unix.(*EventPort).peIntToExt(0xc000200150, 0xc0004f52c0, 0xc000105da8)
        /home/nshalman/go/pkg/mod/golang.org/x/sys@v0.0.0-20220731174439-a90be440212d/unix/syscall_solaris.go:959 +0x265
golang.org/x/sys/unix.(*EventPort).Get(0xc000200150, {0xc000105da8, 0x8, 0x0?}, 0x0?, 0x0?)
        /home/nshalman/go/pkg/mod/golang.org/x/sys@v0.0.0-20220731174439-a90be440212d/unix/syscall_solaris.go:1002 +0x2e9
github.com/fsnotify/fsnotify.(*Watcher).readEvents(0xc000200060)
        /home/nshalman/deleteme/fsnotify/fen.go:148 +0x10a
created by github.com/fsnotify/fsnotify.NewWatcher
        /home/nshalman/deleteme/fsnotify/fen.go:47 +0x165
exit status 2
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Aug 4, 2022
@gopherbot gopherbot added this to the Unreleased milestone Aug 4, 2022
@nshalman nshalman changed the title x/sys/unix: panic when using solaris EventPorts x/sys/unix: panic when using solaris Event Ports Aug 4, 2022
@nshalman
Copy link
Author

nshalman commented Aug 4, 2022

Reproduction is hard and inconsistent in how long it takes. Here's a test case that can eventually trigger it at least sometimes:

diff --git a/unix/syscall_solaris_test.go b/unix/syscall_solaris_test.go
index c2b28be..d597355 100644
--- a/unix/syscall_solaris_test.go
+++ b/unix/syscall_solaris_test.go
@@ -12,7 +12,9 @@ import (
        "io/ioutil"
        "os"
        "os/exec"
+       "path/filepath"
        "runtime"
+       "sync"
        "testing"

        "golang.org/x/sys/unix"

Then add this test

func TestEventPortMemoryStress(t *testing.T) {
        path, err := os.MkdirTemp("", "eventport")
        if err != nil {
                t.Fatalf("unable to create a tempdir: %v", err)
        }
        defer os.RemoveAll(path)

        stat, err := os.Stat(path)
        if err != nil {
                t.Fatalf("Failed to stat %s: %v", path, err)
        }
        port, err := unix.NewEventPort()
        if err != nil {
                t.Fatalf("NewEventPort failed: %v", err)
        }
        defer port.Close()
        cookie := stat.Mode()
        err = port.AssociatePath(path, stat, unix.FILE_MODIFIED, cookie)
        if err != nil {
                t.Errorf("AssociatePath failed: %v", err)
        }
        if !port.PathIsWatched(path) {
                t.Errorf("PathIsWatched unexpectedly returned false")
        }

        c := make(chan int)
        done := make(chan bool)
        var mu sync.Mutex
        go func (c chan int, done chan bool) {
                for {
                        _, err = port.GetOne(nil)
                        if err != nil {
                                t.Errorf("GetOne failed: %v", err)
                        }
                        mu.Lock()
                        err = port.AssociatePath(path, stat, unix.FILE_MODIFIED, cookie)
                        mu.Unlock()
                        select {
                        case _, _ = <-done:
                                return
                        default:
                                if err != nil {
                                        t.Errorf("AssociatePath failed: %v", err)
                                }
                        }
                        c <- 1
                }
        } (c, done)

        iterations := 500000
        for i := 0; i < iterations; i++ {
                mu.Lock()
                file, err := os.Create(filepath.Join(path, fmt.Sprintf("%d", i)))
                        if err != nil {
                                t.Fatalf("unable to create files in %s: %v", path, err)
                        }
                file.Close()
                mu.Unlock()
        }
        var sum int
        for i := 0; i < iterations; i++ {
                sum += <-c
        }
        done <- true
        if sum != iterations {
                t.Errorf("didn't get all %d events", iterations)
        }
}

It usually takes many seconds before it triggers, e.g.:

$ go test -run TestEventPortMemoryStress
panic: mismanaged memory

goroutine 19 [running]:
golang.org/x/sys/unix.(*EventPort).peIntToExt(0xc00010a870, 0xc000068ed0, 0xc000114800)
        /home/nshalman/sys/unix/syscall_solaris.go:959 +0x265
golang.org/x/sys/unix.(*EventPort).GetOne(0xc00010a870, 0xc000112018?)
        /home/nshalman/sys/unix/syscall_solaris.go:932 +0xeb
golang.org/x/sys/unix_test.TestEventPortMemoryStress.func1(0xc00010e820?, 0x5ad410?)
        /home/nshalman/sys/unix/syscall_solaris_test.go:368 +0xb4
created by golang.org/x/sys/unix_test.TestEventPortMemoryStress
        /home/nshalman/sys/unix/syscall_solaris_test.go:366 +0x5fb
exit status 2
FAIL    golang.org/x/sys/unix   27.582s

@dmitshur
Copy link
Contributor

dmitshur commented Aug 4, 2022

CC @golang/runtime, @ianlancetaylor.

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 4, 2022
@prattmic
Copy link
Member

prattmic commented Aug 4, 2022

I'm not sure what the issue is, but in your test the c channel is unbuffered. As a result the AssosciatePath goroutine loops only runs once concurrently with the os.Create loop (because c <- 1 will block until sum += <-c, which is after the os.Create loop).

Is that intentional? It looks like you want those goroutines to race as much as possible? If so, eliminating c in favor of a sync.WaitGroup may make this reproduce more readily.

@prattmic
Copy link
Member

prattmic commented Aug 4, 2022

FWIW, at https://cs.opensource.google/go/x/sys/+/master:unix/syscall_solaris.go;l=954 uintptr(unsafe.Pointer(fCookie.fobj)) == uintptr(peInt.Object) is a violation of the unsafe.Pointer requirements. It is not safe to store a Go pointer as uintptr or uint64.

That said, I don't immediately see how that would cause problems, as the EventPort.cookies map is keeping fobj alive, and we don't move objects.

@nshalman
Copy link
Author

nshalman commented Aug 5, 2022

For posterity I'm updating this comment with the version of the test that I ended up using to further debug this issue and to test the fix.

func TestEventPortMemoryStress(t *testing.T) {
        path, err := os.MkdirTemp("", "eventport")
        if err != nil {
                t.Fatalf("unable to create a tempdir: %v", err)
        }
        defer os.RemoveAll(path)

        stat, err := os.Stat(path)
        if err != nil {
                t.Fatalf("Failed to stat %s: %v", path, err)
        }
        port, err := unix.NewEventPort()
        if err != nil {
                t.Fatalf("NewEventPort failed: %v", err)
        }
        defer port.Close()


        iterations := 100000
        for i := 0; i < iterations; i++ {
                cookie := fmt.Sprintf("cookie %d", i)
                err = port.AssociatePath(path, stat, unix.FILE_MODIFIED, cookie)
                if err != nil {
                        t.Errorf("AssociatePath failed: %v", err)
                }
                if !port.PathIsWatched(path) {
                        t.Errorf("PathIsWatched unexpectedly returned false")
                }
                file, err := os.Create(filepath.Join(path, fmt.Sprintf("%d", i)))
                if err != nil {
                        t.Fatalf("unable to create files in %s: %v", path, err)
                }
                file.Close()
                err = os.Remove(filepath.Join(path, fmt.Sprintf("%d", i)))
                if err != nil {
                        t.Errorf("os.Remove failed: %v", err)
                }

                _, err = port.GetOne(nil)
                if err != nil {
                        t.Errorf("GetOne failed: %v", err)
                }
        }
}

@nshalman
Copy link
Author

nshalman commented Aug 5, 2022

Yeah... If I run that test with GOGC=off it never panics. Either I'm not creating the references the garbage collector needs so that it understands what I'm trying to do (more likely) or maybe there's a garbage collector bug (less likely)

GOGC=off go test -v -count 30 -run TestEventPortMemoryStress
ran to completion for me (n=1)

Without GOGC=off I have yet to get through 30 iterations. (n > 10)

@nshalman
Copy link
Author

nshalman commented Aug 5, 2022

FWIW, at https://cs.opensource.google/go/x/sys/+/master:unix/syscall_solaris.go;l=954 uintptr(unsafe.Pointer(fCookie.fobj)) == uintptr(peInt.Object) is a violation of the unsafe.Pointer requirements. It is not safe to store a Go pointer as uintptr or uint64.

That said, I don't immediately see how that would cause problems, as the EventPort.cookies map is keeping fobj alive, and we don't move objects.

Based on the discovery that the panics are pretty clearly related to garbage collection, I think you're onto something.
Is there a way I could change the code to not violate the unsafe.Pointer requirements?

I thought all of these maps were safe (the key in fds is an actual file descriptor, not a pointer...):

	fds   map[uintptr]*fileObjCookie
	paths map[string]*fileObjCookie
	cookies map[*interface{}]*fileObjCookie

One thing that I appear to be doing that could maybe be changed is that I'm working pretty hard to pass the pointer to the exact thing that the user gave me to the call to port_associate, but maybe I'm trying too hard. As long as I pass in some pointer in that is not going to get garbage collected, as long as I can figure out what it points to when I get it back, I can then return to the user whatever they asked me to associate...

@nshalman
Copy link
Author

nshalman commented Aug 5, 2022

Well, I tried simplifying things and while that makes all the code way easier to read, and the tests pass with GOGC=off,
the tests still fail when the garbage collector is allowed to run.

For now, my code is in a github branch here:
golang/sys@master...nshalman:sys:simplify-and-test

The simplification looks like a good idea to me, but doesn't solve the problem. Maybe there really is a GC bug?!
I'm happy to open a CR for the simplification if that will be helpful.

@prattmic
Copy link
Member

prattmic commented Aug 5, 2022

I thought all of these maps were safe (the key in fds is an actual file descriptor, not a pointer...):

	fds   map[uintptr]*fileObjCookie
	paths map[string]*fileObjCookie
	cookies map[*interface{}]*fileObjCookie

These are all fine. Storing as *fileObjCookie, *interface{} (though a bit odd), and unsafe.Pointer are all fine.

Casting to uintptr (even temporarily) is a problem because the GC loses track of the object and if it is not otherwise kept alive the GC may free the object while it is still in use.

This code is full of unsafe uintptr casts. e.g., https://cs.opensource.google/go/x/sys/+/master:unix/syscall_solaris.go;l=833-834 casts fobj to uintptr in the argument to port_associate, which is not safe [1]. Yet fCookie references fobj and is used later in the function, so it should be keeping the object alive.

Given the amount of uintptr code here, I suspect that one of these cases fails to keep an object alive and is causing the problem. If you have trouble determining the problematic spots, one debugging technique would be to add runtime.KeepAlive calls for any potentially problematic objects.

[1] Note: you've probably seen case 4 in unsafe.Pointer about casting to uintptr in the argument to syscall functions. That rule only applies to assembly functions that have uintptr argument, or Go functions with //go:uintptrkeepalive (such as syscall.Syscall on Linux). port_associate is neither, so fobj could die during the function call.

@prattmic
Copy link
Member

prattmic commented Aug 5, 2022

I suspect it would also be helpful to know in the panic("mismanaged memory") case whether the cookie is missing from the map, or the pointers mismatched (and if so what their values are. do they look like valid pointers?)

nshalman added a commit to nshalman/sys that referenced this issue Aug 7, 2022
@nshalman
Copy link
Author

nshalman commented Aug 7, 2022

I've pushed up a commit with some heavy-handed debugging to answer those questions at nshalman/sys@1bbdf46
Again, GOGC=off go test completes with no issues.
go test will now pretty reliably trigger the error message, but I find it particularly baffling most of the time:

$ go test
Cookie received from port_get at 0xc000492750: cookie 42866
          A cookie in the jar at 0xc000492000: cookie 42886
 Last cookie port_associate'd at 0xc000492000: cookie 42886
       Address at port_associate 0xc000492000
--- FAIL: TestEventPortMemoryStress (0.58s)
panic: mismanaged memory [recovered]
        panic: mismanaged memory
<snip>
$ go test
Cookie received from port_get at 0xc00046acc0: cookie 34760
          A cookie in the jar at 0xc00046a000: cookie 34775
 Last cookie port_associate'd at 0xc00046a000: cookie 34775
       Address at port_associate 0xc00046a000
--- FAIL: TestEventPortMemoryStress (0.55s)
panic: mismanaged memory [recovered]
        panic: mismanaged memory
<snip>

So it looks like port_get is getting back a different address than the one that was provided to port_associate, but again, this odd behavior only happens when the garbage collector is enabled.

However, on one run I saw the most confusing output of all:

$ go test
Cookie received from port_get at 0xc0004306f0: cookie 61295
          A cookie in the jar at 0xc000430000: cookie 61295
 Last cookie port_associate'd at 0xc000430000: cookie 61295
       Address at port_associate 0xc000430000
--- FAIL: TestEventPortMemoryStress (0.87s)
panic: mismanaged memory [recovered]
        panic: mismanaged memory
<snip>

Instead of getting back the wrong cookie, I somehow got the right cookie but at a different address.

The address seeming to be different at port_get and port_associate being triggered by the garbage collector feels like maybe it's an OS bug rather than a golang bug, but I really don't know.

Can we add the OS-illumos tag as well as that's where I'm doing my development and testing.
CC @jclulow @4ad

@nshalman
Copy link
Author

nshalman commented Aug 7, 2022

Had a brief discussion with @rmustacc on IRC who helped me come up with some dtrace to look a little closer at what's going on from the OS side.

fbt:portfs:port_associate_fop:entry
{
	printf(" in: %p->%p", args[2], args[4]);
}
fbt::port_copy_event:entry
/args[1]->portkev_source == 7/
{
	printf("out: %x->%p\n", args[1]->portkev_object, args[1]->portkev_user);
}
 
fbt::port_pfp_setup:entry
{
	printf(" in: %x->%p", args[4], args[6]);
}
 14  70111         port_associate_fop:entry  in: c000130000->c000016090
 14  70103             port_pfp_setup:entry  in: c000130000->c000016090
 14  69995            port_copy_event:entry out: c000130000->c000016090
 
 14  70111         port_associate_fop:entry  in: c000130050->c0000a0780
 14  70103             port_pfp_setup:entry  in: c000130050->c0000a0780
 14  69995            port_copy_event:entry out: c000130050->c0000a0780
 
 14  70111         port_associate_fop:entry  in: c00051e000->c000016060
 14  69995            port_copy_event:entry out: c00051e000->c000516b70

I think the lack of port_pfp_setup:entry is probably significant.

Cookie received from port_get at 0xc000516b70: cookie 5796
          A cookie in the jar at 0xc000016060: cookie 5814
 Last cookie port_associate'd at 0xc000016060: cookie 5814
       Address at port_associate 0xc000016060
--- FAIL: TestEventPortMemoryStress (0.11s)
panic: mismanaged memory [recovered]
	panic: mismanaged memory

I've filed https://www.illumos.org/issues/14898 in case this is indeed a bug on the illumos end.

@nshalman
Copy link
Author

nshalman commented Aug 9, 2022

This is clearly my own fault, but I am not smart enough (yet?) to figure out where my bug is.

Oh sweet summer child...

I'm now convinced that this is https://www.illumos.org/issues/14898 rather than an issue with the code in x/sys/unix.
I have tested https://code.illumos.org/c/illumos-gate/+/2299 (as of patchset 1) and it made this problem go away for me.

Since it will take a while for that fix to land, and longer for it to make it out to all machines in production, I'm inclined to updated the panic message to reference this issue. Perhaps something along the lines of:
panic("unexpected address received from event port (issue #54254)")

I think there is a nontrivial chance that once fsnotify/fsnotify#371 lands (which this bug was preventing me from feeling comfortable landing), illumos and Solaris (which might have this same issue) systems might start tripping over this panic and I want to provide users with a hint of what to do about it.

Thank you everyone for your help guiding me into narrowing down the issue. Your feedback on the panic message is requested so that I know what to put into the CR.

@gopherbot
Copy link

Change https://go.dev/cl/422099 mentions this issue: unix: Improve solaris event port panic message

nshalman added a commit to nshalman/sys that referenced this issue Aug 10, 2022
For golang/go#54254

Change-Id: Id59bacfabc5c818478f6a9af8d585f56f74c2bde
nshalman added a commit to nshalman/sys that referenced this issue Aug 17, 2022
For golang/go#54254

Change-Id: Id59bacfabc5c818478f6a9af8d585f56f74c2bde
@nshalman
Copy link
Author

I've updated the description at the top of this issue to include some helpful text to anyone directed here from the error message proposed in https://go.dev/cl/422099

gopherbot pushed a commit to golang/sys that referenced this issue Aug 25, 2022
For golang/go#54254

Change-Id: Id59bacfabc5c818478f6a9af8d585f56f74c2bde
Reviewed-on: https://go-review.googlesource.com/c/sys/+/422099
Reviewed-by: Nahum Shalman <nahamu@gmail.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Robert Griesemer <gri@google.com>
@nshalman
Copy link
Author

I've tweaked the language in the description one more time now that golang/sys@c680a09 has landed.
Thank you to everyone!

@davepacheco
Copy link

Sorry to drive-by this issue with a tangential one, but was anything determined about the uintptr casts that @prattmic mentioned above might be unsafe? I'm digging a bit into #53289 and several apparently-related failure modes (plus oxidecomputer/omicron#1130) and I'm wondering if they could be related to this one.

@ianlancetaylor
Copy link
Contributor

@davepacheco The code was rewritten (https://go.dev/cl/422338) to avoid the unsafe conversions.

The exact rules can be found at https://pkg.go.dev/unsafe#Pointer.

@davepacheco
Copy link

Thanks -- that's helpful! Do you know what release(s) that will wind up in (or is there some way for me to figure that out)?

@ianlancetaylor
Copy link
Contributor

Sorry, I'm not sure exactly what you are asking. If you are asking about the changes to golang.org/x/sys/unix, the golang.org/x/sys/unix package doesn't currently have a release system. We expect people to update (go get -u) the package as needed to pick up new features and bug fixes.

(Systematic releases for the golang.org/x repos is #21324, but nobody is thinking about that as far as I know.)

@davepacheco
Copy link

My mistake. I thought this package was tied to the Go runtime and so needed a new Go release to get the fix. (Sorry for my ignorance here...I'm a user of software written in Go but I haven't worked with Go much myself.)

@ianlancetaylor
Copy link
Contributor

@davepacheco No worries. (For future notice, see https://go.dev/wiki/Questions for good places to ask questions about Go.)

@golang golang locked and limited conversation to collaborators Oct 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-illumos OS-Solaris
Projects
None yet
Development

No branches or pull requests

7 participants