Skip to content

runtime: crash in hybrid barrier initializing C memory #19928

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rsc opened this issue Apr 11, 2017 · 32 comments
Open

runtime: crash in hybrid barrier initializing C memory #19928

rsc opened this issue Apr 11, 2017 · 32 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@rsc
Copy link
Contributor

rsc commented Apr 11, 2017

Received report via email of code crashing in write barrier finding a bad pointer (pointing to a free span). Code looks like:

keys := (*[1 << 20]*C.char)(unsafe.Pointer(C.malloc(C.size_t(uintptr(n)) * C.size_t(unsafe.Sizeof((*C.char)(nil))))))[:0:n]
values := (*[1 << 20]*C.char)(unsafe.Pointer(C.malloc(C.size_t(uintptr(n)) * C.size_t(unsafe.Sizeof((*C.char)(nil))))))[:0:n]
for key, vals := range req.Header {
	for _, value := range vals {
		keys = append(keys, C.CString(key))
		values = append(values, C.CString(value))
	}
}

The problem appears to be that C.malloc returns uninitialized memory. Then the initialization of that memory during append overwrites uninitialized pointer slots, and the new hybrid write barrier tries to follow the uninitialized pointers as if they were real pointers.

Prior to the hybrid barrier, the old data being overwritten for off-heap memory was never considered. Now it is. This will invalidate code like the above. It took a long time to notice, so apparently this is rare.

One workaround is to make C.malloc return zeroed memory, which would have a performance cost people might not like (especially since it doesn't seem to matter to most people?). Another workaround is to tell people to use C.calloc for allocating C data structures with pointers, or maybe rewriting just the C.malloc calls that mention pointer sizes.

Really the best fix would be for the GC not to follow these pointers. @aclements says there are some cases where do want the write barrier to follow pointers in non-managed-memory, specifically non-global off-heap runtime structures, but maybe we can distinguish these from other kinds of memory (like C memory) somehow.

If there is a simple fix and we receiver any reports of external problems along these lines, we can consider including the fix in Go 1.8.2.

@rsc rsc added this to the Go1.8.2 milestone Apr 11, 2017
@rsc
Copy link
Contributor Author

rsc commented Apr 11, 2017

Also, @ianlancetaylor, do you think this code should work as far as the Go pointer rules are concerned? There are no Go-managed data structures anywhere in sight, so it seems to me like they should.

@rsc
Copy link
Contributor Author

rsc commented Apr 11, 2017

/cc @RLH

@aclements
Copy link
Member

Really the best fix would be for the GC not to follow these pointers.

One downside of this is that it will increase the cost of the write barrier, since we'll have to add a check for if the slot is in the heap or a global.

@aclements says there are some cases where do want the write barrier to follow pointers in non-managed-memory, specifically non-global off-heap runtime structures, but maybe we can distinguish these from other kinds of memory (like C memory) somehow.

I'm not sure off the top of my head how many of these there are. It's possible we could manually annotate the writes to have write barriers (with a write barrier that bypasses the slot check), though that seems error-prone, especially going forward. Or we could compile all write barriers in the runtime to the slot-check-bypassing write barrier, using the knowledge that the runtime never writes a pointer to an uninitialized structure.

@ianlancetaylor
Copy link
Member

I agree that according to the current Go pointer rules this code is OK. Programs aren't permitted to store a Go pointer in C memory, and indeed that is not happening. Apparently the memory happens to contain a bit pattern that looks like a Go pointer, but it is not actually a Go pointer.

Changing C.malloc to return zeroed memory would fix this case but it wouldn't fix all cases of the general problem. It's currently permitted to call C code that allocates an array on the stack, fails to initialize it, and then calls a Go function with a pointer that the Go function uses to fill in elements. The Go function is not permitted to store Go pointers into the array, but it is permitted to store C pointers. That would invoke the write barrier on uninitialized memory from the C stack, which could potentially contain bit patterns that look like Go pointers.

We could extend the Go pointer rules to say that C memory that has a pointer type must be explicitly initialized if it is visible from Go. To make that simpler we could modify C.malloc to zero out memory, but that would only be necessary, and need only be done, when invoked with a pointer type. I don't see how to check that rule with cgocheck=1 but we could check it with cgocheck=2 (which does cgo checking at every write barrier).

@aclements
Copy link
Member

To summarize in-person discussion: we're all concerned about increasing the write barrier cost for something that appears to almost never be a problem in practice. So, we're leaning toward tweaking the rules to say something like: if Go writes to pointers in the memory, it has to be initialized (this is already the case for pointer reads; this would extend it to pointer writes) and possibly extending the cgo interface to make the rules easy to stay in (perhaps providing C.new that takes a type and zeroes memory). But we'll wait to see if this comes up again before diving into a rule change.

@bcmills
Copy link
Contributor

bcmills commented Apr 16, 2017

We could probably get at least some mileage out of whether the pointer has a C type or a Go type.

It seems reasonable to require that all memory accessed through Go types must be initialized (since Go values are normally zero-inititalized), but much less so for C types (since C out-parameters are often left uninitialized in idiomatic C code). With #16623 it might be more plausible for the compiler to be aware of whether the type of a given slot is C or Go.

We could vet the C memory for Go pointers at the time of conversion to Go types (e.g. in a C.AsSlice builtin per #13656 (comment)) by pre-scanning all of the pointer slots to ensure that they are valid.

Then we could apply the more expensive write barrier only to pointers of C types, for which assignments are relatively rare (and often somewhat expensive anyway because they tend to correlate with transitions between Go and C stacks).

@ianlancetaylor
Copy link
Member

@bcmills Perhaps, but I'm somewhat reluctant to further complicate the cgo pointer rules by adding rules that depend on the type used to access the memory, especially since when using cgo that type often varies between C code and Go code. For example, C code will often use void* where Go code will use unsafe.Pointer, so what is the correct handling when C code allocates an array of void* and calls a Go function that receives a *unsafe.Pointer and converts it to a []unsafe.Pointer? I'm not saying we can't write a rule but I'm concerned that we can't write a rule that people can actually understand and use.

@bcmills
Copy link
Contributor

bcmills commented Apr 16, 2017

I think we would have to declare that unsafe.Pointer (by itself or with any number of pointer indirections) is a potentially non-Go type (that is, needs the more expensive write barrier). But it might be prudent to use a more robust write barrier for unsafe.Pointer anyway, since its use in application code increases the risk of subtle pointer errors.

So the rules would be, roughly:

  • Each pointer to a Go type must contain nil, a pointer to a valid object in the Go heap, or a pointer to a valid, fully-initialized object outside the Go heap.
  • A "Go type" for this purpose is defined as:
    • A type of the form C.t in a cgo source file is not a Go type.
    • unsafe.Pointer is not a Go type.
    • A pointer is a Go type only if its base type is a Go type.
    • An array is a Go type only if its element type is a Go type.
    • All other types are Go types. Notably:
      • A struct type declared in Go code is always a Go type.
      • A slice type is always a Go type.

Note that the code in Russ's example would be invalid under this rule: *[1 << 20]*C.char is not a Go type, but []*C.char is. (We would presumably detect it as invalid while evaluating the slice expression if GODEBUG=cgocheck is set to a sufficiently high threshold.)

I will leave it up to you to decide whether that rule is too complicated.

@rsc
Copy link
Contributor Author

rsc commented Apr 17, 2017

The current type system does not distinguish between C memory and Go memory and cannot be reinterpreted to do so.

@bcmills
Copy link
Contributor

bcmills commented Apr 18, 2017

@rsc I don't think anyone is proposing that the type system should distinguish between C memory and Go memory. The rule I described above constrains initialization, not object location.

It's quite possible that there is a better choice of phrases than "Go type", but it wasn't obvious to me. The general idea is that a value of a "Go type" is always initialized by a Go statement or expression, whereas a value of a "non-Go type" may or may not be.

The current type system does distinguish between pointers and non-pointers, particularly with the use of unsafe.Pointer and across cgo boundaries. In the case of C out-parameters we have a third kind of data ("non-pointer stored in a pointer type"), and the proposed rule is an attempt to fit that into the existing type system (by allowing non-pointers to be stored in pointer slots of "non-Go" types).

@bradfitz bradfitz modified the milestones: Go1.8.2, Go1.8.3 May 18, 2017
@broady
Copy link
Contributor

broady commented May 23, 2017

Bumping to Go 1.8.4 (or maybe Go 1.9)

@rsc
Copy link
Contributor Author

rsc commented Oct 13, 2017

Are we just not going to fix this for Go 1.8, @aclements?

@aclements
Copy link
Member

Since we haven't received any external reports of this being a problem and there are ways to work around it, I would lean toward not fixing this for 1.8.

It's also still unclear to me what the fix would be. We decided in April that we didn't want to make the write barrier more expensive (and complicate the runtime) to detect whether the slot is outside Go-managed memory, especially without evidence that this is actually a problem. So that leaves changing the rules (and maybe making C.malloc zero memory and/or introducing C.new).

Either way, doesn't seem like a 1.8 issue, so I'm re-milestoning it to 1.10.

@aclements aclements modified the milestones: Go1.8.5, Go1.10 Oct 13, 2017
@bcmills
Copy link
Contributor

bcmills commented Oct 16, 2017

So that leaves changing the rules (and maybe making C.malloc zero memory and/or introducing C.new).

I'm more concerned about the implications for passed-in buffers than for the Go C.malloc wrapper itself: it's easy enough for Go callers to use calloc explicitly if they know they need to, but if we're exporting a Go API to C, it won't necessarily be obvious what kind of transitive zero-initialization requirements that would impose on C callers.

@bcmills
Copy link
Contributor

bcmills commented Jul 12, 2018

Is Go code allowed to write a nil pointer into C memory?

Yes.

If so, do we need to scrutinize the slot address when we see a nil write?

Not sure.

@aclements
Copy link
Member

To be a little more concrete, here's pseudo-code for what I'm proposing:

have old, slot from write barrier buffer
new := *slot
newObj := findObject(new)
if newObj == nil {
    if findObject(slot) == nil && slot is not in Go data segment {
        Non-Go pointer stored into non-Go memory. Ignore new and old.
        continue
    }
}
shade(newObj)
shade(findObject(old))

Notably, this never inspects old or tries to guess whether it looks like a garbage value. Pointer writes that could lead to a garbage old are discarded on the basis of writing a non-Go pointer (the new value) into non-Go memory (the slot address). We don't even try to shade old.

Unfortunately, while this avoids the slow-path slot check for most non-nil non-C pointer writes, it still triggers that path for writes of nil to Go memory, which are obviously very common, as well as writes of pointers to Go globals into Go heap memory.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/123658 mentions this issue: cmd/cgo: add note about bug writing C pointers to uninitialized C memory

@chfast
Copy link

chfast commented Jul 12, 2018

Please note that a C function struct S f() is transformed by the compiler to struct S* f(struct S*) and actually called as

struct S _;
f(&_);

Users don't have control over it, so this means that returning C structs from "exported" Go functions can also cause troubles.

@ianlancetaylor
Copy link
Member

@chfast I'm likely missing something but I don't see how that can cause a problem. The transformation that you describe is done implicitly by the C compiler. The problems described here only arise in Go code. The Go code is going to return a struct in the usual way for Go code, with the usual zero initialization of the result as neeed. The fact that the C code will then copy that into uninitialized C memory doesn't matter.

@chfast
Copy link

chfast commented Jul 12, 2018

Sorry, I incorrectly assumed that "exported" Go functions have C ABI.

@ianlancetaylor
Copy link
Member

@chfast Behind the scenes, C code that calls an exported Go function actually calls a C wrapper, generated by cgo, that calls the Go function using the Go ABI and translates the results into the C ABI.

gopherbot pushed a commit that referenced this issue Jul 16, 2018
Describe the problem as a bug, since it is not implied by the rest of
the pointer passing rules, and it may be possible to fix it.

Updates #19928

Change-Id: I2d336e7336b2a215c0b8cf909a203201ef1b054e
Reviewed-on: https://go-review.googlesource.com/123658
Reviewed-by: Austin Clements <austin@google.com>
@rsc
Copy link
Contributor Author

rsc commented Aug 17, 2018

Nothing more here for Go 1.11.

@rsc rsc modified the milestones: Go1.11, Go1.12 Aug 17, 2018
@andybons andybons modified the milestones: Go1.12, Go1.13 Feb 12, 2019
@andybons andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@pdefrain
Copy link

We have an implementation that panics, Go 1.18, frequently (panics within minutes of any restart when loaded). The process is C++, RHEL/8, intel compiler, that loads a Golang shared object. The interface between C++ and Go is one that passes a byte*/[]byte back and forth using a Go channel for each direction. Messages are marshalled into buffers which are all allocated from C.malloc and un-marshaled into message objects in the other language. There is no other interaction between C++ and Go. Messages queued from C++ to Go run on a C++ thread that marshals the message, then un-marshals it into a Go message and puts it into a Go channel for goroutines to take. Messages from Go to C++ are queued on a Go channel by a goroutine, and taken off that channel using a C++ thread that dequeues the Go message from the channel, marshals it into a byte[], then un-marshals it into a C message. Marshaling is used as a means to avoid type converting - unsafe.Pointer is used along with a byte size for the un-marshaling functions - the unsafe.Pointer always being C.malloc memory.

Many variations on this theme have been tried, as well as variations on which heap to use for the byte pointers. The Go channels have even been removed and replace with C.malloc shared memory. Always the same problem occurs.

(gdb) bt
#0 runtime.raise () at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/sys_linux_amd64.s:165
#1 0x00007f0617a20a25 in runtime.dieFromSignal (sig=6) at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/signal_unix.go:769
#2 0x00007f0617a20c78 in runtime.crash () at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/signal_unix.go:861
#3 0x00007f0617a0b5b1 in runtime.fatalthrow.func1 () at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/panic.go:1257
#4 0x00007f0617a0b530 in runtime.fatalthrow () at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/panic.go:1250
#5 0x00007f0617a0b2f1 in runtime.throw (s=...) at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/panic.go:1198
#6 0x00007f0617a2ea96 in runtime.gentraceback (pc0=139664142949814, sp0=824634060688, lr0=, gp=, skip=0, pcbuf=0x0, max=2147483647, callback=
{void (runtime.stkframe *, void *, bool *)} 0x7f060d824a18, v=0x0, flags=0) at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/traceback.go:274
#7 0x00007f06179f4e57 in runtime.scanstack (gp=0xc000000b60, gcw=0xc000045698) at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/mgcmark.go:748
#8 0x00007f06179f3d91 in runtime.markroot.func1 () at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/mgcmark.go:232
#9 0x00007f06179f3b50 in runtime.markroot (gcw=0xc000045698, i=22) at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/mgcmark.go:205
#10 0x00007f06179f59b9 in runtime.gcDrain (gcw=0xc000045698, flags=3) at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/mgcmark.go:1013
#11 0x00007f06179f2a05 in runtime.gcBgMarkWorker.func2 () at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/mgc.go:1269
#12 0x00007f0617a39146 in runtime.systemstack () at /home/defrain/legit3/ams/mediaserver/Golang/go/src/runtime/asm_amd64.s:383
#13 0x0000000000000000 in ?? ()

The SIGABRT comes from a GC thread. The GC code is generating a traceback in what appears to be a stack scan and the traceback code panics when it fails to identify a function pointer. I don’t know exactly why the panic generating gentraceback is in the “middle” of the GC code but it appears related to resolving defer operations (it’s not generating a traceback for logging, but, to identify what’s going on in the call stack it’s searching). It appears the GC code may be misusing frame.lr in stack frame 7 along the path executing (which ends in the fatal panic).

                    flr = findfunc(frame.lr)
                    if !flr.valid() {
                            // This happens if you get a profiling interrupt at just the wrong time.
                            // In that context it is okay to stop early.
                            // But if callback is set, we're doing a garbage collection and must
                            // get everything, so crash loudly.
                            doPrint := printing
                            if doPrint && gp.m.incgo && f.funcID == funcID_sigpanic {
                                    // We can inject sigpanic
                                    // calls directly into C code,
                                    // in which case we'll see a C
                                    // return PC. Don't complain.
                                    doPrint = false
                            }
                           if callback != nil || doPrint {
                                    print("runtime: unexpected return pc for ", funcname(f), " called from ", hex(frame.lr), "\n")
                                    tracebackHexdump(gp.stack, &frame, lrPtr)
                            }
                            if callback != nil {
                                    throw("unknown caller pc")

When the panic occurs, I find that frame.lr = 2, or 3, or 0 -- generally a small integer value. It could be there is code not setting frame.lr = 0 before the code gets here to avoid the valid() check.

I've found that if I modify the above code, to detect what appears to be an invalid pointer in frame.lr, and not check it, that our code runs fine w/o any memory growth for long periods of time. But still, it eventually panics, possibly because the invalid pointer check is only an approximation as opposed to being a proper fix.

@ianlancetaylor
Copy link
Member

@pdefrain Please open a new issue for your problem rather than adding on to this rather old issue. Your problem does not sound the same as the problem described in this issue, and I'm not even sure that this issue is still broken.

In your new issue please include reproduction instructions if possible. Thanks.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022
@mknyszek mknyszek moved this to Triage Backlog in Go Compiler / Runtime Jul 15, 2022
levakin added a commit to levakin/go-duckdb that referenced this issue Dec 4, 2023
levakin added a commit to levakin/go-duckdb that referenced this issue Jan 9, 2024
levakin added a commit to levakin/go-duckdb that referenced this issue Jan 15, 2024
levakin added a commit to levakin/go-duckdb that referenced this issue Jan 17, 2024
marcboeker added a commit to marcboeker/go-duckdb that referenced this issue Jan 18, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
* appender: fix typo

* appender: remove unused error

* make duckdb.Conn exported to be used with conn.Raw method in std SQL package

* appender: remove not needed parenthesis

* Implement Apache Arrow interface and Arrow type structs

Added and defined structures for Arrow schema and Arrow array in a new file "arrow.h". This allows Arrow-compatible communication and data sharing, which is memory-efficient. Implemented Arrow interface in the connection.go file. Also updated the 'duckdb_test.go' file to include tests for the Arrow interface.

* call queryArrowArray only when there are more rows to get

* Rename QueryArrowContext to QueryArrow

Renames the function QueryArrowContext to QueryArrow. This change is to simplify the function name as 'Context' in the name isn't required or adds any meaning.

* prepareConfig: refactor to free option string pointers

* use calloc instead of malloc to avoid potential bug

golang/go#19928

* arrow: separate DuckDB Arrow interface

- conn structure was made unexported
- NewConnector was renamed to OpenConnector to be consistent with sql package and now returns ConnectorCloser interface to expose Close method
- appender: NewAppenderFromConn receives connection of type any to be compatible with sql.Conn.Raw method. Tests and docs were adopted as well.

* Refactor to use driver.Conn instead of any.

* styleguide fixes

* arrow: remove Query method leaving only QueryContext

* Simplifications on pointer handling

* Fix tests

* Fix indentation

* Fix formatting

---------

Co-authored-by: Marc Boeker <marc@at6.net>
@mjd95
Copy link

mjd95 commented May 3, 2024

@pdefrain Please open a new issue for your problem rather than adding on to this rather old issue. Your problem does not sound the same as the problem described in this issue, and I'm not even sure that this issue is still broken.

Just wanted to confirm that this one is still an issue, at least in Go 1.22.1. We were hitting this with some code similar to @rsc 's original example, and worked around by switching C.malloc to C.calloc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: Triage Backlog
Development

No branches or pull requests