Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime,cmd/compile: apparent memory corruption in compress/flate #54596

Closed
bcmills opened this issue Aug 22, 2022 · 24 comments
Closed

runtime,cmd/compile: apparent memory corruption in compress/flate #54596

bcmills opened this issue Aug 22, 2022 · 24 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Aug 22, 2022

#!watchflakes
post <- pkg == "compress/flate" && `fatal error: found bad pointer in Go heap`

greplogs -l -e \(\?ms\)runtime\\.throw.\*FAIL\\s+compress/flate --since=2021-10-09
2022-08-17T20:19:28-9c2b481/freebsd-amd64-12_3
2022-08-17T03:15:44-e1b62ef/netbsd-arm64-bsiegert

A very similar failure mode was reported in #48846 and thought to have been fixed by @cherrymui.

(attn @golang/runtime)

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Aug 22, 2022
@dr2chase dr2chase added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 24, 2022
@mknyszek mknyszek added this to the Go1.20 milestone Aug 24, 2022
@heschi
Copy link
Contributor

heschi commented Aug 29, 2022

@cherrymui
Copy link
Member

All three failures seem to have in common:

  • the GC complains about a bad pointer pointing to an unallocated span, but the object hex dump shows the memory content is 0. In face, the entire object seems 0. I think this looks like there is something zeroing the object in concurrent with the GC.
  • the object is in a span of size 663552 (0xa2000), which may be a flate.Writer which has size 656648 (0xa0508)

One possibility for the concurrent write of 0 is allocation. It may be that the GC scans the object when (or before) mallocgc zeroes it. For an object with that size, it is a large object and goes on this code path https://cs.opensource.google/go/go/+/master:src/runtime/malloc.go;l=1036

Why the GC sees the object before mallocgc returns? I don't really know. A conservative scan that happens to see a value coincide with the newly allocated pointer? That seems possible. mallocgc sets span.freeindex to 1 before zeroing the object or setting the heap bitmap. scanConservative (and greyobject) checks for span.IsFree, which in turn checks for freeindex https://cs.opensource.google/go/go/+/master:src/runtime/mgcmark.go;l=1431 . So it seems possible for scanConservative to pick up that object.

The span is from mcache.allocLarge, which calls initHeapBits https://cs.opensource.google/go/go/+/master:src/runtime/mcache.go;l=255
which, however, seems to set the heap bits only in certain cases -- noscan, or all pointers https://cs.opensource.google/go/go/+/master:src/runtime/mbitmap.go;l=735
For the case of a flate.Writer, it would be unset. I guess then the GC is possible to use an old heap bitmap to scan an object that isn't (fully) initialized? That seems bad.

In the third failure https://build.golang.org/log/4c7618e90e934e0158912f4fb949875d171bd521 it is complaining about the first word of the object. If the object is really a flate.Writer, the first word is not a pointer. This may indicate it is not using the right pointer bitmap, which the theory above would lead to.

In Go 1.19, where heapBits.initSpan always set the span's heap bits (to either all 0s or all pointers, never leave it unset), which seems to prevent this problem.

Maybe we should increment span.freeindex later, after zeroing the object and maybe setting the heap bits, so the GC won't see it, even for a conservative scan? This seems easy to do for large objects, but may be harder to do for small objects (without refactoring nextFree etc.).

I haven't reproduced the failure myself (didn't try very hard either). This is just a hypothesis. I may have missed something...

cc @mknyszek @randall77 @aclements

@mknyszek
Copy link
Contributor

A conservative scan that happens to see a value coincide with the newly allocated pointer?

One issue with this is while we're still in some runtime frame a conservative scan shouldn't happen. Asynchronous preemption backs out, I think.

But I agree with the rest of your thinking, i.e. the fact that it's a large object and that it seems like a pointer might somehow be getting published early (while freeindex is still unset). Either that, or there's a pointer into that space that was for some old object before the span was freed, and then materialized after the space for it had died (and it's just likely to land on the large object because it's large).

I think you're also onto something with the fact that the bitmap isn't cleared. Maybe there's yet-another place somewhere that we might hold onto a stale pointer that's otherwise safe because we assume the bitmap is going to get cleared (this was the case with noscan objects and cgo not long ago)?

We can try always zeroing the bitmap (unfortunate...) and see if the problem still shows up. That's data, but it would take a while to be sure given how rare this is.

@cherrymui
Copy link
Member

One issue with this is while we're still in some runtime frame a conservative scan shouldn't happen. Asynchronous preemption backs out, I think.

Oh, I didn't mean conservatively scanning the runtime frame (sorry for being unclear). I think it is possible that a conservative scan of a (completely separate) user frame, on a different thread, happens to see a value (possibly a dead pointer) that is within the range of the newly allocated span.

That said, I don't really know that it was conservative scan. There might be other possibilities for a GC to see such pointer that I couldn't think of.

We can try always zeroing the bitmap (unfortunate...) and see if the problem still shows up. That's data, but it would take a while to be sure given how rare this is.

I can try that. Last night I ran stress flate.test -test.short overnight on a gomote and saw 15 failures over 300000 runs. I can make the change and let it run for a day.

Among the failures the object hex dump are not always all 0. It could be that it is not (always) allocation. Or it may still be allocation, just that when it throws the allocating goroutine has already moved on, returned from mallocgc and started filling in the content.

I think you're also onto something with the fact that the bitmap isn't cleared. Maybe there's yet-another place somewhere that we might hold onto a stale pointer that's otherwise safe because we assume the bitmap is going to get cleared (this was the case with noscan objects and cgo not long ago)?

Yeah, this is unfortunate. But I'm also not really sure how safe it is for the "all pointer" case. If the above is possible, I guess it is still possible for the GC to see the newly allocated but not yet zeroed pointer word? Just much harder as it is just one word?

@mknyszek
Copy link
Contributor

Oh, I didn't mean conservatively scanning the runtime frame

Ohhhhhhh, I see. That's a really good point. It didn't occur to me that pointers could in effect get "published" this way. We probably do need to be defensive about it, as you suggest.

@gopherbot
Copy link

Change https://go.dev/cl/426834 mentions this issue: runtime: always initialize heap bits for a span

@cherrymui
Copy link
Member

With the CL above, before the gomote was killed by coordinator restart, I got

3h36m5s: 101279 runs so far, 0 failures

compared to previously

14h0m0s: 310600 runs so far, 15 failures (0.00%)

It seems positive that zeroing heap bitmap helps.

Independently I'm going to try if delaying incrementing span.freeindex makes any difference.

@gopherbot
Copy link

Change https://go.dev/cl/427619 mentions this issue: runtime: delay incrementing freeindex in malloc

@cherrymui
Copy link
Member

cherrymui commented Sep 1, 2022

With CL 427619 (without CL 426834), it also doesn't fail running overnight (18h4m40s: 522171 runs so far, 0 failures). So delaying incrementing freeindex also seems to help.

As another data point, disabling async preemption (without either CL above) also seems to stop the failure (15h58m40s: 462136 runs so far, 0 failures).

I think these all support the hypothesis above, i.e. conservative scan, GC seeing the newly allocated object before zeroing, and heap bitmap not set.

@bcmills
Copy link
Contributor Author

bcmills commented Sep 14, 2022

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "compress/flate" && `fatal error: found bad pointer in Go heap`
2022-08-17 20:19 freebsd-amd64-12_3 go@9c2b481b compress/flate.TestDeflateInflate (log)
runtime: pointer 0xc0014a6000 to unallocated span span.base()=0xc0014a6000 span.limit=0xc0014c6020 span.state=0
runtime: found in object at *(0xc0012d6000+0x18058)
object=0xc0012d6000 s.base()=0xc001276000 s.limit=0xc001316508 s.spanclass=0 s.elemsize=663552 s.state=mSpanInUse
 *(object+0) = 0x0
 *(object+8) = 0x0
 *(object+16) = 0x0
 *(object+24) = 0x0
 *(object+32) = 0x0
 *(object+40) = 0x0
 *(object+48) = 0x0
...
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

runtime stack:
runtime.throw({0x559f67?, 0x6?})
	/tmp/workdir/go/src/runtime/panic.go:1047 +0x5d fp=0xc000055df8 sp=0xc000055dc8 pc=0x4347bd
runtime.badPointer(0x827ccaf60, 0xc000055e70?, 0xc0012d6000, 0xc000055f00?)
	/tmp/workdir/go/src/runtime/mbitmap.go:316 +0x150 fp=0xc000055e48 sp=0xc000055df8 pc=0x413390
runtime.findObject(0xc00095af00?, 0xc00002cf00?, 0xc00004a800?)
	/tmp/workdir/go/src/runtime/mbitmap.go:359 +0xa6 fp=0xc000055e80 sp=0xc000055e48 pc=0x413526
runtime.scanobject(0xc00002bc40?, 0xc00002bc40)
...
	/tmp/workdir/go/src/compress/flate/deflate.go:667 +0x48 fp=0xc000c75a78 sp=0xc000c75a38 pc=0x4edf28
compress/flate.testSync(0xc000116680, 0xc002049900?, {0x634210, 0x8, 0x8}, {0xc0000f80ba, 0x2})
	/tmp/workdir/go/src/compress/flate/deflate_test.go:265 +0x4a5 fp=0xc000c75d58 sp=0xc000c75a78 pc=0x4f6e45
compress/flate.testToFromWithLevelAndLimit(0xc000116680, 0xc000076340?, {0x634210, 0x8, 0x8}, {0xc0000f80ba, 0x2}, 0x0)
	/tmp/workdir/go/src/compress/flate/deflate_test.go:369 +0x49d fp=0xc000c75e40 sp=0xc000c75d58 pc=0x4f7f9d
compress/flate.testToFromWithLimit(0x7ffffffff1b0?, {0x634210, 0x8, 0x8}, {0xc0000f80ba, 0x2}, {0x0, 0x0, 0x0, 0x0, ...})
	/tmp/workdir/go/src/compress/flate/deflate_test.go:374 +0x6a fp=0xc000c75ea0 sp=0xc000c75e40 pc=0x4f808a
compress/flate.TestDeflateInflate(0xc0001164e0?)
	/tmp/workdir/go/src/compress/flate/deflate_test.go:386 +0x14f fp=0xc000c75f70 sp=0xc000c75ea0 pc=0x4f82af
testing.tRunner(0xc000116680, 0x55c940)
2022-08-26 17:48 darwin-amd64-10_14 go@296c40db compress/flate.TestBestSpeed (log)
runtime: pointer 0xc000f68000 to unallocated span span.base()=0xc000f68000 span.limit=0xc000f77fff span.state=0
runtime: found in object at *(0xc0013c0000+0x0)
object=0xc0013c0000 s.base()=0xc0013a0000 s.limit=0xc001440508 s.spanclass=0 s.elemsize=663552 s.state=mSpanInUse
 *(object+0) = 0x0 <==
 *(object+8) = 0x0
 *(object+16) = 0x0
 *(object+24) = 0x0
 *(object+32) = 0x0
 *(object+40) = 0x0
 *(object+48) = 0x0
...
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

runtime stack:
runtime.throw({0x115b36f?, 0x6?})
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-amd64-10_14/go/src/runtime/panic.go:1047 +0x5d fp=0x7000090edce8 sp=0x7000090edcb8 pc=0x1033e7d
runtime.badPointer(0x1409a20, 0x7000090edd60?, 0xc0013c0000, 0x1024328?)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-amd64-10_14/go/src/runtime/mbitmap.go:314 +0x150 fp=0x7000090edd38 sp=0x7000090edce8 pc=0x1013210
runtime.findObject(0xc0030c4000?, 0x102fcff?, 0xc000031738?)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-amd64-10_14/go/src/runtime/mbitmap.go:357 +0xa6 fp=0x7000090edd70 sp=0x7000090edd38 pc=0x10133a6
runtime.scanobject(0xc000031738?, 0xc000031738)
...
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-amd64-10_14/go/src/runtime/slice.go:103 +0x52 fp=0xc000331978 sp=0xc000331950 pc=0x104b472
compress/flate.newHuffmanBitWriter(...)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-amd64-10_14/go/src/compress/flate/huffman_bit_writer.go:97
compress/flate.(*compressor).init(0xc0011ca000, {0x118b4a0?, 0xc00002c0c0}, 0x1)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-amd64-10_14/go/src/compress/flate/deflate.go:569 +0x51 fp=0xc000331a00 sp=0xc000331978 pc=0x10ee591
compress/flate.NewWriter({0x118b4a0, 0xc00002c0c0}, 0x0?)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-amd64-10_14/go/src/compress/flate/deflate.go:667 +0x48 fp=0xc000331a40 sp=0xc000331a00 pc=0x10ef148
compress/flate.TestBestSpeed(0xc000117380)
	/var/folders/9w/4l2_g3kx01x199n37fbmv3s80000gn/T/workdir-host-darwin-amd64-10_14/go/src/compress/flate/deflate_test.go:666 +0x831 fp=0xc000331f70 sp=0xc000331a40 pc=0x10fb771
testing.tRunner(0xc000117380, 0x115dd30)
2022-09-07 06:18 freebsd-amd64-13_0 go@6375f508 compress/flate.TestDeflateInflate (log)
runtime: pointer 0xc00191a000 to unused region of span span.base()=0xc000968000 span.limit=0xc000977fff span.state=1
runtime: found in object at *(0xc001828000+0x1c078)
object=0xc001828000 s.base()=0xc0017e8000 s.limit=0xc001888508 s.spanclass=0 s.elemsize=663552 s.state=mSpanInUse
 *(object+0) = 0x0
 *(object+8) = 0x0
 *(object+16) = 0x0
 *(object+24) = 0x0
 *(object+32) = 0x0
 *(object+40) = 0x0
 *(object+48) = 0x0
...
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

runtime stack:
runtime.throw({0x55a430?, 0x6?})
	/tmp/workdir/go/src/runtime/panic.go:1047 +0x5d fp=0xc00005bdf8 sp=0xc00005bdc8 pc=0x433ebd
runtime.badPointer(0x827bb1898, 0xc00005be70?, 0xc001828000, 0x4242c8?)
	/tmp/workdir/go/src/runtime/mbitmap.go:314 +0x150 fp=0xc00005be48 sp=0xc00005bdf8 pc=0x413170
runtime.findObject(0xc0000141e0?, 0x0?, 0x0?)
	/tmp/workdir/go/src/runtime/mbitmap.go:357 +0xa6 fp=0xc00005be80 sp=0xc00005be48 pc=0x413306
runtime.scanobject(0xc000027238?, 0xc000027238)
...
	/tmp/workdir/go/src/runtime/malloc.go:1199 +0x27 fp=0xc000155d18 sp=0xc000155cf0 pc=0x40cf27
compress/flate.NewWriter({0x57d940, 0xc0001c0060}, 0x62a094?)
	/tmp/workdir/go/src/compress/flate/deflate.go:666 +0x2f fp=0xc000155d58 sp=0xc000155d18 pc=0x4edf2f
compress/flate.testToFromWithLevelAndLimit(0xc000582680, 0xc0005600d0?, {0x62a094, 0x2, 0x2}, {0xc0001480aa, 0x2}, 0x0)
	/tmp/workdir/go/src/compress/flate/deflate_test.go:344 +0x88 fp=0xc000155e40 sp=0xc000155d58 pc=0x4f77a8
compress/flate.testToFromWithLimit(0x7ffffffff020?, {0x62a094, 0x2, 0x2}, {0xc0001480aa, 0x2}, {0x0, 0x0, 0x0, 0x0, ...})
	/tmp/workdir/go/src/compress/flate/deflate_test.go:374 +0x6a fp=0xc000155ea0 sp=0xc000155e40 pc=0x4f7caa
compress/flate.TestDeflateInflate(0xc0005824e0?)
	/tmp/workdir/go/src/compress/flate/deflate_test.go:386 +0x14f fp=0xc000155f70 sp=0xc000155ea0 pc=0x4f7ecf
testing.tRunner(0xc000582680, 0x55cde0)
2022-09-16 16:32 linux-386-buster go@686b38b5 compress/flate.TestDeflateInflate (log)
runtime: pointer 0x9ec6000 to unallocated span span.base()=0x9ec6000 span.limit=0x9ed5fff span.state=0
runtime: found in object at *(0xa7a2000+0x3c)
object=0xa7a2000 s.base()=0xa702000 s.limit=0xa7a2484 s.spanclass=0 s.elemsize=663552 s.state=mSpanInUse
 *(object+0) = 0x0
 *(object+4) = 0x0
 *(object+8) = 0x0
 *(object+12) = 0x0
 *(object+16) = 0x0
 *(object+20) = 0x0
 *(object+24) = 0x0
...
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

runtime stack:
runtime.throw({0x8192faf, 0x3e})
	/workdir/go/src/runtime/panic.go:1047 +0x4d fp=0x983bec4 sp=0x983beb0 pc=0x807ed6d
runtime.badPointer(0xef819a50, 0x9ec6000, 0xa7a2000, 0x3c)
	/workdir/go/src/runtime/mbitmap.go:314 +0xa2 fp=0x983bee4 sp=0x983bec4 pc=0x8058582
runtime.findObject(0x9ec6000, 0xa7a2000, 0x3c)
	/workdir/go/src/runtime/mbitmap.go:357 +0x9f fp=0x983befc sp=0x983bee4 pc=0x80587df
runtime.scanobject(0xa7a2000, 0x942c95c)
...
	/workdir/go/src/runtime/malloc.go:1199 +0x2c fp=0x99e0e80 sp=0x99e0e6c pc=0x8053a1c
compress/flate.NewWriter({0x81b8668, 0x9af8048}, 0x9)
	/workdir/go/src/compress/flate/deflate.go:666 +0x23 fp=0x99e0ea0 sp=0x99e0e80 pc=0x8131443
compress/flate.testToFromWithLevelAndLimit(0x9482690, 0x9, {0x8262258, 0x8, 0x8}, {0x9a5c09a, 0x2}, 0x0)
	/workdir/go/src/compress/flate/deflate_test.go:344 +0x45 fp=0x99e0f14 sp=0x99e0ea0 pc=0x813a3b5
compress/flate.testToFromWithLimit(0x9482690, {0x8262258, 0x8, 0x8}, {0x9a5c09a, 0x2}, {0x0, 0x0, 0x0, 0x0, ...})
	/workdir/go/src/compress/flate/deflate_test.go:374 +0x65 fp=0x99e0f3c sp=0x99e0f14 pc=0x813a875
compress/flate.TestDeflateInflate(0x9482690)
	/workdir/go/src/compress/flate/deflate_test.go:386 +0x11d fp=0x99e0f9c sp=0x99e0f3c pc=0x813aa1d
testing.tRunner(0x9482690, 0x8195784)
2022-09-29 19:35 linux-386-softfloat go@545adcfe compress/flate.TestDeflateInflate (log)
runtime: pointer 0x91e8000 to unallocated span span.base()=0x91e8000 span.limit=0x91f7fff span.state=0
runtime: found in object at *(0x9fb8000+0x3c)
object=0x9fb8000 s.base()=0x9f18000 s.limit=0x9fb8484 s.spanclass=0 s.elemsize=663552 s.state=mSpanInUse
 *(object+0) = 0x0
 *(object+4) = 0x0
 *(object+8) = 0x0
 *(object+12) = 0x0
 *(object+16) = 0x0
 *(object+20) = 0x0
 *(object+24) = 0x0
...
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

runtime stack:
runtime.throw({0x819834d, 0x3e})
	/workdir/go/src/runtime/panic.go:1047 +0x4d fp=0x8d89ec4 sp=0x8d89eb0 pc=0x807fd3d
runtime.badPointer(0xef54ea20, 0x91e8000, 0x9fb8000, 0x3c)
	/workdir/go/src/runtime/mbitmap.go:314 +0xa2 fp=0x8d89ee4 sp=0x8d89ec4 pc=0x80589a2
runtime.findObject(0x91e8000, 0x9fb8000, 0x3c)
	/workdir/go/src/runtime/mbitmap.go:357 +0x9f fp=0x8d89efc sp=0x8d89ee4 pc=0x8058bff
runtime.scanobject(0x9fb8000, 0x8828f5c)
...
	/workdir/go/src/compress/flate/deflate.go:666 +0x23 fp=0x89fad30 sp=0x89fad10 pc=0x8136703
compress/flate.testSync(0x8882e10, 0x6, {0x8269086, 0x1, 0x1}, {0x88b8078, 0x2})
	/workdir/go/src/compress/flate/deflate_test.go:265 +0x417 fp=0x89faea0 sp=0x89fad30 pc=0x813e897
compress/flate.testToFromWithLevelAndLimit(0x8882e10, 0x6, {0x8269086, 0x1, 0x1}, {0x88b8078, 0x2}, 0x0)
	/workdir/go/src/compress/flate/deflate_test.go:369 +0x4bc fp=0x89faf14 sp=0x89faea0 pc=0x813faec
compress/flate.testToFromWithLimit(0x8882e10, {0x8269086, 0x1, 0x1}, {0x88b8078, 0x2}, {0x0, 0x0, 0x0, 0x0, ...})
	/workdir/go/src/compress/flate/deflate_test.go:374 +0x65 fp=0x89faf3c sp=0x89faf14 pc=0x813fb65
compress/flate.TestDeflateInflate(0x8882e10)
	/workdir/go/src/compress/flate/deflate_test.go:386 +0x11d fp=0x89faf9c sp=0x89faf3c pc=0x813fd0d
testing.tRunner(0x8882e10, 0x819ab88)

watchflakes

@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "compress/flate" && `fatal error: found bad pointer in Go heap`
2022-11-03 15:15 windows-amd64-2008 go@582a6c2d compress/flate.TestDeflateInflate (log)
runtime: pointer 0xc0022de000 to unallocated span span.base()=0xc0022de000 span.limit=0xc00231dffc span.state=0
runtime: found in object at *(0xc0022dc000+0xa8)
object=0xc0022dc000 s.base()=0xc00223c000 s.limit=0xc0022dc508 s.spanclass=0 s.elemsize=663552 s.state=mSpanInUse
 *(object+0) = 0x0
 *(object+8) = 0x0
 *(object+16) = 0x0
 *(object+24) = 0x0
 *(object+32) = 0x0
 *(object+40) = 0x0
 *(object+48) = 0x0
...
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

runtime stack:
runtime.throw({0x573b4c?, 0x6?})
	C:/workdir/go/src/runtime/panic.go:1047 +0x65 fp=0x22fd38 sp=0x22fd08 pc=0x438e05
runtime.badPointer(0x7c9b98, 0x22fdb0?, 0xc0022dc000, 0x0?)
	C:/workdir/go/src/runtime/mbitmap.go:314 +0x150 fp=0x22fd88 sp=0x22fd38 pc=0x4146f0
runtime.findObject(0xc0016c2000?, 0xc00004cb08?, 0x22f8f0?)
	C:/workdir/go/src/runtime/mbitmap.go:357 +0xb7 fp=0x22fdc0 sp=0x22fd88 pc=0x414897
runtime.scanobject(0x908?, 0xc000037738)
...
	C:/workdir/go/src/compress/flate/deflate.go:666 +0x2f fp=0xc0007c1a88 sp=0xc0007c1a48 pc=0x503c8f
compress/flate.testSync(0xc000043d40, 0xc000be0600?, {0x65c368, 0x8, 0x8}, {0xc0000c62da, 0x2})
	C:/workdir/go/src/compress/flate/deflate_test.go:265 +0x47c fp=0xc0007c1d58 sp=0xc0007c1a88 pc=0x50c7bc
compress/flate.testToFromWithLevelAndLimit(0xc000043d40, 0xc0001a0270?, {0x65c368, 0x8, 0x8}, {0xc0000c62da, 0x2}, 0x0)
	C:/workdir/go/src/compress/flate/deflate_test.go:369 +0x49d fp=0xc0007c1e40 sp=0xc0007c1d58 pc=0x50d91d
compress/flate.testToFromWithLimit(0x24?, {0x65c368, 0x8, 0x8}, {0xc0000c62da, 0x2}, {0x0, 0x0, 0x0, 0x0, ...})
	C:/workdir/go/src/compress/flate/deflate_test.go:374 +0x6a fp=0xc0007c1ea0 sp=0xc0007c1e40 pc=0x50da0a
compress/flate.TestDeflateInflate(0xc000043520?)
	C:/workdir/go/src/compress/flate/deflate_test.go:386 +0x14f fp=0xc0007c1f70 sp=0xc0007c1ea0 pc=0x50dc2f
testing.tRunner(0xc000043d40, 0x576770)

watchflakes

@aclements
Copy link
Member

Just hit on a trybot: https://storage.googleapis.com/go-build-log/65e11c86/windows-amd64-2016_3020f2c6.log

Marking release-blocker, since this can cause random crashes on any platform. It sounds like we have a pretty good hypothesis as to the cause. We should finish the fix and land it.

@cherrymui
Copy link
Member

Yeah, I'm working on the fix. Hope to land soon-ish.

@gopherbot
Copy link

Change https://go.dev/cl/449017 mentions this issue: runtime: make GC see object as allocated after it is initialized

@cherrymui
Copy link
Member

@aclements @mknyszek CL 427619 and CL 449017 are two possible fixes. I stress-tested on gomotes and both runs 11+ hours, 300000+ runs with no failure, so both seem to fix the issue. Let me know which one do you think is better. Thanks.

(Probably still need some benchmarks to make sure the fast path performance doesn't regress. Will do.)

@mknyszek
Copy link
Contributor

mknyszek commented Nov 11, 2022

I accidentally clicked submit in the wrong Gerrit tab. I sent a revert: https://go.dev/cl/449501.

@mknyszek mknyszek reopened this Nov 11, 2022
@cherrymui
Copy link
Member

@gopherbot please backport this to previous releases. This may cause (rare) GC crashes or memory corruptions. The large object case is new for Go 1.20, but it could still happen with small objects in previous releases. Thanks.

@gopherbot
Copy link

Backport issue(s) opened: #56750 (for 1.20).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@cherrymui
Copy link
Member

Well, gopherbot picks the "Go 1.20" from my previous comment and opens 1.20 backport, which is not what we want... I'll open 1.18 and 1.19 ones manually...

@gopherbot
Copy link

Change https://go.dev/cl/453235 mentions this issue: [release-branch.go1.19] runtime: make GC see object as allocated after it is initialized

@gopherbot
Copy link

Change https://go.dev/cl/453255 mentions this issue: [release-branch.go1.18] runtime: make GC see object as allocated after it is initialized

gopherbot pushed a commit that referenced this issue Nov 25, 2022
…r it is initialized

When the GC is scanning some memory (possibly conservatively),
finding a pointer, while concurrently another goroutine is
allocating an object at the same address as the found pointer, the
GC may see the pointer before the object and/or the heap bits are
initialized. This may cause the GC to see bad pointers and
possibly crash.

To prevent this, we make it that the scanner can only see the
object as allocated after the object and the heap bits are
initialized. Currently the allocator uses freeindex to find the
next available slot, and that code is coupled with updating the
free index to a new slot past it. The scanner also uses the
freeindex to determine if an object is allocated. This is somewhat
racy. This CL makes the scanner use a different field, which is
only updated after the object initialization (and a memory
barrier).

Updates #54596.
Fixes #56751.

Change-Id: I2a57a226369926e7192c253dd0d21d3faf22297c
Reviewed-on: https://go-review.googlesource.com/c/go/+/449017
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
(cherry picked from commit febe7b8)
Reviewed-on: https://go-review.googlesource.com/c/go/+/453255
gopherbot pushed a commit that referenced this issue Nov 25, 2022
…r it is initialized

When the GC is scanning some memory (possibly conservatively),
finding a pointer, while concurrently another goroutine is
allocating an object at the same address as the found pointer, the
GC may see the pointer before the object and/or the heap bits are
initialized. This may cause the GC to see bad pointers and
possibly crash.

To prevent this, we make it that the scanner can only see the
object as allocated after the object and the heap bits are
initialized. Currently the allocator uses freeindex to find the
next available slot, and that code is coupled with updating the
free index to a new slot past it. The scanner also uses the
freeindex to determine if an object is allocated. This is somewhat
racy. This CL makes the scanner use a different field, which is
only updated after the object initialization (and a memory
barrier).

Updates #54596.
Fixes #56752.

Change-Id: I2a57a226369926e7192c253dd0d21d3faf22297c
Reviewed-on: https://go-review.googlesource.com/c/go/+/449017
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
(cherry picked from commit febe7b8)
Reviewed-on: https://go-review.googlesource.com/c/go/+/453235
andrew-d pushed a commit to tailscale/go that referenced this issue Dec 7, 2022
…r it is initialized

When the GC is scanning some memory (possibly conservatively),
finding a pointer, while concurrently another goroutine is
allocating an object at the same address as the found pointer, the
GC may see the pointer before the object and/or the heap bits are
initialized. This may cause the GC to see bad pointers and
possibly crash.

To prevent this, we make it that the scanner can only see the
object as allocated after the object and the heap bits are
initialized. Currently the allocator uses freeindex to find the
next available slot, and that code is coupled with updating the
free index to a new slot past it. The scanner also uses the
freeindex to determine if an object is allocated. This is somewhat
racy. This CL makes the scanner use a different field, which is
only updated after the object initialization (and a memory
barrier).

Updates golang#54596.
Fixes golang#56752.

Change-Id: I2a57a226369926e7192c253dd0d21d3faf22297c
Reviewed-on: https://go-review.googlesource.com/c/go/+/449017
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
(cherry picked from commit febe7b8)
Reviewed-on: https://go-review.googlesource.com/c/go/+/453235
@gopherbot gopherbot reopened this Jan 4, 2023
@gopherbot
Copy link

Found new dashboard test flakes for:

#!watchflakes
post <- pkg == "compress/flate" && `fatal error: found bad pointer in Go heap`
2022-11-07 19:46 windows-amd64-2012 go@601ad2e4 compress/flate.TestWriteError (log)
runtime: pointer 0xc001aa0000 to unused region of span span.base()=0xc0014f4000 span.limit=0xc0014f6000 span.state=1
runtime: found in object at *(0xc00186a000+0x20a8)
object=0xc00186a000 s.base()=0xc00182a000 s.limit=0xc0018ca508 s.spanclass=0 s.elemsize=663552 s.state=mSpanInUse
 *(object+0) = 0x0
 *(object+8) = 0x0
 *(object+16) = 0x0
 *(object+24) = 0x0
 *(object+32) = 0x0
 *(object+40) = 0x0
 *(object+48) = 0x0
...
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

runtime stack:
runtime.throw({0x573b4c?, 0x6?})
	C:/workdir/go/src/runtime/panic.go:1047 +0x65 fp=0x23fd30 sp=0x23fd00 pc=0x438e85
runtime.badPointer(0x28e920b0, 0x23fda8?, 0xc00186a000, 0x23fe38?)
	C:/workdir/go/src/runtime/mbitmap.go:314 +0x150 fp=0x23fd80 sp=0x23fd30 pc=0x414730
runtime.findObject(0xc000d28000?, 0x431f0e?, 0xc00002e138?)
	C:/workdir/go/src/runtime/mbitmap.go:357 +0xb7 fp=0x23fdb8 sp=0x23fd80 pc=0x4148d7
runtime.scanobject(0xc00002e138?, 0xc00002e138)
...
	C:/workdir/go/src/runtime/mgcmark.go:474 +0x1a5 fp=0xc000365d38 sp=0xc000365cd8 pc=0x41d9e5
runtime.deductAssistCredit(0x4b965c?)
	C:/workdir/go/src/runtime/malloc.go:1207 +0x5a fp=0xc000365d60 sp=0xc000365d38 pc=0x40dfda
runtime.mallocgc(0x8, 0x53f980, 0x0)
	C:/workdir/go/src/runtime/malloc.go:932 +0xd0 fp=0xc000365dc8 sp=0xc000365d60 pc=0x40d7f0
runtime.convT64(0xb2ae)
	C:/workdir/go/src/runtime/iface.go:382 +0x45 fp=0xc000365df0 sp=0xc000365dc8 pc=0x40b725
compress/flate.TestWriteError(0xc000498340)
	C:/workdir/go/src/compress/flate/writer_test.go:65 +0xb6 fp=0xc000365f70 sp=0xc000365df0 pc=0x51a816
testing.tRunner(0xc000498340, 0x576830)

watchflakes

@cherrymui
Copy link
Member

Failure is before the fix.

@golang golang locked and limited conversation to collaborators Jan 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Projects
Status: Done
Development

No branches or pull requests

7 participants