cmd/compile: go statement desugaring results in additional allocations #46349

qmuntal · 2021-05-24T15:59:25Z

What version of Go are you using (`go version`)?

go version devel go1.17-cca23a7 Sat May 22 00:51:17 2021 +0000 windows/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

go env Output

set GO111MODULE=
set GOARCH=amd64
set GOBIN=
set GOCACHE=C:\Users\***\AppData\Local\go-build
set GOENV=C:\Users\***\AppData\Roaming\go\env
set GOEXE=.exe
set GOFLAGS=
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOINSECURE=
set GOMODCACHE=C:\Users\***\go\pkg\mod
set GONOPROXY=*.hp.com
set GONOSUMDB=*.hp.com
set GOOS=windows
set GOPATH=C:\Users\***\go
set GOPRIVATE=*.hp.com
set GOPROXY=direct
set GOROOT=C:\Users\***\sdk\gotip
set GOSUMDB=off
set GOTMPDIR=
set GOTOOLDIR=C:\Users\***\sdk\gotip\pkg\tool\windows_amd64
set GOVCS=
set GOVERSION=devel go1.17-cca23a7 Sat May 22 00:51:17 2021 +0000
set GCCGO=gccgo
set AR=ar
set CC=gcc
set CXX=g++
set CGO_ENABLED=1
set GOMOD=C:\Users\***\Documents\code\go3mf-thumbnailer\go.mod
set CGO_CFLAGS=-g -O2
set CGO_CPPFLAGS=
set CGO_CXXFLAGS=-g -O2
set CGO_FFLAGS=-g -O2
set CGO_LDFLAGS=-g -O2
set PKG_CONFIG=pkg-config
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\***\AppData\Local\Temp\go-build497339627=/tmp/go-build -gno-record-gcc-switches

What did you do?

package main

import (
  "sync"
  "testing"
)

func BenchmarkAlloc(b *testing.B) {
  count := 10
  for n := 0; n < b.N; n++ {
    var wg sync.WaitGroup
    wg.Add(count)
    for i := 0; i < count; i++ {
        go func() {
          wg.Done()
        }()
     }
     wg.Wait()
  }
}

gotip test -run AAAA -bench BenchmarkAlloc -benchmem

What did you expect to see?

Same result as with go1.16.3 and gotip+GOEXPERIMENT=noregabi:

goos: windows
goarch: amd64
pkg: main
cpu: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
BenchmarkCreate-8         211135              6922 ns/op              16 B/op          1 allocs/op
PASS
ok      main      2.048s

What did you see instead?

Alloc count grow linearly with the size of the loop:

goos: windows
goarch: amd64
pkg: main      
cpu: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
BenchmarkCreate-8         183998              7023 ns/op             176 B/op         11 allocs/op
PASS
ok      main            2.575s

Notice that the number of allocations is directly proportional to the number of iterations inside the loop.

Context

I've found this regression while trying gotip in a private codebase. The provided benchmark is just a synthetic and minimal example I've found that reproduces what I'm seeing in the real code.

This is a memory profile snap for this benchmark, generated using go tool pprof -alloc_objects mem.out:

         .          .      6:)
         .          .      7:
         .          .      8:func BenchmarkCreate(b *testing.B) {
         .          .      9:   count := 10
         .          .     10:   for n := 0; n < b.N; n++ {
    131074     131074     11:           var wg sync.WaitGroup
         .          .     12:           wg.Add(count)
         .          .     13:           for i := 0; i < count; i++ {
   1507351    1507351     14:                   go func() {
         .          .     15:                           wg.Done()
         .          .     16:                   }()
         .          .     17:           }
         .          .     18:           wg.Wait()

The text was updated successfully, but these errors were encountered:

mknyszek · 2021-05-24T16:24:41Z

Thanks for the report.

I believe this is due to a known change in Go 1.17 related to go and defer 'desugaring,' that may result in more allocations in some cases. You mention that you saw this in a larger codebase, but could you tell us more about what you saw there? Specifically, is there a significant performance regression in that case?

CC @thanm @cherrymui

mknyszek · 2021-05-24T16:26:18Z

Not marking this a release blocker until we have more information. Also because to some degree this is expected, and for the most part we've found the benefits to outweigh the costs overall by a significant margin.

qmuntal · 2021-05-24T16:54:41Z

I believe this is due to a known change in Go 1.17 related to go and defer 'desugaring,' that may result in more allocations in some cases. You mention that you saw this in a larger codebase, but could you tell us more about what you saw there? Specifically, is there a significant performance regression in that case?

I'm seeing a huge increase of allocations in all multiple parts of the codebase that are called from within a loop an unbounded number of times. In some realistic non-worth-case benchmarks I'm seeing 1000% allocation increase.

The codebase that I'm benchmarking parses 3D software-generated files, such as point clouds or 3D meshes, which can have millions of 3D points. Until now the parsing algorithm had O(1) complexity with respect to the number of vectors, but in gotip is O(N).

On the other hand, I'm also seeing a 12% speed performance gain in most of the benchmarks, which is much appreciated when processing big files. Still, I'm not sure that this speed gain is realistic as there are more allocations so I expect the GC to eat part of this improvement.

mknyszek · 2021-05-24T16:59:16Z

@qmuntal I think we need more details as to the nature of these heap allocations. Do they all involve a go statement? Or do they look like something else (like defer statements)?

You can use the go build flag -gcflags=-m=2 to get inlining and escape analysis data at build time to help your understand what the new sources of allocation are.

qmuntal · 2021-05-24T17:22:00Z

@mknyszek Not all the offending lines use go or defer. Here are some examples

val, err := strconv.ParseUint(string(a.Value), 10, 24) // a.Value is a byte slice

if d.OnStart != nil {
   // t is a complex struct holding a byte slice reference
   d.OnStart(t) // <- Ofending!
}

wn := runtime.GOMAXPROCS(0)
var wg sync.WaitGroup
wg.Add(wn)
for wi := 0; wi < wn; wi++ {
  go func(wi int) { // <- Ofending!
    // ...
  }
}

// This comes from https://tip.golang.org/src/mime/mediatype.go?s=3622:3707#152
 params = make(map[string]string)

I will do the inlining and scape analysis tomorrow morning (CET), I'm running out of time now.

Thanks for jumping into this.

mknyszek · 2021-05-24T17:23:58Z

CC @mdempsky @dr2chase for escape analysis and even more compiler regabi expertise.

cherrymui · 2021-05-24T17:26:53Z

Yes, with register ABI there are more go'd closures needs allocation now. But I'm not sure whether it is really a problem for performance. We already needs to allocate the new goroutine's stack (which is not counted as memory allocation in profile, but it's there). Now we need to do an extra allocation for that closure (which is typically a much smaller allocation). As you posted, the overall performance is improved.

We could do something clever, e.g. allocating the closure on the new goroutine's stack. But there is a complexity for that. I'm not sure it's worth doing. We could consider that if it is a real problem.

cherrymui · 2021-05-24T17:30:01Z

The escape analysis isn't really relevant here. For go statement, it runs asynchronously and so in general we always need to allocate that closure. Except the case that it is not a real closure, i.e. it doesn't capture anything. With register ABI, there are more closures become "real closures", in that we need to capture arguments as well.

mknyszek · 2021-05-24T17:33:06Z

Right, it's not really relevant to the desugaring, I looped in escape analysis people because reacting to @qmuntal claim that lots of other non-go, non-defer sites now end up allocating. But, it's still unclear to me where the new allocations are coming from.

qmuntal · 2021-05-24T18:21:37Z

My codebase is memory intensive and I've always tried to reduce the allocations in order to get performance gains indirectly. In this case, I haven't seen any performance regression but just much more allocations everywhere, so If you say this is the expected behavior when migrating to regabi then it's ok for me. If in the future I see there is a real performance issue here I can report it back with more data.

I let you decide what to do with this issue, I can either close it or provide you more data to better understand where to new allocations are coming from.

mknyszek · 2021-05-24T19:15:03Z

In that case, I'll at least move this issue out of the Go 1.17 milestone.

qmuntal · 2021-05-24T20:09:24Z

Yes, with register ABI there are more go'd closures needs allocation now. But I'm not sure whether it is really a problem for performance. We already needs to allocate the new goroutine's stack (which is not counted as memory allocation in profile, but it's there). Now we need to do an extra allocation for that closure (which is typically a much smaller allocation)

@cherrymui would it make sense to also hide the allocations for that closure from the profile? I tend to hunt down allocations for the sake of performance and from go1.17 the memory profiles will have much more noise due to these additional small allocations that presumably don't have a big performance impact and may be unavoidable.

For example, in my codebase the following line allocates in go1.17 and does not allocate in go1.16:

val, err := strconv.ParseUint(string(a.Value), 10, 24) // allocation!

In go1.16 I would have been worried because the allocation would probably be a.Value escaping to the heap, which is bad, but in go1.17 this allocation is inevitable and irrelevant because it is related to the... register abi?

cherrymui · 2021-05-24T20:45:07Z

would it make sense to also hide the allocations for that closure from the profile?

I think we probably don't want to do that. Under the hood it is just a regular Go allocation. And go realClosure() already allocates in the past and counted in profile.

In general, while memory profile counts all allocations, not all of them have the same impact to performance (CPU and memory usage).

val, err := strconv.ParseUint(string(a.Value), 10, 24) // allocation!

In go1.16 I would have been worried because the allocation would probably be a.Value escaping to the heap, which is bad, but in go1.17 this allocation is inevitable and irrelevant because it is related to the... register abi?

At a quick look it is not obvious to me that this is register ABI related. (The only thing expected due to register ABI is for the go statement (and perhaps defer statement inside a loop, which is rare).) Maybe there is a real issue here. I could take a careful look.

cherrymui · 2021-05-24T20:52:26Z

val, err := strconv.ParseUint(string(a.Value), 10, 24) // allocation!

Could you explain more about this? Could you show the output of -gcflags=-m on this line, for both Go 1.16 and tip?

For a simple example I run, there is an allocation for the []byte to string conversion, and it is the same with Go 1.16 and tip. The underlying a.Value does not necessarily escape, also same with Go 1.16 and tip.

zikaeroh · 2021-05-25T02:23:02Z

I also saw a change in allocations (some 10 extra) in my benchmarks stemming from regabidefer when it was stabilized a bit ago, but didn't report it as I couldn't figure out how to actually determine where the allocations were coming from with a pprof diff and the performance was still largely better with regabi enabled (so, probably not too effectual in my case). I tried removing code and found it to be a little confusing; it seemed like it was coming from deferring an opencensus trace span end call, but I tried to reproduce it outside my application with no success, so gave up.

-gcflags=-m for me is identical between regabi and noregabi. -m=2 is (regabi left, noregabi right):

235,281c235,247
< <autogenerated>:1: parameter .this leaks to {heap} with derefs=1:
< <autogenerated>:1:   flow: {heap} = *.this:
< <autogenerated>:1:     from .this.file (dot of pointer) at <autogenerated>:1
< <autogenerated>:1:     from .this.file.close() (call parameter) at <autogenerated>:1
< <autogenerated>:1: leaking param content: .this
< <autogenerated>:1: .this does not escape
< <autogenerated>:1: parameter .this leaks to {heap} with derefs=1:
< <autogenerated>:1:   flow: {heap} = *.this:
< <autogenerated>:1:     from .this.Common (dot of pointer) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common (address-of) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common.RunDefaultServeMux() (call parameter) at <autogenerated>:1
< <autogenerated>:1: leaking param content: .this
< <autogenerated>:1: .this does not escape
< <autogenerated>:1: parameter .this leaks to {heap} with derefs=1:
< <autogenerated>:1:   flow: {heap} = *.this:
< <autogenerated>:1:     from .this.Common (dot of pointer) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common (address-of) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common.RunDefaultServeMux() (call parameter) at <autogenerated>:1
< <autogenerated>:1: leaking param content: .this
< <autogenerated>:1: .this does not escape
< <autogenerated>:1: parameter .this leaks to {heap} with derefs=1:
< <autogenerated>:1:   flow: {heap} = *.this:
< <autogenerated>:1:     from .this.Common (dot of pointer) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common (address-of) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common.RunDefaultServeMux() (call parameter) at <autogenerated>:1
< <autogenerated>:1: leaking param content: .this
< <autogenerated>:1: .this does not escape
< <autogenerated>:1: parameter .this leaks to {heap} with derefs=1:
< <autogenerated>:1:   flow: {heap} = *.this:
< <autogenerated>:1:     from .this.Common (dot of pointer) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common (address-of) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common.RunDefaultServeMux() (call parameter) at <autogenerated>:1
< <autogenerated>:1: leaking param content: .this
< <autogenerated>:1: .this does not escape
< <autogenerated>:1: parameter .this leaks to {heap} with derefs=1:
< <autogenerated>:1:   flow: {heap} = *.this:
< <autogenerated>:1:     from .this.Common (dot of pointer) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common (address-of) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common.RunDefaultServeMux() (call parameter) at <autogenerated>:1
< <autogenerated>:1: leaking param content: .this
< <autogenerated>:1: .this does not escape
< <autogenerated>:1: parameter .this leaks to {heap} with derefs=1:
< <autogenerated>:1:   flow: {heap} = *.this:
< <autogenerated>:1:     from .this.Common (dot of pointer) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common (address-of) at <autogenerated>:1
< <autogenerated>:1:     from .this.Common.RunDefaultServeMux() (call parameter) at <autogenerated>:1
< <autogenerated>:1: leaking param content: .this
---
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape
> <autogenerated>:1: .this does not escape

But, maybe there's something else I can inspect.

If it's of interest and I'm not sidetracking things, I'm happy to run whatever is needed. I know I was a little surprised to see more allocations when the regabi was enabled.

qmuntal · 2021-05-25T07:25:27Z

Could you explain more about this? Could you show the output of -gcflags=-m on this line, for both Go 1.16 and tip?

For a simple example I run, there is an allocation for the []byte to string conversion, and it is the same with Go 1.16 and tip. The underlying a.Value does not necessarily escape, also same with Go 1.16 and tip.

I was wrong when saying that line did not allocate in go 1.16 but did allocate in go 1.17. In both versions this line is allocating and string(a.Value) escapes to heap. Yet there are more allocations in this line than in go 1.16.

This is the output of -gcflags=-m for both versions: string(a.Value) escapes to heap

I found another strange thing that could be related to this issue. When comparing inuse_objects with regabi disabled and enabled, the former has 5461 objects allocated and the latest 70619. Almost all the new allocations come from runtime.doaddtimer which is indirectly triggered by runtime.bgscavenge.

This is an svg dump from the go tool pprof:

inuse_objects.zip

cherrymui · 2021-05-25T15:14:01Z

@zikaeroh Thanks for the information. If I see it correctly, the difference in -m=2 output above should not make a difference in allocation, at least by itself.

Maybe the extra allocations you saw came from go'd (or deferred in a loop) closures, which is expected.

cherrymui · 2021-05-25T15:17:01Z

@qmuntal

Yet there are more allocations in this line than in go 1.16.

On that line? Could you post the assembly output for that line, using -gcflags=-S, for 1.16 and tip? And maybe provide more context about that line of code?

Almost all the new allocations come from runtime.doaddtimer which is indirectly triggered by runtime.bgscavenge.

I'd be surprised for this being register ABI related. Are your comparing 1.16 vs. tip, or register ABI off (GOEXPERIMENT=none) vs. on? Thanks.

qmuntal · 2021-05-25T15:56:32Z

I'd be surprised for this being register ABI related. Are your comparing 1.16 vs. tip, or register ABI off (GOEXPERIMENT=none) vs. on? Thanks.

I'm comparing register ABI off (GOEXPERIMENT=noregabi) vs on, both in tip.

qmuntal · 2021-05-25T16:19:25Z

On that line? Could you post the assembly output for that line, using -gcflags=-S, for 1.16 and tip? And maybe provide more context about that line of code?

Just this line of code is in a public repo, so you can check it yourself: https://github.com/HPInc/go3mf/blob/master/decoder.go#L462

These are the assembly dumps:

Go tip

decoder.go:462)      MOVQ    "".a+248(SP), R8
decoder.go:462)      MOVQ    "".a+256(SP), CX
decoder.go:462)      XORL    AX, AX
decoder.go:462)      MOVQ    R8, BX
decoder.go:462)      PCDATA  $1, $4
decoder.go:462)      CALL    runtime.slicebytetostring(SB)
decoder.go:462)      MOVL    $10, CX
decoder.go:462)      MOVL    $24, DI
decoder.go:462)      CALL    strconv.ParseUint(SB)

Go 1.16

decoder.go:462)      MOVQ    "".a+256(SP), AX
decoder.go:462)      MOVQ    "".a+264(SP), DI
decoder.go:462)      MOVQ    $0, (SP)
decoder.go:462)      MOVQ    AX, 8(SP)
decoder.go:462)      MOVQ    DI, 16(SP)
decoder.go:462)      PCDATA  $1, $1
decoder.go:462)      CALL    runtime.slicebytetostring(SB)
decoder.go:462)      MOVQ    24(SP), AX
decoder.go:462)      MOVQ    32(SP), CX
decoder.go:462)      MOVQ    AX, (SP)
decoder.go:462)      MOVQ    CX, 8(SP)
decoder.go:462)      MOVQ    $10, 16(SP)
decoder.go:462)      MOVQ    $24, 24(SP)
decoder.go:462)      CALL    strconv.ParseUint(SB)
decoder.go:462)      MOVQ    32(SP), AX

Edited: Headers swapped.

thanm · 2021-05-25T17:32:35Z

That second excerpt (labeled "Go tip") does not look like Go tip to me-- it looks like Go tip plus GOEXPERIMENT=noregabi. All of the values being passed to subroutines are on the stack.

qmuntal · 2021-05-25T17:39:28Z

That second excerpt (labeled "Go tip") does not look like Go tip to me-- it looks like Go tip plus GOEXPERIMENT=noregabi. All of the values being passed to subroutines are on the stack.

You are partially right, I swapped the headers... I will edit the post to fix it. Thanks!

cherrymui · 2021-05-26T16:03:24Z

@qmuntal thanks for posting the assembly. From allocation's perspective they look the same. It is unclear to me where the extra allocation comes from.

networkimprov · 2021-05-27T22:46:33Z

@gopherbot remove WaitingForInfo

gopherbot · 2021-07-04T17:25:57Z

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

mknyszek added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label May 24, 2021

mknyszek added this to the Go1.17 milestone May 24, 2021

mknyszek changed the title ~~allocation regression when using regabi in tip~~ cmd/compile: go statement desugaring results in additional allocations May 24, 2021

mknyszek modified the milestones: Go1.17, Backlog May 24, 2021

gopherbot removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label May 27, 2021

mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 28, 2021

cherrymui added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jun 4, 2021

gopherbot closed this as completed Jul 4, 2021

golang locked and limited conversation to collaborators Jul 4, 2022

gopherbot added the FrozenDueToAge label Jul 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: go statement desugaring results in additional allocations #46349

cmd/compile: go statement desugaring results in additional allocations #46349

qmuntal commented May 24, 2021

mknyszek commented May 24, 2021

mknyszek commented May 24, 2021

qmuntal commented May 24, 2021

mknyszek commented May 24, 2021

qmuntal commented May 24, 2021 •

edited

mknyszek commented May 24, 2021

cherrymui commented May 24, 2021

cherrymui commented May 24, 2021

mknyszek commented May 24, 2021

qmuntal commented May 24, 2021

mknyszek commented May 24, 2021

qmuntal commented May 24, 2021

cherrymui commented May 24, 2021

cherrymui commented May 24, 2021

zikaeroh commented May 25, 2021 •

edited

qmuntal commented May 25, 2021

cherrymui commented May 25, 2021

cherrymui commented May 25, 2021

qmuntal commented May 25, 2021 •

edited

qmuntal commented May 25, 2021 •

edited

thanm commented May 25, 2021

qmuntal commented May 25, 2021

cherrymui commented May 26, 2021

networkimprov commented May 27, 2021

gopherbot commented Jul 4, 2021

cmd/compile: go statement desugaring results in additional allocations #46349

cmd/compile: go statement desugaring results in additional allocations #46349

Comments

qmuntal commented May 24, 2021

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

Context

mknyszek commented May 24, 2021

mknyszek commented May 24, 2021

qmuntal commented May 24, 2021

mknyszek commented May 24, 2021

qmuntal commented May 24, 2021 • edited

mknyszek commented May 24, 2021

cherrymui commented May 24, 2021

cherrymui commented May 24, 2021

mknyszek commented May 24, 2021

qmuntal commented May 24, 2021

mknyszek commented May 24, 2021

qmuntal commented May 24, 2021

cherrymui commented May 24, 2021

cherrymui commented May 24, 2021

zikaeroh commented May 25, 2021 • edited

qmuntal commented May 25, 2021

cherrymui commented May 25, 2021

cherrymui commented May 25, 2021

qmuntal commented May 25, 2021 • edited

qmuntal commented May 25, 2021 • edited

thanm commented May 25, 2021

qmuntal commented May 25, 2021

cherrymui commented May 26, 2021

networkimprov commented May 27, 2021

gopherbot commented Jul 4, 2021

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

qmuntal commented May 24, 2021 •

edited

zikaeroh commented May 25, 2021 •

edited

qmuntal commented May 25, 2021 •

edited

qmuntal commented May 25, 2021 •

edited