Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime,cmd/compile: redesign PCDATA encoding #61984

Open
aclements opened this issue Aug 13, 2023 · 1 comment
Open

runtime,cmd/compile: redesign PCDATA encoding #61984

aclements opened this issue Aug 13, 2023 · 1 comment
Labels
binary-size compiler/runtime Issues related to the Go compiler and/or runtime. FeatureRequest NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@aclements
Copy link
Member

I've been experimenting with alternate encodings for the PCDATA tables, with the goal of significantly improving decode speed, at possibly a slight increase in binary size.

@mknyszek, @dr2chase and I designed a format that's >4x faster to decode than the Go 1.21 tables, while increasing binary size by only 1–2.5%. This issue is to track (eventually) moving to this format in the compiler and runtime.

I've described the format here and have an out-of-tree implementation of the encoder and decoder (apologies for the messy, experimental code).

Here's an example benchmark, using the go binary:

goos: linux
goarch: amd64
pkg: github.com/aclements/go-misc/pcvaluetab
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
                               │  go.bench   │
                               │   sec/op    │
Decode/go/varint-cache-nohit-8   50.40n ± 2%
Decode/go/varint-cache-hit-8     10.23n ± 0%
Decode/go/varint-cache-none-8    42.46n ± 0%
Decode/go/alt-8                  11.69n ± 1%

file size change: +1.975729%

All of the benchmarks generate a random sample of 1024 (PC, table) pairs to lookup and report the average lookup time.

"Decode/go/alt" is the new implementation. "varint-cache-nohit" is the current varint format, with no cache hits. "varint-cache-hit" is similar, but repeats each lookup 8 times to simulate a high cache hit rate. "varint-cache-none" is the varint format, but with no cache.

Here are the same results for github.com/kubernetes/kubernetes/cmd/kubelet:

goos: linux
goarch: amd64
pkg: github.com/aclements/go-misc/pcvaluetab
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
                                    │ kubelet.bench │
                                    │    sec/op     │
Decode/kubelet/varint-cache-nohit-8     56.21n ± 1%
Decode/kubelet/varint-cache-hit-8       11.04n ± 1%
Decode/kubelet/varint-cache-none-8      48.61n ± 1%
Decode/kubelet/alt-8                    11.80n ± 1%

file size change: +1.248393%
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Aug 13, 2023
@aclements aclements added this to the Backlog milestone Aug 13, 2023
@dmitshur dmitshur added Performance NeedsFix The path to resolution is known, but the work has not been done. FeatureRequest binary-size labels Aug 14, 2023
@aclements
Copy link
Member Author

I went a little deeper on benchmarking. I wanted to see how the PC being looked up affected speed, so in addition to selecting a "random" PC like before, I added benchmarks for looking up PC offset 0 and PC offset 4096. Indeed, if you read down the benchstat table, you can see that my proposed encoding is relatively insensitive to the PC value, while the varint encoding slows down pretty significantly with larger PC values. Even at PC offset 0, where you'd expect the varint encoding to do pretty well, it's still slower than my proposed encoding (looking at the pc=0 rows, comparing varint to alt).

perflock go test -test.run ^$ -test.bench Decode -bench-binary ~/sdk/go1.20.6/bin/go -bench-binary /tmp/kubelet -test.count 20 -test.timeout 20m
$ benchstat -col '/enc /cachehit' -filter .unit:sec/op all.bench 
goos: linux
goarch: amd64
pkg: github.com/aclements/go-misc/pcvaluetab
cpu: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
                          /enc │                                          varint                                           │                 alt                 │
                     /cachehit │              │                  0                   │                 7:1                 │                                     │
                               │    sec/op    │    sec/op     vs base                │   sec/op     vs base                │   sec/op     vs base                │
Decode/bin=go/pc=random-8        35.750n ± 1%   44.105n ± 0%  +23.37% (p=0.000 n=20)   9.348n ± 1%  -73.85% (p=0.000 n=20)   11.625n ± 1%  -67.48% (p=0.000 n=20)
Decode/bin=go/pc=0-8             10.295n ± 1%   19.510n ± 0%  +89.51% (p=0.000 n=20)   6.346n ± 1%  -38.36% (p=0.000 n=20)    8.582n ± 0%  -16.63% (p=0.000 n=20)
Decode/bin=go/pc=4096-8          286.75n ± 0%   286.15n ± 1%        ~ (p=0.499 n=20)   38.99n ± 0%  -86.40% (p=0.000 n=20)    11.58n ± 0%  -95.96% (p=0.000 n=20)
Decode/bin=kubelet/pc=random-8    43.98n ± 0%    51.21n ± 0%  +16.45% (p=0.000 n=20)   10.31n ± 0%  -76.55% (p=0.000 n=20)    11.76n ± 1%  -73.26% (p=0.000 n=20)
Decode/bin=kubelet/pc=0-8        11.190n ± 0%   20.190n ± 0%  +80.43% (p=0.000 n=20)   6.464n ± 1%  -42.23% (p=0.000 n=20)    8.577n ± 1%  -23.35% (p=0.000 n=20)
Decode/bin=kubelet/pc=4096-8     340.55n ± 0%   340.40n ± 0%        ~ (p=0.370 n=20)   45.77n ± 0%  -86.56% (p=0.000 n=20)    11.69n ± 1%  -96.57% (p=0.000 n=20)
geomean                           51.04n         66.52n       +30.33%                  13.85n       -72.87%                   10.53n       -79.37%

I also happened to run it on a much older, apparently much slower computer. The same conclusions hold:

$ benchstat -col '/enc /cachehit' -filter .unit:sec/op all.bench 
goos: linux
goarch: amd64
pkg: github.com/aclements/go-misc/pcvaluetab
cpu: Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
                           /enc │                                          varint                                           │                 alt                 │
                      /cachehit │              │                  0                   │                 7:1                 │                                     │
                                │    sec/op    │    sec/op     vs base                │   sec/op     vs base                │   sec/op     vs base                │
Decode/bin=go/pc=random-48         95.36n ± 1%   106.90n ± 0%  +12.10% (p=0.000 n=20)   26.58n ± 0%  -72.13% (p=0.000 n=20)   33.67n ± 0%  -64.69% (p=0.000 n=20)
Decode/bin=go/pc=0-48              36.14n ± 1%    51.35n ± 1%  +42.09% (p=0.000 n=20)   19.66n ± 0%  -45.60% (p=0.000 n=20)   25.65n ± 0%  -29.01% (p=0.000 n=20)
Decode/bin=go/pc=4096-48          441.55n ± 0%   448.05n ± 0%   +1.47% (p=0.000 n=20)   69.34n ± 0%  -84.30% (p=0.000 n=20)   34.71n ± 0%  -92.14% (p=0.000 n=20)
Decode/bin=kubelet/pc=random-48   106.40n ± 1%   115.85n ± 1%   +8.88% (p=0.000 n=20)   27.73n ± 0%  -73.94% (p=0.000 n=20)   34.27n ± 0%  -67.79% (p=0.000 n=20)
Decode/bin=kubelet/pc=0-48         44.62n ± 1%    61.06n ± 2%  +36.83% (p=0.000 n=20)   21.17n ± 1%  -52.55% (p=0.000 n=20)   25.59n ± 0%  -42.65% (p=0.000 n=20)
Decode/bin=kubelet/pc=4096-48     509.65n ± 0%   517.85n ± 0%   +1.61% (p=0.000 n=20)   78.04n ± 0%  -84.69% (p=0.000 n=20)   36.77n ± 0%  -92.78% (p=0.000 n=20)
geomean                            124.3n         144.2n       +16.08%                  34.41n       -72.31%                  31.45n       -74.69%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binary-size compiler/runtime Issues related to the Go compiler and/or runtime. FeatureRequest NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
Development

No branches or pull requests

3 participants