Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: implement more optimizations on loong64 #59120

Open
7 of 12 tasks
xen0n opened this issue Mar 19, 2023 · 12 comments
Open
7 of 12 tasks

cmd/compile: implement more optimizations on loong64 #59120

xen0n opened this issue Mar 19, 2023 · 12 comments
Labels
arch-loong64 Issues solely affecting the loongson architecture. compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@xen0n
Copy link
Member

xen0n commented Mar 19, 2023

This issue is mainly for tracking the implementation progress of various low-hanging fruits regarding loong64 optimizations.

There are many missed optimization chances on loong64. A quick survey on SSA intrinsics uncovers:

  • runtime.publicationBarrier
    • dmb st on arm64
    • dbar 0 on LA64 v1.00
    • dbar <TBD> on next revision of LA64 (finer-grained barriers are to be supported)
  • runtime.Bswap{32,64}
  • runtime/internal/sys.Prefetch{,Streamed}
    • preld on LA64 v1.00
  • runtime/internal/atomic.{And,Or}
  • math.{Trunc,Ceil,Floor,RoundToEven} not possible with LA64 v1.00
    • LA64 v1.00 frint.[sd] is not orthogonal: no fixed rounding mode variants (unlike e.g. ftintr{m,p,z,ne}).
  • math.Round
    • frint.[sd] on LA64 v1.00 -- have to check if the rounding mode behavior is tolerable
  • math.Abs
    • fabs.[sd] on LA64 v1.00
  • math.Copysign
    • fcopysign.[sd] on LA64 v1.00
  • math.FMA
    • f{,n}m{add,sub}.[sd] on LA64 v1.00: CL 483355
  • math/bits.TrailingZeros{64,32} (ssa.OpCtz{64,32})
  • math/bits.Len{64,32,} (ssa.OpBitLen{64,32})
    • clz.[wd] on LA64 v1.00: CL 483356
    • significant performance regression across the board, needs investigation confirmed to be micro-architecture quirk, alleviated somewhat by various alignment tricks
  • math/bits.Reverse{64,32,8} (ssa.OpBitRev{64,32,8})

We may want to implement (and preferably benchmark) all of the above.

cc @golang/loong64

@xen0n xen0n added arch-loong64 Issues solely affecting the loongson architecture. compiler/runtime Issues related to the Go compiler and/or runtime. labels Mar 19, 2023
xen0n added a commit to xen0n/go that referenced this issue Mar 20, 2023
…ns with EXTW{B,H}

Updates golang#59120

Change-Id: Ia7dd0dfe20c0ea3e64889e2b38c6b2118b50d56e
xen0n added a commit to xen0n/go that referenced this issue Mar 20, 2023
Updates golang#59120

Change-Id: I6c90f727eb00e0add2a5f8575ac045b9e288af54
xen0n added a commit to xen0n/go that referenced this issue Mar 20, 2023
Updates golang#59120

Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89
xen0n added a commit to xen0n/go that referenced this issue Mar 20, 2023
Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
xen0n added a commit to xen0n/go that referenced this issue Mar 20, 2023
Updates golang#59120

Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89
xen0n added a commit to xen0n/go that referenced this issue Mar 20, 2023
Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
xen0n added a commit to xen0n/go that referenced this issue Mar 20, 2023
Updates golang#59120

Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89
xen0n added a commit to xen0n/go that referenced this issue Mar 20, 2023
Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
@heschi heschi added the NeedsFix The path to resolution is known, but the work has not been done. label Mar 20, 2023
xen0n added a commit to xen0n/go that referenced this issue Mar 20, 2023
…n loong64

tests TODO

Updates golang#59120

Change-Id: Icde85d717999600954244c1105b7c55759d3469f
xen0n added a commit to xen0n/go that referenced this issue Mar 20, 2023
Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
@cherrymui cherrymui added this to the Unplanned milestone Mar 20, 2023
xen0n added a commit to xen0n/go that referenced this issue Mar 20, 2023
Updates golang#59120

Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89
xen0n added a commit to xen0n/go that referenced this issue Mar 20, 2023
…n loong64

tests TODO

Updates golang#59120

Change-Id: Icde85d717999600954244c1105b7c55759d3469f
xen0n added a commit to xen0n/go that referenced this issue Mar 20, 2023
Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
xen0n added a commit to xen0n/go that referenced this issue Mar 20, 2023
…n loong64

tests TODO

Updates golang#59120

Change-Id: Icde85d717999600954244c1105b7c55759d3469f
xen0n added a commit to xen0n/go that referenced this issue Mar 20, 2023
Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
xen0n added a commit to xen0n/go that referenced this issue Mar 21, 2023
…ns with EXTW{B,H}

Updates golang#59120

Change-Id: Ia7dd0dfe20c0ea3e64889e2b38c6b2118b50d56e
xen0n added a commit to xen0n/go that referenced this issue Mar 21, 2023
Updates golang#59120

Change-Id: I6c90f727eb00e0add2a5f8575ac045b9e288af54
xen0n added a commit to xen0n/go that referenced this issue Mar 21, 2023
Updates golang#59120

Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89
xen0n added a commit to xen0n/go that referenced this issue Mar 21, 2023
…n loong64

tests TODO

Updates golang#59120

Change-Id: Icde85d717999600954244c1105b7c55759d3469f
xen0n added a commit to xen0n/go that referenced this issue Mar 21, 2023
Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
xen0n added a commit to xen0n/go that referenced this issue Mar 25, 2023
…xtensions

Updates golang#59120

Change-Id: Ia7dd0dfe20c0ea3e64889e2b38c6b2118b50d56e
xen0n added a commit to xen0n/go that referenced this issue Mar 25, 2023
Updates golang#59120

Change-Id: I6c90f727eb00e0add2a5f8575ac045b9e288af54
xen0n added a commit to xen0n/go that referenced this issue Mar 25, 2023
Updates golang#59120

Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89
xen0n added a commit to xen0n/go that referenced this issue Mar 25, 2023
…n loong64

tests TODO

Updates golang#59120

Change-Id: Icde85d717999600954244c1105b7c55759d3469f
xen0n added a commit to xen0n/go that referenced this issue Mar 25, 2023
Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
@gopherbot
Copy link

Change https://go.dev/cl/483356 mentions this issue: cmd/compile: wire up math/bits.Len intrinsics for loong64

xen0n added a commit to xen0n/go that referenced this issue Apr 10, 2023
Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
               │    before    │                after                 │
               │    sec/op    │    sec/op     vs base                │
ReverseBytes     3.0130n ± 0%   0.6517n ± 2%  -78.37% (p=0.000 n=10)
ReverseBytes16   0.9027n ± 0%   0.6526n ± 2%  -27.71% (p=0.000 n=10)
ReverseBytes32   1.7040n ± 0%   0.6511n ± 1%  -61.79% (p=0.000 n=10)
ReverseBytes64   2.7080n ± 0%   0.6499n ± 1%  -76.00% (p=0.000 n=10)
geomean           1.882n        0.6513n       -65.40%

Go1 benchmark results indicate no meaningful change except for
micro-architecture-related fluctuations.

Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
xen0n added a commit to xen0n/go that referenced this issue Apr 11, 2023
…xtensions

8- and 16-bit sign extensions and 32-bit zero extensions were realized
with left and right shifts before this change. We now support assembling
EXTWB, EXTWH and BSTRPICKV, so all three can be done with a single insn
respectively.

Benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │  CL 479495  │              this CL               │
                      │   sec/op    │   sec/op     vs base               │
BinaryTree17             14.12 ± 1%    14.06 ± 1%       ~ (p=0.393 n=10)
Fannkuch11               3.420 ± 0%    3.421 ± 0%  +0.04% (p=0.001 n=10)
FmtFprintfEmpty         94.72n ± 0%   94.97n ± 0%  +0.26% (p=0.000 n=10)
FmtFprintfString        152.6n ± 0%   155.3n ± 0%  +1.77% (p=0.000 n=10)
FmtFprintfInt           154.5n ± 0%   154.5n ± 0%       ~ (p=0.263 n=10)
FmtFprintfIntInt        237.7n ± 0%   237.1n ± 0%  -0.21% (p=0.000 n=10)
FmtFprintfPrefixedInt   313.1n ± 0%   313.0n ± 0%  -0.03% (p=0.000 n=10)
FmtFprintfFloat         394.1n ± 0%   392.8n ± 0%  -0.32% (p=0.000 n=10)
FmtManyArgs             934.3n ± 0%   912.6n ± 0%  -2.32% (p=0.000 n=10)
GobDecode               15.29m ± 1%   15.23m ± 1%       ~ (p=0.280 n=10)
GobEncode               17.76m ± 0%   17.66m ± 0%  -0.60% (p=0.000 n=10)
Gzip                    416.0m ± 0%   404.4m ± 0%  -2.79% (p=0.000 n=10)
Gunzip                  83.20m ± 0%   80.88m ± 0%  -2.79% (p=0.000 n=10)
HTTPClientServer        87.82µ ± 1%   87.09µ ± 1%  -0.83% (p=0.000 n=10)
JSONEncode              18.56m ± 0%   18.54m ± 0%       ~ (p=0.123 n=10)
JSONDecode              76.53m ± 0%   78.22m ± 1%  +2.21% (p=0.000 n=10)
Mandelbrot200           7.217m ± 0%   7.215m ± 0%       ~ (p=0.143 n=10)
GoParse                 7.587m ± 1%   7.520m ± 1%       ~ (p=0.165 n=10)
RegexpMatchEasy0_32     134.2n ± 0%   134.5n ± 0%  +0.22% (p=0.000 n=10)
RegexpMatchEasy0_1K     1.366µ ± 0%   1.364µ ± 0%  -0.15% (p=0.000 n=10)
RegexpMatchEasy1_32     163.0n ± 0%   164.0n ± 0%  +0.61% (p=0.000 n=10)
RegexpMatchEasy1_1K     1.497µ ± 0%   1.492µ ± 0%  -0.33% (p=0.000 n=10)
RegexpMatchMedium_32    1.415µ ± 0%   1.403µ ± 0%  -0.85% (p=0.000 n=10)
RegexpMatchMedium_1K    41.61µ ± 0%   41.05µ ± 0%  -1.36% (p=0.000 n=10)
RegexpMatchHard_32      2.121µ ± 0%   2.070µ ± 0%  -2.43% (p=0.000 n=10)
RegexpMatchHard_1K      62.64µ ± 0%   60.87µ ± 0%  -2.83% (p=0.000 n=10)
Revcomp                  1.204 ± 0%    1.210 ± 0%  +0.51% (p=0.000 n=10)
Template                118.0m ± 0%   115.2m ± 1%  -2.31% (p=0.000 n=10)
TimeParse               414.8n ± 0%   410.6n ± 0%  -1.01% (p=0.000 n=10)
TimeFormat              510.7n ± 0%   508.2n ± 0%  -0.48% (p=0.000 n=10)
geomean                 102.3µ        101.7µ       -0.60%

                     │  CL 479495   │               this CL               │
                     │     B/s      │     B/s       vs base               │
GobDecode              47.88Mi ± 1%   48.05Mi ± 1%       ~ (p=0.280 n=10)
GobEncode              41.20Mi ± 0%   41.45Mi ± 0%  +0.60% (p=0.000 n=10)
Gzip                   44.49Mi ± 0%   45.77Mi ± 0%  +2.87% (p=0.000 n=10)
Gunzip                 222.4Mi ± 0%   228.8Mi ± 0%  +2.87% (p=0.000 n=10)
JSONEncode             99.69Mi ± 0%   99.82Mi ± 0%       ~ (p=0.118 n=10)
JSONDecode             24.19Mi ± 0%   23.66Mi ± 1%  -2.19% (p=0.000 n=10)
GoParse                7.281Mi ± 2%   7.343Mi ± 1%       ~ (p=0.187 n=10)
RegexpMatchEasy0_32    227.4Mi ± 0%   226.9Mi ± 0%  -0.22% (p=0.000 n=10)
RegexpMatchEasy0_1K    715.0Mi ± 0%   716.0Mi ± 0%  +0.13% (p=0.000 n=10)
RegexpMatchEasy1_32    187.3Mi ± 0%   186.1Mi ± 0%  -0.62% (p=0.000 n=10)
RegexpMatchEasy1_1K    652.3Mi ± 0%   654.5Mi ± 0%  +0.34% (p=0.000 n=10)
RegexpMatchMedium_32   21.57Mi ± 0%   21.74Mi ± 0%  +0.80% (p=0.000 n=10)
RegexpMatchMedium_1K   23.47Mi ± 0%   23.79Mi ± 0%  +1.38% (p=0.000 n=10)
RegexpMatchHard_32     14.39Mi ± 0%   14.74Mi ± 0%  +2.45% (p=0.000 n=10)
RegexpMatchHard_1K     15.59Mi ± 0%   16.04Mi ± 0%  +2.87% (p=0.000 n=10)
Revcomp                201.3Mi ± 0%   200.3Mi ± 0%  -0.51% (p=0.000 n=10)
Template               15.69Mi ± 0%   16.06Mi ± 1%  +2.37% (p=0.000 n=10)
geomean                61.31Mi        61.82Mi       +0.84%

The test binaries were pre-compiled with `go test -c`, and the test runs
were wrapped with `perf stat record` for recording dynamic instruction
counts. The instruction count, IPC and branch misprediction rate did not
meaningfully change.

As for the JSONDecode regression, `perf stat` is used to check
micro-architectural details:

$ sudo perf stat <test executable> -test.timeout=30m -test.run='^$' \
    -test.cpu=1 -test.bench='JSONDecode' -test.count=1 -test.benchtime=50x

Before:

          4,256.10 msec task-clock               #    1.061 CPUs utilized
            61,431      context-switches         #   14.434 K/sec
                 3      cpu-migrations           #    0.705 /sec
             3,297      page-faults              #  774.652 /sec
    10,364,990,422      cycles                   #    2.435 GHz
    19,640,571,817      instructions             #    1.89  insn per cycle
     4,267,623,324      branches                 #    1.003 G/sec
        44,164,375      branch-misses            #    1.03% of all branches

After:

          4,343.17 msec task-clock               #    1.061 CPUs utilized
            62,742      context-switches         #   14.446 K/sec
                 5      cpu-migrations           #    1.151 /sec
             3,044      page-faults              #  700.871 /sec
    10,577,322,342      cycles                   #    2.435 GHz
    19,582,895,547      instructions             #    1.85  insn per cycle
     4,266,051,537      branches                 #  982.244 M/sec
        46,298,286      branch-misses            #    1.09% of all branches

Instruction count decreased by 0.29% but cycle count went up by 2.05%,
while branch misprediction rate raised too. This is likely caused by the
micro-architecture's sensitivity towards changed code layout; the
optimization implemented here should be a net win otherwise.

Updates golang#59120

Change-Id: Ia7dd0dfe20c0ea3e64889e2b38c6b2118b50d56e
xen0n added a commit to xen0n/go that referenced this issue Apr 11, 2023
The runtime malloc implementation makes use of these, among others.

Some generic strength reduction rules for Ctz ops have also been added,
though only enabled for loong64 for now. This is necessary to make the
optimization profitable at all, as the LA464 architecture apparently
handles the `TrailingZeros64(x) < 64` part in runtime.nextFreeFast very
badly if the compiled branch isn't a simple BEQZ any more (that used to
be the case before, when the compiler is able to peek into the pure Go
implementation of TrailingZeros). Without the generic rules this change
is going to be a big perf hit (as bad as 7~10% in select go1 benchmark
cases).

The generic changes are benchmarked on linux/amd64 (Threadripper 3990X)
and darwin/arm64 (Apple M1 Pro) too, but results are either mixed
(amd64) or even net loss (arm64). So, for now those rules are guarded
with a predicate that only enables them for loong64.

Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
                │   before    │                after                │
                │   sec/op    │   sec/op     vs base                │
TrailingZeros     2.758n ± 0%   1.004n ± 0%  -63.60% (p=0.000 n=10)
TrailingZeros8    1.508n ± 0%   1.219n ± 0%  -19.20% (p=0.000 n=10)
TrailingZeros16   3.526n ± 0%   1.437n ± 0%  -59.25% (p=0.000 n=10)
TrailingZeros32   3.161n ± 0%   1.004n ± 0%  -68.23% (p=0.000 n=10)
TrailingZeros64   2.759n ± 0%   1.003n ± 0%  -63.65% (p=0.000 n=10)
geomean           2.638n        1.121n       -57.51%

Go1 benchmark results on the same machine:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │ CL 479496 v8 │              this CL               │
                      │    sec/op    │   sec/op     vs base               │
BinaryTree17              14.10 ± 1%    13.64 ± 1%  -3.28% (p=0.000 n=10)
Fannkuch11                3.421 ± 0%    3.421 ± 0%       ~ (p=0.075 n=10)
FmtFprintfEmpty          94.78n ± 0%   94.50n ± 0%  -0.30% (p=0.000 n=10)
FmtFprintfString         155.0n ± 0%   154.1n ± 1%       ~ (p=1.000 n=10)
FmtFprintfInt            157.2n ± 0%   155.2n ± 1%  -1.27% (p=0.000 n=10)
FmtFprintfIntInt         242.1n ± 0%   238.0n ± 1%  -1.73% (p=0.000 n=10)
FmtFprintfPrefixedInt    337.6n ± 0%   334.6n ± 0%  -0.89% (p=0.000 n=10)
FmtFprintfFloat          399.0n ± 0%   396.4n ± 0%  -0.65% (p=0.000 n=10)
FmtManyArgs              959.8n ± 0%   923.4n ± 0%  -3.79% (p=0.000 n=10)
GobDecode                15.63m ± 3%   15.17m ± 1%  -2.90% (p=0.001 n=10)
GobEncode                18.43m ± 3%   17.62m ± 0%  -4.38% (p=0.000 n=10)
Gzip                     405.1m ± 0%   405.4m ± 0%  +0.06% (p=0.035 n=10)
Gunzip                   86.84m ± 0%   87.20m ± 0%  +0.41% (p=0.000 n=10)
HTTPClientServer         88.47µ ± 0%   86.92µ ± 1%  -1.75% (p=0.000 n=10)
JSONEncode               18.84m ± 0%   18.66m ± 0%  -0.95% (p=0.000 n=10)
JSONDecode               79.35m ± 0%   75.77m ± 1%  -4.51% (p=0.000 n=10)
Mandelbrot200            7.215m ± 0%   7.215m ± 0%       ~ (p=0.315 n=10)
GoParse                  7.591m ± 1%   7.407m ± 1%  -2.43% (p=0.000 n=10)
RegexpMatchEasy0_32      133.8n ± 0%   134.3n ± 0%  +0.37% (p=0.000 n=10)
RegexpMatchEasy0_1K      1.540µ ± 0%   1.544µ ± 0%  +0.26% (p=0.000 n=10)
RegexpMatchEasy1_32      164.1n ± 0%   165.4n ± 0%  +0.79% (p=0.000 n=10)
RegexpMatchEasy1_1K      1.626µ ± 0%   1.629µ ± 0%  +0.18% (p=0.000 n=10)
RegexpMatchMedium_32     1.403µ ± 0%   1.413µ ± 0%  +0.71% (p=0.000 n=10)
RegexpMatchMedium_1K     41.22µ ± 0%   41.59µ ± 0%  +0.90% (p=0.000 n=10)
RegexpMatchHard_32       2.071µ ± 0%   2.060µ ± 0%  -0.53% (p=0.000 n=10)
RegexpMatchHard_1K       61.05µ ± 0%   61.30µ ± 0%  +0.41% (p=0.001 n=10)
Revcomp                   1.351 ± 0%    1.357 ± 0%  +0.42% (p=0.000 n=10)
Template                 117.3m ± 1%   110.6m ± 2%  -5.71% (p=0.000 n=10)
TimeParse                411.9n ± 0%   411.7n ± 0%       ~ (p=0.117 n=10)
TimeFormat               514.2n ± 0%   499.9n ± 0%  -2.77% (p=0.000 n=10)
geomean                  104.2µ        103.0µ       -1.15%

                     │ CL 479496 v8 │               this CL               │
                     │     B/s      │     B/s       vs base               │
GobDecode              46.84Mi ± 3%   48.24Mi ± 1%  +2.98% (p=0.001 n=10)
GobEncode              39.72Mi ± 4%   41.53Mi ± 0%  +4.57% (p=0.000 n=10)
Gzip                   45.68Mi ± 0%   45.65Mi ± 0%  -0.05% (p=0.029 n=10)
Gunzip                 213.1Mi ± 0%   212.2Mi ± 0%  -0.41% (p=0.000 n=10)
JSONEncode             98.23Mi ± 0%   99.18Mi ± 0%  +0.97% (p=0.000 n=10)
JSONDecode             23.32Mi ± 0%   24.42Mi ± 1%  +4.72% (p=0.000 n=10)
GoParse                7.277Mi ± 1%   7.458Mi ± 1%  +2.49% (p=0.000 n=10)
RegexpMatchEasy0_32    228.1Mi ± 0%   227.3Mi ± 0%  -0.36% (p=0.000 n=10)
RegexpMatchEasy0_1K    634.2Mi ± 0%   632.5Mi ± 0%  -0.27% (p=0.000 n=10)
RegexpMatchEasy1_32    186.0Mi ± 0%   184.5Mi ± 0%  -0.79% (p=0.000 n=10)
RegexpMatchEasy1_1K    600.4Mi ± 0%   599.4Mi ± 0%  -0.17% (p=0.000 n=10)
RegexpMatchMedium_32   21.75Mi ± 0%   21.60Mi ± 0%  -0.70% (p=0.000 n=10)
RegexpMatchMedium_1K   23.69Mi ± 0%   23.48Mi ± 0%  -0.89% (p=0.000 n=10)
RegexpMatchHard_32     14.73Mi ± 0%   14.81Mi ± 0%  +0.52% (p=0.000 n=10)
RegexpMatchHard_1K     15.99Mi ± 0%   15.93Mi ± 0%  -0.42% (p=0.000 n=10)
Revcomp                179.4Mi ± 0%   178.6Mi ± 0%  -0.42% (p=0.000 n=10)
Template               15.78Mi ± 1%   16.73Mi ± 2%  +6.04% (p=0.000 n=10)
geomean                59.97Mi        60.58Mi       +1.02%

The change should be a net win, as all it does is to pattern-match and
replace Ctz ops into respective native instructions, so any performance
regression is likely also micro-architecture related, like observed in
CL 479496's results. (Indeed, some of the more drastic improvements may
well also be coincidental, but the point is that there is at least a
small amount of deterministic improvements anyway.)

Updates golang#59120

Change-Id: I6c90f727eb00e0add2a5f8575ac045b9e288af54
xen0n added a commit to xen0n/go that referenced this issue Apr 11, 2023
Benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │ CL 479498 v11 │               this CL               │
                      │    sec/op     │   sec/op     vs base                │
BinaryTree17               13.64 ± 1%    13.75 ± 2%        ~ (p=0.579 n=10)
Fannkuch11                 3.421 ± 0%    3.650 ± 0%   +6.70% (p=0.000 n=10)
FmtFprintfEmpty           94.50n ± 0%   94.45n ± 0%   -0.05% (p=0.000 n=10)
FmtFprintfString          154.1n ± 1%   155.2n ± 0%        ~ (p=0.689 n=10)
FmtFprintfInt             155.2n ± 1%   154.4n ± 0%        ~ (p=0.785 n=10)
FmtFprintfIntInt          238.0n ± 1%   237.1n ± 0%        ~ (p=0.721 n=10)
FmtFprintfPrefixedInt     334.6n ± 0%   312.8n ± 0%   -6.52% (p=0.000 n=10)
FmtFprintfFloat           396.4n ± 0%   390.5n ± 0%   -1.49% (p=0.000 n=10)
FmtManyArgs               923.4n ± 0%   905.0n ± 0%   -2.00% (p=0.000 n=10)
GobDecode                 15.17m ± 1%   14.93m ± 1%   -1.59% (p=0.000 n=10)
GobEncode                 17.62m ± 0%   17.33m ± 0%   -1.65% (p=0.001 n=10)
Gzip                      405.4m ± 0%   404.3m ± 0%   -0.26% (p=0.000 n=10)
Gunzip                    87.20m ± 0%   80.92m ± 0%   -7.20% (p=0.000 n=10)
HTTPClientServer          86.92µ ± 1%   86.14µ ± 0%   -0.90% (p=0.000 n=10)
JSONEncode                18.66m ± 0%   18.49m ± 0%   -0.91% (p=0.000 n=10)
JSONDecode                75.77m ± 1%   77.34m ± 1%   +2.07% (p=0.000 n=10)
Mandelbrot200             7.215m ± 0%   6.521m ± 0%   -9.62% (p=0.000 n=10)
GoParse                   7.407m ± 1%   7.324m ± 1%   -1.12% (p=0.003 n=10)
RegexpMatchEasy0_32       134.3n ± 0%   134.6n ± 0%   +0.22% (p=0.000 n=10)
RegexpMatchEasy0_1K       1.544µ ± 0%   1.365µ ± 0%  -11.63% (p=0.000 n=10)
RegexpMatchEasy1_32       165.4n ± 0%   164.1n ± 0%   -0.79% (p=0.000 n=10)
RegexpMatchEasy1_1K       1.629µ ± 0%   1.492µ ± 0%   -8.41% (p=0.000 n=10)
RegexpMatchMedium_32      1.413µ ± 0%   1.404µ ± 0%   -0.64% (p=0.000 n=10)
RegexpMatchMedium_1K      41.59µ ± 0%   41.05µ ± 0%   -1.28% (p=0.000 n=10)
RegexpMatchHard_32        2.060µ ± 0%   2.072µ ± 0%   +0.58% (p=0.000 n=10)
RegexpMatchHard_1K        61.30µ ± 0%   60.89µ ± 0%   -0.68% (p=0.000 n=10)
Revcomp                    1.357 ± 0%    1.199 ± 1%  -11.64% (p=0.000 n=10)
Template                  110.6m ± 2%   112.3m ± 2%        ~ (p=0.105 n=10)
TimeParse                 411.7n ± 0%   414.2n ± 1%   +0.60% (p=0.000 n=10)
TimeFormat                499.9n ± 0%   496.9n ± 0%   -0.60% (p=0.000 n=10)
geomean                   103.0µ        101.0µ        -1.98%

                     │ CL 479498 v11 │                this CL                │
                     │      B/s      │      B/s       vs base                │
GobDecode               48.24Mi ± 1%    49.02Mi ± 1%   +1.62% (p=0.000 n=10)
GobEncode               41.53Mi ± 0%    42.23Mi ± 0%   +1.69% (p=0.001 n=10)
Gzip                    45.65Mi ± 0%    45.77Mi ± 0%   +0.25% (p=0.000 n=10)
Gunzip                  212.2Mi ± 0%    228.7Mi ± 0%   +7.76% (p=0.000 n=10)
JSONEncode              99.18Mi ± 0%   100.08Mi ± 0%   +0.91% (p=0.000 n=10)
JSONDecode              24.42Mi ± 1%    23.93Mi ± 1%   -2.03% (p=0.000 n=10)
GoParse                 7.458Mi ± 1%    7.544Mi ± 1%   +1.15% (p=0.001 n=10)
RegexpMatchEasy0_32     227.3Mi ± 0%    226.8Mi ± 0%   -0.21% (p=0.000 n=10)
RegexpMatchEasy0_1K     632.5Mi ± 0%    715.7Mi ± 0%  +13.15% (p=0.000 n=10)
RegexpMatchEasy1_32     184.5Mi ± 0%    186.0Mi ± 0%   +0.81% (p=0.000 n=10)
RegexpMatchEasy1_1K     599.4Mi ± 0%    654.3Mi ± 0%   +9.17% (p=0.000 n=10)
RegexpMatchMedium_32    21.60Mi ± 0%    21.74Mi ± 0%   +0.64% (p=0.000 n=10)
RegexpMatchMedium_1K    23.48Mi ± 0%    23.78Mi ± 0%   +1.30% (p=0.000 n=10)
RegexpMatchHard_32      14.81Mi ± 0%    14.72Mi ± 0%   -0.58% (p=0.000 n=10)
RegexpMatchHard_1K      15.93Mi ± 0%    16.04Mi ± 0%   +0.72% (p=0.000 n=10)
Revcomp                 178.6Mi ± 0%    202.2Mi ± 1%  +13.18% (p=0.000 n=10)
Template                16.73Mi ± 2%    16.48Mi ± 2%        ~ (p=0.093 n=10)
geomean                 60.58Mi         62.23Mi        +2.72%

The only significant regression is the Fannkuch11 case; perf records are
manually inspected, with the hottest part of the code virtually unchanged
except for the alignment of two instructions, that seems to sit at
different sides of a 32- or even 64-byte boundary. So again, the
regression is likely due to micro-architecture quirks, and the change is
in fact a win across the board.

Updates golang#59120

Change-Id: Ibbf64988c9d06f7c1d359480a1d6aecfa2c25b65
xen0n added a commit to xen0n/go that referenced this issue Apr 11, 2023
For the SubFromLen64 codegen test case to work as intended, we need
to fold c-(-(x-d)) into x+(c-d).

Still, some instances of LeadingZeros are not optimized into single
CLZ instructions right now (actually, the LeadingZeros micro-benchmarks
are currently still compiled with redundant adds/subs of 64, due to
interference of loop optimizations before lowering), but perf numbers
indicate it's not that bad after all.

Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
               │   before    │                after                │
               │   sec/op    │   sec/op     vs base                │
LeadingZeros     3.675n ± 0%   1.545n ± 1%  -57.96% (p=0.000 n=10)
LeadingZeros8    2.001n ± 0%   1.868n ± 0%   -6.62% (p=0.000 n=10)
LeadingZeros16   3.144n ± 0%   1.864n ± 1%  -40.71% (p=0.000 n=10)
LeadingZeros32   4.265n ± 1%   1.653n ± 1%  -61.24% (p=0.000 n=10)
LeadingZeros64   3.962n ± 0%   1.539n ± 0%  -61.16% (p=0.000 n=10)
geomean          3.299n        1.688n       -48.84%

go1 benchmark results on the same box:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │  CL 483355  │              this CL               │
                      │   sec/op    │   sec/op     vs base               │
BinaryTree17             13.75 ± 2%    13.70 ± 2%       ~ (p=0.579 n=10)
Fannkuch11               3.650 ± 0%    3.415 ± 0%  -6.46% (p=0.000 n=10)
FmtFprintfEmpty         94.45n ± 0%   94.98n ± 0%  +0.56% (p=0.000 n=10)
FmtFprintfString        155.2n ± 0%   151.1n ± 0%  -2.61% (p=0.000 n=10)
FmtFprintfInt           154.4n ± 0%   153.6n ± 0%  -0.52% (p=0.000 n=10)
FmtFprintfIntInt        237.1n ± 0%   234.7n ± 0%  -0.99% (p=0.000 n=10)
FmtFprintfPrefixedInt   312.8n ± 0%   314.2n ± 0%  +0.45% (p=0.000 n=10)
FmtFprintfFloat         390.5n ± 0%   402.1n ± 0%  +2.97% (p=0.000 n=10)
FmtManyArgs             905.0n ± 0%   918.6n ± 0%  +1.51% (p=0.000 n=10)
GobDecode               14.93m ± 1%   14.98m ± 1%  +0.33% (p=0.015 n=10)
GobEncode               17.33m ± 0%   17.26m ± 1%  -0.39% (p=0.023 n=10)
Gzip                    404.3m ± 0%   404.6m ± 0%  +0.08% (p=0.000 n=10)
Gunzip                  80.92m ± 0%   80.97m ± 0%  +0.06% (p=0.000 n=10)
HTTPClientServer        86.14µ ± 0%   84.39µ ± 0%  -2.03% (p=0.000 n=10)
JSONEncode              18.49m ± 0%   18.50m ± 0%       ~ (p=0.436 n=10)
JSONDecode              77.34m ± 1%   76.26m ± 1%  -1.40% (p=0.000 n=10)
Mandelbrot200           6.521m ± 0%   6.508m ± 0%       ~ (p=0.138 n=10)
GoParse                 7.324m ± 1%   7.413m ± 1%  +1.22% (p=0.005 n=10)
RegexpMatchEasy0_32     134.6n ± 0%   134.6n ± 0%       ~ (p=0.195 n=10)
RegexpMatchEasy0_1K     1.365µ ± 0%   1.366µ ± 0%  +0.07% (p=0.038 n=10)
RegexpMatchEasy1_32     164.1n ± 0%   164.1n ± 0%       ~ (p=0.230 n=10)
RegexpMatchEasy1_1K     1.492µ ± 0%   1.492µ ± 0%       ~ (p=0.211 n=10)
RegexpMatchMedium_32    1.404µ ± 0%   1.403µ ± 0%  -0.07% (p=0.000 n=10)
RegexpMatchMedium_1K    41.05µ ± 0%   41.04µ ± 0%  -0.04% (p=0.000 n=10)
RegexpMatchHard_32      2.072µ ± 0%   2.071µ ± 0%  -0.05% (p=0.000 n=10)
RegexpMatchHard_1K      60.89µ ± 0%   60.87µ ± 0%  -0.04% (p=0.000 n=10)
Revcomp                  1.199 ± 1%    1.200 ± 0%       ~ (p=0.481 n=10)
Template                112.3m ± 2%   112.9m ± 2%       ~ (p=0.353 n=10)
TimeParse               414.2n ± 1%   412.5n ± 0%  -0.40% (p=0.000 n=10)
TimeFormat              496.9n ± 0%   496.6n ± 0%       ~ (p=0.341 n=10)
geomean                 101.0µ        100.7µ       -0.26%

                     │  CL 483355   │                this CL                │
                     │     B/s      │     B/s       vs base                 │
GobDecode              49.02Mi ± 1%   48.87Mi ± 1%  -0.32% (p=0.014 n=10)
GobEncode              42.23Mi ± 0%   42.40Mi ± 1%  +0.40% (p=0.022 n=10)
Gzip                   45.77Mi ± 0%   45.73Mi ± 0%  -0.07% (p=0.000 n=10)
Gunzip                 228.7Mi ± 0%   228.6Mi ± 0%  -0.06% (p=0.000 n=10)
JSONEncode             100.1Mi ± 0%   100.0Mi ± 0%       ~ (p=0.470 n=10)
JSONDecode             23.93Mi ± 1%   24.27Mi ± 1%  +1.43% (p=0.000 n=10)
GoParse                7.544Mi ± 1%   7.448Mi ± 1%  -1.26% (p=0.005 n=10)
RegexpMatchEasy0_32    226.8Mi ± 0%   226.7Mi ± 0%  -0.06% (p=0.001 n=10)
RegexpMatchEasy0_1K    715.7Mi ± 0%   715.1Mi ± 0%  -0.08% (p=0.022 n=10)
RegexpMatchEasy1_32    186.0Mi ± 0%   186.0Mi ± 0%       ~ (p=0.493 n=10)
RegexpMatchEasy1_1K    654.3Mi ± 0%   654.6Mi ± 0%  +0.04% (p=0.000 n=10)
RegexpMatchMedium_32   21.74Mi ± 0%   21.74Mi ± 0%  +0.02% (p=0.022 n=10)
RegexpMatchMedium_1K   23.78Mi ± 0%   23.79Mi ± 0%  +0.04% (p=0.000 n=10)
RegexpMatchHard_32     14.72Mi ± 0%   14.73Mi ± 0%  +0.06% (p=0.000 n=10)
RegexpMatchHard_1K     16.04Mi ± 0%   16.04Mi ± 0%       ~ (p=1.000 n=10) ¹
Revcomp                202.2Mi ± 1%   202.0Mi ± 0%       ~ (p=0.469 n=10)
Template               16.48Mi ± 2%   16.38Mi ± 2%       ~ (p=0.342 n=10)
geomean                62.23Mi        62.21Mi       -0.04%
¹ all samples are equal

In this case though, all significant perf changes are likely due to
micro-architectural quirks.

Updates golang#59120

Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89
xen0n added a commit to xen0n/go that referenced this issue Apr 11, 2023
Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
               │    before    │                after                 │
               │    sec/op    │    sec/op     vs base                │
ReverseBytes     3.0130n ± 0%   0.6517n ± 2%  -78.37% (p=0.000 n=10)
ReverseBytes16   0.9027n ± 0%   0.6526n ± 2%  -27.71% (p=0.000 n=10)
ReverseBytes32   1.7040n ± 0%   0.6511n ± 1%  -61.79% (p=0.000 n=10)
ReverseBytes64   2.7080n ± 0%   0.6499n ± 1%  -76.00% (p=0.000 n=10)
geomean           1.882n        0.6513n       -65.40%

Go1 benchmark results indicate no meaningful change except for
micro-architecture-related fluctuations.

Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
xen0n added a commit to xen0n/go that referenced this issue Apr 11, 2023
…xtensions

8- and 16-bit sign extensions and 32-bit zero extensions were realized
with left and right shifts before this change. We now support assembling
EXTWB, EXTWH and BSTRPICKV, so all three can be done with a single insn
respectively.

Benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │  CL 479495  │              this CL               │
                      │   sec/op    │   sec/op     vs base               │
BinaryTree17             14.12 ± 1%    14.06 ± 1%       ~ (p=0.393 n=10)
Fannkuch11               3.420 ± 0%    3.421 ± 0%  +0.04% (p=0.001 n=10)
FmtFprintfEmpty         94.72n ± 0%   94.97n ± 0%  +0.26% (p=0.000 n=10)
FmtFprintfString        152.6n ± 0%   155.3n ± 0%  +1.77% (p=0.000 n=10)
FmtFprintfInt           154.5n ± 0%   154.5n ± 0%       ~ (p=0.263 n=10)
FmtFprintfIntInt        237.7n ± 0%   237.1n ± 0%  -0.21% (p=0.000 n=10)
FmtFprintfPrefixedInt   313.1n ± 0%   313.0n ± 0%  -0.03% (p=0.000 n=10)
FmtFprintfFloat         394.1n ± 0%   392.8n ± 0%  -0.32% (p=0.000 n=10)
FmtManyArgs             934.3n ± 0%   912.6n ± 0%  -2.32% (p=0.000 n=10)
GobDecode               15.29m ± 1%   15.23m ± 1%       ~ (p=0.280 n=10)
GobEncode               17.76m ± 0%   17.66m ± 0%  -0.60% (p=0.000 n=10)
Gzip                    416.0m ± 0%   404.4m ± 0%  -2.79% (p=0.000 n=10)
Gunzip                  83.20m ± 0%   80.88m ± 0%  -2.79% (p=0.000 n=10)
HTTPClientServer        87.82µ ± 1%   87.09µ ± 1%  -0.83% (p=0.000 n=10)
JSONEncode              18.56m ± 0%   18.54m ± 0%       ~ (p=0.123 n=10)
JSONDecode              76.53m ± 0%   78.22m ± 1%  +2.21% (p=0.000 n=10)
Mandelbrot200           7.217m ± 0%   7.215m ± 0%       ~ (p=0.143 n=10)
GoParse                 7.587m ± 1%   7.520m ± 1%       ~ (p=0.165 n=10)
RegexpMatchEasy0_32     134.2n ± 0%   134.5n ± 0%  +0.22% (p=0.000 n=10)
RegexpMatchEasy0_1K     1.366µ ± 0%   1.364µ ± 0%  -0.15% (p=0.000 n=10)
RegexpMatchEasy1_32     163.0n ± 0%   164.0n ± 0%  +0.61% (p=0.000 n=10)
RegexpMatchEasy1_1K     1.497µ ± 0%   1.492µ ± 0%  -0.33% (p=0.000 n=10)
RegexpMatchMedium_32    1.415µ ± 0%   1.403µ ± 0%  -0.85% (p=0.000 n=10)
RegexpMatchMedium_1K    41.61µ ± 0%   41.05µ ± 0%  -1.36% (p=0.000 n=10)
RegexpMatchHard_32      2.121µ ± 0%   2.070µ ± 0%  -2.43% (p=0.000 n=10)
RegexpMatchHard_1K      62.64µ ± 0%   60.87µ ± 0%  -2.83% (p=0.000 n=10)
Revcomp                  1.204 ± 0%    1.210 ± 0%  +0.51% (p=0.000 n=10)
Template                118.0m ± 0%   115.2m ± 1%  -2.31% (p=0.000 n=10)
TimeParse               414.8n ± 0%   410.6n ± 0%  -1.01% (p=0.000 n=10)
TimeFormat              510.7n ± 0%   508.2n ± 0%  -0.48% (p=0.000 n=10)
geomean                 102.3µ        101.7µ       -0.60%

                     │  CL 479495   │               this CL               │
                     │     B/s      │     B/s       vs base               │
GobDecode              47.88Mi ± 1%   48.05Mi ± 1%       ~ (p=0.280 n=10)
GobEncode              41.20Mi ± 0%   41.45Mi ± 0%  +0.60% (p=0.000 n=10)
Gzip                   44.49Mi ± 0%   45.77Mi ± 0%  +2.87% (p=0.000 n=10)
Gunzip                 222.4Mi ± 0%   228.8Mi ± 0%  +2.87% (p=0.000 n=10)
JSONEncode             99.69Mi ± 0%   99.82Mi ± 0%       ~ (p=0.118 n=10)
JSONDecode             24.19Mi ± 0%   23.66Mi ± 1%  -2.19% (p=0.000 n=10)
GoParse                7.281Mi ± 2%   7.343Mi ± 1%       ~ (p=0.187 n=10)
RegexpMatchEasy0_32    227.4Mi ± 0%   226.9Mi ± 0%  -0.22% (p=0.000 n=10)
RegexpMatchEasy0_1K    715.0Mi ± 0%   716.0Mi ± 0%  +0.13% (p=0.000 n=10)
RegexpMatchEasy1_32    187.3Mi ± 0%   186.1Mi ± 0%  -0.62% (p=0.000 n=10)
RegexpMatchEasy1_1K    652.3Mi ± 0%   654.5Mi ± 0%  +0.34% (p=0.000 n=10)
RegexpMatchMedium_32   21.57Mi ± 0%   21.74Mi ± 0%  +0.80% (p=0.000 n=10)
RegexpMatchMedium_1K   23.47Mi ± 0%   23.79Mi ± 0%  +1.38% (p=0.000 n=10)
RegexpMatchHard_32     14.39Mi ± 0%   14.74Mi ± 0%  +2.45% (p=0.000 n=10)
RegexpMatchHard_1K     15.59Mi ± 0%   16.04Mi ± 0%  +2.87% (p=0.000 n=10)
Revcomp                201.3Mi ± 0%   200.3Mi ± 0%  -0.51% (p=0.000 n=10)
Template               15.69Mi ± 0%   16.06Mi ± 1%  +2.37% (p=0.000 n=10)
geomean                61.31Mi        61.82Mi       +0.84%

The test binaries were pre-compiled with `go test -c`, and the test runs
were wrapped with `perf stat record` for recording dynamic instruction
counts. The instruction count, IPC and branch misprediction rate did not
meaningfully change.

As for the JSONDecode regression, `perf stat` is used to check
micro-architectural details:

$ sudo perf stat <test executable> -test.timeout=30m -test.run='^$' \
    -test.cpu=1 -test.bench='JSONDecode' -test.count=1 -test.benchtime=50x

Before:

          4,256.10 msec task-clock               #    1.061 CPUs utilized
            61,431      context-switches         #   14.434 K/sec
                 3      cpu-migrations           #    0.705 /sec
             3,297      page-faults              #  774.652 /sec
    10,364,990,422      cycles                   #    2.435 GHz
    19,640,571,817      instructions             #    1.89  insn per cycle
     4,267,623,324      branches                 #    1.003 G/sec
        44,164,375      branch-misses            #    1.03% of all branches

After:

          4,343.17 msec task-clock               #    1.061 CPUs utilized
            62,742      context-switches         #   14.446 K/sec
                 5      cpu-migrations           #    1.151 /sec
             3,044      page-faults              #  700.871 /sec
    10,577,322,342      cycles                   #    2.435 GHz
    19,582,895,547      instructions             #    1.85  insn per cycle
     4,266,051,537      branches                 #  982.244 M/sec
        46,298,286      branch-misses            #    1.09% of all branches

Instruction count decreased by 0.29% but cycle count went up by 2.05%,
while branch misprediction rate raised too. This is likely caused by the
micro-architecture's sensitivity towards changed code layout; the
optimization implemented here should be a net win otherwise.

Updates golang#59120

Change-Id: Ia7dd0dfe20c0ea3e64889e2b38c6b2118b50d56e
xen0n added a commit to xen0n/go that referenced this issue Apr 11, 2023
The runtime malloc implementation makes use of these, among others.

Some generic strength reduction rules for Ctz ops have also been added,
though only enabled for loong64 for now. This is necessary to make the
optimization profitable at all, as the LA464 architecture apparently
handles the `TrailingZeros64(x) < 64` part in runtime.nextFreeFast very
badly if the compiled branch isn't a simple BEQZ any more (that used to
be the case before, when the compiler is able to peek into the pure Go
implementation of TrailingZeros). Without the generic rules this change
is going to be a big perf hit (as bad as 7~10% in select go1 benchmark
cases).

The generic changes are benchmarked on linux/amd64 (Threadripper 3990X)
and darwin/arm64 (Apple M1 Pro) too, but results are either mixed
(amd64) or even net loss (arm64). So, for now those rules are guarded
with a predicate that only enables them for loong64.

Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
                │   before    │                after                │
                │   sec/op    │   sec/op     vs base                │
TrailingZeros     2.758n ± 0%   1.004n ± 0%  -63.60% (p=0.000 n=10)
TrailingZeros8    1.508n ± 0%   1.219n ± 0%  -19.20% (p=0.000 n=10)
TrailingZeros16   3.526n ± 0%   1.437n ± 0%  -59.25% (p=0.000 n=10)
TrailingZeros32   3.161n ± 0%   1.004n ± 0%  -68.23% (p=0.000 n=10)
TrailingZeros64   2.759n ± 0%   1.003n ± 0%  -63.65% (p=0.000 n=10)
geomean           2.638n        1.121n       -57.51%

Go1 benchmark results on the same machine:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │ CL 479496 v8 │              this CL               │
                      │    sec/op    │   sec/op     vs base               │
BinaryTree17              14.10 ± 1%    13.64 ± 1%  -3.28% (p=0.000 n=10)
Fannkuch11                3.421 ± 0%    3.421 ± 0%       ~ (p=0.075 n=10)
FmtFprintfEmpty          94.78n ± 0%   94.50n ± 0%  -0.30% (p=0.000 n=10)
FmtFprintfString         155.0n ± 0%   154.1n ± 1%       ~ (p=1.000 n=10)
FmtFprintfInt            157.2n ± 0%   155.2n ± 1%  -1.27% (p=0.000 n=10)
FmtFprintfIntInt         242.1n ± 0%   238.0n ± 1%  -1.73% (p=0.000 n=10)
FmtFprintfPrefixedInt    337.6n ± 0%   334.6n ± 0%  -0.89% (p=0.000 n=10)
FmtFprintfFloat          399.0n ± 0%   396.4n ± 0%  -0.65% (p=0.000 n=10)
FmtManyArgs              959.8n ± 0%   923.4n ± 0%  -3.79% (p=0.000 n=10)
GobDecode                15.63m ± 3%   15.17m ± 1%  -2.90% (p=0.001 n=10)
GobEncode                18.43m ± 3%   17.62m ± 0%  -4.38% (p=0.000 n=10)
Gzip                     405.1m ± 0%   405.4m ± 0%  +0.06% (p=0.035 n=10)
Gunzip                   86.84m ± 0%   87.20m ± 0%  +0.41% (p=0.000 n=10)
HTTPClientServer         88.47µ ± 0%   86.92µ ± 1%  -1.75% (p=0.000 n=10)
JSONEncode               18.84m ± 0%   18.66m ± 0%  -0.95% (p=0.000 n=10)
JSONDecode               79.35m ± 0%   75.77m ± 1%  -4.51% (p=0.000 n=10)
Mandelbrot200            7.215m ± 0%   7.215m ± 0%       ~ (p=0.315 n=10)
GoParse                  7.591m ± 1%   7.407m ± 1%  -2.43% (p=0.000 n=10)
RegexpMatchEasy0_32      133.8n ± 0%   134.3n ± 0%  +0.37% (p=0.000 n=10)
RegexpMatchEasy0_1K      1.540µ ± 0%   1.544µ ± 0%  +0.26% (p=0.000 n=10)
RegexpMatchEasy1_32      164.1n ± 0%   165.4n ± 0%  +0.79% (p=0.000 n=10)
RegexpMatchEasy1_1K      1.626µ ± 0%   1.629µ ± 0%  +0.18% (p=0.000 n=10)
RegexpMatchMedium_32     1.403µ ± 0%   1.413µ ± 0%  +0.71% (p=0.000 n=10)
RegexpMatchMedium_1K     41.22µ ± 0%   41.59µ ± 0%  +0.90% (p=0.000 n=10)
RegexpMatchHard_32       2.071µ ± 0%   2.060µ ± 0%  -0.53% (p=0.000 n=10)
RegexpMatchHard_1K       61.05µ ± 0%   61.30µ ± 0%  +0.41% (p=0.001 n=10)
Revcomp                   1.351 ± 0%    1.357 ± 0%  +0.42% (p=0.000 n=10)
Template                 117.3m ± 1%   110.6m ± 2%  -5.71% (p=0.000 n=10)
TimeParse                411.9n ± 0%   411.7n ± 0%       ~ (p=0.117 n=10)
TimeFormat               514.2n ± 0%   499.9n ± 0%  -2.77% (p=0.000 n=10)
geomean                  104.2µ        103.0µ       -1.15%

                     │ CL 479496 v8 │               this CL               │
                     │     B/s      │     B/s       vs base               │
GobDecode              46.84Mi ± 3%   48.24Mi ± 1%  +2.98% (p=0.001 n=10)
GobEncode              39.72Mi ± 4%   41.53Mi ± 0%  +4.57% (p=0.000 n=10)
Gzip                   45.68Mi ± 0%   45.65Mi ± 0%  -0.05% (p=0.029 n=10)
Gunzip                 213.1Mi ± 0%   212.2Mi ± 0%  -0.41% (p=0.000 n=10)
JSONEncode             98.23Mi ± 0%   99.18Mi ± 0%  +0.97% (p=0.000 n=10)
JSONDecode             23.32Mi ± 0%   24.42Mi ± 1%  +4.72% (p=0.000 n=10)
GoParse                7.277Mi ± 1%   7.458Mi ± 1%  +2.49% (p=0.000 n=10)
RegexpMatchEasy0_32    228.1Mi ± 0%   227.3Mi ± 0%  -0.36% (p=0.000 n=10)
RegexpMatchEasy0_1K    634.2Mi ± 0%   632.5Mi ± 0%  -0.27% (p=0.000 n=10)
RegexpMatchEasy1_32    186.0Mi ± 0%   184.5Mi ± 0%  -0.79% (p=0.000 n=10)
RegexpMatchEasy1_1K    600.4Mi ± 0%   599.4Mi ± 0%  -0.17% (p=0.000 n=10)
RegexpMatchMedium_32   21.75Mi ± 0%   21.60Mi ± 0%  -0.70% (p=0.000 n=10)
RegexpMatchMedium_1K   23.69Mi ± 0%   23.48Mi ± 0%  -0.89% (p=0.000 n=10)
RegexpMatchHard_32     14.73Mi ± 0%   14.81Mi ± 0%  +0.52% (p=0.000 n=10)
RegexpMatchHard_1K     15.99Mi ± 0%   15.93Mi ± 0%  -0.42% (p=0.000 n=10)
Revcomp                179.4Mi ± 0%   178.6Mi ± 0%  -0.42% (p=0.000 n=10)
Template               15.78Mi ± 1%   16.73Mi ± 2%  +6.04% (p=0.000 n=10)
geomean                59.97Mi        60.58Mi       +1.02%

The change should be a net win, as all it does is to pattern-match and
replace Ctz ops into respective native instructions, so any performance
regression is likely also micro-architecture related, like observed in
CL 479496's results. (Indeed, some of the more drastic improvements may
well also be coincidental, but the point is that there is at least a
small amount of deterministic improvements anyway.)

Updates golang#59120

Change-Id: I6c90f727eb00e0add2a5f8575ac045b9e288af54
xen0n added a commit to xen0n/go that referenced this issue Apr 11, 2023
Benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │ CL 479498 v11 │               this CL               │
                      │    sec/op     │   sec/op     vs base                │
BinaryTree17               13.64 ± 1%    13.75 ± 2%        ~ (p=0.579 n=10)
Fannkuch11                 3.421 ± 0%    3.650 ± 0%   +6.70% (p=0.000 n=10)
FmtFprintfEmpty           94.50n ± 0%   94.45n ± 0%   -0.05% (p=0.000 n=10)
FmtFprintfString          154.1n ± 1%   155.2n ± 0%        ~ (p=0.689 n=10)
FmtFprintfInt             155.2n ± 1%   154.4n ± 0%        ~ (p=0.785 n=10)
FmtFprintfIntInt          238.0n ± 1%   237.1n ± 0%        ~ (p=0.721 n=10)
FmtFprintfPrefixedInt     334.6n ± 0%   312.8n ± 0%   -6.52% (p=0.000 n=10)
FmtFprintfFloat           396.4n ± 0%   390.5n ± 0%   -1.49% (p=0.000 n=10)
FmtManyArgs               923.4n ± 0%   905.0n ± 0%   -2.00% (p=0.000 n=10)
GobDecode                 15.17m ± 1%   14.93m ± 1%   -1.59% (p=0.000 n=10)
GobEncode                 17.62m ± 0%   17.33m ± 0%   -1.65% (p=0.001 n=10)
Gzip                      405.4m ± 0%   404.3m ± 0%   -0.26% (p=0.000 n=10)
Gunzip                    87.20m ± 0%   80.92m ± 0%   -7.20% (p=0.000 n=10)
HTTPClientServer          86.92µ ± 1%   86.14µ ± 0%   -0.90% (p=0.000 n=10)
JSONEncode                18.66m ± 0%   18.49m ± 0%   -0.91% (p=0.000 n=10)
JSONDecode                75.77m ± 1%   77.34m ± 1%   +2.07% (p=0.000 n=10)
Mandelbrot200             7.215m ± 0%   6.521m ± 0%   -9.62% (p=0.000 n=10)
GoParse                   7.407m ± 1%   7.324m ± 1%   -1.12% (p=0.003 n=10)
RegexpMatchEasy0_32       134.3n ± 0%   134.6n ± 0%   +0.22% (p=0.000 n=10)
RegexpMatchEasy0_1K       1.544µ ± 0%   1.365µ ± 0%  -11.63% (p=0.000 n=10)
RegexpMatchEasy1_32       165.4n ± 0%   164.1n ± 0%   -0.79% (p=0.000 n=10)
RegexpMatchEasy1_1K       1.629µ ± 0%   1.492µ ± 0%   -8.41% (p=0.000 n=10)
RegexpMatchMedium_32      1.413µ ± 0%   1.404µ ± 0%   -0.64% (p=0.000 n=10)
RegexpMatchMedium_1K      41.59µ ± 0%   41.05µ ± 0%   -1.28% (p=0.000 n=10)
RegexpMatchHard_32        2.060µ ± 0%   2.072µ ± 0%   +0.58% (p=0.000 n=10)
RegexpMatchHard_1K        61.30µ ± 0%   60.89µ ± 0%   -0.68% (p=0.000 n=10)
Revcomp                    1.357 ± 0%    1.199 ± 1%  -11.64% (p=0.000 n=10)
Template                  110.6m ± 2%   112.3m ± 2%        ~ (p=0.105 n=10)
TimeParse                 411.7n ± 0%   414.2n ± 1%   +0.60% (p=0.000 n=10)
TimeFormat                499.9n ± 0%   496.9n ± 0%   -0.60% (p=0.000 n=10)
geomean                   103.0µ        101.0µ        -1.98%

                     │ CL 479498 v11 │                this CL                │
                     │      B/s      │      B/s       vs base                │
GobDecode               48.24Mi ± 1%    49.02Mi ± 1%   +1.62% (p=0.000 n=10)
GobEncode               41.53Mi ± 0%    42.23Mi ± 0%   +1.69% (p=0.001 n=10)
Gzip                    45.65Mi ± 0%    45.77Mi ± 0%   +0.25% (p=0.000 n=10)
Gunzip                  212.2Mi ± 0%    228.7Mi ± 0%   +7.76% (p=0.000 n=10)
JSONEncode              99.18Mi ± 0%   100.08Mi ± 0%   +0.91% (p=0.000 n=10)
JSONDecode              24.42Mi ± 1%    23.93Mi ± 1%   -2.03% (p=0.000 n=10)
GoParse                 7.458Mi ± 1%    7.544Mi ± 1%   +1.15% (p=0.001 n=10)
RegexpMatchEasy0_32     227.3Mi ± 0%    226.8Mi ± 0%   -0.21% (p=0.000 n=10)
RegexpMatchEasy0_1K     632.5Mi ± 0%    715.7Mi ± 0%  +13.15% (p=0.000 n=10)
RegexpMatchEasy1_32     184.5Mi ± 0%    186.0Mi ± 0%   +0.81% (p=0.000 n=10)
RegexpMatchEasy1_1K     599.4Mi ± 0%    654.3Mi ± 0%   +9.17% (p=0.000 n=10)
RegexpMatchMedium_32    21.60Mi ± 0%    21.74Mi ± 0%   +0.64% (p=0.000 n=10)
RegexpMatchMedium_1K    23.48Mi ± 0%    23.78Mi ± 0%   +1.30% (p=0.000 n=10)
RegexpMatchHard_32      14.81Mi ± 0%    14.72Mi ± 0%   -0.58% (p=0.000 n=10)
RegexpMatchHard_1K      15.93Mi ± 0%    16.04Mi ± 0%   +0.72% (p=0.000 n=10)
Revcomp                 178.6Mi ± 0%    202.2Mi ± 1%  +13.18% (p=0.000 n=10)
Template                16.73Mi ± 2%    16.48Mi ± 2%        ~ (p=0.093 n=10)
geomean                 60.58Mi         62.23Mi        +2.72%

The only significant regression is the Fannkuch11 case; perf records are
manually inspected, with the hottest part of the code virtually unchanged
except for the alignment of two instructions, that seems to sit at
different sides of a 32- or even 64-byte boundary. So again, the
regression is likely due to micro-architecture quirks, and the change is
in fact a win across the board.

Updates golang#59120

Change-Id: Ibbf64988c9d06f7c1d359480a1d6aecfa2c25b65
xen0n added a commit to xen0n/go that referenced this issue Apr 11, 2023
For the SubFromLen64 codegen test case to work as intended, we need
to fold c-(-(x-d)) into x+(c-d).

Still, some instances of LeadingZeros are not optimized into single
CLZ instructions right now (actually, the LeadingZeros micro-benchmarks
are currently still compiled with redundant adds/subs of 64, due to
interference of loop optimizations before lowering), but perf numbers
indicate it's not that bad after all.

Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
               │   before    │                after                │
               │   sec/op    │   sec/op     vs base                │
LeadingZeros     3.675n ± 0%   1.545n ± 1%  -57.96% (p=0.000 n=10)
LeadingZeros8    2.001n ± 0%   1.868n ± 0%   -6.62% (p=0.000 n=10)
LeadingZeros16   3.144n ± 0%   1.864n ± 1%  -40.71% (p=0.000 n=10)
LeadingZeros32   4.265n ± 1%   1.653n ± 1%  -61.24% (p=0.000 n=10)
LeadingZeros64   3.962n ± 0%   1.539n ± 0%  -61.16% (p=0.000 n=10)
geomean          3.299n        1.688n       -48.84%

go1 benchmark results on the same box:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │  CL 483355  │              this CL               │
                      │   sec/op    │   sec/op     vs base               │
BinaryTree17             13.75 ± 2%    13.70 ± 2%       ~ (p=0.579 n=10)
Fannkuch11               3.650 ± 0%    3.415 ± 0%  -6.46% (p=0.000 n=10)
FmtFprintfEmpty         94.45n ± 0%   94.98n ± 0%  +0.56% (p=0.000 n=10)
FmtFprintfString        155.2n ± 0%   151.1n ± 0%  -2.61% (p=0.000 n=10)
FmtFprintfInt           154.4n ± 0%   153.6n ± 0%  -0.52% (p=0.000 n=10)
FmtFprintfIntInt        237.1n ± 0%   234.7n ± 0%  -0.99% (p=0.000 n=10)
FmtFprintfPrefixedInt   312.8n ± 0%   314.2n ± 0%  +0.45% (p=0.000 n=10)
FmtFprintfFloat         390.5n ± 0%   402.1n ± 0%  +2.97% (p=0.000 n=10)
FmtManyArgs             905.0n ± 0%   918.6n ± 0%  +1.51% (p=0.000 n=10)
GobDecode               14.93m ± 1%   14.98m ± 1%  +0.33% (p=0.015 n=10)
GobEncode               17.33m ± 0%   17.26m ± 1%  -0.39% (p=0.023 n=10)
Gzip                    404.3m ± 0%   404.6m ± 0%  +0.08% (p=0.000 n=10)
Gunzip                  80.92m ± 0%   80.97m ± 0%  +0.06% (p=0.000 n=10)
HTTPClientServer        86.14µ ± 0%   84.39µ ± 0%  -2.03% (p=0.000 n=10)
JSONEncode              18.49m ± 0%   18.50m ± 0%       ~ (p=0.436 n=10)
JSONDecode              77.34m ± 1%   76.26m ± 1%  -1.40% (p=0.000 n=10)
Mandelbrot200           6.521m ± 0%   6.508m ± 0%       ~ (p=0.138 n=10)
GoParse                 7.324m ± 1%   7.413m ± 1%  +1.22% (p=0.005 n=10)
RegexpMatchEasy0_32     134.6n ± 0%   134.6n ± 0%       ~ (p=0.195 n=10)
RegexpMatchEasy0_1K     1.365µ ± 0%   1.366µ ± 0%  +0.07% (p=0.038 n=10)
RegexpMatchEasy1_32     164.1n ± 0%   164.1n ± 0%       ~ (p=0.230 n=10)
RegexpMatchEasy1_1K     1.492µ ± 0%   1.492µ ± 0%       ~ (p=0.211 n=10)
RegexpMatchMedium_32    1.404µ ± 0%   1.403µ ± 0%  -0.07% (p=0.000 n=10)
RegexpMatchMedium_1K    41.05µ ± 0%   41.04µ ± 0%  -0.04% (p=0.000 n=10)
RegexpMatchHard_32      2.072µ ± 0%   2.071µ ± 0%  -0.05% (p=0.000 n=10)
RegexpMatchHard_1K      60.89µ ± 0%   60.87µ ± 0%  -0.04% (p=0.000 n=10)
Revcomp                  1.199 ± 1%    1.200 ± 0%       ~ (p=0.481 n=10)
Template                112.3m ± 2%   112.9m ± 2%       ~ (p=0.353 n=10)
TimeParse               414.2n ± 1%   412.5n ± 0%  -0.40% (p=0.000 n=10)
TimeFormat              496.9n ± 0%   496.6n ± 0%       ~ (p=0.341 n=10)
geomean                 101.0µ        100.7µ       -0.26%

                     │  CL 483355   │                this CL                │
                     │     B/s      │     B/s       vs base                 │
GobDecode              49.02Mi ± 1%   48.87Mi ± 1%  -0.32% (p=0.014 n=10)
GobEncode              42.23Mi ± 0%   42.40Mi ± 1%  +0.40% (p=0.022 n=10)
Gzip                   45.77Mi ± 0%   45.73Mi ± 0%  -0.07% (p=0.000 n=10)
Gunzip                 228.7Mi ± 0%   228.6Mi ± 0%  -0.06% (p=0.000 n=10)
JSONEncode             100.1Mi ± 0%   100.0Mi ± 0%       ~ (p=0.470 n=10)
JSONDecode             23.93Mi ± 1%   24.27Mi ± 1%  +1.43% (p=0.000 n=10)
GoParse                7.544Mi ± 1%   7.448Mi ± 1%  -1.26% (p=0.005 n=10)
RegexpMatchEasy0_32    226.8Mi ± 0%   226.7Mi ± 0%  -0.06% (p=0.001 n=10)
RegexpMatchEasy0_1K    715.7Mi ± 0%   715.1Mi ± 0%  -0.08% (p=0.022 n=10)
RegexpMatchEasy1_32    186.0Mi ± 0%   186.0Mi ± 0%       ~ (p=0.493 n=10)
RegexpMatchEasy1_1K    654.3Mi ± 0%   654.6Mi ± 0%  +0.04% (p=0.000 n=10)
RegexpMatchMedium_32   21.74Mi ± 0%   21.74Mi ± 0%  +0.02% (p=0.022 n=10)
RegexpMatchMedium_1K   23.78Mi ± 0%   23.79Mi ± 0%  +0.04% (p=0.000 n=10)
RegexpMatchHard_32     14.72Mi ± 0%   14.73Mi ± 0%  +0.06% (p=0.000 n=10)
RegexpMatchHard_1K     16.04Mi ± 0%   16.04Mi ± 0%       ~ (p=1.000 n=10) ¹
Revcomp                202.2Mi ± 1%   202.0Mi ± 0%       ~ (p=0.469 n=10)
Template               16.48Mi ± 2%   16.38Mi ± 2%       ~ (p=0.342 n=10)
geomean                62.23Mi        62.21Mi       -0.04%
¹ all samples are equal

In this case though, all significant perf changes are likely due to
micro-architectural quirks.

Updates golang#59120

Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89
xen0n added a commit to xen0n/go that referenced this issue Apr 11, 2023
Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
               │    before    │                after                 │
               │    sec/op    │    sec/op     vs base                │
ReverseBytes     3.0130n ± 0%   0.6517n ± 2%  -78.37% (p=0.000 n=10)
ReverseBytes16   0.9027n ± 0%   0.6526n ± 2%  -27.71% (p=0.000 n=10)
ReverseBytes32   1.7040n ± 0%   0.6511n ± 1%  -61.79% (p=0.000 n=10)
ReverseBytes64   2.7080n ± 0%   0.6499n ± 1%  -76.00% (p=0.000 n=10)
geomean           1.882n        0.6513n       -65.40%

Go1 benchmark results indicate no meaningful change except for
micro-architecture-related fluctuations.

Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
xen0n added a commit to xen0n/go that referenced this issue Apr 11, 2023
…xtensions

8- and 16-bit sign extensions and 32-bit zero extensions were realized
with left and right shifts before this change. We now support assembling
EXTWB, EXTWH and BSTRPICKV, so all three can be done with a single insn
respectively.

Benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │  CL 479495  │              this CL               │
                      │   sec/op    │   sec/op     vs base               │
BinaryTree17             14.12 ± 1%    14.06 ± 1%       ~ (p=0.393 n=10)
Fannkuch11               3.420 ± 0%    3.421 ± 0%  +0.04% (p=0.001 n=10)
FmtFprintfEmpty         94.72n ± 0%   94.97n ± 0%  +0.26% (p=0.000 n=10)
FmtFprintfString        152.6n ± 0%   155.3n ± 0%  +1.77% (p=0.000 n=10)
FmtFprintfInt           154.5n ± 0%   154.5n ± 0%       ~ (p=0.263 n=10)
FmtFprintfIntInt        237.7n ± 0%   237.1n ± 0%  -0.21% (p=0.000 n=10)
FmtFprintfPrefixedInt   313.1n ± 0%   313.0n ± 0%  -0.03% (p=0.000 n=10)
FmtFprintfFloat         394.1n ± 0%   392.8n ± 0%  -0.32% (p=0.000 n=10)
FmtManyArgs             934.3n ± 0%   912.6n ± 0%  -2.32% (p=0.000 n=10)
GobDecode               15.29m ± 1%   15.23m ± 1%       ~ (p=0.280 n=10)
GobEncode               17.76m ± 0%   17.66m ± 0%  -0.60% (p=0.000 n=10)
Gzip                    416.0m ± 0%   404.4m ± 0%  -2.79% (p=0.000 n=10)
Gunzip                  83.20m ± 0%   80.88m ± 0%  -2.79% (p=0.000 n=10)
HTTPClientServer        87.82µ ± 1%   87.09µ ± 1%  -0.83% (p=0.000 n=10)
JSONEncode              18.56m ± 0%   18.54m ± 0%       ~ (p=0.123 n=10)
JSONDecode              76.53m ± 0%   78.22m ± 1%  +2.21% (p=0.000 n=10)
Mandelbrot200           7.217m ± 0%   7.215m ± 0%       ~ (p=0.143 n=10)
GoParse                 7.587m ± 1%   7.520m ± 1%       ~ (p=0.165 n=10)
RegexpMatchEasy0_32     134.2n ± 0%   134.5n ± 0%  +0.22% (p=0.000 n=10)
RegexpMatchEasy0_1K     1.366µ ± 0%   1.364µ ± 0%  -0.15% (p=0.000 n=10)
RegexpMatchEasy1_32     163.0n ± 0%   164.0n ± 0%  +0.61% (p=0.000 n=10)
RegexpMatchEasy1_1K     1.497µ ± 0%   1.492µ ± 0%  -0.33% (p=0.000 n=10)
RegexpMatchMedium_32    1.415µ ± 0%   1.403µ ± 0%  -0.85% (p=0.000 n=10)
RegexpMatchMedium_1K    41.61µ ± 0%   41.05µ ± 0%  -1.36% (p=0.000 n=10)
RegexpMatchHard_32      2.121µ ± 0%   2.070µ ± 0%  -2.43% (p=0.000 n=10)
RegexpMatchHard_1K      62.64µ ± 0%   60.87µ ± 0%  -2.83% (p=0.000 n=10)
Revcomp                  1.204 ± 0%    1.210 ± 0%  +0.51% (p=0.000 n=10)
Template                118.0m ± 0%   115.2m ± 1%  -2.31% (p=0.000 n=10)
TimeParse               414.8n ± 0%   410.6n ± 0%  -1.01% (p=0.000 n=10)
TimeFormat              510.7n ± 0%   508.2n ± 0%  -0.48% (p=0.000 n=10)
geomean                 102.3µ        101.7µ       -0.60%

                     │  CL 479495   │               this CL               │
                     │     B/s      │     B/s       vs base               │
GobDecode              47.88Mi ± 1%   48.05Mi ± 1%       ~ (p=0.280 n=10)
GobEncode              41.20Mi ± 0%   41.45Mi ± 0%  +0.60% (p=0.000 n=10)
Gzip                   44.49Mi ± 0%   45.77Mi ± 0%  +2.87% (p=0.000 n=10)
Gunzip                 222.4Mi ± 0%   228.8Mi ± 0%  +2.87% (p=0.000 n=10)
JSONEncode             99.69Mi ± 0%   99.82Mi ± 0%       ~ (p=0.118 n=10)
JSONDecode             24.19Mi ± 0%   23.66Mi ± 1%  -2.19% (p=0.000 n=10)
GoParse                7.281Mi ± 2%   7.343Mi ± 1%       ~ (p=0.187 n=10)
RegexpMatchEasy0_32    227.4Mi ± 0%   226.9Mi ± 0%  -0.22% (p=0.000 n=10)
RegexpMatchEasy0_1K    715.0Mi ± 0%   716.0Mi ± 0%  +0.13% (p=0.000 n=10)
RegexpMatchEasy1_32    187.3Mi ± 0%   186.1Mi ± 0%  -0.62% (p=0.000 n=10)
RegexpMatchEasy1_1K    652.3Mi ± 0%   654.5Mi ± 0%  +0.34% (p=0.000 n=10)
RegexpMatchMedium_32   21.57Mi ± 0%   21.74Mi ± 0%  +0.80% (p=0.000 n=10)
RegexpMatchMedium_1K   23.47Mi ± 0%   23.79Mi ± 0%  +1.38% (p=0.000 n=10)
RegexpMatchHard_32     14.39Mi ± 0%   14.74Mi ± 0%  +2.45% (p=0.000 n=10)
RegexpMatchHard_1K     15.59Mi ± 0%   16.04Mi ± 0%  +2.87% (p=0.000 n=10)
Revcomp                201.3Mi ± 0%   200.3Mi ± 0%  -0.51% (p=0.000 n=10)
Template               15.69Mi ± 0%   16.06Mi ± 1%  +2.37% (p=0.000 n=10)
geomean                61.31Mi        61.82Mi       +0.84%

The test binaries were pre-compiled with `go test -c`, and the test runs
were wrapped with `perf stat record` for recording dynamic instruction
counts. The instruction count, IPC and branch misprediction rate did not
meaningfully change.

As for the JSONDecode regression, `perf stat` is used to check
micro-architectural details:

$ sudo perf stat <test executable> -test.timeout=30m -test.run='^$' \
    -test.cpu=1 -test.bench='JSONDecode' -test.count=1 -test.benchtime=50x

Before:

          4,256.10 msec task-clock               #    1.061 CPUs utilized
            61,431      context-switches         #   14.434 K/sec
                 3      cpu-migrations           #    0.705 /sec
             3,297      page-faults              #  774.652 /sec
    10,364,990,422      cycles                   #    2.435 GHz
    19,640,571,817      instructions             #    1.89  insn per cycle
     4,267,623,324      branches                 #    1.003 G/sec
        44,164,375      branch-misses            #    1.03% of all branches

After:

          4,343.17 msec task-clock               #    1.061 CPUs utilized
            62,742      context-switches         #   14.446 K/sec
                 5      cpu-migrations           #    1.151 /sec
             3,044      page-faults              #  700.871 /sec
    10,577,322,342      cycles                   #    2.435 GHz
    19,582,895,547      instructions             #    1.85  insn per cycle
     4,266,051,537      branches                 #  982.244 M/sec
        46,298,286      branch-misses            #    1.09% of all branches

Instruction count decreased by 0.29% but cycle count went up by 2.05%,
while branch misprediction rate raised too. This is likely caused by the
micro-architecture's sensitivity towards changed code layout; the
optimization implemented here should be a net win otherwise.

Updates golang#59120

Change-Id: Ia7dd0dfe20c0ea3e64889e2b38c6b2118b50d56e
xen0n added a commit to xen0n/go that referenced this issue Apr 11, 2023
The runtime malloc implementation makes use of these, among others.

Some generic strength reduction rules for Ctz ops have also been added,
though only enabled for loong64 for now. This is necessary to make the
optimization profitable at all, as the LA464 architecture apparently
handles the `TrailingZeros64(x) < 64` part in runtime.nextFreeFast very
badly if the compiled branch isn't a simple BEQZ any more (that used to
be the case before, when the compiler is able to peek into the pure Go
implementation of TrailingZeros). Without the generic rules this change
is going to be a big perf hit (as bad as 7~10% in select go1 benchmark
cases).

The generic changes are benchmarked on linux/amd64 (Threadripper 3990X)
and darwin/arm64 (Apple M1 Pro) too, but results are either mixed
(amd64) or even net loss (arm64). So, for now those rules are guarded
with a predicate that only enables them for loong64.

Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
                │   before    │                after                │
                │   sec/op    │   sec/op     vs base                │
TrailingZeros     2.758n ± 0%   1.004n ± 0%  -63.60% (p=0.000 n=10)
TrailingZeros8    1.508n ± 0%   1.219n ± 0%  -19.20% (p=0.000 n=10)
TrailingZeros16   3.526n ± 0%   1.437n ± 0%  -59.25% (p=0.000 n=10)
TrailingZeros32   3.161n ± 0%   1.004n ± 0%  -68.23% (p=0.000 n=10)
TrailingZeros64   2.759n ± 0%   1.003n ± 0%  -63.65% (p=0.000 n=10)
geomean           2.638n        1.121n       -57.51%

Go1 benchmark results on the same machine:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │ CL 479496 v8 │              this CL               │
                      │    sec/op    │   sec/op     vs base               │
BinaryTree17              14.10 ± 1%    13.64 ± 1%  -3.28% (p=0.000 n=10)
Fannkuch11                3.421 ± 0%    3.421 ± 0%       ~ (p=0.075 n=10)
FmtFprintfEmpty          94.78n ± 0%   94.50n ± 0%  -0.30% (p=0.000 n=10)
FmtFprintfString         155.0n ± 0%   154.1n ± 1%       ~ (p=1.000 n=10)
FmtFprintfInt            157.2n ± 0%   155.2n ± 1%  -1.27% (p=0.000 n=10)
FmtFprintfIntInt         242.1n ± 0%   238.0n ± 1%  -1.73% (p=0.000 n=10)
FmtFprintfPrefixedInt    337.6n ± 0%   334.6n ± 0%  -0.89% (p=0.000 n=10)
FmtFprintfFloat          399.0n ± 0%   396.4n ± 0%  -0.65% (p=0.000 n=10)
FmtManyArgs              959.8n ± 0%   923.4n ± 0%  -3.79% (p=0.000 n=10)
GobDecode                15.63m ± 3%   15.17m ± 1%  -2.90% (p=0.001 n=10)
GobEncode                18.43m ± 3%   17.62m ± 0%  -4.38% (p=0.000 n=10)
Gzip                     405.1m ± 0%   405.4m ± 0%  +0.06% (p=0.035 n=10)
Gunzip                   86.84m ± 0%   87.20m ± 0%  +0.41% (p=0.000 n=10)
HTTPClientServer         88.47µ ± 0%   86.92µ ± 1%  -1.75% (p=0.000 n=10)
JSONEncode               18.84m ± 0%   18.66m ± 0%  -0.95% (p=0.000 n=10)
JSONDecode               79.35m ± 0%   75.77m ± 1%  -4.51% (p=0.000 n=10)
Mandelbrot200            7.215m ± 0%   7.215m ± 0%       ~ (p=0.315 n=10)
GoParse                  7.591m ± 1%   7.407m ± 1%  -2.43% (p=0.000 n=10)
RegexpMatchEasy0_32      133.8n ± 0%   134.3n ± 0%  +0.37% (p=0.000 n=10)
RegexpMatchEasy0_1K      1.540µ ± 0%   1.544µ ± 0%  +0.26% (p=0.000 n=10)
RegexpMatchEasy1_32      164.1n ± 0%   165.4n ± 0%  +0.79% (p=0.000 n=10)
RegexpMatchEasy1_1K      1.626µ ± 0%   1.629µ ± 0%  +0.18% (p=0.000 n=10)
RegexpMatchMedium_32     1.403µ ± 0%   1.413µ ± 0%  +0.71% (p=0.000 n=10)
RegexpMatchMedium_1K     41.22µ ± 0%   41.59µ ± 0%  +0.90% (p=0.000 n=10)
RegexpMatchHard_32       2.071µ ± 0%   2.060µ ± 0%  -0.53% (p=0.000 n=10)
RegexpMatchHard_1K       61.05µ ± 0%   61.30µ ± 0%  +0.41% (p=0.001 n=10)
Revcomp                   1.351 ± 0%    1.357 ± 0%  +0.42% (p=0.000 n=10)
Template                 117.3m ± 1%   110.6m ± 2%  -5.71% (p=0.000 n=10)
TimeParse                411.9n ± 0%   411.7n ± 0%       ~ (p=0.117 n=10)
TimeFormat               514.2n ± 0%   499.9n ± 0%  -2.77% (p=0.000 n=10)
geomean                  104.2µ        103.0µ       -1.15%

                     │ CL 479496 v8 │               this CL               │
                     │     B/s      │     B/s       vs base               │
GobDecode              46.84Mi ± 3%   48.24Mi ± 1%  +2.98% (p=0.001 n=10)
GobEncode              39.72Mi ± 4%   41.53Mi ± 0%  +4.57% (p=0.000 n=10)
Gzip                   45.68Mi ± 0%   45.65Mi ± 0%  -0.05% (p=0.029 n=10)
Gunzip                 213.1Mi ± 0%   212.2Mi ± 0%  -0.41% (p=0.000 n=10)
JSONEncode             98.23Mi ± 0%   99.18Mi ± 0%  +0.97% (p=0.000 n=10)
JSONDecode             23.32Mi ± 0%   24.42Mi ± 1%  +4.72% (p=0.000 n=10)
GoParse                7.277Mi ± 1%   7.458Mi ± 1%  +2.49% (p=0.000 n=10)
RegexpMatchEasy0_32    228.1Mi ± 0%   227.3Mi ± 0%  -0.36% (p=0.000 n=10)
RegexpMatchEasy0_1K    634.2Mi ± 0%   632.5Mi ± 0%  -0.27% (p=0.000 n=10)
RegexpMatchEasy1_32    186.0Mi ± 0%   184.5Mi ± 0%  -0.79% (p=0.000 n=10)
RegexpMatchEasy1_1K    600.4Mi ± 0%   599.4Mi ± 0%  -0.17% (p=0.000 n=10)
RegexpMatchMedium_32   21.75Mi ± 0%   21.60Mi ± 0%  -0.70% (p=0.000 n=10)
RegexpMatchMedium_1K   23.69Mi ± 0%   23.48Mi ± 0%  -0.89% (p=0.000 n=10)
RegexpMatchHard_32     14.73Mi ± 0%   14.81Mi ± 0%  +0.52% (p=0.000 n=10)
RegexpMatchHard_1K     15.99Mi ± 0%   15.93Mi ± 0%  -0.42% (p=0.000 n=10)
Revcomp                179.4Mi ± 0%   178.6Mi ± 0%  -0.42% (p=0.000 n=10)
Template               15.78Mi ± 1%   16.73Mi ± 2%  +6.04% (p=0.000 n=10)
geomean                59.97Mi        60.58Mi       +1.02%

The change should be a net win, as all it does is to pattern-match and
replace Ctz ops into respective native instructions, so any performance
regression is likely also micro-architecture related, like observed in
CL 479496's results. (Indeed, some of the more drastic improvements may
well also be coincidental, but the point is that there is at least a
small amount of deterministic improvements anyway.)

Updates golang#59120

Change-Id: I6c90f727eb00e0add2a5f8575ac045b9e288af54
xen0n added a commit to xen0n/go that referenced this issue Apr 11, 2023
Benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │ CL 479498 v11 │               this CL               │
                      │    sec/op     │   sec/op     vs base                │
BinaryTree17               13.64 ± 1%    13.75 ± 2%        ~ (p=0.579 n=10)
Fannkuch11                 3.421 ± 0%    3.650 ± 0%   +6.70% (p=0.000 n=10)
FmtFprintfEmpty           94.50n ± 0%   94.45n ± 0%   -0.05% (p=0.000 n=10)
FmtFprintfString          154.1n ± 1%   155.2n ± 0%        ~ (p=0.689 n=10)
FmtFprintfInt             155.2n ± 1%   154.4n ± 0%        ~ (p=0.785 n=10)
FmtFprintfIntInt          238.0n ± 1%   237.1n ± 0%        ~ (p=0.721 n=10)
FmtFprintfPrefixedInt     334.6n ± 0%   312.8n ± 0%   -6.52% (p=0.000 n=10)
FmtFprintfFloat           396.4n ± 0%   390.5n ± 0%   -1.49% (p=0.000 n=10)
FmtManyArgs               923.4n ± 0%   905.0n ± 0%   -2.00% (p=0.000 n=10)
GobDecode                 15.17m ± 1%   14.93m ± 1%   -1.59% (p=0.000 n=10)
GobEncode                 17.62m ± 0%   17.33m ± 0%   -1.65% (p=0.001 n=10)
Gzip                      405.4m ± 0%   404.3m ± 0%   -0.26% (p=0.000 n=10)
Gunzip                    87.20m ± 0%   80.92m ± 0%   -7.20% (p=0.000 n=10)
HTTPClientServer          86.92µ ± 1%   86.14µ ± 0%   -0.90% (p=0.000 n=10)
JSONEncode                18.66m ± 0%   18.49m ± 0%   -0.91% (p=0.000 n=10)
JSONDecode                75.77m ± 1%   77.34m ± 1%   +2.07% (p=0.000 n=10)
Mandelbrot200             7.215m ± 0%   6.521m ± 0%   -9.62% (p=0.000 n=10)
GoParse                   7.407m ± 1%   7.324m ± 1%   -1.12% (p=0.003 n=10)
RegexpMatchEasy0_32       134.3n ± 0%   134.6n ± 0%   +0.22% (p=0.000 n=10)
RegexpMatchEasy0_1K       1.544µ ± 0%   1.365µ ± 0%  -11.63% (p=0.000 n=10)
RegexpMatchEasy1_32       165.4n ± 0%   164.1n ± 0%   -0.79% (p=0.000 n=10)
RegexpMatchEasy1_1K       1.629µ ± 0%   1.492µ ± 0%   -8.41% (p=0.000 n=10)
RegexpMatchMedium_32      1.413µ ± 0%   1.404µ ± 0%   -0.64% (p=0.000 n=10)
RegexpMatchMedium_1K      41.59µ ± 0%   41.05µ ± 0%   -1.28% (p=0.000 n=10)
RegexpMatchHard_32        2.060µ ± 0%   2.072µ ± 0%   +0.58% (p=0.000 n=10)
RegexpMatchHard_1K        61.30µ ± 0%   60.89µ ± 0%   -0.68% (p=0.000 n=10)
Revcomp                    1.357 ± 0%    1.199 ± 1%  -11.64% (p=0.000 n=10)
Template                  110.6m ± 2%   112.3m ± 2%        ~ (p=0.105 n=10)
TimeParse                 411.7n ± 0%   414.2n ± 1%   +0.60% (p=0.000 n=10)
TimeFormat                499.9n ± 0%   496.9n ± 0%   -0.60% (p=0.000 n=10)
geomean                   103.0µ        101.0µ        -1.98%

                     │ CL 479498 v11 │                this CL                │
                     │      B/s      │      B/s       vs base                │
GobDecode               48.24Mi ± 1%    49.02Mi ± 1%   +1.62% (p=0.000 n=10)
GobEncode               41.53Mi ± 0%    42.23Mi ± 0%   +1.69% (p=0.001 n=10)
Gzip                    45.65Mi ± 0%    45.77Mi ± 0%   +0.25% (p=0.000 n=10)
Gunzip                  212.2Mi ± 0%    228.7Mi ± 0%   +7.76% (p=0.000 n=10)
JSONEncode              99.18Mi ± 0%   100.08Mi ± 0%   +0.91% (p=0.000 n=10)
JSONDecode              24.42Mi ± 1%    23.93Mi ± 1%   -2.03% (p=0.000 n=10)
GoParse                 7.458Mi ± 1%    7.544Mi ± 1%   +1.15% (p=0.001 n=10)
RegexpMatchEasy0_32     227.3Mi ± 0%    226.8Mi ± 0%   -0.21% (p=0.000 n=10)
RegexpMatchEasy0_1K     632.5Mi ± 0%    715.7Mi ± 0%  +13.15% (p=0.000 n=10)
RegexpMatchEasy1_32     184.5Mi ± 0%    186.0Mi ± 0%   +0.81% (p=0.000 n=10)
RegexpMatchEasy1_1K     599.4Mi ± 0%    654.3Mi ± 0%   +9.17% (p=0.000 n=10)
RegexpMatchMedium_32    21.60Mi ± 0%    21.74Mi ± 0%   +0.64% (p=0.000 n=10)
RegexpMatchMedium_1K    23.48Mi ± 0%    23.78Mi ± 0%   +1.30% (p=0.000 n=10)
RegexpMatchHard_32      14.81Mi ± 0%    14.72Mi ± 0%   -0.58% (p=0.000 n=10)
RegexpMatchHard_1K      15.93Mi ± 0%    16.04Mi ± 0%   +0.72% (p=0.000 n=10)
Revcomp                 178.6Mi ± 0%    202.2Mi ± 1%  +13.18% (p=0.000 n=10)
Template                16.73Mi ± 2%    16.48Mi ± 2%        ~ (p=0.093 n=10)
geomean                 60.58Mi         62.23Mi        +2.72%

The only significant regression is the Fannkuch11 case; perf records are
manually inspected, with the hottest part of the code virtually unchanged
except for the alignment of two instructions, that seems to sit at
different sides of a 32- or even 64-byte boundary. So again, the
regression is likely due to micro-architecture quirks, and the change is
in fact a win across the board.

Updates golang#59120

Change-Id: Ibbf64988c9d06f7c1d359480a1d6aecfa2c25b65
xen0n added a commit to xen0n/go that referenced this issue Apr 11, 2023
For the SubFromLen64 codegen test case to work as intended, we need
to fold c-(-(x-d)) into x+(c-d).

Still, some instances of LeadingZeros are not optimized into single
CLZ instructions right now (actually, the LeadingZeros micro-benchmarks
are currently still compiled with redundant adds/subs of 64, due to
interference of loop optimizations before lowering), but perf numbers
indicate it's not that bad after all.

Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
               │   before    │                after                │
               │   sec/op    │   sec/op     vs base                │
LeadingZeros     3.675n ± 0%   1.545n ± 1%  -57.96% (p=0.000 n=10)
LeadingZeros8    2.001n ± 0%   1.868n ± 0%   -6.62% (p=0.000 n=10)
LeadingZeros16   3.144n ± 0%   1.864n ± 1%  -40.71% (p=0.000 n=10)
LeadingZeros32   4.265n ± 1%   1.653n ± 1%  -61.24% (p=0.000 n=10)
LeadingZeros64   3.962n ± 0%   1.539n ± 0%  -61.16% (p=0.000 n=10)
geomean          3.299n        1.688n       -48.84%

go1 benchmark results on the same box:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │  CL 483355  │              this CL               │
                      │   sec/op    │   sec/op     vs base               │
BinaryTree17             13.75 ± 2%    13.70 ± 2%       ~ (p=0.579 n=10)
Fannkuch11               3.650 ± 0%    3.415 ± 0%  -6.46% (p=0.000 n=10)
FmtFprintfEmpty         94.45n ± 0%   94.98n ± 0%  +0.56% (p=0.000 n=10)
FmtFprintfString        155.2n ± 0%   151.1n ± 0%  -2.61% (p=0.000 n=10)
FmtFprintfInt           154.4n ± 0%   153.6n ± 0%  -0.52% (p=0.000 n=10)
FmtFprintfIntInt        237.1n ± 0%   234.7n ± 0%  -0.99% (p=0.000 n=10)
FmtFprintfPrefixedInt   312.8n ± 0%   314.2n ± 0%  +0.45% (p=0.000 n=10)
FmtFprintfFloat         390.5n ± 0%   402.1n ± 0%  +2.97% (p=0.000 n=10)
FmtManyArgs             905.0n ± 0%   918.6n ± 0%  +1.51% (p=0.000 n=10)
GobDecode               14.93m ± 1%   14.98m ± 1%  +0.33% (p=0.015 n=10)
GobEncode               17.33m ± 0%   17.26m ± 1%  -0.39% (p=0.023 n=10)
Gzip                    404.3m ± 0%   404.6m ± 0%  +0.08% (p=0.000 n=10)
Gunzip                  80.92m ± 0%   80.97m ± 0%  +0.06% (p=0.000 n=10)
HTTPClientServer        86.14µ ± 0%   84.39µ ± 0%  -2.03% (p=0.000 n=10)
JSONEncode              18.49m ± 0%   18.50m ± 0%       ~ (p=0.436 n=10)
JSONDecode              77.34m ± 1%   76.26m ± 1%  -1.40% (p=0.000 n=10)
Mandelbrot200           6.521m ± 0%   6.508m ± 0%       ~ (p=0.138 n=10)
GoParse                 7.324m ± 1%   7.413m ± 1%  +1.22% (p=0.005 n=10)
RegexpMatchEasy0_32     134.6n ± 0%   134.6n ± 0%       ~ (p=0.195 n=10)
RegexpMatchEasy0_1K     1.365µ ± 0%   1.366µ ± 0%  +0.07% (p=0.038 n=10)
RegexpMatchEasy1_32     164.1n ± 0%   164.1n ± 0%       ~ (p=0.230 n=10)
RegexpMatchEasy1_1K     1.492µ ± 0%   1.492µ ± 0%       ~ (p=0.211 n=10)
RegexpMatchMedium_32    1.404µ ± 0%   1.403µ ± 0%  -0.07% (p=0.000 n=10)
RegexpMatchMedium_1K    41.05µ ± 0%   41.04µ ± 0%  -0.04% (p=0.000 n=10)
RegexpMatchHard_32      2.072µ ± 0%   2.071µ ± 0%  -0.05% (p=0.000 n=10)
RegexpMatchHard_1K      60.89µ ± 0%   60.87µ ± 0%  -0.04% (p=0.000 n=10)
Revcomp                  1.199 ± 1%    1.200 ± 0%       ~ (p=0.481 n=10)
Template                112.3m ± 2%   112.9m ± 2%       ~ (p=0.353 n=10)
TimeParse               414.2n ± 1%   412.5n ± 0%  -0.40% (p=0.000 n=10)
TimeFormat              496.9n ± 0%   496.6n ± 0%       ~ (p=0.341 n=10)
geomean                 101.0µ        100.7µ       -0.26%

                     │  CL 483355   │                this CL                │
                     │     B/s      │     B/s       vs base                 │
GobDecode              49.02Mi ± 1%   48.87Mi ± 1%  -0.32% (p=0.014 n=10)
GobEncode              42.23Mi ± 0%   42.40Mi ± 1%  +0.40% (p=0.022 n=10)
Gzip                   45.77Mi ± 0%   45.73Mi ± 0%  -0.07% (p=0.000 n=10)
Gunzip                 228.7Mi ± 0%   228.6Mi ± 0%  -0.06% (p=0.000 n=10)
JSONEncode             100.1Mi ± 0%   100.0Mi ± 0%       ~ (p=0.470 n=10)
JSONDecode             23.93Mi ± 1%   24.27Mi ± 1%  +1.43% (p=0.000 n=10)
GoParse                7.544Mi ± 1%   7.448Mi ± 1%  -1.26% (p=0.005 n=10)
RegexpMatchEasy0_32    226.8Mi ± 0%   226.7Mi ± 0%  -0.06% (p=0.001 n=10)
RegexpMatchEasy0_1K    715.7Mi ± 0%   715.1Mi ± 0%  -0.08% (p=0.022 n=10)
RegexpMatchEasy1_32    186.0Mi ± 0%   186.0Mi ± 0%       ~ (p=0.493 n=10)
RegexpMatchEasy1_1K    654.3Mi ± 0%   654.6Mi ± 0%  +0.04% (p=0.000 n=10)
RegexpMatchMedium_32   21.74Mi ± 0%   21.74Mi ± 0%  +0.02% (p=0.022 n=10)
RegexpMatchMedium_1K   23.78Mi ± 0%   23.79Mi ± 0%  +0.04% (p=0.000 n=10)
RegexpMatchHard_32     14.72Mi ± 0%   14.73Mi ± 0%  +0.06% (p=0.000 n=10)
RegexpMatchHard_1K     16.04Mi ± 0%   16.04Mi ± 0%       ~ (p=1.000 n=10) ¹
Revcomp                202.2Mi ± 1%   202.0Mi ± 0%       ~ (p=0.469 n=10)
Template               16.48Mi ± 2%   16.38Mi ± 2%       ~ (p=0.342 n=10)
geomean                62.23Mi        62.21Mi       -0.04%
¹ all samples are equal

In this case though, all significant perf changes are likely due to
micro-architectural quirks.

Updates golang#59120

Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89
xen0n added a commit to xen0n/go that referenced this issue Apr 11, 2023
Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
               │    before    │                after                 │
               │    sec/op    │    sec/op     vs base                │
ReverseBytes     3.0130n ± 0%   0.6517n ± 2%  -78.37% (p=0.000 n=10)
ReverseBytes16   0.9027n ± 0%   0.6526n ± 2%  -27.71% (p=0.000 n=10)
ReverseBytes32   1.7040n ± 0%   0.6511n ± 1%  -61.79% (p=0.000 n=10)
ReverseBytes64   2.7080n ± 0%   0.6499n ± 1%  -76.00% (p=0.000 n=10)
geomean           1.882n        0.6513n       -65.40%

Go1 benchmark results indicate no meaningful change except for
micro-architecture-related fluctuations.

Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
xen0n added a commit to xen0n/go that referenced this issue Apr 11, 2023
Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
          │    before    │                after                 │
          │    sec/op    │    sec/op     vs base                │
Reverse     4.2280n ± 0%   0.8029n ± 0%  -81.01% (p=0.000 n=10)
Reverse8    1.0050n ± 0%   0.8029n ± 0%  -20.11% (p=0.000 n=10)
Reverse16   1.9600n ± 0%   0.8029n ± 0%  -59.04% (p=0.000 n=10)
Reverse32   4.0205n ± 0%   0.8029n ± 0%  -80.03% (p=0.000 n=10)
Reverse64   4.0360n ± 0%   0.8029n ± 0%  -80.11% (p=0.000 n=10)
geomean      2.668n        0.8029n       -69.90%

The operation seems unused anywhere else in the tree except in
compress/flate, of which a very slight (time geomean -0.16%,
throughput geomean +0.16%) improvement was observed with the change
applied.

Updates golang#59120

Change-Id: Ie1b446386655e0bb6808e435257293c30420626e
@gopherbot
Copy link

Change https://go.dev/cl/483656 mentions this issue: cmd/compile: wire up bits.Reverse intrinsics for loong64

xen0n added a commit to xen0n/go that referenced this issue May 1, 2023
…xtensions

8- and 16-bit sign extensions and 32-bit zero extensions were realized
with left and right shifts before this change. We now support assembling
EXTWB, EXTWH and BSTRPICKV, so all three can be done with a single insn
respectively.

Benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │  CL 479495  │              this CL               │
                      │   sec/op    │   sec/op     vs base               │
BinaryTree17             14.12 ± 1%    14.06 ± 1%       ~ (p=0.393 n=10)
Fannkuch11               3.420 ± 0%    3.421 ± 0%  +0.04% (p=0.001 n=10)
FmtFprintfEmpty         94.72n ± 0%   94.97n ± 0%  +0.26% (p=0.000 n=10)
FmtFprintfString        152.6n ± 0%   155.3n ± 0%  +1.77% (p=0.000 n=10)
FmtFprintfInt           154.5n ± 0%   154.5n ± 0%       ~ (p=0.263 n=10)
FmtFprintfIntInt        237.7n ± 0%   237.1n ± 0%  -0.21% (p=0.000 n=10)
FmtFprintfPrefixedInt   313.1n ± 0%   313.0n ± 0%  -0.03% (p=0.000 n=10)
FmtFprintfFloat         394.1n ± 0%   392.8n ± 0%  -0.32% (p=0.000 n=10)
FmtManyArgs             934.3n ± 0%   912.6n ± 0%  -2.32% (p=0.000 n=10)
GobDecode               15.29m ± 1%   15.23m ± 1%       ~ (p=0.280 n=10)
GobEncode               17.76m ± 0%   17.66m ± 0%  -0.60% (p=0.000 n=10)
Gzip                    416.0m ± 0%   404.4m ± 0%  -2.79% (p=0.000 n=10)
Gunzip                  83.20m ± 0%   80.88m ± 0%  -2.79% (p=0.000 n=10)
HTTPClientServer        87.82µ ± 1%   87.09µ ± 1%  -0.83% (p=0.000 n=10)
JSONEncode              18.56m ± 0%   18.54m ± 0%       ~ (p=0.123 n=10)
JSONDecode              76.53m ± 0%   78.22m ± 1%  +2.21% (p=0.000 n=10)
Mandelbrot200           7.217m ± 0%   7.215m ± 0%       ~ (p=0.143 n=10)
GoParse                 7.587m ± 1%   7.520m ± 1%       ~ (p=0.165 n=10)
RegexpMatchEasy0_32     134.2n ± 0%   134.5n ± 0%  +0.22% (p=0.000 n=10)
RegexpMatchEasy0_1K     1.366µ ± 0%   1.364µ ± 0%  -0.15% (p=0.000 n=10)
RegexpMatchEasy1_32     163.0n ± 0%   164.0n ± 0%  +0.61% (p=0.000 n=10)
RegexpMatchEasy1_1K     1.497µ ± 0%   1.492µ ± 0%  -0.33% (p=0.000 n=10)
RegexpMatchMedium_32    1.415µ ± 0%   1.403µ ± 0%  -0.85% (p=0.000 n=10)
RegexpMatchMedium_1K    41.61µ ± 0%   41.05µ ± 0%  -1.36% (p=0.000 n=10)
RegexpMatchHard_32      2.121µ ± 0%   2.070µ ± 0%  -2.43% (p=0.000 n=10)
RegexpMatchHard_1K      62.64µ ± 0%   60.87µ ± 0%  -2.83% (p=0.000 n=10)
Revcomp                  1.204 ± 0%    1.210 ± 0%  +0.51% (p=0.000 n=10)
Template                118.0m ± 0%   115.2m ± 1%  -2.31% (p=0.000 n=10)
TimeParse               414.8n ± 0%   410.6n ± 0%  -1.01% (p=0.000 n=10)
TimeFormat              510.7n ± 0%   508.2n ± 0%  -0.48% (p=0.000 n=10)
geomean                 102.3µ        101.7µ       -0.60%

                     │  CL 479495   │               this CL               │
                     │     B/s      │     B/s       vs base               │
GobDecode              47.88Mi ± 1%   48.05Mi ± 1%       ~ (p=0.280 n=10)
GobEncode              41.20Mi ± 0%   41.45Mi ± 0%  +0.60% (p=0.000 n=10)
Gzip                   44.49Mi ± 0%   45.77Mi ± 0%  +2.87% (p=0.000 n=10)
Gunzip                 222.4Mi ± 0%   228.8Mi ± 0%  +2.87% (p=0.000 n=10)
JSONEncode             99.69Mi ± 0%   99.82Mi ± 0%       ~ (p=0.118 n=10)
JSONDecode             24.19Mi ± 0%   23.66Mi ± 1%  -2.19% (p=0.000 n=10)
GoParse                7.281Mi ± 2%   7.343Mi ± 1%       ~ (p=0.187 n=10)
RegexpMatchEasy0_32    227.4Mi ± 0%   226.9Mi ± 0%  -0.22% (p=0.000 n=10)
RegexpMatchEasy0_1K    715.0Mi ± 0%   716.0Mi ± 0%  +0.13% (p=0.000 n=10)
RegexpMatchEasy1_32    187.3Mi ± 0%   186.1Mi ± 0%  -0.62% (p=0.000 n=10)
RegexpMatchEasy1_1K    652.3Mi ± 0%   654.5Mi ± 0%  +0.34% (p=0.000 n=10)
RegexpMatchMedium_32   21.57Mi ± 0%   21.74Mi ± 0%  +0.80% (p=0.000 n=10)
RegexpMatchMedium_1K   23.47Mi ± 0%   23.79Mi ± 0%  +1.38% (p=0.000 n=10)
RegexpMatchHard_32     14.39Mi ± 0%   14.74Mi ± 0%  +2.45% (p=0.000 n=10)
RegexpMatchHard_1K     15.59Mi ± 0%   16.04Mi ± 0%  +2.87% (p=0.000 n=10)
Revcomp                201.3Mi ± 0%   200.3Mi ± 0%  -0.51% (p=0.000 n=10)
Template               15.69Mi ± 0%   16.06Mi ± 1%  +2.37% (p=0.000 n=10)
geomean                61.31Mi        61.82Mi       +0.84%

The test binaries were pre-compiled with `go test -c`, and the test runs
were wrapped with `perf stat record` for recording dynamic instruction
counts. The instruction count, IPC and branch misprediction rate did not
meaningfully change.

As for the JSONDecode regression, `perf stat` is used to check
micro-architectural details:

$ sudo perf stat <test executable> -test.timeout=30m -test.run='^$' \
    -test.cpu=1 -test.bench='JSONDecode' -test.count=1 -test.benchtime=50x

Before:

          4,256.10 msec task-clock               #    1.061 CPUs utilized
            61,431      context-switches         #   14.434 K/sec
                 3      cpu-migrations           #    0.705 /sec
             3,297      page-faults              #  774.652 /sec
    10,364,990,422      cycles                   #    2.435 GHz
    19,640,571,817      instructions             #    1.89  insn per cycle
     4,267,623,324      branches                 #    1.003 G/sec
        44,164,375      branch-misses            #    1.03% of all branches

After:

          4,343.17 msec task-clock               #    1.061 CPUs utilized
            62,742      context-switches         #   14.446 K/sec
                 5      cpu-migrations           #    1.151 /sec
             3,044      page-faults              #  700.871 /sec
    10,577,322,342      cycles                   #    2.435 GHz
    19,582,895,547      instructions             #    1.85  insn per cycle
     4,266,051,537      branches                 #  982.244 M/sec
        46,298,286      branch-misses            #    1.09% of all branches

Instruction count decreased by 0.29% but cycle count went up by 2.05%,
while branch misprediction rate raised too. This is likely caused by the
micro-architecture's sensitivity towards changed code layout; the
optimization implemented here should be a net win otherwise.

Updates golang#59120

Change-Id: Ia7dd0dfe20c0ea3e64889e2b38c6b2118b50d56e
(cherry picked from commit 6c2c3c8470a0a5d0e756e50cf45f140d553ef0b2)
xen0n added a commit to xen0n/go that referenced this issue May 1, 2023
…intrinsics for loong64

The runtime malloc implementation makes use of these, among others.

Some generic strength reduction rules for Ctz ops have also been added,
though only enabled for loong64 for now. This is necessary to make the
optimization profitable at all, as the LA464 architecture apparently
handles the `TrailingZeros64(x) < 64` part in runtime.nextFreeFast very
badly if the compiled branch isn't a simple BEQZ any more (that used to
be the case before, when the compiler is able to peek into the pure Go
implementation of TrailingZeros). Without the generic rules this change
is going to be a big perf hit (as bad as 7~10% in select go1 benchmark
cases).

The generic changes are benchmarked on linux/amd64 (Threadripper 3990X)
and darwin/arm64 (Apple M1 Pro) too, but results are either mixed
(amd64) or even net loss (arm64). So, for now those rules are guarded
with a predicate that only enables them for loong64.

Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
                │   before    │                after                │
                │   sec/op    │   sec/op     vs base                │
TrailingZeros     2.758n ± 0%   1.004n ± 0%  -63.60% (p=0.000 n=10)
TrailingZeros8    1.508n ± 0%   1.219n ± 0%  -19.20% (p=0.000 n=10)
TrailingZeros16   3.526n ± 0%   1.437n ± 0%  -59.25% (p=0.000 n=10)
TrailingZeros32   3.161n ± 0%   1.004n ± 0%  -68.23% (p=0.000 n=10)
TrailingZeros64   2.759n ± 0%   1.003n ± 0%  -63.65% (p=0.000 n=10)
geomean           2.638n        1.121n       -57.51%

Go1 benchmark results on the same machine:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │ CL 479496 v8 │              this CL               │
                      │    sec/op    │   sec/op     vs base               │
BinaryTree17              14.10 ± 1%    13.64 ± 1%  -3.28% (p=0.000 n=10)
Fannkuch11                3.421 ± 0%    3.421 ± 0%       ~ (p=0.075 n=10)
FmtFprintfEmpty          94.78n ± 0%   94.50n ± 0%  -0.30% (p=0.000 n=10)
FmtFprintfString         155.0n ± 0%   154.1n ± 1%       ~ (p=1.000 n=10)
FmtFprintfInt            157.2n ± 0%   155.2n ± 1%  -1.27% (p=0.000 n=10)
FmtFprintfIntInt         242.1n ± 0%   238.0n ± 1%  -1.73% (p=0.000 n=10)
FmtFprintfPrefixedInt    337.6n ± 0%   334.6n ± 0%  -0.89% (p=0.000 n=10)
FmtFprintfFloat          399.0n ± 0%   396.4n ± 0%  -0.65% (p=0.000 n=10)
FmtManyArgs              959.8n ± 0%   923.4n ± 0%  -3.79% (p=0.000 n=10)
GobDecode                15.63m ± 3%   15.17m ± 1%  -2.90% (p=0.001 n=10)
GobEncode                18.43m ± 3%   17.62m ± 0%  -4.38% (p=0.000 n=10)
Gzip                     405.1m ± 0%   405.4m ± 0%  +0.06% (p=0.035 n=10)
Gunzip                   86.84m ± 0%   87.20m ± 0%  +0.41% (p=0.000 n=10)
HTTPClientServer         88.47µ ± 0%   86.92µ ± 1%  -1.75% (p=0.000 n=10)
JSONEncode               18.84m ± 0%   18.66m ± 0%  -0.95% (p=0.000 n=10)
JSONDecode               79.35m ± 0%   75.77m ± 1%  -4.51% (p=0.000 n=10)
Mandelbrot200            7.215m ± 0%   7.215m ± 0%       ~ (p=0.315 n=10)
GoParse                  7.591m ± 1%   7.407m ± 1%  -2.43% (p=0.000 n=10)
RegexpMatchEasy0_32      133.8n ± 0%   134.3n ± 0%  +0.37% (p=0.000 n=10)
RegexpMatchEasy0_1K      1.540µ ± 0%   1.544µ ± 0%  +0.26% (p=0.000 n=10)
RegexpMatchEasy1_32      164.1n ± 0%   165.4n ± 0%  +0.79% (p=0.000 n=10)
RegexpMatchEasy1_1K      1.626µ ± 0%   1.629µ ± 0%  +0.18% (p=0.000 n=10)
RegexpMatchMedium_32     1.403µ ± 0%   1.413µ ± 0%  +0.71% (p=0.000 n=10)
RegexpMatchMedium_1K     41.22µ ± 0%   41.59µ ± 0%  +0.90% (p=0.000 n=10)
RegexpMatchHard_32       2.071µ ± 0%   2.060µ ± 0%  -0.53% (p=0.000 n=10)
RegexpMatchHard_1K       61.05µ ± 0%   61.30µ ± 0%  +0.41% (p=0.001 n=10)
Revcomp                   1.351 ± 0%    1.357 ± 0%  +0.42% (p=0.000 n=10)
Template                 117.3m ± 1%   110.6m ± 2%  -5.71% (p=0.000 n=10)
TimeParse                411.9n ± 0%   411.7n ± 0%       ~ (p=0.117 n=10)
TimeFormat               514.2n ± 0%   499.9n ± 0%  -2.77% (p=0.000 n=10)
geomean                  104.2µ        103.0µ       -1.15%

                     │ CL 479496 v8 │               this CL               │
                     │     B/s      │     B/s       vs base               │
GobDecode              46.84Mi ± 3%   48.24Mi ± 1%  +2.98% (p=0.001 n=10)
GobEncode              39.72Mi ± 4%   41.53Mi ± 0%  +4.57% (p=0.000 n=10)
Gzip                   45.68Mi ± 0%   45.65Mi ± 0%  -0.05% (p=0.029 n=10)
Gunzip                 213.1Mi ± 0%   212.2Mi ± 0%  -0.41% (p=0.000 n=10)
JSONEncode             98.23Mi ± 0%   99.18Mi ± 0%  +0.97% (p=0.000 n=10)
JSONDecode             23.32Mi ± 0%   24.42Mi ± 1%  +4.72% (p=0.000 n=10)
GoParse                7.277Mi ± 1%   7.458Mi ± 1%  +2.49% (p=0.000 n=10)
RegexpMatchEasy0_32    228.1Mi ± 0%   227.3Mi ± 0%  -0.36% (p=0.000 n=10)
RegexpMatchEasy0_1K    634.2Mi ± 0%   632.5Mi ± 0%  -0.27% (p=0.000 n=10)
RegexpMatchEasy1_32    186.0Mi ± 0%   184.5Mi ± 0%  -0.79% (p=0.000 n=10)
RegexpMatchEasy1_1K    600.4Mi ± 0%   599.4Mi ± 0%  -0.17% (p=0.000 n=10)
RegexpMatchMedium_32   21.75Mi ± 0%   21.60Mi ± 0%  -0.70% (p=0.000 n=10)
RegexpMatchMedium_1K   23.69Mi ± 0%   23.48Mi ± 0%  -0.89% (p=0.000 n=10)
RegexpMatchHard_32     14.73Mi ± 0%   14.81Mi ± 0%  +0.52% (p=0.000 n=10)
RegexpMatchHard_1K     15.99Mi ± 0%   15.93Mi ± 0%  -0.42% (p=0.000 n=10)
Revcomp                179.4Mi ± 0%   178.6Mi ± 0%  -0.42% (p=0.000 n=10)
Template               15.78Mi ± 1%   16.73Mi ± 2%  +6.04% (p=0.000 n=10)
geomean                59.97Mi        60.58Mi       +1.02%

The change should be a net win, as all it does is to pattern-match and
replace Ctz ops into respective native instructions, so any performance
regression is likely also micro-architecture related, like observed in
CL 479496's results. (Indeed, some of the more drastic improvements may
well also be coincidental, but the point is that there is at least a
small amount of deterministic improvements anyway.)

Updates golang#59120

Change-Id: I6c90f727eb00e0add2a5f8575ac045b9e288af54
(cherry picked from commit ba1650c3c739434795465d953ef9a193a68c5024)
xen0n added a commit to xen0n/go that referenced this issue May 1, 2023
Benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │ CL 479498 v11 │               this CL               │
                      │    sec/op     │   sec/op     vs base                │
BinaryTree17               13.64 ± 1%    13.75 ± 2%        ~ (p=0.579 n=10)
Fannkuch11                 3.421 ± 0%    3.650 ± 0%   +6.70% (p=0.000 n=10)
FmtFprintfEmpty           94.50n ± 0%   94.45n ± 0%   -0.05% (p=0.000 n=10)
FmtFprintfString          154.1n ± 1%   155.2n ± 0%        ~ (p=0.689 n=10)
FmtFprintfInt             155.2n ± 1%   154.4n ± 0%        ~ (p=0.785 n=10)
FmtFprintfIntInt          238.0n ± 1%   237.1n ± 0%        ~ (p=0.721 n=10)
FmtFprintfPrefixedInt     334.6n ± 0%   312.8n ± 0%   -6.52% (p=0.000 n=10)
FmtFprintfFloat           396.4n ± 0%   390.5n ± 0%   -1.49% (p=0.000 n=10)
FmtManyArgs               923.4n ± 0%   905.0n ± 0%   -2.00% (p=0.000 n=10)
GobDecode                 15.17m ± 1%   14.93m ± 1%   -1.59% (p=0.000 n=10)
GobEncode                 17.62m ± 0%   17.33m ± 0%   -1.65% (p=0.001 n=10)
Gzip                      405.4m ± 0%   404.3m ± 0%   -0.26% (p=0.000 n=10)
Gunzip                    87.20m ± 0%   80.92m ± 0%   -7.20% (p=0.000 n=10)
HTTPClientServer          86.92µ ± 1%   86.14µ ± 0%   -0.90% (p=0.000 n=10)
JSONEncode                18.66m ± 0%   18.49m ± 0%   -0.91% (p=0.000 n=10)
JSONDecode                75.77m ± 1%   77.34m ± 1%   +2.07% (p=0.000 n=10)
Mandelbrot200             7.215m ± 0%   6.521m ± 0%   -9.62% (p=0.000 n=10)
GoParse                   7.407m ± 1%   7.324m ± 1%   -1.12% (p=0.003 n=10)
RegexpMatchEasy0_32       134.3n ± 0%   134.6n ± 0%   +0.22% (p=0.000 n=10)
RegexpMatchEasy0_1K       1.544µ ± 0%   1.365µ ± 0%  -11.63% (p=0.000 n=10)
RegexpMatchEasy1_32       165.4n ± 0%   164.1n ± 0%   -0.79% (p=0.000 n=10)
RegexpMatchEasy1_1K       1.629µ ± 0%   1.492µ ± 0%   -8.41% (p=0.000 n=10)
RegexpMatchMedium_32      1.413µ ± 0%   1.404µ ± 0%   -0.64% (p=0.000 n=10)
RegexpMatchMedium_1K      41.59µ ± 0%   41.05µ ± 0%   -1.28% (p=0.000 n=10)
RegexpMatchHard_32        2.060µ ± 0%   2.072µ ± 0%   +0.58% (p=0.000 n=10)
RegexpMatchHard_1K        61.30µ ± 0%   60.89µ ± 0%   -0.68% (p=0.000 n=10)
Revcomp                    1.357 ± 0%    1.199 ± 1%  -11.64% (p=0.000 n=10)
Template                  110.6m ± 2%   112.3m ± 2%        ~ (p=0.105 n=10)
TimeParse                 411.7n ± 0%   414.2n ± 1%   +0.60% (p=0.000 n=10)
TimeFormat                499.9n ± 0%   496.9n ± 0%   -0.60% (p=0.000 n=10)
geomean                   103.0µ        101.0µ        -1.98%

                     │ CL 479498 v11 │                this CL                │
                     │      B/s      │      B/s       vs base                │
GobDecode               48.24Mi ± 1%    49.02Mi ± 1%   +1.62% (p=0.000 n=10)
GobEncode               41.53Mi ± 0%    42.23Mi ± 0%   +1.69% (p=0.001 n=10)
Gzip                    45.65Mi ± 0%    45.77Mi ± 0%   +0.25% (p=0.000 n=10)
Gunzip                  212.2Mi ± 0%    228.7Mi ± 0%   +7.76% (p=0.000 n=10)
JSONEncode              99.18Mi ± 0%   100.08Mi ± 0%   +0.91% (p=0.000 n=10)
JSONDecode              24.42Mi ± 1%    23.93Mi ± 1%   -2.03% (p=0.000 n=10)
GoParse                 7.458Mi ± 1%    7.544Mi ± 1%   +1.15% (p=0.001 n=10)
RegexpMatchEasy0_32     227.3Mi ± 0%    226.8Mi ± 0%   -0.21% (p=0.000 n=10)
RegexpMatchEasy0_1K     632.5Mi ± 0%    715.7Mi ± 0%  +13.15% (p=0.000 n=10)
RegexpMatchEasy1_32     184.5Mi ± 0%    186.0Mi ± 0%   +0.81% (p=0.000 n=10)
RegexpMatchEasy1_1K     599.4Mi ± 0%    654.3Mi ± 0%   +9.17% (p=0.000 n=10)
RegexpMatchMedium_32    21.60Mi ± 0%    21.74Mi ± 0%   +0.64% (p=0.000 n=10)
RegexpMatchMedium_1K    23.48Mi ± 0%    23.78Mi ± 0%   +1.30% (p=0.000 n=10)
RegexpMatchHard_32      14.81Mi ± 0%    14.72Mi ± 0%   -0.58% (p=0.000 n=10)
RegexpMatchHard_1K      15.93Mi ± 0%    16.04Mi ± 0%   +0.72% (p=0.000 n=10)
Revcomp                 178.6Mi ± 0%    202.2Mi ± 1%  +13.18% (p=0.000 n=10)
Template                16.73Mi ± 2%    16.48Mi ± 2%        ~ (p=0.093 n=10)
geomean                 60.58Mi         62.23Mi        +2.72%

The only significant regression is the Fannkuch11 case; perf records are
manually inspected, with the hottest part of the code virtually unchanged
except for the alignment of two instructions, that seems to sit at
different sides of a 32- or even 64-byte boundary. So again, the
regression is likely due to micro-architecture quirks, and the change is
in fact a win across the board.

Updates golang#59120

Change-Id: Ibbf64988c9d06f7c1d359480a1d6aecfa2c25b65
(cherry picked from commit 03e1790d8d84c3955b0294992f1d7b6b7693ed3f)
xen0n added a commit to xen0n/go that referenced this issue May 1, 2023
… for loong64

For the SubFromLen64 codegen test case to work as intended, we need
to fold c-(-(x-d)) into x+(c-d).

Still, some instances of LeadingZeros are not optimized into single
CLZ instructions right now (actually, the LeadingZeros micro-benchmarks
are currently still compiled with redundant adds/subs of 64, due to
interference of loop optimizations before lowering), but perf numbers
indicate it's not that bad after all.

Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
               │   before    │                after                │
               │   sec/op    │   sec/op     vs base                │
LeadingZeros     3.675n ± 0%   1.545n ± 1%  -57.96% (p=0.000 n=10)
LeadingZeros8    2.001n ± 0%   1.868n ± 0%   -6.62% (p=0.000 n=10)
LeadingZeros16   3.144n ± 0%   1.864n ± 1%  -40.71% (p=0.000 n=10)
LeadingZeros32   4.265n ± 1%   1.653n ± 1%  -61.24% (p=0.000 n=10)
LeadingZeros64   3.962n ± 0%   1.539n ± 0%  -61.16% (p=0.000 n=10)
geomean          3.299n        1.688n       -48.84%

go1 benchmark results on the same box:

goos: linux
goarch: loong64
pkg: test/bench/go1
                      │  CL 483355  │              this CL               │
                      │   sec/op    │   sec/op     vs base               │
BinaryTree17             13.75 ± 2%    13.70 ± 2%       ~ (p=0.579 n=10)
Fannkuch11               3.650 ± 0%    3.415 ± 0%  -6.46% (p=0.000 n=10)
FmtFprintfEmpty         94.45n ± 0%   94.98n ± 0%  +0.56% (p=0.000 n=10)
FmtFprintfString        155.2n ± 0%   151.1n ± 0%  -2.61% (p=0.000 n=10)
FmtFprintfInt           154.4n ± 0%   153.6n ± 0%  -0.52% (p=0.000 n=10)
FmtFprintfIntInt        237.1n ± 0%   234.7n ± 0%  -0.99% (p=0.000 n=10)
FmtFprintfPrefixedInt   312.8n ± 0%   314.2n ± 0%  +0.45% (p=0.000 n=10)
FmtFprintfFloat         390.5n ± 0%   402.1n ± 0%  +2.97% (p=0.000 n=10)
FmtManyArgs             905.0n ± 0%   918.6n ± 0%  +1.51% (p=0.000 n=10)
GobDecode               14.93m ± 1%   14.98m ± 1%  +0.33% (p=0.015 n=10)
GobEncode               17.33m ± 0%   17.26m ± 1%  -0.39% (p=0.023 n=10)
Gzip                    404.3m ± 0%   404.6m ± 0%  +0.08% (p=0.000 n=10)
Gunzip                  80.92m ± 0%   80.97m ± 0%  +0.06% (p=0.000 n=10)
HTTPClientServer        86.14µ ± 0%   84.39µ ± 0%  -2.03% (p=0.000 n=10)
JSONEncode              18.49m ± 0%   18.50m ± 0%       ~ (p=0.436 n=10)
JSONDecode              77.34m ± 1%   76.26m ± 1%  -1.40% (p=0.000 n=10)
Mandelbrot200           6.521m ± 0%   6.508m ± 0%       ~ (p=0.138 n=10)
GoParse                 7.324m ± 1%   7.413m ± 1%  +1.22% (p=0.005 n=10)
RegexpMatchEasy0_32     134.6n ± 0%   134.6n ± 0%       ~ (p=0.195 n=10)
RegexpMatchEasy0_1K     1.365µ ± 0%   1.366µ ± 0%  +0.07% (p=0.038 n=10)
RegexpMatchEasy1_32     164.1n ± 0%   164.1n ± 0%       ~ (p=0.230 n=10)
RegexpMatchEasy1_1K     1.492µ ± 0%   1.492µ ± 0%       ~ (p=0.211 n=10)
RegexpMatchMedium_32    1.404µ ± 0%   1.403µ ± 0%  -0.07% (p=0.000 n=10)
RegexpMatchMedium_1K    41.05µ ± 0%   41.04µ ± 0%  -0.04% (p=0.000 n=10)
RegexpMatchHard_32      2.072µ ± 0%   2.071µ ± 0%  -0.05% (p=0.000 n=10)
RegexpMatchHard_1K      60.89µ ± 0%   60.87µ ± 0%  -0.04% (p=0.000 n=10)
Revcomp                  1.199 ± 1%    1.200 ± 0%       ~ (p=0.481 n=10)
Template                112.3m ± 2%   112.9m ± 2%       ~ (p=0.353 n=10)
TimeParse               414.2n ± 1%   412.5n ± 0%  -0.40% (p=0.000 n=10)
TimeFormat              496.9n ± 0%   496.6n ± 0%       ~ (p=0.341 n=10)
geomean                 101.0µ        100.7µ       -0.26%

                     │  CL 483355   │                this CL                │
                     │     B/s      │     B/s       vs base                 │
GobDecode              49.02Mi ± 1%   48.87Mi ± 1%  -0.32% (p=0.014 n=10)
GobEncode              42.23Mi ± 0%   42.40Mi ± 1%  +0.40% (p=0.022 n=10)
Gzip                   45.77Mi ± 0%   45.73Mi ± 0%  -0.07% (p=0.000 n=10)
Gunzip                 228.7Mi ± 0%   228.6Mi ± 0%  -0.06% (p=0.000 n=10)
JSONEncode             100.1Mi ± 0%   100.0Mi ± 0%       ~ (p=0.470 n=10)
JSONDecode             23.93Mi ± 1%   24.27Mi ± 1%  +1.43% (p=0.000 n=10)
GoParse                7.544Mi ± 1%   7.448Mi ± 1%  -1.26% (p=0.005 n=10)
RegexpMatchEasy0_32    226.8Mi ± 0%   226.7Mi ± 0%  -0.06% (p=0.001 n=10)
RegexpMatchEasy0_1K    715.7Mi ± 0%   715.1Mi ± 0%  -0.08% (p=0.022 n=10)
RegexpMatchEasy1_32    186.0Mi ± 0%   186.0Mi ± 0%       ~ (p=0.493 n=10)
RegexpMatchEasy1_1K    654.3Mi ± 0%   654.6Mi ± 0%  +0.04% (p=0.000 n=10)
RegexpMatchMedium_32   21.74Mi ± 0%   21.74Mi ± 0%  +0.02% (p=0.022 n=10)
RegexpMatchMedium_1K   23.78Mi ± 0%   23.79Mi ± 0%  +0.04% (p=0.000 n=10)
RegexpMatchHard_32     14.72Mi ± 0%   14.73Mi ± 0%  +0.06% (p=0.000 n=10)
RegexpMatchHard_1K     16.04Mi ± 0%   16.04Mi ± 0%       ~ (p=1.000 n=10) ¹
Revcomp                202.2Mi ± 1%   202.0Mi ± 0%       ~ (p=0.469 n=10)
Template               16.48Mi ± 2%   16.38Mi ± 2%       ~ (p=0.342 n=10)
geomean                62.23Mi        62.21Mi       -0.04%
¹ all samples are equal

In this case though, all significant perf changes are likely due to
micro-architectural quirks.

Updates golang#59120

Change-Id: Icc8f7d8e79c6168aae634f5c36f044f3fd034d89
(cherry picked from commit 80a298243a07e982573e14723d8133fc5be45065)
xen0n added a commit to xen0n/go that referenced this issue May 1, 2023
…nsics for loong64

Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
               │    before    │                after                 │
               │    sec/op    │    sec/op     vs base                │
ReverseBytes     3.0130n ± 0%   0.6517n ± 2%  -78.37% (p=0.000 n=10)
ReverseBytes16   0.9027n ± 0%   0.6526n ± 2%  -27.71% (p=0.000 n=10)
ReverseBytes32   1.7040n ± 0%   0.6511n ± 1%  -61.79% (p=0.000 n=10)
ReverseBytes64   2.7080n ± 0%   0.6499n ± 1%  -76.00% (p=0.000 n=10)
geomean           1.882n        0.6513n       -65.40%

Go1 benchmark results indicate no meaningful change except for
micro-architecture-related fluctuations.

Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac
(cherry picked from commit 4e0bacc50e09ea7defbf1e769b6ee5467e82e881)
xen0n added a commit to xen0n/go that referenced this issue May 1, 2023
Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
          │    before    │                after                 │
          │    sec/op    │    sec/op     vs base                │
Reverse     4.2280n ± 0%   0.8029n ± 0%  -81.01% (p=0.000 n=10)
Reverse8    1.0050n ± 0%   0.8029n ± 0%  -20.11% (p=0.000 n=10)
Reverse16   1.9600n ± 0%   0.8029n ± 0%  -59.04% (p=0.000 n=10)
Reverse32   4.0205n ± 0%   0.8029n ± 0%  -80.03% (p=0.000 n=10)
Reverse64   4.0360n ± 0%   0.8029n ± 0%  -80.11% (p=0.000 n=10)
geomean      2.668n        0.8029n       -69.90%

The operation seems unused anywhere else in the tree except in
compress/flate, of which a very slight (time geomean -0.16%,
throughput geomean +0.16%) improvement was observed with the change
applied.

Updates golang#59120

Change-Id: Ie1b446386655e0bb6808e435257293c30420626e
(cherry picked from commit 7e6c4dce73a400b8928207c66442eaf9fcd535fa)
xen0n added a commit to xen0n/go that referenced this issue May 1, 2023
…nsics for loong64

Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
               │    before    │                after                 │
               │    sec/op    │    sec/op     vs base                │
ReverseBytes     3.0130n ± 0%   0.6517n ± 2%  -78.37% (p=0.000 n=10)
ReverseBytes16   0.9027n ± 0%   0.6526n ± 2%  -27.71% (p=0.000 n=10)
ReverseBytes32   1.7040n ± 0%   0.6511n ± 1%  -61.79% (p=0.000 n=10)
ReverseBytes64   2.7080n ± 0%   0.6499n ± 1%  -76.00% (p=0.000 n=10)
geomean           1.882n        0.6513n       -65.40%

Go1 benchmark results indicate no meaningful change except for
micro-architecture-related fluctuations.

Updates golang#59120

Change-Id: I39c1edbd7363f454ad1e848a25abeced722b16ac

[xen0n: removed Bswap16 because go1.20 doesn't support this op]
(cherry picked from commit 4e0bacc50e09ea7defbf1e769b6ee5467e82e881)
xen0n added a commit to xen0n/go that referenced this issue May 1, 2023
Micro-benchmark results on Loongson 3A5000:

goos: linux
goarch: loong64
pkg: math/bits
          │    before    │                after                 │
          │    sec/op    │    sec/op     vs base                │
Reverse     4.2280n ± 0%   0.8029n ± 0%  -81.01% (p=0.000 n=10)
Reverse8    1.0050n ± 0%   0.8029n ± 0%  -20.11% (p=0.000 n=10)
Reverse16   1.9600n ± 0%   0.8029n ± 0%  -59.04% (p=0.000 n=10)
Reverse32   4.0205n ± 0%   0.8029n ± 0%  -80.03% (p=0.000 n=10)
Reverse64   4.0360n ± 0%   0.8029n ± 0%  -80.11% (p=0.000 n=10)
geomean      2.668n        0.8029n       -69.90%

The operation seems unused anywhere else in the tree except in
compress/flate, of which a very slight (time geomean -0.16%,
throughput geomean +0.16%) improvement was observed with the change
applied.

Updates golang#59120

Change-Id: Ie1b446386655e0bb6808e435257293c30420626e
(cherry picked from commit 7e6c4dce73a400b8928207c66442eaf9fcd535fa)
@gopherbot
Copy link

Change https://go.dev/cl/577515 mentions this issue: cmd/compile/internal: intrinsify publicationBarrier on loong64

@gopherbot
Copy link

Change https://go.dev/cl/580280 mentions this issue: cmd/compile, math: make math.Ceil/Floor/RoundToEven/Trunc/Abs/CopySign intrinsics on loong64

@gopherbot
Copy link

Change https://go.dev/cl/580283 mentions this issue: cmd/compile: intrinsics for math.min/max and implement float min/max in hardware on loong64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-loong64 Issues solely affecting the loongson architecture. compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
None yet
Development

No branches or pull requests

4 participants