Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: poor register allocator behavior in compression code #16122

Open
flanglet opened this issue Jun 20, 2016 · 17 comments
Open

cmd/compile: poor register allocator behavior in compression code #16122

flanglet opened this issue Jun 20, 2016 · 17 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@flanglet
Copy link

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?
    go1.6 windows/amd64 and go1.7beta1 windows/amd64
  2. What operating system and processor architecture are you using (go env)?
    set GOARCH=amd64
    set GOBIN=
    set GOEXE=.exe
    set GOHOSTARCH=amd64
    set GOHOSTOS=windows
    set GOOS=windows
    set GOPATH=E:\Users\fred\Documents\Prog\kanzi\go
    set GORACE=
    set GOROOT=E:\Program Files\go
    set GOTOOLDIR=E:\Program Files\go\pkg\tool\windows_amd64
    set GO15VENDOREXPERIMENT=1
    set CC=gcc
    set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0
    set CXX=g++
    set CGO_ENABLED=1
  3. What did you do?
    I ran "go build TestZRLT.go" then "TestZRLT.exe" both for Go 1.6 and 1.7 beta1
    The source code is very simple: https://github.com/flanglet/kanzi/blob/master/go/src/kanzi/test/TestZRLT.go.
    It runs a correctness and a performance test tor the Zero Run Length Transform:
    https://github.com/flanglet/kanzi/blob/master/go/src/kanzi/function/ZRLT.go.
  4. What did you expect to see?
    I expected to see no performance regression from 1.6 to 1.7beta1
  5. What did you see instead?
    ZRLT encoding is much faster with 1.7beta1 but decoding is much slower.

Output for 1.6:
Speed test
Iterations: 50000

ZRLT encoding [ms]: 10694
Throughput [MB/s]: 222
ZRLT decoding [ms]: 7419
Throughput [MB/s]: 321

ZRLT encoding [ms]: 10753
Throughput [MB/s]: 221
ZRLT decoding [ms]: 7472
Throughput [MB/s]: 319

ZRLT encoding [ms]: 10724
Throughput [MB/s]: 222
ZRLT decoding [ms]: 7393
Throughput [MB/s]: 322

Output for 1.7beta1:
Speed test
Iterations: 50000

ZRLT encoding [ms]: 6834
Throughput [MB/s]: 348
ZRLT decoding [ms]: 11560
Throughput [MB/s]: 206

ZRLT encoding [ms]: 6828
Throughput [MB/s]: 349
ZRLT decoding [ms]: 11589
Throughput [MB/s]: 205

ZRLT encoding [ms]: 6790
Throughput [MB/s]: 351
ZRLT decoding [ms]: 11558
Throughput [MB/s]: 206

I narrowed down the issue to the run length decoding loop:

for val <= 1 {
    runLength = (runLength << 1) | int(val)
    srcIdx++

    if srcIdx >= srcEnd {
            break
    }

    val = src[srcIdx]
}

If I replace 'for val <= 1 {' with 'for val&1 == val {', the decoding becomes much faster (although not as fast as with Go 1.6)

Output for 1.7beta1 with code change:

Speed test
Iterations: 50000

ZRLT encoding [ms]: 6800
Throughput [MB/s]: 350
ZRLT decoding [ms]: 7669
Throughput [MB/s]: 310

ZRLT encoding [ms]: 6813
Throughput [MB/s]: 349
ZRLT decoding [ms]: 7689
Throughput [MB/s]: 310

ZRLT encoding [ms]: 6775
Throughput [MB/s]: 351
ZRLT decoding [ms]: 7662
Throughput [MB/s]: 311

@adg adg changed the title Go 1.7beta1 performance drop vs. Go 1.6 in compression code cmd/compile: Go 1.7beta1 performance drop vs. Go 1.6 in compression code Jun 20, 2016
@adg
Copy link
Contributor

adg commented Jun 20, 2016

cc @randall77 @josharian @dr2chase

@adg adg added this to the Go1.7Maybe milestone Jun 20, 2016
@randall77
Copy link
Contributor

I'm not seeing what you are seeing.
The performance of 1.6.2 and tip are nearly identical for decode (tip is quite a bit better for encode).
I also don't see any variance between val <= 1 and val&1 == val.

That said, the code for the inner loop in decode isn't very good. The register allocator uses more copies than are necessary.

The indexing produces code like:

  y = x + 1
  load from p + x + 1

The latter is a single complex-addressing mode load, and the x+1 was folded into it.
In between those two instructions, both y and x are live. So the y=x+1 can't be done in place and a bunch of moves need to fix things up afterward.

I'll try to fix this up during the 1.8 cycle.

@randall77 randall77 modified the milestones: Go1.8, Go1.7Maybe Jun 21, 2016
@randall77 randall77 self-assigned this Jun 21, 2016
@flanglet
Copy link
Author

OK. I suppose the issue has been addressed between 1.7.beta1 and tip then. What remains is an optimization.

@quentinmit quentinmit changed the title cmd/compile: Go 1.7beta1 performance drop vs. Go 1.6 in compression code cmd/compile: poor register allocator behavior in compression code Jun 30, 2016
@quentinmit quentinmit added the NeedsFix The path to resolution is known, but the work has not been done. label Jun 30, 2016
@rsc
Copy link
Contributor

rsc commented Oct 21, 2016

@randall77, anything left here for Go 1.8?

@randall77
Copy link
Contributor

No, I did not get to fixing anything else here. Punting to 1.9.

@randall77 randall77 modified the milestones: Go1.9, Go1.8 Oct 21, 2016
@josharian
Copy link
Contributor

I grabbed commit 3872c76b9b410c44428e74b2065f3e2291cb8095 of github.com/flanglet/kanzi based on the date of the commit and of this issue, extracted all relevant code into a single file, and updated it to use regular test/benchmark form: https://gist.github.com/josharian/e0bc6e238d4914a44289b44bc4ae3640.

This shows a steady regression over time (tip at b53acd8):

name \ time/op  17          18          tip
Speed/encode-8  314µs ± 1%  406µs ± 1%  438µs ± 2%
Speed/decode-8  314µs ± 1%  333µs ± 1%  356µs ± 1%

@josharian
Copy link
Contributor

josharian commented May 16, 2017

CL 43491 applied to tip helps a fair amount:

name            old time/op  new time/op  delta
Speed/encode-8   438µs ± 2%   374µs ± 0%  -14.71%  (p=0.000 n=10+9)
Speed/decode-8   356µs ± 1%   320µs ± 1%   -9.93%  (p=0.000 n=10+10)

That brings it back below 1.8 levels, although still not as good as 1.7.

@josharian
Copy link
Contributor

CL 43491 is in, so we're back below the 1.8 level. Changing to milestone 1.10 for the original goal of getting back to 1.7 levels.

@josharian josharian modified the milestones: Go1.10, Go1.9 May 18, 2017
@dr2chase
Copy link
Contributor

I spent some time (intermission at a concert) noodling with pictures of loops, and it seems to me that there's families of loops where we can arrange to have "optimal" answers, and we should probably arrange to do that in the block ordering code and then see what's left for heuristics.

For example, if there is a block in the loop that dominates all the exits and itself is conditional with an exit successor, rotate that block to the bottom, and its in-loop successor to the beginning of the run of blocks for the loop.

If we have a diamond with an exit on one of the two arms, A -> (B,C), B -> (X,D), C -> D, D -> A, order it C D A B.

In "layout", we'd do this by detecting transition into a loop, decoding the loop type, and inferring the best start block if it fits one of our models, and I think that the rest just falls out -- i.e., if you start with C, then the only logical successor is D then A, then B. This might subsume the loop rotation phase.

@bradfitz bradfitz modified the milestones: Go1.10, Go1.11 Nov 28, 2017
@flanglet
Copy link
Author

go1.10beta1 shows a 15% performance regression on this code vs.
go 1.9.2 (see bench at [https://github.com/flanglet/kanzi/blob/master/go/src/kanzi/benchmark/Functions_test.go]).

go1.9.2 windows/amd64

go test Functions_test.go -bench ZRLT -benchtime 5s -cpuprofile=r:\cpu.prof

BenchmarkZRLT-8            10000           1032659 ns/op
PASS
ok      command-line-arguments  10.680s

Duration: 10.60s, Total samples = 10.42s (98.28%)
Showing nodes accounting for 10.41s, 99.90% of 10.42s total
Dropped 2 nodes (cum <= 0.05s)
      flat  flat%   sum%        cum   cum%
     6.39s 61.32% 61.32%      6.39s 61.32%  kanzi/function.(*ZRLT).Forward 
     4.02s 38.58% 99.90%      4.02s 38.58%  kanzi/function.(*ZRLT).Inverse  <===
         0     0% 99.90%     10.41s 99.90%  command-line-arguments.BenchmarkZRLT
         0     0% 99.90%     10.41s 99.90%  testing.(*B).launch 
         0     0% 99.90%     10.41s 99.90%  testing.(*B).runN

go1.10beta1 windows/amd64

go test Functions_test.go -bench ZRLT -benchtime 5s -cpuprofile=r:\cpu.prof

BenchmarkZRLT-8            10000           1185367 ns/op
PASS
ok      command-line-arguments  12.179s

Duration: 12.10s, Total samples = 12.61s (104.20%)

Showing nodes accounting for 12.27s, 97.30% of 12.61s total
Dropped 46 nodes (cum <= 0.06s)
Showing top 10 nodes out of 24 
      flat  flat%   sum%        cum   cum%
     6.36s 50.44% 50.44%      6.36s 50.44%  kanzi/function.(*ZRLT).Forward
     5.58s 44.25% 94.69%      5.58s 44.25%  kanzi/function.(*ZRLT).Inverse   <===
     0.08s  0.63% 95.32%      0.08s  0.63%  runtime.stdcall1
     0.06s  0.48% 95.80%      0.13s  1.03%  runtime/pprof.(*profMap).lookup
     0.04s  0.32% 96.11%      0.07s  0.56%  runtime.mapaccess1_fast64
     0.04s  0.32% 96.43%      0.13s  1.03%  runtime.ready
     0.04s  0.32% 96.75%      0.17s  1.35%  runtime/pprof.(*profileBuilder).addC
PUData
     0.03s  0.24% 96.99%      0.07s  0.56%  runtime/pprof.readProfile
     0.02s  0.16% 97.15%      0.08s  0.63%  runtime.findrunnable
     0.02s  0.16% 97.30%      0.12s  0.95%  runtime.park_m

Problematic lines: 166 and 194


ROUTINE ======================== kanzi/function.(*ZRLT).Inverse 
     5.58s      5.58s (flat, cum) 44.25% of Total
         .          .    147:   runLength := 1
         .          .    148:   srcIdx, dstIdx := 0, 0
         .          .    149:   var err error
         .          .    150:
         .          .    151:   if srcIdx < srcEnd {
     410ms      410ms    152:           for dstIdx < dstEnd {
      90ms       90ms    153:                   if runLength > 1 {
     260ms      260ms    154:                           runLength--
      60ms       60ms    155:                           dst[dstIdx] = 0
         .          .    156:                           dstIdx++
         .          .    157:                           continue
         .          .    158:                   }
         .          .    159:
     340ms      340ms    160:                   if srcIdx >= srcEnd {
         .          .    161:                           break
         .          .    162:                   }
         .          .    163:
     770ms      770ms    164:                   val := src[srcIdx]
         .          .    165:
     1.14s      1.14s    166:                   if val <= 1 {
         .          .    167:                           // Generate the run leng
th bit by bit (but force MSB)
         .          .    168:                           runLength = 1
         .          .    169:
     100ms      100ms    170:                           for val <= 1 {
      50ms       50ms    171:                                   runLength = (run
Length << 1) | int(val)
      40ms       40ms    172:                                   srcIdx++
         .          .    173:
         .          .    174:                                   if srcIdx >= src
End {
         .          .    175:                                           break
         .          .    176:                                   }
         .          .    177:
      30ms       30ms    178:                                   val = src[srcIdx
]
         .          .    179:                           }
         .          .    180:
      50ms       50ms    181:                           continue
         .          .    182:                   }
         .          .    183:
         .          .    184:                   // Regular data processing
     340ms      340ms    185:                   if val == 0xFF {
         .          .    186:                           srcIdx++
         .          .    187:
         .          .    188:                           if srcIdx >= srcEnd {
         .          .    189:                                   break
         .          .    190:                           }
         .          .    191:
         .          .    192:                           dst[dstIdx] = 0xFE + src
[srcIdx]
         .          .    193:                   } else {
     1.62s      1.62s    194:                           dst[dstIdx] = val - 1
         .          .    195:                   }
         .          .    196:
         .          .    197:                   srcIdx++
     280ms      280ms    198:                   dstIdx++
         .          .    199:           }
         .          .    200:   }
         .          .    201:
         .          .    202:   // If runLength is not 1, add trailing 0s
         .          .    203:   end := dstIdx + runLength - 1

@randall77
Copy link
Contributor

@flanglet I don't see the same performance issues you do.
(I'm on linux/amd64, in case that makes a difference.)

$ ~/go1.9.2/bin/go test -test.bench=ZRLT -test.count=10 > log192
$ ~/goTip/bin/go test -test.bench=ZRLT -test.count=10 > logTip
$ benchstat log192 logTip 
name  old time/op  new time/op  delta
ZRLT   541µs ± 9%   547µs ± 9%   ~     (p=0.393 n=10+10)

There's a lot of noise in this benchmark; it varies a lot from run to run. That makes it hard to make any definitive statements.

@flanglet
Copy link
Author

I don't see the same performance issues you do.
(I'm on linux/amd64, in case that makes a difference.)

I also do not see much difference on a Linux machine. But the performance drop is reproducible every time the benchmark runs on the Windows 7 machine.

There's a lot of noise in this benchmark; it varies a lot from run to run. That makes it hard to make any definitive statements.

Increase the test count. I will work on reducing the noise.

@randall77
Copy link
Contributor

@flanglet : Then I don't understand what is going on. The register allocator is identical on linux and windows, so this may be an entirely separate problem from the one that started this issue.
Are your linux and windows boxes two separate machines? Might it be the CPU, not the OS?

@flanglet
Copy link
Author

The Windows 7 machine has an Intel Core7 2600 (Sandy bridge) CPU.
The Linux machine has an Intel Core7 7700K (Kaby Lake) CPU.
So. yeah, it might be CPU related.

@randall77
Copy link
Contributor

@TocarIP , might be up your alley.

@TocarIP
Copy link
Contributor

TocarIP commented Dec 12, 2017

Couldn't reproduce on Sandy Bridge
model name : Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
I see:
ZRLT-24 935µs ± 0% 931µs ± 1% -0.40% (p=0.009 n=10+10)

@flanglet
Copy link
Author

Well, that is weird. It happens every time on my machine.
FWIW, I downloaded the go package from https://golang.org/dl/#go1.10beta1. Maybe you tested a slightly different version ?
Anyway, I will wait for newer releases and keep on testing this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
None yet
Development

No branches or pull requests

10 participants