cmd/compile: eliminate some bounds checks #5364

bradfitz · 2013-04-27T23:27:32Z

I was just surprised to see a container/heap Swap call show up in a profile and looked
to see what code was generated.

It seems like the redundant bounds checks could be eliminated.

And it seems like some redundant code to call runtime.panicindex is generated too.

$ cat x.go
package main
func swap(s []int, i, j int) { s[i], s[j] = s[j], s[i] }
func main() { swap([]int{0, 1, 2, 3}, 1, 2) }

$ go tool 6g -S x.go 
--- prog list "swap" ---
0000 (x.go:2) TEXT    swap+0(SB),$0-40
0001 (x.go:2) MOVQ    i+24(FP),DI
0002 (x.go:2) MOVQ    j+32(FP),SI
0003 (x.go:2) MOVQ    s+0(FP),CX
0004 (x.go:2) MOVQ    s+8(FP),AX
0005 (x.go:2) LOCALS  ,$0
0006 (x.go:2) TYPE    s+0(FP){[]int},$24
0007 (x.go:2) TYPE    i+24(FP){int},$8
0008 (x.go:2) TYPE    j+32(FP){int},$8
0009 (x.go:2) MOVQ    CX,BX
0010 (x.go:2) MOVQ    DI,BP
0011 (x.go:2) CMPQ    DI,AX
0012 (x.go:2) JCS     $1,15
0013 (x.go:2) CALL    ,runtime.panicindex+0(SB)
0014 (x.go:2) UNDEF   ,
0015 (x.go:2) LEAQ    (BX)(BP*8),BX
0016 (x.go:2) MOVQ    (BX),DX
0017 (x.go:2) MOVQ    CX,BX
0018 (x.go:2) MOVQ    DI,BP
0019 (x.go:2) CMPQ    DI,AX
0020 (x.go:2) JCS     $1,23
0021 (x.go:2) CALL    ,runtime.panicindex+0(SB)
0022 (x.go:2) UNDEF   ,
0023 (x.go:2) LEAQ    (BX)(BP*8),BX
0024 (x.go:2) MOVQ    CX,BP
0025 (x.go:2) MOVQ    SI,R8
0026 (x.go:2) CMPQ    SI,AX
0027 (x.go:2) JCS     $1,30
0028 (x.go:2) CALL    ,runtime.panicindex+0(SB)
0029 (x.go:2) UNDEF   ,
0030 (x.go:2) LEAQ    (BP)(R8*8),BP
0031 (x.go:2) MOVQ    (BP),R8
0032 (x.go:2) MOVQ    R8,(BX)
0033 (x.go:2) MOVQ    CX,BX
0034 (x.go:2) MOVQ    SI,BP
0035 (x.go:2) CMPQ    SI,AX
0036 (x.go:2) JCS     $1,39
0037 (x.go:2) CALL    ,runtime.panicindex+0(SB)
0038 (x.go:2) UNDEF   ,
0039 (x.go:2) LEAQ    (BX)(BP*8),BX
0040 (x.go:2) MOVQ    DX,(BX)
0041 (x.go:2) RET     ,

$ go build -o /tmp/x --gcflags=-l x.go
$ objdump -D /tmp/x | less

0000000000400c00 <main.swap>:
  400c00:       64 48 8b 0c 25 f0 ff    mov    %fs:0xfffffffffffffff0,%rcx
  400c07:       ff ff 
  400c09:       48 3b 21                cmp    (%rcx),%rsp
  400c0c:       77 05                   ja     400c13 <main.swap+0x13>
  400c0e:       e8 dd 7a 01 00          callq  4186f0 <runtime.morestack40>
  400c13:       48 8b 7c 24 20          mov    0x20(%rsp),%rdi
  400c18:       48 8b 74 24 28          mov    0x28(%rsp),%rsi
  400c1d:       48 8b 4c 24 08          mov    0x8(%rsp),%rcx
  400c22:       48 8b 44 24 10          mov    0x10(%rsp),%rax
  400c27:       48 89 cb                mov    %rcx,%rbx
  400c2a:       48 89 fd                mov    %rdi,%rbp
  400c2d:       48 39 c7                cmp    %rax,%rdi
  400c30:       73 55                   jae    400c87 <main.swap+0x87>
  400c32:       48 8d 1c eb             lea    (%rbx,%rbp,8),%rbx
  400c36:       48 8b 13                mov    (%rbx),%rdx
  400c39:       48 89 cb                mov    %rcx,%rbx
  400c3c:       48 89 fd                mov    %rdi,%rbp
  400c3f:       48 39 c7                cmp    %rax,%rdi
  400c42:       73 3c                   jae    400c80 <main.swap+0x80>
  400c44:       48 8d 1c eb             lea    (%rbx,%rbp,8),%rbx
  400c48:       48 89 cd                mov    %rcx,%rbp
  400c4b:       49 89 f0                mov    %rsi,%r8
  400c4e:       48 39 c6                cmp    %rax,%rsi
  400c51:       73 26                   jae    400c79 <main.swap+0x79>
  400c53:       4a 8d 6c c5 00          lea    0x0(%rbp,%r8,8),%rbp
  400c58:       4c 8b 45 00             mov    0x0(%rbp),%r8
  400c5c:       4c 89 03                mov    %r8,(%rbx)
  400c5f:       48 89 cb                mov    %rcx,%rbx
  400c62:       48 89 f5                mov    %rsi,%rbp
  400c65:       48 39 c6                cmp    %rax,%rsi
  400c68:       73 08                   jae    400c72 <main.swap+0x72>
  400c6a:       48 8d 1c eb             lea    (%rbx,%rbp,8),%rbx
  400c6e:       48 89 13                mov    %rdx,(%rbx)
  400c71:       c3                      retq   
  400c72:       e8 a9 bd 00 00          callq  40ca20 <runtime.panicindex>
  400c77:       0f 0b                   ud2    
  400c79:       e8 a2 bd 00 00          callq  40ca20 <runtime.panicindex>
  400c7e:       0f 0b                   ud2    
  400c80:       e8 9b bd 00 00          callq  40ca20 <runtime.panicindex>
  400c85:       0f 0b                   ud2    
  400c87:       e8 94 bd 00 00          callq  40ca20 <runtime.panicindex>
  400c8c:       0f 0b                   ud2    
        ...

remyoudompheng · 2013-05-20T22:11:18Z

Comment 1:

The calls to runtime.panicindex are annotated by the line number of the index
expressions so that they show nicely in stack traces.
Panics on the same line could be merged into a single one but I think being able to
differentiate them by their PC is useful for debugging.
The subject of redundant bounds check has already been raised a few times, I'm surprised
I can't find an existing issue about it.

DanielMorsing · 2013-05-21T08:52:31Z

Comment 2:

To remove redundant bounds check, we need some way of doing data-flow analysis before we
walk the functions. That would also make a whole class of optimizations open to us, like
caching repeated pointer indirects in registers when it can be done safely.

rsc · 2013-07-30T13:45:30Z

Comment 3:

Labels changed: added go1.3.

robpike · 2013-08-20T23:36:16Z

Comment 4:

Labels changed: removed go1.3.

bradfitz · 2013-11-19T05:43:36Z

Comment 5:

Another one, mostly the same as above, but with constants:
package main
func main() {
    x := make([]byte, 10)
    foo(x)
}
func foo(x []byte) {
    _ = x[5]
    _ = x[4]
    _ = x[3]
    _ = x[2]
    _ = x[1]
    _ = x[0]
}
We only need one check in foo. If 5 works, the rest will too.

rsc · 2013-12-04T01:43:48Z

Comment 6:

Labels changed: added repo-main.

rsc · 2014-03-03T20:46:14Z

Comment 7:

Adding Release=None to all Priority=Someday bugs.

Labels changed: added release-none.

klauspost · 2015-11-18T12:41:36Z

I saw in the Go 1.6 planning that @dr2chase might want to look at bounds check elimination. I think work is still on-going. It seems like a perfect fit for the SSA compiler.

For fun I have created all the cases I can think of, and added it as a boundscheck.go gist. It can be copied directly to the playground.

My main thought is that with a SSA tree it would be possible to attach a possible "minimum" and "maximum" value to each slice and variable. For values that we don't know, or escape, we set it to the maximum/minimum value of the type.

btracey · 2015-11-18T14:28:48Z

There are a bunch more involving more complicated expressions. Some examples

if len(a) < m*n {
    return 
}
for i := 0: i < m; i++ {
    for j := 0 ; j <n ;j++{
       _ =  a[i*m + j]
   }
}

Or the more common (and slightly more complicated)

if n < stride {
    return
}
if len(a) < m*stride {
    return 
}
for i := 0: i < m; i++ {
    for j := 0 ; j <n ;j++{
       _ =  a[i*stride + j]
   }
}

And there are lots involving multiple slices

if len(a) != len(b) {
    return
}
var dot float64
for i, v := range a {
    dot += b[i] * v
}

dr2chase · 2015-11-18T15:37:53Z

It's slipped to 1.7, the analysis ought to exist in buggy form later this week or early next week. In the short run I'm not targeting multiplication, but I am trying to get the following cases:

for i := 0 ; i + K-1 < len(a); i+= K { ... a[i] ... a[i+K-1] }

if len(a) > len(b) {
  t := a
  a = b
  b = t
}
for i := 0; i < len(a); i++ { ... a[i] ... b[i] ... }

The multiplication expression is harder than you might think because of integer overflow.
Consider

if len(a) < m*n {
    return 
}
for i := 0: i < m; i++ {
    for j := 0 ; j <n ;j++{
       _ =  a[i*m + j]
   }
}

Suppose m = 4, n = 1<<62.
Then m*n == 0, len(a) is greater than 0, and therefore we can remove bounds checking, right? :-)

Just remember, we design languages to let integers overflow and wrap because it's "more efficient".

klauspost · 2015-11-18T21:16:59Z

In the long term some way of indicating a guaranteed slice length for the loop - for instance by reslicing, eg. a = a[:16], or using arrays.

Combined that with some way of indicating a maximum lookup value v = a[x&15] or v = a[byte(x)] would be a win for the compression packages, since they have a lot of these types of cases.

dr2chase · 2015-11-19T18:23:05Z

By-the-way, this idiom ought to work somewhat better (for analysis purposes):

if m < 0 {
  return
} // this might be necessary; easier analysis should suffice if m>=0 is known.
n := len(a)/m
// m*n <= len(a), no overflow possible.
for i := 0: i < m; i++ {
    for j := 0 ; j <n ;j++{
       _ =  a[i*m + j]
   }
}

I'll put this in my queue.

btracey · 2015-11-19T18:56:53Z

Our actual use case involves matrix views, and so it can't be as simple as the above code. Our real code is closer to

if m < 0 {
    panic("negative rows")
}
if n < 0 {
   panic("negative columns")
}
if stride < n {
    panic("bad stride")
}
if (m-1)*stride + n > len(a) {
     panic("out of bounds error")
}
for i := 0; i < m; i++ {
    for j := 0; j < n; j++ {
      _ = a[i*stride +j]
    }
}

note in particular that in general stride != m.

Honestly though, I think the highest impact you could have on optimizing gonum code is (code snippets assume correct bounds checking has been done above)

Dot product

var dot float64
for i, v := range x {
   dot += y[i]
}

Scale

for i := range x {
    x[i] *= beta
}

"Axpy", i.e. y += a*x

for i, v := range x {
    y[i] += a*v
}

All of these also come in the "strided" form you mention above, where instead i is incremented by a constant value. If these three expressions could have bounds checks removed and SIMD, there would be a huge speedup in the BLAS benchmarks.

If you're interested for testing purposes, we have a bunch of benchmarks at godoc.org/github.com/gonum/blas/native, and godoc.org/github.com/gonum/matrix/mat64 .

Thanks for thinking about compiler optimizations!

btracey · 2015-11-19T18:58:33Z

Oh, to add, those benchmarks at present use assembly for the three cases I mentioned above. The pure go versions can be run with the -noasm tag.

dr2chase · 2015-11-23T04:20:40Z

Same problem with potential overflow here:

if (m-1)*stride + n > len(a) {
     panic("out of bounds error")
}
for i := 0; i < m; i++ {
    for j := 0; j < n; j++ {
      _ = a[i*stride +j]
    }
}

because

m:=2; n := 0x7fffFFFFffffFFFF; stride :=1,
(m-1)*stride + n  < len(a) // == -0x8000000000000000

and the inner loop can run and run and run.

btracey · 2015-11-23T06:02:37Z

I see. If the check is

maxIdx := (m-1)*stride + n
if  maxIdx < 0 || maxIdx > len(a) {
     panic("out of bounds error")
}
for i := 0; i < m; i++ {
    for j := 0; j < n; j++ {
      _ = a[i*stride +j]
    }
}

Is that then correct?

btracey · 2015-11-23T06:03:28Z

(also assuming m, n and stride are each individually checked to be >= 0)

dr2chase · 2015-11-25T21:37:05Z

I suspect it is not correct, given suitably hacked modulo arithmetic.
e.g., suppose stride is 1 << 62, consider multiples 0, 1, 2, 3, 4, they are:

0
1<<62 == large positive suitable for indexing into space.
2<<62 == -(1<<63)
3<<62 ==  -(1<<62)
4<<62 == 0

Let stride = n = 1<<62, m = 4, then maxIdx = 0.

I think that division is ultimately what you want, though I'm nowhere near there yet on that inference.

bradfitz · 2016-04-01T11:20:20Z

Fixed.

See:

https://go-review.googlesource.com/20517
https://go-review.googlesource.com/21008 (issue encoding/binary, cmd/compile: improve bounds checking hints #14900)
Issue cmd/compile: swapping two elements of a slice generates double the necessary boundary checks #10131(basically a dup, but with more info from @brtzsnr)

(And maybe other SSA changes)

bradfitz added accepted Performance labels Mar 3, 2014

rsc added this to the Unplanned milestone Apr 10, 2015

rsc removed priority-someday labels Apr 10, 2015

rsc changed the title ~~cmd/gc: eliminate some bounds checks~~ cmd/compile: eliminate some bounds checks Jun 8, 2015

dsnet mentioned this issue Jul 22, 2015

cmd/compile: Use wide integer load/store instructions if possible #11819

Closed

btracey mentioned this issue Mar 17, 2016

cmd/compile: unnecessary bounds checks are not removed #14808

Open

bradfitz closed this as completed Apr 1, 2016

golang locked and limited conversation to collaborators Apr 2, 2017

gopherbot added the FrozenDueToAge label Apr 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: eliminate some bounds checks #5364

cmd/compile: eliminate some bounds checks #5364

bradfitz commented Apr 27, 2013

remyoudompheng commented May 20, 2013

DanielMorsing commented May 21, 2013

rsc commented Jul 30, 2013

robpike commented Aug 20, 2013

bradfitz commented Nov 19, 2013

rsc commented Dec 4, 2013

rsc commented Mar 3, 2014

klauspost commented Nov 18, 2015

btracey commented Nov 18, 2015

dr2chase commented Nov 18, 2015

klauspost commented Nov 18, 2015

dr2chase commented Nov 19, 2015

btracey commented Nov 19, 2015

btracey commented Nov 19, 2015

dr2chase commented Nov 23, 2015

btracey commented Nov 23, 2015

btracey commented Nov 23, 2015

dr2chase commented Nov 25, 2015

bradfitz commented Apr 1, 2016

cmd/compile: eliminate some bounds checks #5364

cmd/compile: eliminate some bounds checks #5364

Comments

bradfitz commented Apr 27, 2013

remyoudompheng commented May 20, 2013

DanielMorsing commented May 21, 2013

rsc commented Jul 30, 2013

robpike commented Aug 20, 2013

bradfitz commented Nov 19, 2013

rsc commented Dec 4, 2013

rsc commented Mar 3, 2014

klauspost commented Nov 18, 2015

btracey commented Nov 18, 2015

dr2chase commented Nov 18, 2015

klauspost commented Nov 18, 2015

dr2chase commented Nov 19, 2015

btracey commented Nov 19, 2015

btracey commented Nov 19, 2015

dr2chase commented Nov 23, 2015

btracey commented Nov 23, 2015

btracey commented Nov 23, 2015

dr2chase commented Nov 25, 2015

bradfitz commented Apr 1, 2016