Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: eliminate some bounds checks #5364

Closed
bradfitz opened this issue Apr 27, 2013 · 19 comments
Closed

cmd/compile: eliminate some bounds checks #5364

bradfitz opened this issue Apr 27, 2013 · 19 comments

Comments

@bradfitz
Copy link
Contributor

I was just surprised to see a container/heap Swap call show up in a profile and looked
to see what code was generated.

It seems like the redundant bounds checks could be eliminated.

And it seems like some redundant code to call runtime.panicindex is generated too.

$ cat x.go
package main
func swap(s []int, i, j int) { s[i], s[j] = s[j], s[i] }
func main() { swap([]int{0, 1, 2, 3}, 1, 2) }

$ go tool 6g -S x.go 
--- prog list "swap" ---
0000 (x.go:2) TEXT    swap+0(SB),$0-40
0001 (x.go:2) MOVQ    i+24(FP),DI
0002 (x.go:2) MOVQ    j+32(FP),SI
0003 (x.go:2) MOVQ    s+0(FP),CX
0004 (x.go:2) MOVQ    s+8(FP),AX
0005 (x.go:2) LOCALS  ,$0
0006 (x.go:2) TYPE    s+0(FP){[]int},$24
0007 (x.go:2) TYPE    i+24(FP){int},$8
0008 (x.go:2) TYPE    j+32(FP){int},$8
0009 (x.go:2) MOVQ    CX,BX
0010 (x.go:2) MOVQ    DI,BP
0011 (x.go:2) CMPQ    DI,AX
0012 (x.go:2) JCS     $1,15
0013 (x.go:2) CALL    ,runtime.panicindex+0(SB)
0014 (x.go:2) UNDEF   ,
0015 (x.go:2) LEAQ    (BX)(BP*8),BX
0016 (x.go:2) MOVQ    (BX),DX
0017 (x.go:2) MOVQ    CX,BX
0018 (x.go:2) MOVQ    DI,BP
0019 (x.go:2) CMPQ    DI,AX
0020 (x.go:2) JCS     $1,23
0021 (x.go:2) CALL    ,runtime.panicindex+0(SB)
0022 (x.go:2) UNDEF   ,
0023 (x.go:2) LEAQ    (BX)(BP*8),BX
0024 (x.go:2) MOVQ    CX,BP
0025 (x.go:2) MOVQ    SI,R8
0026 (x.go:2) CMPQ    SI,AX
0027 (x.go:2) JCS     $1,30
0028 (x.go:2) CALL    ,runtime.panicindex+0(SB)
0029 (x.go:2) UNDEF   ,
0030 (x.go:2) LEAQ    (BP)(R8*8),BP
0031 (x.go:2) MOVQ    (BP),R8
0032 (x.go:2) MOVQ    R8,(BX)
0033 (x.go:2) MOVQ    CX,BX
0034 (x.go:2) MOVQ    SI,BP
0035 (x.go:2) CMPQ    SI,AX
0036 (x.go:2) JCS     $1,39
0037 (x.go:2) CALL    ,runtime.panicindex+0(SB)
0038 (x.go:2) UNDEF   ,
0039 (x.go:2) LEAQ    (BX)(BP*8),BX
0040 (x.go:2) MOVQ    DX,(BX)
0041 (x.go:2) RET     ,

$ go build -o /tmp/x --gcflags=-l x.go
$ objdump -D /tmp/x | less

0000000000400c00 <main.swap>:
  400c00:       64 48 8b 0c 25 f0 ff    mov    %fs:0xfffffffffffffff0,%rcx
  400c07:       ff ff 
  400c09:       48 3b 21                cmp    (%rcx),%rsp
  400c0c:       77 05                   ja     400c13 <main.swap+0x13>
  400c0e:       e8 dd 7a 01 00          callq  4186f0 <runtime.morestack40>
  400c13:       48 8b 7c 24 20          mov    0x20(%rsp),%rdi
  400c18:       48 8b 74 24 28          mov    0x28(%rsp),%rsi
  400c1d:       48 8b 4c 24 08          mov    0x8(%rsp),%rcx
  400c22:       48 8b 44 24 10          mov    0x10(%rsp),%rax
  400c27:       48 89 cb                mov    %rcx,%rbx
  400c2a:       48 89 fd                mov    %rdi,%rbp
  400c2d:       48 39 c7                cmp    %rax,%rdi
  400c30:       73 55                   jae    400c87 <main.swap+0x87>
  400c32:       48 8d 1c eb             lea    (%rbx,%rbp,8),%rbx
  400c36:       48 8b 13                mov    (%rbx),%rdx
  400c39:       48 89 cb                mov    %rcx,%rbx
  400c3c:       48 89 fd                mov    %rdi,%rbp
  400c3f:       48 39 c7                cmp    %rax,%rdi
  400c42:       73 3c                   jae    400c80 <main.swap+0x80>
  400c44:       48 8d 1c eb             lea    (%rbx,%rbp,8),%rbx
  400c48:       48 89 cd                mov    %rcx,%rbp
  400c4b:       49 89 f0                mov    %rsi,%r8
  400c4e:       48 39 c6                cmp    %rax,%rsi
  400c51:       73 26                   jae    400c79 <main.swap+0x79>
  400c53:       4a 8d 6c c5 00          lea    0x0(%rbp,%r8,8),%rbp
  400c58:       4c 8b 45 00             mov    0x0(%rbp),%r8
  400c5c:       4c 89 03                mov    %r8,(%rbx)
  400c5f:       48 89 cb                mov    %rcx,%rbx
  400c62:       48 89 f5                mov    %rsi,%rbp
  400c65:       48 39 c6                cmp    %rax,%rsi
  400c68:       73 08                   jae    400c72 <main.swap+0x72>
  400c6a:       48 8d 1c eb             lea    (%rbx,%rbp,8),%rbx
  400c6e:       48 89 13                mov    %rdx,(%rbx)
  400c71:       c3                      retq   
  400c72:       e8 a9 bd 00 00          callq  40ca20 <runtime.panicindex>
  400c77:       0f 0b                   ud2    
  400c79:       e8 a2 bd 00 00          callq  40ca20 <runtime.panicindex>
  400c7e:       0f 0b                   ud2    
  400c80:       e8 9b bd 00 00          callq  40ca20 <runtime.panicindex>
  400c85:       0f 0b                   ud2    
  400c87:       e8 94 bd 00 00          callq  40ca20 <runtime.panicindex>
  400c8c:       0f 0b                   ud2    
        ...
@remyoudompheng
Copy link
Contributor

Comment 1:

The calls to runtime.panicindex are annotated by the line number of the index
expressions so that they show nicely in stack traces.
Panics on the same line could be merged into a single one but I think being able to
differentiate them by their PC is useful for debugging.
The subject of redundant bounds check has already been raised a few times, I'm surprised
I can't find an existing issue about it.

@DanielMorsing
Copy link
Contributor

Comment 2:

To remove redundant bounds check, we need some way of doing data-flow analysis before we
walk the functions. That would also make a whole class of optimizations open to us, like
caching repeated pointer indirects in registers when it can be done safely.

@rsc
Copy link
Contributor

rsc commented Jul 30, 2013

Comment 3:

Labels changed: added go1.3.

@robpike
Copy link
Contributor

robpike commented Aug 20, 2013

Comment 4:

Labels changed: removed go1.3.

@bradfitz
Copy link
Contributor Author

Comment 5:

Another one, mostly the same as above, but with constants:
package main
func main() {
    x := make([]byte, 10)
    foo(x)
}
func foo(x []byte) {
    _ = x[5]
    _ = x[4]
    _ = x[3]
    _ = x[2]
    _ = x[1]
    _ = x[0]
}
We only need one check in foo. If 5 works, the rest will too.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2013

Comment 6:

Labels changed: added repo-main.

@rsc
Copy link
Contributor

rsc commented Mar 3, 2014

Comment 7:

Adding Release=None to all Priority=Someday bugs.

Labels changed: added release-none.

@rsc rsc added this to the Unplanned milestone Apr 10, 2015
@rsc rsc changed the title cmd/gc: eliminate some bounds checks cmd/compile: eliminate some bounds checks Jun 8, 2015
@klauspost
Copy link
Contributor

I saw in the Go 1.6 planning that @dr2chase might want to look at bounds check elimination. I think work is still on-going. It seems like a perfect fit for the SSA compiler.

For fun I have created all the cases I can think of, and added it as a boundscheck.go gist. It can be copied directly to the playground.

My main thought is that with a SSA tree it would be possible to attach a possible "minimum" and "maximum" value to each slice and variable. For values that we don't know, or escape, we set it to the maximum/minimum value of the type.

@btracey
Copy link
Contributor

btracey commented Nov 18, 2015

There are a bunch more involving more complicated expressions. Some examples

if len(a) < m*n {
    return 
}
for i := 0: i < m; i++ {
    for j := 0 ; j <n ;j++{
       _ =  a[i*m + j]
   }
}

Or the more common (and slightly more complicated)

if n < stride {
    return
}
if len(a) < m*stride {
    return 
}
for i := 0: i < m; i++ {
    for j := 0 ; j <n ;j++{
       _ =  a[i*stride + j]
   }
}

And there are lots involving multiple slices

if len(a) != len(b) {
    return
}
var dot float64
for i, v := range a {
    dot += b[i] * v
}

@dr2chase
Copy link
Contributor

It's slipped to 1.7, the analysis ought to exist in buggy form later this week or early next week. In the short run I'm not targeting multiplication, but I am trying to get the following cases:

for i := 0 ; i + K-1 < len(a); i+= K { ... a[i] ... a[i+K-1] }
if len(a) > len(b) {
  t := a
  a = b
  b = t
}
for i := 0; i < len(a); i++ { ... a[i] ... b[i] ... }

The multiplication expression is harder than you might think because of integer overflow.
Consider

if len(a) < m*n {
    return 
}
for i := 0: i < m; i++ {
    for j := 0 ; j <n ;j++{
       _ =  a[i*m + j]
   }
}

Suppose m = 4, n = 1<<62.
Then m*n == 0, len(a) is greater than 0, and therefore we can remove bounds checking, right? :-)

Just remember, we design languages to let integers overflow and wrap because it's "more efficient".

@klauspost
Copy link
Contributor

In the long term some way of indicating a guaranteed slice length for the loop - for instance by reslicing, eg. a = a[:16], or using arrays.

Combined that with some way of indicating a maximum lookup value v = a[x&15] or v = a[byte(x)] would be a win for the compression packages, since they have a lot of these types of cases.

@dr2chase
Copy link
Contributor

By-the-way, this idiom ought to work somewhat better (for analysis purposes):

if m < 0 {
  return
} // this might be necessary; easier analysis should suffice if m>=0 is known.
n := len(a)/m
// m*n <= len(a), no overflow possible.
for i := 0: i < m; i++ {
    for j := 0 ; j <n ;j++{
       _ =  a[i*m + j]
   }
}

I'll put this in my queue.

@btracey
Copy link
Contributor

btracey commented Nov 19, 2015

Our actual use case involves matrix views, and so it can't be as simple as the above code. Our real code is closer to

if m < 0 {
    panic("negative rows")
}
if n < 0 {
   panic("negative columns")
}
if stride < n {
    panic("bad stride")
}
if (m-1)*stride + n > len(a) {
     panic("out of bounds error")
}
for i := 0; i < m; i++ {
    for j := 0; j < n; j++ {
      _ = a[i*stride +j]
    }
}

note in particular that in general stride != m.

Honestly though, I think the highest impact you could have on optimizing gonum code is (code snippets assume correct bounds checking has been done above)

  1. Dot product
var dot float64
for i, v := range x {
   dot += y[i]
}
  1. Scale
for i := range x {
    x[i] *= beta
}
  1. "Axpy", i.e. y += a*x
for i, v := range x {
    y[i] += a*v
}

All of these also come in the "strided" form you mention above, where instead i is incremented by a constant value. If these three expressions could have bounds checks removed and SIMD, there would be a huge speedup in the BLAS benchmarks.

If you're interested for testing purposes, we have a bunch of benchmarks at godoc.org/github.com/gonum/blas/native, and godoc.org/github.com/gonum/matrix/mat64 .

Thanks for thinking about compiler optimizations!

@btracey
Copy link
Contributor

btracey commented Nov 19, 2015

Oh, to add, those benchmarks at present use assembly for the three cases I mentioned above. The pure go versions can be run with the -noasm tag.

@dr2chase
Copy link
Contributor

Same problem with potential overflow here:

if (m-1)*stride + n > len(a) {
     panic("out of bounds error")
}
for i := 0; i < m; i++ {
    for j := 0; j < n; j++ {
      _ = a[i*stride +j]
    }
}

because

m:=2; n := 0x7fffFFFFffffFFFF; stride :=1,
(m-1)*stride + n  < len(a) // == -0x8000000000000000

and the inner loop can run and run and run.

@btracey
Copy link
Contributor

btracey commented Nov 23, 2015

I see. If the check is

maxIdx := (m-1)*stride + n
if  maxIdx < 0 || maxIdx > len(a) {
     panic("out of bounds error")
}
for i := 0; i < m; i++ {
    for j := 0; j < n; j++ {
      _ = a[i*stride +j]
    }
}

Is that then correct?

@btracey
Copy link
Contributor

btracey commented Nov 23, 2015

(also assuming m, n and stride are each individually checked to be >= 0)

@dr2chase
Copy link
Contributor

I suspect it is not correct, given suitably hacked modulo arithmetic.
e.g., suppose stride is 1 << 62, consider multiples 0, 1, 2, 3, 4, they are:

0
1<<62 == large positive suitable for indexing into space.
2<<62 == -(1<<63)
3<<62 ==  -(1<<62)
4<<62 == 0

Let stride = n = 1<<62, m = 4, then maxIdx = 0.

I think that division is ultimately what you want, though I'm nowhere near there yet on that inference.

@bradfitz
Copy link
Contributor Author

bradfitz commented Apr 1, 2016

@bradfitz bradfitz closed this as completed Apr 1, 2016
@golang golang locked and limited conversation to collaborators Apr 2, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants