Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: range loop over []byte, []int can avoid base pointer increments #15809

Closed
randall77 opened this issue May 24, 2016 · 10 comments
Closed
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@randall77
Copy link
Contributor

var t [256]byte

func f(b *[16]byte) {
    for i, v := range b {
        b[i] = t[v]
    }
}

compiles to:

    0x000f 00015 (tmp1.go:7)    MOVBLZX (AX), DX
    0x0012 00018 (tmp1.go:7)    LEAQ    "".t(SB), BX
    0x0019 00025 (tmp1.go:7)    MOVBLZX (BX)(DX*1), DX
    0x001d 00029 (tmp1.go:7)    MOVQ    "".b+8(FP), BX
    0x0022 00034 (tmp1.go:7)    MOVB    DL, (BX)(CX*1)
    0x0025 00037 (tmp1.go:6)    INCQ    AX
    0x0028 00040 (tmp1.go:6)    INCQ    CX
    0x002b 00043 (tmp1.go:6)    CMPQ    CX, $16
    0x002f 00047 (tmp1.go:6)    JLT $0, 15

It looks like we strength-reduced the index into b into a pointer increment. But the terminating condition still uses the integer, so we end up having 2 induction variables instead of one. We should either compute the terminating condition using an end pointer (and throw away the induction variable CX), or throw away the pointer induction variable (AX) and use an indexed load.

@brtzsnr

@randall77 randall77 added this to the Go1.8 milestone May 24, 2016
@brtzsnr
Copy link
Contributor

brtzsnr commented May 24, 2016

I don't recall any strength reduction code in the current compiler. This is I think how the code is generated by the frontend and is one of the test cases that a SR will hopefully address.

Alternatively, should the frontend generate better code?

@josharian
Copy link
Contributor

We'd have to be careful that end pointers don't confuse the GC. See related comments in gc/range.go (which is indeed probably the place for improvements here). See also my pending bool2int CL for an interaction between SSA and range / end-of-slice front end code.

@dr2chase
Copy link
Contributor

A potential sort-of-optimization for the future is to keep the locally-root (might not be really-root, depending on what our callers fed us) pointers safely on the stack, and then let the derived and sometimes-off-the-end pointers be marked as not live as long as we knew a 1:1 correspondence to the local roots. Means that the GC scans less, means that off-the-end is okay, means that more often the GC's offset calculation isn't needed.

@dr2chase
Copy link
Contributor

Except for pointers to stack-allocated things, oops, never mind.

@quentinmit quentinmit added the NeedsFix The path to resolution is known, but the work has not been done. label Oct 11, 2016
@rsc
Copy link
Contributor

rsc commented Oct 21, 2016

This is not a strength reduction. It's just the code generated for range loops over arrays/slices: there is a pointer being advanced and an integer index. The pointer advancement avoids a multiply and a bounds check, at least in the old system. You could change the range expansion to avoid the pointer++ when the element size is something that has cheap index calculation, and then mark the indexing as bounds-checked already.

@rsc rsc modified the milestones: Go1.9, Go1.8 Oct 21, 2016
@rsc rsc changed the title cmd/compile: pointless strength reduction cmd/compile: range loop over []byte, []int can avoid base pointer increments Oct 21, 2016
@martisch martisch self-assigned this Dec 19, 2016
@martisch
Copy link
Contributor

Started working on this and created a CL that computes the element position directly each time if it is cheap to do. Currently looking at the effects, adjusting tests, benchmarking and evaluating e.g. if we could just compare against a pointer to the end of the array. I am planning to send a CL for 1.9 within the next 2 weeks.

with new CL:

        0x0000 00000 (/Users/martisch/test/main.go:7)   MOVQ    "".b+8(FP), AX
        0x0005 00005 (/Users/martisch/test/main.go:7)   MOVQ    $0, CX
        0x0007 00007 (/Users/martisch/test/main.go:7)   CMPQ    CX, $16
        0x000b 00011 (/Users/martisch/test/main.go:7)   JGE     $0, 42
        0x000d 00013 (/Users/martisch/test/main.go:7)   TESTB   AL, (AX) // can and should be outside the loop?
        0x000f 00015 (/Users/martisch/test/main.go:8)   MOVBLZX (AX)(CX*1), DX
        0x0013 00019 (/Users/martisch/test/main.go:8)   LEAQ    "".t(SB), BX
        0x001a 00026 (/Users/martisch/test/main.go:8)   MOVBLZX (BX)(DX*1), DX
        0x001e 00030 (/Users/martisch/test/main.go:8)   MOVB    DL, (AX)(CX*1)
        0x0021 00033 (/Users/martisch/test/main.go:7)   INCQ    CX
        0x0024 00036 (/Users/martisch/test/main.go:7)   CMPQ    CX, $16
        0x0028 00040 (/Users/martisch/test/main.go:7)   JLT     $0, 13
        0x002a 00042 (/Users/martisch/test/main.go:10)  RET

without:

        0x0000 00000 (/Users/martisch/test/main.go:7)   MOVQ    "".b+8(FP), AX
        0x0005 00005 (/Users/martisch/test/main.go:7)   TESTB   AL, (AX)
        0x0007 00007 (/Users/martisch/test/main.go:6)   MOVQ    AX, CX
        0x000a 00010 (/Users/martisch/test/main.go:7)   MOVQ    $0, DX
        0x000c 00012 (/Users/martisch/test/main.go:7)   CMPQ    DX, $16
        0x0010 00016 (/Users/martisch/test/main.go:7)   JGE     $0, 47
        0x0012 00018 (/Users/martisch/test/main.go:8)   MOVBLZX (AX), BX
        0x0015 00021 (/Users/martisch/test/main.go:8)   LEAQ    "".t(SB), SI
        0x001c 00028 (/Users/martisch/test/main.go:8)   MOVBLZX (SI)(BX*1), BX
        0x0020 00032 (/Users/martisch/test/main.go:8)   MOVB    BL, (CX)(DX*1)
        0x0023 00035 (/Users/martisch/test/main.go:7)   INCQ    AX
        0x0026 00038 (/Users/martisch/test/main.go:7)   INCQ    DX
        0x0029 00041 (/Users/martisch/test/main.go:7)   CMPQ    DX, $16
        0x002d 00045 (/Users/martisch/test/main.go:7)   JLT     $0, 18
        0x002f 00047 (/Users/martisch/test/main.go:10)  RET

@rsc
Copy link
Contributor

rsc commented Dec 19, 2016

Thanks. Note that the "pointer to the end of the array" is not well defined in Go. You can't point past the last element (that's a pointer to a different object and will confuse the GC), you can't use an integer instead of a pointer (that won't be updated if the object is on the stack and moves, confusing the program itself), and you can't use a pointer to the last element (an array with 0 elements has no last element). You could use the pointer to the last element as long as there's a check for zero-size arrays that bypasses everything entirely. Anyway, just keep all that in mind. Thanks again.

@martisch
Copy link
Contributor

@rsc thanks. What i had in mind was having a look at what you described: "You could use the pointer to the last element as long as there's a check for zero-size arrays".

On top off all what you mentioned one also needs to keep e.g. the order in which assignments are made in mind. I seem to have found a bug in the string range case (in new and old code) while working on the array range code. #18376

@gopherbot
Copy link

CL https://golang.org/cl/38061 mentions this issue.

@randall77 randall77 modified the milestones: Go1.10, Go1.9 Jun 6, 2017
gopherbot pushed a commit that referenced this issue Oct 13, 2017
In range loops over slices and arrays besides a variable to track the
index an extra variable containing the address of the current element
is used. To compute a pointer to the next element the elements size is
added to the address.

On 386 and amd64 an element of size 1, 2, 4 or 8 bytes can by copied
from an array using a MOV instruction with suitable addressing mode
that uses the start address of the array, the index of the element and
element size as scaling factor. Thereby, for arrays and slices with
suitable element size we can avoid keeping and incrementing an extra
variable to compute the next elements address.

Shrinks cmd/go by 4 kilobytes.

AMD64:
name                   old time/op    new time/op    delta
BinaryTree17              2.66s ± 7%     2.54s ± 0%  -4.53%  (p=0.000 n=10+8)
Fannkuch11                3.02s ± 1%     3.02s ± 1%    ~     (p=0.579 n=10+10)
FmtFprintfEmpty          45.6ns ± 1%    42.2ns ± 1%  -7.46%  (p=0.000 n=10+10)
FmtFprintfString         69.8ns ± 1%    70.4ns ± 1%  +0.84%  (p=0.041 n=10+10)
FmtFprintfInt            80.1ns ± 1%    79.0ns ± 1%  -1.35%  (p=0.000 n=10+10)
FmtFprintfIntInt          127ns ± 1%     125ns ± 1%  -1.00%  (p=0.007 n=10+9)
FmtFprintfPrefixedInt     158ns ± 2%     152ns ± 1%  -4.11%  (p=0.000 n=10+10)
FmtFprintfFloat           218ns ± 1%     214ns ± 1%  -1.61%  (p=0.000 n=10+10)
FmtManyArgs               508ns ± 1%     504ns ± 1%  -0.93%  (p=0.001 n=9+10)
GobDecode                6.76ms ± 1%    6.78ms ± 1%    ~     (p=0.353 n=10+10)
GobEncode                5.84ms ± 1%    5.77ms ± 1%  -1.31%  (p=0.000 n=10+9)
Gzip                      223ms ± 1%     218ms ± 1%  -2.39%  (p=0.000 n=10+10)
Gunzip                   40.3ms ± 1%    40.4ms ± 3%    ~     (p=0.796 n=10+10)
HTTPClientServer         73.5µs ± 0%    73.3µs ± 0%  -0.28%  (p=0.000 n=10+9)
JSONEncode               12.7ms ± 1%    12.6ms ± 8%    ~     (p=0.173 n=8+10)
JSONDecode               57.5ms ± 1%    56.1ms ± 2%  -2.40%  (p=0.000 n=10+10)
Mandelbrot200            3.80ms ± 1%    3.86ms ± 6%    ~     (p=0.579 n=10+10)
GoParse                  3.25ms ± 1%    3.23ms ± 1%    ~     (p=0.052 n=10+10)
RegexpMatchEasy0_32      74.4ns ± 1%    76.9ns ± 1%  +3.39%  (p=0.000 n=10+10)
RegexpMatchEasy0_1K       243ns ± 2%     248ns ± 1%  +1.86%  (p=0.000 n=10+8)
RegexpMatchEasy1_32      71.0ns ± 2%    72.8ns ± 1%  +2.55%  (p=0.000 n=10+10)
RegexpMatchEasy1_1K       370ns ± 1%     383ns ± 0%  +3.39%  (p=0.000 n=10+9)
RegexpMatchMedium_32      107ns ± 0%     113ns ± 1%  +5.33%  (p=0.000 n=6+10)
RegexpMatchMedium_1K     35.0µs ± 1%    36.0µs ± 1%  +3.13%  (p=0.000 n=10+10)
RegexpMatchHard_32       1.65µs ± 1%    1.69µs ± 1%  +2.23%  (p=0.000 n=10+9)
RegexpMatchHard_1K       49.8µs ± 1%    50.6µs ± 1%  +1.59%  (p=0.000 n=10+10)
Revcomp                   398ms ± 1%     396ms ± 1%  -0.51%  (p=0.043 n=10+10)
Template                 63.4ms ± 1%    60.8ms ± 0%  -4.11%  (p=0.000 n=10+9)
TimeParse                 318ns ± 1%     322ns ± 1%  +1.10%  (p=0.005 n=10+10)
TimeFormat                323ns ± 1%     336ns ± 1%  +4.15%  (p=0.000 n=10+10)

Updates: #15809.

Change-Id: I55915aaf6d26768e12247f8a8edf14e7630726d1
Reviewed-on: https://go-review.googlesource.com/38061
Run-TryBot: Martin Möhrmann <moehrmann@google.com>
Reviewed-by: Keith Randall <khr@golang.org>
@martisch
Copy link
Contributor

martisch commented Nov 10, 2017

Was fixed by https://golang.org/cl/38061 for most used architectures.

@golang golang locked and limited conversation to collaborators Nov 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
None yet
Development

No branches or pull requests

8 participants