cmd/compile: new backend; multiple efficiency issues... #16192

GordonBGood · 2016-06-27T08:41:08Z

Please answer these questions before submitting your issue. Thanks!

Version: go version go1.7beta2 windows/amd64
Environment:
set GOARCH=amd64
set GOBIN=
set GOEXE=.exe
set GOHOSTARCH=amd64
set GOHOSTOS=windows
set GOOS=windows
set GOPATH=F:\Go
set GORACE=
set GOROOT=F:\Go
set GOTOOLDIR=F:\Go\pkg\tool\windows_amd64
set CC=gcc
set GOGCCFLAGS=-m64 -mthreads -fmessage-length=0 -fdebug-prefix-map=C:\Users\Super\AppData\Local\Temp\go-build631535858=/tmp/go-build -gno-record-gcc-switches
set CXX=g++
set CGO_ENABLED=1

However, these issues are likely also related to other operating systems and architectures, as in at least x386.

Runnable Program: Go Playground link:

// SoE project main.go
package main

import (
"fmt"
"math"
)

func primesOdds(top uint32) func() uint32 {
topndx := int((top - 3) / 2)
topsqrtndx := (int(math.Sqrt(float64(top))) - 3) / 2
cmpsts := make([]uint32, (topndx/32)+1)
for i := 0; i <= topsqrtndx; i++ {
    if cmpsts[i>>5]&(1<<(uint32(i)&0x1F)) == 0 {
        p := i + i + 3
        for j := (p*p - 3) >> 1; j <= topndx; j += p {
            cmpsts[j>>5] |= uint32(1) << (uint32(j) & 0x1F)
        }
    }
}
i := -1
return func() uint32 {
    oi := i
    if i <= topndx {
        i++
    }
    for i <= topndx && cmpsts[i>>5]&(uint32(1)<<(uint32(i)&0x1F)) != 0 {
        i++
    }
    if oi < 0 {
        return 2
    } else {
        return (uint32(oi) << 1) + 3
    }
}
}

func main() {
iter := primesOdds(1000000)
count := 0
for v := iter(); v <= 1000000; v = iter() {
    count++
}
fmt.Printf("%v\r\n", count)
}

What I saw:

The issue is the assembly code as viewed when "go tool compile -S > main.asm" is run, with a portion of that file as follows:

0x00e9 00233 (main.go:15)   LEAQ    (CX)(CX*1), R8  ;; **ISSUE 1** - change
0x00ed 00237 (main.go:15)   LEAQ    3(CX)(CX*1), R9 ;; prime 'p' calculation; good
0x00f2 00242 (main.go:16)   IMULQ   R9, R9  ;; 'sqr' calculation - good
0x00f6 00246 (main.go:16)   ADDQ    $-3, R9 ;; 'sqr' - 3 is good
0x00fa 00250 (main.go:16)   SARQ    $1, R9  ;; including shortcut divide by 2 - good
0x00fd 00253 (main.go:16)   CMPQ    R9, SI  ;; advance range compare
0x0100 00256 (main.go:16)   JGT $0, 319     ;; check - good
0x0102 00258 (main.go:17)   MOVQ    R9, R11 ;; **ISSUE 2** - change
0x0105 00261 (main.go:17)   SARQ    $5, R9  ;; calculate word address - good
0x0109 00265 (main.go:17)   CMPQ    R9, DX  ;; *** only here if array bounds check
0x010c 00268 (main.go:17)   JCC PANIC       ;; *** only here if array bounds check
0x0112 00274 (main.go:17)   MOVL    (AX)(R9*4), R12 ;; **ISSUE 3** - not this way
0x0116 00278 (main.go:17)   MOVQ    R11, R13    ;; **ISSUE 4** - not this way
0x0119 00281 (main.go:17)   ANDQ    $31, R11        ;; **ISSUE 5** - unnecessary
0x011d 00285 (main.go:17)   MOVQ    R11, CX     ;; part of **ISSUE 2** - change
0x0120 00288 (main.go:17)   MOVL    R10, R14        ;; **ISSUE 6** - unnecessary
0x0123 00291 (main.go:17)   SHLL    CX, R10     ;; 1 << ('j'&0x1F) - good
0x0126 00294 (main.go:17)   ORL R10, R12                ;; part of **ISSUE 3**
0x0129 00297 (main.go:17)   MOVL    R12, (AX)(R9*4) ;; part of **ISSUE 3**
0x012d 00301 (main.go:16)   LEAQ    3(R8)(R13*1), R9    ;; part of **ISSUE 1**
0x0132 00306 (main.go:13)   MOVQ    "".i+64(SP), CX ;; **ISSUE 7**; unnecessary
0x0137 00311 (main.go:17)   MOVL    R14, R10        ;; part of **ISSUE 6**
0x013a 00314 (main.go:16)   CMPQ    R9, SI  ;; end of loop
0x013d 00317 (main.go:16)   JLE $0, 258     ;; check - good
0x013f 00319 (main.go:13)   LEAQ    1(CX), R9       ;; part of **ISSUE #7** - not this way

Expected: 78498 is expected and that is what is output - not the issue:

The issue is the assembly code as viewed when "go tool compile -S > main.asm" is run, with a portion of that file as follows:
```
OuterLoop:
LEAQ    3(CX)(CX*1), R8 ;; prime 'p' calculation; good, **left in R8 as per ISSUE 1**
MOVQ    R8, R9  ;; **ISSUE 1** fixed
IMULQ   R9, R9  ;; 'sqr' calculation - good
ADDQ    $-3, R9 ;; 'sqr' - 3 is good
SARQ    $1, R9  ;; including shortcut divide by 2 - good - left in R9
CMPQ    R9, SI  ;; advance range compare
JGT PastInner       ;; check - good
InnerLoop:
MOVQ    R9, R11 ;; **ISSUE 2** fixed
MOVQ    R9,CX   ;; **ISSUE 4** fixed
SARQ    $5, R11 ;; calculate word address - good
MOVQ    $1,R10  ;; **ISSUE 6** fixed
CMPQ    R11, DX ;; *** only here if array bounds check
JCC PANIC       ;; *** only here if array bounds check
SHLL    CX, R10 ;; 1 << ('j'&0x1F) - good
ADDQ    R8, R9  ;; part of **ISSUE  1** fixed
ORL R10, (AX)(R11*4)    ;; **ISSUE 3** fixed
CMPQ    R9, SI  ;; end of loop
JLE InnerLoop   ;; check - good
PastInner:
MOVQ    "".i+64(SP), CX ;; **ISSUE 7** fixed; 'i' may well already be in another register
LEAQ    1(CX), R9       ;; now more available registers, if other register, just ADD $1
```
ISSUE 1: Preserves "2 * 'i'", that requires a full 'p' calculation inside the loop using an LEAQ instruction at 272, instead of preserving the full 'p' ('i' + 'i' + 3), that would then eliminate needing to recalculate 'p' inside the loop and would allow for a simple add instruction at line 272, which is a little faster.

ISSUE 2: Preserves the original in a new register before clobbering the original register in order to save latency (ignoring that the CPU will likely use Out of Order Execution - OOE, anyway), where a simple reordering of instructions would do the same and not require the advanced contents be calculated/moved back to the original register at the end of the loop. This is a common pattern.

ISSUE 3: Ignores that the "cmpsts[j>>5] |= ..." can be encoded with a single instruction "ORL R..., (BASEREG)(INDEXREG*4)" to save some complexity and time.

ISSUE 4: the same as ISSUE 2, where a simple instruction order change can mean that no register use swap needs to be made and alleviates the need for more complex LEA use.

ISSUE 5: When a uint32 value is shifted by uint32 bits, the compiler correctly eliminates a logical "and" by 0x1F (31) as the CPU limits the shift to this anyway; the issue is that if shifted by a uint, it doesn't eliminate it as it should (workaround is to use uint32 for shifts). We should check to see if a int32 shifted by 31 bits also gets eliminated as it should; in fact any masking above 31 (above 63 for 64-bit registers) is unnecessary.

ISSUE 6 is cmd/compile: complicated bounds check elimination #16092, where a register is preserved instead of using a simple MOV immediate. This is pervasive further outside these bounds: as if the compiler has a - "avoid immediate MOV at all costs".

ISSUE 7: This instruction is completely unnecessary in restoring a value to the CX register when it never gets used in the loop and gets clobbered for each loop. Correctly, the CX register should be reloaded if necessary outside the loop. This is issue cmd/compile: regalloc restoring to dead register #14761.

The general observation is that the compiler tends to overuse LEA instructions, which instructions are very effective when necessary, but cost a little bit in speed as used instead of other simpler instructions: they are slightly slower than those simpler instructions, which doesn't seem to be taken into account.

Summary: The golang compiler is quite a way from being optimum, and won't come close to "Cee" (C/C++) efficiency until is comes much closer than this. The changes here aren't that complex, and some simple rule changes/additions should suffice. While version 1.7 is better than 1.6, it still is nowhere near as efficient as it needs to be. Rule changes/additions as suggested above can make tight inner loops run up to twice as fast or more on some architectures and some situations, although not that much here for the amd64 as a high end CPU will re-order instructions to minimize the execution time.

The text was updated successfully, but these errors were encountered:

randall77 · 2016-06-27T13:35:14Z

Issue 2: yes, it looks like there are cases where we should perform the op on the copy instead of the original. I'll have to think about how to detect when that is beneficial.
Issue 3: Even before doing things like OR AX, (BX), we also would want OR (BX), AX. The problem is on amd64 there are a ton of possible instructions you could potentially want, including all the ops times all the addressing modes. So although I think it would be easy to handle this particular case, doing it generally would be a large maintenance burden.
Issue 5 : should be an easy rule fix.
Issue 7: I think this is #14761.

GordonBGood · 2016-06-27T14:38:42Z

@randall77:

Issue 2: yes, it looks like there are cases where we should perform the op on the copy instead of the original. I'll have to think about how to detect when that is beneficial.

I think Issue 1 and 4 are all various symptoms of this: to me it looks like it would be better to keep the original and modify the copy more times than not.

Issue 3: Even before doing things like OR AX, (BX), we also would want OR (BX), AX. The problem is on amd64 there are a ton of possible instructions you could potentially want, including all the ops times all the addressing modes. So although I think it would be easy to handle this particular case, doing it generally would be a large maintenance burden.

x86 doesn't have so many modes, but I think it has the read/modify/write instructions?

Regarding addressing complexity, all the addressing modes are already used by the LEA instruction, although the execution time assessments don't seem quite right yet; it you solve that one, then the read/modify/write is solved except for latency.

Your concern about OR (BX),AX is misplaced I think, as latency is just dependent on whether the memory is in the cache or not, as for all memory accesses; the OR AX,(BX) is more of a concern only if the destination is in the dependency chain, then it is quite long, but almost always shorter or about the same time as separate MOV (BX),AX/XOR DX,AX/MOV AX,(BX) instructions. Note that 1) there is a latency of the first MOV, but the manipulations of the AX register can hide it, and 2) the use of the instructions is usually is this order where there is no dependency on the final (BX) destination.

I'm just saying - GCC can do it in the specific case where the modify/assignment operators are used, why can't we? GCC does not use them when the modify/assignment operators are not used although Clang/LLVM does. We'll never be able to effectively do extreme loop unrolling of some kinds without it (modulos type of things).

Issue 7: I think this is #14761.

Yes, I think you're right.

randall77 · 2016-06-27T16:40:02Z

x86 doesn't have so many modes, but I think it has the read/modify/write instructions?

The SSA backend encodes each addressing mode in a separate opcode. So for integer loads, we already already have:
MOVBload, MOVBQSXload, MOVWload, MOVWQSXload, MOVLload, MOVLQSXload, MOVQload, MOVOload, MOVBloadidx1, MOVWloadidx1, MOVWloadidx2, MOVLloadidx1, MOVLloadidx4, MOVQloadidx1, MOVQloadidx8.
Similarly for stores. And that's not all the loads, for instance we haven't implemented MOVWloadix4 yet.
That's just the MOV opcode. Multiply that by ADD,SUB,AND,OR,XOR,MUL,DIV,NOT,NEG, and probably others. It gets pretty unwieldy pretty quickly.

I'd like to solve this problem at some point. Maybe we bite the bullet and just do all those opcodes (and corresponding rewrite rules). Maybe we autogenerate somehow. It's low priority at the moment because I don't think it buys a whole lot (it's the same microops, just encoded more efficiently) and it's only x86. It's not needed for all the other architectures, and those are higher priority at the moment.

Regarding addressing complexity, all the addressing modes are already used by the LEA instruction, although the execution time assessments don't seem quite right yet; it you solve that one, then the read/modify/write is solved except for latency.

Your concern about OR (BX),AX is misplaced I think, as latency is just dependent on whether the memory is in the cache or not, as for all memory accesses; the OR AX,(BX) is more of a concern only if the destination is in the dependency chain, then it is quite long, but almost always shorter or about the same time as separate MOV (BX),AX/XOR DX,AX/MOV AX,(BX) instructions. Note that 1) there is a latency of the first MOV, but the manipulations of the AX register can hide it, and 2) the use of the instructions is usually is this order where there is no dependency on the final (BX) destination.

GordonBGood · 2016-06-28T20:30:48Z

@randall77:

The SSA backend encodes each addressing mode in a separate opcode. So for integer loads, we already already have:
MOVBload, MOVBQSXload, MOVWload, MOVWQSXload, MOVLload, MOVLQSXload, MOVQload, MOVOload, MOVBloadidx1, MOVWloadidx1, MOVWloadidx2, MOVLloadidx1, MOVLloadidx4, MOVQloadidx1, MOVQloadidx8.
Similarly for stores. And that's not all the loads, for instance we haven't implemented MOVWloadix4 yet.
That's just the MOV opcode. Multiply that by ADD,SUB,AND,OR,XOR,MUL,DIV,NOT,NEG, and probably others. It gets pretty unwieldy pretty quickly.

I'd like to solve this problem at some point. Maybe we bite the bullet and just do all those opcodes (and corresponding rewrite rules). Maybe we autogenerate somehow. It's low priority at the moment because I don't think it buys a whole lot (it's the same microops, just encoded more efficiently) and it's only x86. It's not needed for all the other architectures, and those are higher priority at the moment.

You are right that in the general case, using the read/modify/write opcodes doesn't make much difference to timing as generally the latency of the first move will be masked by the computations of the modify value so the difference is likely only ten to fifteen percent for a tight loop as here. However, if one is optimizing using loop unrolling techniques (where the modify value is either immediate or a pre-calculated register), it can make a very large difference as in currently using a load followed by the very short modify followed by the store takes 3 clock cycles where it would take only 1 clock cycle for the read/modify/write instruction (both with no latency/dependency on the memory contents of the store/write).

Often, for maximum speed, the base address will be a "unsafe.Pointer", which has the same read/load followed by modify followed by write/store template as currently used here (in fact it has another issue as it produces the simple base address for the MOV's by a LEA instruction when not necessary, which makes the timing even worse; I'll raise an Issue about that once I document it properly).

Regarding addressing complexity, all the addressing modes are already used by the LEA instruction, although the execution time assessments don't seem quite right yet; it you solve that one, then the read/modify/write is solved except for latency.

As I said, it seems you are already handling all of the addressing modes for LEA, although not the load/store variations; couldn't you just apply those rules to all the variations of the other opcodes?

And autogeneration of the code seems to be the way to do it, as it would seem that the rules would be the same for each different load or store for each of the different opcodes.

I can't comment on your priorities, and the current implementation of read then modify then write is "adequate", but we'll never compete with Cee "extreme" optimizations without this.

GordonBGood · 2016-10-30T06:22:57Z

Just as a further goal to shoot for, Rust/LLVM produces the following assembly code for the same algorithm, although in Rust I had to use an unsafe mutable pointer reference to eliminate the array bounds check; the following code is about as good as it gets without hand optimizing:

    .p2align    4, 0x90     ;; alignment for better efficiency on loop branch destination
.LBB22_73:
    movq    %rcx, %rax  ;; note that the loop variable is kept in rcx for this case: left shift ready
    shrq    $5, %rax            ;; moved for the right shift word calculation here
    movl    $1, %edx            ;; this should be fixed as here as per issue 16092 using immediate load
    shll    %cl, %edx       ;; only works because the and is not necessary when shifting 32-bit reg
    orl %edx, (%r14,%rax,4) ;; note use of combined read/modify/write instruction!
    addq    %rbx, %rcx  ;; note only use of a reg for the prime value added for next loop
.LBB22_68:
    cmpq    %r13, %rcx  ;; loop limit kept in register
    jb  .LBB22_73       ;; note that the array base is kept in a reg (r14), doesn't need load in loop

On my Sky Lake Intel i5-6500, this takes about 3.2 CPU cycles per loop. The CPU is quite likely executing these instructions Out Of Order (OOE) in order to minimize latencies when registers are updated and immediately used in the next instruction.

Note that the above code recognizes that the loop variable can sit in the rcx register and thus be used for the variable left shift when shifting 32-bit registers (also works when shifting 64-bit registers) as the mask of the lower five bits (six bits for 64-bit) is not necessary and has been eliminated. golang version 1.7 eliminates the mask, but does not then recognize that it can also then use rcx as the loop variable eliminating one register move and the use of one extra register. A total of only six registers are used in the above inner loop, meaning it can be just as efficient in x86 code.

Note that is seems that stable golang version 1.7 for at least x86_64 target is very good at eliminating the bounds check in this case, recognizing that topndx is related to the `len(cmpsts), that the bounds check is therefore not necessary, thus eliminating it.

josharian · 2017-05-16T18:54:10Z

I'll leave it to @randall77 to decide about all the specific recommendations here and whether we want to tackle any of them for 1.9.

A quick update on the overall execution speed, though--things have at least been getting steadily better. For 1.7.5, 1.8.1, and tip at b53acd8:

name \ time/op  p17          p18          ptip
PrimesOdds-8    3.17ns ± 1%  2.56ns ± 1%  2.23ns ± 1%

This uses the same code as before, but converted straightforwardly into a benchmark.

package p

import (
	"math"
	"testing"
)

func primesOdds(top uint32) func() uint32 {
	topndx := int((top - 3) / 2)
	topsqrtndx := (int(math.Sqrt(float64(top))) - 3) / 2
	cmpsts := make([]uint32, (topndx/32)+1)
	for i := 0; i <= topsqrtndx; i++ {
		if cmpsts[i>>5]&(1<<(uint32(i)&0x1F)) == 0 {
			p := i + i + 3
			for j := (p*p - 3) >> 1; j <= topndx; j += p {
				cmpsts[j>>5] |= uint32(1) << (uint32(j) & 0x1F)
			}
		}
	}
	i := -1
	return func() uint32 {
		oi := i
		if i <= topndx {
			i++
		}
		for i <= topndx && cmpsts[i>>5]&(uint32(1)<<(uint32(i)&0x1F)) != 0 {
			i++
		}
		if oi < 0 {
			return 2
		} else {
			return (uint32(oi) << 1) + 3
		}
	}
}

func BenchmarkPrimesOdds(b *testing.B) {
	iter := primesOdds(1000000)
	for i := 0; i < b.N; i++ {
		iter()
	}
}

davecheney · 2017-05-16T20:54:43Z

Josh, would you consider adding this benchmark to the Go1 suite?

…

On Wed, 17 May 2017, 04:54 Josh Bleecher Snyder ***@***.***> wrote: I'll leave it to @randall77 <https://github.com/randall77> to decide about all the specific recommendations here and whether we want to tackle any of them for 1.9. A quick update on the overall execution speed, though--things have at least been getting steadily better. For 1.7.5, 1.8.1, and tip at b53acd8 <b53acd8> : name \ time/op p17 p18 ptip PrimesOdds-8 3.17ns ± 1% 2.56ns ± 1% 2.23ns ± 1% This uses the same code as before, but converted straightforwardly into a benchmark. package p import ( "math" "testing" ) func primesOdds(top uint32) func() uint32 { topndx := int((top - 3) / 2) topsqrtndx := (int(math.Sqrt(float64(top))) - 3) / 2 cmpsts := make([]uint32, (topndx/32)+1) for i := 0; i <= topsqrtndx; i++ { if cmpsts[i>>5]&(1<<(uint32(i)&0x1F)) == 0 { p := i + i + 3 for j := (p*p - 3) >> 1; j <= topndx; j += p { cmpsts[j>>5] |= uint32(1) << (uint32(j) & 0x1F) } } } i := -1 return func() uint32 { oi := i if i <= topndx { i++ } for i <= topndx && cmpsts[i>>5]&(uint32(1)<<(uint32(i)&0x1F)) != 0 { i++ } if oi < 0 { return 2 } else { return (uint32(oi) << 1) + 3 } } } func BenchmarkPrimesOdds(b *testing.B) { iter := primesOdds(1000000) for i := 0; i < b.N; i++ { iter() } } — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#16192 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAcA1BqRjhOpMncUcq0sXFDCNg2cVasks5r6fDUgaJpZM4I-32j> .

josharian · 2017-05-16T21:10:15Z

I'd love to see the Go1 suite expand. (And maybe shrink, too--I think fmt is over-represented.) I'm sure we could find a bunch of candidates (#16122 also comes to mind--recency effects). But we should do a considered, bulk addition. Would you mind filing a new issue, where we could collect a set of proposed additions and then consider them en masse?

navytux · 2017-05-16T21:47:17Z

@josharian, maybe an issue label would do here?

josharian · 2017-05-16T22:00:21Z

I think @bradfitz and @spf13 were trying to trim down labels. Let's discuss all this in a new issue so as to leave this one on-topic. I'll file one now.

GordonBGood · 2017-05-17T02:53:25Z

My problem with the mini benchmark as per @josharian is that the timing includes the determination of the results and not just the culling of the composites as was the intention; thus, the improved results may only indicate better iteration rather than an improved tight composite culling loop that was intended to be tested.

The benchmark timing results should only include the

	iter := primesOdds(1000000)

line and not the

	for i := 0; i < b.N; i++ {
		iter()
	}

lines.

josharian · 2017-05-17T04:24:05Z

Sorry for misinterpreting. To be clear, is this the correct translation?

var sink func() uint32

func BenchmarkPrimesOdds(b *testing.B) {
	for i := 0; i < b.N; i++ {
		sink = primesOdds(1000000)
	}
}

For that, with tip at c20e545, I see a regression from 1.8 to 1.9:

name \ time/op  17          18          tip
PrimesOdds-8    978µs ± 1%  884µs ± 1%  990µs ± 2%

bradfitz · 2018-06-19T17:43:46Z

I see: (where tip == 1caa062)

bradfitz@gdev:~/src/issue_16192$ benchstat 1.6 tip
name          old time/op  new time/op  delta
PrimesOdds-4  4.07ns ± 1%  4.42ns ± 1%  +8.46%  (p=0.000 n=10+9)

bradfitz@gdev:~/src/issue_16192$ benchstat 1.6 1.7
name          old time/op  new time/op  delta
PrimesOdds-4  4.07ns ± 1%  4.40ns ± 1%  +8.18%  (p=0.000 n=10+10)

bradfitz@gdev:~/src/issue_16192$ benchstat 1.7 tip
name          old time/op  new time/op  delta
PrimesOdds-4  4.40ns ± 1%  4.42ns ± 1%   ~     (p=0.227 n=10+9)

randall77 added this to the Go1.8 milestone Jun 27, 2016

randall77 added the Performance label Jun 27, 2016

randall77 self-assigned this Jun 27, 2016

quentinmit added the NeedsFix The path to resolution is known, but the work has not been done. label Oct 11, 2016

rsc modified the milestones: Go1.9, Go1.8 Oct 21, 2016

josharian mentioned this issue May 16, 2017

proposal: expand the go1 bench suite #20384

Closed

randall77 modified the milestones: Go1.10, Go1.9 May 31, 2017

bradfitz modified the milestones: Go1.10, Go1.11 Nov 28, 2017

bradfitz modified the milestones: Go1.11, Unplanned Jun 19, 2018

rsc unassigned randall77 Jun 23, 2022

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: new backend; multiple efficiency issues... #16192

cmd/compile: new backend; multiple efficiency issues... #16192

GordonBGood commented Jun 27, 2016 •

edited

randall77 commented Jun 27, 2016

GordonBGood commented Jun 27, 2016 •

edited

randall77 commented Jun 27, 2016

GordonBGood commented Jun 28, 2016

GordonBGood commented Oct 30, 2016

josharian commented May 16, 2017

davecheney commented May 16, 2017 via email

josharian commented May 16, 2017

navytux commented May 16, 2017

josharian commented May 16, 2017

GordonBGood commented May 17, 2017

josharian commented May 17, 2017

bradfitz commented Jun 19, 2018

cmd/compile: new backend; multiple efficiency issues... #16192

cmd/compile: new backend; multiple efficiency issues... #16192

Comments

GordonBGood commented Jun 27, 2016 • edited

randall77 commented Jun 27, 2016

GordonBGood commented Jun 27, 2016 • edited

randall77 commented Jun 27, 2016

GordonBGood commented Jun 28, 2016

GordonBGood commented Oct 30, 2016

josharian commented May 16, 2017

davecheney commented May 16, 2017 via email

josharian commented May 16, 2017

navytux commented May 16, 2017

josharian commented May 16, 2017

GordonBGood commented May 17, 2017

josharian commented May 17, 2017

bradfitz commented Jun 19, 2018

GordonBGood commented Jun 27, 2016 •

edited

GordonBGood commented Jun 27, 2016 •

edited