cmd/compile: performance of slice append is optimizable #36405

lhonglwl · 2020-01-06T11:48:23Z

What version of Go are you using (`go version`)?

$ go version
go version go1.13.5 linux/amd64

What did you do?

package main

import "testing"

func BenchmarkAppend(b *testing.B) {
    buf1 := make([]byte, 3, 10)
    buf1[0] = 0
    buf1[1] = 1
    buf1[2] = 2
    buf2 := []byte{'a', 'b', 'c', 'd', 'e', 'f', 'g'}
    for i := 0; i < b.N; i++ {
        buf1 = buf1[:3]
        buf1 = append(buf1, buf2...)
    }
}

func BenchmarkAppend2(b *testing.B) {
    buf1 := make([]byte, 3, 10)
    buf1[0] = 0
    buf1[1] = 1
    buf1[2] = 2
    buf2 := buf1[3:10]
    buf2[0] = 'a'
    buf2[1] = 'b'
    buf2[2] = 'c'
    buf2[3] = 'd'
    buf2[4] = 'e'
    buf2[5] = 'f'
    buf2[6] = 'g'
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        buf1 = buf1[:3]
        buf1 = append(buf1, buf2...)
    }
}

func BenchmarkAppend3(b *testing.B) {
    buf1 := make([]byte, 3, 10)
    buf1[0] = 0
    buf1[1] = 1
    buf1[2] = 2
    for i := 0; i < b.N; i++ {
        buf2 := []byte{'a', 'b', 'c', 'd', 'e', 'f', 'g'}
        buf1 = buf1[:3]
        buf1 = append(buf1, buf2...)
    }
}

func BenchmarkAppend4(b *testing.B) {
    buf1 := make([]byte, 3, 10)
    buf1[0] = 0
    buf1[1] = 1
    buf1[2] = 2
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        buf2 := buf1[3:10]
        buf2[0] = 'a'
        buf2[1] = 'b'
        buf2[2] = 'c'
        buf2[3] = 'd'
        buf2[4] = 'e'
        buf2[5] = 'f'
        buf2[6] = 'g'
        buf1 = buf1[:3]
        buf1 = append(buf1, buf2...)
    }
}

output:

BenchmarkAppend-8       1000000000               0.879 ns/op
BenchmarkAppend2-8      193146717                6.20 ns/op
BenchmarkAppend3-8      172225722                7.04 ns/op
BenchmarkAppend4-8      170996562                6.96 ns/op

What did you expect to see?

I expected BenchmarkAppend2 would be obviously much fastest than BenchmarkAppend as the memory of buf1 and buf2 is continuous. append operator would not has actually memmove in BenchmarkAppend2.

But the benchmark says BenchmarkAppend2 cost much more time than BenchmarkAppend, just as it has malloc memory for every loop because it cost almost equal to BenchmarkAppend3 and BenchmarkAppend3

I wonder it's there has something wrong or bug in append for memory copy, or will it be fix/optimize in future?

The text was updated successfully, but these errors were encountered:

randall77 · 2020-01-06T16:15:05Z

BenchmarkAppend and BenchmarkAppend2 have essentially the same inner loop.
BenchmarkAppend:

     260ms      260ms    10f8af6: MOVL 0x44(SP), SI                       ;example.com/m.BenchmarkAppend issue36405_test.go:13
         .          .    10f8afa: MOVL 0x41(SP), DI                       ;issue36405_test.go:13
         .          .    10f8afe: MOVL DI, 0x3(DX)
      70ms       70ms    10f8b01: MOVL SI, 0x6(DX)                        ;example.com/m.BenchmarkAppend issue36405_test.go:13
     260ms      260ms    10f8b04: INCQ CX                                 ;example.com/m.BenchmarkAppend issue36405_test.go:11
         .          .    10f8b07: CMPQ CX, 0x108(AX)                      ;issue36405_test.go:11
         .          .    10f8b0e: JLE 0x10f8b63
         .          .    10f8b10: CMPQ $0x3, BX                           ;issue36405_test.go:12
         .          .    10f8b14: JB 0x10f8b6d
         .          .    10f8b16: CMPQ $0xa, BX                           ;issue36405_test.go:13
         .          .    10f8b1a: JAE 0x10f8af6

BenchmarkAppend2:

         .          .    10f8bec: MOVL 0x54(SP), SI                       ;issue36405_test.go:32
         .          .    10f8bf0: MOVL 0x51(SP), DI
     1.12s      1.12s    10f8bf4: MOVL DI, 0x3(DX)                        ;example.com/m.BenchmarkAppend2 issue36405_test.go:32
     190ms      190ms    10f8bf7: MOVL SI, 0x6(DX)
         .          .    10f8bfa: INCQ CX                                 ;issue36405_test.go:30
         .          .    10f8bfd: CMPQ CX, 0x108(AX)
         .          .    10f8c04: JLE 0x10f8c59
         .          .    10f8c06: CMPQ $0x3, BX                           ;issue36405_test.go:31
         .          .    10f8c0a: JB 0x10f8c63
     110ms      110ms    10f8c0c: CMPQ $0xa, BX                           ;example.com/m.BenchmarkAppend2 issue36405_test.go:32
         .          .    10f8c10: JAE 0x10f8bec                           ;issue36405_test.go:32

I think what is happening here is the fact that the source and destination arrays are aliased in version 2. When copying those 7 bytes, we use two 4-byte load/stores. That would cause havoc with the write buffer on the chip, as it has to resolve the writes before doing the reads.

Are you seeing this problem in a real application? Generally one doesn't append portions of a buffer to itself.

lhonglwl · 2020-01-07T02:23:01Z

I doubt that is it neccessary to malloc new memory for BenchmarkAppend2, can't we just detect this situation to avoid unneccessary memmove operator?

I found this issue because I find serialize/unserialize with encoding/binary and bytes/Buffer packages performance poor(almost cost 10 times than c/c++)。With pprof tools and read the source code, I found it costs much time in runtime.mallocgc, runtime.convT* operators which belongs to malloc temporary buffer and memory copy. So, I decide rewrite the personal io.Buffer with less temporary buffer and memory move, which i want read/write direct in []byte buffer.

However, we can't append data to []byte with position index assignment(but only with append method) even it has enough cap. So I make a new []byte slice point to next to the []byte buffer, and read/write it, then append to source buffer just like BenchmarkAppend2.

Unfortunately, BenchmarkAppend2 performance so poor than expected, and it should not.

randall77 · 2020-01-07T04:34:19Z

I found it costs much time in runtime.mallocgc, runtime.convT* operators which belongs to malloc temporary buffer and memory copy.

The benchmarks you provided don't call any of these things, at least in the inner loop. So I don't think this issue is directly applicable to your original problem (slowness in encoding/binary and/or bytes.Buffer).

Perhaps you could provide a benchmark for your replacement code?

However, we can't append data to []byte with position index assignment(but only with append method) even it has enough cap. So I make a new []byte slice point to next to the []byte buffer, and read/write it, then append to source buffer just like BenchmarkAppend2.

You don't need to use append to write to a []byte. You can reslice past the length (you can reslice up to the capacity) and write directly, using direct assignments, the copy builtin, or encoding/binary.

e.g.:

b := make([]byte, 0, 10)
b = b[:1]
b[0] = 'a'
copy(b[1:4], ...some slice of size 3...)
b = b[:4]
n := binary.PutVarint(b[4:], 999)
b = b[:4+n]

lhonglwl · 2020-01-07T06:48:53Z

The benchmarks you provided don't call any of these things, at least in the inner loop. So I don't think this issue is directly applicable to your original problem (slowness in encoding/binary and/or bytes.Buffer).

This issue is not the reason for the poor performance of encoding/binary and bytes.Buffer . it's just where i found it.

You don't need to use append to write to a []byte. You can reslice past the length (you can reslice up to the capacity) and write directly, using direct assignments, the copy builtin, or encoding/binary.

Thanks, the reslice do solve my problem. In this situation append may not be used while it has poor performance util it has optimized(if it will be).

randall77 · 2020-01-07T07:03:31Z

I don't think there's any optimization we can actually do here.
The performance is constrained by the hardware - it's just not designed to do writes immediately followed by differently aligned reads quickly. See the intel optimization manual, section 3.6.5.1.

Go is optimized to do appends fast when the backing store of the slice you're appending to is write-only. Switching between reads and writes on every small append is not a common use case, and is thus not something we optimize for.

I'm going to close this issue as not actionable. Please reopen if you disagree.

lhonglwl changed the title ~~performance of slice append is optimizable~~ append: performance of slice append is optimizable Jan 6, 2020

ianlancetaylor changed the title ~~append: performance of slice append is optimizable~~ cmd/compile: performance of slice append is optimizable Jan 6, 2020

ianlancetaylor added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jan 6, 2020

lhonglwl closed this as completed Jan 7, 2020

lhonglwl reopened this Jan 7, 2020

randall77 closed this as completed Jan 7, 2020

golang locked and limited conversation to collaborators Jan 6, 2021

gopherbot added the FrozenDueToAge label Jan 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: performance of slice append is optimizable #36405

cmd/compile: performance of slice append is optimizable #36405

lhonglwl commented Jan 6, 2020 •

edited

randall77 commented Jan 6, 2020

lhonglwl commented Jan 7, 2020

randall77 commented Jan 7, 2020

lhonglwl commented Jan 7, 2020

randall77 commented Jan 7, 2020

cmd/compile: performance of slice append is optimizable #36405

cmd/compile: performance of slice append is optimizable #36405

Comments

lhonglwl commented Jan 6, 2020 • edited

What version of Go are you using (go version)?

What did you do?

What did you expect to see?

randall77 commented Jan 6, 2020

lhonglwl commented Jan 7, 2020

randall77 commented Jan 7, 2020

lhonglwl commented Jan 7, 2020

randall77 commented Jan 7, 2020

lhonglwl commented Jan 6, 2020 •

edited

What version of Go are you using (`go version`)?