cmd/compile: optimise ranges over string(byteSlice) #27148

mvdan · 2018-08-22T13:31:46Z

Take these two benchmarks: https://play.golang.org/p/QI4BxUq8MGp

The first code is cleaner, more idiomatic, and easier to write/maintain. The second is much trickier, and I'm not even sure I wrote it correctly.

Lucky for us, the first tends to perform about the same or slightly better in terms of time:

$ go test -bench=. -benchmem
goos: linux
goarch: amd64
pkg: mvdan.cc/p
BenchmarkIdiomatic-4      500000              2389 ns/op            2688 B/op          1 allocs/op
BenchmarkManual-4         500000              2645 ns/op               0 B/op          0 allocs/op
PASS
ok      mvdan.cc/p      2.578s

However, as one can see, it still incurs an extra allocation. I'm not sure why that is - via go test -gcflags=-m, I can see ./f_test.go:16:27: BenchmarkIdiomatic string(Input) does not escape.

We have optimized other common string conversion patterns, such as switch string(byteSlice)
in 566e3e0, and I believe someMap[string(byteSlice)] was optimized too.

Would it be possible to get rid of this allocation somehow? My knowledge of the runtime and compiler is a bit limited, but I'd think that it is possible.

As for a real use case - there's a few occurrences in the standard library that could be greatly simplified. For example, in encoding/json we'd remove ten lines of tricky code:

diff --git a/src/encoding/json/decode.go b/src/encoding/json/decode.go
index 2e734fb39e..fa09b85aab 100644
--- a/src/encoding/json/decode.go
+++ b/src/encoding/json/decode.go
@@ -1210,20 +1210,11 @@ func unquoteBytes(s []byte) (t []byte, ok bool) {
 	// then no unquoting is needed, so return a slice of the
 	// original bytes.
 	r := 0
-	for r < len(s) {
-		c := s[r]
-		if c == '\\' || c == '"' || c < ' ' {
-			break
-		}
-		if c < utf8.RuneSelf {
-			r++
-			continue
-		}
-		rr, size := utf8.DecodeRune(s[r:])
-		if rr == utf8.RuneError && size == 1 {
+	for i, c := range string(s) {
+		if c == '\\' || c == '"' || c < ' ' || c == utf8.RuneError {
 			break
 		}
-		r += size
+		r = i
 	}
 	if r == len(s) {
 		return s, true

However, that currently means a bad regression in speed and allocations:

name           old time/op    new time/op    delta
CodeDecoder-4    27.3ms ± 1%    31.2ms ± 1%   +14.16%  (p=0.002 n=6+6)

name           old speed      new speed      delta
CodeDecoder-4  71.1MB/s ± 1%  62.3MB/s ± 1%   -12.41%  (p=0.002 n=6+6)

name           old alloc/op   new alloc/op   delta
CodeDecoder-4    2.74MB ± 0%    5.13MB ± 0%   +87.28%  (p=0.002 n=6+6)

name           old allocs/op  new allocs/op  delta
CodeDecoder-4     77.5k ± 0%    184.3k ± 0%  +137.72%  (p=0.002 n=6+6)

I haven't investigated why my microbenchmark is faster when simpler, while json gets so much slower when made simpler.

Any input appreciated. cc @josharian @martisch @randall77 @TocarIP

The text was updated successfully, but these errors were encountered:

mvdan · 2018-08-22T13:33:42Z

Another similar optimization: https://go-review.googlesource.com/c/go/+/108985

Perhaps there's a better way to iterate over the utf8 runes in a byte slice that I'm missing. Hopefully not, as that would be a bit embarassing :)

martisch · 2018-08-22T18:24:21Z

One issue why this has not been optimized like range []byte(string) yet to not allocate is that it is hard to prove in general that in "for i, c := range string(s)" the s is not changed within the for loop. A single byte pointer changed anywhere within computations (even in other packages or through interfaces) within the for loop can change s and thereby change the result from the case where s is copied on loop entry.

However there might be cases like this were it could trigger with some very conservative analysis.

Maybe we could solve this problem by having a range loop variant that iterates over the runes of a byte slice (taking some of the changes to byte slice into account) and does not need to allocate in general. Similar to map iteration not iterating over a snapshot of the map on loop entry.

The other question is why BenchmarkManual is so much slower even if it does not need to allocate.

martisch · 2018-08-22T18:41:15Z

Note also there is a slight difference between the two versions of the code given:
c == utf8.RuneError is not the same check as rr == utf8.RuneError && size == 1.

mvdan · 2018-08-22T19:05:14Z

Conservative analysis sounds fine to me. I imagine that most of the use cases that matter would not involve pointers, reflection, nor any other magic.

The range loop variant seems interesting. Are there any downsides to this method? I imagine it would essentially be as if the compiler "expanded" the idiomatic code into the manual code.

The other question is why BenchmarkManual is so much slower even if it does not need to allocate.

Yes - but note that in encoding/json, the opposite is true. So it might just be an issue with my benchmark code.

c == utf8.RuneError is not the same check as rr == utf8.RuneError && size == 1

That came to mind, but I wasn't sure. How could I have written the idiomatic version to behave equally?

At least encoding/json doesn't care - both versions of the code pass all the tests. Which is not to say that they're equivalent or equally valid, but at least the idiomatic version isn't horribly broken.

martisch · 2018-08-22T19:55:01Z

Note that its not only any "magic" that cant be used but no function (even ones inserted by the compiler) can be used that is not known during compile time or function that changes a byte pointer or byte slice as we need to prove those can not alias to the slice we iterating over. This will need per function information #25999.

josharian · 2018-08-23T05:20:25Z

However, as one can see, it still incurs an extra allocation.

Where exactly is the allocation? pprof should tell you.

quasilyte · 2018-09-19T09:21:35Z

string(Input) does not escape, but that expression is compiled to
runtime.slicebytetostring which does allocate (but not always, see below).

And now a little experiment:

- var Input = bytes.Repeat([]byte("abcde"), 500)
+ var Input = bytes.Repeat([]byte("abcde"), 5)

BenchmarkIdiomatic-8 50000000  20.9 ns/op 0 B/op 0 allocs/op
BenchmarkManual-8    100000000 23.1 ns/op 0 B/op 0 allocs/op

The problem is that non-escaping tmp buffer is allocated on stack and
it has very limited capacity.

So if string being converted does not fit that buffer, it will be heap-allocated:

// The constant is known to the compiler.
// There is no fundamental theory behind this number.
const tmpStringBufSize = 32

type tmpBuf [tmpStringBufSize]byte

Somewhere inside slicebytetostring:

// buf is tmpBuf
if buf != nil && len(b) <= len(buf) {
	p = unsafe.Pointer(buf)
} else {
	p = mallocgc(uintptr(len(b)), nil, false)
}

I think there was an issue somewhere to create readonly []byte slice
views for temporary non-escaping strings. The problem is that if someone
holds that []byte and mutates it while we're iterating over that readonly view,
code will behave differently, since string(s) creates a copy.
(Update: #2205)

mvdan · 2018-11-28T19:23:30Z

Doesn't look like anyone plans on working on this, so I'm removing it from 1.12 for now. I think we should keep the issue open for a while, to see if there's any interest.

mvdan · 2019-04-16T15:00:21Z

Poking around encoding/json, I just found one of the reasons why the hand-rolled version is slower; the r += size statement throws off the prove pass, since r could overflow as a signed integer to become negative, so c := s[r] has a bounds check. That's unfortunate, and does show up in the CPU profile.

I can't imagine a way to remove that bounds check, other than having the compiler treat utf8.DecodeRune in a special way. Tricks like if r += size; r < 0 { break } don't seem to appease prove either; but in any case, that would be replacing a bounds check with another branch.

I still think there should be an easy and performant way to iterate over the runes in a []byte, without losing performance nor incurring allocations/copies.

mvdan added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Aug 22, 2018

mvdan added this to the Go1.12 milestone Aug 22, 2018

quasilyte mentioned this issue Sep 19, 2018

cmd/compile: teach the compiler that arguments passed to {strings|bytes}.{Index*|Count|Has*} don't escape #25864

Closed

martisch mentioned this issue Nov 5, 2018

proposal: Go 2: for index, rune, runelen = range for strings #28599

Closed

mvdan modified the milestones: Go1.12, Unplanned Nov 28, 2018

martisch mentioned this issue Jan 18, 2019

cmd/compile: readonly analysis of functions accepting []byte arguments #29810

Closed

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: optimise ranges over string(byteSlice) #27148

cmd/compile: optimise ranges over string(byteSlice) #27148

mvdan commented Aug 22, 2018

mvdan commented Aug 22, 2018

martisch commented Aug 22, 2018 •

edited

martisch commented Aug 22, 2018

mvdan commented Aug 22, 2018

martisch commented Aug 22, 2018 •

edited

josharian commented Aug 23, 2018

quasilyte commented Sep 19, 2018 •

edited

mvdan commented Nov 28, 2018

mvdan commented Apr 16, 2019

cmd/compile: optimise ranges over string(byteSlice) #27148

cmd/compile: optimise ranges over string(byteSlice) #27148

Comments

mvdan commented Aug 22, 2018

mvdan commented Aug 22, 2018

martisch commented Aug 22, 2018 • edited

martisch commented Aug 22, 2018

mvdan commented Aug 22, 2018

martisch commented Aug 22, 2018 • edited

josharian commented Aug 23, 2018

quasilyte commented Sep 19, 2018 • edited

mvdan commented Nov 28, 2018

mvdan commented Apr 16, 2019

martisch commented Aug 22, 2018 •

edited

martisch commented Aug 22, 2018 •

edited

quasilyte commented Sep 19, 2018 •

edited