Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/url: optimize unescape and escape #17860

Open
dsnet opened this issue Nov 8, 2016 · 14 comments
Open

net/url: optimize unescape and escape #17860

dsnet opened this issue Nov 8, 2016 · 14 comments

Comments

@dsnet
Copy link
Member

dsnet commented Nov 8, 2016

According to a profiling data from a large number of servers at Google, URL related functions are among the top 150 functions by CPU time. The simplest optimizations I see are around using LUTs in shouldEscape, isHex, and unHex.

@dsnet dsnet added this to the Go1.9 milestone Nov 8, 2016
@LMMilewski
Copy link
Member

Based on the profiling data that you mention, it looks like the unescape time is split evenly between runtime.makeslice and runtime.slicebytetostring.
The net/url.shouldEscape function is responsible only for ~7% of the CPU time. LUTs might not help as much as we would like.

@bradfitz
Copy link
Contributor

bradfitz commented Feb 8, 2017

Regarding the allocations and slicebytetostring, see compiler feature request #6714.

/cc @randall77 @cherrymui @josharian @dr2chase

@bradfitz
Copy link
Contributor

bradfitz commented Feb 8, 2017

See also: #18990

@bradfitz bradfitz modified the milestones: Go1.10, Go1.9 May 23, 2017
@ccbrown
Copy link

ccbrown commented Sep 12, 2017

I'm not sure it's the kind of optimization you want, but I find that if I do some "hand optimization", just eliminating branches, I get pretty significant speed increases:

const ESCAPE_ME = "net/url: optimize unescape and escape"

var s string

func BenchmarkPathEscape(b *testing.B) {
	for i := 0; i < b.N; i++ {
		s = url.PathEscape(ESCAPE_ME)
	}
}

func shouldPathEscape(c byte) bool {
	if 'A' <= c && c <= 'Z' || 'a' <= c && c <= 'z' || '0' <= c && c <= '9' {
		return false
	}
	switch c {
	case '-', '_', '.', '~', ':', '@', '&', '=', '+', '$':
		return false
	}
	return true
}

func MyPathEscape(s string) string {
	hexCount := 0
	for i := 0; i < len(s); i++ {
		if shouldPathEscape(s[i]) {
			hexCount++
		}
	}
	if hexCount == 0 {
		return s
	}
	t := make([]byte, len(s)+2*hexCount)
	j := 0
	for i := 0; i < len(s); i++ {
		c := s[i]
		if shouldPathEscape(c) {
			t[j] = '%'
			t[j+1] = "0123456789ABCDEF"[c>>4]
			t[j+2] = "0123456789ABCDEF"[c&15]
			j += 3
		} else {
			t[j] = c
			j++
		}
	}
	return string(t)
}

func BenchmarkMyPathEscape(b *testing.B) {
	for i := 0; i < b.N; i++ {
		s = MyPathEscape(ESCAPE_ME)
	}
}
goos: darwin
goarch: amd64
BenchmarkPathEscape-8     	 3000000	       438 ns/op
BenchmarkMyPathEscape-8   	10000000	       147 ns/op

It seems to me like calls to functions like escape should probably be inlined when a parameter on which many branches depend is given a const argument:

func PathEscape(s string) string {
    return escape(s, encodePathSegment) // not inlined, but probably should be
}

@martisch
Copy link
Contributor

martisch commented Sep 12, 2017

I also experimented with some optimizations (mostly unescpae) but have not cleaned them up in a CL yet. So i am interested what we can improve too. Maybe i still get around to finish them for 1.10.

I was thinking of creating a table [256]uint8 where byte contains some flags. e.g. shouldPathEscape (if we need more flags this can become uint32)
so we can use the same lookup table in multiple functions to quickly look up properties of ASCII characters while at the same time not wasting ram for a lookup table each time. This could be even stored in a separate package so multiple std lib packages (e.g unicode.IsSpace or strings.Fields) can use the same table.

For unhex i prefer using 9*(c>>6)` + (c&15) vs a LUT. I will have to benchmark this again but will send as a small separate improvement for the std lib occurrences if it still checks out to be faster than the current branch based version.

@ccbrown
Some comments while briefly looking at the code from the above comment:

  • t[j+2] could maybe avoid some bounds check here by checking _ = t[j+2] earlier in the block.
  • we could remember the first position where a to be escaped symbol is and skip ahead to that in the second loop after making a copy of the start up to that symbol. But that would make performance worse when to position is very near the start of the string.
if shouldPathEscape(s[i]) {
 hexCount++
}

if this is not already recognized by the compiler to become a setCC instruction with an add on AMD6 4 instead of a branch we should have a look at optimizing that. (maybe there is even an add that just takes the flag without setCC)

Does your new version pass all the tests?

@ccbrown
Copy link

ccbrown commented Sep 12, 2017

Tried a few more things.

Using a boolean LUT does make things faster:

func MyBoolLUTEscape(s string, shouldEscapeLUT [256]bool) string {
	hexCount := 0
	for i := 0; i < len(s); i++ {
		if shouldEscapeLUT[s[i]] {
			hexCount++
		}
	}
	if hexCount == 0 {
		return s
	}
	t := make([]byte, len(s)+2*hexCount)
	j := 0
	for i := 0; i < len(s); i++ {
		c := s[i]
		if shouldEscapeLUT[c] {
			t[j] = '%'
			t[j+1] = "0123456789ABCDEF"[c>>4]
			t[j+2] = "0123456789ABCDEF"[c&15]
			j += 3
		} else {
			t[j] = c
			j++
		}
	}
	return string(t)
}
BenchmarkPathEscape-8              	 3000000	       438 ns/op
BenchmarkMyPathEscape-8            	10000000	       144 ns/op
BenchmarkMyBoolLUTPathEscape-8     	10000000	       128 ns/op

Compacting all of the different encodings into a single [256]byte LUT is slower, but still reasonably fast (and would potentially save 6*256 bytes in memory):

func MyLUTEscape(s string, mode encoding) string {
	hexCount := 0
	mask := byte(1 << byte(mode))
	for i := 0; i < len(s); i++ {
		if shouldEscapeLUT[s[i]]&mask != 0 {
			hexCount++
		}
	}
	if hexCount == 0 {
		return s
	}
	t := make([]byte, len(s)+2*hexCount)
	j := 0
	for i := 0; i < len(s); i++ {
		c := s[i]
		if shouldEscapeLUT[s[i]]&mask != 0 {
			t[j] = '%'
			t[j+1] = "0123456789ABCDEF"[c>>4]
			t[j+2] = "0123456789ABCDEF"[c&15]
			j += 3
		} else {
			t[j] = c
			j++
		}
	}
	return string(t)
}
BenchmarkMyLUTEscape-8             	10000000	       151 ns/op

Note while these functions can be re-used by all of the encodings, I'm still excluding the c == ' ' && mode == encodeQueryComponent case for now. If needed, query component encoding can get its own function implementation.

@martisch

t[j+2] could maybe avoid some bounds check here by checking _ = t[j+2] earlier in the block.

we could remember the first position where a to be escaped symbol is and skip ahead to that in the second loop after making a copy of the start up to that symbol. But that would make performance worse when to position is very near the start of the string.

Tried both. Neither really made a measurable difference for me.

if this is not already recognized by the compiler to become a setCC instruction with an add on AMD64 instead of a branch we should have a look at optimizing that.

That function actually got inlined for me. So it didn't use any conditional instructions like that there.

But in the LUT version, maybe that sort of thing could work here:

perf_test.go:207	0x10f004d		4084f8			TESTL DI, AL
perf_test.go:207	0x10f0050		74e3			JE 0x10f0035
perf_test.go:208	0x10f0052		48ffc6			INCQ SI
perf_test.go:207	0x10f0055		ebde			JMP 0x10f0035

I'm not really gonna try to optimize this at such a low level yet though.

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017
@bradfitz bradfitz modified the milestones: Go1.11, Unplanned May 18, 2018
@gopherbot
Copy link

Change https://golang.org/cl/134296 mentions this issue: net/url: remove an allocation for short strings in escape

gopherbot pushed a commit that referenced this issue Sep 11, 2018
Use a 64 byte array to avoid an allocation on the assumption that
most url escaping is performed on short strings. Also adds a fast
path for escaping strings whose only replacements are spaces which
is common in query components.

Adds benchmarks for QueryEscape, PathEscape, QueryUnescape and
PathUnescape but no optimizations are include for the unescape functions
so I don't include those benchmark results here.

Reduces allocations by 10% in the existing String benchmark with a
modest performance increase.

name               old time/op    new time/op    delta
QueryEscape/#00-8    64.6ns ± 1%    43.8ns ± 0%  -32.14%  (p=0.000 n=9+9)
QueryEscape/#1-8     276ns ± 3%     249ns ± 0%   -9.62%  (p=0.000 n=10+7)
QueryEscape/#2-8     176ns ± 2%     155ns ± 3%  -12.21%  (p=0.000 n=10+10)
QueryEscape/#3-8     388ns ± 1%     362ns ± 0%   -6.55%  (p=0.000 n=10+8)
QueryEscape/#4-8    2.32µs ± 2%    2.27µs ± 2%   -2.26%  (p=0.001 n=10+10)
PathEscape/#00-8     78.0ns ± 3%    63.4ns ± 1%  -18.69%  (p=0.000 n=10+10)
PathEscape/#1-8      276ns ± 2%     260ns ± 0%   -6.01%  (p=0.000 n=10+10)
PathEscape/#2-8      175ns ± 0%     153ns ± 0%  -12.53%  (p=0.000 n=8+10)
PathEscape/#3-8      389ns ± 2%     361ns ± 0%   -7.21%  (p=0.000 n=10+9)
PathEscape/#4-8     2.30µs ± 2%    2.27µs ± 1%   -1.33%  (p=0.001 n=9+10)
String-8             3.56µs ± 4%    3.42µs ± 7%   -4.00%  (p=0.003 n=10+10)

name               old alloc/op   new alloc/op   delta
QueryEscape/#00-8     16.0B ± 0%      8.0B ± 0%  -50.00%  (p=0.000 n=10+10)
QueryEscape/#1-8      128B ± 0%       64B ± 0%  -50.00%  (p=0.000 n=10+10)
QueryEscape/#2-8     64.0B ± 0%     32.0B ± 0%  -50.00%  (p=0.000 n=10+10)
QueryEscape/#3-8      128B ± 0%       64B ± 0%  -50.00%  (p=0.000 n=10+10)
QueryEscape/#4-8      832B ± 0%      832B ± 0%     ~     (all equal)
PathEscape/#00-8      32.0B ± 0%     16.0B ± 0%  -50.00%  (p=0.000 n=10+10)
PathEscape/#1-8       128B ± 0%       64B ± 0%  -50.00%  (p=0.000 n=10+10)
PathEscape/#2-8      64.0B ± 0%     32.0B ± 0%  -50.00%  (p=0.000 n=10+10)
PathEscape/#3-8       128B ± 0%       64B ± 0%  -50.00%  (p=0.000 n=10+10)
PathEscape/#4-8       704B ± 0%      704B ± 0%     ~     (all equal)
String-8             1.84kB ± 0%    1.66kB ± 0%   -9.57%  (p=0.000 n=10+10)

name               old allocs/op  new allocs/op  delta
QueryEscape/#00-8      2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.000 n=10+10)
QueryEscape/#1-8      2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.000 n=10+10)
QueryEscape/#2-8      2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.000 n=10+10)
QueryEscape/#3-8      2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.000 n=10+10)
QueryEscape/#4-8      2.00 ± 0%      2.00 ± 0%     ~     (all equal)
PathEscape/#00-8       2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.000 n=10+10)
PathEscape/#1-8       2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.000 n=10+10)
PathEscape/#2-8       2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.000 n=10+10)
PathEscape/#3-8       2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.000 n=10+10)
PathEscape/#4-8       2.00 ± 0%      2.00 ± 0%     ~     (all equal)
String-8               69.0 ± 0%      61.0 ± 0%  -11.59%  (p=0.000 n=10+10)

Updates #17860

Change-Id: I45c5e9d40b242f874c61f6ccc73bf94c494bb868
Reviewed-on: https://go-review.googlesource.com/134296
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@iand
Copy link
Contributor

iand commented Sep 11, 2018

While working on 134296 I saw that the benchmark for the String() method still shows about 60 allocations being made. This is because we don't pre-size the strings.Builder and also call escape at least 5 times for different url components.

I think this could be optimized by pre-sizing the builder with the known lengths (scheme, opaque, rawquery) and passing the builder to a new escape function that writes to a builder. The existing escape function could be modified to create a builder and call the new function or the existing callers of escape could do it (QueryEscape, PathEscape etc.)

I can work on this if there is interest in this approach.

@bradfitz
Copy link
Contributor

@iand, sounds reasonable.

@iand
Copy link
Contributor

iand commented Sep 12, 2018

After spending several more minutes examining the String benchmark I realised that the reported timings and allocations are the cumulative totals of all 20 test cases.

I investigated modifying the various escape functions write to a shared strings.Builder but couldn't reduce the allocations and timings by enough to justify the extra complexity.

Also looked at pre-sizing the builder. It's easy enough to calculate the capacity needed for the fixed components and estimate the length of most of the encoded parts, but path is little more tricky. Performing these calculations adds timing overhead which makes the function slower overall, even with the one or two allocations it saves.

I found I could get just as good a benefit by simply growing the builder to a fixed 64 bytes. Some timings for variants on this approach are below. However, it's likely that I'm just fitting to the test cases.

I'm happy to send a CL for a one line change setting the initial builder size to 64 bytes, but I'm not convinced it adds much value.

Current master:

BenchmarkString-8        500000	      3393 ns/op	    1664 B/op	61 allocs/op

Pre-sized builder.

32 bytes:  BenchmarkString-8   500000	2966 ns/op	    1632 B/op	43 allocs/op
64 bytes:  BenchmarkString-8   500000	2837 ns/op	    1648 B/op	33 allocs/op
128 bytes: BenchmarkString-8   500000	2891 ns/op	    2800 B/op	32 allocs/op

@gopherbot
Copy link

Change https://golang.org/cl/174998 mentions this issue: net/url: rework shouldEscape func to bitmask flag for performance boost

@cristaloleg
Copy link

What is the state of this issue? Looks like it's stalled after 1.13 freeze, but no other problems are mentioned. Kindly ping @iand

@ahfuzhang
Copy link

I write a new version of FastQueryEscape(), it's 7.33 times faster than url.QueryEscape().
Of course, I based your codes.

The detail is here: https://www.cnblogs.com/ahfuzhang/p/17473178.html

@gopherbot
Copy link

Change https://go.dev/cl/514335 mentions this issue: net/url: optimize QueryEscape and PathEscape

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants