Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/json: incorrect usage of sync.Pool #27735

Open
dsnet opened this issue Sep 18, 2018 · 39 comments
Open

encoding/json: incorrect usage of sync.Pool #27735

dsnet opened this issue Sep 18, 2018 · 39 comments
Labels
NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@dsnet
Copy link
Member

dsnet commented Sep 18, 2018

https://golang.org/cl/84897 introduces a denial-of-service attack on json.Marshal via a live-lock situation with sync.Pool.

Consider this snippet:

type CustomMarshaler int

func (c CustomMarshaler) MarshalJSON() ([]byte, error) {
	time.Sleep(500 * time.Millisecond) // simulate processing time
	b := make([]byte, int(c))
	b[0] = '"'
	for i := 1; i < len(b)-1; i++ {
		b[i] = 'a'
	}
	b[len(b)-1] = '"'
	return b, nil
}

func main() {
	processRequest := func(size int) {
		json.Marshal(CustomMarshaler(size))
		time.Sleep(1 * time.Millisecond) // simulate idle time
	}

	// Simulate a steady stream of infrequent large requests.
	go func() {
		for {
			processRequest(1 << 28) // 256MiB
		}
	}()

	// Simulate a storm of small requests.
	for i := 0; i < 1000; i++ {
		go func() {
			for {
				processRequest(1 << 10) // 1KiB
			}
		}()
	}

	// Continually run a GC and track the allocated bytes.
	var stats runtime.MemStats
	for i := 0; ; i++ {
		runtime.ReadMemStats(&stats)
		fmt.Printf("Cycle %d: %dB\n", i, stats.Alloc)
		time.Sleep(time.Second)
		runtime.GC()
	}
}

This is a variation of #23199 (comment) of a situation suggested by @bcmills.

Essentially, we have a 1-to-1000 ratio of a routines that either use 1KiB or 256MiB, respectively. The occasional insertion of a 256MiB buffer into the sync.Pool gets continually held by the 1KiB routines. On my machine, after 300 GC cycles, the above program occupies 6GiB of my heap, when I expect it to be 256MiB in the worst-case.

\cc @jnjackins @bradfitz

@gopherbot
Copy link

Change https://golang.org/cl/136035 mentions this issue: encoding/json: fix usage of sync.Pool

@bcmills bcmills added Performance NeedsFix The path to resolution is known, but the work has not been done. labels Sep 19, 2018
@bcmills bcmills added this to the Go1.12 milestone Sep 19, 2018
@ianlancetaylor ianlancetaylor modified the milestones: Go1.12, Go1.13 Dec 12, 2018
@andybons andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019
@flimzy
Copy link
Contributor

flimzy commented Aug 25, 2019

It's been nearly a year, and no real attention to this issue.

So what are the options for fixing this?

The simplest would be to set an arbitrary size limit for buffers to recycle. Whatever number is selected will hurt performance for someone, I'd wager.

When using a json.Encoder, an additional configuration parameter could be exposed (i.e. SetMaxBufferSize). That complicates sharing buffer pools across encoder instances, though.

Providing a package-level variable is an option, too, but an ugly one that should absolutely be avoided (I just mention it for completeness sake).

Eliminating the need internal buffering in the json encoder could be another option. It wouldn't solve this problem for everyone, but it would provide recourse for those who are affected enough to care. I believe this was the exact issue that I think brought this up (https://go-review.googlesource.com/c/go/+/135595), which makes it a bit ironic that this issue may be holding up that one :)

@bcmills
Copy link
Contributor

bcmills commented Aug 26, 2019

@flimzy, one approach would be to keep statistics on the output size, and only return a buffer to the pool if it isn't a statistical outlier. The problem there is that the statistical computations need to be cheap enough that they don't swamp out the benefit from using sync.Pool in the first place.

Another (simpler?) approach would be to pull a buffer from the pool, use it, and abandon it instead of returning it to the pool according to some probability distribution computed from the final high-water mark and the actual buffer size. (One simple distribution would be: 1.0 if the buffer is within 2x of the final high-water mark, or 0.0 otherwise.)

@dsnet
Copy link
Member Author

dsnet commented Aug 26, 2019

@flimzy, are you running into an issue related to this in production? Seeing how it performs badly in production may be helpful data in coming up with a reasonable approach for this.

When I filed this issue, it was just a theoretical issue I saw. It was not a concrete problem I was experiencing.

@flimzy
Copy link
Contributor

flimzy commented Aug 26, 2019

@dsnet: No, I'm not. I'm interested in resolving this issue because it seems to be a blocker for making headway on some other json-encoder related issues that are affecting me: https://go-review.googlesource.com/c/go/+/135595, and by extension #33714

@flimzy
Copy link
Contributor

flimzy commented Aug 26, 2019

Thanks for the feeback, @bcmills

I like the simplicity of your second approach. I think we need to be careful not to discard buffers to aggressively, in the case of applications that naturally have a large disparity of JSON payloads. A simple example comes to mind: A REST API that responds with {"ok":true} for PUT requests, but with 10s or 100s of kb responseses for a 'GET'. Perhaps considering 10k (or another arbitrary limit) as the high-water mark, if it was smaller than that would guard against this.

I can still imagine certain applications where this would be sub-optimal, but maybe this is a reasonable starting point, to be tuned later

@flimzy
Copy link
Contributor

flimzy commented Aug 26, 2019

As I consider this further, I have to wonder: Is this really a DoS attack vector? If it were for json.Unmarshal, I can see it being easily exploited (just send a huge payload to your favorite Go-powered REST API). But as this is for json.Marshal, for this to be exploited, it would mean that several other possible validation opportunities were disregarded.

I don't mean to suggest this doesn't warrant some amount of attention, I just suspect the danger is over-stated.

@iand
Copy link
Contributor

iand commented Aug 26, 2019

Another approach would be to remove the package level encodeStatePool and allow the caller to optionally supply one when creating an encoder. The reasoning here is that the caller has more knowledge about the distribution of payloads and the edge cases that a package-level statistical count would struggle with.

@flimzy
Copy link
Contributor

flimzy commented Aug 26, 2019

Another approach would be to remove the package level encodeStatePool and allow the caller to optionally supply one when creating an encoder.

This does seem like the only approach to cover all use cases optimally. Is there any precedent for something similar elsewhere in the stdlib to use for guidance?

@gopherbot
Copy link

Change https://golang.org/cl/195217 mentions this issue: Add a simple, optional 'keep buffer' capability to the Encoder type

@flimzy
Copy link
Contributor

flimzy commented Sep 13, 2019

I have created https://golang.org/cl/195217 as the simplest change I could think of to address the concerns raised in #27735. My main concern with this implementation is that it would become possible for a third-party library to starve the buffer pool by, for instance, never returning anything to the pool.

The other realistic approach, IMO, is to expose the encodeStatePool object to the public API. This will have a larger footprint on the public API, and will require all users of json.NewEncoder wishing to use the new functionality, to propagate their own pool throughout their code (including to any third-party libraries which do JSON encoding), to avoid using lots of fragmented pools.

Even with the drawbacks, the latter may be the better approach, but I started with the former because it is so simple, it seemed worth the minor effort to put it in front of reviewers.

@dsnet
Copy link
Member Author

dsnet commented Dec 5, 2020

I've been thinking of ways to guarantee bounds on the worst-case behavior, and this is a prototype idea:

// bufferPool is a pool of buffers.
//
// Example usage:
//
//	p := getBuffer()
//	defer putBuffer(p)
//	p.buf = appendFoo(p.buf, foo)        // may resize buffer to arbitrarily large sizes
//	return append([]byte(nil), p.buf...) // single copy of the buffer contents
//
var bufferPool = sync.Pool{
	New: func() interface{} { return new(pooledBuffer) },
}

type pooledBuffer struct {
	buf     []byte
	strikes int // number of times the buffer was under-utilized
}

func getBuffer() *pooledBuffer {
	return bufferPool.Get().(*pooledBuffer)
}

func putBuffer(p *pooledBuffer) {
	// Recycle large buffers if sufficiently utilized.
	// If a buffer is under-utilized enough times sequentially,
	// then it is discarded, ensuring that a single large buffer
	// won't be kept alive by a continuous stream of small usages.
	switch {
	case cap(p.buf) <= 1<<16: // always recycle buffers smaller than 64KiB
		p.strikes = 0
	case cap(p.buf)/2 <= len(p.buf): // at least 50% utilization
		p.strikes = 0
	case p.strikes < 4:
		p.strikes++
	default:
		return // discard the buffer; too large and too often under-utilized
	}
	p.buf = p.buf[:0]
	bufferPool.Put(p)
}

In the worst-case scenario, we can ensure a bound on the minimal utilization based on the % threshold (e.g., 50% in the example above) and maximum number of sequential recycles below that threshold (e.g., 4x in the example above).

For the constants chosen above, the worst case sequence of utilization would be:

  • 50%, 0%, 0%, 0%, 0%, 50%, 0%, 0%, 0%, 0%, 50%, 0%, 0%, 0%, 0%, ...

On average, that's a worst case utilization of 10%, which is far better than the theoretical worst-case of 0% if the code naively recycles buffers without any limits.

Advantages of the algorithm:

  • It's simple; easy to analyze and reason about.
  • It's fast; only requires basic integer arithmetic.
  • It doesn't depend on any global state where statistics are concurrently gathered.
  • For usages of all the same approximate size (regardless of how big or small), this ensures buffers are always recycled.

Happy to hear suggestions for improvements.

@iand
Copy link
Contributor

iand commented Dec 5, 2020

@dsnet I like this. It's elegant and low cost. What is the reasoning behind dropping <64k buffers?

@josharian
Copy link
Contributor

The idea of keeping statistics locally is a nice one. We do lose some amount of information every time we throw away an over-sized buffer. An alternative, which might be helpful if we kept more sophisticated statistics, would be to nil out buf and return the pooledBuffer object to the pool. Then on Get we could pre-allocate a new buf with a better initial size.

Even with these simple stats, it might even be worth zeroing and returning pooledBuffer to avoid an allocation. Maybe even include a small byte array struct field in pooledBuffer that we could initialize buf to use, providing an allocation-free path when a small buf is needed.

@dsnet
Copy link
Member Author

dsnet commented Dec 5, 2020

@dsnet I like this. It's elegant and low cost. What is the reasoning behind dropping <64k buffers?

It's not dropping 64k buffers, but always choosing to recycle them in the pool. That specific code path isn't strictly necessary, but for buffers below a certain size it probably doesn't hurt to always recycle them regardless of the utilization.

An alternative, which might be helpful if we kept more sophisticated statistics, would be to nil out buf and return the pooledBuffer object to the pool. Then on Get we could pre-allocate a new buf with a better initial size.

Interesting idea. I think a useful statistic to obtain would perhaps be the maximum length encountered over the last N usages. If we evict a buffer due to low utilization, we can allocate a new one based on a recently seen maximum length.

@lavalamp
Copy link

@dsnet It's been awhile! Friendly ping, any change we can get this change on track for go 1.21? (https://go.dev/cl/455776)

@dsnet
Copy link
Member Author

dsnet commented Feb 13, 2023

Thank you very much for implementing this and exploring it as a solution.

I was in the process of reviewing this a few weeks ago and was saddened by the amount of logic being added (300LOC for the logic, and another 300LOC for the tests). I believe that a segmented buffer is probably the right approach in the long term, but much of the complexity stems from the existing logic in json assuming that it's outputting to a bytes.Buffer. Thus, your segmented buffer is forced to implement non-trivial methods like Truncate, WriteByte, WriteString, and WriteTo. Had the rest of json been changed to only rely on append-to mechanisms (e.g., #53685), then the segmented buffer implementation can be reduced to <200LOC (for both logic and tests). This is due to no fault of your implementation, but rather the rest of the package that it's trying to interface with.

I see several possible paths forward:

  1. Cleanup the rest of json to stop depending on so many methods of bytes.Buffer and instead rely mostly on append-to operations. However, @mvdan, @johanbrandhorst, and I have been discussing completely replacing the underlying syntactical encoder/decoder with a different implementation in the future, so this may not be worth the effort.
  2. Just implement a simple capacity limit on buffers returned to the pool. Something like https://go-review.googlesource.com/c/go/+/136035, where we just accept the loss of performance for larger outputs.
  3. Still use tiered sync.Pools, but maintain a single contiguous buffer (instead of segmented ones). When we grow, we copy the contents to a larger buffer twice the size. Such a buffer is much simpler to implement the methods of Truncate, WriteByte, or WriteString. Also, since the unused capacity is never larger than the length, we could consider just returning the results of Bytes without cloning it. This would avoid the final clone that we do before we return the buffer to the user in Marshal.

@lavalamp
Copy link

  1. Seems much riskier (more logic changed, more chances for bugs) -- json output is much more complex than tracking how many bytes are in a buffer.
  2. capping the size doesn't solve the problem in general, it helps a range of sizes but removes the benefits of the cache from other sizes.
  3. In running benchmarks to select the buffer sizes used in my change, it became obvious that allocating really large buffers is sometimes really slow. This would be substantially worse performance variance than what I have implemented. (Ironically this was my original suggestion and the current approach was your counter-suggestion :) )

I found that the existing json tests exercised ~every code path already and the specific tests I added are just for ease in debugging and peace of mind.

The code could be made shorter by using generics to cover both Write and WriteString with one function; it is also probably overly optimized and could be made simpler and only slightly slower. OTOH the tests could be extended with a randomized test that proves same behavior as the stock bytes.Buffer.

For my use case (kube-apiserver) I really need to keep all the speed from caching and also to not be 5xing memory usage for certain request patterns, as the current code does. Is there anything I can do to this change to make it acceptable? My other options aren't great.

@dsnet
Copy link
Member Author

dsnet commented Feb 20, 2023

I discussed it with @mvdan, and we think option 1 is the best path forward. I don't think it's that hard to switch the encoder to use append-like APIs. I'll send out a CL for that.

Thus, the segmented buffer will only need to implement the following methods:

  • Grow, AvailableBuffer, Write, Reset, Bytes

You'll notice that AvailableBuffer is not currently present on bytes.Buffer, but may be added in #53685.

You could optionally implement the following wrappers:

func (b *buffer) WriteString(s string) (int, error) {
    return b.Write(append(b.AvailableBuffer(), s...))
}
func (b *buffer) WriteByte(c byte) error {
    _, err := b.Write(append(b.AvailableBuffer(), c))
    return err
}

or they can instead be methods on encodeState.

@gopherbot
Copy link

Change https://go.dev/cl/469555 mentions this issue: encoding/json: use append for Compact and Indent

@gopherbot
Copy link

Change https://go.dev/cl/469556 mentions this issue: encoding/json: unify encodeState.string and encodeState.stringBytes

@gopherbot
Copy link

Change https://go.dev/cl/469558 mentions this issue: encoding/json: use append-like operations for encoding

@lavalamp
Copy link

I'm not sold on using AvailableBuffer like that here, because if the thing being written doesn't fit in the buffer there will be an allocation. But there is a fairly easy way to use it that doesn't have that problem, I think.

@dsnet
Copy link
Member Author

dsnet commented Feb 21, 2023

You're correct that there could be an allocation outside of the segmented buffer, but the number of allocations asymptotically approaches zero so long as the buffer grows exponentially (which it does).

For the cases where the appended bytes could be large, we call Grow beforehand, which guarantees sufficient capacity to prevent an allocation. I did this for the logic prior to calling appendCompact and the appending of base64 encoded data since we know those sizes. We could also consider doing it for strings. All other writes are small (e.g., bools and numbers), where the probability of an allocation is small and low cost.

@lavalamp
Copy link

lavalamp commented Feb 21, 2023

I don't think it's ideal but it can be made to work by truncating the existing buffer and allocating a one-off buffer of the requested size or bigger. Likewise when there are only a few bytes left in the current buffer, I would truncate it and make another when AvailableBuffer is called.

I found that the best performance was with many smallish buffers (< 64k IIRC) rather than continuing to double the size. So, the allocations are not exponentially rare when the output is big enough. (but, they can be mostly prevented by wasting some bytes as described.)

@lavalamp
Copy link

OK, I read the above CLs -- is there more coming or shall I rebase on top of those?

@dsnet
Copy link
Member Author

dsnet commented Feb 21, 2023

Even if the buffer is not exponentially grown, so long as we explicitly call Grow before large writes, I believe the performance will still be fine. Writing of single-byte delimiters, bools, and numbers are sufficiently small that even with a 4k buffer, the probability of an occasional small allocation is okay.

OK, I read the above CLs -- is there more coming or shall I rebase on top of those?

That's all for what's relevant for your change.

@lavalamp
Copy link

I've rebased and it's working, but it's much slower than it was so I've got more work to do before I push a new revision.

@lavalamp
Copy link

I have perf back to baseline, but git codereview mail does not like the fact that I've pulled your changes locally and rebased on top of them (it thinks I'm trying to impersonate you :) ), how do I appease it?

@lavalamp
Copy link

OK I think I figured out my rebase problem and https://go.dev/cl/455776 is ready for a pass. It is not much smaller because the b.Write(append(b.AvailableBuffer(), s...)) style one-liners are significantly slower than the code I'd gotten to. WriteByte in particular is called so often that all the details matter a lot.

gopherbot pushed a commit that referenced this issue Feb 24, 2023
This is part of the effort to reduce direct reliance on bytes.Buffer
so that we can use a buffer with better pooling characteristics.

Avoid direct use of bytes.Buffer in Compact and Indent and
instead modify the logic to rely only on append.
This avoids reliance on the bytes.Buffer.Truncate method,
which makes switching to a custom buffer implementation easier.

Performance:

	name                old time/op    new time/op    delta
	EncodeMarshaler    25.5ns ± 8%    25.7ns ± 9%   ~     (p=0.724 n=10+10)

	name                old alloc/op   new alloc/op   delta
	EncodeMarshaler     4.00B ± 0%     4.00B ± 0%   ~     (all equal)

	name                old allocs/op  new allocs/op  delta
	EncodeMarshaler      1.00 ± 0%      1.00 ± 0%   ~     (all equal)

Updates #27735

Change-Id: I8cded03fab7651d43b5a238ee721f3472530868e
Reviewed-on: https://go-review.googlesource.com/c/go/+/469555
Run-TryBot: Joseph Tsai <joetsai@digital-static.net>
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Auto-Submit: Joseph Tsai <joetsai@digital-static.net>
Reviewed-by: Bryan Mills <bcmills@google.com>
gopherbot pushed a commit that referenced this issue Feb 24, 2023
This is part of the effort to reduce direct reliance on bytes.Buffer
so that we can use a buffer with better pooling characteristics.

Unify these two methods as a single version that uses generics
to reduce duplicated logic. Unfortunately, we lack a generic
version of utf8.DecodeRune (see #56948), so we cast []byte to string.
The []byte variant is slightly slower for multi-byte unicode since
casting results in a stack-allocated copy operation.
Fortunately, this code path is used only for TextMarshalers.
We can also delete TestStringBytes, which exists to ensure
that the two duplicate implementations remain in sync.

Performance:

    name              old time/op    new time/op    delta
    CodeEncoder          399µs ± 2%     409µs ± 2%   +2.59%  (p=0.000 n=9+9)
    CodeEncoderError     450µs ± 1%     451µs ± 2%     ~     (p=0.684 n=10+10)
    CodeMarshal          553µs ± 2%     562µs ± 3%     ~     (p=0.075 n=10+10)
    CodeMarshalError     733µs ± 3%     737µs ± 2%     ~     (p=0.400 n=9+10)
    EncodeMarshaler     24.9ns ±12%    24.1ns ±13%     ~     (p=0.190 n=10+10)
    EncoderEncode       12.3ns ± 3%    14.7ns ±20%     ~     (p=0.315 n=8+10)

    name              old speed      new speed      delta
    CodeEncoder       4.87GB/s ± 2%  4.74GB/s ± 2%   -2.53%  (p=0.000 n=9+9)
    CodeEncoderError  4.31GB/s ± 1%  4.30GB/s ± 2%     ~     (p=0.684 n=10+10)
    CodeMarshal       3.51GB/s ± 2%  3.46GB/s ± 3%     ~     (p=0.075 n=10+10)
    CodeMarshalError  2.65GB/s ± 3%  2.63GB/s ± 2%     ~     (p=0.400 n=9+10)

    name              old alloc/op   new alloc/op   delta
    CodeEncoder          327B ±347%     447B ±232%  +36.93%  (p=0.034 n=9+10)
    CodeEncoderError      142B ± 1%      143B ± 0%     ~     (p=1.000 n=8+7)
    CodeMarshal         1.96MB ± 2%    1.96MB ± 2%     ~     (p=0.468 n=10+10)
    CodeMarshalError    2.04MB ± 3%    2.03MB ± 1%     ~     (p=0.971 n=10+10)
    EncodeMarshaler      4.00B ± 0%     4.00B ± 0%     ~     (all equal)
    EncoderEncode        0.00B          0.00B          ~     (all equal)

    name              old allocs/op  new allocs/op  delta
    CodeEncoder           0.00           0.00          ~     (all equal)
    CodeEncoderError      4.00 ± 0%      4.00 ± 0%     ~     (all equal)
    CodeMarshal           1.00 ± 0%      1.00 ± 0%     ~     (all equal)
    CodeMarshalError      6.00 ± 0%      6.00 ± 0%     ~     (all equal)
    EncodeMarshaler       1.00 ± 0%      1.00 ± 0%     ~     (all equal)
    EncoderEncode         0.00           0.00          ~     (all equal)

There is a very slight performance degradation for CodeEncoder
due to an increase in allocation sizes. However, the number of allocations
did not change. This is likely due to remote effects of the growth rate
differences between bytes.Buffer and the builtin append function.
We shouldn't overly rely on the growth rate of bytes.Buffer anyways
since that is subject to possibly change in #51462.
As the benchtime increases, the alloc/op goes down indicating
that the amortized memory cost is fixed.

Updates #27735

Change-Id: Ie35e480e292fe082d7986e0a4d81212c1d4202b3
Reviewed-on: https://go-review.googlesource.com/c/go/+/469556
Run-TryBot: Joseph Tsai <joetsai@digital-static.net>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
Auto-Submit: Joseph Tsai <joetsai@digital-static.net>
gopherbot pushed a commit that referenced this issue Feb 24, 2023
As part of the effort to rely less on bytes.Buffer,
switch most operations to use more natural append-like operations.
This makes it easier to swap bytes.Buffer out with a buffer type
that only needs to support a minimal subset of operations.

As a simplification, we can remove the use of the scratch buffer
and use the available capacity of the buffer itself as the scratch.
Also, declare an inlineable mayAppendQuote function to conditionally
append a double-quote if necessary.

Performance:

    name              old time/op    new time/op    delta
    CodeEncoder          405µs ± 2%     397µs ± 2%  -1.94%  (p=0.000 n=20+20)
    CodeEncoderError     453µs ± 1%     444µs ± 4%  -1.83%  (p=0.000 n=19+19)
    CodeMarshal          559µs ± 4%     548µs ± 2%  -2.02%  (p=0.001 n=19+17)
    CodeMarshalError     724µs ± 3%     716µs ± 2%  -1.13%  (p=0.030 n=19+20)
    EncodeMarshaler     24.9ns ±15%    22.9ns ± 5%    ~     (p=0.086 n=20+17)
    EncoderEncode       14.0ns ±27%    15.0ns ±20%    ~     (p=0.365 n=20+20)

There is a slight performance gain across the board due to
the elimination of the scratch buffer. Appends are done directly
into the unused capacity of the underlying buffer,
avoiding an additional copy. See #53685

Updates #27735

Change-Id: Icf6d612a7f7a51ecd10097af092762dd1225d49e
Reviewed-on: https://go-review.googlesource.com/c/go/+/469558
Reviewed-by: Daniel Martí <mvdan@mvdan.cc>
Auto-Submit: Joseph Tsai <joetsai@digital-static.net>
Reviewed-by: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Run-TryBot: Joseph Tsai <joetsai@digital-static.net>
@gopherbot
Copy link

Change https://go.dev/cl/471200 mentions this issue: encoding/json: improve Marshal memory utilization

@dsnet
Copy link
Member Author

dsnet commented Feb 25, 2023

I'm still a bit saddened by the amount of code being added. I sent out https://go.dev/cl/471200 as a simpler alternative implementation.

Here's the performance comparison:

name \ time/op             v0                v1             v2
LargeOutput                1.85s ± 3%        1.56s ± 3%     1.35s ± 9%
CodeEncoder                394µs ± 1%        408µs ± 2%     379µs ± 1%
CodeEncoderError           437µs ± 1%        450µs ± 1%     423µs ± 1%
AllocationWastage          396µs ± 2%        719µs ± 3%     482µs ± 4%
CodeMarshal                534µs ± 1%        722µs ± 7%     633µs ± 4%
CodeMarshalError           694µs ± 2%        905µs ± 4%     789µs ± 3%
MarshalBytes/32            170ns ± 3%        173ns ± 3%     162ns ± 2%
MarshalBytes/256           387ns ± 2%        456ns ± 1%     378ns ± 2%
MarshalBytes/4096         3.93µs ± 2%       3.98µs ± 1%    3.89µs ± 1%
MarshalBytesError/32       372µs ± 1%        361µs ± 1%     370µs ± 1%
MarshalBytesError/256      372µs ± 1%        363µs ± 1%     371µs ± 1%
MarshalBytesError/4096     379µs ± 1%        369µs ± 3%     377µs ± 2%
EncodeMarshaler           22.8ns ± 2%       25.1ns ±12%    24.7ns ± 5%
EncoderEncode             12.5ns ± 3%       13.2ns ± 1%    15.0ns ±19%

name \ wasted_b/op         v0                v1             v2
AllocationWastage          2.25M ± 0%        0.16M ± 0%     0.16M ± 0%

where:

We see that both v1 and v2 result in low allocation wastage compared to v0.
This is the problem we're trying to solve.

Comparing performance of v2 relative to v1, we get:

name                         old time/op      new time/op   delta
LargeOutput                  1.56s ± 3%       1.35s ± 9%   -13.49%  (p=0.000 n=8+10)
CodeEncoder                  408µs ± 2%       379µs ± 1%    -7.08%  (p=0.000 n=10+9)
CodeEncoderError             450µs ± 1%       423µs ± 1%    -5.92%  (p=0.000 n=8+9)
AllocationWastage            719µs ± 3%       482µs ± 4%   -32.87%  (p=0.000 n=9+10)
CodeMarshal                  722µs ± 7%       633µs ± 4%   -12.37%  (p=0.000 n=9+10)
CodeMarshalError             905µs ± 4%       789µs ± 3%   -12.80%  (p=0.000 n=10+9)
MarshalBytes/32              173ns ± 3%       162ns ± 2%    -6.37%  (p=0.000 n=10+9)
MarshalBytes/256             456ns ± 1%       378ns ± 2%   -17.03%  (p=0.000 n=10+10)
MarshalBytes/4096           3.98µs ± 1%      3.89µs ± 1%    -2.34%  (p=0.000 n=9+10)
MarshalBytesError/32         361µs ± 1%       370µs ± 1%    +2.59%  (p=0.000 n=9+10)
MarshalBytesError/256        363µs ± 1%       371µs ± 1%    +2.22%  (p=0.000 n=10+9)
MarshalBytesError/4096       369µs ± 3%       377µs ± 2%    +1.92%  (p=0.002 n=10+10)
EncodeMarshaler             25.1ns ±12%      24.7ns ± 5%      ~     (p=0.343 n=10+10)
EncoderEncode               13.2ns ± 1%      15.0ns ±19%   +13.39%  (p=0.003 n=8+10)

There's generally a performance improvement over the more complicated https://go.dev/cl/455776 implementation.
The apparent performance degradation of EncoderEncode seems to be largely in the noise.

I feel bad sending out an alternative CL after you (@lavalamp) clearly spent a lot of time on your CL. I'm okay with you taking attribution credit for the CL if you want. After all, my CL would not have come about until you spent the grunt work to prove that segmented buffers were a viable solution.

(The benchmarks for v2 were not run with the optimization to bytes.Join in https://go.dev/cl/456336)

@dsnet
Copy link
Member Author

dsnet commented Feb 27, 2023

I found that the best performance was with many smallish buffers

I found this comment curious, so I spent some time investigating why this might be the case.

The cause is that as buffers grow larger, they are recycled through the sync.Pool more infrequently with longer time periods between each use. Consequently, there's a higher probability that the larger buffers have been GC'd before it even got a chance to be reused. At the same time, there's benefits to larger buffer sizes as hardware is often optimized for long sequential memory accesses. Thus, there are two goals that are competing against each other. We do want to grow the buffer sizes over time, but also not too quickly that larger buffers are likely to be GC'd too early.

@lavalamp
Copy link

I put my benchmark results in a comment on the CL, they are directionally the same.

I am happy with any solution as long as the problem gets solved :)

@lavalamp
Copy link

Thus, there are two goals that are competing against each other.

Yeah, the most striking thing was that variance was super high with large buffers, so it makes sense that there are competing effects.

@mitar
Copy link
Contributor

mitar commented Oct 21, 2023

@dsnet Have you seen the approach taken by https://github.com/valyala/bytebufferpool? It also attempts to collect some stats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
None yet
Development

No branches or pull requests