encoding/json: add (*Decoder).SetLimit #56733

rolandshoemaker · 2022-11-14T18:51:33Z

Typically, the advice to avoid reading excessively large values into memory when passing an untrusted io.Reader is to wrap the reader in a io.LimitedReader. For encoding/json.NewDecoder this is not necessarily a reasonable approach, since the Decoder may be intended to read from a long lived stream (i.e. a net.Conn) where the user may not want to limit the total amount of bytes read from the Reader across multiple reads, but does want to limit the number of bytes read during a single call to Decoder.Decode (i.e. reading 100 500 byte messages across the lifetime of the Decoder may be perfectly acceptable, but a single 50,000 byte read is not.)

A solution to this problem would be to add a method to Decoder, which allows limiting the amount read from the Reader into the internal Decoder buffer on each call to Decoder.Decode, returning an error if the number of bytes required to decode the value exceeds the set limit. Something along the lines of:

// LimitValueSize limits the number of bytes read from the wrapped io.Reader
// during a single call to dec.Decode. If decoding the value requires reading
// more than limit bytes, dec.Decode will return an error.
func (dec *Decoder) LimitValueSize(limit int)

cc @dsnet

The text was updated successfully, but these errors were encountered:

rittneje · 2022-11-15T13:49:05Z

Should the bytes that are already in the decoder's buffer count against this limit on subsequent calls to Decode? I would think so as otherwise it would lead to inconsistent behavior, but the way it is phrased in your proposal is not entirely clear.

dsnet · 2022-11-15T17:21:56Z

The feature seems reasonable in general, but it's unclear whether the limit should be value-based or token-based. Value-based limits make sense when calling Decode, while token-based limits make sense when calling Token.

As a work-around, you could reset the limit on an io.LimitedReader after every Decode call.

Jorropo · 2022-11-15T21:32:15Z

@dsnet The annoying thing with LimitedReader is that as you pointed out you need to reset it every time.

Jorropo · 2022-11-15T21:35:41Z

Should the bytes that are already in the decoder's buffer count against this limit on subsequent calls to Decode? I would think so as otherwise it would lead to inconsistent behavior, but the way it is phrased in your proposal is not entirely clear.

The goal is to help limit how much memory one single message is allowed to keep alive to help prevent memory exhaustion attacks,
So I would say this would limit how big the buffer is allowed, not counting buffering over-reads as this is implementation details.
So all of this to say I think this should be how big one message is allowed to be.

Jorropo · 2022-11-15T21:38:33Z

I think when the limit is reached, Decode should continue to read (but discard) the data and fully read over the problematic message.
This can be implement with a "simple" statemachine that count how many (),{},"", ... we have seen (like how the initial buffering already work but discarding away already red data).

This mean if you have 2 messages, the first one is too big but not the second one, decode would error the first time, however if you call it again it would succesfully read the second message.

rsc · 2022-11-16T19:00:30Z

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

rsc · 2022-11-30T18:32:01Z

If you call SetValueLimit and then only call Token, is there just no limit?

Using an io.LimitedReader where you reset the limit is also a possibility, of course,
and it seems clearer what it means.

Maybe part of the problem is when the reset happens. What if there was just a SetInputLimit that applied to all future reads, and you have to reset it yourself when you want to allow more data?

rsc · 2022-12-07T18:26:57Z

@rolandshoemaker
Any thoughts on SetInputLimit (with no automatic reset)?
Or about using an io.LimitedReader directly?

rolandshoemaker · 2022-12-08T18:38:52Z

I think in cases where you are directly calling the Decoder, the usage of a io.LimitedReader that you reset yourself is reasonable, but it doesn't address the case where you pass a Decoder to something else which calls Token or Decode, where you may not be able to determine when you need to reset your readers limit.

I've paged out the context about how the parser consumes bytes, but it seems like generally the limit could apply to both values and tokens equally, such that it is essentially only limiting the number of bytes being read for the reader for a particular parsing step. I may be misunderstanding though, @dsnet were you saying that it may make more sense to have different limits for values vs. tokens, because tokens are inherently smaller and as such you may want to have a more constrained limit?

dsnet · 2022-12-12T21:45:26Z

it doesn't address the case where you pass a Decoder to something else which calls Token or Decode, where you may not be able to determine when you need to reset your readers limit.

I'm not sure I understand this use case. Why would the called function need to check whether a limit exists or be responsible for resetting the readers? That seems to be the responsibility of the caller, not the callee:

If the callee experiences an error as it uses the json.Decoder, then great, the limiter is working as intended.
If the callee function returns successfully, then the caller knows the limit was not hit and resets the limit.

@dsnet were you saying that it may make more sense to have different limits for values vs. tokens, because tokens are inherently smaller and as such you may want to have a more constrained limit?

Possibly yes. There are applications that parse a stream of JSON, where the stream is massive. In such an application, you would want to ensure no token exceeds a certain size. Technically, this can be achieved by constantly resetting the per-Value limit right before reading every single token, but that's expensive and probably a frustrating to use API. It would be preferable to specify something like dec.LimitTokenSize(1<<20) once and not worry about it again.

Also, with value-specific limiting, it's unclear whether limits can be nested. For example:

Calling dec.LimitValueSize right before the top-level value seems to imply that the top-level value cannot exceed N bytes.
Suppose I then call dec.Token to read a leading '[' token, which transitions the decoder state to now parse values within a JSON array.
What's the behavior of now calling dec.LimitValueSize again? Is that going to change the top-level value limit? or is it setting a nested limit for the immediate next JSON value within the JSON array? Does it panic?

Also, what's the behavior of this functionality with regard to JSON whitespace? Are whitespace counted towards or against the limit? Usually, the purpose of a limit is to ensure some reasonable bounds on memory usage. While whitespace needs to be parsed, the implementation could be done in a way that it only takes O(1) memory to skip N bytes of whitespace.

rsc · 2023-01-12T16:51:54Z

I talked to @rolandshoemaker about this. The API we think makes sense is

// Limit sets a limit on the number of bytes of JSON input consumed by
// any single call to the Decode or Token methods.
// After Limit(n), if a Decode or Token call would need to consume more than n bytes
// to process the entire JSON value or token,
// the call will instead return an error after consuming only n bytes.
// Calling Limit(-1) corresponds to not having a limit, which is the default for a new Decoder.
func (dec *Decoder) Limit(n int64)

Decode and Token would both reset the limit when they are called, and then the input reader would enforce the limit as it read more input. Right now Token happens to call Decode as somewhat confusing way to get at the string and number parsers. We would change Token to call an unexported decode that is Decode without the limit reset. Except for that one case, Decode and Token are not called from inside the json package as far as I can see.

The semantics would be that

dec := NewDecoder(r)
dec.Limit(n)
dec.Decode(x)
dec.Token()

is exactly the same as

r := &io.LimitedReader{R: r, N: 1<<63 - 1}
dec := NewDecoder(r)
r.N = n
dec.Decode(x)
r.N = n
dec.Token()

except for the errors being returned when the limit is reached.

It's true that users can construct this themselves with io.LimitedReader today, but it is a pain, and JSON decoding is an out-of-memory footgun that we should try to make easier to protect against. Adding one call to dec.Limit to your program is much easier than working out how to use io.LimitedReader correctly.

The goal is simply to have a big, effective, easily understood hammer to allow limiting the memory impact of a given Decoder method call. In a JSON v2 that streamed the input instead, the limit would still apply, to disallow things like decoding an infinite stream [[],[],[],[],[],[],[],[],... into an any and taking up an arbitrary amount of memory. It's true that the hammer may prevent processing certain inputs that wouldn't or don't take an arbitrary amount of memory, such as inputs containing gigabytes of spaces. That's fine with me. It's a hammer not a scalpel, and users who care about time spent processing the input may want to disallow such inputs anyway.

rsc · 2023-01-18T18:39:19Z

Any thoughts on or objections to the previous comment's API? @dsnet?

dsnet · 2023-01-19T07:41:40Z

The proposed API assumes that Decoder never leaks to user implementation.

Suppose we wanted to introduce the following API:

type Unmarshaler2 interface {
    UnmarshalJSON2(*json.Decoder) error
}

where types that implement UnmarshalJSON2 can unmarshal themselves using the streaming Decoder.Token API. Such an API is fairly reasonable as it avoids the quadratic runtime behavior of recursive UnmarshalJSON calls (this is a real issue, not hypothetical).

In this situation, the implementation of UnmarshalJSON2 would call Decoder.Token or Decoder.Decode methods, but the proposed behavior means that the limit would be reset over and over recursively (making it practically useless). I guess, we could work around this by having recursive calls to Token or Decode not reset the limit. For example, the implementation of Decoder.Decode could be something like:

func (d *Decoder) Decode(p any) error {
    if d.topLevel {
        d.topLevel = false
        defer func() {
            d.resetLimit()
            d.topLevel = true
        }()
    }
    ...
}

The difference in semantics between top-level and recursive calls makes me feel a little uneasy.

dsnet · 2023-01-19T07:48:01Z

I think the fundamental problem is that a memory limit is a property of a particular Unmarshal call, and we're trying to introduce it as an option off of the Decoder state. While this is certainly possible, it introduces weird edge cases due to the mismatch.

Ideally, we would have something like:

func UnmarshalReader(r io.Reader, p any, o UnmarshalOptions) error

where UnmarshalOptions supports a memory limit (among other possible options).

dsnet · 2023-01-19T08:46:01Z

I thought more about this while washing up...

I propose instead:

// Limit sets a limit on the number of bytes of input that may be consumed 
// for each top-level JSON value.
// After Limit(n), if a Decode or Token call would need to consume more than
// a total of n bytes to make progress on the current top-level JSON value,
// the call will instead report an error.
// Calling Limit(n) when the Decoder has already begun parsing a top-level JSON value
// sets the limit for the remainder of the current value.
// Subsequent top-level JSON values are subject to the entirety of the limit.
// Calling Limit(-1) corresponds to not having a limit, which is the default for a new Decoder.
func (dec *Decoder) Limit(n int64)

Properties:

The main difference is in when the limit is reset. The earlier proposal resets after every Decode and Token call, which makes the limit a property of the call itself, and nothing to do with the JSON state machine. In contrast, my proposal is to reset whenever the JSON state machine pops back up to the top-level. This resolves why I felt uneasy about the API in my earlier comment.
Practically speaking, the behavior is identical for almost all common usages and only becomes observable with more advance usages of Decoder such as the hypothetical API for JSONUnmarshaler2. For the example posted with the earlier proposal, it behaves identically.
There is no support for nested limits. This would be complicated, slow, and the use cases are overly esoteric.
There is partial support for limiting on a per-Token basis, but this can be accomplished by calling Limit just prior to every Token call. This takes advantage of the fact that the next JSON token must be part of the remaining current top-level JSON value.

I like that it operates as a hammer when you want to limit on a per-top-level value basis (which is by far the most common use case), but can be used as a scalpel when you want to limit on a per-token basis (which is only for rare and more advanced use cases).

ianlancetaylor · 2023-01-19T19:35:31Z

Thanks for the analysis. Once we start thinking about recursive cases, it's not hard to imagine wanting to know what the current limit is before starting some expensive operation. So perhaps we should call this method SetLimit, and perhaps we should consider adding a Limit method to return the current limit.

rsc · 2023-01-25T18:10:04Z

SetLimit sounds fine for now. I'm not sure whether we'll need Limit but leaving the door open sounds fine.

rsc · 2023-01-25T18:10:32Z

Have all concerns about this proposal been addressed?

rsc · 2023-02-01T19:01:56Z

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

rsc · 2023-02-09T02:28:06Z

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

gopherbot · 2023-02-28T02:33:36Z

Change https://go.dev/cl/471875 mentions this issue: encoding/json: implement SetLimit

dsnet · 2023-03-03T19:35:39Z

I propose renaming this as SetByteLimit, which limits the maximum byte size of a JSON value. It is conceivable that we could add SetDepthLimit, that limits the maximum nested depth of a JSON value.

In #58786, @bigpigeon mentioned needing a maximum depth limit, which is a reasonable feature to potentially add.

gopherbot added this to the Proposal milestone Nov 14, 2022

gopherbot added the Proposal label Nov 14, 2022

rsc changed the title ~~proposal: encoding/json: allow limiting Decoder.Decode read size~~ proposal: encoding/json: add (*Decoder).SetLimit Jan 25, 2023

dpw mentioned this issue Jan 27, 2023

encoding/json: the Decoder.Decode API lends itself to misuse #36225

Open

rsc added the Proposal-FinalCommentPeriod label Feb 1, 2023

rsc changed the title ~~proposal: encoding/json: add (*Decoder).SetLimit~~ encoding/json: add (*Decoder).SetLimit Feb 9, 2023

rsc modified the milestones: Proposal, Backlog Feb 9, 2023

rsc added Proposal-Accepted and removed Proposal-FinalCommentPeriod labels Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding/json: add (*Decoder).SetLimit #56733

encoding/json: add (*Decoder).SetLimit #56733

rolandshoemaker commented Nov 14, 2022

rittneje commented Nov 15, 2022

dsnet commented Nov 15, 2022

Jorropo commented Nov 15, 2022

Jorropo commented Nov 15, 2022

Jorropo commented Nov 15, 2022 •

edited

rsc commented Nov 16, 2022

rsc commented Nov 30, 2022

rsc commented Dec 7, 2022

rolandshoemaker commented Dec 8, 2022

dsnet commented Dec 12, 2022 •

edited

rsc commented Jan 12, 2023

rsc commented Jan 18, 2023

dsnet commented Jan 19, 2023 •

edited

dsnet commented Jan 19, 2023

dsnet commented Jan 19, 2023

ianlancetaylor commented Jan 19, 2023

rsc commented Jan 25, 2023

rsc commented Jan 25, 2023

rsc commented Feb 1, 2023

rsc commented Feb 9, 2023

gopherbot commented Feb 28, 2023

dsnet commented Mar 3, 2023

encoding/json: add (*Decoder).SetLimit #56733

encoding/json: add (*Decoder).SetLimit #56733

Comments

rolandshoemaker commented Nov 14, 2022

rittneje commented Nov 15, 2022

dsnet commented Nov 15, 2022

Jorropo commented Nov 15, 2022

Jorropo commented Nov 15, 2022

Jorropo commented Nov 15, 2022 • edited

rsc commented Nov 16, 2022

rsc commented Nov 30, 2022

rsc commented Dec 7, 2022

rolandshoemaker commented Dec 8, 2022

dsnet commented Dec 12, 2022 • edited

rsc commented Jan 12, 2023

rsc commented Jan 18, 2023

dsnet commented Jan 19, 2023 • edited

dsnet commented Jan 19, 2023

dsnet commented Jan 19, 2023

ianlancetaylor commented Jan 19, 2023

rsc commented Jan 25, 2023

rsc commented Jan 25, 2023

rsc commented Feb 1, 2023

rsc commented Feb 9, 2023

gopherbot commented Feb 28, 2023

dsnet commented Mar 3, 2023

Jorropo commented Nov 15, 2022 •

edited

dsnet commented Dec 12, 2022 •

edited

dsnet commented Jan 19, 2023 •

edited