Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/json: add (*Decoder).SetLimit #56733

Open
rolandshoemaker opened this issue Nov 14, 2022 · 22 comments
Open

encoding/json: add (*Decoder).SetLimit #56733

rolandshoemaker opened this issue Nov 14, 2022 · 22 comments

Comments

@rolandshoemaker
Copy link
Member

Typically, the advice to avoid reading excessively large values into memory when passing an untrusted io.Reader is to wrap the reader in a io.LimitedReader. For encoding/json.NewDecoder this is not necessarily a reasonable approach, since the Decoder may be intended to read from a long lived stream (i.e. a net.Conn) where the user may not want to limit the total amount of bytes read from the Reader across multiple reads, but does want to limit the number of bytes read during a single call to Decoder.Decode (i.e. reading 100 500 byte messages across the lifetime of the Decoder may be perfectly acceptable, but a single 50,000 byte read is not.)

A solution to this problem would be to add a method to Decoder, which allows limiting the amount read from the Reader into the internal Decoder buffer on each call to Decoder.Decode, returning an error if the number of bytes required to decode the value exceeds the set limit. Something along the lines of:

// LimitValueSize limits the number of bytes read from the wrapped io.Reader
// during a single call to dec.Decode. If decoding the value requires reading
// more than limit bytes, dec.Decode will return an error.
func (dec *Decoder) LimitValueSize(limit int)

cc @dsnet

@gopherbot gopherbot added this to the Proposal milestone Nov 14, 2022
@rittneje
Copy link

Should the bytes that are already in the decoder's buffer count against this limit on subsequent calls to Decode? I would think so as otherwise it would lead to inconsistent behavior, but the way it is phrased in your proposal is not entirely clear.

@dsnet
Copy link
Member

dsnet commented Nov 15, 2022

The feature seems reasonable in general, but it's unclear whether the limit should be value-based or token-based. Value-based limits make sense when calling Decode, while token-based limits make sense when calling Token.

As a work-around, you could reset the limit on an io.LimitedReader after every Decode call.

@Jorropo
Copy link
Member

Jorropo commented Nov 15, 2022

@dsnet The annoying thing with LimitedReader is that as you pointed out you need to reset it every time.

@Jorropo
Copy link
Member

Jorropo commented Nov 15, 2022

Should the bytes that are already in the decoder's buffer count against this limit on subsequent calls to Decode? I would think so as otherwise it would lead to inconsistent behavior, but the way it is phrased in your proposal is not entirely clear.

The goal is to help limit how much memory one single message is allowed to keep alive to help prevent memory exhaustion attacks,
So I would say this would limit how big the buffer is allowed, not counting buffering over-reads as this is implementation details.
So all of this to say I think this should be how big one message is allowed to be.

@Jorropo
Copy link
Member

Jorropo commented Nov 15, 2022

I think when the limit is reached, Decode should continue to read (but discard) the data and fully read over the problematic message.
This can be implement with a "simple" statemachine that count how many (),{},"", ... we have seen (like how the initial buffering already work but discarding away already red data).

This mean if you have 2 messages, the first one is too big but not the second one, decode would error the first time, however if you call it again it would succesfully read the second message.

@rsc
Copy link
Contributor

rsc commented Nov 16, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Nov 30, 2022

If you call SetValueLimit and then only call Token, is there just no limit?

Using an io.LimitedReader where you reset the limit is also a possibility, of course,
and it seems clearer what it means.

Maybe part of the problem is when the reset happens. What if there was just a SetInputLimit that applied to all future reads, and you have to reset it yourself when you want to allow more data?

@rsc
Copy link
Contributor

rsc commented Dec 7, 2022

@rolandshoemaker
Any thoughts on SetInputLimit (with no automatic reset)?
Or about using an io.LimitedReader directly?

@rolandshoemaker
Copy link
Member Author

I think in cases where you are directly calling the Decoder, the usage of a io.LimitedReader that you reset yourself is reasonable, but it doesn't address the case where you pass a Decoder to something else which calls Token or Decode, where you may not be able to determine when you need to reset your readers limit.

I've paged out the context about how the parser consumes bytes, but it seems like generally the limit could apply to both values and tokens equally, such that it is essentially only limiting the number of bytes being read for the reader for a particular parsing step. I may be misunderstanding though, @dsnet were you saying that it may make more sense to have different limits for values vs. tokens, because tokens are inherently smaller and as such you may want to have a more constrained limit?

@dsnet
Copy link
Member

dsnet commented Dec 12, 2022

it doesn't address the case where you pass a Decoder to something else which calls Token or Decode, where you may not be able to determine when you need to reset your readers limit.

I'm not sure I understand this use case. Why would the called function need to check whether a limit exists or be responsible for resetting the readers? That seems to be the responsibility of the caller, not the callee:

  1. If the callee experiences an error as it uses the json.Decoder, then great, the limiter is working as intended.
  2. If the callee function returns successfully, then the caller knows the limit was not hit and resets the limit.

@dsnet were you saying that it may make more sense to have different limits for values vs. tokens, because tokens are inherently smaller and as such you may want to have a more constrained limit?

Possibly yes. There are applications that parse a stream of JSON, where the stream is massive. In such an application, you would want to ensure no token exceeds a certain size. Technically, this can be achieved by constantly resetting the per-Value limit right before reading every single token, but that's expensive and probably a frustrating to use API. It would be preferable to specify something like dec.LimitTokenSize(1<<20) once and not worry about it again.

Also, with value-specific limiting, it's unclear whether limits can be nested. For example:

  • Calling dec.LimitValueSize right before the top-level value seems to imply that the top-level value cannot exceed N bytes.
  • Suppose I then call dec.Token to read a leading '[' token, which transitions the decoder state to now parse values within a JSON array.
  • What's the behavior of now calling dec.LimitValueSize again? Is that going to change the top-level value limit? or is it setting a nested limit for the immediate next JSON value within the JSON array? Does it panic?

Also, what's the behavior of this functionality with regard to JSON whitespace? Are whitespace counted towards or against the limit? Usually, the purpose of a limit is to ensure some reasonable bounds on memory usage. While whitespace needs to be parsed, the implementation could be done in a way that it only takes O(1) memory to skip N bytes of whitespace.

@rsc
Copy link
Contributor

rsc commented Jan 12, 2023

I talked to @rolandshoemaker about this. The API we think makes sense is

// Limit sets a limit on the number of bytes of JSON input consumed by
// any single call to the Decode or Token methods.
// After Limit(n), if a Decode or Token call would need to consume more than n bytes
// to process the entire JSON value or token,
// the call will instead return an error after consuming only n bytes.
// Calling Limit(-1) corresponds to not having a limit, which is the default for a new Decoder.
func (dec *Decoder) Limit(n int64)

Decode and Token would both reset the limit when they are called, and then the input reader would enforce the limit as it read more input. Right now Token happens to call Decode as somewhat confusing way to get at the string and number parsers. We would change Token to call an unexported decode that is Decode without the limit reset. Except for that one case, Decode and Token are not called from inside the json package as far as I can see.

The semantics would be that

dec := NewDecoder(r)
dec.Limit(n)
dec.Decode(x)
dec.Token()

is exactly the same as

r := &io.LimitedReader{R: r, N: 1<<63 - 1}
dec := NewDecoder(r)
r.N = n
dec.Decode(x)
r.N = n
dec.Token()

except for the errors being returned when the limit is reached.

It's true that users can construct this themselves with io.LimitedReader today, but it is a pain, and JSON decoding is an out-of-memory footgun that we should try to make easier to protect against. Adding one call to dec.Limit to your program is much easier than working out how to use io.LimitedReader correctly.

The goal is simply to have a big, effective, easily understood hammer to allow limiting the memory impact of a given Decoder method call. In a JSON v2 that streamed the input instead, the limit would still apply, to disallow things like decoding an infinite stream [[],[],[],[],[],[],[],[],... into an any and taking up an arbitrary amount of memory. It's true that the hammer may prevent processing certain inputs that wouldn't or don't take an arbitrary amount of memory, such as inputs containing gigabytes of spaces. That's fine with me. It's a hammer not a scalpel, and users who care about time spent processing the input may want to disallow such inputs anyway.

@rsc
Copy link
Contributor

rsc commented Jan 18, 2023

Any thoughts on or objections to the previous comment's API? @dsnet?

@dsnet
Copy link
Member

dsnet commented Jan 19, 2023

The proposed API assumes that Decoder never leaks to user implementation.

Suppose we wanted to introduce the following API:

type Unmarshaler2 interface {
    UnmarshalJSON2(*json.Decoder) error
}

where types that implement UnmarshalJSON2 can unmarshal themselves using the streaming Decoder.Token API. Such an API is fairly reasonable as it avoids the quadratic runtime behavior of recursive UnmarshalJSON calls (this is a real issue, not hypothetical).

In this situation, the implementation of UnmarshalJSON2 would call Decoder.Token or Decoder.Decode methods, but the proposed behavior means that the limit would be reset over and over recursively (making it practically useless). I guess, we could work around this by having recursive calls to Token or Decode not reset the limit. For example, the implementation of Decoder.Decode could be something like:

func (d *Decoder) Decode(p any) error {
    if d.topLevel {
        d.topLevel = false
        defer func() {
            d.resetLimit()
            d.topLevel = true
        }()
    }
    ...
}

The difference in semantics between top-level and recursive calls makes me feel a little uneasy.

@dsnet
Copy link
Member

dsnet commented Jan 19, 2023

I think the fundamental problem is that a memory limit is a property of a particular Unmarshal call, and we're trying to introduce it as an option off of the Decoder state. While this is certainly possible, it introduces weird edge cases due to the mismatch.

Ideally, we would have something like:

func UnmarshalReader(r io.Reader, p any, o UnmarshalOptions) error

where UnmarshalOptions supports a memory limit (among other possible options).

@dsnet
Copy link
Member

dsnet commented Jan 19, 2023

I thought more about this while washing up...

I propose instead:

// Limit sets a limit on the number of bytes of input that may be consumed 
// for each top-level JSON value.
// After Limit(n), if a Decode or Token call would need to consume more than
// a total of n bytes to make progress on the current top-level JSON value,
// the call will instead report an error.
// Calling Limit(n) when the Decoder has already begun parsing a top-level JSON value
// sets the limit for the remainder of the current value.
// Subsequent top-level JSON values are subject to the entirety of the limit.
// Calling Limit(-1) corresponds to not having a limit, which is the default for a new Decoder.
func (dec *Decoder) Limit(n int64)

Properties:

  • The main difference is in when the limit is reset. The earlier proposal resets after every Decode and Token call, which makes the limit a property of the call itself, and nothing to do with the JSON state machine. In contrast, my proposal is to reset whenever the JSON state machine pops back up to the top-level. This resolves why I felt uneasy about the API in my earlier comment.
  • Practically speaking, the behavior is identical for almost all common usages and only becomes observable with more advance usages of Decoder such as the hypothetical API for JSONUnmarshaler2. For the example posted with the earlier proposal, it behaves identically.
  • There is no support for nested limits. This would be complicated, slow, and the use cases are overly esoteric.
  • There is partial support for limiting on a per-Token basis, but this can be accomplished by calling Limit just prior to every Token call. This takes advantage of the fact that the next JSON token must be part of the remaining current top-level JSON value.

I like that it operates as a hammer when you want to limit on a per-top-level value basis (which is by far the most common use case), but can be used as a scalpel when you want to limit on a per-token basis (which is only for rare and more advanced use cases).

@ianlancetaylor
Copy link
Contributor

Thanks for the analysis. Once we start thinking about recursive cases, it's not hard to imagine wanting to know what the current limit is before starting some expensive operation. So perhaps we should call this method SetLimit, and perhaps we should consider adding a Limit method to return the current limit.

@rsc
Copy link
Contributor

rsc commented Jan 25, 2023

SetLimit sounds fine for now. I'm not sure whether we'll need Limit but leaving the door open sounds fine.

@rsc rsc changed the title proposal: encoding/json: allow limiting Decoder.Decode read size proposal: encoding/json: add (*Decoder).SetLimit Jan 25, 2023
@rsc
Copy link
Contributor

rsc commented Jan 25, 2023

Have all concerns about this proposal been addressed?

@rsc
Copy link
Contributor

rsc commented Feb 1, 2023

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Feb 9, 2023

No change in consensus, so accepted. 🎉
This issue now tracks the work of implementing the proposal.
— rsc for the proposal review group

@rsc rsc changed the title proposal: encoding/json: add (*Decoder).SetLimit encoding/json: add (*Decoder).SetLimit Feb 9, 2023
@rsc rsc modified the milestones: Proposal, Backlog Feb 9, 2023
@gopherbot
Copy link

Change https://go.dev/cl/471875 mentions this issue: encoding/json: implement SetLimit

@dsnet
Copy link
Member

dsnet commented Mar 3, 2023

I propose renaming this as SetByteLimit, which limits the maximum byte size of a JSON value. It is conceivable that we could add SetDepthLimit, that limits the maximum nested depth of a JSON value.

In #58786, @bigpigeon mentioned needing a maximum depth limit, which is a reasonable feature to potentially add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Accepted
Development

No branches or pull requests

7 participants