Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: encoding/json: add a function to report whether a []byte represents a JSON array before unmarshal #53170

Open
Lz-Gustavo opened this issue May 31, 2022 · 5 comments

Comments

@Lz-Gustavo
Copy link

Problem statement

Many web applications utilize JSON encoding as its primary data exchange format. Among these, some applications may even accept different JSON schemes as an acceptable argument for a single method. For an example, a common use case would be to accept both of the following JSONs as request bodies on a HTTP API endpoint:

{
  "name": "foo"
}
[
  {
    "name": "foo"
  },
  {
    "name": "bar"
  }
]

However, implementing this kind of behavior on the current state of the library is not trivial, and may lead to inefficient or even wrong implementations. There may exists multiple other ways to do this, but I will mostly focus on two:

  1. Unmarshal the received JSON into an interface{}, and implement a common type assertion on a switch statement to decide which value you should convert the unmarshaled value

    var v interface{}
    json.Unmarshal(raw, &v) // ignoring error handling
    switch v.(type) {
    case []interface{}:
        // deal as a JSON array, possibly returning a slice of a desired type A
    
    case map[string]interface{}:
        // deal as a JSON object, returning an initialized struct of type A
    }
  2. Implement an in-place verification over the []byte to check if it starts with { or [

    // source: https://stackoverflow.com/a/55014220
    x := bytes.TrimLeft(data, " \t\r\n")
    
    isArray := len(x) > 0 && x[0] == '['
    isObject := len(x) > 0 && x[0] == '{'

Even though 1 utilizes native features from Go, it may be considered far from optimal due to the underlying reflection usage and the unmarshal itself. Considering raw the informed []byte as a valid JSON, there's no need to unmarshal it entirely only to detect whether it represents a list or not. As for the other option, although being faster than 1, it may be considered error prone and unsafe for a lot of developers to deploy it on a production environment. I like its simplicity, but I guess it could be even more simple.

Proposal

As a solution for this, I propose the implementation of a new exported function on pkg encoding/json to inform whether a raw []byte represents a JSON array (i.e. is a valid JSON and starts with [). The signature could be something as follows, similar to Valid:

func IsArray(data []byte) bool

With this function available, one could easily implement the same use case described before as follows:

// consider that raw is an already validated JSON
if json.IsArray(raw) {
  // unmarshal into a []A slice

} else {
  // unmarshal into a A struct
}  

As for its underlying implementation, I guess it could not only utilize the same checkValid function to detect if data represents a valid JSON before searching for [, but also utilize already implemented methods of the scanner struct as a reference for the JSON traversal. I'd love to contribute with a PR if you guys agree it's a valid addition to the library.

OBS: I'm still not sure about the name itself. I admit I don't feel very comfortable with the noun Array, but it's the same noun utilized on the JSON RFC. Should it be called slice? Any ideas?

@gopherbot gopherbot added this to the Proposal milestone May 31, 2022
@seankhliao
Copy link
Member

Running a parser/validator over the entire input just to check that it's a valid json array, only to then do it again but to unmarshal it doesn't seem to be very efficient.

If this is to be used in the context of a json.Unmarshaler, the only test needed is x[0] == '['.
You haven't stated why it is "unsafe"

@Lz-Gustavo
Copy link
Author

Running a parser/validator over the entire input just to check that it's a valid json array, only to then do it again but to unmarshal it doesn't seem to be very efficient.

If this is to be used in the context of a json.Unmarshaler, the only test needed is x[0] == '['. You haven't stated why it is "unsafe"

good point!

I agree that just mentioning as "unsafe" is a weak argument, not to mention it is imprecise. What I meant by "unsafe" is in the sense that, IMHO, implementing this kind of verification by hand is "less safe" (in terms of reliable) when compared to utilizing a peer-reviewed native feature of the standard library.

Also, could you elaborate on the usage of a json.Unmarshaler? Considering that use case, wouldn't you also require to invoke a second time the UnmarshalJSON method for a different type if you happen to fail the first one?

The whole point of utilizing the json.IsArray() on the example use case would be to avoid invoking two times an unmarshal method for a given []byte.

@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) Jun 1, 2022
@earthboundkid
Copy link
Contributor

@Lz-Gustavo
Copy link
Author

https://go.dev/play/p/U2xktwpNq46

That's my point @carlmjohnson. With a json.IsArray() function, in your example you wouldn't had to call two times the json.Unmarshal() within your implementation of UnmarshalJSON(). Although correct, notice that for a non-array JSON you unnecessarily had to call the first json.Unmarshal() at line 17, so them you could successfully unmarshal it on line 22.

With the proposed function, that method would be something like this:

func (nn *names) UnmarshalJSON(b []byte) error {
	if json.Array(b) {
		if err := json.Unmarshal(b, (*[]named)(nn)); err != nil {
			return err
		}
	
	} else {
		var n named
		if err := json.Unmarshal(b, &n); err != nil {
			return err
		}
		*nn = names{n}
	}
	return nil
}

Which I believe performs better since it avoids a second invocation of json.Unmarshal() for some inputs. Ok, this approach adds the non-negligible cost of json.Array(), which should be carefully evaluated. I'll run a benchmark considering json.Valid() to be the same cost of this function to evaluate this hypothesis.

@apparentlymart
Copy link

apparentlymart commented Sep 1, 2022

Given that the subsequent json.Unmarshal will fully validate the contents of the possible array anyway, could it be plausible to define this as json.MaybeArray(b), where returning true means that the given bytes are either a valid JSON document with an array at its root or an invalid JSON document?

If the goal is only to choose which type to pass into the second argument of Unmarshal then it seems unnecessary to check whether the remainder of the buffer is valid JSON.


Perhaps this generalizes as:

func FirstToken(data []byte) (Token, error) {
    // ...
}

My intent here is that this is functionally equivalent to calling NewDecoder with a bytes.Reader containing data and then calling Token on it, but potentially implemented in a way that minimizes allocations and doesn't do all the state machine advancing that Decoder.Token does today. I would also not expect it to provide the same guarantees of valid nesting that Decoder.Token provides, under the assumption that the caller is about to use json.Unmarshal anyway; the only possible errors then would seem to be io.EOF or a *json.SyntaxError describing an invalid token at the start of the buffer.

The motivating example would then look like this:

	if json.NextToken(b) == json.Delim('[') {
		if err := json.Unmarshal(b, (*[]named)(nn)); err != nil {
			return err
		}
	
	} else {
		var n named
		if err := json.Unmarshal(b, &n); err != nil {
			return err
		}
		*nn = names{n}
	}

Perhaps for improved readability this would also benefit from some extra constants:

const ObjectOpen = Delim('{')
const ObjectClose = Delim('}')
const ArrayOpen = Delim('[')
const ArrayClose = Delim(']')

...which would then allow json.NextToken(b) == json.ArrayOpen, which is slightly longer but perhaps easier to guess what it represents.

The NextToken forumulation is also more friendly to a switch statement without repeatedly re-tokenizing the beginning of the buffer:

switch json.NextToken(b) {
	case json.ArrayOpen:
		// ...
	case json.ObjectOpen:
		// ...
}

... though admittedly most of the other implementations of json.Delim are more likely to be used in a type switch than in a value switch, in which case it would be a little non-ergonomic to first type switch and then, in the case json.Delim:, switch again to see exactly what kind of delimiter it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Incoming
Development

No branches or pull requests

5 participants