proposal: encoding/json: add a new DecodeRaw API #18821

mikewied · 2017-01-27T20:23:56Z

I've run into some performance issues when decoding large json documents and have a proposal for adding a new API to encoding/json package. I'm willing to write the code, tests, go through the review process etc, but before I do that I'd like to get approval that this new API would be accepted.

As I mentioned I'm currently using the Decoder in encoding/json to parse a large json documents which in my case containing a json list of objects. The problem that I have is that the Decode function in the Decoder does two things. It first reads the bytes of the next object in the list and then unmarshals the json into a golang object. The main consumer of CPU however is unmarshaling the data. To test this I wrote the following program.

func ReadListMinimal(path string) error {
	f, err := os.Create("/tmp/pprof")
	if err != nil {
		fmt.Printf("could not create CPU profile: %v", err)
	}

	if err := pprof.StartCPUProfile(f); err != nil {
		fmt.Printf("could not start CPU profile: %v", err)
	}
	defer pprof.StopCPUProfile()

	docsRead := 0
	file, err := os.Open(path)
	if err != nil {
		return err
	}

	decoder := NewDecoder(file)

	// Read opening bracket
	if _, err := decoder.Token(); err != nil {
		return fmt.Errorf("expected a JSON list of objects")
	}


	item := &QueuedItem{nil, nil, nil}
	for decoder.More() {
		docsRead++
		err = decoder.DecodeRaw(&item.bytes)
		if err != nil {
			return fmt.Errorf("expected a JSON list of objects")
		}

		if docsRead > 500000 {
			return nil
		}
	}

	// Read closing bracket
	if _, err := decoder.Token(); err != nil {
		return fmt.Errorf("expected a JSON list of objects")
	}


	return file.Close()
}

The program reads the first 500000 objects from the list and then exits. It discards the objects after reading them. What I found was that 35.36s of 43.51s (81.26%) of the time spent running the program are spent unmarshaling data.

For large json objects we can only have a single reader so we limited by the speed that we can process this large json object even if we have more CPU's available to us.

My proposal would be to add a new API to the Decoder which I will call DecodeRaw. DecodeRaw will read the next json object just like Decode does, but DecodeRaw will return the bytes of that object back to the user instead of the unmarshaled data. The user can then unmarshal the bytes when they deem appropriate. What this would allow developers to do is to have a single reader reading the large json document. The could then create multiple goroutines to unmarshal the data. This would allow developers to take advantage of all CPU's on their machines for reading large JSON files. My initial proposal is that the DecodeRaw function would look something like below. Note that this function is not my final proposal. It is just meant to drive discussion.

func (dec *Decoder) DecodeRaw(bytes *[]byte) error {
	if dec.err != nil {
		return dec.err
	}

	if err := dec.tokenPrepareForDecode(); err != nil {
		return err
	}

	if !dec.tokenValueAllowed() {
		return &SyntaxError{msg: "not at beginning of value"}
	}

	// Read whole value into buffer.
	n, err := dec.readValue()
	if err != nil {
		return err
	}

	*bytes = dec.buf[dec.scanp : dec.scanp+n]
	dec.scanp += n

	// fixup token streaming state
	dec.tokenValueEnd()

	return err
}

So my question is whether the community thinks this would be a useful function to add to the Decoder?

The text was updated successfully, but these errors were encountered:

vcabbage · 2017-01-28T04:55:37Z

Would decoding into json.RawMessage work for your purposes?

Here's a modified version json.Decoder.Decode's example: https://play.golang.org/p/--kgxo7k_J

mikewied · 2017-01-28T23:52:54Z

That sounds like it would solve my issue and would remove the need to add a new API. Ill try this out over the weekend and close this issue if it works.

mikewied · 2017-01-30T17:16:36Z

I'm going to close this since I've verified that using RawMessage will solve this problem and as a result there is no need to add a new API. One thing that I am planning on looking into a little bit more is why there is still a fair amount of time spent in the unmarshal call when using RawMessage. The new timings for example are 2.54s/5.76s being spent in unmarshal. It seems like there may be some sort of optimization that can be added here since we should just be able to pass the bytes back to the user. I'll investigate this more and if I find anything I will open a separate issue.

mikewied · 2017-01-30T17:17:46Z

Also, @vcabbage thanks for pointing out the use of RawMessage for this case.

ALTree changed the title ~~Request to add a new DecodeRaw API to encoding/json Decoder~~ proposal: encoding/json: add a new DecodeRaw API Jan 27, 2017

ALTree added the Proposal label Jan 27, 2017

ALTree added this to the Proposal milestone Jan 27, 2017

mikewied closed this as completed Jan 30, 2017

golang locked and limited conversation to collaborators Jan 30, 2018

gopherbot added the FrozenDueToAge label Jan 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: encoding/json: add a new DecodeRaw API #18821

proposal: encoding/json: add a new DecodeRaw API #18821

mikewied commented Jan 27, 2017

vcabbage commented Jan 28, 2017 •

edited

Loading

mikewied commented Jan 28, 2017

mikewied commented Jan 30, 2017

mikewied commented Jan 30, 2017

proposal: encoding/json: add a new DecodeRaw API #18821

proposal: encoding/json: add a new DecodeRaw API #18821

Comments

mikewied commented Jan 27, 2017

vcabbage commented Jan 28, 2017 • edited Loading

mikewied commented Jan 28, 2017

mikewied commented Jan 30, 2017

mikewied commented Jan 30, 2017

vcabbage commented Jan 28, 2017 •

edited

Loading