You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've run into some performance issues when decoding large json documents and have a proposal for adding a new API to encoding/json package. I'm willing to write the code, tests, go through the review process etc, but before I do that I'd like to get approval that this new API would be accepted.
As I mentioned I'm currently using the Decoder in encoding/json to parse a large json documents which in my case containing a json list of objects. The problem that I have is that the Decode function in the Decoder does two things. It first reads the bytes of the next object in the list and then unmarshals the json into a golang object. The main consumer of CPU however is unmarshaling the data. To test this I wrote the following program.
func ReadListMinimal(path string) error {
f, err := os.Create("/tmp/pprof")
if err != nil {
fmt.Printf("could not create CPU profile: %v", err)
}
if err := pprof.StartCPUProfile(f); err != nil {
fmt.Printf("could not start CPU profile: %v", err)
}
defer pprof.StopCPUProfile()
docsRead := 0
file, err := os.Open(path)
if err != nil {
return err
}
decoder := NewDecoder(file)
// Read opening bracket
if _, err := decoder.Token(); err != nil {
return fmt.Errorf("expected a JSON list of objects")
}
item := &QueuedItem{nil, nil, nil}
for decoder.More() {
docsRead++
err = decoder.DecodeRaw(&item.bytes)
if err != nil {
return fmt.Errorf("expected a JSON list of objects")
}
if docsRead > 500000 {
return nil
}
}
// Read closing bracket
if _, err := decoder.Token(); err != nil {
return fmt.Errorf("expected a JSON list of objects")
}
return file.Close()
}
The program reads the first 500000 objects from the list and then exits. It discards the objects after reading them. What I found was that 35.36s of 43.51s (81.26%) of the time spent running the program are spent unmarshaling data.
For large json objects we can only have a single reader so we limited by the speed that we can process this large json object even if we have more CPU's available to us.
My proposal would be to add a new API to the Decoder which I will call DecodeRaw. DecodeRaw will read the next json object just like Decode does, but DecodeRaw will return the bytes of that object back to the user instead of the unmarshaled data. The user can then unmarshal the bytes when they deem appropriate. What this would allow developers to do is to have a single reader reading the large json document. The could then create multiple goroutines to unmarshal the data. This would allow developers to take advantage of all CPU's on their machines for reading large JSON files. My initial proposal is that the DecodeRaw function would look something like below. Note that this function is not my final proposal. It is just meant to drive discussion.
func (dec *Decoder) DecodeRaw(bytes *[]byte) error {
if dec.err != nil {
return dec.err
}
if err := dec.tokenPrepareForDecode(); err != nil {
return err
}
if !dec.tokenValueAllowed() {
return &SyntaxError{msg: "not at beginning of value"}
}
// Read whole value into buffer.
n, err := dec.readValue()
if err != nil {
return err
}
*bytes = dec.buf[dec.scanp : dec.scanp+n]
dec.scanp += n
// fixup token streaming state
dec.tokenValueEnd()
return err
}
So my question is whether the community thinks this would be a useful function to add to the Decoder?
The text was updated successfully, but these errors were encountered:
ALTree
changed the title
Request to add a new DecodeRaw API to encoding/json Decoder
proposal: encoding/json: add a new DecodeRaw API
Jan 27, 2017
That sounds like it would solve my issue and would remove the need to add a new API. Ill try this out over the weekend and close this issue if it works.
I'm going to close this since I've verified that using RawMessage will solve this problem and as a result there is no need to add a new API. One thing that I am planning on looking into a little bit more is why there is still a fair amount of time spent in the unmarshal call when using RawMessage. The new timings for example are 2.54s/5.76s being spent in unmarshal. It seems like there may be some sort of optimization that can be added here since we should just be able to pass the bytes back to the user. I'll investigate this more and if I find anything I will open a separate issue.
I've run into some performance issues when decoding large json documents and have a proposal for adding a new API to encoding/json package. I'm willing to write the code, tests, go through the review process etc, but before I do that I'd like to get approval that this new API would be accepted.
As I mentioned I'm currently using the Decoder in encoding/json to parse a large json documents which in my case containing a json list of objects. The problem that I have is that the Decode function in the Decoder does two things. It first reads the bytes of the next object in the list and then unmarshals the json into a golang object. The main consumer of CPU however is unmarshaling the data. To test this I wrote the following program.
The program reads the first 500000 objects from the list and then exits. It discards the objects after reading them. What I found was that 35.36s of 43.51s (81.26%) of the time spent running the program are spent unmarshaling data.
For large json objects we can only have a single reader so we limited by the speed that we can process this large json object even if we have more CPU's available to us.
My proposal would be to add a new API to the Decoder which I will call DecodeRaw. DecodeRaw will read the next json object just like Decode does, but DecodeRaw will return the bytes of that object back to the user instead of the unmarshaled data. The user can then unmarshal the bytes when they deem appropriate. What this would allow developers to do is to have a single reader reading the large json document. The could then create multiple goroutines to unmarshal the data. This would allow developers to take advantage of all CPU's on their machines for reading large JSON files. My initial proposal is that the DecodeRaw function would look something like below. Note that this function is not my final proposal. It is just meant to drive discussion.
So my question is whether the community thinks this would be a useful function to add to the Decoder?
The text was updated successfully, but these errors were encountered: