Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/json: provide tokenizer #6050

Closed
gopherbot opened this issue Aug 5, 2013 · 22 comments
Closed

encoding/json: provide tokenizer #6050

gopherbot opened this issue Aug 5, 2013 · 22 comments
Milestone

Comments

@gopherbot
Copy link

by krolaw:

This would be useful when parsing (really) large JSON documents in low memory
environments.

At the moment one would need to write their own comma and key handling code.  Since this
is available in the encoding/xml package (through RawToken),  it would be consistent to
have similar functionality in the json package.

Thanks.
@rsc
Copy link
Contributor

rsc commented Aug 5, 2013

Comment 1:

I agree that it might be nice in some cases. One reason there's one in xml and not json
is that tokenizing json is nearly trivial; not so of xml.
It's a bit too late to design new APIs for Go 1.2. Perhaps for Go 1.3.

Labels changed: added priority-later, go1.3, removed priority-triage, go1.2maybe.

Status changed to Accepted.

@robpike
Copy link
Contributor

robpike commented Aug 20, 2013

Comment 2:

Labels changed: removed go1.3.

@rsc
Copy link
Contributor

rsc commented Nov 27, 2013

Comment 3:

Labels changed: added go1.3maybe.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2013

Comment 4:

Labels changed: added release-none, removed go1.3maybe.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2013

Comment 5:

Labels changed: added repo-main.

@extemporalgenome
Copy link
Contributor

Comment 6:

I'm certainly interested in this for a number of my projects. Existing 3rd party JSON
scanning packages often are just clones of encoding/json with the scanner exported; to
avoid code duplication in programs needing a scanner, it's better to use the 3rd party
lib for (un)marshaling as well, though that lib may or may not be maintained (and often
isn't).

@nitingupta910
Copy link

I'm parsing wikidata dump (http://dumps.wikimedia.org/other/wikidata/) which is in JSON format (~36GB) and is a single giant array containing objects:

[
    { "id":"Q1", "type":"item", ...},
    { "id":"Q8", "type":"item", ...},

    (millions of such objects)
]

With the current json Decoder interface it seems impossible to parse this JSON without reading in whole file in memory. So, exposing tokenizer would be really helpful.

@aclements
Copy link
Member

I'd like to suggest an alternate, higher-level API for streaming JSON decoding.

Tokenizers and the existing "whole value" API are two opposite extremes on a spectrum. Many applications that require streaming decoding need incremental parsing at the upper levels of the JSON structure, but closer to the leaves simply need to get (potentially compound) values. At these lower levels, a tokenizer interface simply gets in the way. For example, in @nitingupta910's Wikipedia example, it's necessary to incrementally parse the top-level array, but for each element of the array it's far more convenient to use the regular JSON parser to get whole objects than to piece them together from the token stream.

Hence, my suggestion is to view the JSON data as a tree and expose a caller-driven in-order traversal of this tree. At any point in the traversal, the caller can ask for the entire subtree as a decoded JSON value, or it can descend or ascend the tree as appropriate. This is not entirely unlike a tokenizer, except that it's hierarchical (not linear), the caller can switch into and out of the full JSON decoder as convenient (rather than being trapped in one world or the other), and it abstracts the details of syntax such as separating commas and balancing brackets and braces.

This can all be done as a natural extension to the existing Decoder. The Decode method already reads just the next value and stops (even if there's more data in the Reader). All we would need are methods for descending into compound values, ascending after the last member of a compound value, and reporting where we are in the in-order traversal. Something like:

// Enter descends into the next compound JSON value in its input.
//
// If the next value is an array, subsequent calls to Decode will
// return the elements of that array in order. If the next value is an
// object, subsequent calls to Decode will return the key/value pairs
// of that object, alternating between returning the key and the value
// associated with that key. If the next value is any other type,
// Enter returns TraversalError.
func (*Decoder) Enter() error

// Exit ascends from the current compound JSON value in its input.
//
// The decoder must be at the end of an array or object that was
// previously Entered; otherwise, Exit returns TraversalError.
func (*Decoder) Exit() error

// Peek returns the type of the next value in the input. It may read
// from the input in order to determine the type of the value.
//
// If the next value is a non-compound value, Peek returns TypeNumber,
// TypeString, TypeBoolean, or TypeNull. If the next value is a
// compound value, Peek returns TypeArray or TypeObject. If the
// decoder has Entered an array and there are no more values in the
// array, Peek returns EndArray. Likewise, if the decoder has Entered
// an object and there are no more key/value pairs in the object, Peek
// returns EndObject. If there are no more values, Peek returns an
// io.EOF error.
func (*Decoder) Peek() (ValueType, error)

For example, using this interface, @nitingupta910's example could be parsed with the following code (eliding error handling):

d.Enter()
for {
    if typ, _ := d.Peek(); type == EndArray {
        break
    }
    d.Decode(&object)
    process(object)
}
d.Exit()

@peterwald
Copy link
Contributor

Good idea @aclements. I like the idea of unifying the Decoder and Tokenizer such that you can switch seamlessly back and forth between whichever level of abstraction is desired at the time.

@keks
Copy link

keks commented Apr 9, 2015

@peterwald: You said you wanted to add a tokenizer for 1.5 and it's nearly finished. Did you implement a traditional tokenizer or @aclements proposal? I couldn't find your code on github or gerrit.

@peterwald
Copy link
Contributor

@keks Basically both. The scanner that's currently in the encoding/json package needs work. State is spread throughout the scanner and the decoder, and in places, the decoder actually does some of it's own scanning. In some places, scanning is done multiple times using different scanner instances, etc...

I ended up having to rewrite portions of the scanner, so that's why it's taking me a while to make sure everything is well tested and structured. I wrote an underlying scanner that's simpler, then layered the Decoder on top of that, and added a higher-level api similar to what @aclements proposed. In addition, I have a "walker" helper class, that allows you to scan ahead to a particular node (using the underlying tokenizer) and then Decode or read tokens as needed. All of this is done using a forward only streaming model, so it should be very memory efficient regardless of the size of the document.

I have not pushed the code to Gerrit yet. I don't know what the policy is on getting unfinished code out for review, but I wanted to make sure it was pretty well complete.

I am working to push it before the end of the week.

@josharian
Copy link
Contributor

@peterwald if you've already invested a lot of time understanding the scanner state machine, you might also enjoy looking at #10335. (Be sure to look at the discussion in the retracted CL as well.) No pressure, just thought I'd mention it.

@peterwald
Copy link
Contributor

Thanks @josharian, I think the issue is that the state machine stays one byte behind the reader, so in certain cases, an error condition is allocated and cached in anticipation of a possible error on the next state transition (in other words, an error at the current position may not become evident until the next state transition). It looks like my changes will remove the need for that. I can include the test cases from #10335 in my patch to be sure.

@rsc rsc added this to the Unplanned milestone Apr 10, 2015
@larsth
Copy link

larsth commented Apr 10, 2015

I really like @aclements´s idea. I am doing a libray that interprets the JSON protocol[1] from the gpsd daemon, and gpsd will truncate[2] all JSON responses after the 1536th character (byte?), so i need to create a streaming JSON reader that fill in the closing } and ] in the correct sequence. An extended JSON decoder would be very useful.

[1]: gpsd JSON core protocol:
http://manpages.ubuntu.com/manpages/trusty/man5/gpsd_json.5.html

[2] is from [1]:

The length limit for responses and reports is 1536 characters, 
including trailing newline; longer responses will be truncated, 
so client code must be prepared for the possibility of invalid 
JSON fragments.

@rsc rsc removed accepted labels Apr 14, 2015
@keks
Copy link

keks commented Apr 17, 2015

Why is this suddenly Unplanned and was unaccepted? It's still in Rob's list for 1.5.

@peterwald, do you think you will get this done before May?

@ianlancetaylor
Copy link
Contributor

It's not sudden. This was changed to release-none in the old issue tracker in December 2013. That label has now been mapped to milestone:unplanned in this issue tracker.

We've dropped the accepted label entirely as it wasn't helpful.

See Russ's recent note to golang-dev: https://groups.google.com/d/msg/golang-dev/YU56eKYiXJg/K1CkXo7Iy0gJ .

None of this means that it won't get done for Go 1.5. It does mean that if this does not get done it will not block the Go 1.5 release.

@keks
Copy link

keks commented Apr 17, 2015

Oh, okay. Sorry for the noise then; I am just really looking forward to this :)
I actually read over Russ' post before but apparently I didn't read to the end...

@peterwald
Copy link
Contributor

@keks Yes this will be done any day now. I ended up having to substantially rewrite the patch since I last posted here. It's very close.

@peterwald
Copy link
Contributor

I've pushed the CL. https://go-review.googlesource.com/#/c/9073/.

Tear it apart and let me know what you find.

@gopherbot
Copy link
Author

CL https://golang.org/cl/9073 mentions this issue.

@gopherbot
Copy link
Author

CL https://golang.org/cl/11651 mentions this issue.

@rsc rsc closed this as completed in 0cf48b4 Jul 27, 2015
@mikioh mikioh modified the milestones: Go1.5, Unplanned Jul 28, 2015
@peterwald
Copy link
Contributor

FYI, for anyone coming back to this issue... I had originally provided an API to do forward seeking on the stream of JSON tokens that was part of this change. The seeking part was not accepted into the runtime library, so I've extracted it out into its own package. It extends the existing json.Decoder.

https://github.com/exponent-io/json-seek

@golang golang locked and limited conversation to collaborators Oct 4, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests