New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding/json: provide tokenizer #6050
Comments
I agree that it might be nice in some cases. One reason there's one in xml and not json is that tokenizing json is nearly trivial; not so of xml. It's a bit too late to design new APIs for Go 1.2. Perhaps for Go 1.3. Labels changed: added priority-later, go1.3, removed priority-triage, go1.2maybe. Status changed to Accepted. |
I'm certainly interested in this for a number of my projects. Existing 3rd party JSON scanning packages often are just clones of encoding/json with the scanner exported; to avoid code duplication in programs needing a scanner, it's better to use the 3rd party lib for (un)marshaling as well, though that lib may or may not be maintained (and often isn't). |
I'm parsing wikidata dump (http://dumps.wikimedia.org/other/wikidata/) which is in JSON format (~36GB) and is a single giant array containing objects:
With the current json Decoder interface it seems impossible to parse this JSON without reading in whole file in memory. So, exposing tokenizer would be really helpful. |
I'd like to suggest an alternate, higher-level API for streaming JSON decoding. Tokenizers and the existing "whole value" API are two opposite extremes on a spectrum. Many applications that require streaming decoding need incremental parsing at the upper levels of the JSON structure, but closer to the leaves simply need to get (potentially compound) values. At these lower levels, a tokenizer interface simply gets in the way. For example, in @nitingupta910's Wikipedia example, it's necessary to incrementally parse the top-level array, but for each element of the array it's far more convenient to use the regular JSON parser to get whole objects than to piece them together from the token stream. Hence, my suggestion is to view the JSON data as a tree and expose a caller-driven in-order traversal of this tree. At any point in the traversal, the caller can ask for the entire subtree as a decoded JSON value, or it can descend or ascend the tree as appropriate. This is not entirely unlike a tokenizer, except that it's hierarchical (not linear), the caller can switch into and out of the full JSON decoder as convenient (rather than being trapped in one world or the other), and it abstracts the details of syntax such as separating commas and balancing brackets and braces. This can all be done as a natural extension to the existing Decoder. The Decode method already reads just the next value and stops (even if there's more data in the Reader). All we would need are methods for descending into compound values, ascending after the last member of a compound value, and reporting where we are in the in-order traversal. Something like: // Enter descends into the next compound JSON value in its input. // Exit ascends from the current compound JSON value in its input. // Peek returns the type of the next value in the input. It may read For example, using this interface, @nitingupta910's example could be parsed with the following code (eliding error handling):
|
Good idea @aclements. I like the idea of unifying the Decoder and Tokenizer such that you can switch seamlessly back and forth between whichever level of abstraction is desired at the time. |
@peterwald: You said you wanted to add a tokenizer for 1.5 and it's nearly finished. Did you implement a traditional tokenizer or @aclements proposal? I couldn't find your code on github or gerrit. |
@keks Basically both. The scanner that's currently in the encoding/json package needs work. State is spread throughout the scanner and the decoder, and in places, the decoder actually does some of it's own scanning. In some places, scanning is done multiple times using different scanner instances, etc... I ended up having to rewrite portions of the scanner, so that's why it's taking me a while to make sure everything is well tested and structured. I wrote an underlying scanner that's simpler, then layered the Decoder on top of that, and added a higher-level api similar to what @aclements proposed. In addition, I have a "walker" helper class, that allows you to scan ahead to a particular node (using the underlying tokenizer) and then Decode or read tokens as needed. All of this is done using a forward only streaming model, so it should be very memory efficient regardless of the size of the document. I have not pushed the code to Gerrit yet. I don't know what the policy is on getting unfinished code out for review, but I wanted to make sure it was pretty well complete. I am working to push it before the end of the week. |
@peterwald if you've already invested a lot of time understanding the scanner state machine, you might also enjoy looking at #10335. (Be sure to look at the discussion in the retracted CL as well.) No pressure, just thought I'd mention it. |
Thanks @josharian, I think the issue is that the state machine stays one byte behind the reader, so in certain cases, an error condition is allocated and cached in anticipation of a possible error on the next state transition (in other words, an error at the current position may not become evident until the next state transition). It looks like my changes will remove the need for that. I can include the test cases from #10335 in my patch to be sure. |
I really like @aclements´s idea. I am doing a libray that interprets the JSON protocol[1] from the gpsd daemon, and gpsd will truncate[2] all JSON responses after the 1536th character (byte?), so i need to create a streaming JSON reader that fill in the closing } and ] in the correct sequence. An extended JSON decoder would be very useful. [1]: gpsd JSON core protocol: [2] is from [1]:
|
Why is this suddenly Unplanned and was unaccepted? It's still in Rob's list for 1.5. @peterwald, do you think you will get this done before May? |
It's not sudden. This was changed to release-none in the old issue tracker in December 2013. That label has now been mapped to milestone:unplanned in this issue tracker. We've dropped the accepted label entirely as it wasn't helpful. See Russ's recent note to golang-dev: https://groups.google.com/d/msg/golang-dev/YU56eKYiXJg/K1CkXo7Iy0gJ . None of this means that it won't get done for Go 1.5. It does mean that if this does not get done it will not block the Go 1.5 release. |
Oh, okay. Sorry for the noise then; I am just really looking forward to this :) |
@keks Yes this will be done any day now. I ended up having to substantially rewrite the patch since I last posted here. It's very close. |
I've pushed the CL. https://go-review.googlesource.com/#/c/9073/. Tear it apart and let me know what you find. |
CL https://golang.org/cl/9073 mentions this issue. |
CL https://golang.org/cl/11651 mentions this issue. |
FYI, for anyone coming back to this issue... I had originally provided an API to do forward seeking on the stream of JSON tokens that was part of this change. The seeking part was not accepted into the runtime library, so I've extracted it out into its own package. It extends the existing json.Decoder. |
by krolaw:
The text was updated successfully, but these errors were encountered: