Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: encoding/xml: add ability to create decoder with start element already on stack #49313

Closed
SamWhited opened this issue Nov 3, 2021 · 16 comments

Comments

@SamWhited
Copy link
Member

SamWhited commented Nov 3, 2021

In a project that includes handlers similar to http.Handler except reading from XML, not the byte stream directly, I frequently find myself doing something like the following (where "xmlstream" is a package that contains the functionality of the io package except dealing with XML tokens instead of bytes):

// HandleXML implements XMLHandler. The token reader is the original stream
// but limited to the remainder of the element that triggered this handler, and start is the
// token that triggered the handler.
func (something) HandleXML(r xml.TokenReader, start *xml.StartElement) error { 
 	d := xml.NewTokenDecoder(xmlstream.MultiReader(
           xmlstream.Token(*start),
           t,
        ))
        // Pop the start element through the decoder, otherwise the stack is wrong
        // and when we read the end element it will throw an error about there being
        // no matching start element.
        d.Token()

        var foo someType
        d.DecodeElement(&foo, start)

In this example we could change our handler types to pass in a decoder directly, but this isn't desirable for several reasons:

  1. the Handler type could be defined in another package that we don't control
  2. the actual type of r is not actually xml.TokenReader but something that includes various other methods and implements xml.TokenReader, passing in a Decoder would make us have to provide two arguments, the decoder wrapping r and r itself which would be odd.
  3. r is actually a pipeline of various wrapped XML readers that perform transformations and checks on a document. We don't want the entire document read through a decoder with its associated overhead and validity checks as we know the document is valid already which is why we create one (for the users convenience, not for checking validity) only in the handler and only if the user actually needs it.

Instead I'd like to have an equivalent of the Decode/DecodeElement pattern but for creating a new decoder, something like:

// NewTokenDecoderElement is like NewTokenDecoder except that it creates a
// decoder with start already on the stack as the outermost tag.
//
// This is to accommodate the common pattern where a token is popped from a
// reader and if it is a StartElement it is decoded and otherwise some other
// action is performed without requiring that the entire stream be subject to
// the overhead of a decoder.
func NewTokenDecoderElement(t TokenReader, start *StartElement) *Decoder

This would allow us to pop from an xml.TokenReader and only create a decoder when the start element is encountered, then decode into a struct (consuming the end element) without error. The initial example then becomes:

func (something) HandleXML(r xml.TokenReader, start *xml.StartElement) error { 
 	d := xml.NewTokenDecoderElement(r, start)

        var foo someType
        d.DecodeElement(&foo, start)

EDIT: I should mention that while this can be done in an external package using the workaround from the initial example, doing it directly in encoding/xml has the benefit of not doing namespace transformations on the start element (we can just call pushElement directly). This reduces overhead and possibly other subtle issues due to double decoding and the way XML namespaces are handled.

/cc @rsc

@gopherbot gopherbot added this to the Proposal milestone Nov 3, 2021
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) Nov 3, 2021
@SamWhited
Copy link
Member Author

Ping

@rsc
Copy link
Contributor

rsc commented Jan 12, 2022

We haven't gotten this out of incoming yet, sorry. It won't be for Go 1.18.

@rsc rsc moved this from Incoming to Active in Proposals (old) Jan 12, 2022
@rsc
Copy link
Contributor

rsc commented Jan 12, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented Jan 19, 2022

I'm confused about

func (something) HandleXML(r xml.TokenReader, start *xml.StartElement) error { 
 	d := xml.NewTokenDecoderElement(r, start)

        var foo someType
        d.DecodeElement(&foo, start)

Why does start appear twice?

What is wrong with the current API which allows

d := xml.NewTokenDecoder(r)
d.DecodeElement(&foo, start)

?

@SamWhited
Copy link
Member Author

This contrived example shows the error:

https://go.dev/play/p/P2n9VLzEQX6

Because the start token has already been popped from the stream (to check what the name is to trigger the appropriate handler or callback) it is never read through the decoder. If you call decode there is no start element on the stack and the XML appears invalid. There is a workaround as detailed in the initial post, but this requires yet another wrapper and looks rather ugly.

@rsc
Copy link
Contributor

rsc commented Feb 9, 2022

I still need to take time to understand this better.

@rsc
Copy link
Contributor

rsc commented Aug 17, 2022

Instead of xml.NewTokenDecoderElement, what about adding a Push method to Decoder, like:

package xml

// Push pushes the token onto the input stack, so that the next call to RawToken, Token, or Decode
// will use this token as the first element of the input.
func (d *Decoder) Push(Token)

Then NewTokenDecoderElement = NewTokenDecoder+Push, but Push could be used in other contexts as well.

Thoughts?

@SamWhited
Copy link
Member Author

That seems better to me at first glance since it's a lot more flexible. I'm assuming any XML errors would be deferred until the token was read back out?
Even if XML level errors are deferred, is an error or ok bool on Push necessary in case the type pushed is not one of the token types?

I'll go stick that in some of the code where I would have used NewTokenDecoderElement and see what it looks like.

@SamWhited
Copy link
Member Author

Actually, looking again this isn't the same thing. The next call to Token would return the start token, but in the initial examples the start token has already been consumed, its namespaces are just being added as the default (if necessary) and a matching end element will be necessary. Maybe Push should operate that way so that we don't end up with a double-decoding issue if one of the underlying readers somewhere in the pipeline is also a decoder.

@rsc
Copy link
Contributor

rsc commented Oct 26, 2022

Actually, looking again this isn't the same thing. The next call to Token would return the start token, but in the initial examples the start token has already been consumed, its namespaces are just being added as the default (if necessary) and a matching end element will be necessary.

Apologies for the delay in replying.

I don't understand this comment. It sounds like you are saying that Push cannot be used to "un-read" a token from a decoder. I agree with that. But I thought we were talking about prepping a new decoder that hasn't been used at all yet. In that case, it seems like

d := xml.NewTokenDecoder(r)
d.Push(start)
d.DecodeElement(&foo)

should work. It's true that it will never make sense to push an element that was just read from the TokenDecoder. But you could imagine other manipulations that might make sense, such as peeking at the next element from r to see if it's something you want to modify and then pushing that directly.

@SamWhited
Copy link
Member Author

I'm having a hard time expressing this in an example as it either becomes very long or doesn't look like it makes sense. The question is, with your proposal, which of the following would be returned (ignoring for a moment that it's silly to pop a token just to push it immediately):

r := strings.NewReader(`<one xmlns="foo"><two/></one>`)
d := xml.NewDecoder(r)
one, _ := d.Token()
d.Push(one.(xml.StartElement))

// Does this result in "one" being returned, or "two"?
d.Token()

@rsc
Copy link
Contributor

rsc commented Oct 26, 2022

d.Token would return one, but that's not how you'd use d.Push. Instead you would make a new decoder, as in your original comment above, and seed it with "one".

@SamWhited
Copy link
Member Author

Right, so in an example like the following:

r := strings.NewReader(`<one xmlns="foo"><two/></one>`)
d := xml.NewDecoder(r)
tok, _ := d.Token()
start := tok.(xml.StartElement)

switch start.Name.Space {
case "foo":
  FooHandler(Middleware(d), start)
case "bar":
  BarHandler(d, start)
}

If we wanted to use push inside FooHandler, we already know what start is, so we have to pop it even though we've already popped it once and then pushed it back onto the stack. eg.

func FooHandler(r xml.TokenReader, start xml.StartElement) {
  d := xml.NewTokenDecoder(r)
  d.Push(start)
  d.Token() // Discard, we already know what this is and don't need it again.
  d.DecodeElement(&whatever, start)
}

Instead, it would be nice if we could avoid the unnecessary d.Token since we know what we're pushing already, there's no need to pop it again.

@rsc
Copy link
Contributor

rsc commented Apr 19, 2023

Paging this back in again. I'm OK with having d.Push and d.Token and having to use both in that example. Because you might just as often use

d := xml.NewTokenDecoder(r)
d.Push(start)
d.Decode(&whatever)

Of course as was noted in the initial post, a custom xml.TokenReader can do the same already, something like

d := xml.NewTokenDecoder(xmlcat(start, r))
d.Decode(&whatever)

I'm not sure there's really a huge amount of difference between those, and since third-party implementations of xmlcat already exist (including your xmlstream.MultiReader), I think maybe we should leave things as they are.

I thought for a bit about Push and I can't decide which semantics of Push you'd really want. That is, if you want to push <A><B><C> into a stream that continues data</C></B></A>, there are reasonably arguments for writing Push(A); Push(B); Push(C) and reasonable arguments for writing Push(C), Push(B), Push(A), depending eventually on whether you think of the token sequence being manipulated by Push as a stack attached to the input or a queue checked before the main input. Using a custom xmlstream.MultiReader lets those APIs define that detail in whatever way is natural instead of us having to decide it.

@rsc
Copy link
Contributor

rsc commented May 3, 2023

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

@rsc
Copy link
Contributor

rsc commented May 10, 2023

No change in consensus, so declined.
— rsc for the proposal review group

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Declined
Development

No branches or pull requests

3 participants