Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: encoding/xml: option to treat unknown fields as an error #30301

Closed
zelch opened this issue Feb 18, 2019 · 28 comments
Closed

proposal: encoding/xml: option to treat unknown fields as an error #30301

zelch opened this issue Feb 18, 2019 · 28 comments

Comments

@zelch
Copy link

zelch commented Feb 18, 2019

Towards the end of 2013, issue #6901 was opened requesting support to error on unknown fields instead of silently dropping the data.

In 2016, that issue was closed in favor of issue #15314, where DisallowUnknownFields was implemented for JSON, however that issue was JSON specific, and as a result the issue was never really resolved for encoding/xml.

As such, I would like to propose either adding DisallowUnknownFields to the encoding/xml Decoder type, to exactly mirror the JSON API, or adding DisallowUnknownElements and DisallowUnknownAttributes to the encoding/xml Decoder type.

The former would provide more consistency between the JSON and XML interfaces, the latter would be a moderately better fit for XML.

From a look through the code, it looks like it should be reasonably trivial to implement either version, and I would be happy to do the implantation work if there is agreement on how to go about it.

@dmitshur dmitshur added FeatureRequest NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Feb 22, 2019
@dmitshur dmitshur added this to the Go1.13 milestone Feb 22, 2019
@dmitshur
Copy link
Contributor

/cc @rsc per owners.

@andybons andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019
@gopherbot
Copy link

Change https://golang.org/cl/191962 mentions this issue: encoding/xml: disallow unknown elements/attributes in Decoder

@escholtz
Copy link
Contributor

escholtz commented Sep 18, 2019

I think we should consider returning all unknown elements/attributes (or perhaps the first N). This is useful for cases where xml is being used as a file format and you want to find all element typos/mistakes at once (in the same way the go compiler returns more than one error at once). Behavior could be controlled using an integer instead of a bool.

The downside is that the style would be different from the encoding/json package. While consistency is ideal, there are already several differences between encoding/json and encoding/xml so maybe one more would be ok?

@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@utrack
Copy link

utrack commented Sep 15, 2021

It would be really helpful to have this - the CL mentioned here is fairly compact and goes in style with JSON decoder.

@gopherbot
Copy link

Change https://golang.org/cl/362243 mentions this issue: encoding/xml: optional Decoder field enforce

@johnnybubonic
Copy link

It appears this has stagnated. I have resolved the merge conflicts and otherwise presented the original CL as-is.

https://go-review.googlesource.com/c/go/+/362243

@johnnybubonic
Copy link

Update: no update, no traction, no response, no review.

Time moves on, another day without strictly enforced XML. Once again, XML is treated as the redheaded stepchild compared to JSON. I tire.

🕓

@cristaloleg
Copy link

Should this be added to the review meeting minutes?

@johnnybubonic
Copy link

If you think it'd help! It's past the freeze so..

@ianlancetaylor ianlancetaylor changed the title encoding/xml: option to treat unknown fields as an error proposal: encoding/xml: option to treat unknown fields as an error Feb 10, 2022
@ianlancetaylor ianlancetaylor modified the milestones: Backlog, Proposal Feb 10, 2022
@ianlancetaylor ianlancetaylor removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. FeatureRequest labels Feb 10, 2022
@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) Feb 10, 2022
@ianlancetaylor
Copy link
Contributor

This is an API change so I'll move it into the proposal process.

It's definitely true that there is very little maintenance on XML and also for that matter JSON.

@rsc
Copy link
Contributor

rsc commented Feb 16, 2022

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc rsc moved this from Incoming to Active in Proposals (old) Feb 16, 2022
@johnnybubonic
Copy link

@rsc Much thanks!

@rsc
Copy link
Contributor

rsc commented Feb 23, 2022

It sounds like there are three things to disallow:

  • DisallowUnknownElements, to reject nested elements we don't expect
  • DisallowUnknownAttributes, to reject attributes we don't expect
  • DisallowUnknownCDATA, to reject plain text (other than white space) we don't expect (?)

Do we need the third? Probably?

@johnnybubonic
Copy link

That's an interesting idea. I didn't see any mention of the need in any of the original discussion when Eddie Scholtz submitted his original CL.

What would be the potential use case for that? My personal use case for DisallowUnknownElements/DisallowUnknownAttributes is so I can know if my structs match the input rather than vice versa.

I use W3C XSD for the actual data input validation (via https://pkg.go.dev/github.com/lestrrat-go/libxml2/xsd but there are a couple other implementations out there as well) - which I'm certainly not suggesting be stdlib (as there are multiple data validation/schema formats out there for XML besides XSD - DTD, Schematron, RELAX NG, etc.).
I think this is where XML differs significantly from JSON - there's a handful of validation methods out there as XML itself is expected to conform to a schema (however it is defined/implemented). Although some attempts at JSON validation have occurred, JSON doesn't have that inherent expectation.

All that to say that typically XML validation itself is done via schema implementation, but as there isn't a way for golang to "self-generate" a struct type definition from an XSD (or et. al.), this is where the Disallow* would come into play. Typically in XML, the element contents would be validated by the schema mechanism so I'm having difficulty figuring out where DisallowUnknownCDATA plays into that.

Does that make sense? I'm not opposed to DisallowUnknownCDATA, it just seems to be incongruous to the other two proposed additions in function -- at least in my own mind.

@escholtz
Copy link
Contributor

We needed DisallowUnknownElements and DisallowUnknownAttributes. But adding all three sounds reasonable.

Our specific use case is service configuration via xml files. Sometimes customers hand edit the configuration. Catching typos during unmarshalling is a better user experience than silently using default values for a given field.

Ideally we would be able to provide line/column number (or atleast byte offset which can be converted) for unknown fields and attributes. It would also be ideal to gather all of the unknown fields and attributes instead of just the first one. I think much of this quickly grows beyond the scope of a standard library package and reaches the point where it would be better served by a more general purpose xml parser. I suspect everyone's needs and use cases will be slightly different. However, at least providing similar standard library functionality to the json package seems reasonable.

@johnnybubonic
Copy link

We needed DisallowUnknownElements and DisallowUnknownAttributes. But adding all three sounds reasonable.

Right; I'm just trying to keep in mind feature creep -- would specifying unknown CDATA inside an element be considered structure non-conformance or data non-conformance? Keeping in mind feature-parity with JSON, is it closer to an improperly formatted JSON key, or is it closer to improper JSON value? This is where things start to get a little philosophical and weird.

Our specific use case is service configuration via xml files. Sometimes customers hand edit the configuration. Catching typos during unmarshalling is a better user experience than silently using default values for a given field.

Agreed, for sure.

Ideally we would be able to provide line/column number (or atleast byte offset which can be converted) for unknown fields and attributes. It would also be ideal to gather all of the unknown fields and attributes instead of just the first one. I think much of this quickly grows beyond the scope of a standard library package and reaches the point where it would be better served by a more general purpose xml parser. I suspect everyone's needs and use cases will be slightly different. However, at least providing similar standard library functionality to the json package seems reasonable.

Yeah, and that's my big worry here -- feature creep. Some of the above is already implemented in the various schemas I mentioned, and have analogues in corresponding libraries that DO already exist:

  • XSD (25 libraries)
  • Schematron (no libraries, but os.Exec could call to external tool)
  • RELAX NG (no libraries(?), etc.)
  • DTD (2 libraries)

But I think there needs to be some method of Golang recognizing that there are certain structural components that were otherwise silently ignored, which the DisallowUnknownElements and DisallowUnknownAttributes both would serve nicely. The question I'm wrestling with is whether an unknown CDATA qualifies as (formatted/contained) data or structure. The former which I think Golang stdlib oughtn't concern with, since there's no JSON analogue and it's going to be probably specific to implementation.

Theoretically CDATA should be allowed anywhere regular data is since it's just that -- character data; not intended to be parsed as actual XML/SGML/etc. by the parser. So if this would be implemented, it probably ought to be implemented as a more generic "must be empty" struct tag rather than a DisableUnknownCDATA option. I'd imagine a lot of that handling would be more related to content of elements (e.g. off the top of my head, #21399).
Since the purpose of CDATA is for the containing data to not be parsed, we wouldn't really have a method of even identifying what CDATA would be allowed vs. what wouldn't.

But that's why I'm interested in a use case @rsc had in mind because I don't know if I'm glossing over something mentally.

@zelch
Copy link
Author

zelch commented Feb 23, 2022

I would suggest reframing it ever so slightly.

It is about data 'which has no location in the structure'.

It's not that CDATA is bad, it's that any data that we can't actually store in the given structure is potentially bad. It most definitely means that it is data that you can't do a round trip between XML -> Go data structures -> XML on.

The key here is, of course, the 'unknown' part. This is all about unknown data. And unknown CDATA is just as much of a thing to want to reject as unknown tags or attributes.

And for many of the same reasons, it is very likely to be an error by whoever or whatever generated the XML, and using just the tools provided by the standard library you not only can't do anything with it, right now you can't even know it exists.

Sure, stepping outside of stdlib gives you more options, but sometimes what you want is the ability to easily handle XML just like you handle JSON, including whatever content validation that you use on the resulting data structures. And that's just not practical if there are ways to insert data that is just silently dropped.

@johnnybubonic
Copy link

@zelch Hrmm. Can you provide me with an example of what this would look like with some sample XML and the behavior you're looking for? I'm still having difficulty visualizing this.

I think by and large it's because I'm unclear what constitutes exactly what "unknown CDATA" actually is. I've been understanding it to mean "content within an element that is not intended to be understood as structure", which is what CDATA's intent is, in which case this can be captured with the ,chardata tag option already:

  • If the XML element contains character data, that data is
    accumulated in the first struct field that has tag ",chardata".
    The struct field may have type []byte or string.
    If there is no such field, the character data is discarded.

(https://pkg.go.dev/encoding/xml#Unmarshal)

Data "which has no location in the structure" ("structure" here meaning the Golang struct) would either be an element or an attribute and nothing else, would it not?

Conceptually I'm unclear what data which has no location in the structure, where structure is that of the XML document, would actually be as that'd be just plain invalid XML from my understanding, e.g.

somethinghereisnotvalid
<root>
  <child/>
</root>

Can you give me some sample XML you're describing and the behavior you want to see?

@zelch
Copy link
Author

zelch commented Feb 23, 2022

Entirely from memory:

<root>
  random text
  <subitem attr='value'>
    more random text
      <tag>text we want</tag>
  </subitem>
</root>

Where does the random text go in a Go Structure?

@johnnybubonic
Copy link

It isn't captured by Root.Cdata and Sub.Cdata?

package main

import (
	"encoding/xml"
	"fmt"
	"log"
)

type Root struct {
	Name    xml.Name `xml:"root"`
	Cdata   []byte   `xml:",chardata"`
	SubItem Sub      `xml:"subitem"`
}

type Sub struct {
	Name    xml.Name `xml:"subitem"`
	Attr    string   `xml:"attr,attr"`
	Cdata   []byte   `xml:",chardata"`
	TagItem string   `xml:"tag"`
}

var xmlDoc []byte = []byte(`
<root>
  random text
  <subitem attr='value'>
    more random text
      <tag>text we want</tag>
  </subitem>
</root>`)

func main() {

	var err error
	var r Root = Root{}

	if err = xml.Unmarshal(xmlDoc, &r); err != nil {
		log.Panicln(err)
	}

	fmt.Printf("%#v\n", r)
}

@rsc
Copy link
Contributor

rsc commented Mar 2, 2022

Yes, and the question is whether, if the xml:",chardata" fields are missing and the cdata would otherwise be thrown on the floor, there should be a flag to make that an error, same as throwing an attribute or a sub-element on the floor.

@johnnybubonic
Copy link

johnnybubonic commented Mar 2, 2022

@rsc AH, okay, I think I see. My confusion was stemming from a couple things:

  • My understanding of "CDATA" used in the original proposed addition referring to CDATA rather than "content within an element ('character data') not otherwise captured".
    • The "Unknown" in the method name, which indicates a structural element to me (which confounded further my original "character data" vs. "CDATA" understanding)
  • There already being a way to capture and check for (otherwise) unserialized content within an element with the ,chardata xml tag option, of which there isn't an equivalent for elements and attributes without the proposed CL.
  • In e.g. the example above, random text is considered content of the root element per XML spec - so in the XML perspective, it's e.g. /root/text() in XPath 1.0,for instance (specifically, /root/text()[1]). In other words, it seems the purpose of this specific Disallow would be to avoid needing a ,chardata field for every struct and checking if it is nonempty?

It seems then that this particular Disallow is desired to catch content that exists that "shouldn't" due to malformatting in the original data rather than attempting to disallow mismatching structure (as random text in the above example is itself considered content rather than structure in XML; it's contained within an element within the document/DOM).

I resubmitted the original CL with the intent of getting closer feature parity with (encoding/json).Decoder (i.e. structural-only).

While I'm certain that some people may find a use for a DisallowUncapturedChardata or DisallowUncapturedText (which is what I'd recommend for naming instead of DisallowUnknownCDATA to avoid ambiguity and to better describe what it actually does), I believe this would best be addressed in a different proposal and CL as it addresses a different type of enforcement (content instead of structural) and has no JSON equivalent or analog (indeed, this particular situation itself has no JSON equivalent as JSON considers it invalid to mix content and structure within a structure field whereas XML considers it valid).

Is this acceptable?

@rsc
Copy link
Contributor

rsc commented Mar 9, 2022

Yes, ,chardata will allow accumulating unexpected text, but similarly ,any will allow accumulating unexpected elements and ,any,attr will allow accumulating unexpected attributes.

If ,chardata means that we don't need DisallowUnknownChardata, maybe ,any and ,any,attr mean we don't need these other disallows either?

@rsc
Copy link
Contributor

rsc commented Mar 23, 2022

Still would like to hear what people think in response to the questions in the previous comment.

@escholtz
Copy link
Contributor

I think the ,any approach is reasonable and readable. The other advantage is that the caller can traverse the struct tree they are unmarshalling into and know where exactly (tree-wise) the error is occurring.

The second approach would be to accumulate all unknown elements, attributes, and chardata in a custom error type that gets returned at the end of unmarshalling (assuming the flag is set). This might be easier for callers with a large tree of structs (10s or 100s of different types) since they don't need to add ,any receiver fields to each one. Ideally we could also include byte offsets for each unknown which could then be converted to line/column by the caller. This enables more useful error messages when xml is being used for configuration. One disadvantage of this approach is that the location within the tree is unavailable. I suspect line/column will be more useful than tree location but I don't have any data to back that up.

@rsc
Copy link
Contributor

rsc commented Mar 30, 2022

Okay, so given that we have ,any and ,any,attr, does that mean we don't need to add this new mode being proposed?

@rsc rsc moved this from Active to Likely Decline in Proposals (old) Apr 13, 2022
@rsc
Copy link
Contributor

rsc commented Apr 13, 2022

Based on the discussion above, this proposal seems like a likely decline.
— rsc for the proposal review group

@rsc rsc moved this from Likely Decline to Declined in Proposals (old) May 4, 2022
@rsc
Copy link
Contributor

rsc commented May 4, 2022

No change in consensus, so declined.
— rsc for the proposal review group

@rsc rsc closed this as completed May 4, 2022
@golang golang locked and limited conversation to collaborators May 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Development

No branches or pull requests

10 participants