Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/xml: unmarshal only processes first XML element #20754

Open
rsc opened this issue Jun 22, 2017 · 7 comments
Open

encoding/xml: unmarshal only processes first XML element #20754

rsc opened this issue Jun 22, 2017 · 7 comments
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone

Comments

@rsc
Copy link
Contributor

rsc commented Jun 22, 2017

If you Marshal []int{1,2} you get <int>1</int><int>2</int>, but then if you Unmarshal it back into a new slice, you get just []int{1}. Unmarshal simply stops after the first top-most XML element, because it is implemented as NewDecoder(bytes.NewReader(data)).Decode(v). When v is not a slice, this makes sense. But if v is a slice there's an argument that maybe Unmarshal should repeat the Decode calls until it reaches the end of the data. That's maybe easier said than done, and also maybe not a compatible change, but we should at least consider it. The Decoder itself is right to process only a single element, since it is processing an arbitrary input stream that might block if one reads too far. But Unmarshal, holding a []byte, has perfect knowledge of the remainder of the stream and might be able to do better. Or perhaps Unmarshal should return an error.

package main

import (
	"encoding/xml"
	"fmt"
	"log"
)

func main() {
	x1 := []int{1, 2}
	out, _ := xml.Marshal(x1)
	fmt.Println("XML:", string(out)) // <int>1</int><int>2</int>

	x2 := []int{3}
	if err := xml.Unmarshal(out, &x2); err != nil {
		log.Fatal(err)
	}
	fmt.Println(x2) // [3 1] not [3 1 2]
}

In contrast, the equivalent input given to json.Unmarshal produces an error:

package main

import (
	"encoding/json"
	"fmt"
	"log"
)

func main() {
	out := []byte("[1] [2]")
	x2 := []int{3}
	if err := json.Unmarshal(out, &x2); err != nil {
		log.Fatal(err) // invalid character '[' after top-level value
	}
	fmt.Println(x2)
}

This is more justified in the case of JSON, since Marshaling []int{1,2} does not produce [1] [2].

@rsc rsc added this to the Go1.10 milestone Jun 22, 2017
@rsc rsc added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Jun 22, 2017
@SamWhited
Copy link
Member

SamWhited commented Jun 22, 2017

I think there may be two issues here; the other is that the call to Marshal is producing invalid XML. All XML MUST have a root element; if we're marshaling using an Encoder and a root element has already been written, writing <int>1</int><int>2</int> for a call to Encode with a slice is probably correct, but marshal is not for writing streams, it is for marshaling an entire XML document given some data (I think? maybe that understanding is incorrect), so I would expect it to write a root element.

EDIT: To clarify, the "correct" way to write an array or slice type, assuming we were marshaling it as a complete document and not part of a stream, in XML would be something like this (names don't matter, I just stuck with "int" and arbitrarily picked "slice" as the root):

<slice>
  <int>1</int>
  <int>2</int>
</slice>

@rsc
Copy link
Contributor Author

rsc commented Jun 23, 2017

Sorry @SamWhited, but I disagree. Marshal clearly can be used to generate fragments of XML documents as well, for example if you pass it a slice of things to marshal. The docs explicitly contemplate this:

Marshal handles an array or slice by marshaling each of the elements.

We can't change that now, and I think it would be a mistake to do so anyway, since it would take control of the wrapping away from the users.

@SamWhited
Copy link
Member

SamWhited commented Jun 23, 2017

but I disagree

Marshal handles an array or slice by marshaling each of the elements.
We can't change that now

Fair enough; I did not take that statement in the docs to mean that it marshals each element and does not ensure validity or a root node, but as you said, there's probably no way to change that anyways even if we agreed.

On an interesting, but unrelated note, in the paper you recommend ("The Essence of XML; Siméon, Wadler") it says that "lists" in XML are actually space-separated (<ints>1 2 3</ints>). I've never seen this, but I also avoid XML schema as much as possible and haven't read much of the spec. Just thought it was interesting; maybe I've been doing lists wrong.

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017
@bradfitz bradfitz modified the milestones: Go1.11, Go1.12 Jun 13, 2018
@rsc rsc added the early-in-cycle A change that should be done early in the 3 month dev cycle. label Nov 14, 2018
@rsc rsc modified the milestones: Go1.12, Go1.13 Nov 14, 2018
@webern
Copy link

webern commented Apr 29, 2019

On an interesting, but unrelated note, in the paper you recommend ("The Essence of XML; Siméon, Wadler") it says that "lists" in XML are actually space-separated (<ints>1 2 3</ints>). I've never seen this, but I also avoid XML schema as much as possible and haven't read much of the spec. Just thought it was interesting; maybe I've been doing lists wrong.

@SamWhited XSD is a rich specification language. With it, one could specify that the tag takes a space delimited list of numbers, or you could specify a comma delimited list of numbers, or whatever.

The value inside of the tag would be an xs:string with further constraint of a regex pattern like ^[0-9]{1}[\s0-9]+[0-9]{1}$ (or something like that). A validating parser would validate that the string matches the regex pattern, but that list of ints would just be the string "1 2 3".

It would be unusual to do this. I think someone would only do this in their schema if they were trying to save space.

I am saddened that XML has fallen out of favor since the XSD specification language is so powerful (it's akin strong static typing vs what feels like dynamic typing in JSON). Unfortunately there is no way that I know of to validate an XML file against an XSD with pure Go. The only way to do it that I know of is with libxml2 and cgo.

In fact, today I was musing at what it would take to port libxml2 to Go. I think cloc had libxml2 around 200k lines of C code. I image porting C to Go is somewhat 1-to-1.

@andybons andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@JoseFMP
Copy link

JoseFMP commented Jun 15, 2020

Any news about this issue? Half a year later still happening over here.

@ianlancetaylor
Copy link
Contributor

No news, sorry. Any fixes will be reported here.

@fastfading
Copy link

unmarshal only processes first XML element
is very bad for xml design.
since you defined xml:",innerxml"
some time we want to continue to parse this innerxml in another function , or another module
innerxml may contain multiple element, but it only process the first XML element.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Projects
None yet
Development

No branches or pull requests

8 participants