x/net/html: unexpected whitespace rendering of html #30772

tcurdt · 2019-03-12T14:15:21Z

What version of Go are you using (`go version`)?

$ go version
go version go1.12 darwin/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/tcurdt/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/tcurdt/.go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/Cellar/go/1.12/libexec"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.12/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/pf/7vhqx5bn41qddypw08w9jc4w0000gn/T/go-build451640899=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I am parsing and then rendering html.

package main

import (
	"bytes"
	"fmt"
	"os"
	"strings"

	"golang.org/x/net/html"
)

func main() {
	var input = `<!DOCTYPE html>
<html>
<head>
  <title>Title of the document</title>
</head>
<body>
   body content <p>more content</p>
</body>
</html>`
	doc, err := html.Parse(strings.NewReader(input))
	if err != nil {
		fmt.Fprintf(os.Stderr, "error parsing: %s\n", err.Error())
		os.Exit(1)
	}

	buf := bytes.NewBufferString("")
	html.Render(buf, doc)

	fmt.Println("--")
	fmt.Print(input)
	fmt.Println("--")

	fmt.Println("--")
	fmt.Print(buf.String())
	fmt.Println("--")
}

What did you expect to see?

With the docs saying:

Rendering is done on a 'best effort' basis: calling Parse on the output of Render will always result in something similar to the original tree, but it is not necessarily an exact clone unless the original tree was 'well-formed'.

Given that the HTML is well-formed I'd expect the output be the same as the input.

What did you see instead?

Instead I am seeing changes in whitespace:

--
<!DOCTYPE html>
<html>
<head>
  <title>Title of the document</title>
</head>
<body>
   body content <p>more content</p>
</body>
</html>--
--
<!DOCTYPE html><html><head>
  <title>Title of the document</title>
</head>
<body>
   body content <p>more content</p>

</body></html>--

IMO there should be a test case verifying that the output matches in input for the documented case.

The text was updated successfully, but these errors were encountered:

julieqiu · 2019-03-12T19:29:27Z

cc: @bradfitz

empijei · 2019-03-12T20:52:59Z

Nit: I think this is related to x/net/html, not x/net/template.

TomAnthony · 2019-04-18T19:22:55Z

A minimal example input:

<html><head></head><body></body>
</html>

Outputs in go:

<html><head></head><body>



</body></html>

Chrome gives identical output:

<html><head></head><body>



</body></html>

As does Firefox:

<html><head></head><body>



</body></html>

TomAnthony · 2019-04-18T19:30:34Z

Things more interesting with comment blocks, where go, Chrome and Firefox all disagree...

Input:

<html><head></head><body>
	<!--a-->

		<!--b-->
			
</body>
</html>

<!--c-->

Outputs in go, changes whitespace but keeps final comment in same relative place:

<html><head></head><body>
	<!--a-->

		<!--b-->





</body></html><!--c-->

Chrome moves the final comment inside the body:

<head></head><body>
	<!--a-->

		<!--b-->
			



<!--c--></body></html>

Firefox does same as Go with slightly different spacing:

<html><head></head><body>
	<!--a-->

		<!--b-->
			



</body></html>
<!--c-->

TomAnthony · 2019-04-18T19:42:34Z

According to Section 12.2.6.4.22, I believe Chrome is at fault with regards to the final comment positioning. However, I think that is a separate issue to the original reported issue here.

tcurdt · 2019-04-18T23:34:50Z

There are a couple of facets to this IMO:

golang.org/x/net/html does not always allow for a parsing-serialization round trip. This is sad but documented
The documentation suggests that a round trip is possible as long as the document is well-formed. This poses an expectation that is not met (as shown in the code example). Hence I opened this issue.
The result of a browser parse-serialization round trip is inconsistent across implementations and possibly non-conformant to the standard. IMO this should not be really relevant to golang.org/x/net/html though. In the end those browsers are just markup consumers. While defined in the standard how they parse and build the DOM should not be relevant to the go package. Plus we would have to compare the actual DOMs not the serialization of it. In the end the difference could be the serialization code instead.

So what to do?

Given that browser implementations are out of scope to fix, it seems sane to just ignore them. The easiest "solution" would be to just change the docs to reflect 2).

Given that I am rather an advocate of a proper parsing-serialization round trip I'd love to see at least a test case based on the above example.

But in the end it comes down to what golang.org/x/net/html wants to be and whether it wants to fulfill the contract given in the current documentation.

TomAnthony · 2019-04-19T06:53:42Z

I agree with you @tcurdt that the browser implementations shouldn't necessarily impact the approach used in golang.org/x/net/html, which should be based on the spec.

However, the spec is very complex and the browser implementations in Chrome & Firefox serve as two well-maintained implementations which can serve as a cross reference. In the original report above, the Chrome & Firefox (and also html5lib) all agree with golang.org/x/net/html.

The documentations mentions Parse on the output of Render will always result in something similar to the original tree which is Render -> Parse, but the report does Parse -> Render. If the resultant snippet above is fed back into Parse (and then Render again) the results don't change.

Are we confident there is a mistake in the implementation of the spec?

tcurdt · 2019-04-19T09:25:21Z

@TomAnthony but you left out the most relevant part for this issue when quoting

unless the original tree was 'well-formed'.

which gives this a completely different spin.

TBH I don't have enough knowledge of the spec to make any educated comments - but I do think it's surprising that there are whitespace changes happening on a Parse -> Render operation.

In my use case I wanted to parse a HTML file, change something and then write it back. Especially given that humans would not use whitespace as rendered this is a problem for that task at hand.

Ideally the round trip would not change a thing. As a second choice I would use my own renderer of the DOM tree and so be in control of the rendered output - but I didn't see an obvious way to do this yet.

TomAnthony · 2019-04-19T10:10:28Z

@tcurdt - yeah, I definitely see your point and I agree it seems surprising. However, I am unsure exactly how to define 'well-formed'...

The spec states The DOM will not let Document nodes have Text node children, so they are dropped on the floor., which indicates that text nodes are not allowed between </body> and </html>.

However, rather than being 'dropped on the floor' the white space between your </body> and </html> is being moved inside the </body> tag. The fact that 4 different implementations exhibit this same behaviour makes me think there is part of the spec we are still not understanding.

Either way, I think the white space between </body> and </html> is always going to be removed because the parsing phase will remove it according to the spec. For that reason I also think your own renderer wouldn't help (the parsing step will already have removed/moved that text node).

I think the behaviour that is interrupting your process is entirely prompted by whitespace outside of the <body>; whitespace inside the <body> shouldn't be moved. Depending on what specifically you are trying to do, you could perhaps snip the <body> our of your input file and parse it as a fragment. Then snip the resulting <body> out of the serialised output and use that to replace the <body> contents from your input file?

tcurdt · 2019-04-19T18:44:10Z

If there isn't really a good path in code to solve this I think the "well-formed" part should be removed from the documentation - especially if there is a uncertainty of the definition. That would create much less confusion given the start of the sentence.

As for the actual behaviour it kind of means there are two representations of the DOM. The input as usually given by humans and the rendered version that conforms to the spec. So unless a non-standard DOM is allowed to exist there is no way to make the roundtrip work without a hitch.

It is indeed mostly the white space outside the body element that is giving me grief. Working around this by dealing with fragments and then combining the strings feels like a hack and quite fragile.

I think there is a place for both the spec compliant and the non-spec compliant representation. But that's just not supported. I guess I either need to make my peace with the change of whitespace or write my own parser/dom. A bit of a shame.

Since I doubt the golang.org/x/net/html is willing to allow a change for a non-compliant DOM a change of documentation might be the right path forward.

TomAnthony · 2019-04-19T20:29:43Z

I'm just a 'visitor' here - I recently had an issue in parse.go and thought I may be able to help with your issue 🙂. Someone else can may offer a more 'official' response to what is considered well-formed, and as to whether golang.org/x/net/html might be change to allow a non-compliant DOM (I highly doubt it, and am unsure how this should even work).

I disagree that the docs should be updated (or at least the segment you referenced) as that talks about the rendering step, but your issue occurs in the parsing step. The output of render produces an identical tree if fed back into parse.

I actually think if you want to maintain "ill-formed" whitespace outside of the body, and your changes are all in the body, that the fragment approach would be an ok approach. I think trying to create your own parser/renderer would be a recipe for pain!

Sorry I cannot be more help! 😞

tcurdt · 2019-04-19T22:02:49Z

Well, the "how this should even work" is probably the least of the problems. Without even looking at the implementation I am pretty sure that at some part of the way of building the DOM some nodes are left out are fixed for compliance reasons. A simple KISS parser would just build the "illegal" DOM as is.

You have a point on the rendering step though.

calling Parse on the output of Render will always result in something similar to the original tree

it does not make any claims on parsing the original input but just the rendering of the input.
But then again I have missed the part in the documentation where it talks about the input of a non-compliant tree getting transformed into a compliant tree.

At hindsight looking at the documentation of Parse

It implements the HTML5 parsing algorithm, which is very complicated. The resultant tree can contain implicitly created nodes that have no explicit listed in r's data, and nodes' parents can differ from the nesting implied by a naive processing of start and end s. Conversely, explicit s in r's data can be silently dropped, with no corresponding node in the resulting tree.

From this it should be clear that a roundtrip cannot be ensured.
I just had the expectation it would.

Writing a parser that just builds the given html into a (non-compliant) DOM really is no rocket science. I assume one could even use the existing tokenizer. As hinted above it will be much simpler than what the current implementation does. I just hoped to avoid it.

TomAnthony · 2019-04-20T07:52:58Z

The problem with creating a "non-compliant DOM" is the DOM is designed to be a tree structure. Some badly formed HTML cannot form a tree:

<form><div></form></div>

This cannot be turned into a tree and re-serialised as it is. The HTML 5 spec has algorithms designed to handle this sort of case and re-nest things correctly. There are many other such cases.

However assuming your input is guaranteed to be well-formed (in terms of tag nesting, rather than what tags/tokens are allowed where), then you may find an existing parser that works (maybe lxml/BeautifulSoup?).

Good luck! 🙂

tcurdt · 2019-04-20T08:39:45Z

Yes, one has to draw a line somewhere. It's gotta be a tree :)

TomAnthony · 2019-04-20T08:52:55Z

@tcurdt lxml may work for you then; this maintains the whitespace between </body> and </html> (but ditches the whitespace after </html>):

import lxml.html

foo = """<!DOCTYPE html>
<html>
<head>
  <title>Title of the document</title>
</head>
<body>
   body content <p>more content</p>
</body>

</html>

	"""

bar = lxml.html.fromstring(foo)

print(lxml.html.tostring(bar))

tcurdt · 2019-04-20T09:02:36Z

...but that's not in go :)

TomAnthony · 2019-04-20T09:47:17Z

Maybe https://github.com/lestrrat-go/libxml2 or https://github.com/moovweb/gokogiri?

julieqiu added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 12, 2019

julieqiu changed the title ~~unexpected whitespace rendering of html~~ html/template: unexpected whitespace rendering of html Mar 12, 2019

julieqiu changed the title ~~html/template: unexpected whitespace rendering of html~~ x/net/template: unexpected whitespace rendering of html Mar 12, 2019

gopherbot added this to the Unreleased milestone Mar 12, 2019

mikioh changed the title ~~x/net/template: unexpected whitespace rendering of html~~ x/net/html/template: unexpected whitespace rendering of html Mar 13, 2019

robpike changed the title ~~x/net/html/template: unexpected whitespace rendering of html~~ x/net/html: unexpected whitespace rendering of html Mar 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/net/html: unexpected whitespace rendering of html #30772

x/net/html: unexpected whitespace rendering of html #30772

tcurdt commented Mar 12, 2019

julieqiu commented Mar 12, 2019 •

edited

empijei commented Mar 12, 2019

TomAnthony commented Apr 18, 2019

TomAnthony commented Apr 18, 2019

TomAnthony commented Apr 18, 2019

tcurdt commented Apr 18, 2019

TomAnthony commented Apr 19, 2019

tcurdt commented Apr 19, 2019

TomAnthony commented Apr 19, 2019 •

edited

tcurdt commented Apr 19, 2019

TomAnthony commented Apr 19, 2019

tcurdt commented Apr 19, 2019

TomAnthony commented Apr 20, 2019

tcurdt commented Apr 20, 2019

TomAnthony commented Apr 20, 2019

tcurdt commented Apr 20, 2019

TomAnthony commented Apr 20, 2019

x/net/html: unexpected whitespace rendering of html #30772

x/net/html: unexpected whitespace rendering of html #30772

Comments

tcurdt commented Mar 12, 2019

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

julieqiu commented Mar 12, 2019 • edited

empijei commented Mar 12, 2019

TomAnthony commented Apr 18, 2019

TomAnthony commented Apr 18, 2019

TomAnthony commented Apr 18, 2019

tcurdt commented Apr 18, 2019

TomAnthony commented Apr 19, 2019

tcurdt commented Apr 19, 2019

TomAnthony commented Apr 19, 2019 • edited

tcurdt commented Apr 19, 2019

TomAnthony commented Apr 19, 2019

tcurdt commented Apr 19, 2019

TomAnthony commented Apr 20, 2019

tcurdt commented Apr 20, 2019

TomAnthony commented Apr 20, 2019

tcurdt commented Apr 20, 2019

TomAnthony commented Apr 20, 2019

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

julieqiu commented Mar 12, 2019 •

edited

TomAnthony commented Apr 19, 2019 •

edited