Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/html: unexpected whitespace rendering of html #30772

Open
tcurdt opened this issue Mar 12, 2019 · 17 comments
Open

x/net/html: unexpected whitespace rendering of html #30772

tcurdt opened this issue Mar 12, 2019 · 17 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@tcurdt
Copy link

tcurdt commented Mar 12, 2019

What version of Go are you using (go version)?

$ go version
go version go1.12 darwin/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/tcurdt/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/tcurdt/.go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/Cellar/go/1.12/libexec"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.12/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/pf/7vhqx5bn41qddypw08w9jc4w0000gn/T/go-build451640899=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I am parsing and then rendering html.

package main

import (
	"bytes"
	"fmt"
	"os"
	"strings"

	"golang.org/x/net/html"
)

func main() {
	var input = `<!DOCTYPE html>
<html>
<head>
  <title>Title of the document</title>
</head>
<body>
   body content <p>more content</p>
</body>
</html>`
	doc, err := html.Parse(strings.NewReader(input))
	if err != nil {
		fmt.Fprintf(os.Stderr, "error parsing: %s\n", err.Error())
		os.Exit(1)
	}

	buf := bytes.NewBufferString("")
	html.Render(buf, doc)

	fmt.Println("--")
	fmt.Print(input)
	fmt.Println("--")

	fmt.Println("--")
	fmt.Print(buf.String())
	fmt.Println("--")
}

What did you expect to see?

With the docs saying:

Rendering is done on a 'best effort' basis: calling Parse on the output of Render will always result in something similar to the original tree, but it is not necessarily an exact clone unless the original tree was 'well-formed'.

Given that the HTML is well-formed I'd expect the output be the same as the input.

What did you see instead?

Instead I am seeing changes in whitespace:

--
<!DOCTYPE html>
<html>
<head>
  <title>Title of the document</title>
</head>
<body>
   body content <p>more content</p>
</body>
</html>--
--
<!DOCTYPE html><html><head>
  <title>Title of the document</title>
</head>
<body>
   body content <p>more content</p>

</body></html>--

IMO there should be a test case verifying that the output matches in input for the documented case.

@julieqiu
Copy link
Member

julieqiu commented Mar 12, 2019

cc: @bradfitz

@julieqiu julieqiu added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 12, 2019
@julieqiu julieqiu changed the title unexpected whitespace rendering of html html/template: unexpected whitespace rendering of html Mar 12, 2019
@julieqiu julieqiu changed the title html/template: unexpected whitespace rendering of html x/net/template: unexpected whitespace rendering of html Mar 12, 2019
@gopherbot gopherbot added this to the Unreleased milestone Mar 12, 2019
@empijei
Copy link
Contributor

empijei commented Mar 12, 2019

Nit: I think this is related to x/net/html, not x/net/template.

@mikioh mikioh changed the title x/net/template: unexpected whitespace rendering of html x/net/html/template: unexpected whitespace rendering of html Mar 13, 2019
@robpike robpike changed the title x/net/html/template: unexpected whitespace rendering of html x/net/html: unexpected whitespace rendering of html Mar 13, 2019
@TomAnthony
Copy link

A minimal example input:

<html><head></head><body></body>
</html>

Outputs in go:

<html><head></head><body>



</body></html>

Chrome gives identical output:

<html><head></head><body>



</body></html>

As does Firefox:

<html><head></head><body>



</body></html>

@TomAnthony
Copy link

Things more interesting with comment blocks, where go, Chrome and Firefox all disagree...

Input:

<html><head></head><body>
	<!--a-->

		<!--b-->
			
</body>
</html>

<!--c-->

Outputs in go, changes whitespace but keeps final comment in same relative place:

<html><head></head><body>
	<!--a-->

		<!--b-->





</body></html><!--c-->

Chrome moves the final comment inside the body:

<head></head><body>
	<!--a-->

		<!--b-->
			



<!--c--></body></html>

Firefox does same as Go with slightly different spacing:

<html><head></head><body>
	<!--a-->

		<!--b-->
			



</body></html>
<!--c-->

@TomAnthony
Copy link

According to Section 12.2.6.4.22, I believe Chrome is at fault with regards to the final comment positioning. However, I think that is a separate issue to the original reported issue here.

@tcurdt
Copy link
Author

tcurdt commented Apr 18, 2019

There are a couple of facets to this IMO:

  1. golang.org/x/net/html does not always allow for a parsing-serialization round trip. This is sad but documented
  2. The documentation suggests that a round trip is possible as long as the document is well-formed. This poses an expectation that is not met (as shown in the code example). Hence I opened this issue.
  3. The result of a browser parse-serialization round trip is inconsistent across implementations and possibly non-conformant to the standard. IMO this should not be really relevant to golang.org/x/net/html though. In the end those browsers are just markup consumers. While defined in the standard how they parse and build the DOM should not be relevant to the go package. Plus we would have to compare the actual DOMs not the serialization of it. In the end the difference could be the serialization code instead.

So what to do?

Given that browser implementations are out of scope to fix, it seems sane to just ignore them. The easiest "solution" would be to just change the docs to reflect 2).

Given that I am rather an advocate of a proper parsing-serialization round trip I'd love to see at least a test case based on the above example.

But in the end it comes down to what golang.org/x/net/html wants to be and whether it wants to fulfill the contract given in the current documentation.

@TomAnthony
Copy link

I agree with you @tcurdt that the browser implementations shouldn't necessarily impact the approach used in golang.org/x/net/html, which should be based on the spec.

However, the spec is very complex and the browser implementations in Chrome & Firefox serve as two well-maintained implementations which can serve as a cross reference. In the original report above, the Chrome & Firefox (and also html5lib) all agree with golang.org/x/net/html.

The documentations mentions Parse on the output of Render will always result in something similar to the original tree which is Render -> Parse, but the report does Parse -> Render. If the resultant snippet above is fed back into Parse (and then Render again) the results don't change.

Are we confident there is a mistake in the implementation of the spec?

@tcurdt
Copy link
Author

tcurdt commented Apr 19, 2019

@TomAnthony but you left out the most relevant part for this issue when quoting

unless the original tree was 'well-formed'.

which gives this a completely different spin.

TBH I don't have enough knowledge of the spec to make any educated comments - but I do think it's surprising that there are whitespace changes happening on a Parse -> Render operation.

In my use case I wanted to parse a HTML file, change something and then write it back. Especially given that humans would not use whitespace as rendered this is a problem for that task at hand.

Ideally the round trip would not change a thing. As a second choice I would use my own renderer of the DOM tree and so be in control of the rendered output - but I didn't see an obvious way to do this yet.

@TomAnthony
Copy link

TomAnthony commented Apr 19, 2019

@tcurdt - yeah, I definitely see your point and I agree it seems surprising. However, I am unsure exactly how to define 'well-formed'...

The spec states The DOM will not let Document nodes have Text node children, so they are dropped on the floor., which indicates that text nodes are not allowed between </body> and </html>.

However, rather than being 'dropped on the floor' the white space between your </body> and </html> is being moved inside the </body> tag. The fact that 4 different implementations exhibit this same behaviour makes me think there is part of the spec we are still not understanding.

Either way, I think the white space between </body> and </html> is always going to be removed because the parsing phase will remove it according to the spec. For that reason I also think your own renderer wouldn't help (the parsing step will already have removed/moved that text node).

I think the behaviour that is interrupting your process is entirely prompted by whitespace outside of the <body>; whitespace inside the <body> shouldn't be moved. Depending on what specifically you are trying to do, you could perhaps snip the <body> our of your input file and parse it as a fragment. Then snip the resulting <body> out of the serialised output and use that to replace the <body> contents from your input file?

@tcurdt
Copy link
Author

tcurdt commented Apr 19, 2019

If there isn't really a good path in code to solve this I think the "well-formed" part should be removed from the documentation - especially if there is a uncertainty of the definition. That would create much less confusion given the start of the sentence.

As for the actual behaviour it kind of means there are two representations of the DOM. The input as usually given by humans and the rendered version that conforms to the spec. So unless a non-standard DOM is allowed to exist there is no way to make the roundtrip work without a hitch.

It is indeed mostly the white space outside the body element that is giving me grief. Working around this by dealing with fragments and then combining the strings feels like a hack and quite fragile.

I think there is a place for both the spec compliant and the non-spec compliant representation. But that's just not supported. I guess I either need to make my peace with the change of whitespace or write my own parser/dom. A bit of a shame.

Since I doubt the golang.org/x/net/html is willing to allow a change for a non-compliant DOM a change of documentation might be the right path forward.

@TomAnthony
Copy link

I'm just a 'visitor' here - I recently had an issue in parse.go and thought I may be able to help with your issue 🙂. Someone else can may offer a more 'official' response to what is considered well-formed, and as to whether golang.org/x/net/html might be change to allow a non-compliant DOM (I highly doubt it, and am unsure how this should even work).

I disagree that the docs should be updated (or at least the segment you referenced) as that talks about the rendering step, but your issue occurs in the parsing step. The output of render produces an identical tree if fed back into parse.

I actually think if you want to maintain "ill-formed" whitespace outside of the body, and your changes are all in the body, that the fragment approach would be an ok approach. I think trying to create your own parser/renderer would be a recipe for pain!

Sorry I cannot be more help! 😞

@tcurdt
Copy link
Author

tcurdt commented Apr 19, 2019

Well, the "how this should even work" is probably the least of the problems. Without even looking at the implementation I am pretty sure that at some part of the way of building the DOM some nodes are left out are fixed for compliance reasons. A simple KISS parser would just build the "illegal" DOM as is.

You have a point on the rendering step though.

calling Parse on the output of Render will always result in something similar to the original tree

it does not make any claims on parsing the original input but just the rendering of the input.
But then again I have missed the part in the documentation where it talks about the input of a non-compliant tree getting transformed into a compliant tree.

At hindsight looking at the documentation of Parse

It implements the HTML5 parsing algorithm, which is very complicated. The resultant tree can contain implicitly created nodes that have no explicit listed in r's data, and nodes' parents can differ from the nesting implied by a naive processing of start and end s. Conversely, explicit s in r's data can be silently dropped, with no corresponding node in the resulting tree.

From this it should be clear that a roundtrip cannot be ensured.
I just had the expectation it would.

Writing a parser that just builds the given html into a (non-compliant) DOM really is no rocket science. I assume one could even use the existing tokenizer. As hinted above it will be much simpler than what the current implementation does. I just hoped to avoid it.

@TomAnthony
Copy link

The problem with creating a "non-compliant DOM" is the DOM is designed to be a tree structure. Some badly formed HTML cannot form a tree:

<form><div></form></div>

This cannot be turned into a tree and re-serialised as it is. The HTML 5 spec has algorithms designed to handle this sort of case and re-nest things correctly. There are many other such cases.

However assuming your input is guaranteed to be well-formed (in terms of tag nesting, rather than what tags/tokens are allowed where), then you may find an existing parser that works (maybe lxml/BeautifulSoup?).

Good luck! 🙂

@tcurdt
Copy link
Author

tcurdt commented Apr 20, 2019

Yes, one has to draw a line somewhere. It's gotta be a tree :)

@TomAnthony
Copy link

@tcurdt lxml may work for you then; this maintains the whitespace between </body> and </html> (but ditches the whitespace after </html>):

import lxml.html

foo = """<!DOCTYPE html>
<html>
<head>
  <title>Title of the document</title>
</head>
<body>
   body content <p>more content</p>
</body>

</html>

	"""

bar = lxml.html.fromstring(foo)

print(lxml.html.tostring(bar))

@tcurdt
Copy link
Author

tcurdt commented Apr 20, 2019

...but that's not in go :)

@TomAnthony
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants