Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/html: Parse handles custom self-closing tags differently to all major browsers #30961

Closed
TomAnthony opened this issue Mar 20, 2019 · 7 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@TomAnthony
Copy link

What version of Go are you using (go version)?

> go version
go version go1.12.1 darwin/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
> go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/tom/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/tom/projects/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/Cellar/go/1.12.1/libexec"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.12.1/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/hm/n8fgn9bx61z5129js8wft_1c0000gn/T/go-build225010318=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I ran this program:

https://play.golang.org/p/JfyssBdfzFH

Which parses this snippet of html then re-stringifies it:

<!doctype html>
<html>
	<head></head>
	<body>
		<tag1>
			<tag2 />
			<p>one</p>
		</tag1>
		<div>two</div>
	</body>
</html>

What did you expect to see?

I'd expect the custom tag2 tag to wrap the p tag (as a custom tag cannot be self-closing according to the spec - this is discussed in issue #26566), but everything at a higher level in the tree (including tag1) to remain in their respective locations:

<!doctype html>
<html>
	<head></head>
	<body>
		<tag1>
			<tag2>
				<p>one</p>
			</tag2>
		</tag1>
		<div>two</div>
	</body>
</html>

What did you see instead?

The p tag is wrapped as we would expect, but the div tag that was a sibling of tag1 is now a child of tag1:

<!DOCTYPE html>
<html>
	<head></head>
	<body>
		<tag1>
			<tag2>
				<p>one</p>
			</tag2>
			<div>two</div>
		</tag1>
	</body>
</html>

This doesn't make sense to me - the self-closing tag2 is in violation of the spec, but has been turned into a properly closed tag following sensible assumptions about what should become children of it (all its siblings). It makes no sense that after being closed there are further alterations to subsequent parts of the tree.

The Go output is also in disagreement to how all the major browsers handle the same HTML code:

Chrome:

chrome

Firefox:

firefox

Safari:

safari

Go script output:

golang

@gopherbot gopherbot added this to the Unreleased milestone Mar 20, 2019
@ALTree ALTree added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 21, 2019
@TomAnthony
Copy link
Author

TomAnthony commented Mar 21, 2019

An additional observation.

If you swap the illegal tag2 for an illegal self closing div instead:

<!doctype html>
<html>
	<head></head>
	<body>
		<tag1>
			<div />
			<p>one</p>
		</tag1>
		<div>two</div>
	</body>
</html>

Then Chrome/Safari/Firefox all exhibit a similar behaviour to Go (though note the 'two' div is now tag1's grandchild rather than child like above), in that they move the <div>two</div> inside the tag1 (which Go does also with the <div /> example):

chrome-div

I don't know the spec well enough to understand why that would be the case and why all the browsers agree with one another on these two use cases, but Go disagrees on one.

@TomAnthony
Copy link
Author

One final example, with the original example but now with the illegal tag1 swapped for a legal div, wrapping the illegal self-closing tag2:

<!doctype html>
<html>
	<head></head>
	<body>
		<div>
			<tag2 />
			<p>one</p>
		</div>
		<div>two</div>
	</body>
</html>

In this instance Go parses it correctly:

<!doctype html>
<html>
	<head></head>
	<body>
		<div>
			<tag2>
				<p>one</p>
			</tag2>
		</div>
		<div>two</div>
	</body>
</html>

So the erroneous handling seems to come about only when there is an non-standard tag wrapping a non-standard self-closing tag.

TomAnthony added a commit to TomAnthony/net that referenced this issue Mar 21, 2019
The existing implementation behaves differently to all major browsers, for the instance where a self-closing element of an unknown tag type is the child of another element of an unknown tag type. Unclear if the spec is specific here. Issue golang/go#30961 has several scenarios, of which Go diverges in behaviour on one (added as a test in this commit).

Fixes golang/go#30961
@gopherbot
Copy link

Change https://golang.org/cl/168638 mentions this issue: html: normalise parsing of nested tags of an unknown type with browsers

@mikioh mikioh changed the title x/net: html.Parse handles custom self-closing tags differently to all major browsers x/net/html: Parse handles custom self-closing tags differently to all major browsers Mar 21, 2019
@TomAnthony
Copy link
Author

I've submitted a PR which fixes this issue whilst passing all the existing go and webkit unit tests. From my examples above it seems the popUntil function, when run on unknown tags, should treat other unknown tags as a stop tags.

However, I don't have a detailed understanding of the spec, so this likely needs some tweaks / overhauling.

I'll keep reading the spec and update this if I get a clearer understanding.

@TomAnthony
Copy link
Author

After some more reading, I believe I have pinpointed this better.

I believe the issue is the inBodyEndTagOther where it checks if the atom types match. For unknown elements this will be 0, meaning nested unknown elements will falsely match against each other (which aligns with the manifestations we see above):

func (p *parser) inBodyEndTagOther(tagAtom a.Atom) {
	for i := len(p.oe) - 1; i >= 0; i-- {
		if p.oe[i].DataAtom == tagAtom {
			p.oe = p.oe[:i]
			break
		}
		if isSpecialElement(p.oe[i]) {
			break
		}
	}
}

I need to have a think about how best to update my PR.

@nigeltao
Copy link
Contributor

@TomAnthony
Copy link
Author

I've responded in Gerrit and updated the PR. Thanks! :)

TomAnthony added a commit to TomAnthony/net that referenced this issue Apr 16, 2019
The existing implementation behaves differently to all major browsers, for the instance where a self-closing element of an unknown tag type is the child of another element of an unknown tag type. Unclear if the spec is specific here. Issue golang/go#30961 has several scenarios, of which Go diverges in behaviour on one (added as a test in this commit).

Fixes golang/go#30961
@golang golang locked and limited conversation to collaborators Mar 23, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants