Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/html: html.Parse chomps too much leading newline in <pre> #27807

Closed
alin04 opened this issue Sep 21, 2018 · 1 comment
Closed

x/net/html: html.Parse chomps too much leading newline in <pre> #27807

alin04 opened this issue Sep 21, 2018 · 1 comment

Comments

@alin04
Copy link

alin04 commented Sep 21, 2018

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version go1.11rc2 linux/amd64

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"

What did you do?

https://play.golang.org/p/_79yZsnUz4j

package main

import (
	"fmt"
	"golang.org/x/net/html"
	"strings"
)

func parseAndPrint(s string) {
	root, _ := html.Parse(strings.NewReader(s))
	var out strings.Builder
	html.Render(&out, root)
	fmt.Printf("%q", out.String())
}

func main() {
	// Only this one produces unexpected output.
	parseAndPrint("<pre>&#13;&#10;crlf</pre>")
	fmt.Println()

	// All these remaining lines are to show that only the first newline, either CR or LF is stripped.
	parseAndPrint("<pre>&#10;&#13;lfcr</pre>")
	fmt.Println()
	parseAndPrint("<pre>&#13;&#13;crcr</pre>")
	fmt.Println()
	parseAndPrint("<pre>&#10;&#10;lflf</pre>")
	fmt.Println()
}

What did you expect to see?

The spec says that a leading newline character immediately following the <pre> element start tag is stripped. https://html.spec.whatwg.org/#the-pre-element

Don't even get me started on how this makes parse and print non-idempotent.

But I would expect only the FIRST newline to be stripped.

<html><head></head><body><pre>crlf</pre></body></html>
<html><head></head><body><pre>&#13;lfcr</pre></body></html>
<html><head></head><body><pre>&#13;crcr</pre></body></html>
<html><head></head><body><pre>\n\nlflf</pre></body></html>

What did you see instead?

<html><head></head><body><pre>\ncrlf</pre></body></html>
<html><head></head><body><pre>&#13;lfcr</pre></body></html>
<html><head></head><body><pre>&#13;crcr</pre></body></html>
<html><head></head><body><pre>\n\nlflf</pre></body></html>

This bit of code strips the exact sequence of CR followed immediately by LF, so it's effectively stripping two instead of one.

Happy to submit a patch. I think the correct logic should be:

// Ignore a newline at the start of a <pre> block.
if d != "" && (d[0] == '\r' || d[0]== '\n') {
	d = d[1:]
}
@gopherbot gopherbot added this to the Unreleased milestone Sep 21, 2018
@alin04
Copy link
Author

alin04 commented Sep 21, 2018

Nevermind. The spec says CR followed by NL is treated as a newline.

https://html.spec.whatwg.org/#syntax-newlines

Maybe should rename this bug to please support a mode to be idempotent.

@alin04 alin04 closed this as completed Sep 21, 2018
@golang golang locked and limited conversation to collaborators Sep 21, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants