x/net/html: html.Parse chomps too much leading newline in <pre> #27807

alin04 · 2018-09-21T22:29:43Z

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (`go version`)?

go version go1.11rc2 linux/amd64

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (`go env`)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"

What did you do?

https://play.golang.org/p/_79yZsnUz4j

package main

import (
	"fmt"
	"golang.org/x/net/html"
	"strings"
)

func parseAndPrint(s string) {
	root, _ := html.Parse(strings.NewReader(s))
	var out strings.Builder
	html.Render(&out, root)
	fmt.Printf("%q", out.String())
}

func main() {
	// Only this one produces unexpected output.
	parseAndPrint("<pre>&#13;&#10;crlf</pre>")
	fmt.Println()

	// All these remaining lines are to show that only the first newline, either CR or LF is stripped.
	parseAndPrint("<pre>&#10;&#13;lfcr</pre>")
	fmt.Println()
	parseAndPrint("<pre>&#13;&#13;crcr</pre>")
	fmt.Println()
	parseAndPrint("<pre>&#10;&#10;lflf</pre>")
	fmt.Println()
}

What did you expect to see?

The spec says that a leading newline character immediately following the <pre> element start tag is stripped. https://html.spec.whatwg.org/#the-pre-element

Don't even get me started on how this makes parse and print non-idempotent.

But I would expect only the FIRST newline to be stripped.

<html><head></head><body><pre>crlf</pre></body></html>
<html><head></head><body><pre>&#13;lfcr</pre></body></html>
<html><head></head><body><pre>&#13;crcr</pre></body></html>
<html><head></head><body><pre>\n\nlflf</pre></body></html>

What did you see instead?

<html><head></head><body><pre>\ncrlf</pre></body></html>
<html><head></head><body><pre>&#13;lfcr</pre></body></html>
<html><head></head><body><pre>&#13;crcr</pre></body></html>
<html><head></head><body><pre>\n\nlflf</pre></body></html>

This bit of code strips the exact sequence of CR followed immediately by LF, so it's effectively stripping two instead of one.

Happy to submit a patch. I think the correct logic should be:

// Ignore a newline at the start of a <pre> block.
if d != "" && (d[0] == '\r' || d[0]== '\n') {
	d = d[1:]
}

The text was updated successfully, but these errors were encountered:

alin04 · 2018-09-21T22:37:39Z

Nevermind. The spec says CR followed by NL is treated as a newline.

https://html.spec.whatwg.org/#syntax-newlines

Maybe should rename this bug to please support a mode to be idempotent.

gopherbot added this to the Unreleased milestone Sep 21, 2018

alin04 closed this as completed Sep 21, 2018

golang locked and limited conversation to collaborators Sep 21, 2019

gopherbot added the FrozenDueToAge label Sep 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/net/html: html.Parse chomps too much leading newline in <pre> #27807

x/net/html: html.Parse chomps too much leading newline in <pre> #27807

alin04 commented Sep 21, 2018 •

edited

alin04 commented Sep 21, 2018 •

edited

x/net/html: html.Parse chomps too much leading newline in <pre> #27807

x/net/html: html.Parse chomps too much leading newline in <pre> #27807

Comments

alin04 commented Sep 21, 2018 • edited

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

alin04 commented Sep 21, 2018 • edited

alin04 commented Sep 21, 2018 •

edited

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

alin04 commented Sep 21, 2018 •

edited