Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/html: do not parse the blank line and line break as a TextNode #37466

Open
youdiandai opened this issue Feb 26, 2020 · 2 comments
Open
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@youdiandai
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.13.7 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/changxin/Library/Caches/go-build"
GOENV="/Users/changxin/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/changxin/go"
GOPRIVATE=""
GOPROXY="https://goproxy.io,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/rl/_ybfhjr975s9ldh7zjlqnlfc0000gn/T/go-build286633492=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I tried to parse the golang.org page with the net / html package, and tried to print all the information in TextNode, but I found that many blank lines appeared, and after testing, I found that all the blank lines and line breaks were parsed as TextNode, but They are just empty to format the code, not the information that the page needs to display, and there is no correct way to distinguish them from the real content.
my project
`package main

import (
"fmt"
"os"

"golang.org/x/net/html"

)

func main() {
doc, err := html.Parse(os.Stdin)
if err != nil {
fmt.Fprintf(os.Stderr, "content print :%v\n", err)
}
contentPrint(doc)
}

func contentPrint(n *html.Node) {
if n.Type == html.ElementNode && n.Data != "script" && n.Data != "style" {
if n.FirstChild != nil && n.FirstChild.Type == html.TextNode {
fmt.Printf("content:%s\n", n.FirstChild.Data)
}
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
contentPrint(c)
}
}`

What did you expect to see?

`content:The Go Programming Language
content:Documents
content:Packages
content:The Project
content:Help
content:Blog
content:Play
content:Search
content:simple
content:reliable
content:efficient
content:Try Go
content:Open in Playground
content:// You can edit this code!
// Click here and start typing.
package main

import "fmt"

func main() {
fmt.Println("Hello, 世界")
}

content:Hello, World!
content:Conway's Game of Life
content:Fibonacci Closure
content:Peano Integers
content:Concurrent pi
content:Concurrent Prime Sieve
content:Peg Solitaire Solver
content:Tree Comparison
content:Run
content:Share
content:Tour
content:Featured articles
content:Go 1.14 is released
content:download page
content:Published 25 February 2020
content:Next steps for pkg.go.dev
content:go.dev
content:Published 31 January 2020
content:Read more >
content:Featured video
content:Copyright
content:Terms of Service
content:Privacy Policy
content:Report a website issue
content:Supported by Google`

What did you see instead?

`The Go Programming Language
Documents
Packages
The Project
Help
Blog
Play
Search
simple
reliable
efficient
Try Go
Open in Playground
// You can edit this code!
// Click here and start typing.
package main

import "fmt"

func main() {
fmt.Println("Hello, 世界")
}

Hello, World!
Conway's Game of Life
Fibonacci Closure
Peano Integers
Concurrent pi
Concurrent Prime Sieve
Peg Solitaire Solver
Tree Comparison
Run
Share
Tour
Featured articles
Go 1.14 is released
download page
Published 25 February 2020
Next steps for pkg.go.dev
go.dev
Published 31 January 2020
Read more >
Featured video
Copyright
Terms of Service
Privacy Policy
Report a website issue
Supported by Google
~/go/src/go_bible/ch5/5.3  go build
~/go/src/go_bible/ch5/5.3  ./5.3 <golang_org.htm
^C
✘  ~/go/src/go_bible/ch5/5.3  go build
~/go/src/go_bible/ch5/5.3  ./5.3 <golang_org.htm
content:The Go Programming Language
content:

content:

content:

content:

content:

content:Documents
content:Packages
content:The Project
content:Help
content:Blog
content:Play
content:

content:

content:

content:Search
content:

content:

content:

content:

content:
Go is an open source programming language that makes it easy to build

content:simple
content:reliable
content:efficient
content:

content:
Binary distributions available for
content:

content:

content:Try Go
content:Open in Playground
content:

content:// You can edit this code!
// Click here and start typing.
package main

import "fmt"

func main() {
fmt.Println("Hello, 世界")
}

content:

content:

content:Hello, World!
content:Conway's Game of Life
content:Fibonacci Closure
content:Peano Integers
content:Concurrent pi
content:Concurrent Prime Sieve
content:Peg Solitaire Solver
content:Tree Comparison
content:

content:Run
content:

content:Share
content:Tour
content:

content:Featured articles
content:Go 1.14 is released
content:Today the Go team is very happy to announce the release of Go 1.14. You can get it from the
content:download page
content:Published 25 February 2020
content:Next steps for pkg.go.dev
content:In 2019, we launched
content:go.dev
content:Published 31 January 2020
content:Read more >
content:

content:Featured video
content:

content:

content:

content:

content:Copyright
content:Terms of Service
content:Privacy Policy
content:Report a website issue
content:Supported by Google`

@youdiandai youdiandai changed the title net/html包 Parse将空白行和换行符号也解析为了TextNode,而且无法进行区分 package net/html Parse the blank line and line break as TextNode Feb 26, 2020
@cagedmantis cagedmantis changed the title package net/html Parse the blank line and line break as TextNode x/net/html: do not parse the blank line and line break as a TextNode Feb 28, 2020
@gopherbot gopherbot added this to the Unreleased milestone Feb 28, 2020
@cagedmantis cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 28, 2020
@cagedmantis
Copy link
Contributor

/cc @mikioh

@nonzerofloat
Copy link

Use strings.TrimSpace. You can distinguish blank line from content line by comparing it with empty string "".

if node.Type == html.TextNode {
	text := strings.TrimSpace(node.Data)
	if text != "" {
		fmt.Println(text)
	}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants