x/net/html: Tokenizer could track Line/Column of token #31312

erinpentecost · 2019-04-06T23:11:58Z

What version of Go are you using (`go version`)?

$ go version
go version go1.12.1 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build083125343=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I wanted to get the line and column for the current Token, which forced me to fork the package. The key change is adding something like the following to the end of readByte() in token.go:

	// Increment the line and column tracker
	if x == '\n' {
		z.currentLine++
		z.currentColumn = 0
	} else {
		z.currentColumn++
	}

What did you expect to see?

I'd like to see a public method on Tokenizer that returns the starting and ending line/column of the current Token. The method could return a new struct with these four values.

What did you see instead?

There isn't a way to figure out where the token is in the input aside from byte offset. I could feed that byte offset into user code to determine where the line/column is, but then I'd have to parse the input twice and build up that lookup table first.

The text was updated successfully, but these errors were encountered:

bradleypeabody · 2019-09-09T15:49:02Z

The golang.org/x/net/html package already has an option mechanism, an option could be added to track the position. @bcmills @bradfitz How about this as a design:

A ParseOptionPosition() ParseOption function is added.
A struct Position is added, with three fields: Line, Column, Offset int (Line and Column are numbered from 1 in the usual convention, Offset is a 0-based byte offset into the original input)
A Position Position field is added to both Token and Node.
When ParseOptionPosition is used with ParseWithOptions or ParseFragmentWithOptions it populates the Position field on Token and Node as applicable. The Position field is otherwise left blank if option not used.

I also have an another option I'd like to add which would be to disable the lower casing of element and attribute names. This could be added using a ParseOptionPreserveCase() function, no other type changes would be needed.

These changes seem fairly low impact. They would not break existing users of the package. And if not used would only add a relatively few bytes to the Token and Node structs. (And if space is a concern, the field could be defined as Position *Position to be smaller in memory for the case where it's not used) We'll need to see but performance impact should be fairly minimal when these features are enabled, and should be no different from existing code if not used.

I will probably end up putting together a prototype of this as I need this case preservation option (and the line numbers would definitely be nice) for some functionality being added to github.com/vugu/vugu - any feedback on the approach would be greatly appreciated, so we improve the possibility of getting the changes merged back in at some point.

nigeltao · 2019-09-19T04:46:53Z

I wanted to get the line and column for the current Token, which forced me to fork the package.

I'm not sure if a fork is necessary. The Tokenizer takes an io.Reader, and that io.Reader can build the mapping from byte offset to line:col numbers. If I understand #34302 correctly, its LineCounter does exactly that.

gopherbot added this to the Unreleased milestone Apr 6, 2019

erinpentecost mentioned this issue Apr 6, 2019

vugufmt - make formatter for .vugu files vugu/vugu#9

Closed

bcmills added FeatureRequest NeedsInvestigation labels Apr 10, 2019

bradleypeabody mentioned this issue Jul 29, 2019

Make the code generator emit proper line numbers and positions from .vugu files vugu/vugu#38

Open

bradleypeabody mentioned this issue Aug 19, 2019

Goodbye htmlx vugu/vugu#60

Closed

bradleypeabody mentioned this issue Sep 14, 2019

proposal: x/net/html: add offset & line tracking, case preservation #34302

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x/net/html: Tokenizer could track Line/Column of token #31312

x/net/html: Tokenizer could track Line/Column of token #31312

erinpentecost commented Apr 6, 2019

bradleypeabody commented Sep 9, 2019

nigeltao commented Sep 19, 2019

x/net/html: Tokenizer could track Line/Column of token #31312

x/net/html: Tokenizer could track Line/Column of token #31312

Comments

erinpentecost commented Apr 6, 2019

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

bradleypeabody commented Sep 9, 2019

nigeltao commented Sep 19, 2019

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?