Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/html: Tokenizer could track Line/Column of token #31312

Open
erinpentecost opened this issue Apr 6, 2019 · 2 comments
Open

x/net/html: Tokenizer could track Line/Column of token #31312

erinpentecost opened this issue Apr 6, 2019 · 2 comments
Labels
FeatureRequest NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@erinpentecost
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.12.1 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build083125343=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I wanted to get the line and column for the current Token, which forced me to fork the package. The key change is adding something like the following to the end of readByte() in token.go:

	// Increment the line and column tracker
	if x == '\n' {
		z.currentLine++
		z.currentColumn = 0
	} else {
		z.currentColumn++
	}

What did you expect to see?

I'd like to see a public method on Tokenizer that returns the starting and ending line/column of the current Token. The method could return a new struct with these four values.

What did you see instead?

There isn't a way to figure out where the token is in the input aside from byte offset. I could feed that byte offset into user code to determine where the line/column is, but then I'd have to parse the input twice and build up that lookup table first.

@bradleypeabody
Copy link

The golang.org/x/net/html package already has an option mechanism, an option could be added to track the position. @bcmills @bradfitz How about this as a design:

  • A ParseOptionPosition() ParseOption function is added.
  • A struct Position is added, with three fields: Line, Column, Offset int (Line and Column are numbered from 1 in the usual convention, Offset is a 0-based byte offset into the original input)
  • A Position Position field is added to both Token and Node.
  • When ParseOptionPosition is used with ParseWithOptions or ParseFragmentWithOptions it populates the Position field on Token and Node as applicable. The Position field is otherwise left blank if option not used.

I also have an another option I'd like to add which would be to disable the lower casing of element and attribute names. This could be added using a ParseOptionPreserveCase() function, no other type changes would be needed.

These changes seem fairly low impact. They would not break existing users of the package. And if not used would only add a relatively few bytes to the Token and Node structs. (And if space is a concern, the field could be defined as Position *Position to be smaller in memory for the case where it's not used) We'll need to see but performance impact should be fairly minimal when these features are enabled, and should be no different from existing code if not used.

I will probably end up putting together a prototype of this as I need this case preservation option (and the line numbers would definitely be nice) for some functionality being added to github.com/vugu/vugu - any feedback on the approach would be greatly appreciated, so we improve the possibility of getting the changes merged back in at some point.

@nigeltao
Copy link
Contributor

I wanted to get the line and column for the current Token, which forced me to fork the package.

I'm not sure if a fork is necessary. The Tokenizer takes an io.Reader, and that io.Reader can build the mapping from byte offset to line:col numbers. If I understand #34302 correctly, its LineCounter does exactly that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FeatureRequest NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants