Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go/scanner: Scanner.Scan sometimes returns token.ILLEGAL and empty literal string #28112

Closed
dmitshur opened this issue Oct 10, 2018 · 9 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@dmitshur
Copy link
Contributor

What version of Go are you using (go version)?

$ go version
go version devel +c870d56f98 Wed Oct 10 02:08:36 2018 +0000 darwin/amd64

Does this issue reproduce with the latest release?

Yes it does, with Go 1.11.1.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/dmitri/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/dmitri/Dropbox/Work/2013/GoLanding:/Users/dmitri/Dropbox/Work/2013/GoLand"
GOPROXY=""
GORACE=""
GOROOT="/Users/dmitri/Dropbox/Work/2015/golang-contribute/go"
GOTMPDIR=""
GOTOOLDIR="/Users/dmitri/Dropbox/Work/2015/golang-contribute/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/b8/66r1c5856mqds1mrf2tjtq8w0000gn/T/go-build878012089=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I tried scanning over the following minified Go code containing illegal tokens:

a..b

With the following program:

package main

import (
	"fmt"
	"go/scanner"
	"go/token"
)

func main() {
	// src is the input that we want to tokenize.
	src := []byte("a..b")

	// Initialize the scanner.
	var s scanner.Scanner
	fset := token.NewFileSet()                      // positions are relative to fset
	file := fset.AddFile("", fset.Base(), len(src)) // register input "file"
	s.Init(file, src, nil /* no error handler */, scanner.ScanComments)

	// Repeated calls to Scan yield the token sequence found in the input.
	for {
		pos, tok, lit := s.Scan()
		if tok == token.EOF {
			break
		}
		fmt.Printf("%s\t%s\t%q\n", fset.Position(pos), tok, lit)
	}
}

(Playground: https://play.golang.org/p/No77Yx0D4Ki.)

What did you expect to see?

go/scanner.Scanner.Scan documentation says:

If the returned token is token.ILLEGAL, the literal string is the offending character.

In all other cases, Scan returns an empty literal string.

Perhaps I'm misunderstanding the documentation, but I expected that if returned token is token.ILLEGAL, then the literal string should not be empty. I expected to see the literal string contain the offending character or characters.

1:1	IDENT	"a"
1:2	ILLEGAL	".."
1:4	IDENT	"b"
1:5	;	"\n"

What did you see instead?

A token.ILLEGAL token and empty literal string:

1:1	IDENT	"a"
1:2	ILLEGAL	""
1:4	IDENT	"b"
1:5	;	"\n"

/cc @griesemer

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Oct 10, 2018
@dmitshur dmitshur added this to the Go1.12 milestone Oct 10, 2018
@cznic
Copy link
Contributor

cznic commented Oct 10, 2018

The ELLIPSIS token is missing its last character, so the the "offending character" in this case is empty. The leading ".." sequence is ok, but the only valid character after that is a third '.'.

@griesemer
Copy link
Contributor

@cznic The offending character here is the 'b' - it if were a '.' it would be ok. This is clearly a bug in the scanner.

@cznic
Copy link
Contributor

cznic commented Oct 10, 2018

Heh, used a .. b in my test by mistake. Still, I'm not sure if the offending character, when ir does not start a token, should be consumed as it may start a valid token.

@griesemer
Copy link
Contributor

@cznic My apologies - I concluded a bit too quickly. It's not clear here what exactly the offending character should be. Your suggestion of an empty character might be just as valid. It's certainly problematic that the 'if' starting on scanner.go:740 has not 'else' branch and thus silently exits the big 'switch' statement. Because the result 'tok' is not set at all, it returns the initial (invalid) token.

@griesemer
Copy link
Contributor

For reference: #28128

@griesemer
Copy link
Contributor

griesemer commented Oct 10, 2018

Actually, while this is a bug in the scanner (and there are others, related to this), the spec is pretty clear:

While breaking the input into tokens, the next token is the longest sequence of characters that form a valid token.

So the longest valid sequence of characters here is a dot . followed by another one. That is, the scanner should not return an ILLEGAL token at all. That is what cmd/compile's scanner does.

I vaguely remember having been aware of this for a long time but never bothered to get it 100% correct in go/scanner because it doesn't matter in practice. There's no Go code containing two dots ".." that is valid, so as long as we get an error, we're mostly ok.

There are probably other such cases such as invalid octal or hex numbers, say: 07a. Is that the octal literal 07 followed by an a, or an invalid octal constant? Following the word of the spec exactly, it should be the former. On the other hand, an error complaining about an invalid octal constant may be clearer to a reader than an error due to invalid syntax because the octal 07 is incorrectly followed by the identifier a. Maybe the spec should be less demanding in those cases.

I will try to address the specific problem here if I can find an easy fix (which doesn't slow down the scanner everywhere because we need to keep more state around) but leave the more general problem open.

@gopherbot
Copy link

Change https://golang.org/cl/141337 mentions this issue: go/scanner: don't return token.INVALID for ".." sequence

@dmitshur
Copy link
Contributor Author

There's no Go code containing two dots ".." that is valid,

In case you're curious how I discovered this, I ran into this in issue #28082. You left a comment there containing a fenced code block marked up with "Go" syntax, but it was actually a diff of Go code.

I was viewing that comment with some software that performs syntax highlighting for Go using a highlighter implemented on top of go/scanner, and it was behaving incorrectly on the index 1de7cd81b2..38b50f2596 100644 line because I made the wrong assumption about literal being non-empty when token is token.ILLEGAL.

@griesemer
Copy link
Contributor

@dmitshur Interesting. The scanner should always return the offending character (lit) if there's an ILLEGAL token. I believe this should be the case now.

dmitshur added a commit to shurcooL/highlight_go that referenced this issue Oct 28, 2018
The previous logic did not handle the situation where token.ILLEGAL
was returned together with an empty literal string, and panicked
because it performed an out of bounds slice operation (s[low:high],
where low > high).

Improve the code so that doesn't happen, and add test cases for it.

The token.ILLEGAL with empty literal string possibility was a bug in
the go/scanner package (see golang/go#28112). It has been fixed in
https://golang.org/cl/141337, and shouldn't happen again.

Vendor a copy of go/scanner with https://golang.org/cl/141337 applied.
It can and should be removed after Go 1.12 is released with that fix.

Clean up the code in general.
@golang golang locked and limited conversation to collaborators Oct 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants