cmd/compile: honor the unicode classes for identifiers #12483

robpike · 2015-09-03T21:22:35Z

The code currently says (lex.go):

    if c >= utf8.RuneSelf {
        /* all multibyte runes are alpha */
        cp = &lexbuf
        cp.Reset()

        goto talph
    }

Now that the compiler is in Go, we have access to the unicode tables and should use them.

rsc · 2015-11-04T20:36:59Z

It does use them; see the code at label talph. There's one bug in that leading non-ASCII Unicode digits are not rejected, but that's a separate issue and I have a CL forthcoming.

robpike · 2015-11-17T15:45:19Z

Looks like CL 16919 but that didn't reference this issue. This issue was triggered by a public post (stack overflow??) that had an example I should have included.

There should probably be tests that the compiler gets this right. It's clear it didn't before.

rsc · 2015-11-17T16:56:20Z

For the purposes of lexing byte-at-a-time, all multibyte sequences are tentatively alpha. Then we filter once we've parsed the runes. We've always* done that. I know the comment makes it sound like what the Plan 9 C compiler does, but it's really not.

This is from Go 1.1 (just to show that the behavior has been this way for a long time):

if(c >= Runeself) {
    /* all multibyte runes are alpha */
    cp = lexbuf;
    ep = lexbuf+sizeof lexbuf;
    goto talph;
}

if(yy_isalpha(c)) {
    cp = lexbuf;
    ep = lexbuf+sizeof lexbuf;
    goto talph;
}

if(yy_isdigit(c))
    goto tnum;

switch(c) {
case EOF:
    lineno = prevlineno;
    ungetc(EOF);
    return -1;

case '_':
    cp = lexbuf;
    ep = lexbuf+sizeof lexbuf;
    goto talph;

That's all the possible ways to start an identifier, leading to the talph label. Then at the label:

talph:
    for(;;) {
        if(cp+10 >= ep) {
            yyerror("identifier too long");
            errorexit();
        }
        if(c >= Runeself) {
            ungetc(c);
            rune = getr();
            // 0xb7 · is used for internal names
>>>         if(!isalpharune(rune) && !isdigitrune(rune) && (importpkg == nil || rune != 0xb7))
>>>             yyerror("invalid identifier character U+%04x", rune);
            cp += runetochar(cp, &rune);
>>>     } else if(!yy_isalnum(c) && c != '_')
            break;
        else
            *cp++ = c;
        c = getc();
    }
    *cp = 0;
    ungetc(c);

So any multibyte non-alphanumeric will end up at talph and then be rejected with a message about that being an invalid character for an identifier (probably the best possible message, although strictly speaking it's making an assumption; maybe the user didn't intend the non-alphanumeric as part of an identifier).

The only bug in the code (that I found) was that leading non-ASCII digits were allowed (#11359). I closed this issue without a CL because I don't see any other problems. There is now a test for leading non-ASCII digits, as part of the CL for #11359. The current Go version of the talph block is:

talph:
    for {
        if c >= utf8.RuneSelf {
            ungetc(c)
            r := rune(getr())

            // 0xb7 · is used for internal names
            if !unicode.IsLetter(r) && !unicode.IsDigit(r) && (importpkg == nil || r != 0xb7) {
                Yyerror("invalid identifier character U+%04x", r)
            }
            if cp.Len() == 0 && unicode.IsDigit(r) {
                Yyerror("identifier cannot begin with digit U+%04x", r)
            }
            cp.WriteRune(r)
        } else if !isAlnum(c) && c != '_' {
            break
        } else {
            cp.WriteByte(byte(c))
        }
        c = getc()
    }

There is also a test (test/fixedbugs/bug163.go):

// errorcheck

// Copyright 2009 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

package main

func main() {
    x⊛y := 1; // ERROR "identifier"
}

I'd be happy to look again given a specific test case that is incorrectly accepted.

"always" here means since June 2009. https://go.googlesource.com/go/+/5d5904bb4dc132e6f97ab990e0bb0c73a2af15ff

robpike · 2016-01-25T21:56:01Z

This program works and it should not.

package main

func main() {
    လ := 3
    _ = လ
}

griesemer · 2016-01-25T22:14:42Z

http://play.golang.org/p/kUuxyPC4qw says that 'လ' is a letter.

'လ' is 101C which is Myanmar Letter LA

robpike added this to the Go1.6 milestone Sep 3, 2015

rsc closed this as completed Nov 4, 2015

robpike reopened this Jan 25, 2016

robpike modified the milestones: Go1.7Early, Go1.6 Jan 25, 2016

robpike closed this as completed Jan 25, 2016

golang locked and limited conversation to collaborators Jan 24, 2017

gopherbot added the FrozenDueToAge label Jan 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: honor the unicode classes for identifiers #12483

cmd/compile: honor the unicode classes for identifiers #12483

robpike commented Sep 3, 2015

rsc commented Nov 4, 2015

robpike commented Nov 17, 2015

rsc commented Nov 17, 2015

robpike commented Jan 25, 2016

griesemer commented Jan 25, 2016

cmd/compile: honor the unicode classes for identifiers #12483

cmd/compile: honor the unicode classes for identifiers #12483

Comments

robpike commented Sep 3, 2015

rsc commented Nov 4, 2015

robpike commented Nov 17, 2015

rsc commented Nov 17, 2015

robpike commented Jan 25, 2016

griesemer commented Jan 25, 2016