Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: honor the unicode classes for identifiers #12483

Closed
robpike opened this issue Sep 3, 2015 · 5 comments
Closed

cmd/compile: honor the unicode classes for identifiers #12483

robpike opened this issue Sep 3, 2015 · 5 comments

Comments

@robpike
Copy link
Contributor

robpike commented Sep 3, 2015

The code currently says (lex.go):

    if c >= utf8.RuneSelf {
        /* all multibyte runes are alpha */
        cp = &lexbuf
        cp.Reset()

        goto talph
    }

Now that the compiler is in Go, we have access to the unicode tables and should use them.

@robpike robpike added this to the Go1.6 milestone Sep 3, 2015
@rsc
Copy link
Contributor

rsc commented Nov 4, 2015

It does use them; see the code at label talph. There's one bug in that leading non-ASCII Unicode digits are not rejected, but that's a separate issue and I have a CL forthcoming.

@rsc rsc closed this as completed Nov 4, 2015
@robpike
Copy link
Contributor Author

robpike commented Nov 17, 2015

Looks like CL 16919 but that didn't reference this issue. This issue was triggered by a public post (stack overflow??) that had an example I should have included.

There should probably be tests that the compiler gets this right. It's clear it didn't before.

@rsc
Copy link
Contributor

rsc commented Nov 17, 2015

For the purposes of lexing byte-at-a-time, all multibyte sequences are tentatively alpha. Then we filter once we've parsed the runes. We've always* done that. I know the comment makes it sound like what the Plan 9 C compiler does, but it's really not.

This is from Go 1.1 (just to show that the behavior has been this way for a long time):

if(c >= Runeself) {
    /* all multibyte runes are alpha */
    cp = lexbuf;
    ep = lexbuf+sizeof lexbuf;
    goto talph;
}

if(yy_isalpha(c)) {
    cp = lexbuf;
    ep = lexbuf+sizeof lexbuf;
    goto talph;
}

if(yy_isdigit(c))
    goto tnum;

switch(c) {
case EOF:
    lineno = prevlineno;
    ungetc(EOF);
    return -1;

case '_':
    cp = lexbuf;
    ep = lexbuf+sizeof lexbuf;
    goto talph;

That's all the possible ways to start an identifier, leading to the talph label. Then at the label:

talph:
    for(;;) {
        if(cp+10 >= ep) {
            yyerror("identifier too long");
            errorexit();
        }
        if(c >= Runeself) {
            ungetc(c);
            rune = getr();
            // 0xb7 · is used for internal names
>>>         if(!isalpharune(rune) && !isdigitrune(rune) && (importpkg == nil || rune != 0xb7))
>>>             yyerror("invalid identifier character U+%04x", rune);
            cp += runetochar(cp, &rune);
>>>     } else if(!yy_isalnum(c) && c != '_')
            break;
        else
            *cp++ = c;
        c = getc();
    }
    *cp = 0;
    ungetc(c);

So any multibyte non-alphanumeric will end up at talph and then be rejected with a message about that being an invalid character for an identifier (probably the best possible message, although strictly speaking it's making an assumption; maybe the user didn't intend the non-alphanumeric as part of an identifier).

The only bug in the code (that I found) was that leading non-ASCII digits were allowed (#11359). I closed this issue without a CL because I don't see any other problems. There is now a test for leading non-ASCII digits, as part of the CL for #11359. The current Go version of the talph block is:

talph:
    for {
        if c >= utf8.RuneSelf {
            ungetc(c)
            r := rune(getr())

            // 0xb7 · is used for internal names
            if !unicode.IsLetter(r) && !unicode.IsDigit(r) && (importpkg == nil || r != 0xb7) {
                Yyerror("invalid identifier character U+%04x", r)
            }
            if cp.Len() == 0 && unicode.IsDigit(r) {
                Yyerror("identifier cannot begin with digit U+%04x", r)
            }
            cp.WriteRune(r)
        } else if !isAlnum(c) && c != '_' {
            break
        } else {
            cp.WriteByte(byte(c))
        }
        c = getc()
    }

There is also a test (test/fixedbugs/bug163.go):

// errorcheck

// Copyright 2009 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.

package main

func main() {
    x⊛y := 1; // ERROR "identifier"
}

I'd be happy to look again given a specific test case that is incorrectly accepted.

@robpike
Copy link
Contributor Author

robpike commented Jan 25, 2016

This program works and it should not.

package main

func main() {
     := 3
    _ = 
}

@robpike robpike reopened this Jan 25, 2016
@robpike robpike modified the milestones: Go1.7Early, Go1.6 Jan 25, 2016
@griesemer
Copy link
Contributor

http://play.golang.org/p/kUuxyPC4qw says that 'လ' is a letter.

'လ' is 101C which is Myanmar Letter LA

@robpike robpike closed this as completed Jan 25, 2016
@golang golang locked and limited conversation to collaborators Jan 24, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants