New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
go/parser: reject files with BOMs not at the beginning. #5265
Labels
Comments
The Go Programming Language Specification Version of September 4, 2012 http://golang.org/ref/spec Source code representation Source code is Unicode text encoded in UTF-8. The text is not canonicalized, so a single accented code point is distinct from the same character constructed from combining an accent and a letter; those are treated as two code points. For simplicity, this document will use the unqualified term character to refer to a Unicode code point in the source text. Each code point is distinct; for instance, upper and lower case letters are different characters. Implementation restriction: For compatibility with other tools, a compiler may disallow the NUL character (U+0000) in the source text. The Unicode Standard, Version 6.2 Chapter 3 Conformance http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf D95 When represented in UTF-8, the byte order mark [U+FEFF] turns into the byte sequence <EF BB BF>. D89 In a Unicode encoding form: A Unicode string is said to be in a particular Unicode encoding form if and only if it consists of a well-formed Unicode code unit sequence of that Unicode encoding form. • A Unicode string consisting of a well-formed UTF-8 code unit sequence is said to be in UTF-8. Such a Unicode string is referred to as a valid UTF-8 string, or a UTF-8 string for short. D92 • Any UTF-8 byte sequence that does not match the patterns listed in Table 3-7 is ill-formed. Table 3-7 lists all of the byte sequences that are well-formed in UTF-8. Table 3-7. Well-Formed UTF-8 Byte Sequences [in pertinent part] Code Points First Byte Second Byte Third Byte Fourth Byte U+E000..U+FFFF EE..EF 80..BF 80..BF The Unicode specification defines UTF-8. It looks to me as if the UTF-8 byte sequence <EF BB BF>, for the BOM U+FEFF code point, is defined by Unicode as a well-formed sequence of UTF-8 bytes. Therefore, I'm surprised that Go does not accept it. Are there any other well-formed sequences of UTF-8 bytes does Go not accept, apart from the NUL character? Does this break the Go 1 guarantee that "Source code is Unicode text encoded in UTF-8.", except that "a compiler may disallow the NUL character (U+0000)"? |
Clarification of the spec in https://golang.org/cl/8649043 This isn't really a bug, just an inconsistency, since the property falls under the 'implementation restriction' clause. Retracting the issue. Status changed to Retracted. |
This issue was closed.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Attachments:
The text was updated successfully, but these errors were encountered: