go/parser: reject files with BOMs not at the beginning. #5265

robpike · 2013-04-10T23:07:17Z

$ go fmt x.go #attached

You see no error.

$ go build x.go

You get an error:
    ./x.go:4: Unicode (UTF-8) BOM in middle of file
The error is correct: this is an illegal Go source file. I suspect the parser isn't
rejecting BOMs properly. They are allowed only as the first code point in a source file.

It's a minor point but consistency among tools would be good.

Attachments:

x.go (47 bytes)

peterGo · 2013-04-11T00:24:50Z

Comment 1:

The Go Programming Language Specification
Version of September 4, 2012
http://golang.org/ref/spec
Source code representation
Source code is Unicode text encoded in UTF-8. The text is not canonicalized, so a single
accented code point is distinct from the same character constructed from combining an
accent and a letter; those are treated as two code points. For simplicity, this document
will use the unqualified term character to refer to a Unicode code point in the source
text.
Each code point is distinct; for instance, upper and lower case letters are different
characters.
Implementation restriction: For compatibility with other tools, a compiler may disallow
the NUL character (U+0000) in the source text. 
The Unicode Standard, Version 6.2
Chapter 3 Conformance
http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf
D95
When represented in UTF-8, the byte order mark [U+FEFF] turns into the byte sequence
<EF BB BF>.
D89
In a Unicode encoding form: A Unicode string is said to be in a particular Unicode
encoding form if and only if it consists of a well-formed Unicode code unit sequence of
that Unicode encoding form.
• A Unicode string consisting of a well-formed UTF-8 code unit sequence is said to be
in UTF-8. Such a Unicode string is referred to as a valid UTF-8 string, or a UTF-8
string for short.
D92
• Any UTF-8 byte sequence that does not match the patterns listed in Table 3-7 is
ill-formed.
Table 3-7 lists all of the byte sequences that are well-formed in UTF-8. 
Table 3-7. Well-Formed UTF-8 Byte Sequences [in pertinent part]
Code Points        First Byte Second Byte Third Byte Fourth Byte
U+E000..U+FFFF     EE..EF     80..BF      80..BF 
The Unicode specification defines UTF-8. It looks to me as if the UTF-8 byte sequence
<EF BB BF>, for the BOM U+FEFF code point, is defined by Unicode as a well-formed
sequence of UTF-8 bytes. Therefore, I'm surprised that Go does not accept it. Are there
any other well-formed sequences of UTF-8 bytes does Go not accept, apart from the NUL
character?
Does this break the Go 1 guarantee that "Source code is Unicode text encoded in UTF-8.",
except that "a compiler may disallow the NUL character (U+0000)"?

robpike · 2013-04-11T02:31:33Z

Comment 2:

Clarification of the spec in https://golang.org/cl/8649043
This isn't really a bug, just an inconsistency, since the property falls under the
'implementation restriction' clause. Retracting the issue.

Status changed to Retracted.

robpike added retracted labels Apr 11, 2013

robpike assigned griesemer Apr 11, 2013

golang locked and limited conversation to collaborators Jun 24, 2016

gopherbot added the FrozenDueToAge label Jun 24, 2016

rsc unassigned griesemer Jun 22, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

go/parser: reject files with BOMs not at the beginning. #5265

go/parser: reject files with BOMs not at the beginning. #5265

robpike commented Apr 10, 2013

peterGo commented Apr 11, 2013

robpike commented Apr 11, 2013

go/parser: reject files with BOMs not at the beginning. #5265

go/parser: reject files with BOMs not at the beginning. #5265

Comments

robpike commented Apr 10, 2013

peterGo commented Apr 11, 2013

robpike commented Apr 11, 2013