New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deal with files using \r\n or \r line endings #680
Labels
Milestone
Comments
I'd lex them all to \n, including inside ``. People wanting \r in backticks can strings.Replace() them back into existence, which is already common practice in net/http/mime testing code. Additional problem is what gofmt should do. People's editors on Windows might not like having their \r\n collapsed to \n. gofmt could detect & preserve the line ending style, similar to what gri is doing with detecting 1 or 2 line spacing between top-level declarations. |
Typically, languages consider the following characters or character pairs as new line indicators: LF CR CR LF (C#, Java, Python). C# also specifies three additional Unicode chars (next line char U+2085, line sep char U+2028, and paragraph sep char U+2029). Single CRs were used by old Macintosh OSs. We can probably ignore them. Programs containing only CRs are likely not going to compile and thus at least a user is alerted to the problem. We could allow the additional Unicode chars, but I am not convinced it is important - in the interest of simplicity I would not add them. Since CR LR contains an LF, it will be properly recognized as a newline and things work as expected with respect to parsing lines and semicolon insertion. Thus, additionally inserted CRs in CR LF sequences are treated as white space and are not visible to a program except if they occur in multi-line raw strings. Files compiled on different platforms should all behave the same, so compiling a file on a Windows machine should not result in different raw strings than the same file compiled on a Unix machine. It looks like there are two possible ways to go: a) We don't change the language spec. It is the source file creator's responsibility to be aware of potential extra CRs in multi-line raw strings (and often, it won't matter). b) We change the language spec. As proposed before, one option is to say that all CR LF sequences are replaced by LFs, in the scanning phase. gofmt effectively does b) with every program except for raw strings which it preserves untouched. |
On second thought, I am not convinced anymore this is good enough. Should \r (the utf-8 byte) also be stripped from interpreted string literals? And if not, why not? In the spec we refer to "line breaks" explicitly (or mostly) as newline (which is defined as \n), but for interpreted string literals we say that they cannot span multiple lines. On a \r-based system, an interpreted string containing a \r byte will make it appear on multiple lines; on a \n-based system it will appear on one. Is the string legal or not? Similarly for comments: Multi-line comments act like newline, which matters for semicolon insertion. When is a comment mult-iline? It may depend on the system (newline is defined, on the other hand). I think we want to be able to take a given source (\r, \r\n, or \n-based) to be reproduced into an equivalent program (e.g. w/ gofmt) on system with a different line break. One way out might be: - We don't care about \r-based systems; a line break is present if there is a \n byte. - Consequently, a string or comment spans multiple lines if there is a \n byte. - \r bytes outside strings act as white space, they are ignored inside (all?) strings. Thus, avoiding the notion of "multiple lines": - A general comment containing newlines acts like a newline; otherwise it acts like a space. - An interpreted string may not contain newlines. - \r chars are ignored in all strings. (?) |
I don't think we care about systems which end lines with a plain \r. I wouldn't bother to change the rules for interpreted string literals. We set the rule for raw string literals because of the potential confusion if a file changes from \r\n lines to \n lines or vice-versa. There is no such potential confusion for an interpreted string literal, so there is no need to change anything. |
The recent spec change is implemented in the compiler. I believe the only remaining change is to implement the rule in go/token. changeset: f130f78eefa4 user: Russ Cox <rsc@golang.org> date: Thu Dec 15 10:47:09 2011 -0500 summary: gc: implement and test \r in raw strings changeset: 2d9ac660f013 user: Rob Pike <r@golang.org> date: Wed Dec 14 21:52:41 2011 -0800 summary: spec: skip carriage returns in raw literals |
Owner changed to @griesemer. |
This was implemented on the go/scanner side as well: http://golang.org/cl/5495049 Marking as fixed. Status changed to Fixed. |
This issue was closed.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
The text was updated successfully, but these errors were encountered: