Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deal with files using \r\n or \r line endings #680

Closed
rsc opened this issue Mar 19, 2010 · 14 comments
Closed

deal with files using \r\n or \r line endings #680

rsc opened this issue Mar 19, 2010 · 14 comments
Milestone

Comments

@rsc
Copy link
Contributor

rsc commented Mar 19, 2010

right now things misbehave.
@rsc
Copy link
Contributor Author

rsc commented Mar 19, 2010

Comment 1:

by things i mean the compilers do the wrong thing
because they expect \n.  we need to decide in the
language spec what to do and then do it.
in addition to semicolon insertion problems
there is the problem that `` strings spanning
lines get different bytes on different machines.

@bradfitz
Copy link
Contributor

bradfitz commented May 3, 2011

Comment 2:

I'd lex them all to \n, including inside ``. People wanting \r in backticks can
strings.Replace() them back into existence, which is already common practice in
net/http/mime testing code.
Additional problem is what gofmt should do. People's editors on Windows might not like
having their \r\n collapsed to \n.
gofmt could detect & preserve the line ending style, similar to what gri is doing with
detecting 1 or 2 line spacing between top-level declarations.

@rsc
Copy link
Contributor Author

rsc commented May 3, 2011

Comment 3:

that got squashed.  content-based heuristics are a slippery slope.

@griesemer
Copy link
Contributor

Comment 4:

Typically, languages consider the following characters or character pairs as new line
indicators:
LF
CR
CR LF
(C#, Java, Python). C# also specifies three additional Unicode chars (next line char
U+2085, line sep char U+2028, and paragraph sep char U+2029).
Single CRs were used by old Macintosh OSs. We can probably ignore them. Programs
containing only CRs are likely not going to compile and thus at least a user is alerted
to the problem.
We could allow the additional Unicode chars, but I am not convinced it is important - in
the interest of simplicity I would not add them.
Since CR LR contains an LF, it will be properly recognized as a newline and things work
as expected with respect to parsing lines and semicolon insertion.
Thus, additionally inserted CRs in CR LF sequences are treated as white space and are
not visible to a program except if they occur in multi-line raw strings. Files compiled
on different platforms should all behave the same, so compiling a file on a Windows
machine should not result in different raw strings than the same file compiled on a Unix
machine.
It looks like there are two possible ways to go:
a) We don't change the language spec. It is the source file creator's responsibility to
be aware of potential extra CRs in multi-line raw strings (and often, it won't matter).
b) We change the language spec. As proposed before, one option is to say that all CR LF
sequences are replaced by LFs, in the scanning phase.
gofmt effectively does b) with every program except for raw strings which it preserves
untouched.

@rsc
Copy link
Contributor Author

rsc commented May 6, 2011

Comment 5:

I would be happy to say that in a Go source file,
even inside a raw string, \r\n is treated as \n.
Russ

@rsc
Copy link
Contributor Author

rsc commented Dec 9, 2011

Comment 7:

Labels changed: added priority-later.

@rsc
Copy link
Contributor Author

rsc commented Dec 12, 2011

Comment 8:

Labels changed: added priority-go1.

@robpike
Copy link
Contributor

robpike commented Dec 15, 2011

Comment 9:

Spec has been updated to state that \r is stripped from raw literals; compilers do not
enforce the rule yet.

@griesemer
Copy link
Contributor

Comment 10:

On second thought, I am not convinced anymore this is good enough.
Should \r (the utf-8 byte) also be stripped from interpreted string literals? And if
not, why not?
In the spec we refer to "line breaks" explicitly (or mostly) as newline (which is
defined as \n), but for interpreted string literals we say that they cannot span
multiple lines. On a \r-based system, an interpreted string containing a \r byte will
make it appear on multiple lines; on a \n-based system it will appear on one. Is the
string legal or not?
Similarly for comments: Multi-line comments act like newline, which matters for
semicolon insertion. When is a comment mult-iline? It may depend on the system (newline
is defined, on the other hand).
I think we want to be able to take a given source (\r, \r\n, or \n-based) to be
reproduced into an equivalent program (e.g. w/ gofmt) on system with a different line
break.
One way out might be:
- We don't care about \r-based systems; a line break is present if there is a \n byte.
- Consequently, a string or comment spans multiple lines if there is a \n byte.
- \r bytes outside strings act as white space, they are ignored inside (all?) strings.
Thus, avoiding the notion of "multiple lines":
- A general comment containing newlines acts like a newline; otherwise it acts like a
space.
- An interpreted string may not contain newlines.
- \r chars are ignored in all strings. (?)

@ianlancetaylor
Copy link
Contributor

Comment 11:

I don't think we care about systems which end lines with a plain \r.
I wouldn't bother to change the rules for interpreted string literals.  We set the rule
for raw string literals because of the potential confusion if a file changes from \r\n
lines to \n lines or vice-versa.  There is no such potential confusion for an
interpreted string literal, so there is no need to change anything.

@gopherbot
Copy link

Comment 12 by robert.griesemer:

We should still be more precise about what the meaning of "multiple lines" is, though.

@rsc
Copy link
Contributor Author

rsc commented Dec 15, 2011

Comment 13:

The recent spec change is implemented in the compiler.
I believe the only remaining change is to implement the
rule in go/token.
changeset:   f130f78eefa4
user:        Russ Cox <rsc@golang.org>
date:        Thu Dec 15 10:47:09 2011 -0500
summary:     gc: implement and test \r in raw strings
changeset:   2d9ac660f013
user:        Rob Pike <r@golang.org>
date:        Wed Dec 14 21:52:41 2011 -0800
summary:     spec: skip carriage returns in raw literals

@robpike
Copy link
Contributor

robpike commented Dec 19, 2011

Comment 14:

Owner changed to @griesemer.

@griesemer
Copy link
Contributor

Comment 15:

This was implemented on the go/scanner side as well:
http://golang.org/cl/5495049
Marking as fixed.

Status changed to Fixed.

@rsc rsc added fixed labels Dec 19, 2011
@rsc rsc added this to the Go1 milestone Apr 10, 2015
@rsc rsc removed the priority-go1 label Apr 10, 2015
@golang golang locked and limited conversation to collaborators Jun 24, 2016
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants