encoding/csv: skipping of empty rows leads to loss of data in single-column datasets #39119

kokes · 2020-05-17T13:11:15Z

What version of Go are you using (`go version`)?

$ go version
go version go1.14 darwin/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/ondrej/Library/Caches/go-build"
GOENV="/Users/ondrej/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/ondrej/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/Cellar/go/1.14/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.14/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/hp/q7nph21s1q76nw1hv1hfxv2m0000gn/T/go-build988616937=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I wrote a CSV with a single column and missing data.

What did you expect to see?

I expected to load the data back, intact.

What did you see instead?

I lost the missing values, encoding/csv skipped them as it skips blank lines. In this case, a blank line actually represents data.

I'm not sure I understand the rationale behind skipping blank lines. Neither in terms of common practice (why would I have blank lines in my CSVs?) nor in terms of standards (the closest we have is RFC 4180 and I couldn't find anything about blank lines - so I'm not sure if Go follows it).

Here's a reproduction of the problem. I wrote a dataset into a file and was unable to read it back.

package main

import (
	"encoding/csv"
	"errors"
	"log"
	"os"
	"reflect"
)

func writeData(filename string, data [][]string) error {
	f, err := os.Create(filename)
	if err != nil {
		return err
	}
	defer f.Close()
	cw := csv.NewWriter(f)
	defer cw.Flush()
	if err := cw.WriteAll(data); err != nil {
		return err
	}
	return nil
}

func readData(filename string) ([][]string, error) {
	f, err := os.Open(filename)
	if err != nil {
		return nil, err
	}
	defer f.Close()
	cr := csv.NewReader(f)
	rows, err := cr.ReadAll()
	if err != nil {
		return nil, err
	}
	return rows, nil
}

func run() error {
	fn := "data/roundtrip.csv"
	data := [][]string{{"john"}, {"jane"}, {""}, {"jack"}}

	if err := writeData(fn, data); err != nil {
		return err
	}

	returned, err := readData(fn)
	if err != nil {
		return err
	}
	if !reflect.DeepEqual(returned, data) {
		log.Println("expected", data, "got", returned)
		return errors.New("not equal")
	}

	return nil
}

func main() {
	if err := run(); err != nil {
		log.Fatal(err)
	}
}

The text was updated successfully, but these errors were encountered:

cagedmantis · 2020-05-20T18:28:05Z

As mentioned, a quick look at RFC 4180 does not seem to explicitly describe what the behavior should be for empty lines.

cagedmantis · 2020-05-20T18:28:21Z

/cc @dsnet @bradfitz @rsc

dsnet · 2020-05-21T19:42:21Z

According to RFC 4180, section 2.1:

Each record is located on a separate line, delimited by a line break (CRLF).

Which suggests that this is a bug in the reader code.

However, I'm not going to be surprised if fixing this breaks lots of users due to Hyrum's law and we're forced to just document the errant behavior. Even though the package tries to follow RFC 4180, CSV files are one of those formats with many strange variants that do not folllow any specific grammar.

dsnet · 2020-05-21T19:47:31Z

Interesting that this used to work in go1.3 and stopped working in go1.4, but not because the reader behavior changed, but that the writer behavior changed.

In go1.3, it would produce:

john
jane
""
jack

but in go1.4, it would produce:

john
jane

jack

The presence of an explicit "" made it clear to the reader that there was a record present.

ianlancetaylor · 2020-05-21T20:54:05Z

The change in Go 1.4 is documented, but I guess not in user-visible documentation.

// We used to quote empty strings, but we do not anymore (as of Go 1.4).
// The two representations should be equivalent, but Postgres distinguishes
// quoted vs non-quoted empty string during database imports, and it has
// an option to force the quoted behavior for non-quoted CSV but it has
// no option to force the non-quoted behavior for quoted CSV, making
// CSV with quoted empty strings strictly less useful.
// Not quoting the empty string also makes this package match the behavior
// of Microsoft Excel and Google Drive.

See #7586.

kokes · 2020-05-22T06:13:01Z

Which suggests that this is a bug in the reader code.

However, I'm not going to be surprised if fixing this breaks lots of users due to Hyrum's law and we're forced to just document the errant behavior. Even though the package tries to follow RFC 4180, CSV files are one of those formats with many strange variants that do not folllow any specific grammar.

If we're worried about breaking existing code, we could add a boolean flag, which defaults to the current behaviour and then we could discuss flipping it (and potentially removing it) for Go 2.

If I understand the implementation correctly, then one cannot just initialise a Reader struct, one has to go through NewReader, because the underlying io.Reader is unexported. In that case, we can enforce our default in the constructor. (Or we could flip the boolean flag to mean e.g. ParseBlankLines, which has the desired default value.)

Something along the lines of this (haven't tested it, just sketching):

diff --git a/src/encoding/csv/reader.go b/src/encoding/csv/reader.go
index c40aa506b0..cd2b0ccfc1 100644
--- a/src/encoding/csv/reader.go
+++ b/src/encoding/csv/reader.go
@@ -16,8 +16,8 @@
 //
 // Carriage returns before newline characters are silently removed.
 //
-// Blank lines are ignored. A line with only whitespace characters (excluding
-// the ending newline character) is not considered a blank line.
+// Blank lines are ignored by default. A line with only whitespace characters
+// (excluding the ending newline character) is not considered a blank line.
 //
 // Fields which start and stop with the quote character " are called
 // quoted-fields. The beginning and ending quote are not part of the
@@ -142,6 +142,9 @@ type Reader struct {
 	// By default, each call to Read returns newly allocated memory owned by the caller.
 	ReuseRecord bool
 
+	// If SkipBlankLines is true (default), rows with no data are skipped.
+	SkipBlankLines bool
+
 	TrailingComma bool // Deprecated: No longer used.
 
 	r *bufio.Reader
@@ -169,8 +172,9 @@ type Reader struct {
 // NewReader returns a new Reader that reads from r.
 func NewReader(r io.Reader) *Reader {
 	return &Reader{
-		Comma: ',',
-		r:     bufio.NewReader(r),
+		Comma:          ',',
+		SkipBlankLines: true,
+		r:              bufio.NewReader(r),
 	}
 }
 
@@ -268,7 +272,7 @@ func (r *Reader) readRecord(dst []string) ([]string, error) {
 			line = nil
 			continue // Skip comment lines
 		}
-		if errRead == nil && len(line) == lengthNL(line) {
+		if r.SkipBlankLines && errRead == nil && len(line) == lengthNL(line) {
 			line = nil
 			continue // Skip empty lines
 		}

kokes · 2020-06-12T13:17:53Z

~~Here's a patch, including tests. If this is something that would be acceptable, I can go through the usual gerrit mechanism.~~ Oh, I see PRs on Github are accepted, I'll do that then.

Also, if this patch goes through, I'd suggest for Go 2 to have the default reverted - interpret blank lines by default and either add a SkipBlankLines switch or remove this functionality altogether.

gopherbot · 2020-06-12T13:42:10Z

Change https://golang.org/cl/237658 mentions this issue: encoding/csv: allow for interpretation of empty lines

ianlancetaylor · 2020-06-12T17:27:34Z

We've pretty much decided that we are not going to add additional knobs to encoding/csv. There are just too many variations in CSV files out there. Trying to handle all of them would lead to a massive proliferation of knobs and an unusable package API.

kokes · 2020-06-12T17:35:11Z

That's too bad, because two of the existing knobs are actually going against the "standard" - lazyquotes and comments, while this suggested knob actually tries to bring Go closer to standard compliance (while keeping backward compatibility).

Anyway, where can I register interest in changing this in Go 2? No knobs, just changing the default - blank lines are perfectly fine and should be interpreted as data.

ianlancetaylor · 2020-06-12T17:42:03Z

There isn't going to be a big shift to Go 2. There may never be a Go 2, except perhaps for marketing reasons. For a package like this, there may someday be a encoding/csv/v2. Although personally I think that our experience with this package is that it would be better to drop it from the standard library, and encourage a variety of third party packages instead.

msakrejda · 2020-11-13T00:21:36Z

Should the documentation be updated to indicate that RFC 4180 is not fully supported? I just ran into this myself, and reading the RFC

   record = field *(COMMA field)
   name = field
   field = (escaped / non-escaped)
   escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
   non-escaped = *TEXTDATA
   ...
   TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E

suggests strongly that blank lines should correspond to a single empty field.

I realize there's already a note a couple of paragraphs down in the encoding/csv documentation that states blank lines are ignored, but it's not made clear that this behavior contradicts the RFC.

gopherbot · 2021-11-10T18:30:17Z

Change https://golang.org/cl/363001 mentions this issue: encoding/csv: allow for interpretation of empty lines

cagedmantis added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 20, 2020

cagedmantis added this to the Backlog milestone May 20, 2020

kokes mentioned this issue Jun 12, 2020

encoding/csv: allow for interpretation of empty lines #39556

Closed

KN4CK3R mentioned this issue Mar 24, 2021

Add Tabular Diff for CSV files go-gitea/gitea#14661

Merged

kdumontnu linked a pull request Mar 25, 2021 that will close this issue

encoding/csv: allow for interpretation of empty lines #45228

Open

eatonphil mentioned this issue Jan 13, 2022

Parsing CSV "wrong number of fields" multiprocessio/dsq#6

Closed

johnkerl mentioned this issue Jan 1, 2023

CSV empty-line regression since 6.0 johnkerl/miller#1164

Closed

shenwei356 mentioned this issue Aug 9, 2023

del-header removes empty cells when the input is one column shenwei356/csvtk#238

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding/csv: skipping of empty rows leads to loss of data in single-column datasets #39119

encoding/csv: skipping of empty rows leads to loss of data in single-column datasets #39119

kokes commented May 17, 2020 •

edited by ALTree

cagedmantis commented May 20, 2020

cagedmantis commented May 20, 2020

dsnet commented May 21, 2020

dsnet commented May 21, 2020

ianlancetaylor commented May 21, 2020

kokes commented May 22, 2020

kokes commented Jun 12, 2020 •

edited

gopherbot commented Jun 12, 2020

ianlancetaylor commented Jun 12, 2020

kokes commented Jun 12, 2020

ianlancetaylor commented Jun 12, 2020

msakrejda commented Nov 13, 2020 •

edited

gopherbot commented Nov 10, 2021

encoding/csv: skipping of empty rows leads to loss of data in single-column datasets #39119

encoding/csv: skipping of empty rows leads to loss of data in single-column datasets #39119

Comments

kokes commented May 17, 2020 • edited by ALTree

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

cagedmantis commented May 20, 2020

cagedmantis commented May 20, 2020

dsnet commented May 21, 2020

dsnet commented May 21, 2020

ianlancetaylor commented May 21, 2020

kokes commented May 22, 2020

kokes commented Jun 12, 2020 • edited

gopherbot commented Jun 12, 2020

ianlancetaylor commented Jun 12, 2020

kokes commented Jun 12, 2020

ianlancetaylor commented Jun 12, 2020

msakrejda commented Nov 13, 2020 • edited

gopherbot commented Nov 10, 2021

kokes commented May 17, 2020 •

edited by ALTree

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

kokes commented Jun 12, 2020 •

edited

msakrejda commented Nov 13, 2020 •

edited