Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/csv: Lazy quotes reader does not properly parse quoted fields that end in a bare quote #56329

Closed
julianedwards opened this issue Oct 19, 2022 · 6 comments
Labels
FrozenDueToAge WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.

Comments

@julianedwards
Copy link

julianedwards commented Oct 19, 2022

What version of Go are you using (go version)?

$ go version
go version go1.19.2 darwin/amd64

Does this issue reproduce with the latest release?

Yes. I also saw the same issue on 1.18.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/julianedwards/Library/Caches/go-build"
GOENV="/Users/julianedwards/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/julianedwards/gocode/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/julianedwards/gocode"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.19.2"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/gq/_pcdqwn544g188vb6mhyffhm0000gn/T/go-build1771569906=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

package main

import (
	"bytes"
	"encoding/csv"
	"fmt"
	"log"
)

const example1 = `
c0,c1,c2
abc,123,
"abc",123,
"a"b"c",123,
""ab"c",123,
""abc"",123,
"a"b"c",123,
`
const example2 = `
c0,c1,c2
abc,123,
"a"bc"",123,
a"b"c,123,
`

func main() {
	// Example 1
	csvReader := csv.NewReader(bytes.NewReader([]byte(example1)))
	// If I do not set this to a negative number, the reader can fail since
	// it fails to split by the comma delimiter and combines fields from
	// the rest of the row and the subsequent lines.
	csvReader.FieldsPerRecord = -1
	csvReader.LazyQuotes = true

	records, err := csvReader.ReadAll()
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println("Example 1")
	for i, record := range records {
		fmt.Println("ROW ", i)
		fmt.Println(record)
	}

	// Example 2
	csvReader = csv.NewReader(bytes.NewReader([]byte(example2)))
	csvReader.FieldsPerRecord = -1
	csvReader.LazyQuotes = true

	records, err = csvReader.ReadAll()
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println("\nExample 2")
	for i, record := range records {
		fmt.Println("ROW ", i)
		fmt.Println(record)
	}
}

What did you expect to see?

Example 1
ROW 0
[c0 c1 c2]
ROW 1
[abc 123 ]
ROW 2
[abc 123 ]
ROW 3
[a"b"c 123 ]
ROW 4
["ab"c 123 ]
ROW 5
["abc" 123 ]
ROW 6
[a"b"c 123 ]

Example 2
ROW 0
[c0 c1 c2]
ROW 1
[abc 123 ]
ROW 2
[a"bc" 123 ]
ROW 3
[a"b"c 123 ]

What did you see instead?

Example 1
ROW  0
[c0 c1 c2]
ROW  1
[abc 123 ]
ROW  2
[abc 123 ]
ROW  3
[a"b"c 123 ]
ROW  4
["ab"c 123 ]
ROW  5
["abc",123,
"a"b"c 123 ]

Example 2
ROW  0
[c0 c1 c2]
ROW  1
[abc 123 ]
ROW  2
[a"bc",123,
a"b"c,123,
]

Notes

  • Quoted fields with bare quotes starting and ending in the middle are okay ("*"*"*",)
  • Quoted fields starting with a bare quote that ends before the last character are okay (""*"*",)
  • Quoted fields that end with a bare quote are not parsed properly ("*"",)
  • Quoted fields that start and end with bare quotes are not parsed properly (""*"",)–this is really a subset of the previous pattern.

I looked into the csv library code and the culprit seems to be these two lines. When finding a closing ", the line position should not increment—similar to how the the lazy quotes case does not increment the line position when it finds an opening " in a quoted field. The removal of quotes should always be handled by this block of code, never in the subsequent switch statement. The premature incrementing of the line causes (1) the next iteration of the loop to hit this clause in the if/else statement because there are no more quotes in the line and it assumes the end of the line is hit (causing any subsequent fields in the line and, potentially, following lines to be squashed) or (2) read until the next quote in the line squashing all of the fields in the line together (which can lead to case (1) if the next quote is the last rune of the field).

@ianlancetaylor
Copy link
Contributor

I didn't try to understand every case. I just note that you say that for row 2 you expect ["abc" 123 ] but you get [abc 123 ]. The latter appears consistent with the test cases in readTests in reader_test.go, which removes the quotes from a fully quoted string. It also appears consistent with the docs for LazyQuote, which say that it permits extra quotes but implies that it still permits quoted strings (and removes the quotes).

If I'm correct--and I may not be--then we aren't going to change this because it would change the existing behavior of this package. In general we want to change this package as little as possible because experience tells us that any change will break current users. The fix for your case may be to edit the input or the output, or to copy the package (it's pretty small) and change it to suit your purposes.

@ianlancetaylor ianlancetaylor added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Oct 20, 2022
@julianedwards
Copy link
Author

julianedwards commented Oct 21, 2022

Hey @ianlancetaylor sorry for the confusion, I had some copy/paste errors in the expected output, I fixed it. The second row in the first example should, as you have stated, get returned as [abc 123]. It is row 5 in the first example and row 2 in the second example that cause issues.

@ianlancetaylor
Copy link
Contributor

Per RFC 4180, CSV permits a " character to appear in a quoted field, by writing "". Your input is

""abc"",123,
"a"b"c",123,

The first " starts the quoted field. The second " is accepted because LazyQuotes is true. The pair "" turns into a single ". The next two " are accepted because LazyQuotes is true. The final ", alone, before a comma, ends the quoted field. The result is a value containing a newline:

"abc",123,
"a"b"c

This is weird but it seems consistent.

@julianedwards
Copy link
Author

julianedwards commented Oct 24, 2022

Ah I see, yes, according to the spec it would require a preceding ". But it seems that other CSV readers do a better job at maintaining the rows as expected.

I wrote the following to a file example.csv—the last line is the example with double quotes from the RFC 4180 spec—and read the file using Python's built-in CSV reader:

c0,c1,c2
abc,123,
"abc",123,
"a"b"c",123,
""ab"c",123,
""abc"",123,
"a"b"c",123,
"aaa","b""bb","ccc"
>>> import csv
>>> with open('example.csv') as f:
...     r = csv.reader(f)
...     for row in r:
...             print(row)
... 
['c0', 'c1', 'c2']
['abc', '123', '']
['abc', '123', '']
['ab"c"', '123', '']
['ab"c"', '123', '']
['abc""', '123', '']
['ab"c"', '123', '']
['aaa', 'b"bb', 'ccc']

This seems to handle the double quotes more correctly.

I also imported it to a Google Sheet and got the same output:
Screen Shot 2022-10-24 at 1 53 04 PM

@ianlancetaylor
Copy link
Contributor

We've learned over the years that there is a vast variety of CSV formats out there. Rather than add knobs for all of them, we've decided that the standard library package will focus only on the RFC 4180 format. We added the configuration knob LazyQuotes before we made that decision.

For people who need a different format, we suggest copying the package--it's only a few hundred lines of code--and modifying it for your purpose.

@seankhliao
Copy link
Member

Closing as working as intended for encoding/csv.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

4 participants