New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: encoding/csv: support QUOTE_NONE in Reader #23344
Comments
The main complexity of the CSV implementation is needing to deal with multi-line quoted fields. The moment you remove the semantics of quoted strings, it seems like this problem is easier solved using r := strings.NewReader(`f1value1 f2value1 f3"value1"
f1value2 f2value2 f3val"ue2
f1value3 f2value3 "f3value3
`)
scanner := bufio.NewScanner(r)
for scanner.Scan() {
fields := strings.Split(scanner.Text(), "\t")
fmt.Printf("Fields: %q\n", fields)
}
if err := scanner.Err(); err != nil {
log.Fatalf("scanner.Err: %v", err)
}
// Output:
// Fields: ["f1value1" "f2value1" "f3\"value1\""]
// Fields: ["f1value2" "f2value2" "f3val\"ue2"]
// Fields: ["f1value3" "f2value3" "\"f3value3"] There is a performance detriment to |
There are many many variations of CSV out there. We decided years ago that the standard library's encoding/csv package should stick to RFC 4180. If you have a different format, I recommend that you make your own copy of encoding/csv/reader.go and modify it for your use case. Or do as @dsnet suggests. The I'm going to close this proposal because, following past decisions in this area, we aren't going to do this., Please comment if you disagree, and explain why this situation is different. |
I would argue are those decisions made long time ago never revised? https://golang.org/pkg/encoding/csv/#Reader for example this TrailingComma field is ignored now must be from something revised decision?
in the same way if this LazyQuotes design is weird, (I've done CSV processing in many other languages and parsing libraries never such a similar LazyQuotes option) causing more confusing than it resolves, why can't this one be marked as ignored as well?
Furthermore, if the builtin pkg "encoding/csv" is barely useful and people have to pull another 3rd party implementation, why not just take another csv implementation to be the new builtin "encoding/csv" I know backward compatibility is a big concern here, but how many languages / libraries / software were dumped away in the compute related history? would you say 5 years / 10 years is enough time to make some library incompatible changes? or when Go 2 is to be released will allow some ? after all my solution is to copy the |
|
Similar to Python csv reader's https://docs.python.org/2/library/csv.html#csv.QUOTE_NONE option
Recently I'm in a data engineering project, need to handle some huge numbers of big csv files (some GBs gz compressed) generated from some other probably proprietary software system, it uses '\t' as delimiter and some very inconsistent quoting, using Python csv reader with csv.QUOTE_NONE option it's no problem at all, can handle all of them, however not fast enough; I'm considering rewrite in Go, however does not have a similar option to Python's
QUOTE_NONE
I have tried set theLazyQuotes = true
it helped somewhat, but still failing for some cases, after digging hard, I found the root causehttps://golang.org/pkg/encoding/csv/#Reader
the problem is at values like
"f3value3
which has beginning"
and current Go's csv reader behavior is read till another"
with the delimiter as an end boundary; causing this single field take some GB of memory, could be even more, and then being OOM killed.I have made a local copy of the
csv/reader.go
file with some changes get the similar behavior of Python QUOTE_NONE working; possibly extend to all four quote option QUOTE_NONNUMERIC QUOTE_MINIMAL and QUOTE_ALL; let me know any one like this approachBTW, through reading the code and the doc at https://golang.org/pkg/encoding/csv/#Reader I'm still not very clear why need a LazyQuotes option, it is just confusing and complicate the code; any one have comments?
The text was updated successfully, but these errors were encountered: