encoding/csv: Read returns strings which has reference of string of whole line #30222

mtgto · 2019-02-14T01:42:21Z

What version of Go are you using (`go version`)?

$ go version
go version go1.11.5 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (`go env`)?

go env Output

$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/user/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/user/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/Cellar/go/1.11.5/libexec"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.11.5/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/lx/fcsz9swj2ysdrvthsnxnxjzr0000gn/T/go-build061484716=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

"encoding/csv" Reader.Read() returns record []string, which has reference to the string of whole line.
I wrote my code hold only one field in record, and found the usage of memory is huger than expects me.

To confirm it, I wrote some sample code:
https://play.golang.org/p/OpsuPDjMRfJ

This line says that each field of record has reference of line string.
https://github.com/golang/go/blob/go1.11.5/src/encoding/csv/reader.go#L386

What did you expect to see?

Has only reference of the field of record which I use it.
In sample code of playground, first field ( = "0" ) is held, but second field "0 ... 0" is not held.

What did you see instead?

It held over 100002*10000 bytes in memory.

The text was updated successfully, but these errors were encountered:

dsnet · 2019-02-14T02:56:17Z

The current implementation deliberately pins a large string and creates sub-slices out of the string. Doing this provides a significant performance speed-up, at the expense of potentially pinning a lot of memory (as you are running into).

For a little bit of extra complexity, we can do something where it batch allocates fields up to a certain size, providing an upper bound on the maximum amount of memory that is pinned, but still getting most of the performance benefits.

mtgto · 2019-02-14T05:01:12Z

Thanks to your quick reply, @dsnet, I understand it's trade-off between cpu performance and memory usage.
You can close this issue if you want.
I read the doc of opening issue but I don't know I must close it.

I write below about my real situation encounter this issue and I want to.

my real situation

My program reads 16M line CSV (about 1giga Byte).
One field in csv record is stored as string, others are converted into int.

https://github.com/mtgto/pediaroute-go/blob/9be69615ae58f3836f0a8c00115ef2dec0392b0c/internal/app/core/core.go#L20-L68

I want to...

godoc of Read/ReadAll of package "encoding/csv" says "Read strings pins a large string".
Or "encoding/csv" Reader has a flag to avoid use sub-slices

dsnet · 2019-02-14T05:07:42Z

I want to...

godoc of Read/ReadAll of package "encoding/csv" says "Read strings pins a large string".

Or "encoding/csv" Reader has a flag to avoid use sub-slices

As a general principle, I think we should fight hard against documented (or API knobs) that constrict the implementation towards one given approach.

trade-off between cpu performance and memory usage.

Correct. I think there is room for improvement here that I would like to avoid API expansion or documentation of implementation details.

mtgto · 2019-02-14T05:21:33Z

As a general principle, I think we should fight hard against documented (or API knobs) that constrict the implementation towards one given approach.

Understand. It was only my request. I already managed the problem by deepcopy, but worried about other people will meet this situation.

Correct. I think there is room for improvement here that I would like to avoid API expansion or documentation of implementation details.

Thanks.

bcmills · 2019-02-28T21:58:40Z

You can always deep-copy the strings if you don't want the aliasing behavior.

On the other hand, if we made copies unprompted, users who want aliasing would have no way to recover that behavior.

bcmills · 2019-02-28T21:59:57Z

On the third hand, the memory behavior of string-copying is an implementation detail of the garbage collector.

In theory, the GC could notice that only a small fraction of the string is live during garbage collection and allocate smaller substrings at that point.

(CC @RLH @aclements for GC knowledge.)

bcmills · 2019-02-28T22:00:38Z

I would bet that there is already an issue filed somewhere for chopping up large retained strings, but I can't find it.

mtgto · 2019-02-28T22:37:18Z

@bcmills I already understand it's tradeoff problem between performance and memory usage.
I also know deep-copy help reducing memory usage.
#30222 (comment)

My requests are "encoding/csv API documents mentions it" or "create new option to use deep-copy string for csv Reader" to stop anyone meet same trouble.

bcmills · 2019-03-01T20:28:18Z

My point is, at some point the runtime itself may do that deep-copy for you. That would save the expense of copying if the data turns out to be short-lived, and would not require any new options or API surface.

I'm also not sure that it merits a mention in the package documentation: string operations are pervasive in Go programs, and they all alias by default. (Allocating new strings is the exception, not the rule.) Given that, I don't think that repeating that information in every individual API would be productive either.

aclements · 2019-03-01T21:21:41Z

I would bet that there is already an issue filed somewhere for chopping up large retained strings, but I can't find it.

It was discussed a little here. I think you and I may have been talking in person about how GC could recognize that only part of a string is reachable and chop it up. As far as I know, there's no issue filed for this.

This is of course actually quite hard to do, since you need to not only recognize when you've reached a string, but keep track of extra information on what part of it you've reached. It's extra hard if you want to chop strings down from both ends, since then you need to recognize not just the string object, but the string header that points to the string object so you can read the length from the header. And even if you can do all of this, updating all of the references to point to the new string(s) is quite difficult (and probably would require specializing the write barrier).

ianlancetaylor · 2019-03-01T21:35:05Z

You don't have to update all references, though. You just have to somehow decide when this is worth doing, and then let each instance take care of itself.

aclements · 2019-03-01T21:43:41Z

I'm not sure I follow. If you don't update all of the references to a moved string, you'll have to retain the old string, which will just increase the memory footprint rather than decreasing it.

fgimian · 2020-02-10T11:34:25Z

I came across this issue today and spent many hours trying to troubleshoot it. Having written the same exact program in Rust and Go, Go was consuming 2 GB of memory while the Rust version was consuming 400 MB with virtually the same performance.

Personally, I think the current behaviour is extremely confusing.

In the end, I am working around it by doing something similar to @mtgto but instead have condensed it to one line:

items = append(items, string([]byte(record[3])))

csv.Reader return strings pointing to the same backarray. Therefore, using directly one of those will hold memory for the entire readed line instead of just the interested column. ref: golang/go#30222

dsnet added the Performance label Feb 14, 2019

bcmills added this to the Unplanned milestone Feb 28, 2019

bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

encoding/csv: Read returns strings which has reference of string of whole line #30222

encoding/csv: Read returns strings which has reference of string of whole line #30222

mtgto commented Feb 14, 2019

dsnet commented Feb 14, 2019 •

edited

mtgto commented Feb 14, 2019

dsnet commented Feb 14, 2019

mtgto commented Feb 14, 2019

bcmills commented Feb 28, 2019

bcmills commented Feb 28, 2019

bcmills commented Feb 28, 2019

mtgto commented Feb 28, 2019

bcmills commented Mar 1, 2019

aclements commented Mar 1, 2019

ianlancetaylor commented Mar 1, 2019

aclements commented Mar 1, 2019

fgimian commented Feb 10, 2020

encoding/csv: Read returns strings which has reference of string of whole line #30222

encoding/csv: Read returns strings which has reference of string of whole line #30222

Comments

mtgto commented Feb 14, 2019

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

dsnet commented Feb 14, 2019 • edited

mtgto commented Feb 14, 2019

my real situation

I want to...

dsnet commented Feb 14, 2019

mtgto commented Feb 14, 2019

bcmills commented Feb 28, 2019

bcmills commented Feb 28, 2019

bcmills commented Feb 28, 2019

mtgto commented Feb 28, 2019

bcmills commented Mar 1, 2019

aclements commented Mar 1, 2019

ianlancetaylor commented Mar 1, 2019

aclements commented Mar 1, 2019

fgimian commented Feb 10, 2020

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

dsnet commented Feb 14, 2019 •

edited