Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/csv: Read returns strings which has reference of string of whole line #30222

Open
mtgto opened this issue Feb 14, 2019 · 13 comments
Open
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@mtgto
Copy link

mtgto commented Feb 14, 2019

What version of Go are you using (go version)?

$ go version
go version go1.11.5 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/user/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/user/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/Cellar/go/1.11.5/libexec"
GOTMPDIR=""
GOTOOLDIR="/usr/local/Cellar/go/1.11.5/libexec/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/lx/fcsz9swj2ysdrvthsnxnxjzr0000gn/T/go-build061484716=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

"encoding/csv" Reader.Read() returns record []string, which has reference to the string of whole line.
I wrote my code hold only one field in record, and found the usage of memory is huger than expects me.

To confirm it, I wrote some sample code:
https://play.golang.org/p/OpsuPDjMRfJ

This line says that each field of record has reference of line string.
https://github.com/golang/go/blob/go1.11.5/src/encoding/csv/reader.go#L386

What did you expect to see?

Has only reference of the field of record which I use it.
In sample code of playground, first field ( = "0" ) is held, but second field "0 ... 0" is not held.

What did you see instead?

It held over 100002*10000 bytes in memory.

@dsnet
Copy link
Member

dsnet commented Feb 14, 2019

The current implementation deliberately pins a large string and creates sub-slices out of the string. Doing this provides a significant performance speed-up, at the expense of potentially pinning a lot of memory (as you are running into).

For a little bit of extra complexity, we can do something where it batch allocates fields up to a certain size, providing an upper bound on the maximum amount of memory that is pinned, but still getting most of the performance benefits.

@mtgto
Copy link
Author

mtgto commented Feb 14, 2019

Thanks to your quick reply, @dsnet, I understand it's trade-off between cpu performance and memory usage.
You can close this issue if you want.
I read the doc of opening issue but I don't know I must close it.

I write below about my real situation encounter this issue and I want to.

my real situation

My program reads 16M line CSV (about 1giga Byte).
One field in csv record is stored as string, others are converted into int.

https://github.com/mtgto/pediaroute-go/blob/9be69615ae58f3836f0a8c00115ef2dec0392b0c/internal/app/core/core.go#L20-L68

I want to...

  • godoc of Read/ReadAll of package "encoding/csv" says "Read strings pins a large string".
  • Or "encoding/csv" Reader has a flag to avoid use sub-slices

@dsnet
Copy link
Member

dsnet commented Feb 14, 2019

I want to...

  • godoc of Read/ReadAll of package "encoding/csv" says "Read strings pins a large string".
  • Or "encoding/csv" Reader has a flag to avoid use sub-slices

As a general principle, I think we should fight hard against documented (or API knobs) that constrict the implementation towards one given approach.

trade-off between cpu performance and memory usage.

Correct. I think there is room for improvement here that I would like to avoid API expansion or documentation of implementation details.

@mtgto
Copy link
Author

mtgto commented Feb 14, 2019

As a general principle, I think we should fight hard against documented (or API knobs) that constrict the implementation towards one given approach.

Understand. It was only my request. I already managed the problem by deepcopy, but worried about other people will meet this situation.

Correct. I think there is room for improvement here that I would like to avoid API expansion or documentation of implementation details.

Thanks.

@bcmills
Copy link
Contributor

bcmills commented Feb 28, 2019

You can always deep-copy the strings if you don't want the aliasing behavior.

On the other hand, if we made copies unprompted, users who want aliasing would have no way to recover that behavior.

@bcmills
Copy link
Contributor

bcmills commented Feb 28, 2019

On the third hand, the memory behavior of string-copying is an implementation detail of the garbage collector.

In theory, the GC could notice that only a small fraction of the string is live during garbage collection and allocate smaller substrings at that point.

(CC @RLH @aclements for GC knowledge.)

@bcmills bcmills added this to the Unplanned milestone Feb 28, 2019
@bcmills
Copy link
Contributor

bcmills commented Feb 28, 2019

I would bet that there is already an issue filed somewhere for chopping up large retained strings, but I can't find it.

@mtgto
Copy link
Author

mtgto commented Feb 28, 2019

@bcmills I already understand it's tradeoff problem between performance and memory usage.
I also know deep-copy help reducing memory usage.
#30222 (comment)

My requests are "encoding/csv API documents mentions it" or "create new option to use deep-copy string for csv Reader" to stop anyone meet same trouble.

@bcmills
Copy link
Contributor

bcmills commented Mar 1, 2019

My point is, at some point the runtime itself may do that deep-copy for you. That would save the expense of copying if the data turns out to be short-lived, and would not require any new options or API surface.

I'm also not sure that it merits a mention in the package documentation: string operations are pervasive in Go programs, and they all alias by default. (Allocating new strings is the exception, not the rule.) Given that, I don't think that repeating that information in every individual API would be productive either.

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 1, 2019
@aclements
Copy link
Member

I would bet that there is already an issue filed somewhere for chopping up large retained strings, but I can't find it.

It was discussed a little here. I think you and I may have been talking in person about how GC could recognize that only part of a string is reachable and chop it up. As far as I know, there's no issue filed for this.

This is of course actually quite hard to do, since you need to not only recognize when you've reached a string, but keep track of extra information on what part of it you've reached. It's extra hard if you want to chop strings down from both ends, since then you need to recognize not just the string object, but the string header that points to the string object so you can read the length from the header. And even if you can do all of this, updating all of the references to point to the new string(s) is quite difficult (and probably would require specializing the write barrier).

@ianlancetaylor
Copy link
Contributor

You don't have to update all references, though. You just have to somehow decide when this is worth doing, and then let each instance take care of itself.

@aclements
Copy link
Member

I'm not sure I follow. If you don't update all of the references to a moved string, you'll have to retain the old string, which will just increase the memory footprint rather than decreasing it.

@fgimian
Copy link

fgimian commented Feb 10, 2020

I came across this issue today and spent many hours trying to troubleshoot it. Having written the same exact program in Rust and Go, Go was consuming 2 GB of memory while the Rust version was consuming 400 MB with virtually the same performance.

Personally, I think the current behaviour is extremely confusing.

In the end, I am working around it by doing something similar to @mtgto but instead have condensed it to one line:

items = append(items, string([]byte(record[3])))

tux-mind added a commit to evilsocket/sum that referenced this issue Feb 11, 2020
csv.Reader return strings pointing to the same backarray.
Therefore, using directly one of those will hold memory for the entire readed line instead of just the interested column.

ref: golang/go#30222
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

6 participants