Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/link: don't put StringHeader into go.string.* #7384

Closed
gopherbot opened this issue Feb 21, 2014 · 9 comments
Closed

cmd/link: don't put StringHeader into go.string.* #7384

gopherbot opened this issue Feb 21, 2014 · 9 comments
Milestone

Comments

@gopherbot
Copy link

by tjarratt@pivotallabs.com:

I was creating a small utility that would print random strings and I needed a
dictionary. On unix, this is not a problem because /usr/share/dict/words exists, but for
windows I decided to include 3.5MB of strings directly in my source code as a slice of
strings.

Source code is here: 
https://github.com/tjarratt/go-experiments/tree/master/rw

I noticed that when I compiled with the rw_windows.go source file, my binary grew from
2.1MB to 13MB. Not terrible, but still pretty outlandish.

What steps will reproduce the problem?
If possible, include a link to a program on play.golang.org.
1. Have a small go utility
2. Add 3.5MB of strings as a slice of strings
3. Compile and compare file sizes

What is the expected output?
Probably something between 3.5MB and 7MB growth in the binary.

What do you see instead?
The binary is 11MB larger when 3.5MB of strings are present.


Which compiler are you using (5g, 6g, 8g, gccgo)?
Unsure.


Which operating system are you using?
OS X

Which version are you using?  (run 'go version')
go version go1.2 darwin/amd64

Please provide any additional information below.
@remyoudompheng
Copy link
Contributor

Comment 1:

The allocation size (size of the underlying array) of a slice of 240000 strings is
5.7MB. Not having these megabytes already written in the binary (to be directly mapped)
means generating them at runtime and slowing down program startup.

@minux
Copy link
Member

minux commented Feb 22, 2014

Comment 2:

you have 235886 strings, total string size (not including overhead) is 2257223.
Then the bundledDictionary slice has 235886 entry, each entry contains a
StringHeader, which is 16-byte on amd64, so you will need to add at least
235886 * 16 bytes of overhead (which is 3774176 bytes).
so actually you bundledDictionary will use at least 7918487 bytes of space,
far larger than the string data themselves.
I suggest you concatenate the strings to make a bigger one, and split at
runtime (or save the index of each string; or even better, use trie to store
the strings, so that it doesn't need any initialization at runtime)
PS: the reason why Go binary uses ~4MB more is that every string is actually
stored as a separate StringHeader in the Go string table, so each strings adds
further 16 bytes of overhead (3774176 bytes). Taking that into account,
now the binary uses 11692663 bytes to store all the strings and the slice,
and this number explains 11MB increase in binary size.
I don't think we can do anything here. Don't use too many strings.

Status changed to WorkingAsIntended.

@remyoudompheng
Copy link
Contributor

Comment 3:

I don't think we need to store string constants as StringHeaders. The overhead should
happen only once.

@minux
Copy link
Member

minux commented Feb 22, 2014

Comment 4:

ok.
That means the linker must keep track of which stringheader is referenced,
seems doable but it will need big changes to go.string.* section.
but perhaps we should first solve the 1.5GB RSS problem of 6g first.
(also cmd/6g will generate a 33040637 bytes rw.6!)
also, it turns out our go.string.* has fairly big space inefficiency problem.
each Go string is stored like this in the binary in the go.string.* section:
  pointer *byte // points to the payload, always this address + sizeof(uintptr)*2, redundant
  len uintptr // not redundant, but the compiler could almost always inline this number.
  payload byte[len+1] // always has 0-termination. why??
  padding byte[padlen] // pad to next uintptr-aligned for next StringHeader.
which means that each (8*n+k) byte string will take (8-k) bytes more.
(e.g. a 8 byte string will take 8 extra bytes of padding. Yes. you read this correctly.)
The problem seems to be just don't store StringHeader and always either
construct in runtime or (in this case) store in the temporary statictmp_0020.
which means we can concatenate all strings in the go.string.* section,
and this also enables substring reuse optimization (e.g. if there are both "a" and "ab"
strings in a Go program, we only store "ab" in the string section)
Too big a change for the current linker, let's consider this when cmd/link takes over.

Labels changed: added release-go1.4, repo-main.

Status changed to Accepted.

@bradfitz
Copy link
Contributor

Comment 5:

Josh (copied) was looking into this recently.

@josharian
Copy link
Contributor

Comment 6:

I came to the same conclusion as minux, that is a significant change and should wait for
cmd/link. There are definitely non-trivial binary size savings to be had here for all
programs; even hello world contains lots of small strings.

@matrixik
Copy link

Comment 7:

I don't know if this also belong here:
https://github.com/fiam/gounidecode
It have big map in https://github.com/fiam/gounidecode/blob/master/unidecode/table.go:
`var transliterations = map[rune]string`
Over 750KB input file make output binary bigger by almost 5 MB.
Fell free to remove this comment.
Best regards,
Dobrosław Żybort

@rsc
Copy link
Contributor

rsc commented Sep 15, 2014

Comment 8:

Labels changed: added release-go1.5, removed release-go1.4.

@bradfitz bradfitz modified the milestone: Go1.5 Dec 16, 2014
@rsc rsc removed accepted labels Apr 14, 2015
@rsc rsc changed the title cmd/ld: don't put StringHeader into go.string.* cmd/link: don't put StringHeader into go.string.* Jun 8, 2015
@gopherbot
Copy link
Author

CL https://golang.org/cl/11698 mentions this issue.

@rsc rsc closed this as completed in e8f2eb4 Jun 30, 2015
@golang golang locked and limited conversation to collaborators Jun 29, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants