Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/link, embed: store only one copy of an embedded []byte #52533

Open
pjebs opened this issue Apr 24, 2022 · 17 comments
Open

cmd/link, embed: store only one copy of an embedded []byte #52533

pjebs opened this issue Apr 24, 2022 · 17 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@pjebs
Copy link
Contributor

pjebs commented Apr 24, 2022

I am using the caire package: https://github.com/esimov/caire/blob/master/process.go#L26
It embeds a file called facefinder required for a common dependency called pigo.

My application also uses the pigo package and independently needs to embed the same file.

Currently the same file's data is being embedded twice. Perhaps as an optimisation, it can calculate the hashes of each embedded file (including inside all the dependencies), and store the data only once for duplicates.

This proposal is only for single file embeds and not directories.

@pjebs pjebs added the Proposal label Apr 24, 2022
@gopherbot gopherbot added this to the Proposal milestone Apr 24, 2022
@ianlancetaylor
Copy link
Contributor

We can't do that for this example. When you use go:embed with a []byte, the []byte is initialized with the contents of the file. But the program can then change the contents of the []byte, as it can with any []byte value. So each separate go:embed requires a different []byte value.

@ianlancetaylor ianlancetaylor added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Apr 25, 2022
@pjebs
Copy link
Contributor Author

pjebs commented Apr 25, 2022

How about this?

The first []byte embed reference, it's stored as currently implemented. For all others, the app populates it with a new copy as part of the app loading process.

It's not a matter of doing it ourself in an func init() because we don't have access to what is going on in dependency packages.

Essentially this proposal is about reducing executable's file size in exchange for app start-up time - for cases where single files are duplicated.

@seankhliao
Copy link
Member

This doesn't sound like a very common problem, and seems it would be better served by the dependency exposing the embedded data, either there or as a separate package, so the data doesn't get out of sync.

@mvdan
Copy link
Member

mvdan commented Apr 25, 2022

As a reply to Ian: perhaps the deduplication could happen when embedded as a string. There is another advantage with strings for embedding already: they can live in read-only memory.

@randall77
Copy link
Contributor

@mvdan I believe we already deduplicate when embedding strings.

@DeedleFake
Copy link

DeedleFake commented Apr 25, 2022

@randall77

Just ran a quick test. Yeah, it does. Running strings confirms that the string only appears in the file once, and printing the string headers shows identical data pointers.

Edit: Realized that I didn't test it with identical, but separate, files embedded in two separate packages, so I did that. Same result.

@pjebs
Copy link
Contributor Author

pjebs commented Apr 29, 2022

I'm not sure if all the suggestions regarding deduplication already taking place for strings is a suggestion that the feature already exists if I choose to just embed as a string instead - OR if the suggestion is that it's already been done for strings and therefore for consistency, this feature should also be implemented for []byte as a tacit agreement of my proposal.

If it's the former, we have no control over what dependencies are doing internally. The proposal is about minimising executable file size.

@DeedleFake
Copy link

I believe that it's actually being used as evidence that it was deliberately not implemented for []byte.

@pjebs
Copy link
Contributor Author

pjebs commented Apr 29, 2022

I have updated the proposal to treat string and []byte embeds as equivalent for deduplication purposes.

As such, all []byte embeds should be embedded as a string embed "behind-the-scenes", if the file hashes are the same (since deduplication is already supported), and then at app-start-up, copied to the relevant []byte reference.

@randall77
Copy link
Contributor

It is not immediately clear to me that it is worth doing. You're basically trading off binary size for startup time and final memory usage.

If you embed the same []byte twice, you now have to copy that data twice on startup, and then you've got 3 copies of it in memory, at least until/if the original string copy gets uncached/paged out/whatever.

In the current scheme (embed it twice in the data section), then the binary is bigger but the startup time, and the ram use, is smaller.

@pjebs It sounds like you are more concerned about binary size than the other two, but maybe that's not true for everyone.

@pjebs
Copy link
Contributor Author

pjebs commented Apr 29, 2022

@randall77 That's correct. I am more concerned with file size. Perhaps it's not true for everyone. Also bear in mind that go is supporting WASM, where on the browser, file size download is probably the bigger concern.

Also in the current scheme, it could be embedded multiple times > twice. If memory is the issue, under the proposal, the developer has control over clearing the []bytes if it's not truely necessary. In the current scheme, the developer has no control over minimising file size.

@DeedleFake
Copy link

Also bear in mind that go is supporting WASM, where on the browser, file size download is probably the bigger concern.

That sounds like this should be a special optimization just for WASM, then. Or make it an option in the compiler with a different default for WASM output, perhaps.

@seankhliao
Copy link
Member

isn't wasm usually served compressed (eg gzip over http) in which case simple duplication is mostly eliminated?

@pjebs
Copy link
Contributor Author

pjebs commented May 3, 2022

I've never heard anyone complaining about Go's start-up time, but I've heard plenty complaining about file sizes.

@ianlancetaylor ianlancetaylor removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label May 4, 2022
@ianlancetaylor
Copy link
Contributor

As I understand the current proposal, there is on API change here, so this doesn't to go through the proposal process.

The suggestion here is that when the same file is embedded more than once for a []byte variable, the file contents be stored only once in the executable. Then at startup time the file contents will be copied into a separate []byte variable for each copy.

@ianlancetaylor ianlancetaylor changed the title proposal: embed: Smarter embedding cmd/link, embed: store only one copy of an embedded file May 4, 2022
@ianlancetaylor ianlancetaylor added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed Proposal labels May 4, 2022
@ianlancetaylor ianlancetaylor modified the milestones: Proposal, Backlog May 4, 2022
@pjebs
Copy link
Contributor Author

pjebs commented May 5, 2022

@ianlancetaylor The idea is for Go compiler to treat all []byte embeds as string embeds for deduplication purposes (Since string embeds have built-in deduplication). Then prior to start-up, copy the data to the []byte embeds.

@ianlancetaylor
Copy link
Contributor

Thanks, I think we are saying the same thing.

@seankhliao seankhliao changed the title cmd/link, embed: store only one copy of an embedded file cmd/link, embed: store only one copy of an embedded []byte Jun 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

7 participants