Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: archive/zip: NewZipWriter should support Ahead-of-time CRC and Sizes #23301

Closed
steve-gray opened this issue Jan 2, 2018 · 15 comments
Closed

Comments

@steve-gray
Copy link

Golang's archive/zip does not support pre-calculation of the file sizes and CRC statistics used in the PKZIP file header structure, requiring use of the 'Content Descriptor' general purpose flag and appending a structure after each included file. This flag is not compatible with various older software, precluding archive/zip from being used in some application scenarios.

What version of Go are you using (go version)?

1.9

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

OSX High Sierra (Intel x64)

What did you do?

  • When writing zip files, Golang always defaults to the 'Content Descriptor' method. Some legacy applications do not understand the 0x08 'general purpose' flag, and when content size is known in advance,

What did you expect to see?

  • Ability to set 'PrecalculateSize' flag on writer or similiar, to avoid writing Content Descriptor structure and use pre-computed fields in the file header within the central directory.

What did you see instead?

  • Go's zip output differs from WinZip/OSX/Linux zip command output, because file size header is set to 0 and file is followed by a Content Descriptor.
@odeke-em odeke-em changed the title NewZipWriter - Ahead-of-time CRC and Sizes proposal: archive/zip: NewZipWriter should support Ahead-of-time CRC and Sizes Jan 2, 2018
@gopherbot gopherbot added this to the Proposal milestone Jan 2, 2018
@odeke-em
Copy link
Member

odeke-em commented Jan 2, 2018

/cc @dsnet @rsc

@steve-gray
Copy link
Author

@rsc - A variation on this idea (that works for my use case) is found at https://github.com/hidez8891/zip/

Rather than the implemented modification of CreateHeader/Create, I'd propose a new method (or method pair) that allows non-streaming creation of file entries - leaving the default behaviour as-is, but allowing the new use case.

@dsnet
Copy link
Member

dsnet commented Jan 2, 2018

Any functionality that allows the user to preset the size seems brittle. There are 3 pieces of information that needs to be known ahead-of-time: CRC32, compressed size, and uncompressed size. The compressed size is not known by the user and is dependent on the underlying compression implementation used.

The only API that seems to make sense to me is:

func (*Writer) AddFile(fh *Header, b []byte) error 

While I understand your use-case, it doesn't seem like this issue is prevalent enough to warrant new API being added.

@rasky
Copy link
Member

rasky commented Jan 3, 2018

@steve-gray which old software are you aware of (and which version) that doesn't support content descriptors?

@rsc
Copy link
Contributor

rsc commented Jan 8, 2018

It doesn't have to be ahead of time, we just have to seek back earlier in the file and write it out, right? So maybe we could do this transparently if the underlying writer allows seeking?

@dsnet
Copy link
Member

dsnet commented Jan 8, 2018

Yes and no. If the size was less than 4GB, then it's easy to seek back and re-write the fixed-width 4B fields. However, if the file written was large enough and had to upgrade to ZIP64, it would not be able to go back and write the local headers for ZIP64. That being said, if the purpose of all this is to support old readers, then I doubt an old reader would understand ZIP64 anyways.

Secondly, as a user, I would be surprised if the output format differed depending on whether the underlying writer was an io.Writer or an io.WriteSeeker. At minimum, I would expect a different constructor that took in an io.WriteSeeker to make this distinction clear.

EDIT: We could always emit a ZIP64 record conservatively. This leads to an increase of 28B for every file. Old readers that don't support ZIP64 should still ignore the ZIP64 extra data.

@rsc
Copy link
Contributor

rsc commented Jan 22, 2018

@rasky's question is still unanswered: what programs care about this? It may just not be worth doing.

@mavimo
Copy link

mavimo commented Jan 23, 2018

@rsc I'm try to upload a zip file into google chome webstore (https://chrome.google.com/webstore/developer/dashboard) but it indicate that the file is not valid. After fixing it with zip -F mypackage.zip --out mypackage-fixed.zip seems to work.

Generated files:

File debugging:

Difference between mypackage.zip and mypackage-fixed.zip:

image 1

image 2

I hope this can help

@genez
Copy link

genez commented Jan 23, 2018

following, I have the same issue

@rsc
Copy link
Contributor

rsc commented Jan 29, 2018

@genez you mean you are using Google Chrome Webstore?

Maybe this should be a bug for Google Chrome Webstore instead. What Go generates is a valid zip file already.

@genez
Copy link

genez commented Jan 30, 2018

@rsc yes, I could submit a bug for Google Chrome Webstore.
I think it's a gray area, I mean: after "fixing" the zip file with zip -F, Chrome Webstore successfully ingests it.
So, in my understanding, zip files generated by Go std packages is "ok" (I can confirm that I can open it in Mac and Windows with no problem at all) but it could be more "compatible" just by adding some (maybe optional? I don't have good knowledge of zip format) information in the header.

I would see this as an improvement rather than a bug

What's your opinion?

@rsc
Copy link
Contributor

rsc commented Feb 5, 2018

The whole zip.Writer interface and design is predicated on having an io.Writer and generating the output in one pass. And the zip file format is designed to make that possible. It's far from clear that we need to take on the complexity of two passes just for Google Chrome Webstore. Even knowing about zip -F is maybe enough.

Are there any other reasons we should do this?

@steve-gray
Copy link
Author

steve-gray commented Feb 5, 2018 via email

@dsnet
Copy link
Member

dsnet commented Feb 5, 2018

The complexity being added is only to support single-pass readers (which Chrome seems to be one). The more I think about it, the more I believe supporting such a use-case is actually a mistake.

There is nothing in the specification that forbids a valid file to be written, but omitted from the trailing central-directory. In fact, you can "remove" and add files to an existing ZIP file by concatenating data to the end and writing an entirely new central directory that references the newly added files and ignores the "removed" ones. When reading in a streaming manner, it is impossible for a reader to determine whether the file really exists or was "removed" until it hits the directory. Fundamentally, the ZIP format is not streaming read friendly. Giving users the illusion that it is the case, is misleading in my opinion.

If we're going to add complexity, #15626 is more in line with actual properties that ZIP guarantees.

@rsc
Copy link
Contributor

rsc commented Feb 26, 2018

Per @dsnet's comment, I think we should decline to do this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants