Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/vgo: use tar instead of zip for package archives #24057

Closed
stapelberg opened this issue Feb 23, 2018 · 13 comments
Closed

x/vgo: use tar instead of zip for package archives #24057

stapelberg opened this issue Feb 23, 2018 · 13 comments
Milestone

Comments

@stapelberg
Copy link
Contributor

This is easily visible in repositories such as https://github.com/prometheus/procfs, which contains executable files (e.g. ttar) and symbolic links (e.g. fixtures/self):

% mkdir -p /tmp/repro
% cd /tmp/repro
% echo > go.mod
% cat >> hello.go <<'EOT'
package main // import "github.com/stapelberg/repro"

import _ "github.com/prometheus/procfs"

func main() {
}
EOT
% vgo build
% vgo test all
--- FAIL: TestNewNamespaces (0.00s)
	proc_ns_test.go:15: readlink fixtures/26231/ns/net: invalid argument
--- FAIL: TestSelf (0.00s)
	proc_test.go:18: readlink fixtures/self: invalid argument
--- FAIL: TestExecutable (0.00s)
	proc_test.go:97: readlink fixtures/26231/exe: invalid argument
--- FAIL: TestFileDescriptorTargets (0.00s)
	proc_test.go:138: want fds [../../symlinktargets/abc ../../symlinktargets/def ../../symlinktargets/ghi ../../symlinktargets/uvw ../../symlinktargets/xyz], have [    ]
FAIL
FAIL	github.com/prometheus/procfs	0.002s
?   	github.com/prometheus/procfs/internal/util	[no test files]
ok  	github.com/prometheus/procfs/nfs	0.001s
ok  	github.com/prometheus/procfs/xfs	0.001s
?   	github.com/stapelberg/repro	[no test files]

While permissions apparently can be stuffed into extended attributes (see https://stackoverflow.com/a/13633772/712014), it seems that GitHub and/or vgo don’t support that. Symlink support was added to zip years ago as per https://serverfault.com/a/265678/100772, but doesn’t seem to be used by GitHub and/or vgo either.

@gopherbot gopherbot added this to the vgo milestone Feb 23, 2018
@stapelberg
Copy link
Contributor Author

If I had to make a suggestion as to what to use instead, I would recommend tar archives: they could also be imported as-is into e.g. the Debian archive (or other Linux distributions), which typically don’t support zip.

@mvdan
Copy link
Member

mvdan commented Feb 23, 2018

I asked about zip and tar files too in golang-dev - search for "zip" in https://groups.google.com/forum/m/#!topic/golang-dev/MNQwgYHMEcY. I don't know how to share single messages on mobile, unfortunately.

@dsnet
Copy link
Member

dsnet commented Feb 23, 2018

Having dealt extensively with both the TAR and ZIP formats, I have concluded that both are terrible formats and consistent support for them is awful. However, in terms of better world-wide support, I vote for TAR.

Here is my assessment of the advantage and disadvantage of each:

  • ZIP has its heritage in Windows and better supports Windowisms.
  • TAR has its heritage in Unix and better supports Unixisms.
  • ZIP was designed to be written in a random-access manner, but can be written in a streaming manner.
  • ZIP must be read in a random-access manner, but some readers incorrectly assume you can read in a streaming manner.
  • TAR is written in a streaming manner.
  • TAR is read in a streaming manner.
  • ZIP allows random-access reading between files.
  • ZIP does not allow random-access reading within a file (if compression is used).
  • TAR does not allow random-access reading between files
  • TAR does not allow random-access reading within a file.
  • ZIP has one primary format which is well-specified, but attempts to be extension friendly with its "extra" fields, which has ironically led to a huge number of variants (too many to mention). Many variants conflict with each other, but nothing prevents you from placing multiple conflicting "extra" fields together. The specifications for these extensions are not always easy to find.
  • TAR has 3 main competing formats (USTAR, PAX, and GNU). USTAR is entirely a subset of PAX; so really two competing formats. The two most common tools GNU tar and BSD tar have strong support for both formats. The PAX format is standardized, and the GNU format is well-documented.
  • ZIP has issues with character encoding, making exact representation of filenames difficult (especially when it comes to foreign languages). Support for the UTF-8 flag is fairly poor.
  • TAR has better support for character encodings. The USTAR format is always ASCII, PAX format is always UTF-8, but unfortunately GNU format is specified as "local variant of 8-bit ASCII".
  • ZIP supports symlinks via certain "extra" header extensions, but I highly discourage them as being widely-compatible in any way.
  • TAR supports for symlinks.
  • TAR and ZIP can both support file sizes up to 18.4EiB.
  • ZIP has max path names of 64KiB.
  • TAR supports unlimited path names (via GNU or PAX formats).
  • ZIP has DEFLATE compression built-in, but wide support for other compression algorithms is poor.
  • TAR has no compression. However, it is very common to compress an archive as the GZIP, BZIP2, XZ, or (upcoming) ZSTD formats. GZIP and ZSTD are well-specified. BZIP2 and XZ are "specified" according to the reference implementation.
  • ZIP compresses on a per-file basis, while usually the entire TAR archive is compressed. Thus, TAR tends to have smaller archives. Since these Go source-code archives usually contains many small files, compressed TAR can gain a decent size reduction over compressed ZIP.
  • ZIP has poor support for Unix permissions (via the various competing Unix "extra" fields).
  • TAR has good support for Unix permissions.
  • ZIP has builtin CRC protection for the data.
  • TAR has no CRC protection for the data.
  • ZIP has poor support for accurate timestamps (the original format stored the local date at 2s resolution without storing the timezone). Various "extra" fields store the timestamps as seconds since Unix epoch.
  • TAR has good support for accurate timestamps.
  • ZIP has no support for sparse files.
  • TAR has some support for sparse files.

The main advantage of ZIP is the ability to random-access between files. For which, I'm not sure if that feature is a deal breaker. There are ways to stripe through a TAR archive once and build an index to provide random access between files and within a file.

@mvdan
Copy link
Member

mvdan commented Feb 23, 2018

On the computer now, pasting @rsc's answer to my question about the decision to use ZIP:

I guess there are two parts to your question: which container format, and which compression algorithm? I picked zip because it's so much better specified than tar (really!) and supports random access. I think both of those are very important.

As for compression efficiency, I don't think the system is going to be bottlenecked by file transfers. Most packages will be reused from cache. If zip supported better algorithms then it would be fine if we wanted to use them, but I'm not super worried. If xzip compression were an option I could enable in the zip writer, I would, of course. But I think the better container wins over compression efficiency.

@dsnet
Copy link
Member

dsnet commented Feb 23, 2018

I picked zip because it's so much better specified than tar (really!)

Somewhat disagree. Essentially:

  • ZIP has one well-specified format, and a non-trivial long-tail of extensions that exist to address the limitations of the original format. ZIP tried to be "extension-friendly" by having a vendor-specified "extra" field, which is often used to encode semantic details regarding the file. The world has abused that "feature" without end, leading to inconsistent support for these extensions. That is what makes ZIP worse in my opinion.
  • TAR has two main competing formats, both of which are well-specified, documented, and supported by common tools. There are other variants of TAR, but it seems the world-wide community has more or less converged on just the GNU and PAX formats. The fact that it takes more engineering work to come up with your own extension to TAR has prevented world from creating so many competing variations.

and supports random access

Agreed. That's the main benefit of ZIP.

As for compression efficiency, I don't think the system is going to be bottlenecked by file transfers.

For a cold-cache, it's still going to be alot of data. Imagine building Kubernetes or JuJu for the first time.

For some empirical data:

70.7MiB  go1.10.src.store.zip   // no compression
20.6MiB  go1.10.src.flate.zip   // gzip -9
19.8MiB  go1.10.src.bz2.zip     // bzip2 -9
19.4MiB  go1.10.src.zstd.zip    // zstd -19
19.1MiB  go1.10.src.xz.zip      // xz -9
19.0MiB  go1.10.src.brotli.zip  // brotli -9

74.9MiB  go1.10.src.tar         // no compression
17.3MiB  go1.10.src.tar.gz      // gzip -9
14.9MiB  go1.10.src.tar.bz2     // bzip2 -9
14.4MiB  go1.10.src.tar.br      // brotli -9
13.9MiB  go1.10.src.tar.zstd    // zstd -19
12.8MiB  go1.10.src.tar.xz      // xz -9

Some observations:

  • Without compression, ZIP is smaller (70.7 vs 74.9 MiB since TAR has relatively large headers)
  • With the GZIP (or DEFLATE) formats, TAR is smaller (17.3 vs 20.6 MiB)
  • With best compression using XZ, TAR is much smaller (12.8 vs 19.1 MiB since TAR has cross-file compression benefits)
  • For ZIP, when compressing many small files, it does not benefit from better compression algorithms. Switching to XZ from GZIP only reduces size by 7% (while TAR gets a reduction of 26%).
  • On ZIP, brotli performs better than XZ because it contains a massive static dictionary that enables it to compress small files well.
  • The fact that TAR decouples archiving from compression, provides more flexibility in adopting newer algorithms (ZSTD is one I'm excited for as it compresses almost as well as XZ but decompresses faster than GZIP).

@dsnet
Copy link
Member

dsnet commented Feb 23, 2018

Also, I don't think random-access is that important. The purpose of the container format is for the proxies to transfer the package sources+testdata across the network, which is almost certainly in a streaming fashion.

I can imagine random-access is a useful property for the cache, but that seems to be a implementation detail. The external world should not care if the internal cache implementation ends up storing them as ZIP archives, SQLite databases, protobufs, flatbuffers, regular files, etc.

@davecheney
Copy link
Contributor

I want to add my voice in support of tar over zip for all the reasons @dsnet mentioned.

@dsnet dsnet changed the title x/vgo: zip files lose symbolic links and executable permission x/vgo: use tar instead of zip for package archives Feb 23, 2018
@smlx
Copy link

smlx commented Feb 24, 2018

This use case sounds ideal for catar, with the down-side of obscurity.

...a well-defined, reproducible, random-access serialization format for directory trees (think: a more modern tar), to permit efficient, stable storage of complete directory trees...

@rsc
Copy link
Contributor

rsc commented Mar 30, 2018

@stapelberg's original report conflates restrictions imposed by vgo on the content of a Go module with the container format. It's not the use of zip that's the problem in that transcript. It's that vgo quite intentionally supports only plain files (no devices, no symlinks, no sparse files) and no executable bits. Go modules hold Go source code for building with the Go toolchain. Basic testdata is fine, but in general, for portability, modules must not be sources for other special kinds of files. Even if vgo were storing modules as tar.gz files, it would still be using an unpacker (x/vgo/vendor/cmd/go/internal/modfetch/unzip.go) that ignores symlinks and executable bits. Those do not belong in source archives, and their presence is more likely to be portability problems or attack vectors than innocent code. These restrictions would have been in place from day 1 if Go code had been responsible for putting source code on disk; the only reason symlinks and executables work today is because we delegated that to version control tools. Not anymore.

So as far as the original report is concerned ("zip files lose symbolic links and executable permissions"), sorry, but working as intended, and not because of zip files. It looks like github.com/prometheus/procfs needs to find a different way to initialize its test environment; it can no longer assume the special files are available directly from the source code tree.


The discussion here then turned to merits of tar vs zip more generally.

Note that vgo downloads zip files from GitHub and tgz files from Gerrit and transcodes both into zip files of the proper format (with the right file tree structure) for saving locally. The format served by popular code hosting sites is therefore not relevant. We still need to supply a tool to turn a local git/etc repo into a tree of zip files for proxies, static hosting, etc, but that will happen too. In both cases, vgo itself is writing the zip file. The interchange format ends up being almost an internal detail.

The reasons I prefer zip instead of tar are, in order:

  1. It's simple and well-specified.
  2. You can read the table of contents without reading the entire file.
  3. You have random access to individual files.
  4. Windows and OS X have built-in support for zip,
    and zip is almost always installed on Unix or else easily installed.
    In contrast, if I want to extract a .tar.gz on Windows I never remember what the program du jour is.

Mainly these considerations are for ease of debugging and poking around, but it's easy to envision actual vgo features (like vgo verify on subsets of a module) that would require random access. Honestly, if we need to pick an archive file format in 2018, I really think one that can't list the files in the archive or get to a specific file without processing the entire archive - that is, can't do it in O(1) time instead of O(n) time - is just not in the running. The rest of my comment is really all in the margins.

With lots of respect for all your work on both archive/tar and archive/zip, @dsnet, I disagree with your conclusion that tar is more well-specified or simpler:

  • The only thing vgo wants out of the zip file is the name of each file, its size, and the content. I'm confident that Go's zip reader can fetch that basic information, especially since Go's zip writer is almost certainly going to be the one writing it (see transcoding note above).
  • If it makes anyone happier, I'm fine saying that the interchange format is "zip files compatible with Go 1.10's reader, using deflate, with names encoded in UTF-8".
  • As you note, "ZIP has one well-specified format" while "TAR has two main competing formats." Given that I don't care about the extra functionality either way, I'd rather have the format with one spec instead of two.
  • I think zip does a better job of cordoning off the extra stuff in a specific place, as compared with tar's habit of stuffing new encodings into old fields.
  • Go's archive/zip has 1,642 non-test lines of code, while archive/tar has 2,904.

As far as compression ratios, I agree that tar, being uncompressed, admits more effective compression. But your empirical data shows that the cost is you must give up random access. I'm not willing to do that.

I take your point about having a different on-disk-cache vs network transfer format, but that's added complexity and eliminates the simplicity of having a Go package proxy that serves out of an actual cmd/go download cache. In fact if we just write a few more metadata files in the cache (which I intend to do), GOPROXY=file://$GOPATH/src/v will work. Different context but same point as above: one format is better than two.

So as far as the new title ("use tar instead of zip for package archives"), again sorry, but working as intended.


P.S. Years ago, when I had internalized "everything associated with Windows is awful and sucks and everything associated with Unix is the one true way" (see this thread), I remember Bryan Ford showing me VXA, a really beautiful system. When he got to the part where he mentioned using the zip file format, I remember this visceral "Ugh! Why would you do that? It's awful." But in fact I'd been blinded by my priors and (as usual) Bryan's design was exactly right.

So especially since Go developers who frequent our issue tracker seem to tend toward being Unix developers, a cautionary note to anyone reading this issue: if you recognize that you're having a similar gut reaction like "clearly tar is better than zip", I'd encourage you to try to step back and examine where that's coming from. (Or maybe I was the only one who fell into that trap, in which case ignore this note.)

@rsc rsc closed this as completed Mar 30, 2018
@dsnet
Copy link
Member

dsnet commented Mar 30, 2018

SGTM. Thanks for your well-written reply.

Different context but same point as above: one format is better than two.

Using the same format for both the proxy and cache certainly has it's advantages. If so, then table-of-contents and random-access will be necessary. Given these constraints, I agree zip is right choice.

If it makes anyone happier, I'm fine saying that the interchange format is "zip files compatible with Go 1.10's reader, using deflate, with names encoded in UTF-8".

SGTM.

(Or maybe I was the only one who fell into that trap, in which case ignore this note.)

My opinion on tar and zip are influenced by frustration working with both these formats. Interestingly, I actually supported zip some time ago, and flipped my opinion after working more on zip. I may be just pessimistic towards the state of world. Fortunately, I certainly don't support creating our own archive format.

Fun fact: I'm writing this from a Windows machine.

@tv42
Copy link

tv42 commented Mar 30, 2018

It's that vgo quite intentionally supports only plain files (no devices, no symlinks, no sparse files) and no executable bits.

I would expect a lot of repositories to contain shell scripts for various things, including administrative actions and test helpers that are run in subprocesses.

Go itself contains 59 executable files in the repository. (git ls-tree -r HEAD|grep ^100755|wc -l). From a quick glance, x/tools, x/text (this one seems accidental and likely caused by a Windows user), x/mobile, x/sys x/build and so on all contain executable files. A find over the local $GOPATH/src/github.com shows lots more.

Please accept that repositories want to contain executable files.

@dolmen
Copy link
Contributor

dolmen commented Apr 5, 2018

@tv42 The purpose of Go modules is to provide Go source files for inclusion in a Go build. A Go build uses only Go files (and now go.mod files and zip archives) as input.
The executables files you mention are not used during the Go build. They may be used for "go generate", but the output of "go generate" which generates go source files is supposed to be committed in the repo, so provided in the module archive directly without additional build. So those executable files are out of scope.

@tv42
Copy link

tv42 commented Apr 5, 2018

@dolmen You've just engineered a solution where running "go generate" in vendored dependencies no longer works.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants