archive/tar: add support for writing tar containing sparse files #13548

grubernaut · 2015-12-09T15:49:12Z

I've created a Github Repo with all the needed steps for reproducing this on Ubuntu 12.04 using Go1.5.1. I've also verified that using Go1.5.2 still experiences this error.

Run vagrant create then vagrant provision from repository root.

vagrant create
vagrant provision

Expected Output:

$ vagrant provision
==> default: Running provisioner: shell...
    default: Running: inline script
==> default: stdin: is not a tty
==> default: go version go1.5.2 linux/amd64
==> default: Creating Sparse file
==> default: Proving file is truly sparse
==> default: 0 -rw-r--r-- 1 root root 512M Dec  9 15:26 sparse.img
==> default: Compressing in Go without sparse
==> default: Compressing in Go with sparse
==> default: FileInfo File Size: 536870912
==> default: Proving non-sparse in Go gained size on disk
==> default: 512M -rw-r--r-- 1 root root 512M Dec  9 15:26 non_sparse/sparse.img
==> default: Proving sparse in Go DID keep file size on disk
==> default: 0 -rw-r--r-- 1 root root 0 Dec  9 15:26 sparse/sparse.img
==> default: Compressing via tar w/ Sparse Flag set
==> default: Proving sparse via tar DID keep file size on disk
==> default: 0 -rw-r--r-- 1 root root 512M Dec  9 15:26 tar/sparse.img

Actual Output:

$ vagrant provision
==> default: Running provisioner: shell...
    default: Running: inline script
==> default: stdin: is not a tty
==> default: go version go1.5.2 linux/amd64
==> default: Creating Sparse file
==> default: Proving file is truly sparse
==> default: 0 -rw-r--r-- 1 root root 512M Dec  9 15:35 sparse.img
==> default: Compressing in Go without sparse
==> default: Compressing in Go with sparse
==> default: Proving non-sparse in Go gained size on disk
==> default: 513M -rw-r--r-- 1 root root 512M Dec  9 15:35 non_sparse/sparse.img
==> default: Proving sparse in Go DID NOT keep file size on disk
==> default: 512M -rw-r--r-- 1 root root 512M Dec  9 15:35 sparse/sparse.img
==> default: Compressing via tar w/ Sparse Flag set
==> default: Proving sparse via tar DID keep file size on disk
==> default: 0 -rw-r--r-- 1 root root 512M Dec  9 15:35 tar/sparse.img

The Vagrantfile supplied in the repository runs the following shell steps:

Installs Go
Creates a sparse file via truncate -s 512M sparse.img
Proves that the file is sparse via ls -lash sparse.img
Runs compress.go via go run compress.go
Untars the archives created by compress.go via tar -xf
Verifies that the extracted files did not maintain sparse files, both with and without the sparse type set in the tar file's header. ls -lash sparse.img
Uses GNU/Tar to compress the sparse file with the sparse flag set tar -Scf sparse.tar sparse.img
Extracts the archive created by GNU/Tar tar -xf sparse.tar
Proves that GNU/Tar maintained sparse files ls -lash sparse.img

This is somewhat related to #12594.

I could also be creating the archive incorrectly, and have tried a few different methods for creating the tar archive, each one however, did not keep the sparse files intact upon extraction of the archive. This also cannot be replicated in OSX as HGFS+ does not have a concept of sparse files, and instantly destroys any file sparseness, hence the need for running and testing the reproduction case in a vagrant vm.

Any thoughts or hints into this would be greatly appreciated, thanks!

The text was updated successfully, but these errors were encountered:

bradfitz · 2015-12-09T16:15:37Z

/cc @dsnet who's been going crazy on the archive/tar package in the Go 1.6 tree ("master" branch)

dsnet · 2015-12-09T17:42:03Z

This isn't a bug per-say, but more of a feature request. Sparse file support is only provided for tar.Reader, but not tar.Writer. Currently, it's a bit asymmetrical, but supporting sparse files on tar.Writer requires API change, which may take some time to think about.

Also, this is mostly unrelated to #12594. Although, that bug should definitely be fixed before any attempt at this is made. For the time being, I recommend putting this in the "unplanned" milestone, I'll revisit this issue when the other tar bugs are first fixed.

grubernaut · 2015-12-09T23:06:33Z

@dsnet should I keep this here as a feature request, or is there another preferred format for those?

dsnet · 2015-12-09T23:10:05Z

The issue tracker is perfect for that. So this is just fine.

dsnet · 2016-02-26T22:12:15Z

This my proposed addition to the tar API to support sparse writing.

First, we modify tar.Header to have an extra field:

type Header struct {
    ...

    // SparseHoles represents a sequence of holes in a sparse file.
    //
    // The regions must be sorted in ascending order, not overlap with
    // each other, and not extend past the specified Size.
    // If len(SparseHoles) > 0 or Typeflag is TypeGNUSparse, then the file is
    // sparse. It is optional for Typeflag to be set to TypeGNUSparse.
    SparseHoles  []SparseHole
}

// SparseEntry represents a Length-sized fragment at Offset in the file.
type SparseEntry struct {
    Offset int64
    Length int64
}

On the reader side, nothing much changes. We already support sparse files. All that's being done is that we're now exporting information about the sparse file through the SparseHoles field.

On the writer side, the user must set the SparseHoles field if they intend to write a sparse file. It is optional for them to set Typeflag to TypeGNUSparse (there are multiple formats to represent sparse files so this is not important). The user then proceeds to write all the data for the file. For sparse holes, they will be required to write Length zeros for that given hole. It is a little inefficient writing zeros for the holes, but I decided on this approach because:

It is symmetrical with how tar.Reader already operates (which transparently expands a sparse file).
It is more representative of what the "end result" really looks like. For example, it allows a user to write a sparse file by just doing io.Copy(tarFile, sparseFile) and not worry about where the holes are (assuming they already populated the SparseHoles field).

I should note that the tar format represents sparse files by indicating which regions have data, and treating everything else as a hole. The API exposed here does the opposite; it represents sparse files by indicating which regions are holes, and treating everything else as data. The reason for this inversion is because it fits the Go philosophy that the zero value of some be meaningful. The zero value of SparseHoles indicates that there are no holes in the file, and thus it is a normal file; i.e., the default makes sense. If we were to use SparseDatas instead, the zero value of that indicates that there is no data in the file, which is rather odd.

It is a little inefficient requiring that users write zeros and the bottleneck will be the memory bandwidth's ability to transfer potentially large chunks of zeros. Though not necessary, the following methods may be worth adding as well:

// Discard skips the next n bytes, returning the number of bytes discarded.
// This is useful when dealing with sparse files to efficiently skip holes.
func (tr *Reader) Discard(n int64) (int64, error) {}

// FillZeros writes the next n bytes by filling it in with zeros.
// It returns the number of bytes written, and an error if any.
// This is useful when dealing with sparse files to efficiently skip holes.
func (tw *Writer) FillZeros(n int64) (int64, error) {}

Potential example usage: https://play.golang.org/p/Vy63LrOToO

ianlancetaylor · 2016-02-26T22:38:24Z

If Reader and Writer support sparse files transparently, why export SparseHoles? Is the issue that when writing you don't want to introduce a sparse hole that the caller did not explicitly request?

dsnet · 2016-02-26T22:51:03Z

The Reader expands sparse files transparently. The Writer is "transparent" in the sense that a user can just do io.Copy(tw, sparseFile) and so long as the user already specified where there sparse holes are, it will avoid writing the long runs of zeros.

Purely transparent sparse files for Writer cannot easily done since the tar.Header is written before the file data. Thus, the Writer cannot know what sparse map to encode in the header prior to seeing the data itself. Thus, Writer.WriteHeader needs to be told where the sparse holes are.

I don't think tar should automatically create sparse files (for backwards compatibility). As a data point, the tar utilities do not automatically generate sparse files unless the -S flag is passed in. However, it would be nice if the user didn't need to come up with the SparseHoles themselves. Unfortunately, I don't see an easy solution to this.

There are three main ways that sparse files may be written:

In the case of writing a file from the filesystem (the use case that spawned this issue is of this), I'm not aware of any platform independent way to easily query for all the sparse holes. There is a method to do this on Linux and Solaris with SEEK_DATA and SEEK_HOLE (see my test in CL/17692), but I'm not aware of ways to do this on other OSes like Windows or Darwin.
In the case of a round-trip read-write, a tar.Header read from Reader.Next and written to Writer.WriteHeader will work just fine as expected since tar.Header will have the SparseHoles field populated.
In the case of writing a file from a memory, the user will need to write their own zero detection scheme (assuming they don't already know where the holes are).

I looked at the source for GNU and BSD tar to see what they do:

(Source) BSD tar attempts to use FIEMAP first, then SEEK_DATA/SEEK_HOLE, then (it seems) it avoids sparse files altogether.
(Source) GNU tar attempts to use SEEK_DATA/SEEK_HOLE, then falls back on brute-force zero block detection.

I'm not too fond of the OS specific things that they do to detect holes (granted archive/tar already has many OS specific things in it). I think it would be nice if tar.Writer provided a way to write spares files, but I think we should delegate detection of sparse holes to the user for now. If possible, we can try and get sparse info during FileInfoHeader, but I'm not sure that os.FileInfo has the necessary information to do the queries that are needed.

AkihiroSuda · 2016-11-29T08:07:07Z

@dsnet Design SGTM (non-binding), do you plan to implement that feature?

dsnet · 2016-12-01T13:11:58Z

I'll try and get this into the Go 1.9 cycle. However, a major refactoring of the tar.Writer implementation needs to happen first.

dsnet · 2016-12-07T23:52:49Z

That being said, for all those interested in this feature, can you mention what your use case is?

For example, are you only interested in being able to write a sparse file where you have to specify explicitly where the holes in the file are? Or do you expect to pass an os.FileInfo and have the tar package figure it out (I'm not sure this is possible)?

willglynn · 2016-12-08T01:01:45Z

My use is go_ami_tools/aws_bundle, a library which makes machine images for Amazon EC2. The inside of the Amazon bundle format is a sparse tar, which is a big advantage for machine images since there's usually lots of zeroes. go_ami_tools currently writes all the zeroes and lets them get compressed away, but a spare tar would be better.

I'd like to leave zero specification up to the user of my library. ec2-bundle-and-upload-image – my example tool – would read zeroes straight from the host filesystem, but someone could just as easily plug the go_ami_tools library to a VMDK or QCOW reader in which case the zeroes would be caller-specified.

AkihiroSuda · 2016-12-08T01:04:52Z

My use case is to solve a Docker's issue moby/moby#5419 (comment) , which leads docker build to ENOSPC when the container image contains a sparse file.

grubernaut · 2016-12-08T15:39:03Z

We (Hashicorp) run Packer builds for customers on our public SaaS, Atlas. We offer up an Artifact Store for Atlas customers so that they can store their created Vagrant Boxes, VirtualBox (ISO, VMX), QEMU, or other builds inside our infrastructure. If the customer specifies using the Atlas post-processor during a Packer build, we first create an archive of the resulting artifact, and then we create a POST to Atlas with the resulting archive.

Many of the resulting QEMU, VirtualBox, and VMware builds can be fairly large (10-20GB), and we've had a few customers sparse the resulting disk image, which can lower the resulting artifacts size to ~500-1024MB. This, of course allows for faster downloads, less bandwidth usage, and a better experience overall.

We first start to create the archive from the Atlas Post-Processor in Packer (https://github.com/mitchellh/packer/blob/master/post-processor/atlas/post-processor.go#L154).
We then archive the resulting artifact directory, and walk the directory. Finally, we write the file headers, and perform an io.Copy: (https://github.com/hashicorp/atlas-go/blob/master/archive/archive.go#L381).

In this case, we wouldn't know explicitly where the holes in the file are, and would have to rely on os.FileInfo or something similar to generate the sparsemap of the file; although I'm not entirely sure that this is possible.

vbatts · 2017-04-24T11:59:37Z

@dsnet the use-case is largely around the container images. So the Reader design you proposed SGTM, though it would be nice if the tar reader also provider io.Seeker to accommodate the SparseHoles, but that is not a terrible issue just less than ideal.
For the Writer, either passing the FileInfo, or some way quick detection and perhaps an io.Writer wrapper with a type assertion?
Both sides would be useful though. Thanks for your work on this.

Running `useradd` without `--no-log-init` risks triggering a resource exhaustion issue: moby/moby#15585 moby/moby#5419 golang/go#13548

dsnet · 2017-08-18T06:12:40Z

Sorry this got dropped in Go1.9, I have a working solution out for review for Go1.10.

gopherbot · 2017-08-18T06:19:41Z

Change https://golang.org/cl/56771 mentions this issue: archive/tar: refactor Reader support for sparse files

astromechza · 2017-09-23T06:44:51Z

Came across this issue looking for sparse-file support in Golang. API looks good to me and certainly fits my usecase :). Is there no sysSparsePunch needed for unix?

dsnet · 2017-09-23T06:53:30Z

On Unix OSes that support sparse files, seeking past EOF and writing or resizing the file to be larger automatically produces a sparse file.

astromechza · 2017-09-23T06:58:31Z

Cool, so it detects that you've skipped past a block and not written anything to it and automatically assumes its sparse? Nice 👍

gopherbot · 2017-11-16T00:25:27Z

Change https://golang.org/cl/78030 mentions this issue: archive/tar: partially revert sparse file support

This CL removes the following APIs: type SparseEntry struct{ ... } type Header struct{ SparseHoles []SparseEntry; ... } func (*Header) DetectSparseHoles(f *os.File) error func (*Header) PunchSparseHoles(f *os.File) error func (*Reader) WriteTo(io.Writer) (int, error) func (*Writer) ReadFrom(io.Reader) (int, error) This API was added during the Go1.10 dev cycle, and are safe to remove. The rationale for reverting is because Header.DetectSparseHoles and Header.PunchSparseHoles are functionality that probably better belongs in the os package itself. The other API like Header.SparseHoles, Reader.WriteTo, and Writer.ReadFrom perform no OS specific logic and only perform the actual business logic of reading and writing sparse archives. Since we do know know what the API added to package os may look like, we preemptively revert these non-OS specific changes as well by simply commenting them out. Updates #13548 Updates #22735 Change-Id: I77842acd39a43de63e5c754bfa1c26cc24687b70 Reviewed-on: https://go-review.googlesource.com/78030 Reviewed-by: Russ Cox <rsc@golang.org>

rasky · 2017-11-17T07:49:39Z

Unfortunately, the code had to be reverted and will not be part of 1.10 anymore. This bug should probably be reopened.

- Determinitist GID and UID - Docker recommends using `--no-log-init` until [this issue](golang/go#13548) gets resolved.

gogowitsch · 2021-02-11T10:00:53Z

Dear Go heros, please try to get sparse support into tar.Writer. Thanks!

With the proposed changes the time required to run `./bin/compose setup` is being reduced from ~18 minutes down to ~7 minutes on my machine. In addition a workaround is applied to reduce the size of the images. == Changes === Speed-Up `bundle install` The time spent withing `bundle install` takes a significant amount time during the `./bin/compose setup`. We could make use of two improvements, which both allows us to utitlize multiple CPU cures: * Make use of the bundle `--jobs` argument * Make use of the lesser known/used `MAKE` environment variable A significant amount of time spent during `bundle install` is actually compiling C-extensions, that's why the usage of the `MAKE` variable will drastically improve performence. === `useradd --no-log-init` Unfortunately there is a nasty bug when running `useradd` for a huge `uid`, which could result in excessive image sizes. See attached links for more information. === BuildKit BuildKit is the default builder toolkit for Docker on Windows and DockerDesktop on Macs. Using BuildKit will greatly improve performance when building docker images. == Links === Speed-Up `bundle install` * [One Weird Trick That Will Speed Up Your Bundle Install](https://build.betterup.com/one-weird-trick-that-will-speed-up-your-bundle-install/) === BuildKit * [Build images with BuildKit](https://docs.docker.com/develop/develop-images/build_enhancements/) * [Faster builds in Docker Compose 1.25.1 thanks to BuildKit Support](https://www.docker.com/blog/faster-builds-in-compose-thanks-to-buildkit-support/) === `useradd --no-log-init` * Best practices for writing Dockerfiles: [User](https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#user) * golang/co: [archive/tar: add support for writing tar containing sparse files](golang/go#13548)

TLDR: Passing `--no-log-init` to `useradd` prevents an issue where the Docker image size would potentially increase to hundreds of gigabytes when passed a "large" UID or GID. This is apparently a side effect of how `useradd` creates the user fail logs. The issue is explained in more detail at docker/docs#4754. The root cause is apparently combination of the following: 1. `useradd` by default allocates space for the faillog and lastlog for "all" users: https://unix.stackexchange.com/q/529827. If you pass it a high UID, e.g. 414053617, it will reserve space for all those 414053617 user logs, which amounts to more than 260GB. 2. The first bullet wouldn't be a problem if Docker would recognize the sparse file and compress it efficiently. However, there is an unresolved issue in the Go archive/tar package's (which Docker uses to package image layers) handling of sparse files: golang/go#13548 Eight years unresolved and counting! Passing `--no-log-init` to `useradd` avoids allocating space for the faillog and lastlog and fixes the issue. Finding out the root cause for this bug drove me loco. Reader, enjoy :-)

TLDR: Passing `--no-log-init` to `useradd` prevents an issue where the Docker image size would potentially increase to hundreds of gigabytes when passed a "large" UID or GID. This is apparently a side effect of how `useradd` creates the user fail logs. The issue is explained in more detail at docker/docs#4754. The root cause is apparently combination of the following: 1. `useradd` by default allocates space for the faillog and lastlog for "all" users: https://unix.stackexchange.com/q/529827. If you pass it a high UID, e.g. 414053617, it will reserve space for all those 414053617 user logs, which amounts to more than 260GB. 2. The first bullet wouldn't be a problem if Docker would recognize the sparse file and compress it efficiently. However, there is an unresolved issue in the Go archive/tar package's (which Docker uses to package image layers) handling of sparse files: golang/go#13548 Eight years unresolved and counting! Passing `--no-log-init` to `useradd` avoids allocating space for the faillog and lastlog and fixes the issue.

realtebo · 2024-03-29T11:14:00Z

is this bug still present?

rsc changed the title ~~archive/tar: Writing a tarfile does not maintain sparse files~~ archive/tar: no support for writing tar containing sparse files Dec 28, 2015

rsc added this to the Unplanned milestone Dec 28, 2015

rsc changed the title ~~archive/tar: no support for writing tar containing sparse files~~ archive/tar: add support for writing tar containing sparse files Dec 28, 2015

dsnet mentioned this issue Feb 22, 2016

archive/tar: add support for arbitrary pax vendor extensions #14472

Closed

dsnet self-assigned this May 9, 2016

AkihiroSuda mentioned this issue Nov 29, 2016

docker build hangs/crashes when useradd with large UID moby/moby#5419

Open

AkihiroSuda mentioned this issue Dec 1, 2016

Performing an image build with RUN adduser -u 1000000000 will result in no space left on host moby/moby#28920

Closed

dsnet modified the milestones: Go1.9Maybe, Unplanned Dec 1, 2016

dsnet mentioned this issue Jan 17, 2017

archive/tar: add Reader.NextRaw method to read only one raw header #17657

Open

dsnet modified the milestones: Go1.10, Go1.9Maybe May 22, 2017

memory mentioned this issue May 26, 2017

Suggest passing --no-log-init to adduser docker/docs#3413

Merged

mdlinville pushed a commit to docker/docs that referenced this issue Jun 2, 2017

Suggest passing --no-log-init to adduser (#3413)

571d235

Running `useradd` without `--no-log-init` risks triggering a resource exhaustion issue: moby/moby#15585 moby/moby#5419 golang/go#13548

bzier mentioned this issue Sep 29, 2017

Best Practices - Possibly outdated note docker/docs#4754

Closed

Silex mentioned this issue Oct 18, 2017

Docker-compose build issues with binary files with permission 600 docker/compose#4386

Closed

rasky mentioned this issue Nov 17, 2017

archive/tar: re-add sparse file support #22735

Open

bradfitz reopened this Nov 17, 2017

bradfitz modified the milestones: Go1.10, Unplanned Nov 17, 2017

bradfitz added the FeatureRequest label Nov 17, 2017

bradfitz assigned dsnet Nov 17, 2017

bradfitz added the NeedsFix The path to resolution is known, but the work has not been done. label Nov 17, 2017

stgraber mentioned this issue Feb 10, 2018

lxc publish expands sparse files canonical/lxd#4239

Closed

2 tasks

jcayzac added a commit to rakutentech/plantuml-docker that referenced this issue Nov 26, 2020

follow best practices for creating unprivileged user

e838c81

- Determinitist GID and UID - Docker recommends using `--no-log-init` until [this issue](golang/go#13548) gets resolved.

Deradon mentioned this issue Jun 21, 2021

[Proposal] Speed up bin/compose setup significantly opf/openproject#9404

Merged

This was referenced Dec 20, 2021

Container images do not support sparse files kubevirt/kubevirt#6976

Closed

Support of sparse files in container images containers/storage#1091

Closed

rsc unassigned dsnet Jun 23, 2022

thaJeztah mentioned this issue Aug 4, 2022

Docker Build Issue - Stuck At Exporting Layers docker/hub-feedback#2263

Open

3 tasks

sinopeus mentioned this issue Apr 5, 2023

fix: avoid exploding Docker image size radix-ai/poetry-cookiecutter#185

Merged

stgraber mentioned this issue Apr 20, 2024

Investigate sparse file support for VMs lxc/incus#662

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

archive/tar: add support for writing tar containing sparse files #13548

archive/tar: add support for writing tar containing sparse files #13548

grubernaut commented Dec 9, 2015

bradfitz commented Dec 9, 2015

dsnet commented Dec 9, 2015

grubernaut commented Dec 9, 2015

dsnet commented Dec 9, 2015

dsnet commented Feb 26, 2016

ianlancetaylor commented Feb 26, 2016

dsnet commented Feb 26, 2016

AkihiroSuda commented Nov 29, 2016

dsnet commented Dec 1, 2016

dsnet commented Dec 7, 2016

willglynn commented Dec 8, 2016

AkihiroSuda commented Dec 8, 2016

grubernaut commented Dec 8, 2016

vbatts commented Apr 24, 2017

dsnet commented Aug 18, 2017

gopherbot commented Aug 18, 2017

astromechza commented Sep 23, 2017

dsnet commented Sep 23, 2017

astromechza commented Sep 23, 2017

gopherbot commented Nov 16, 2017

rasky commented Nov 17, 2017

gogowitsch commented Feb 11, 2021

realtebo commented Mar 29, 2024

archive/tar: add support for writing tar containing sparse files #13548

archive/tar: add support for writing tar containing sparse files #13548

Comments

grubernaut commented Dec 9, 2015

bradfitz commented Dec 9, 2015

dsnet commented Dec 9, 2015

grubernaut commented Dec 9, 2015

dsnet commented Dec 9, 2015

dsnet commented Feb 26, 2016

ianlancetaylor commented Feb 26, 2016

dsnet commented Feb 26, 2016

AkihiroSuda commented Nov 29, 2016

dsnet commented Dec 1, 2016

dsnet commented Dec 7, 2016

willglynn commented Dec 8, 2016

AkihiroSuda commented Dec 8, 2016

grubernaut commented Dec 8, 2016

vbatts commented Apr 24, 2017

dsnet commented Aug 18, 2017

gopherbot commented Aug 18, 2017

astromechza commented Sep 23, 2017

dsnet commented Sep 23, 2017

astromechza commented Sep 23, 2017

gopherbot commented Nov 16, 2017

rasky commented Nov 17, 2017

gogowitsch commented Feb 11, 2021

realtebo commented Mar 29, 2024