archive/tar: package understanding of GNU format is wrong #12594

dsnet · 2015-09-11T23:30:50Z

Using go1.5

Also discovered this while fixing other archive/tar issues (and I found fair number of them, mostly minor). However, fixing this will change the way archive/tar reads and writes certain formats.

What the current archive/tar thinks the GNU format is:

A magic and version that forms the string "ustar\x20\x20\x00" (this is correct).
That the structure is identical to the POSIX format. That is, there is a 155byte prefix section (this is incorrect).
That it extends the POSIX format by adding the ability to perform base-256 encoding (this is not necessarily specific to GNU format).

What the GNU manual actually says the format is:

The GNU manual says that the format for headers using this magic is the following (in Go syntax):

type headerGNU struct {
    // Original V7 header
    name     [100]byte //   0
    mode     [8]byte   // 100
    uid      [8]byte   // 108
    gid      [8]byte   // 116
    size     [12]byte  // 124
    mtime    [12]byte  // 136
    chksum   [8]byte   // 148
    typeflag [1]byte   // 156
    linkname [100]byte // 157

    // This section is based on the Posix standard.
    magic      [6]byte         // 257: "ustar "
    version    [2]byte         // 263: " \x00"
    uname      [32]byte        // 265
    gname      [32]byte        // 297
    devmajor   [8]byte         // 329
    devminor   [8]byte         // 337

    // The GNU format replaces the prefix field with this stuff.
    // The fact that GNU replaces the prefix with this makes it non-compliant.
    atime      [12]byte        // 345
    ctime      [12]byte        // 357
    offset     [12]byte        // 369
    longnames  [4]byte         // 381
    unused     [1]byte         // 385
    sparse     [4]headerSparse // 386
    isextended [1]byte         // 482
    realsize   [12]byte        // 483
                               // 495
}

type headerSparse struct {
    offset   [12]byte //  0
    numbytes [12]byte // 12
                      // 24
}

In fact, the structure for GNU swaps out the prefix section of POSIX, for a bunch of extra fields for atime, ctime, and sparse file support (contrary to what Go thinks).

Regarding the use of base-256 encoding, it seems that GNU was the first to introduce this encoding back in 1999. Since then, pretty much every tar decoder handles reading base-256 encoding regardless of whether it is GNU format or not. Marking the format as GNU may or may not be necessary just because base-256 encoding was used.

Problem 1:

When reading, if the decoder detects the GNU magic number, it will attempt to read 155bytes for the prefix. This is just plain wrong and will start to read the atime, ctime, etc instead. This causes the prefix to be incorrect.

See this playground example
The paths there have something like "12574544345" prepended to it. This is because when the tar archive tries to read the the prefix, it is actually reading the atime (which is in ASCII octal and is null terminated). Thus, it incorrectly uses the atime as the prefix.

This probably went undetected for so long since the "incremental" mode of GNU tar is rarely used, and thus the atime and ctime fields are never filled out and left as null bytes. This happens to work in the common case, since the cstring for this field ends up being an empty string.

Problem 2:

When writing, if a numeric field was ever too large to represent in octal format, it would trigger the usedBinary flag and cause the library to output the GNU magic numbers, but subsequently fail to encode in the GNU format. Since it believes that the GNU format has a prefix field, it erroneously tries to use it, losing some information in the process.

This is ultimately what causes #9683, but is rare in practice since the perfect conditions need to be met for GNU format to be used. There is a very narrow range between the use cases of USTAR and PAX where the logic will use GNU.

Solution:

When decoding, change it so that the reader doesn't read the 155byte prefix field (since this is just plain wrong). Optionally, support parsing of the atime and ctime from the GNU format. Nothing needs to change for sparse file support since that logic correctly understood the GNU format.

When encoding, I propose the following order of precedence:

First, use the 1988 POSIX (USTAR) standard when possible for maximum backwards compatibility.
If any numeric field goes beyond the octal representation, or any names are longer than what is supported, just use the 2001 POSIX (PAX) standard.

Let's avoid writing the GNU format. In fact the GNU manual itself, says the following under the POSIX section:

This archive format will be the default format for future versions of GNU tar.

The only advantages that GNU offers over USTAR is:

Unlimited length filenames (only ASCII)
Relatively large filesizes
Possibly atime and ctime

However, PAX offers all of these over USTAR and far more:

Unlimited length strings (including UTF-8) support for filenames, usernames, etc.
Unlimited large integers for filesizes, uids, etc.
Sub-second resolution times.
No need for base-256 encoding (and assuming that decoders can handle them) since PAX has its own well-defined method of encoding arbitrarily large integers.

Not to mention, we are already outputting PAX in many situations. What's the point of straggling between 3 different output formats?

Thoughts?

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2015-09-12T00:18:48Z

My thought is that you obviously know more about this than I do. Do you have any actual question? If not it seems like you know how to proceed.

CC @dsymonds

dsnet · 2015-09-12T00:20:48Z

Given that this is obviously a bug for the reader. The course of action there is clear.

As for the writer side, should it still output GNU format under certain conditions? Or should we just completely remove support for it and use PAX.

ianlancetaylor · 2015-09-12T00:27:05Z

I think we should always output PAX format. It's old enough that I don't think we need to worry about generating older formats.

gopherbot · 2015-09-16T07:00:38Z

CL https://golang.org/cl/14623 mentions this issue.

dsnet · 2015-12-02T01:04:06Z

Fixing this bug (for the Writer) is not trivial. I have a fix for it available, but it would require multiple CLs and I dont have the bandwidth to go through the code reviews for them.

I suggest moving this to Go1.7 milestone.

gopherbot · 2016-05-06T00:00:16Z

CL https://golang.org/cl/14669 mentions this issue.

The Reader and Writer have hard-coded constants regarding the offsets and lengths of certain fields in the tar format sprinkled all over. This makes it harder to verify that the offsets are correct since a reviewer would need to search for them throughout the code. Instead, all information about the layout of header fields should be centralized in one single file. This has the advantage of being both centralized, and also acting as a form of documentation about the header struct format. This method was chosen over using "encoding/binary" since that method would cause an allocation of a header struct every time binary.Read was called. This method causes zero allocations and its logic is no longer than if structs were declared. Updates #12594 Change-Id: Ic7a0565d2a2cd95d955547ace3b6dea2b57fab34 Reviewed-on: https://go-review.googlesource.com/14669 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>

rsc · 2016-05-17T20:35:04Z

We're in the Go 1.7 freeze; any bugs in archive/tar not introduced in the Go 1.7 dev cycle aren't worth the risk to fix. If users do encounter problems with archive/tar, it is straightforward to run a forked, fixed copy until the standard version is fixed in a future release.

gopherbot · 2016-10-19T00:55:10Z

CL https://golang.org/cl/31444 mentions this issue.

The GNU format does not have a prefix field, so we should make no attempt to read it. It does however have atime and ctime fields. Since Go previously placed incorrect values here, we liberally read the atime and ctime fields and ignore errors so that old tar files written by Go can at least be partially read. This fixes half of #12594. The Writer is much harder to fix. Updates #12594 Change-Id: Ia32845e2f262ee53366cf41dfa935f4d770c7a30 Reviewed-on: https://go-review.googlesource.com/31444 Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>

dsnet · 2016-10-19T18:11:11Z

Moving milestone to Go1.9. The fix is performed on the Reader. The fix for the Writer is more involved.

The current logic for the Writer is to the assume that it is writing in one format, and then it tries to backtrack if it can't use that format and switch to another. However, it's complicated trying to keep track of what state needs to be undone and what writes have already occurred (or not). A better approach is to verify up-front what formats can be used for the given input file and commit to using that format. There should be no back-tracking.

gopherbot · 2016-10-27T23:30:16Z

CL https://golang.org/cl/32234 mentions this issue.

The proper fix for the Writer is too involved to be done in time for Go 1.8. Instead, we do a localized fix that simply disables the prefix encoding logic. While this will prevent some legitimate uses of prefix, it will ensure that we don't keep outputting invalid GNU format files that have the prefix field populated. For headers with long filenames that could have used the prefix field, they will be promoted to use the PAX format, which ensures that we will still be able to encode all headers that we were able to do before. Updates #12594 Fixes #17630 Fixes #9683 Change-Id: Ia97b524ac69865390e2ae8bb0dfb664d40a05add Reviewed-on: https://go-review.googlesource.com/32234 Reviewed-by: Russ Cox <rsc@golang.org> Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org>

gopherbot · 2017-08-10T19:52:00Z

Change https://golang.org/cl/54433 mentions this issue: archive/tar: check for permissible output formats first

The current logic in writeHeader attempts to encode the Header in one format and if it discovered that it could not it would attempt to switch to a different format mid-way through. This makes it very hard to reason about what format will be used in the end and whether it will even be a valid format. Instead, we should verify from the start what formats are allowed to encode the given input Header. If no formats are possible, then we can return immediately, rejecting the Header. For now, we continue on to the hairy logic in writeHeader, but a future CL can split that logic up and specialize them for each format now that we know what is possible. Update #9683 Update #12594 Change-Id: I8406ea855dfcb8b478a03a7058ddf8b2b09d46dc Reviewed-on: https://go-review.googlesource.com/54433 Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com> Reviewed-by: Ian Lance Taylor <iant@golang.org>

gopherbot · 2017-08-14T07:18:43Z

Change https://golang.org/cl/55237 mentions this issue: archive/tar: implement specialized logic for GNU format

Rather than going through writeHeader, which attempts to handle all formats, implement writeGNUHeader, which only has an understanding of the GNU format. Currently, the implementation is nearly identical to writeUSTARHeader, except: * formatNumeric is used instead of formatOctal * the GNU magic value is used This is kept as a separate method since it makes more logical sense when we add support for sparse files, long filenames, and atime/ctime fields, which do not affect USTAR. Updates #12594 Change-Id: I76efc0b39dc649efc22646dfc9867a7c165f34a8 Reviewed-on: https://go-review.googlesource.com/55237 Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Martin Möhrmann <moehrmann@google.com>

gopherbot · 2017-08-14T22:17:29Z

Change https://golang.org/cl/55550 mentions this issue: archive/tar: remove writeHeader and writePAXHeaderLegacy

gopherbot · 2017-08-15T00:40:46Z

Change https://golang.org/cl/55574 mentions this issue: archive/tar: re-implement USTAR path splitting

The logic for USTAR was disabled because a previous implementation of Writer had a wrong understanding of the differences between USTAR and GNU, causing the prefix field is incorrectly be populated in GNU files. Now that this issue has been fixed, we can re-enable the logic for USTAR path splitting, which allows Writer to use the USTAR for a wider range of possible inputs. Updates #9683 Updates #12594 Updates #17630 Change-Id: I9fe34e5df63f99c6dd56fee3a7e7e4d6ec3995c9 Reviewed-on: https://go-review.googlesource.com/55574 Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>

dsnet mentioned this issue Sep 11, 2015

archive/tar: do not use ustar long filename encoding and binary number encoding together #9683

Closed

ianlancetaylor added this to the Go1.6 milestone Sep 12, 2015

dsnet mentioned this issue Sep 16, 2015

archive/tar: needs hardening and refactoring #12638

Closed

ianlancetaylor modified the milestones: Go1.7, Go1.6 Dec 2, 2015

grubernaut mentioned this issue Dec 9, 2015

archive/tar: add support for writing tar containing sparse files #13548

Open

dsnet self-assigned this May 9, 2016

rsc modified the milestones: Go1.8, Go1.7 May 17, 2016

rsc added the NeedsFix label Sep 29, 2016

rsc modified the milestones: Go1.8Maybe, Go1.8 Sep 29, 2016

dsnet modified the milestones: Go1.9, Go1.8Maybe Oct 19, 2016

nmiyake mentioned this issue Oct 27, 2016

archive/tar: tar archives created with paths longer than 100 characters are not valid tar files #17630

Closed

dsnet modified the milestones: Go1.10, Go1.9 May 22, 2017

thaJeztah mentioned this issue Jul 12, 2017

Docker export creates a tar archive with a lot of files in the filesystem root moby/moby#29360

Open

gopherbot closed this as completed in 694875c Aug 14, 2017

akshayjshah mentioned this issue Sep 19, 2017

1.9beta1 fails to install with dpkg error niemeyer/godeb#29

Closed

golang locked and limited conversation to collaborators Aug 15, 2018

gopherbot added the FrozenDueToAge label Aug 15, 2018

rsc unassigned dsnet Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

archive/tar: package understanding of GNU format is wrong #12594

archive/tar: package understanding of GNU format is wrong #12594

dsnet commented Sep 11, 2015

ianlancetaylor commented Sep 12, 2015

dsnet commented Sep 12, 2015

ianlancetaylor commented Sep 12, 2015

gopherbot commented Sep 16, 2015

dsnet commented Dec 2, 2015

gopherbot commented May 6, 2016

rsc commented May 17, 2016

gopherbot commented Oct 19, 2016

dsnet commented Oct 19, 2016

gopherbot commented Oct 27, 2016

gopherbot commented Aug 10, 2017

gopherbot commented Aug 14, 2017

gopherbot commented Aug 14, 2017

gopherbot commented Aug 15, 2017

archive/tar: package understanding of GNU format is wrong #12594

archive/tar: package understanding of GNU format is wrong #12594

Comments

dsnet commented Sep 11, 2015

What the current archive/tar thinks the GNU format is:

What the GNU manual actually says the format is:

Problem 1:

Problem 2:

Solution:

ianlancetaylor commented Sep 12, 2015

dsnet commented Sep 12, 2015

ianlancetaylor commented Sep 12, 2015

gopherbot commented Sep 16, 2015

dsnet commented Dec 2, 2015

gopherbot commented May 6, 2016

rsc commented May 17, 2016

gopherbot commented Oct 19, 2016

dsnet commented Oct 19, 2016

gopherbot commented Oct 27, 2016

gopherbot commented Aug 10, 2017

gopherbot commented Aug 14, 2017

gopherbot commented Aug 14, 2017

gopherbot commented Aug 15, 2017