archive/tar: add Reader.NextRaw method to read only one raw header #17657

wking · 2016-10-29T04:16:04Z

The current master implementation elides records, for example pax x typeflag records are collapsed into the following record with the extended headers being merged into the normal file's headers. This is useful for most consumers, who are only interested in consuming the tar content. But two use cases would benefit from more direct access to the lower-level tar details:

Accessing information that is not yet represented in the the tar structure. For example, tar.Header does not currently expose pax extended headers outside of these names and prefixes. There are plans to rectify this limitation (archive/tar: add support for arbitrary pax vendor extensions #14472), but until the tar structures expose all of these sorts of details it will be useful to have a way to directly access the header and file data. This will allow users to add their own parsers for any extension they need which is not yet supported by the stdlib, without requiring them to fork archive/tar or replace it with a completely novel parser.
Validating tar files against a particular standard. If a protocol places requirements on the tar elements, users will need more direct access to the tar details to check the tar file against those requirements. For an example along these lines, see layer: Require ustar (originally from IEEE Std 1003.1-1988 but linking IEEE Std 1003.1-2013) opencontainers/image-spec#342 and the extremely naive stub here (which currently doesn't work because Next() never returns the x typeflag headers).

I think we can address both of these cases without breaking the existing API with two changes:

Adding a tar.Reader.NextRaw() that works like tar.Reader.Next(), but always returns the next record regardless of its typeflag.
Adding tar.Header.Bytes exposing a []byte slice of the raw header data, to allow clients to access header fields that are not yet mapped to other tar.Header attributes. Alternatively, this could be a method tar.Header.Bytes() (or an attribute exposing a string or whatever) to make the data clearly read-only.

Folks who wish to control writing at a low level may want similar changes on the tar.Writer side, but I think that is orthogonal enough to punt to a separate issue (if anyone wants it).

CC @dsnet.

The text was updated successfully, but these errors were encountered:

dsnet · 2016-10-29T21:46:17Z

Thanks for filing this. I explored this somewhat and I'm opposed to providing this functionality for the following reasons:

The requested tar.Header.Bytes field has very little use other than for people who want to have low-level access to the tar file. However, this can be still achieved today by using an io.TeeWriter on the input reader (e.g., https://play.golang.org/p/xXOIJ0d63A). This work-around is hacky, but trying to get access to the low-level format is already hacky.
What is the meaning of the Bytes field when writing? Do we simply ignore it? Do we write Bytes instead of what the Writer would have generated? We could document the exact behavior, but I'm not fond of a header field that is only valid for the Reader, but not for the Writer. As an API it behaves in unexpected ways.
I believe the reason of "accessing information that is not yet represented in the tar structure" is somewhat weak since the tar archive can easily be forked to provide the new functionality. Suppose some new functionality came around, the user still needs to implement that functionality themselves. Either forking the package or using the above workaround seems sufficient.
The desire to validate the tar format seems like a legitimate need, but the proposed API doesn't easily solve this. It seems that your approach was to get low-level access to the header bytes and to validate them yourselves. The workaround I presented allows you to do this.
A better API might be to add a Format field to Header. When reading, the Format will be filled out with a enum representing the format. When writing, Format tells the Writer what format to encode the header using.
- However this is harder than it sounds. The unfortunate nature of Postel's Law is that there are many tar readers that liberally decode non-compliant tar files. Even Go's reader will liberally decode files that are somewhat invalid (See archive/tar: Writer outputs invalid PAX format after writing GNU format #17665 for an example of this). While "specifications" exist for the USTAR, GNU, and PAX formats, the world has a very liberal conformance to those said "specifications". I have even seen instances of where the canonical implementation of the specification behaves differently from what the specification even says. The reality is: the tar format (with all of its variations) is not a format friendly to formal specifications (and thus not friendly to easy validation).

wking · 2016-10-31T18:02:46Z

On Sat, Oct 29, 2016 at 02:46:22PM -0700, Joe Tsai wrote:

The requested tar.Header.Bytes field has very little use other
than for people who want to have low-level access to the tar
file. However, this can be still achieved today by using an
io.TeeWriter on the input reader (e.g.,
https://play.golang.org/p/xXOIJ0d63A). This work-around is hacky,
but trying to get access to the low-level format is already hacky.

I agree that tar.Header.Bytes is hacky and am open to the TeeWriter
approach. However, it's possible that tar.Reader.Next skips elides
two headers (e.g. an ‘x’ and a ‘g’). In that case, I'm not sure how
you'd find the middle header short of implementing your own parser.
You could drop tar.Header.Bytes (in favor of the TeeWriter approach)
if you had NextRaw to give single steps.

I believe the reason of "accessing information that is not yet
represented in the tar structure" is somewhat weak since the tar
archive can easily be forked to provide the new
functionality. Suppose some new functionality came around, the
user still needs to implement that functionality
themselves. Either forking the package or using the above
workaround seems sufficient.

Forking is so drastic, when NextRaw seems pretty straightforward.

A better API might be to add a Format field to Header… However…

I agree that this is a bad fit for the stdlib for the reasons you
give. I think NextRaw and TeeWriter would give sufficient access to
graft this sort of thing onto the stdlib's reader. Do you have
problems with the NextRaw interface? Would it help if I worked up a
patch implementing it?

dsnet · 2016-11-09T21:55:22Z

I think a Header.Bytes field is off the table given:

It's low number of use-cases
A workaround with io.TeeWriter

An API to add NextRaw is still a possibility, but I have many concerns.

What does NextRaw do when it decodes a old GNU header for a sparse file? Do we parse the extended blocks as well? (I would assume yes, otherwise the next NextRaw call will fail). This seems to break your assumption that "raw" always means 512B.
Let's suppose you decode the raw header for a PAX header, is it now your responsibility to parse the raw key-value PAX headers yourself?
What happens when some value obtained from a previous call to NextRaw is relevant to properly parse in the next call to NextRaw? For example, let's say the next header is for the PAX headers, and the "size" record is set. When we call NextRaw again, how does it know about information from a previous raw header (like the filesize) that is important to parse the next file?
What happens when you intertwine Next and NextRaw calls?

wking · 2016-11-09T23:23:22Z

On Wed, Nov 09, 2016 at 01:55:26PM -0800, Joe Tsai wrote:

I think a Header.Bytes field is off the table given:

It's low number of use-cases

A workaround with io.TeeWriter

Agreed, although the TeeWriter approach depends on NextRaw or similar
1.

An API to add NextRaw is still a possibility, but I have many
concerns…

These are good points. The main issue with Bytes access and tar
validation is that we need access to the tip of the entry point. How
about:

// Next2 is just like Next, but if callback is non-nil it is called
// after parsing a header block (512 bytes after the start of the
// record). A single Next2 call may result in multiple callback
// calls, e.g. in the case of pax extended headers. If callback
// returns an error, Next2 returns the same error and does no
// further reading. For records with GNU's "extra" or "sparse"
// header extensions, callback will be called before the additional
// headers are parsed, which means the values set in the callback's
// header argument may be incomplete. If the callback returns with
// no errors, Next2 will continue parsing as usual. Callbacks must
// not modify the passed tar reader internally unless they can do so
// without altering the state of the Reader reference assocaited
// with Next2.
func (tr _Reader) Next2(callback *HeaderCallback) (_Header, error)

type HeaderCallback(tar *Reader, header *Header) error

That seems reasonable to me, although the “don't break the reader”
warning is a bit awkward. You could avoid that with:

type HeaderCallback(header *Header) error

but folks might want to use the Reader reference in a non-desctuctive
way. For example, Python exposes global pax headers on the
Reader/Writer 2, and Go might decide to handle global pax headers
the same way (as an alternative to packing them all into the same
Header map 3). On the other hand, folks can always use a closure or
some such with the Header-only form if they do need a reference to
the reader, so I'm ok either way.

dsnet · 2016-11-10T22:06:41Z

Before we dig deeper into this API (which seems fairly complex), I want to understand the use case more. How does Next2 help achieve the goal of "validating the tar file"? Can you give practical and specific things that you would be checking for it to be "valid"?

vbatts · 2017-01-16T21:54:02Z

a possible option (and one that I would love to reconcile with golang upstream) is the approach taken for https://github.com/vbatts/tar-split.
Currently we carry archive/tar but add a functionality of enabling access to raw bytes.
See the docs: https://godoc.org/github.com/vbatts/tar-split/archive/tar
The only additions to the archive/tar API is the RawAccounting bool field and the RawBytes() func on the Reader struct.

(coincidentally I discovered this thread while preparing to open the discussion for pushing the RawBytes feature to upstream)

wking · 2017-01-17T04:45:38Z

On Mon, Jan 16, 2017 at 01:54:05PM -0800, Vincent Batts wrote: See the docs: https://godoc.org/github.com/vbatts/tar-split/archive/tar The only additions to the `archive/tar` API is the `RawAccounting bool` field and the `RawBytes()` func on the `Reader` struct.

Does this collect *all* the bytes (from both the headers and payloads) between RawBytes calls when RawAccounting is set? If so, it sounds a lot like the io.TeeWriter approach [1]. Is it just a more conventient API for that behavior, or does it give you a way to trigger on tar headers that are currently elided by the upstream Go library (e.g. pax x typeflag records)? [1]: #17657 (comment)

dsnet · 2017-01-17T05:58:38Z

It does not seem like the RawBytes API provides a way to trigger on elided headers.

As @wking mention, it seems that the RawBytes API is identical to using a TeeReader around the source io.Reader except the RawBytes does not record the body.

It seems that this could be accomplished with a modified TeeReader approach: https://play.golang.org/p/ZUanaldFcu

However, my modification falls short in the following two areas:

Consuming the extra padding for the body only occurs at Next, which incorrectly causes the TeeReader to read padding that is not part of the Header.
This requires reading the entire body before calling Next, otherwise part of the body is accidentally copied into TeeReader. This is inefficient when the body is a sparse file or the input io.Reader is actually a io.Seeker and data could have been quickly discarded.

If those are the only two deficiencies (correct me if I'm wrong), I would much rather see the following happen instead:

That the Read function guarantees that it consumes the padding when it hits io.EOF. This could be documented behavior with a test ensuring it's persistence in the standard library.
That we add a Reader.Discard method to tell the read to drop some number of bytes. A discard method is useful for the handling of sparse files (see my comment on #13548).

The first change requires no API change and is something I would want to see anyways. The later API change at least provides benefit for another functionality more popularly requested of the tar package.

dsnet · 2017-01-18T21:50:04Z

#18710 is about adding functionality to tell the Writer what format to use (and conversely for the Reader to report what format was read). @wking, I'm wondering if this will solve your original use-case for this issue where you needed to do validation of the format.

vbatts · 2017-01-19T20:34:42Z

@dsnet understandable, but the need was in accessing only the raw headers and padding, and not the body of each entry, as the tar.Reader.Read() already provides this. @wking in this way, it does allow for collecting all the bytes. That's the purpose of tar-split, such that all the non-payload bytes can be preserved for re-assembling the cryptographic-equivalent original.

There is nothing for triggering on tar headers. Though a parser of the RawBytes ought not be too terrible.

Additionally, that RawBytes approach does not affect the tar.Writer at all. It is only for reading.

Use TeeReader instead: https://play.golang.org/p/xXOIJ0d63A There's an interesting proposal currently open golang/go#17657 Signed-off-by: Simarpreet Singh <simar@linux.com>

wking mentioned this issue Oct 29, 2016

archive/tar: add support for arbitrary pax vendor extensions #14472

Closed

dsnet added this to the Unplanned milestone Oct 29, 2016

dsnet changed the title ~~archive/tar: expose low level tar properties when reading with Header.Bytes and NextRaw~~ archive/tar: add Reader.NextRaw method to read only one raw header Nov 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

archive/tar: add Reader.NextRaw method to read only one raw header #17657

archive/tar: add Reader.NextRaw method to read only one raw header #17657

wking commented Oct 29, 2016

dsnet commented Oct 29, 2016 •

edited

wking commented Oct 31, 2016

dsnet commented Nov 9, 2016

wking commented Nov 9, 2016

dsnet commented Nov 10, 2016

vbatts commented Jan 16, 2017

wking commented Jan 17, 2017 via email

dsnet commented Jan 17, 2017 •

edited

dsnet commented Jan 18, 2017

vbatts commented Jan 19, 2017

archive/tar: add Reader.NextRaw method to read only one raw header #17657

archive/tar: add Reader.NextRaw method to read only one raw header #17657

Comments

wking commented Oct 29, 2016

dsnet commented Oct 29, 2016 • edited

wking commented Oct 31, 2016

dsnet commented Nov 9, 2016

wking commented Nov 9, 2016

dsnet commented Nov 10, 2016

vbatts commented Jan 16, 2017

wking commented Jan 17, 2017 via email

dsnet commented Jan 17, 2017 • edited

dsnet commented Jan 18, 2017

vbatts commented Jan 19, 2017

dsnet commented Oct 29, 2016 •

edited

dsnet commented Jan 17, 2017 •

edited