Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archive/tar: add Reader.NextRaw method to read only one raw header #17657

Open
wking opened this issue Oct 29, 2016 · 10 comments
Open

archive/tar: add Reader.NextRaw method to read only one raw header #17657

wking opened this issue Oct 29, 2016 · 10 comments
Milestone

Comments

@wking
Copy link
Contributor

wking commented Oct 29, 2016

The current master implementation elides records, for example pax x typeflag records are collapsed into the following record with the extended headers being merged into the normal file's headers. This is useful for most consumers, who are only interested in consuming the tar content. But two use cases would benefit from more direct access to the lower-level tar details:

  • Accessing information that is not yet represented in the the tar structure. For example, tar.Header does not currently expose pax extended headers outside of these names and prefixes. There are plans to rectify this limitation (archive/tar: add support for arbitrary pax vendor extensions #14472), but until the tar structures expose all of these sorts of details it will be useful to have a way to directly access the header and file data. This will allow users to add their own parsers for any extension they need which is not yet supported by the stdlib, without requiring them to fork archive/tar or replace it with a completely novel parser.
  • Validating tar files against a particular standard. If a protocol places requirements on the tar elements, users will need more direct access to the tar details to check the tar file against those requirements. For an example along these lines, see layer: Require ustar (originally from IEEE Std 1003.1-1988 but linking IEEE Std 1003.1-2013) opencontainers/image-spec#342 and the extremely naive stub here (which currently doesn't work because Next() never returns the x typeflag headers).

I think we can address both of these cases without breaking the existing API with two changes:

  1. Adding a tar.Reader.NextRaw() that works like tar.Reader.Next(), but always returns the next record regardless of its typeflag.
  2. Adding tar.Header.Bytes exposing a []byte slice of the raw header data, to allow clients to access header fields that are not yet mapped to other tar.Header attributes. Alternatively, this could be a method tar.Header.Bytes() (or an attribute exposing a string or whatever) to make the data clearly read-only.

Folks who wish to control writing at a low level may want similar changes on the tar.Writer side, but I think that is orthogonal enough to punt to a separate issue (if anyone wants it).

CC @dsnet.

@dsnet
Copy link
Member

dsnet commented Oct 29, 2016

Thanks for filing this. I explored this somewhat and I'm opposed to providing this functionality for the following reasons:

  • The requested tar.Header.Bytes field has very little use other than for people who want to have low-level access to the tar file. However, this can be still achieved today by using an io.TeeWriter on the input reader (e.g., https://play.golang.org/p/xXOIJ0d63A). This work-around is hacky, but trying to get access to the low-level format is already hacky.
  • What is the meaning of the Bytes field when writing? Do we simply ignore it? Do we write Bytes instead of what the Writer would have generated? We could document the exact behavior, but I'm not fond of a header field that is only valid for the Reader, but not for the Writer. As an API it behaves in unexpected ways.
  • I believe the reason of "accessing information that is not yet represented in the tar structure" is somewhat weak since the tar archive can easily be forked to provide the new functionality. Suppose some new functionality came around, the user still needs to implement that functionality themselves. Either forking the package or using the above workaround seems sufficient.
  • The desire to validate the tar format seems like a legitimate need, but the proposed API doesn't easily solve this. It seems that your approach was to get low-level access to the header bytes and to validate them yourselves. The workaround I presented allows you to do this.
  • A better API might be to add a Format field to Header. When reading, the Format will be filled out with a enum representing the format. When writing, Format tells the Writer what format to encode the header using.
    • However this is harder than it sounds. The unfortunate nature of Postel's Law is that there are many tar readers that liberally decode non-compliant tar files. Even Go's reader will liberally decode files that are somewhat invalid (See archive/tar: Writer outputs invalid PAX format after writing GNU format #17665 for an example of this). While "specifications" exist for the USTAR, GNU, and PAX formats, the world has a very liberal conformance to those said "specifications". I have even seen instances of where the canonical implementation of the specification behaves differently from what the specification even says. The reality is: the tar format (with all of its variations) is not a format friendly to formal specifications (and thus not friendly to easy validation).

@dsnet dsnet added this to the Unplanned milestone Oct 29, 2016
@wking
Copy link
Contributor Author

wking commented Oct 31, 2016

On Sat, Oct 29, 2016 at 02:46:22PM -0700, Joe Tsai wrote:

  • The requested tar.Header.Bytes field has very little use other
    than for people who want to have low-level access to the tar
    file. However, this can be still achieved today by using an
    io.TeeWriter on the input reader (e.g.,
    https://play.golang.org/p/xXOIJ0d63A). This work-around is hacky,
    but trying to get access to the low-level format is already hacky.

I agree that tar.Header.Bytes is hacky and am open to the TeeWriter
approach. However, it's possible that tar.Reader.Next skips elides
two headers (e.g. an ‘x’ and a ‘g’). In that case, I'm not sure how
you'd find the middle header short of implementing your own parser.
You could drop tar.Header.Bytes (in favor of the TeeWriter approach)
if you had NextRaw to give single steps.

  • I believe the reason of "accessing information that is not yet
    represented in the tar structure" is somewhat weak since the tar
    archive can easily be forked to provide the new
    functionality. Suppose some new functionality came around, the
    user still needs to implement that functionality
    themselves. Either forking the package or using the above
    workaround seems sufficient.

Forking is so drastic, when NextRaw seems pretty straightforward.

  • A better API might be to add a Format field to Header… However…

I agree that this is a bad fit for the stdlib for the reasons you
give. I think NextRaw and TeeWriter would give sufficient access to
graft this sort of thing onto the stdlib's reader. Do you have
problems with the NextRaw interface? Would it help if I worked up a
patch implementing it?

@dsnet dsnet changed the title archive/tar: expose low level tar properties when reading with Header.Bytes and NextRaw archive/tar: add Reader.NextRaw method to read only one raw header Nov 9, 2016
@dsnet
Copy link
Member

dsnet commented Nov 9, 2016

I think a Header.Bytes field is off the table given:

  • It's low number of use-cases
  • A workaround with io.TeeWriter

An API to add NextRaw is still a possibility, but I have many concerns.

  • What does NextRaw do when it decodes a old GNU header for a sparse file? Do we parse the extended blocks as well? (I would assume yes, otherwise the next NextRaw call will fail). This seems to break your assumption that "raw" always means 512B.
  • Let's suppose you decode the raw header for a PAX header, is it now your responsibility to parse the raw key-value PAX headers yourself?
  • What happens when some value obtained from a previous call to NextRaw is relevant to properly parse in the next call to NextRaw? For example, let's say the next header is for the PAX headers, and the "size" record is set. When we call NextRaw again, how does it know about information from a previous raw header (like the filesize) that is important to parse the next file?
  • What happens when you intertwine Next and NextRaw calls?

@wking
Copy link
Contributor Author

wking commented Nov 9, 2016

On Wed, Nov 09, 2016 at 01:55:26PM -0800, Joe Tsai wrote:

I think a Header.Bytes field is off the table given:

  • It's low number of use-cases
  • A workaround with io.TeeWriter

Agreed, although the TeeWriter approach depends on NextRaw or similar
1.

An API to add NextRaw is still a possibility, but I have many
concerns…

These are good points. The main issue with Bytes access and tar
validation is that we need access to the tip of the entry point. How
about:

// Next2 is just like Next, but if callback is non-nil it is called
// after parsing a header block (512 bytes after the start of the
// record). A single Next2 call may result in multiple callback
// calls, e.g. in the case of pax extended headers. If callback
// returns an error, Next2 returns the same error and does no
// further reading. For records with GNU's "extra" or "sparse"
// header extensions, callback will be called before the additional
// headers are parsed, which means the values set in the callback's
// header argument may be incomplete. If the callback returns with
// no errors, Next2 will continue parsing as usual. Callbacks must
// not modify the passed tar reader internally unless they can do so
// without altering the state of the Reader reference assocaited
// with Next2.
func (tr _Reader) Next2(callback *HeaderCallback) (_Header, error)

type HeaderCallback(tar *Reader, header *Header) error

That seems reasonable to me, although the “don't break the reader”
warning is a bit awkward. You could avoid that with:

type HeaderCallback(header *Header) error

but folks might want to use the Reader reference in a non-desctuctive
way. For example, Python exposes global pax headers on the
Reader/Writer 2, and Go might decide to handle global pax headers
the same way (as an alternative to packing them all into the same
Header map 3). On the other hand, folks can always use a closure or
some such with the Header-only form if they do need a reference to
the reader, so I'm ok either way.

@dsnet
Copy link
Member

dsnet commented Nov 10, 2016

Before we dig deeper into this API (which seems fairly complex), I want to understand the use case more. How does Next2 help achieve the goal of "validating the tar file"? Can you give practical and specific things that you would be checking for it to be "valid"?

@vbatts
Copy link
Contributor

vbatts commented Jan 16, 2017

a possible option (and one that I would love to reconcile with golang upstream) is the approach taken for https://github.com/vbatts/tar-split.
Currently we carry archive/tar but add a functionality of enabling access to raw bytes.
See the docs: https://godoc.org/github.com/vbatts/tar-split/archive/tar
The only additions to the archive/tar API is the RawAccounting bool field and the RawBytes() func on the Reader struct.

(coincidentally I discovered this thread while preparing to open the discussion for pushing the RawBytes feature to upstream)

@wking
Copy link
Contributor Author

wking commented Jan 17, 2017 via email

@dsnet
Copy link
Member

dsnet commented Jan 17, 2017

It does not seem like the RawBytes API provides a way to trigger on elided headers.

As @wking mention, it seems that the RawBytes API is identical to using a TeeReader around the source io.Reader except the RawBytes does not record the body.

It seems that this could be accomplished with a modified TeeReader approach: https://play.golang.org/p/ZUanaldFcu

However, my modification falls short in the following two areas:

  • Consuming the extra padding for the body only occurs at Next, which incorrectly causes the TeeReader to read padding that is not part of the Header.
  • This requires reading the entire body before calling Next, otherwise part of the body is accidentally copied into TeeReader. This is inefficient when the body is a sparse file or the input io.Reader is actually a io.Seeker and data could have been quickly discarded.

If those are the only two deficiencies (correct me if I'm wrong), I would much rather see the following happen instead:

  • That the Read function guarantees that it consumes the padding when it hits io.EOF. This could be documented behavior with a test ensuring it's persistence in the standard library.
  • That we add a Reader.Discard method to tell the read to drop some number of bytes. A discard method is useful for the handling of sparse files (see my comment on #13548).

The first change requires no API change and is something I would want to see anyways. The later API change at least provides benefit for another functionality more popularly requested of the tar package.

@dsnet
Copy link
Member

dsnet commented Jan 18, 2017

#18710 is about adding functionality to tell the Writer what format to use (and conversely for the Reader to report what format was read). @wking, I'm wondering if this will solve your original use-case for this issue where you needed to do validation of the format.

@vbatts
Copy link
Contributor

vbatts commented Jan 19, 2017

@dsnet understandable, but the need was in accessing only the raw headers and padding, and not the body of each entry, as the tar.Reader.Read() already provides this. @wking in this way, it does allow for collecting all the bytes. That's the purpose of tar-split, such that all the non-payload bytes can be preserved for re-assembling the cryptographic-equivalent original.

There is nothing for triggering on tar headers. Though a parser of the RawBytes ought not be too terrible.

Additionally, that RawBytes approach does not affect the tar.Writer at all. It is only for reading.

simar7 added a commit to aquasecurity/fanal that referenced this issue Dec 3, 2019
Use TeeReader instead: https://play.golang.org/p/xXOIJ0d63A

There's an interesting proposal currently open
golang/go#17657

Signed-off-by: Simarpreet Singh <simar@linux.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants