Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archive/tar: add support for arbitrary pax vendor extensions #14472

Closed
jstarks opened this issue Feb 22, 2016 · 22 comments
Closed

archive/tar: add support for arbitrary pax vendor extensions #14472

jstarks opened this issue Feb 22, 2016 · 22 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@jstarks
Copy link

jstarks commented Feb 22, 2016

Feature request: archive/tar should support reading and writing arbitrary pax vendor extensions. This could be done either with additions to tar.Header, or with new functions on tar.Reader and tar.Writer. This would allow go code to work with metadata that archive/tar does not yet support.

Here's the motivation: Docker (written in go) uses tar as the archive format for container images in Linux. We'd like to make sure this works for Windows, too, but doing so requires extending the tar format to include some non-POSIX Windows metadata (security descriptor, creation time, etc.).

The natural way to extend the tar format is to encode this additional metadata in pax vendor extensions, e.g. "MSWINDOWS.sd" for the security descriptor. Currently, however, archive/tar does not expose pax vendor extensions other than those that start with SCHILY.xattr (used for POSIX extended attributes), which are exposed in Header.Xattrs. It's probably not reasonable to ask archive/tar to support these specific "MSWINDOWS" extensions that are neither formally defined nor used anywhere today, but archive/tar could expose the ability to read and write arbitrary vendor extensions.

This seems like it would be useful outside of this use case as well, e.g. to support POSIX ACLs (SCHILY.acl.*).

@dsnet

@ianlancetaylor ianlancetaylor added this to the Unplanned milestone Feb 22, 2016
@dsnet
Copy link
Member

dsnet commented Feb 22, 2016

I'm not the owner of archive/tar, but I personally think that this is a fairly reasonable feature request. I think it's unfortunate that tar.Header.Xattrs is specific to the SCHILY.xattrs.* headers only, which is too narrow in functionality.

If the package allows for arbitrary vendor extensions, there are some edge cases to think about. For example #13548 requests sparse file support, which can be accomplished using the GNU.sparse.* extended headers. Since these extended headers have semantic meaning in regards to how to read the file, what happens when the user overwrites GNU.sparse.realsize or something? More than likely, the user will end up creating a corrupt file.

This will take a little thought, but it's certainly solvable. If I have time, I can try and tackle this in the go1.7 cycle.

@jstarks
Copy link
Author

jstarks commented Feb 23, 2016

Yeah, I thought about that, too, but I don't have a good solution. The same might be said e.g. for SCHILY.acl.*, if someone adds support for POSIX ACLs to Header directly. If the caller to WriteHeader() fills out both the specific Header field and an entry in the PAX extension map, which takes precedence?

Thanks for considering this change. For now, I'm going to have to fork archive/tar to add the necessary functionality, but I'd like to converge back on the official version if/when this feature makes it in.

@bradfitz bradfitz modified the milestones: Go1.8Maybe, Unplanned May 9, 2016
@quentinmit quentinmit added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Oct 10, 2016
@rsc
Copy link
Contributor

rsc commented Oct 10, 2016

@dsnet:

I'm not the owner of archive/tar

I think you are now. :-)

Anyway, it sounds like you are in favor and so a CL just needs to be written. This isn't going to block the release, but if you want to write a CL or someone else wants to write it and send it to you for review, that sounds fine.

@rsc rsc added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. labels Oct 10, 2016
@dsnet
Copy link
Member

dsnet commented Oct 14, 2016

Pushing to Go 1.9.

I'm going to close off all Reader related issues in this cycle and let Go 1.9 be about Writer.

@dsnet dsnet modified the milestones: Go1.9, Go1.8Maybe Oct 14, 2016
@wking
Copy link
Contributor

wking commented Oct 28, 2016

Another way to deal with this (for readers anyway) is to add:

func (tr *Reader) NextRaw() (*Header, error)

(or some similar name) that gets you the next entry without invoking the current automatic pax-extention-header consumption and similar. Users who want to handle custom extended headers could call NextRaw() (or whatever) to advance one entry at a time, regardless of the entry type. This would also be useful for low-level code that wanted to validate the tar files (e.g. if wanted to make sure there were no pax extended headers because some downstream library didn't understand pax).

@dsnet
Copy link
Member

dsnet commented Oct 28, 2016

@wking

NextRaw ... gets you the next entry without invoking the current automatic pax-extention-header consumption and similar. Users who want to handle custom extended headers could call NextRaw() (or whatever) to advance one entry at a time, regardless of the entry type.

If I understand your suggestion correctly, you are proposing that we try to read the archive as straight USTAR format (since PAX is really an extension built ontop of USTAR)? What happens when the underlying format is the old GNU format, which does not conform to USTAR? It also seems pretty painful, that'd you would have to parse all of the PAX extended headers yourself.

This would also be useful for low-level code that wanted to validate the tar files (e.g. if wanted to make sure there were no pax extended headers because some downstream library didn't understand pax).

Is this a real use-case that you have? Validating that a tar file is of a given format is actually pretty hard. Do you validate it according to strict conformance to the POSIX specification or GNU manual? Do you validate it according to what the GNU implementation actually does? Do you validate it according to what the world seems to believe those formats are? If there is a problem with implementations properly conforming to HTTP specifications, then the conformance of tar implementations to "specifications" is even worse.

I'm not sure the standard library should provide this feature. On the other hand, I'm willing to explore providing an option where the Writer guarantees to output only specific formats; please file another issue if you care for it.

@wking
Copy link
Contributor

wking commented Oct 28, 2016

On Fri, Oct 28, 2016 at 03:19:18PM -0700, Joe Tsai wrote:

If I understand your suggestion correctly, you are proposing that we
try to read the archive as straight USTAR format (since PAX is
really an extension built ontop of USTAR)?

Yeah.

What happens when the underlying format is the old GNU format, which
does not conform to USTAR?

I'm not particularly familiar with the old GNU format; can you spell
out the problem? Skimming over libarchive's tar(5), 1 talks about
header entries that conflict with ustar as well as defining additional
typeflags like S for sparse. For header conflicts, you'd need to
extend Header to expose the fields, either by adding things like the
‘sparse’ structure to Header, or by parsing whatever matches the
current Header fields and providing Header.Bytes with a []byte slice
on the raw header. In the interest of providing a simple fix that
doesn't limit consumers, I'm leaning towards Header.Bytes.

This would also be useful for low-level code that wanted to
validate the tar files (e.g. if wanted to make sure there were no
pax extended headers because some downstream library didn't
understand pax).

Is this a real use-case that you have? Validating that a tar file is
of a given format is actually pretty hard. Do you validate it
according to strict conformance to the POSIX specification or GNU
manual?

At the moment, I'm just looking at typeflag values (not having realize
that Go was eliding them) 2. My goal was to validate against the
current ustar (opencontainers/image-spec#342), but I don't really care
about which standard gets picked. I'm more interested in being able
to check that a given tarball conforms to whichever standard gets
picked, otherwise interop between tar producers and consumers is
unreliable.

I also have the same “I want to access a pax extended header that Go
currently ignores” use case that opened this issue 3.

I'm not sure the standard library should provide this feature…

I agree that the standard library does not need to provide tar
compliance validation. What I'm looking for (with Reader.NextRaw and
Header.Bytes) is enough access to the low-level content so that
third-party tools can extend handling where they need to without
having to completely bypass the existing stdlib implementation.

(on the other hand, I'm willing to explore providing an option where
the Writer guarantees to output only specific formats; please file
another issue if you care for it).

I'll see how getting low-level extension points goes for readers ;).
If/when that gets straightened out, we can come back and look at
writers in a separate issue.

@dsnet
Copy link
Member

dsnet commented Oct 29, 2016

@wking, I glanced through your references and I want to make sure that we are addressing your issues appropriately. It seems from that you have two separate issues:

The first problem is irrelevant to this ticket and I am somewhat opposed to providing that amount of low-level control over the consumed and generated tar ball. I know I'm not really responding to your points. We can discuss more, but please file a separate issue for that.

The second problem is exactly what this issue is about and I think there is a sufficient need that this should be addressed. I'll present a preliminary design shortly that can possibly make it into Go 1.9.

@dsnet
Copy link
Member

dsnet commented Oct 29, 2016

@jstarks and @wking. Your thoughts on the following API?

type Header struct {
    ...

    // Records stores key-value records for a given file to be encoded as a
    // series of PAX extended header records.
    //
    // The key must take the following form:
    //  VENDOR.keyword
    // Where VENDOR is some namespace in all uppercase and keyword is any
    // arbitrary UTF-8 encoded string not containing the '=' character
    // (e.g., "GOLANG.pkg.version").
    //
    // The value has no restrictions.
    Records map[string]string

    // Xattrs stores attributes for a file under the SCHILY namespace.
    //
    // Recording a key-value pair as:
    //  hdr.Xattrs["key"] = "value"
    // Is equivalent to doing the following:
    //  hdr.Records["SCHILY.xattr.key"] = "value"
    //
    // Deprecated: Use Records instead.
    Xattrs map[string]string
}

As far as I know, the only common PAX header that has semantic meaning for the interpretation of the file is GNU.sparse.*. These will be handled specially.

Changes to Reader

When reading a file, all PAX records will be read. All records will be copied into Records except for:

  • Keys that are for GNU sparse (e.g., strings.HasPrefix(key, "GNU.sparse."))
  • Keys that are not prefixed by a vendor (e.g., strings.IndexByte(key, '.') <= 0)
  • Keys where the vendor name is not all UPPERCASE (e.g., not in the set of [A-Z])

All of the records in Records with the SCHILY.xattr. prefix will be copied to Xattrs, but with the prefix trimmed off. We continue to populate Xattrs for compatibility reasons.

Changes to Writer

When writing a file, all records in Xattrs will be copied to Records, but with the SCHILY.xattr. prefix prepended to the key. If the key-value already exists in Records and the value differs, we reject the header.

The records in Records are considered valid except for:

  • Keys that contain the '=' character
  • Keys that are for GNU sparse
  • Keys that are not prefixed by a vendor
  • Keys where the vendor name is not all UPPERCASE

If those condition passes, we will write that information into the tar file.

Note that the restrictions (except for the ones regarding GNU.sparse) come from the PAX specification, which states the following:

The standard developers have reserved keyword name space for vendor extensions. It is suggested that the format to be used is:
VENDOR.keyword
where VENDOR is the name of the vendor or organization in all uppercase letters. It is further suggested that the keyword following the be named differently than any of the standard keywords so that it could be used for future standardization, if appropriate, by omitting the VENDOR prefix.

In other places:

Keywords consisting entirely of lowercase letters, digits, and periods are reserved for future standardization. A keyword shall not include an <equals-sign>.

@jstarks
Copy link
Author

jstarks commented Oct 29, 2016

This looks good to me. I think this would cover the MSWINDOWS.* extensions I created for Windows metadata.

Looking through my fork of archive/tar, the other thing I needed to change was the handling of time. Basically I need to guarantee subsecond time granularity, which is anticipated by this comment in writer.go:

// TODO(dsnet): we might want to use PAX headers for
// subsecond time resolution, but for now let's just capture
// too long fields or non ascii characters

To avoid confusing this issue, I can open another issue to request some mechanism to encode time with subsecond resolution, if you think it's worthy of consideration.

@wking
Copy link
Contributor

wking commented Oct 29, 2016

On Fri, Oct 28, 2016 at 05:01:50PM -0700, Joe Tsai wrote:

Filed as #17657.

@wking
Copy link
Contributor

wking commented Oct 29, 2016

On Fri, Oct 28, 2016 at 05:02:14PM -0700, Joe Tsai wrote:

type Header struct {
// Records stores key-value records for a given file to be encoded as a
// series of PAX extended header records.

This looks good to me, but I'd rather have it be PaxExtendedHeaders or
some such. More to type, but clearly namespaced from whatever other
extended record types future specs decide to stuff into a tar-based
format ;).

// The key must take the following form:
// VENDOR.keyword
// Where VENDOR is some namespace in all uppercase and keyword is any
// arbitrary UTF-8 encoded string not containing the '=' character
// (e.g., "GOLANG.pkg.version").

This is stricter than the spec, which you quote. Here it is with a
bit more detail 1:

An extended header shall consist of one or more records, each
constructed as follows:

"%d %s=%s\n", <length>, <keyword>, <value>

The extended header records shall be encoded according to the
ISO/IEC 10646-1:2000 standard UTF-8 encoding… The fields
can be any UTF-8 characters…

… Keywords consisting entirely of lowercase letters, digits, and
periods are reserved for future standardization. A keyword shall
not include an …

And then later down, they suggest something that's closer to your
requirement 2:

The standard developers have reserved keyword name space for vendor
extensions. It is suggested that the format to be used is:

VENDOR.keyword

where VENDOR is the name of the vendor or organization in all
uppercase letters. It is further suggested that the keyword
following the be named differently than any of the standard
keywords so that it could be used for future standardization, if
appropriate, by omitting the VENDOR prefix.

So I think it's a good idea for users to follow those suggestions, but
I think we only want to error out if they try and put an equals sign
or non-UTF-8 in the Records (or PaxExtendedHeaders, or whatever) key.
I'm not particular enough to care about the destinction between
“ISO/IEC 10646-1:2000 standard UTF-8” and “whatever version of UTF-8
unicode/utf8 implements”, because that's well past my character
requirements ;). Folks who do care about that disctinction could use
something like #17657 do access the headers directly if they do care
about the UTF-8 flavor.

And rather than recapping all of that in the comment, I'd suggest
having the comment link out to the pax extended header spec [1,2].

// Xattrs stores attributes for a file under the SCHILY namespace.
//
// Recording a key-value pair as:
// hdr.Xattrs["key"] = "value"
// Is equivalent to doing the following:
// hdr.Records["SCHILY.xattr.key"] = "value"
//
// Deprecated: Please use Records.
Xattrs map[string]string

Looks good to me.

As far as I know, the only common PAX header that has semantic
meaning for the interpretation of the file is GNU.sparse.*. These
will be handled specially.

I'm not sure about “common”, but SCHILY.filetype [3](which I'd like
to use for whiteouts in OCI's image-spec [4]) supports a number of
exotic filetypes.

When reading a file, all PAX records will be read. All records will
be copied into Records except for:

  • Keys that are not prefixed by a vendor (e.g.,
    strings.IndexByte(key, '.') <= 0)
  • Keys where the vendor name is not all UPPERCASE (e.g., not in the
    set of [A-Z])

I'd copy anything that we don't consume into Records. For example,
maybe a future POSIX standardizes a new lowercase key, and folks run
across it in the wild before Go has been taught about it. Or the user
is running an old Go, so even if master has been taught about the new
key, their version doesn't understand it. Users should still have a
way to access the headers so they can handle them directly if they
want. Another way to think about this is that we should be able to
round-trip valid pax extended headers regardless of their keywords.

When writing a file, all records in Xattrs will be copied to
Records, but with the SCHILY.xattr. prefix prepended to the
key. If the key-value already exists in Records and the value
differs, we reject the header.

Is “reject the header” “exit WriteHeader with an error”? That makes
sense, but it's probably good to point that out in the docs and say
which error you'll return.

The records in Records are considered valid except for:

  • Keys that are not prefixed by a vendor
  • Keys where the vendor name is not all UPPERCASE

I think these are too strict; see above. For cases where the user
sets a known key (e.g. GNU.sparse.* or atime) and the value differs
from that which we would otherwise write (e.g.
Headers.Records["atime"] doesn't match Headers.AccessTime), I'd
return the same error we return for Xattrs/Records missmatches.

@dsnet
Copy link
Member

dsnet commented Oct 29, 2016

I'm not sure about “common”, but SCHILY.filetype [3](which I'd like to use for whiteouts in OCI's image-spec [4]) supports a number of exotic filetypes.

It seems that SCHILY.filetype is essentially a text-based form of the Flag field in the original V7 header. I should have clarified, but when I said "semantic meaning", I mean that the GNU.sparse headers affect how the file content itself is actually parsed and interpreted. SCHILY.filetype seems to only affect the end-user of the tar package.

Sparse file support is complex enough that it makes sense to build that into the tar package. I'm not aware of any other file semantic that is complex enough to warrant native support from the package.

I think these are too strict;

The strictness of my proposed API is entirely intentional. We can always relax the rules in future versions, but we can never make them strict again due to the compatibility agreement. The reason I chose an intentionally strict API is because I didn't want to deal with all the edge cases that occur when member fields of the Header struct conflict with the PAX headers that a user puts in. I believe that it will be too easy for the user to accidentally create a conflicting Header struct that ends up being rejected by WriteHeader.

I believe the number of users who even need arbitrary PAX headers are relatively low and it seems my current proposal is sufficient for all the known use cases. It seems to me that the use cases for complete reign over the PAX headers are mostly hypothetical. Should a need arise in the future, people are welcome to fork the tar library for their own needs until we decide whether and how the standard library should support it. It seems that @jstarks forked tar for his own purposes while we're figuring out what to do here.

One listed header that people may want access to that my API currently doesn't provide is the comment header, but there have been no reports of people needing that. I'd rather permit controlled access to comment than allow free-reign over the extended headers.

I believe the proposed API with all its strictness satisfies the real use-cases that people have without complicating the API significantly for the typical user of the tar package.

@dsnet dsnet closed this as completed Oct 29, 2016
@dsnet dsnet reopened this Oct 29, 2016
@wking
Copy link
Contributor

wking commented Oct 29, 2016

On Fri, Oct 28, 2016 at 10:48:44PM -0700, Joe Tsai wrote:

I believe that it will be too easy for the user to accidentally
create a conflicting Header struct that ends up being rejected by
WriteHeader.

This is currently true with your Xattrs / Records split. If someone
reads in a Header which sets something in Xattrs, your current
proposal requires them to update both Records and Xattrs if they want
to change the value. If they only update one location, they'll have a
WriteHeader-time conflict.

How about replacing your Records with:

func (h *Header) PaxHeaders() map[string]string
func (h *Header) SetPaxHeader(keyword, value string) error

SetPaxHeader would parse known headers (e.g. “atime”) into the
appropriate Header attribute (e.g. Header.AccessTime).

PaxHeaders() would construct the headers as writeHeader does now,
blindly clobbering any fields it understands, and return a copy of the
internal map. Because SetPaxHeader was keeping the attributes up to
date, any actual clobbering would be due to later changes to the
attribute itself. For example:

header.SetPaxHeader("atime", "1084839148.1212")
header.AccessTime // 2004-05-18T00:12:28.1212
header.Accesstime = time.Unix(12.345)
header.PaxHeaders()["atime"] // "12.345"

Then callers can update via either the Header attribute or via
SetPaxHeader without worrying about an inconsistent state. The
internal header state would be stale after the attribute write, but
we'd freshen it during PaxHeaders and WriteHeader calls.

And with that sort of consistency guarantee, I think it would be
fairly straightforward to support atime, etc. syncing vs. just
SCHILY.xattr syncing. We'd want to support both directions (pax
header ↔ attribute) for every property that had support for one
direction. But that sort of round-tripping is a good idea anyway
(e.g. to support writing times with subsecond resolution 1), and we
don't have so many currently-one-way fields that adding support for
round-tripping would be that difficult.

One interesting case (that I'm ok with but which is probably worth
documenting) is when someone uses SetPaxHeader to set a value which
can be represented without using pax headers. For example, a
whole-second time:

header.SetPaxHeader("atime", "1084839148")
header.PaxHeaders()["atime"] // nil

I'm fine with the explicitly-set header “going away”, and if that
seems too surprising we could work around it with more internal state
(e.g. keeping a list of keys which PaxHeaders() should set regardless
of whether it needs pax headers to represent their data).

@cyphar
Copy link

cyphar commented Nov 9, 2016

@wking Or we can just make it so that if a key is set in Xattrs (and even if it's in Records) we just use the Xattrs one. Please don't complicate this proposal. Xattrs is deprecated, but it is also the old interface so we should obey it over the new one.

@wking
Copy link
Contributor

wking commented Nov 9, 2016

On Wed, Nov 09, 2016 at 05:41:11AM -0800, Aleksa Sarai wrote:

Or we can just make it so that if a key is set in Xattrs (and even
if it's in Records) we just use the Xattrs one. Please don't
complicate this proposal. Xattrs is deprecated, but it is also the
old interface so we should obey it over the new one.

With an entry that sets SCHILY.xattr.key to foo, all of the proposals
support backwards compat:

header, err = tarReader.Next()
header.Xattrs["key"] = "bar"
n, err = tarWriter.WriteHeader(header)

With @dsnet's original proposal 1, folks trying to use the new API
with:

header, err = tarReader.Next()
header.Records["SCHILY.xattr.key"] = "bar"
n, err = tarWriter.WriteHeader(header)

will get an error from the WriteHeader call (“If the key-value already
exists in Records and the value differs, we reject the header”). To
get a successful write with the original proposal using the new API,
you'd need:

header, err = tarReader.Next()
header.Records["SCHILY.xattr.key"] = "bar" // you can skip this line
header.Xattrs["key"] = "bar"
n, err = tarWriter.WriteHeader(header)

That makes actually removing Xattrs (in some future Go release after a
long deprecation period) hard, because callers aren't able to ignore
the deprecated Xattrs.

With your proposal (not erroring out on conflicts and preferring
Xattrs), folks still need to touch Xattrs to get a successful update,
and the Records-only approach will silently succeed (while leaving
SCHILY.xattr.key set to foo).

With my proposal 2, callers can still use the old Xattrs-only
approach, but they can also use a new SetPaxHeader-only approach:

header, err = tarReader.Next()
header.SetPaxHeader("SCHILY.xattr.key", "bar")
n, err = tarWriter.WriteHeader(header)

That doesn't involve the deprecated Xattrs API, so you can transition
consumers to the new API and eventually (after some long deprecation
period) drop the deprecated API entirely.

If SetPaxHeader is not the API folks want (for whatever reason),
another approach would be:

PaxHeaders map[string]string // or Records, or whatever

// Create PaxHeaders entries for anything in in Xattrs and empty Xattrs.
PaxHeadersOptIn() error

// Deprecated. Call PaxHeadersOptIn() and use PaxHeaders
Xattrs map[string]string

Callers interested in the new API could:

header, err = tarReader.Next()
err = header.PaxHeadersOptIn()
header.PaxHeader["SCHILY.xattr.key"] = "bar"
n, err = tarWriter.WriteHeader(header)

Then when Xattrs is eventually dropped, PaxHeadersOptIn becomes a
no-op. And after another suitably-long deprecation period, you could
drop PaxHeadersOptIn. But during the initial transition, you'd still
have to shift the whole stack consuming those opted-in headers from
Xattrs to PaxHeaders. With my preferred (Set)PaxHeaders approach you
could make the transition one call at a time (and other calls in your
stack could continue use either Xattrs or (Set)PaxHeaders). So I
still think (Set)PaxHeaders will give the best migration experience.

@quentinmit
Copy link
Contributor

I like the idea of SetPaxHeader. I do not like PaxHeadersOptIn.

I think SetPaxHeader is better than exposing a map because it sidesteps the possible collisions.

In this proposal, how do you remove a header? Is an empty value the same as removing, or does that potentially have semantic meaning (in which case I guess we need DeletePaxHeader?)

@wking
Copy link
Contributor

wking commented Nov 9, 2016

On Wed, Nov 09, 2016 at 11:49:31AM -0800, Quentin Smith wrote:

In this proposal, how do you remove a header? Is an empty value the
same as removing, or does that potentially have semantic meaning (in
which case I guess we need DeletePaxHeader?)

Ah, good point. I don't see a “ must be non-empty” requirement
in the spec 1, so I suggest we support deletion. An alternative to
DeletePaxHeader would be:

func (h *Header) SetPaxHeader(keyword, value *string) error

where nil means “delete the header”. I don't have a strong preference
between the *string and DeletePaxHeader approaches, and I don't read
enough Go to know which is more idiomatic.

@dsnet
Copy link
Member

dsnet commented Nov 9, 2016

I'm not opposed to adding methods to Header like SetPaxHeader that helps manage the fields in the Header (similar to methods on zip.FileHeader), but I am opposed to a method that manages some unexported field as @wking's initial proposal suggests.

Here are my reasons why:

  • Unexported fields implies that the Header has hidden state. Copying a Header struct from one variable to another incurs a coupling between the two versions. In the current version, only the Xattrs field is aliased between two copies. The proposed unexported field will be another aliased field, but is worse since it is hidden from the user.
  • The API is non-performant since the PaxHeaders method creates a map copy every time.
  • The way to check if any PAX headers exist is to call the above method, which is non-performant.
  • The SetPaxHeader doesn't actually sidestep possible collisions, it just makes the decision to replace the value in the struct field in the event of a collision. This policy could easily be applied by WriteHeader on the exported map approach (if we support non-namespaced keys).
  • The main advantage of SetPaxHeader is that it keeps the "basic" fields like Name, Uid, etc in sync with what is in exported map. I find this to be of dubious benefit since a person can just easily set those fields in the struct directly (e.g., it is easier to just do h.Name = "myfile" rather than h.SetPaxHeader("path", "myfile")). Why present two approaches to do the same thing? For that reason, I forbid the ability to set PAX headers for basic fields.
  • Header deletion is more obvious in the exported map approach (just use the builtin delete). I'm not worried about providing the user with the ability to encode the actual PAX deletion record since that is only relevant if we support global PAX headers, which is a different discussion.

I'm not opposed to a SetPaxHeader method, but I believe it should be in addition to, and not replacing the exported map.

Responses to other statements made:

will get an error from the WriteHeader call (“If the key-value already exists in Records and the value differs, we reject the header”).

@wking, The semantic meaning of my proposal is only if the existing value differs. Thus, the following is valid:

header, err = tarReader.Next()
header.Records["SCHILY.xattr.key"] = "bar"
n, err = tarWriter.WriteHeader(header)

I think SetPaxHeader is better than exposing a map because it sidesteps the possible collisions.

@quentinmit, It doesn't sidestep collisions, they still occur. The specific policy chosen (i.e., always replace) is not specific to the SetPaxHeader method.

I don't see a “ must be non-empty” requirement in the spec.

@wking, Actually, an empty value means deletion: If the <value> field is zero length, it shall delete any header block field, previously entered extended header value, or global extended header value of the same name.. So an empty string is totally fine to represent deletion.

@wking
Copy link
Contributor

wking commented Nov 9, 2016

On Wed, Nov 09, 2016 at 01:39:51PM -0800, Joe Tsai wrote:

  • Unexported fields implies that the Header has hidden
    state. Copying a Header struct from one variable to another
    incurs a coupling between the two versions. In the current
    version, only the Xattrs field is aliased between two
    copies. The proposed unexported field will be another aliased
    field, but is worse since it is hidden from the user.

I'm not familiar with “aliased field”, can you unpack that for me? Do
you mean “deep copied” [1](copied Header gets a new pax-header map)
or “shallow copied” (copied Header shares the original pax-header map)
2? I certainly prefer deep-copies here. Folks who want a shared
pax-header map should be passing *Header references. I don't see a
use case for “I want copies of all of Header but not the pax headers”.

  • The API is non-performant since the PaxHeaders method creates a
    map copy every time.

Every time you call it. Folks who need to access a number of headers
at once should use:

paxHeaders := header.PaxHeaders()
fmt.Printf("some headers: %s %s\n", paxHeaders["foo"], paxHeaders["bar"])

I agree that it would be nice to avoid this copy cost, but it's a
price I'm willing to pay to draw a clear synchronization boundary (the
map returned by PaxHeaders will not be updated by future changes to
header, and header will not be updated by future changes to the
returned map).

  • The way to check if any PAX headers exist is to call the above
    method, which is non-performant.

I think the local-variable approach mitigates this, but I'm also
comfortable adding a ‘PaxHeader(key string) string, ok’ getter.

  • The SetPaxHeader doesn't actually sidestep possible collisions,
    it just makes the decision to replace the value in the struct
    field in the event of a collision. This policy could easily be
    applied by WriteHeader on the exported map approach (if we
    support non-namespaced keys).

With the atime examples I floated in 3, the newer value clobbers the
older, regardless of whether you're coming at it via header.AccessTime
or header.SetPaxHeader. You cannot replicate that approach in
WriteHeader, because the “which value is newer” information is lost.

  • The main advantage of SetPaxHeader is that it keeps the "basic"
    fields like Name, Uid, etc in sync with what is in exported
    map.

It keeps all the fields in sync, since it will be updating Xattrs
too.

I find this to be of dubious benefit since a person can just
easily set those fields in the struct directly (e.g., it is easier
to just do h.Name = "myfile" rather than h.SetPaxHeader("path", "myfile")). Why present two approaches to do the same thing?

Because “returns the pax headers that WriteHeaders will write” is
easier to explain without the “but not for fields that are otherwise
represented in Header” caveat. However, you still get all the
synchronization and migration benefits of SetPaxHeader even if your
implementation errors in “this field is represented on another,
non-deprecated Header field” cases. So I'd accept a SetPaxHeader that
didn't support atime, etc., if you really felt they were adding
confusion.

I'm not opposed to a SetPaxHeader method, but I believe it should
be in addition to, and not replacing the exported map.

As soon as you allow an exported map, you lose the ability to
synchronize the other fields.

@wking, The semantic meaning of my proposal is only if the
existing value differs. Thus, the following is valid:

header, err = tarReader.Next()
header.Records["SCHILY.xattr.key"] = "bar"
n, err = tarWriter.WriteHeader(header)

But the existing value was “bar” (as loaded by Next() 1).

I don't see a “ must be non-empty” requirement in the spec.

@wking, Actually, an empty value means deletion: If the <value> field is zero length, it shall delete any header block field, previously entered extended header value, or global extended header value of the same name.. So an empty string is totally fine to
represent deletion.

Hmm. An empty-string value will have a length of one (“The
field shall be the decimal length of the extended header record in
octets, including the trailing .”). So thinking through our
global pax header handling, reading will look like:

  1. Hit a global (g) record and store it in tarReader's state.
  2. Hit a non-global (x) record, create a per-file pax-header map, seed
    it with the current global header entries, and then read the
    current record's entries into the per-file pax-header map,
    clobbering any existing entries (and removing them if the new entry
    has zero length.
  3. Hit another non-global (x) record (are multiple x records for a
    single file legal?). Grab the per-file pax-header map from (2) and
    continue clobbering existing entries.
  4. Hit the “real” record. Attach the pax-header map to the returned
    object.

Writing gets tricky. There doesn't seem to be a tailored API for
managing global headers, but if someone writes one by hand they'll
want a way do all of the following on subsequent non-global (x)
records:

a. Set an empty-string value (e.g. value = ""; SetPaxHeader("foo", &value)).
b. Remove a value (e.g. DeletePaxHeader("foo")).
c. Set a zero-length header (SetPaxHeader("foo", nil)?).

So my current preferred API is:

// PaxHeaders returns a copy of the pax extended headers which would
// currently be written by a WriteHeader call.
func (h _Header) PaxHeaders() map[string]_string

// SetPaxHeader sets a pax extended header. If the keyword
// corresponds to another header field (e.g. AccessTime or the
// deprecated Xattrs), the corresponding field will also be updated.
// If the value is nil, calls to WriteHeader will generate a
// zero-length masking entry.
func (h *Header) SetPaxHeader(keyword, value *string) error

// DeletePaxHeader removes a pax header from future PaxHeaders or
// WriteHeader output. If the field is represented by another
// Header field which requires the pax extended header (e.g. an
// AccessTime with subsecond precision), DeletePaxHeader will return an
// error (TODO: which error?).
func (h *Header) DeletePaxHeader(keyword) error

@wking
Copy link
Contributor

wking commented Nov 9, 2016

On Wed, Nov 09, 2016 at 02:28:56PM -0800, W. Trevor King wrote:

Wed, Nov 09, 2016 at 01:39:51PM -0800, Joe Tsai:

  • The SetPaxHeader doesn't actually sidestep possible
    collisions, it just makes the decision to replace the value in
    the struct field in the event of a collision. This policy could
    easily be applied by WriteHeader on the exported map approach
    (if we support non-namespaced keys).

With the atime examples I floated in [3], the newer value clobbers
the older, regardless of whether you're coming at it via
header.AccessTime or header.SetPaxHeader. You cannot replicate that
approach in WriteHeader, because the “which value is newer”
information is lost.

Another important distinction is that WriteHeader is not the only
possible consumer of this information. Unpacking my earlier “you'd
still have to shift the whole stack” 1, consider the workflow where:

  1. FuncA gets a Header with SCHILY.xattr.key set to “foo”.
  2. FuncA sets SCHILY.xattr.key to “bar” using whatever new API we agree on
    here, but does not set it via Xattrs.
  3. FuncA passes that Header on to FuncB for further processing.
  4. FuncB (which has not been updated to use the new API) looks in
    Xattrs and sees $VALUE, when FuncA presumably expected it to see
    the newer “bar”.

With the SetPaxHeader approach, $VALUE will be the new “bar”. With
the Records approach, $VALUE will be the old “foo”. You can bump
FuncA to the new API (removing all Xattrs references) now, and wait
until next month before you get around to updating FuncB.

To avoid confusion with the Records approach, as soon as one element
in your stack (e.g. FuncA) switches to the new approach, the whole
stack (including FuncB) must also be immediately transitioned. Where
portions of the stack are shared (e.g. via a library), that means that
Records would lead to consumer flag days as library consumers updated
all access from Xattrs to Records before they could bump the library.

An alternative Xattrs → Records migration plan would be the parallel
updates I sketched out in 1, and have FuncA explicitly update both
Records and Xattrs. That allows you to migrate gradually, but means
that once the migration is complete you have to go back in a second
pass and remove the now-dummy Xattrs bumps.

@dsnet dsnet modified the milestones: Go1.10, Go1.9 May 22, 2017
@gopherbot
Copy link

Change https://golang.org/cl/58390 mentions this issue: archive/tar: support arbitrary PAX records

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

9 participants