Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: io/fs.Hash and Hasher #43076

Closed
earthboundkid opened this issue Dec 8, 2020 · 19 comments
Closed

proposal: io/fs.Hash and Hasher #43076

earthboundkid opened this issue Dec 8, 2020 · 19 comments

Comments

@earthboundkid
Copy link
Contributor

Background:

  • When serving an embed.FS as an http.FileSystem, it would be nice to have etag support.
  • Many file stores (e.g. S3) provide a way to get the MD5 hash of file.

Proposal:

Add a func to io/fs to support getting the hash of file.

Perhaps this?

func Hash(fs FS, h hash.Hash, path string) ([]byte, error)

type Hasher interface {
   Hash(h hash.Hash, path string) ([]byte, error)
}

If there's a dependency import issue, maybe put it into package hash.

Ref. #41191 (comment)

@gopherbot gopherbot added this to the Proposal milestone Dec 8, 2020
@earthboundkid
Copy link
Contributor Author

On further consideration, crypto.Hash is better than hash.Hash. But that would almost certainly require not putting it into io/fs.

@tv42
Copy link

tv42 commented Dec 8, 2020

hash.Hash is an interface (https://golang.org/pkg/hash/#Hash), is this supposed to be used like h := fs.Hash(fsys, fnv.New128a(), "index.html")? There's some cost to such a New call, more than passing a string identifier or such.

If you meant crypto.Hash (https://golang.org/pkg/crypto/#Hash), I'd like to protest against a central registry that doesn't cater to non-stdlib use.

Also, I think we should cater for a "just give me whatever hash you happen to have" use; e.g. ETags don't care.

@earthboundkid
Copy link
Contributor Author

I was thinking there was a way to look at a hash.Hash and figure out that it was an MD5 hasher, but I think I was thinking of crypto.Hash. The fact that cypto.Hash can't be extended outside stdlib is bad, yes. I guess the question is whether just having any hash is good enough or users will care to be able to ensure that the hash is a particular one.

https://godoc.org/gocloud.dev/blob#Attributes is probably what should be used as prior art and copied:

    // CreateTime is the time the blob was created, if available. If not available,
    // CreateTime will be the zero time.
    CreateTime time.Time
    // ModTime is the time the blob was last modified.
    ModTime time.Time
    // Size is the size of the blob's content in bytes.
    Size int64
    // MD5 is an MD5 hash of the blob contents or nil if not available.
    MD5 []byte
    // ETag for the blob; see https://en.wikipedia.org/wiki/HTTP_ETag.
    ETag string

@tv42
Copy link

tv42 commented Dec 8, 2020

Some alternatives:

  1. Identify hash algorithms by strings. Empty string means "anything".
    1a. Just let chaos reign, like database/sql.Register
    1b. stdlib gets dotless crypto.Hash.String names, hash/fnv gets "FNV-1", "FNV-1a", "FNV-1a-64" and such. Non-stdlib hash implementations like Blake3 are required to use their import paths. (edit: as prefixes, with variants allowed much like Go identifiers)
    1c. (edited to add) Even stdlib hashes are identified by their import paths.

  2. Identify hash algorithms by an interface{} value exported by the relevant hash library, with the same rules about uniqueness (edit: and comparability) as context.WithValue. nil means "anything".

@earthboundkid
Copy link
Contributor Author

earthboundkid commented Dec 8, 2020

So what if the proposal is instead http.ReadETagFS:

type ReadETagFS interface {
    fs.FS
    ReadETag(path string) (string, error)
}

and embed.FS gets a ReadETag method that just returns the MD5? ISTM that solves the major use case and is compatible with what you get from S3 and other blob storage.

@tv42
Copy link

tv42 commented Dec 8, 2020

I wouldn't tie fs.FS to specifically ETags or MD5.

type ShortHashFS interface {
    fs.FS

    // ShortHash returns a short hash digest summarizing the file.
    // The returned digest is expected to be different if the file content is different,
    // but the hash algorithm used is not required to be cryptographically strong
    // and the length of the digest may not be long enough to provide perfect
    // uniqueness.
    // Length of the digest must be at least 8 bytes.
    ShortHash(path string) (digest []byte, err error)
}

(And then error can be whatever ErrNotSupported convention wins.)

I'm returning []byte not string as I'd rather have encoding the digest be the caller's responsibility, if and when that is needed, as 1) it's not always needed, e.g. for use in protobuf 2) the allowed character set varies, and e.g. ETag has a bunch of extra rules on top.

I think the use case for asking for a specific hash is valid, but that can be a separate concern. Example: To upload to S3, I can ensure integrity by providing an MD5 as Content-MD5, but it's not mandatory. Except if I'm using signed urls (AWS sig v4), then I need SHA256 in x-amz-content-sha256 instead. In 2022, FooCloud might prefer Blake3.

@tv42
Copy link

tv42 commented Dec 8, 2020

Requirements for an ideal "do you have a shortcut for my custom hash" extension interface:

  1. Return a hash.Hash so caller can e.g. say r := h.(*blake3.Hasher).XOF() https://godoc.org/lukechampine.com/blake3#Hasher.XOF for arbitrary-size streaming output instead of a predetermined []byte.

  2. Take a "configurable hash algorithm request" as argument, to support keyed (personalized/salted) hashes and e.g. hashes where the result depends on parallelization. E.g. a content-addressed storage mechanism might not want to store a SHA256 that allows a confirmation-of-a-file attack, but use a hash that is meaningful only if you already know the key. The earlier idea of taking interface{} much like context.WithValue works here.

type PrehashedFS interface {
    fs.FS

    // Prehashed returns a hasher that has already digested the file contents,
    // if one has been precomputed.
    // Calling hash.Hash.Write on the returned hasher MAY return an error.
    // algo follows the same rules as context keys, and for parameterized
    // hashes can also be an instance of a configuration data type.
    // If algo is not recognized by the FS, or the digest is not already available,
    // returns ErrNotSupported.
    Prehashed(path string, algo interface{}) (hash.Hash, error)
}

// Prehashed ...
// Prehashed will not compute a hash digest if one was not already available.
func Prehashed(fsys FS, path string, algo interface{}) (hash.Hash, error) {
    ...
}

Note that there's no guarantee that the returned hash.Hash has the same underlying type as e.g. sha256.New(). Many hash implementations don't support that. To unlock special features like XOF, new hash implementations can support marshaling+unmarshaling their state, and then the caller can type assert the returned hash.Hash to something meaningful.

Note that the hash.Hash returned may have to break https://golang.org/pkg/hash/#Hash rules that say Write never returns an error. If all I have is a []byte result of an earlier SHA256, there's no way I can continue it. If this feels too ugly, then this needs its own interface defined that's a subset of hash.Hash, or something like that. That doesn't change the big idea: return something that can be type asserted for functionality beyond the basic.

Example simple use:

h, err := fs.Prehashed(fsys, "index.html", crypto.SHA256)
if err != nil { ... }
fmt.Printf("hash is %x\n", h.Sum(nil))

Example advanced use:

import "imaginary.example.com/blake4"

algo := &blake4.Config{
    // Hash is personalized with this key, see https://godoc.org/lukechampine.com/blake3#New
    Key: []byte("example.com/frobnicator 2020-12-08 blob hash"),
}
h, err := fs.Prehashed(fsys, "index.html", algo)
if err != nil { .... }
buf := make([]byte, something)
reader := h.(*blake4.Hasher).XOF()
_, err := io.ReadFull(reader, buf)

(That would need an extra non-stdlib API that https://godoc.org/lukechampine.com/blake3 doesn't provide, to construct a keyed hasher from its marshaled internal state.)

@tv42
Copy link

tv42 commented Dec 8, 2020

And the alternative for ReadETagFS or ShortHashFS using just PrehashedFS, without those special case interfaces but just nil arg to say "anything goes", would look like

h, err := fs.Prehashed(fsys, "index.html", nil)
if err != nil { ... }
buf := h.Sum(nil)

I'd still document some base assumptions, like minimum 8 byte length, in the PrehashedFS interface.

@ianlancetaylor ianlancetaylor added this to Incoming in Proposals (old) Dec 8, 2020
@ianlancetaylor ianlancetaylor added Proposal-Crypto Proposal related to crypto packages or other security issues and removed Proposal-Crypto Proposal related to crypto packages or other security issues labels Dec 8, 2020
@rsc rsc changed the title proposal: io/fs.Hash/er func/interface proposal: io/fs.Hash and Hasher Dec 9, 2020
@rsc rsc moved this from Incoming to Active in Proposals (old) Dec 9, 2020
@Merovius
Copy link
Contributor

Personally, I don't like the idea of identifying the hash algorithm via a string. It's too easy to mistype and there is no compiler-check. I think a better solution would be an exported constant, so you'd pass sha256.Hash (or something). Or just accept an interface { New() hash.Hash }, which a) can still be a symbolic constant implementing that interface and b) in particular, could be a crypto.Hash (but the fs package doesn't have to actually mention crypto).

However, really, I don't know if it even makes sense to pass a hash algorithm. From what I can tell, a fs.FS could only really provide useful support, by having hashes pre-computed - that is, //go:embed would precompute a sha256 hash and embed it and an S3-FS can retrieve an MD5 hash via the API. But in that case, the choice of hash algorithm is fixed by the filesystem, so it doesn't really make sense for the user of the fs.FS to pass a hash algo. Even in the "integrity-protect an S3 upload" use-case, there only really is a benefit if the FS you are uploading from already has that specific hash computed - so again, the filesystem itself chooses the Algorithm it supports.

So, personally, I don't really think it makes a lot of sense to pass a hash function of any sorts to the FS. At least not as part of a generalized filesystem API. For the ETag use-case, an optional interface like @tv42's ShortHashFS seems much more reasonable.

@tv42
Copy link

tv42 commented Dec 10, 2020

Yes, to use a precomputed hash, a hash has to be precomputed. I'm not sure what alternative there is. To continue the upload to S3 example: not having a prehashed MD5 just means you don't get end-to-end protection against corruption.

The source of the data may have precomputed multiple hashes, so it's not just "filesystem itself chooses the Algorithm". Consider e.g. serving HTTP with assets in a cloud object store. Object stores today might already store MD5 and SHA256. And moving forward, they are likely to deprecate MD5 and add something else. Plus one can annotate the objects with extra metadata, storing the hash du jour there even if the object store itself doesn't understand it.

Yes there is no magic to transmogrify MD5 into a SHA256, you only get end-to-end guarantees or save work if the marketplace of ideas settles on a small handful of popular hashes. I'd hate to end with a design that rules out the possibility of using much faster hashes like Blake3, when all steps in the processing chain agree.

Consider a content management system with assets in an object store, and adding Subresource Integrity to links based on what hashes are available: https://w3c.github.io/webappsec-subresource-integrity/#agility

ShortHashFS serves only ETag(-like) use cases. I think designing for only ETag, or designing for only MD5, was silly already a decade ago.

All of this can be done without using fs.FS. All of this can be done with FS extension interfaces outside of the stdlib. But choosing either of those opinions means ETag support is duplicate work.

Writing the above raised a new thought: The caller might want to enumerate what hashes the pathname has precomputed. Not sure how to formulate that into an API well, at this time. The SRI use case needs to map them to standardized name prefixes anyway, so it has a list of what it knows how to handle.

So maybe we flip the API around: let the filesystem tell the caller what hashes it has precomputed:

// ReadOnlyHash is a subset of hash.Hash without the continuation or reset methods.
// Otherwise, it behaves like hash.Hash.
type ReadOnlyHash interface {
    Sum(b []byte) []byte
    Size() int
}

type PrehashedFS interface {
    fs.FS

    // Prehashed returns the precomputed hash digests available for path.
   //
    // Map keys follow the same rules as context keys, and for parameterized
    // hashes can also be an instance of a configuration data type. Common
    // standard library hashes are identified by crypto.Hash values.
    //
    // If no precomputed hash digests are available, it may return either an
    // empty (potentially nil) map or ErrNotSupported.
    //
    // Caller must not mutate the returned map.
    Prehashed(path string) (map[interface{}]ReadOnlyHash, error)
}

// Prehashed ...
// Prehashed will not compute a hash digest if one was not already available.
func Prehashed(fsys FS, path string) (map[interface{}]ReadOnlyHash, error) {
    ...
}

Example use:

m, err := fs.Prehashed(fsys, "index.html")
if err != nil { ... }
if h, ok := m[crypto.SHA256]
if !ok { ... }
fmt.Printf("hash is %x\n", h.Sum(nil))

Does that please you more? Now the FS just says what it knows, yet it's still extensible the same way.

The biggest downside of that is that now the FS has to construct the full map, even if caller only cares about just SHA256.

ETag use case is a little awkward with a map, it can't just pick first thing range iterates because map order is not guaranteed, and we want predictability. Some alternatives:

  1. The awkwardness of ETag might be enough to justify not using a map but an interface, but I couldn't come up with comfortable enough methods for it writing this. Avoiding the map might make this less wasteful, as such an interface can be "lazy loading" (implemented as just a struct with a pointer to the file), and not make a map every time. Something like the overly bureaucratic
type Prehash interface {
    Best() ReadOnlyHash
    Get(algo interface{}) ReadOnlyHash
    Iter() PrehashIterator
}

type PrehashIterator interface {
    // Algorithm identifiers follow the same rules as context keys
    Algorithm() interface{}
    Hash() ReadOnlyHash
    Next() bool
}

(EDIT: that API can be minimized a little by removing Best and dictating best algo is first one from Iter. Now the interface is just representing a lazy map-like thing but with order.)

  1. Keep using a map, but require mandatory key (e.g. nil, or some constant somewhere) to be present that has the "preferred hash", whatever that means, and use that for ETag.
  2. ETag code just iterates through popular hashes, if none are in the map then no etag for you.

@Merovius
Copy link
Contributor

ShortHashFS serves only ETag(-like) use cases. I think designing for only ETag, or designing for only MD5, was silly already a decade ago.

Then we disagree. I think ShortHashFS provides an easy enough to specify interface that solves a concrete and important problem. And I think this discussion shows pretty compellingly, that there is a problem mismatch that makes an API to select hash algorithms hilariously complex for very little demonstrable benefit.

Also note, that while the interface doesn't guarantee the hash, there is nothing preventing an fs.FS from returning a consistent, well-specified hash. To allow the consumer to distinguish what hash was used, you could tag the output of fs.FS with the hash that was actually used. e.g. do

type HashFS interface {
    fs.FS
    Hash(path string) (digest []byte, algo HashAlgorithm, err error)
}

where HashAlgorithm is an appropriate tag-type (for example a string-type that would contain "md5", or it could be an interface type and you'd have a md5.Tag constant implementing it, or…). And if a filesystem can provide different hashes, it could provide different constructors (or an argument to the constructor), fixing the algorithm used.

And yes. I am fully aware that even that still leaves some cases open (e.g. it doesn't allow a filesystem to gradually migrate hash algorithms file-by-file). But at some point, we have to ask ourselves if the extra use-cases we cover really justify the extra complexity we add to the API.

As a person who primarily consumes Go APIs and doesn't provide them, I can say that I tend to pass on APIs that try to solve all problems. Because that tends to make it very hard to solve my problem.

@earthboundkid
Copy link
Contributor Author

earthboundkid commented Dec 10, 2020

It's worth taking a step back and asking what the problem is and why it needs to be solved.

I see the problem as "some filesystems have precomputed hashes but http.FileSystem cannot use those hashes for setting an ETag". (There's another question of file integrity checks, but fs.FS doesn't allow writing files, so I don't think that's really relevant now.) As a side note, an interesting question is whether there would be some way of exposing e.g. ZFS integrity hashes from the on disk filesystem.

Okay, so then there should be some way of asking whether for a given path, a hash exists and secondarily if the hash is the right one for a given purpose. How about something like:

type ReadOnlyHash interface {
    Sum(b []byte) []byte
    Size() int
    AlgoName string()
}

type PrecomputedHashFS interface {
   fs.FS
   ReadHash(path string) (ReadOnlyHash, error)
}

Do we need anything more complex than that?

@Merovius
Copy link
Contributor

Merovius commented Dec 10, 2020

@carlmjohnson I think that's more complicated than needed. ReadOnlyHash would, realistically, just be a thin wrapper around []byte, so just returning that seems better. And if the ETag usecase (i.e. deriving a short, unique identifier for the file) is what we are after, the actual hash used is also irrelevant. So, IMO, Hash(path string) ([]byte, error) is really all that's needed.

But is that the only use-case? It certainly is the one I'd care about, personally. But I might be biased by my own experience. I agree that talking about the actual concrete use-cases we want to solve should come first, though, not the solution. And FTR, even "add an optional fs.FS interface with hashes" is a solution as well (though I'd agree that it's a likely candidate to solve ETags).

@tv42
Copy link

tv42 commented Dec 10, 2020

@carlmjohnson That fixes the FS to providing just one hash algorithm. Also, I'm not really happy with communicating the hash algorithm by string: 1) there are no standard names for hashes, it's "SHA-384" in one place and "sha384" in another, so anyone communicating the hash in some protocol needs a mapping anyway; the returned value can implement Stringer for human output purposes 2) forcing a parameterized hash description to a string means next you're parsing that string, and that's just silly within the same program.

@Merovius I like the idea of returning an interface, because that allows modern sponge-design hashes to provide an underlying type that implements arbitrary-length output. Modern hashes are all migrating that way: https://godoc.org/golang.org/x/crypto/sha3#ShakeHash https://godoc.org/lukechampine.com/blake3#Hasher.XOF
And no, I don't love the hash.Hash.Sum API either, but that's what's in stdlib, and it provides an allocation-free calling convention.

Beyond those three concerns, please do come up with a simpler API. I've tried.

Here's one more attempt, taking inspiration from errors.As, but requiring new types to identify hash digests. (No bool return because these things have transient errors.)

package sha256 // crypto/sha256

type Digest []byte
package fs

// ShortHash contains a hash digest suitable for use as a simple
// change detector, such as HTTP ETag.
// It is in raw binary form, and must still be encoded by caller where necessary.
type ShortHash []byte

varr ErrNotPrehashed = errors.New("hash not precomputed")

type PrehashedFS interface {
    fs.FS

    // Prehashed returns a precomputed hash digest available for path, if
    // one is available.
    //
    // If the requested hash is not available, returns ErrNotPrehashed.
    // Panics if target is not a non-nil pointer.
    //
    // See also ShortHash.
    Prehashed(path string, target interface{}) error
}

// Prehashed ...
// Prehashed will not compute a hash digest if one was not already available.
func Prehashed(fsys FS, path string, target interface{}) error {
    ...
}

Example use:

var h sha256.Digest
if err := fs.Prehashed(fsys, "index.html", &h); err != nil {
    ...
}
fmt.Printf("sha256=%x\n", h)
func addETag(w http.ResponseWriter, fsys fs.FS, path string) error {
    var h fs.ShortHash
    err := fs.Prehashed(fsys, path, &h)
    if err == fs.NotPrehashed {
        return nil
    }
    if err != nil {
        return err
    }
    w.Header().Set("Etag", formatETag(h))
    return nil
}

@Merovius
Copy link
Contributor

Beyond those three concerns, please do come up with a simpler API. I've tried.

Assuming those concerns, there might not be. Which is why, to repeat myself, I think there should be some discussion of what we actually need. The only real tangible use-case I have a grasp on, right now, is ETags (and similar) and for those, I'd argue we don't need arbitrary length hash outputs - any collision resistant hash with at least, say, 128 bit will be sufficient. And we'd also wouldn't need to introduce all the complexity of negotiating hash algorithms.

I don't see the point arguing about the best APIs, if we don't agree on what the requirements are first. Once that has happened, the API should come pretty naturally.

@tv42
Copy link

tv42 commented Dec 10, 2020

If we just want to solve ETag "give me something that changes if the file content changes, I don't care what", then #43076 (comment) is all that's needed.

That means we will not have a standard solution that can cope with e.g. Subresource Integrity.

It's all trade-offs.

@Merovius
Copy link
Contributor

That means we will not have a standard solution that can cope with e.g. Subresource Integrity.

That would be an example of a concrete use-case to discuss, thank you. I agree that this would require you to know what the hash is and limit it to a specific set of options.

@earthboundkid
Copy link
Contributor Author

earthboundkid commented Dec 11, 2020

FWIW, @benbjohnson just dropped this: https://github.com/benbjohnson/hashfs

@earthboundkid
Copy link
Contributor Author

Closing as a duplicate of #43223. We can keep discussing there.

@rsc rsc moved this from Active to Declined in Proposals (old) Dec 16, 2020
@golang golang locked and limited conversation to collaborators Dec 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Development

No branches or pull requests

5 participants