Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proxy.golang.org: support git-lfs #47241

Open
jpap opened this issue Jul 16, 2021 · 14 comments
Open

proxy.golang.org: support git-lfs #47241

jpap opened this issue Jul 16, 2021 · 14 comments
Labels
FeatureRequest NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. proxy.golang.org

Comments

@jpap
Copy link
Contributor

jpap commented Jul 16, 2021

The Google-hosted proxy.golang.org does not smudge git-lfs pointer files in public repos. This results in broken builds when modules include binary syso files that are hosted under git-lfs. (The linker sees a text file instead of the object file, ignores it, then explodes with a lot of missing symbol errors which isn't very intuitive.)

While it might be unreasonable to expect every git extension be supported on proxy.golang.org, git-lfs is quite common: it's supported by many of the major git hosts (e.g. GitHub, GitLab, Bitbucket, Azure Repos) with many others having a pending feature-request for it.

Workarounds

  1. Use GOPRIVATE for all packages that need git-lfs. This can result in a complex configuration environment for the developer who now needs to track this across possibly many dependencies, some of which may be transitory.

  2. The nuclear option: turn off the proxy with GOPROXY=direct.

  3. Change all modules to avoid git-lfs and commit all binary syso files directly into each repo.

None of these options are great...

Similar issues

Issue #41708 talks about a mod checksum mismatch between proxy and direct. One solution is to ignore the smudge and get matching checksums, which can work in that case because the git-lfs file(s) are inconsequential to the build. That approach doesn't work in the scenario described above.

The following issues are related to git-lfs, but aren't as relevant:

@cherrymui cherrymui added FeatureRequest NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jul 16, 2021
@cherrymui cherrymui added this to the Unreleased milestone Jul 16, 2021
@cherrymui
Copy link
Member

@heschi
Copy link
Contributor

heschi commented Jul 16, 2021

As a matter of policy, proxy.golang.org doesn't deviate from the default behavior of the go command. #39720 makes it sound like the go command invokes git in a way that disables git-lfs. If that's correct, then the go command needs to change before proxy.golang.org and you need a resolution to that bug.

However, as a practical matter, changing the way Git behaves while downloading modules is a dangerous thing to do, since sum.golang.org cryptographically guarantees module contents stay stable over time. We would have to keep track of whether particular versions should be downloaded with lfs enabled or disabled. So this would be a fairly large project.

@jpap
Copy link
Contributor Author

jpap commented Jul 16, 2021

As a matter of policy, proxy.golang.org doesn't deviate from the default behavior of the go command.

As suggested in the Workarounds section above, using the go command without the proxy does correctly yield smudged git-lfs files on modules, whether public or privately hosted. The module checksum is of course different, which is why #41708 and #38941 exist.

If the proxy is just using go and git, and not using a pure-Go implementation like go-git †, then it isn't processing git-lfs files simply because the machine doesn't have git-lfs installed. I can't tell if this is the case, because I can't find the source for the Google-sponsored proxy. Is it available somewhere?

† The go-git project is close to supporting git-lfs by way of filters.

#39720 makes it sound like the go command invokes git in a way that disables git-lfs. If that's correct, then the go command needs to change before proxy.golang.org and you need a resolution to that bug.

I suspect the OP in #39720 used an old git version, in which git archive did not process hooks that allow git-lfs to smudge. When I perform the following actions on my machine (git version 2.28.0) I get smudged files, and not the git-lfs pointers,

git init --bare
git remote add origin $PUBLIC_REPO_URL
git fetch
git archive remotes/origin/master | tar -tvf -

and is easily confirmed by some console messages about downloading the git-lfs files and the listed file sizes.

However, as a practical matter, changing the way Git behaves while downloading modules is a dangerous thing to do, since sum.golang.org cryptographically guarantees module contents stay stable over time. We would have to keep track of whether particular versions should be downloaded with lfs enabled or disabled. So this would be a fairly large project.

I understand your concern. One idea is to:

  1. Have proxy.golang.org smudge git-lfs files. This might be as simple as installing git-lfs on your infra.

  2. Make a change to the way go drives git and how the module hash is computed, so that it always occurs before any git hooks are executed. This would address cmd/go: checksum mismatch for module containing Git LFS files #41708 and cmd/go: checksum mismatch with private dependent module with GIT LFS files #38941. For git-lfs files, that hash includes the pointer file contents which contains a hash of the referenced git-lfs content, so you are implicitly including the full file contents in the final go module checksum.

    This step should not prevent git from executing hooks after the hash is computed, otherwise it will break the "don't use proxy.golang.org" workaround and we will be forced to abandon git-lfs completely.

This approach means that all existing checksums previously computed by proxy.golang.org remain valid, while enabling the proxy to (correctly) deliver git-lfs smudged files in the module ZIP.

The hash computed above is over the "raw repository": you might call it a "repo" checksum. You might then consider adding a second "content" hash to the GOPROXY protocol for sumdb which is over the ZIP file being vended by the proxy, if a user wants to verify that the git hooks have not mutated the content differently at a later date.

@heschi
Copy link
Contributor

heschi commented Jul 19, 2021

simply because the machine doesn't have git-lfs installed

The module services are closed source, but the part you're interested in just calls go mod download, so you're probably right. That's a bit worrying, since it means that if we accidentally install git-lfs at some point we'll be in trouble.

hash computed above is over the "raw repository"

This distinction doesn't exist in the go command; not all modules come from VCS repositories. Checksums are calculated after some filtering and preprocessing steps, based on the contents of the zip files that are actually the canonical representation of a module version. What you're calling a "repo" checksum would be a new concept.

This is a question for the cmd/go team, @bcmills @jayconrod @matloob. But my guess is that we want to actively disable LFS support unless a user manually opts in. The module ecosystem really isn't set up to have multiple valid views of the same repository. I understand this is going to make life hard for LFS users and I don't know of a good answer to that.

@bcmills
Copy link
Contributor

bcmills commented Jul 19, 2021

We're still figuring out the path forward for Git LFS; see #41708 and #39720. Unfortunately, as Heschi notes we don't have a good solution yet.

@bcmills
Copy link
Contributor

bcmills commented Jul 19, 2021

We certainly cannot have the go command execute git hooks after checksumming. Part of the point of modules is to reduce the go command's reliance on VCS tools, and especially its exposure to VCS vulnerabilities. Executing arbitrary (or semi-arbitrary) hooks after checksums are verified would undermine a lot of the benefits that led us to use checksums in the first place.

@jpap
Copy link
Contributor Author

jpap commented Jul 20, 2021

But my guess is that we want to actively disable LFS support unless a user manually opts in.

I would argue that LFS support should be on by default, with an opt out for cases like #29987. As it stands, GOPROXY can break a build that otherwise works without: it seems odd to opt-in for non-breaking behavior.

An alternative which I would support is a new VCS called "git-lfs". In this case the "opt in" is by the package author (in their go-import meta tag; or VCS qualifier and import comment), and not the package user. In this way, checksums will be consistent because all repository accesses would require LFS and therefore a single-view of it.

The module ecosystem really isn't set up to have multiple valid views of the same repository.
We're still figuring out the path forward for Git LFS; see #41708 and #39720.

These issues were cited in the top-comment -- all of them are about a year old now and I've not seen any discussion or proposals on how LFS might be properly supported. What is the process for seeing it through?

We certainly cannot have the go command execute git hooks after checksumming. Part of the point of modules is to reduce the go command's reliance on VCS tools, and especially its exposure to VCS vulnerabilities. Executing arbitrary (or semi-arbitrary) hooks after checksums are verified would undermine a lot of the benefits that led us to use checksums in the first place.

I am all for this: the git-lfs project is implemented in go, licensed under MIT, and can be incorporated into cmd/go and any proxy implementation (including Google's closed-source server).

Implementing a new "git-lfs" VCS would satisfy all the needs brought up in this thread so far:

  • No changes to the GOPROXY protocol
  • A consistent single-view into a package repository
  • Opt-in behavior (by the package author)
  • No (semi or full) arbitrary execution of Git hooks

@bcmills
Copy link
Contributor

bcmills commented Jul 20, 2021

Implementing a new "git-lfs" VCS would satisfy all the needs brought up in this thread so far:

Maybe! Treating git-lfs as a completely different VCS could be one of the more viable paths forward, but it too would break existing checksums for modules that use git-lfs. But perhaps that's the least-bad option, since it would only break the existing versions if the module's owner explicitly switches to the git-lfs protocol. And given module proxies (to serve the old versions) and the fact that the go.mod files themselves presumably wouldn't change due to LFS (so that the damage is contained and easy to upgrade past), the blast radius might not actually be that bad.

(Unfortunately, the go-import protocol that the go command uses to bootstrap VCS information over HTTPS doesn't tell the server which version of the module the go command is after, so the server can't just choose a specific version to cut over.)

@bcmills
Copy link
Contributor

bcmills commented Jul 20, 2021

An explicit git-lfs protocol would also have the advantage of being something we can detect and report explicitly.

If go get is asked to fetch a git-lfs module itself (via GOPRIVATE or GOPROXY=direct), and if we can detect that git-lfs is not installed on the user's machine, then we can give a useful failure message instead of a checksum error. (If we fell back to “Git without LFS” implicitly, we wouldn't be able to distinguish between “fetched the correct non-LFS contents” and “fetched something totally bogus”.)

@heschi
Copy link
Contributor

heschi commented Jul 20, 2021

It also seems like a reasonable option to me. But anyone who isn't using a vanity server is stuck -- we can't switch the protocol for all of github.com (etc) without causing trouble. I don't know if that's good enough.

@bcmills
Copy link
Contributor

bcmills commented Jul 20, 2021

It would presumably get its own VCS path extension too. So you could do something like:

module github.com/bcmills/example.git-lfs

and then go get github.com/bcmills/example.git-lfs would work, but it would have to encode the protocol as part of the module path.

Actually, that's a really nice property to have, because it gives different module paths (with different checksums) for github.com/bcmills/example and github.com/bcmills/example.git-lfs.

@bcmills
Copy link
Contributor

bcmills commented Jul 20, 2021

(Then we would just need some mechanism for forwarding references from the old path to the one with the git-lfs extension, which I'm actively thinking about anyway. Stay tuned!)

@bcmills
Copy link
Contributor

bcmills commented Jul 20, 2021

@jpap, I think that idea (an explicit git-lfs protocol) is solid enough to file a proposal for cmd/go support. Do you want to file the proposal, or shall I? 😁

@jpap
Copy link
Contributor Author

jpap commented Jul 21, 2021

I've added a proposal here. Feedback in that issue is welcome.

...but it too would break existing checksums for modules that use git-lfs. But perhaps that's the least-bad option, since it would only break the existing versions if the module's owner explicitly switches to the git-lfs protocol. And given module proxies (to serve the old versions) and the fact that the go.mod files themselves presumably wouldn't change due to LFS (so that the damage is contained and easy to upgrade past), the blast radius might not actually be that bad.

(Unfortunately, the go-import protocol that the go command uses to bootstrap VCS information over HTTPS doesn't tell the server which version of the module the go command is after, so the server can't just choose a specific version to cut over.)

If a module owner opts in "new versions" to Git LFS, old (tagged) commits remain available via the original import path (lacking the .git-lfs suffix). This is because the git repository remains at the same repo-url whether you're doing a regular Git fetch or a Git+LFS fetch.

If the module user requests "old versions" but now using the new VCS qualifier, old (tagged) commits will now be fetched with Git LFS. Whether or not those old versions hosted Git LFS content, each checksum is still unique because it associates with a module path that now includes the .git-lfs suffix.

The only downside to the last scenario is that a GOPROXY might end up storing the same content twice, if there was no LFS content in the old version: once for the original module path, then again for the new module path with .git-lfs suffix.

(That duplication depends on the implementation: if the proxy stores all of its downloaded module data keyed on checksum, it could effectively deduplicate them. Pity about the ZIP contents filename $module@$version prefix: it would need to re-generate the ZIP for each, which can be done cheaply on the fly if the stored data is DEFLATE'ed in advance.)

The checksum mismatch between the current GOPROXY and using cmd/go with Git LFS installed goes away with Part 6 of the new proposal.

There is one more scenario whereby the module user requests one of the "new versions" of the module that opts into Git LFS, but does so without specifying the new VCS qualifier (explicitly or via go-import). This scenario might be a module user updating their dependency.

In this last case, we want the user to use Git LFS because that's what the module owner wants for that version. This is handled by the "upgrade" scheme in Part 5 of the new proposal: if that proves too complex or magic, I am happy to replace it with "all references to a module (package) that has opted into Git LFS must use the git-lfs qualifier". This makes importing a GOPATH package using Git LFS less clean, but I could live with that given that use case is being deprecated.

This last part may relate to the "forwarding references" you mentioned. Your feedback is welcome.

EDIT: I've removed the "upgrade" scheme in the proposal and replaced it with the error out clause...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
FeatureRequest NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. proxy.golang.org
Projects
None yet
Development

No branches or pull requests

6 participants