Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/pkgsite: invalid links for internal v2+ github enterprise modules #61404

Open
redloaf opened this issue Jul 17, 2023 · 2 comments
Open

x/pkgsite: invalid links for internal v2+ github enterprise modules #61404

redloaf opened this issue Jul 17, 2023 · 2 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. pkgsite

Comments

@redloaf
Copy link

redloaf commented Jul 17, 2023

What is the URL of the page with the issue?

This is a bug report about a v2+ version of a non-public repository on self-hosted pkgsite. The URL would be for this form:
https://my-pkgsite.mycompany.internal/github.mycompany.internal/myorg/myrepo/v2

What is your user agent?

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36

Screenshot

As this is an internal code-base, we would prefer to not share a screenshot.
The primary bug here is also not in the rendering of the page, it is with the link to the code associated with a symbol.
For example, with the hypothetical repo view above:
The pages would include a link to https://github.mycompany.internal/myorg/myrepo.git/blob/v2.0.0/v2/mypkg/myfile.go

Note specifically the .git and /v2/ in this link. The latter is included unconditionally, regardless of whether it uses a major-branch or a major subdirectory pattern.

What did you do?

Running a local instance of pkgsite, I requested indexing of a v2 package on a private github repository of the form github.mycompany.internal/myorg/myrepo/v2 that follows the major branch layout. The links to the code were broken, following a pattern such as https://github.mycompany.internal/myorg/myrepo.git/blob/v2.0.0/v2/mypkg/myfile.go, which has some problems:

  • GitHub Enterprise has a redirect for repo URLs (e.g. URLs of the form https://github.mycompany.internal/myorg/myrepo.git) but issues 404s for any subpaths, including the /blob/ subpath.
  • The URLs imply that pkgsite expects the repo to follow the major submodule layout (v2/go.mod), thus expecting to find mypkg/myfile.go in a directory v2/mypkg/myfile.go, even though the repo uses the major branch layout.
  • Transient errors may result in pkgsite caching incorrect results because adjustVersionedModuleDirectory treats any error as if it were a 404.
  • When fetching the go.mod BlobURL to distinguish between major branch vs. major subdirectory layouts, pkgsite does not verify that the content at the subdirectory is a valid go.mod file for the package being queried, treating any 200 response as confirmation that the go.mod file exists in the major version subdirectory (even if the 200 response is as a result of a redirect to a login page).

In our internal deployment, pkgsite pulls Go modules from an internal deployment of Athens and does not have direct access to the internal git repositories. We would be happy to give it this access, but that does not appear to be an option today.

What did you expect to see?

FileURL should link to https://github.mycompany.internal/myorg/myrepo/blob/v2.0.0/mypkg/myfile.go

What did you see instead?

https://github.mycompany.internal/myorg/myrepo.git/blob/v2.0.0/v2/mypkg/myfile.go

Analysis

It appears the regular expressions intended to match internal github (and gitlab?) instances do not match module paths that have a major version >= 2. Therefore the repository metadata is fetched dynamically and not stripped of its vcs suffix. ModuleInfo then calls adjustVersionedModuleDirectory to perform a HEAD request on the go.mod file and considers any 200 response successful, even if it is a login page (after a redirect). The repository layout (major branch vs. major subdirectory) question must be resolved by querying the repository itself. For private repositories, there does not seem to be a means to configure authentication so that pkgsite can accurately derive these answers.

Proposed solutions

There are many ways to proceed here, one of which is to permit pkgsite to be configured with specific "code hosts."
In this case, we would configure a "code host" for github.mycompany.internal and its configuration would supercede the regular expression matching. This "code host" could be configured with its type (e.g. GitHub Enterprise) and API credentials, which would let pkgsite query the API directly (to see if the file v2/go.mod exists) rather than relying on HTTP. Some code hosts may offer a standard means of serving a single file.

Another approach would be to use the existing RawURL, if available, so that pkgsite can affirmatively parse the go.mod file (after redirects) to ensure it is a valid go.mod for the package being fetched. However, this would still require solving the authentication issue.

@gopherbot gopherbot added this to the Unreleased milestone Jul 17, 2023
@jamalc jamalc added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 20, 2023
@jamalc jamalc modified the milestones: Unreleased, website/unplanned Jul 20, 2023
@jamalc
Copy link

jamalc commented Jul 20, 2023

The second solution with an environment variable to add a github auth token to requests sounds reasonable. Would that be sufficient for your usecase?

@redloaf
Copy link
Author

redloaf commented Jul 21, 2023

Thanks for looking into this. For that to work, I think we would need to use the GitHub API or a git operation when fetching the go.mod file. pkgsite currently uses a HEAD request to the /blob/ URL on the GitHub web interface, but I don't believe access tokens are valid in that context. For GitHub Enterprise I've found that something like this would work:

curl -L \
  -H "Accept: application/vnd.github.raw" \
  -H "Authorization: Bearer $GITHUB_AUTH_TOKEN" \
  "https://github.mycompany.internal/api/v3/myorg/myrepo/contents/v2/go.mod?ref=v2.0.0"

And from the docs it seems github.com wants something more like this:

curl -L \
  -H "Accept: application/vnd.github.raw" \
  -H "Authorization: Bearer $GITHUB_AUTH_TOKEN" \
  -H "X-GitHub-Api-Version: 2022-11-28" \
  "https://api.github.com/repos/myorg/myrepo/contents/v2/go.mod?ref=v2.0.0"

We would also need to ensure that the auth token is sent only to the intended code host, perhaps with the ability to set a different token per code host.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. pkgsite
Projects
None yet
Development

No branches or pull requests

3 participants