Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proxy.golang.org: Unusual traffic to git hosting service from Go #44577

Closed
ddevault opened this issue Feb 24, 2021 · 38 comments
Closed

proxy.golang.org: Unusual traffic to git hosting service from Go #44577

ddevault opened this issue Feb 24, 2021 · 38 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.

Comments

@ddevault
Copy link

Following up from #44468: I run a git hosting service which has, in the past few weeks, received elevated levels of traffic from Google-owned IP addresses performing git clones. The shape of the requests is something like this:

74.125.182.164 - - [23/Feb/2021:00:41:34 +0000] "GET /~yoink00/zaplog/info/refs?service=git-upload-pack HTTP/2.0" 200 553 "-" "git/2.30.0" "-"
74.125.182.161 - - [23/Feb/2021:00:41:34 +0000] "GET /~yoink00/zaplog/info/refs?service=git-upload-pack HTTP/2.0" 200 553 "-" "git/2.30.0" "-"
74.125.182.161 - - [23/Feb/2021:00:41:34 +0000] "POST /~yoink00/zaplog/git-upload-pack HTTP/2.0" 200 56 "-" "git/2.30.0" "-"
74.125.182.161 - - [23/Feb/2021:00:41:34 +0000] "POST /~yoink00/zaplog/git-upload-pack HTTP/2.0" 200 9434 "-" "git/2.30.0" "-"

What is the purpose of this traffic?

If it's crawling, it should set an appropriate user-agent and respect robots.txt. The traffic is coming in at a rate which I would not consider reasonable for a crawler, up to several times per second - and git clones are more expensive than other HTTP requests.

@seankhliao seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 24, 2021
@ianlancetaylor
Copy link
Contributor

CC @hyangah @heschik

@katiehockman
Copy link
Contributor

katiehockman commented Feb 24, 2021

Hi @ddevault

Thanks for filing the issue.

What is the purpose of this traffic?

There could be several sources of this traffic, or a combination of sources, so we will work to narrow this down. It sounds like this issue is focused on lowering the traffic, or at a minimum, documenting it so that the purpose of the traffic is clearer. We'll leave the User-Agent discussion in #44468. I have a few questions for you which can help us narrow down the cause.

You mentioned "in the past few weeks" that you've seen this behavior change. Was there a huge spike in traffic that started on a particular day that you know of, or was it gradual to the point where things are now? For example, was it right when Go 1.16 came out on Feb 16, or were you seeing this earlier than that? This can help us narrow down the root cause.

Are you seeing this across all of the code on git.sr.ht, or is it focused on a specific module or set of modules?

Additionally, do you have any data about the total volume of requests over a typical hour, or patterns you're seeing? You noted that the traffic you are seeing now is not reasonable. It would be helpful to hear your opinion of what an acceptable volume of requests would look like?

Some thoughts about where this traffic could be coming from:

  • module resolution from the go command, which executes several fetches at once
  • some new Go command behavior from 1.16
  • proxy.golang.org may retry requests for transient errors
  • proxy.golang.org runs regular jobs which run data refreshes of modules we already know about, e.g. what's @latest, what's the current list of versions, etc. If we do this work upfront, it makes the experience a lot faster (ie. a database lookup is a lot cheaper than a full fetch from the go command). And if we do these refreshes often enough, then this speed doesn't come at the cost of stale data. However, how and when we do these refreshes is something we may re-evaluate.

I also want to note that proxy.golang.org isn't trying to do any crawling. It fetches and pre-caches code that we can be confident that users will want, fetching and refreshing data that users have already asked for.
Even if hosting services see a large amount of traffic from proxy.golang.org, it may actually be the case that their total traffic would be higher if proxy.golang.org wasn't acting as a mirror for many users. Meaning that even though the sites see a spike in traffic from Google IPs, their total traffic volume could be less than it otherwise would be if users weren't contacting proxy.golang.org first.

@ddevault
Copy link
Author

I'll give approximate answers, but knowing that these are the questions you want answered, I will be able to collect more specific answers when the behavior is next observable.

You mentioned "in the past few weeks" that you've seen this behavior change. Was there a huge spike in traffic that started on a particular day that you know of, or was it gradual to the point where things are now? For example, was it right when Go 1.16 came out on Feb 16, or were you seeing this earlier than that? This can help us narrow down the root cause.

This has occured on February 10th and 14th, then at least once per day from the 18th onwards. It seems likely that this co-incides with the Go 1.16 release, which notably made some major changes to how modules are used.

Are you seeing this across all of the code on git.sr.ht, or is it focused on a specific module or set of modules?

If there's any discernable pattern to the repositories affected, it's hard for me to tell. It would help if you could characterize the IP address I gave (74.125.182.164) and share a range of IPs which are also likely to be implicated in the same behavior so I can extract just the relevant part of the logs.

Additionally, do you have any data about the total volume of requests over a typical hour, or patterns you're seeing? You noted that the traffic you are seeing now is not reasonable. It would be helpful to hear your opinion of what an acceptable volume of requests would look like?

Hm, in my experience, it happens for a few tens of minutes at a rate of between 2 and 10 requests per second. Googlebot on its default settings, for example, only does one request every 5 seconds or something like that. In any case, if it's an automated process, it should be fetching robots.txt and letting me, the sysadmin of the remote host, configure its crawling parameters. The precise rate is not especially important, but I would like to tune it to a predictable value and then update our assumptions about network monitoring so we aren't getting a bunch of false alarms from Go crawling our servers.

Some thoughts about where this traffic could be coming from:

This list can be narrowed down: every request has appeared to come from a Google-owned IP address, so we can eliminate behavior from end users (and also hold Google accountable for whatever their servers are up to). I know that godoc.org has had (:frowning_face:) a crawling feature, no clue what proxy.go.org nor pkg.go.dev does.

@ddevault
Copy link
Author

Note: it would be helpful of Go's release notes were dated.

@ALTree
Copy link
Member

ALTree commented Feb 24, 2021

Note: it would be helpful of Go's release notes were dated.

This page: https://golang.org/doc/devel/release.html has dates for every release, if you need them.

@ddevault
Copy link
Author

Thanks!

@katiehockman
Copy link
Contributor

@ddevault Thanks for your fast response. To make it easier to figure out exactly which traffic is coming from us, we're going to go ahead and work on setting the User-Agent for proxy.golang.org requests, per #44468. Then you'll be able to more easily discern which requests are coming from us (rather than filtering by IP), and the logs will hopefully be easier to collect. It's a good thing to do either way, so we'll prioritize doing this.

I'll follow back up here once those changes are in production.

@ddevault
Copy link
Author

Sounds good, thanks!

@ddevault
Copy link
Author

With the new User-Agent in place, I can characterize the behavior more concretely now. Over the past hour, I've received 1,912 requests from proxy.golang.org, from IP blocks 74.125.0.0/16 and 173.194.0.0/16. The full list of requests is available here, with columns for the request IP, date, and hash of the module URL.

Redundant requests per IP address are somewhat reasonable:

https://paste.sr.ht/~sircmpwn/986d4c2e3f5909385b19adf6fa15bc789bff8708

But redundant requests across all IPs are less so:

https://paste.sr.ht/~sircmpwn/b46ad0b13e864923df80cb8e8285bf1661e6f872

There is some room for improvement here.

@ddevault
Copy link
Author

ddevault commented Feb 26, 2021

Some specific recommendations:

  • Reduce redundant requests across nodes

proxy.golang.org runs regular jobs which run data refreshes of modules we already know about, e.g. what's @latest, what's the current list of versions, etc. If we do this work upfront, it makes the experience a lot faster (ie. a database lookup is a lot cheaper than a full fetch from the go command). And if we do these refreshes often enough, then this speed doesn't come at the cost of stale data. However, how and when we do these refreshes is something we may re-evaluate.

  • This process should obey robots.txt so the sysadmin can control the request rate or opt-out (for some or all packages).

I don't mind the volume if it's legitimate traffic, but an effort should be made to reduce redundant load, and to give the sysadmins some knobs to control the load.

Another question: do you make a fresh clone every time, or do you keep the repo around and fetch only the difference? You should maintain an LRU cache of clones and freshen them up on subsequent requests.

@katiehockman
Copy link
Contributor

@ddevault
Thanks. We're going to take a look at reducing the load of our refresh jobs. I will respond back here with updates.

@ddevault
Copy link
Author

ddevault commented Mar 2, 2021

Thanks. Would also appreciate answers to my more specific questions when you have the opportunity.

@katiehockman
Copy link
Contributor

@ddevault
Sure, happy to answer those.

do you make a fresh clone every time, or do you keep the repo around and fetch only the difference?

The short answer is that yes we make a fresh clone every time. I want to provide more information below to clarify this:

It's not the case that a go get will always translate to a single git clone or even a single HTTP request. For example, if someone requests a new version that we don't already have, e.g. github.com/a/b/c@v0.0.1, then this may require module resolution if github.com/a/b/c is not a module. So the go command performs several requests, e.g. one for "github.com/a/b/c" then "github.com/a/b", then "github.com/a", etc.

However, we suspect that the majority of the traffic you are experiencing is not due to module resolution of new modules, but instead from our refresh jobs. These already know the resolved path, so the go command doesn't need to do this resolution. For example, go mod download module@version (which is what proxies are expected to use to fetch a module at a particular version), isn't guaranteed to perform a single HTTP request. It may do things like first request a list of all the tags of that module, then look for the go.mod file or other metadata, then do a shallow fetch to get the contents.

From proxy.golang.org's perspective, we shell out work to the go command, and it's up to the go command to decide how best to retrieve the information and pass it back to us. So the idea of keeping a cache of clones around isn't practical, nor would it help the go command.

However, something we can do right now is improve our refresh jobs to help with load, so we're going to look into that.

@ddevault
Copy link
Author

ddevault commented Mar 2, 2021

From proxy.golang.org's perspective, we shell out work to the go command, and it's up to the go command to decide how best to retrieve the information and pass it back to us. So the idea of keeping a cache of clones around isn't practical, nor would it help the go command.

It doesn't seem reasonable to change the go command, but I would argue that proxy.golang.org is in a unique position among users of the go command, and as such it seems reasonable to suggest that it would be open to an improved implementation which better suits its unique setting. I mean, if not, Go is just shoving the complexity burden onto software hosting services, and with wasteful and redundant requests.

However, something we can do right now is improve our refresh jobs to help with load, so we're going to look into that.

Sounds like a good start.

@katiehockman
Copy link
Contributor

Hi @ddevault. I wanted to give an update that we've gone ahead and improved our refresh jobs to hopefully lead to less duplication of request traffic to origin servers. This could yield a 2-3x drop in requests.

@ddevault
Copy link
Author

Thanks!

@ddevault
Copy link
Author

I can confirm that the load is reduced, but it is still a bit heavy all things considered.

@ddevault
Copy link
Author

Actually, I went to quantify my impression of a reduced load and found that it has not changed much at all. It has gotten worse in some respects.

Fresh data: in the past hour, we received about 2500 requests from a GoModuleMirror User-Agent. The new breakdowns by IP and module are here:

https://paste.sr.ht/~sircmpwn/8636039e4bff971f8b9028d22ad05984f4e7a24c

https://paste.sr.ht/~sircmpwn/4f4636fed5f672aa3cccca527b95476fddef3ca5

@ddevault

This comment has been minimized.

@ddevault

This comment has been minimized.

@ddevault
Copy link
Author

ddevault commented May 23, 2021

Gah, I jumped the gun here. Sorry. This logs spans over an hour, not a minute.

Update: it looks like the traffic was a huge burst from an unrelated source and the Go proxy had the bad luck to submit a large batch of requests on the tail end of those logs. Then I collected all of the recent requests from that IP range and misread the timestamps on the logs. Sorry for the noise.

@BenLubar
Copy link

BenLubar commented May 30, 2021

Yesterday, GoModuleMirror downloaded 4 gigabytes of data from my server requesting a single module over 500 times (log attached). As far as I know, I am the only person in the world using this Go module. I would greatly appreciate some caching or at least rate limiting.

gomodulemirror.log

For reference, here's a bandwidth graph. It seems to have started about a week ago.

image

@katiehockman
Copy link
Contributor

Thanks for letting us know @BenLubar. We're continuing to look into this, and appreciate the extra details.

@ddevault
Copy link
Author

ddevault commented Jun 8, 2021

Please re-prioritize this, if any organization with more accountability than Google was DDoSing hosting providers since February then it'd be front page news and their ISP would have cut them off.

@katiehockman
Copy link
Contributor

@ddevault We are taking this seriously, and have been taking strides to improve this since it was reported to us. For transparency, we spent a while discussing options for how we can approach this internally on Friday of last week. The fix for this isn't straightforward, and impacts all users of the proxy, not just origin servers. We have a job which runs at regular intervals which fetches fresh data from origin servers to make sure that the end user has a good experience when they try to download Go source code. If we change this, users will have to wait longer while we fetch things on-the-fly, or receive stale data, which can greatly slow down developer's builds.

Currently, a single request for a module may cause refetch traffic for several days after. That may be what you are experiencing. One idea we've been discussing is to make it such that our job only make refresh requests if the module is deemed "popular" enough (e.g. the module has been requested 100 times in the last week). However, this is going to require some re-architecting, and database changes, so it is taking some time to work through.

In the meantime, if you would prefer, we can turn off all refresh traffic for your domain while we continue to improve this on our end. That would mean that the only traffic you would receive from us would be the result of a request directly from a user. This may impact the freshness of your domain's data which users receive from our servers, since we need to have some caching on our end to prevent too frequent fetches. We can do the same thing for @BenLubar's domain if preferred.

@ddevault
Copy link
Author

ddevault commented Jun 8, 2021

Greater transparency and communication on the issue would be appreciated and would go far towards improving the optics of this problem.

Have you considered the robots.txt approach, which would simply allow the sysadmin to tune the rate at which you will scrape their service? The best option puts the controls in the hands of the sysadmins you're affecting. This is what the rest of the internet does.

Also, this probably isn't what you want to hear, but maybe the proxy is a bad idea in the first place. For my part, I use GOPROXY=direct for privacy/cache-breaking reasons, and I have found that many Go projects actually have broken dependencies that are only held up because they're in the Go proxy cache — which is an accident waiting to happen. Privacy concerns, engineering problems like this, and DDoSing hosting providers, this doesn't looks like the best rep sheet for GOPROXY in general.

@BenLubar
Copy link

BenLubar commented Jun 8, 2021

In the meantime, if you would prefer, we can turn off all refresh traffic for your domain while we continue to improve this on our end. That would mean that the only traffic you would receive from us would be the result of a request directly from a user. This may impact the freshness of your domain's data which users receive from our servers, since we need to have some caching on our end to prevent too frequent fetches. We can do the same thing for @BenLubar's domain if preferred.

That would be helpful, thanks.

@BenLubar
Copy link

Was the change applied on your end? I disabled the part of my server configuration that returns a 429 response for GoModuleProxy git requests and it seems to have shot back up to the level it was at before.

Screenshot 2021-06-14 12 53 07

@katiehockman
Copy link
Contributor

@BenLubar - no it hasn't been applied yet. It required some code changes on our end which took a few days. You should be able to expect it to be applied by the end of this week (likely sooner than that). I'll let you know if that's not the case.

@katiehockman
Copy link
Contributor

The changes are now in prod to no longer send refresh traffic to the git.lubar.me domain. Please let us know if you are still experiencing higher traffic than you would expect. Thanks.

@seankhliao seankhliao changed the title Unusual traffic to git hosting service from Go proxy.golang.org: Unusual traffic to git hosting service from Go Jun 18, 2021
@Anachron
Copy link

What's the state here? I read that sr.ht is still experiencing high traffic caused by GOs "proxy" feature?

Just thinking out loud: Can the crawlers not use a redis db or alike and store the url and datetime of last clone which gets cleaned after 24 hours?

For a big project like Go, this fix should be trivial and also save resources on GOs side as well.

@heschi
Copy link
Contributor

heschi commented May 25, 2022

Anyone who's receiving too much traffic from proxy.golang.org can request that they be excluded from the refresh traffic, as we did for git.lubar.me. Nobody asked for sr.ht be added to the exclusion set, so as far as it's concerned nothing has changed.

We did consider caching clones, but it has security implications and adds complexity, so we decided not to. It is certainly not trivial to do and not something we are likely to do based on this issue.

Since there hasn't been activity on this issue in nearly a year, I'm going to close it. Anyone who wants to be excluded from refresh traffic can file a new issue.

@heschi heschi closed this as completed May 25, 2022
@FiloSottile
Copy link
Contributor

FiloSottile commented May 25, 2022

[Without any Google hat, since I left the company earlier this month.]

I believe the operator of git.sr.ht is currently banned from this issue tracker, but I believe an email to golang-dev to request an exclusion would be accepted by the team as well. If not, they can email me personally at my golang.org email (filippo@) and I'll relay a request to be excluded from refresh traffic to the issue tracker.

@tomberek
Copy link

If it's crawling, it should set an appropriate user-agent and respect robots.txt.

This seems to be a reasonable mechanism to provide some control of the refresh rate, rather than a binary choice.

@guillaume-uH57J9
Copy link

guillaume-uH57J9 commented May 25, 2022

@tomberek Using robots.txt would indeed scale better than a go-specific, manual exclusion process (ie fill a bug to ask devs to manually update an exclusion list).

It's ultimately up to Google and Go devs to pick a process, but one that scale well would be in mutual interest of module hosters and go dev.

Also, a heads up : this issue was just featured on YCombinator/HackerNews https://news.ycombinator.com/item?id=31508000

@heschi
Copy link
Contributor

heschi commented May 25, 2022

I see. I wasn't aware of the blog post or the HN item, thank you. That explains the sudden attention.

The proxy performs a mix of user-initiated traffic, for which robots.txt is not applicable, and refresh traffic, which it might perhaps be. For boring technical reasons, it would be a fair bit of extra work for us to read robots.txt, so rather than going to a bunch of work to do that, we implemented a trivial list and offered to add sr.ht to it. That offer was ignored at the time, but is still open.

Subsequent to this bug, there has been exactly one more request to suppress traffic to a host, #51284. It doesn't make sense to me to scale or otherwise improve a process with that little usage. That opinion could of course change if we get more requests.

@u9000-Nine
Copy link

The proxy performs a mix of user-initiated traffic, for which robots.txt is not applicable, and refresh traffic, which it might perhaps be. For boring technical reasons, it would be a fair bit of extra work for us to read robots.txt, so rather than going to a bunch of work to do that, we implemented a trivial list

Standardization is worth extra work. Putting that aside, however, this could be trivially hacked on by periodically checking the robots.txt of domains in the refresh que, then conditionally adding them to the list.

Subsequent to this bug, there has been exactly one more request to suppress traffic to a host, #51284. It doesn't make sense to me to scale or otherwise improve a process with that little usage. That opinion could of course change if we get more requests.

It being a golang-specific process could contribute to the lack of requests, which would significantly affect this metric.

@ghost

This comment was marked as off-topic.

@golang golang locked as resolved and limited conversation to collaborators May 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests