Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/net/publicsuffix: automate periodic regeneration #15518

Open
yaslama opened this issue May 3, 2016 · 13 comments
Open

x/net/publicsuffix: automate periodic regeneration #15518

yaslama opened this issue May 3, 2016 · 13 comments
Assignees
Labels
Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@yaslama
Copy link

yaslama commented May 3, 2016

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?
    1.6.1
  2. What operating system and processor architecture are you using (go env)?
    linux/amd64
  3. What did you do?

I used golang.org/x/net/publicsuffix and it doesn't contain the last updated list

  1. What did you expect to see?

An updated list

  1. What did you see instead?

A list from 2016-03-01

The list is changing more than before. The trigger was LetsEncrypt but people begin to understand the security implications beside Letsencrypt.

@bradfitz bradfitz changed the title Please update golang.org/x/net/publicsuffix/table.go from github.com/publicsuffix/list x/net/publicsuffix: update May 3, 2016
@bradfitz
Copy link
Contributor

bradfitz commented May 3, 2016

@nigeltao, we should talk about this today.

@bradfitz bradfitz added this to the Unreleased milestone May 3, 2016
@bradfitz bradfitz self-assigned this May 3, 2016
@yaslama
Copy link
Author

yaslama commented May 5, 2016

@bradfitz and @nigeltao, did you reach any conclusion about the update ?
Thanks a lot.

@bradfitz
Copy link
Contributor

bradfitz commented May 5, 2016

We talked about making some online serving mode where your program could fetch the latest data at runtime and in the background periodically, relieving of us of the need to regularly rebuild this and litter the git repo's history with huge binary blob updates every week.

If somebody wants to write that mode, that'd be great.

@weppos
Copy link

weppos commented May 26, 2016

FYI to solve this problem I made available an alternative implementation that can parse a list dynamically: https://github.com/weppos/publicsuffix-go

There is an open ticket in the Let's Encrypt repo about using a different way to pull the PSL changes.
letsencrypt/boulder#1479

@yaslama
Copy link
Author

yaslama commented May 26, 2016

To @Webpos, I think you mean https://github.com/weppos/publicsuffix-go

@weppos
Copy link

weppos commented May 26, 2016

Indeed, wrong buffer. Thanks for pointing it out, I've updated the link.

@bradfitz bradfitz changed the title x/net/publicsuffix: update x/net/publicsuffix: automate periodic regeneration Apr 24, 2017
@bradfitz bradfitz removed their assignment Apr 24, 2017
@bradfitz bradfitz added Builders x/build issues (builders, bots, dashboards) help wanted labels Apr 24, 2017
@nigeltao
Copy link
Contributor

This came up again in a private conversation with a colleague.

As I understand the original request, "automate periodic regeneration", the idea is to subscribe for changes to the canonical https://publicsuffix.org/ data. When it changes, generate and commit the x/net/publicsuffix compiled form automatically. I don't know if there are any operational security or other administrative concerns about having a bot (or automated processes in general) committing code. I don't know if we already have bots pushing to go.googlesource.com. Perhaps @bradfitz would know??

To echo an earlier comment by @bradfitz, another cause for pushback is that every update causes non-trivial growth in the x/net git repository. Even if an individual upstream commit's diff is relatively small, the downstream x/net/publicsuffix commit's diff can be relatively large, since we pack the implicit tree structure in that text file into a dense index of nodes and a dense index of strings. Deleting one entry in the upstream text file will re-index every subsequent entry in the downstream form. As a data point, picking up only 7 weeks worth of upstream changes:

$ git clone https://go.googlesource.com/net
Cloning into 'net'...
remote: Total 7752 (delta 5384), reused 7752 (delta 5384)
Receiving objects: 100% (7752/7752), 5.73 MiB | 8.30 MiB/s, done.
Resolving deltas: 100% (5384/5384), done.
$ du --bytes --summarize net
10841140	net
$ cd net/publicsuffix/
$ time go generate

real	0m4.048s
user	0m2.288s
sys	0m0.336s
$ git status
On branch master
Your branch is up to date with 'origin/master'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	modified:   table.go
	modified:   table_test.go

no changes added to commit (use "git add" and/or "git commit -a")
$ git commit -a -m blahblahblah
[master c7fbea3] blahblahblah
 2 files changed, 9743 insertions(+), 9749 deletions(-)
 rewrite publicsuffix/table.go (68%)
$ cd ../..
$ du --bytes --summarize net
10995803	net

The disk usage difference, 10995803 - 10841140, is 154663 bytes, or 1.4% of 10841140.

One idea (not necessarily a good one, but I don't want to forget it) to cap git repo size concerns is to keep x/net/publicsuffix as the package name that callers should import, for back compat, but have its implementation delegate to x/publicsuffixvolume000 (yes, there's probably a better name), which is updated more frequently. Whenever that gets too big (for whatever the value of "big" is), roll it over to volume001, then volume002, etc. Over time, programmers who go get golang.org/x/net/publicsuffix might end up also cloning a moderately large git repo, but they shouldn't need to clone very large ones, nor should x/net grow too large with old PSL data, and they might be able to garbage collect no longer needed 'volumes'.

Back to solving the original problem, x/net/publicsuffix being stale, I don't know what e.g. Let's Encrypt does these days. Perhaps @weppos can say.

Again, as @bradfitz suggested above, we could add API where the caller brings their own copy of the PSL-as-a-text-file for the library to compile. Compilation used to take minutes, but now takes seconds, so that might be more feasible than it was a few years ago. We might have to cut the overall feature into multiple packages, so that people who just want the pre-compiled version don't have to import all the dependencies for dynamic generation, but if we want to do dynamic generation, that ought to be a solvable problem.

To be clear, I don't have much spare time to work on this myself, but I'm writing this all down before I forget the conversation.

@weppos
Copy link

weppos commented Jul 18, 2018

Back to solving the original problem, x/net/publicsuffix being stale, I don't know what e.g. Let's Encrypt does these days. Perhaps @weppos can say.

They currently use my Go library
https://github.com/weppos/publicsuffix-go

I setup a process that does exactly what you described. It checks for changes, and automatically creates a PR for review.
https://github.com/weppos/publicsuffix-go/search?q=autopull&type=Issues

In my case, this is trivial because I don't generate a blob. I cleanup, pre-process the list and pre-compile a Go file for optimal load, but I don't go any further. As a result, diffs are small:
https://github.com/weppos/publicsuffix-go/pull/128/files

Furthermore, weppos/publicsuffix-go#128 was designed to allow to load an external list with the exact intent of allowing the consumer to bypass the shipped version of the list. There's a couple of cases where this is a good idea: if the packaged list is outdated, or if you want to use a custom list (after all, the PSL is a format as well so you can submit your own list).

@nigeltao
Copy link
Contributor

@weppos my colleague had some concerns about https://github.com/weppos/publicsuffix-go performance. Every List.Find call does a linear iteration over the entire list of rules, right? (Over 8000, by default?) Have any of your other users raised performance concerns? Is there any intention (an issue?) to build a fancier data structure (i.e. a tree)?

@weppos
Copy link

weppos commented Jul 18, 2018

@nigeltao the concern is absolutely valid and the only reason I haven't flagged the lib as 1.x (it's still 0.x) is exactly because I want to change the lookup strategy.

My plan was to switch to the same Hash-based lookup I implemented in the Ruby library (weppos/publicsuffix-ruby#133) that proved to be extremely efficient. I also made tests and benchmarks with a Trie based lookup (weppos/publicsuffix-ruby#134) but there was little if no gain. I also made some research to use a DAWG/DAFSA but it turns out while it is extremely memory efficient (I tried to balance lookup performance with memory size), it works well only in cases where the list is not intended to be dynamically modified. In fact, it works well for the Chromium use case (the folks at Chrome implemented the lookup with a DAFSA) but not in the case of a more general-purpose library.

Peter from Amazon also reported another proposal that seems to provide further efficiency over the current version that I still need to review weppos/publicsuffix-ruby#143

All these versions provide good performance still adopting a representation that is diff-efficient.

Long story short, I definitely want to use a more efficient data structure. To be fair, I did not put too much effort on it lately as nobody really complained about performance. Still, it's on the roadmap. I should probably create a ticket on the repo and dump this info there (so that it can also inspire contributions).

@vanbroup
Copy link
Contributor

vanbroup commented Aug 9, 2018

You might want to try https://github.com/globalsign/publicsuffix which aims for performance close to https://golang.org/x/net/publicsuffix whilst adding the capability to update the PSL similar to https://github.com/weppos/publicsuffix-go

@gopherbot
Copy link

Change https://go.dev/cl/450935 mentions this issue: publicsuffix: embed table data

@neild
Copy link
Contributor

neild commented Nov 16, 2022

Regenerating the publicsuffix package every time we tagged a new x/net version (monthly) wouldn't add all that much repo growth. Is monthly good enough? Or do we ideally want something that tracks the upstream list more closely? If we want a schedule faster than the x/net tagging one, that's another reason to move to a new module.

At least half the time spent generating the publicsuffix repo right now is text compression, where we take a list of label strings and crush them into a single string containing all the individual labels, possibly overlapping. On my laptop, this step takes 100ms and reduces the text size by 33%, saving 16KiB. (For comparison, compress/gzip takes 1.5ms and reduces the text size by 47%, although its eventual form is less indexable.)

Maybe we could stop compressing the list? 16KiB isn't a lot these days, and storing the list uncompressed would reduce commit deltas.

gopherbot pushed a commit to golang/net that referenced this issue Nov 16, 2022
Use //go:embed to embed the public suffix tables,
rather than generating .go files containing the data.

Creating an empty git repo and generating commits for the
last 20 updates to the public suffix list, the total size
of the repository directory as measured by "du -sh" decreases
from 2.2M to 668K when using embedding.

For golang/go#15518.

Change-Id: Id71759765831a7699e7a182937095b3820bb643b
Reviewed-on: https://go-review.googlesource.com/c/net/+/450935
Run-TryBot: Damien Neil <dneil@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Nigel Tao <nigeltao@golang.org>
Reviewed-by: Nigel Tao (INACTIVE; USE @golang.org INSTEAD) <nigeltao@google.com>
@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

8 participants