Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: runtime: optionally allow callers to intern strings #5160

Closed
bradfitz opened this issue Mar 29, 2013 · 30 comments
Closed

proposal: runtime: optionally allow callers to intern strings #5160

bradfitz opened this issue Mar 29, 2013 · 30 comments
Labels
FrozenDueToAge GarbageCollector NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. Performance
Milestone

Comments

@bradfitz
Copy link
Contributor

Duplicate strings can waste memory and cause lots of allocations, but may be cheaper at
allocation time.

Interning strings has a lookup cost, but can save memory and even CPU by reduced number
of GCs.

I'd like to have this choice, for when I know my strings are large and likely duplicates.

Manual interning in user code,

  var internedStrings struct{
     sync.RWMutex
     m map[string]string
  }

... leads to lock contention and a map that grows forever.

The runtime could have a less contentious map and could integrate with the garbage
collector, such that strings that are only referenced from the intern table to
themselves are still collected and removed from the internal map.
@bradfitz
Copy link
Contributor Author

Comment 1:

And I'd like to be able to lookup the interned strings from a []byte or string.
Related: issue #3512

@ianlancetaylor
Copy link
Contributor

Comment 2:

You could do this in user space if the runtime provided weak pointers: pointers that are
set to nil if the object to which they point is garbage collected.
(I don't see why the runtime package has any advantage in making access to the map less
contentious, but perhaps I am missing something.)

@bradfitz
Copy link
Contributor Author

Comment 3:

Even if the runtime provided weak pointers, I'm still not exactly sure how I would
implement it.  What would my map key be?  I might also need the runtime to give me the
string's hash value, since doing my own hash function would probably be too slow.
The runtime could do a better job since it could use a less contentious hashmap
(non-iterable, pushing down locks further, aware of resizes, etc) without having to
offer that to the language.

@dvyukov
Copy link
Member

dvyukov commented Mar 31, 2013

Comment 4:

Agree with Brad, even if it will use weak pointers inside it's very difficult in
implement and optimize.
It may not necessary live in runtime, it can be in strings/bytes but use some runtime
helpers.
Btw wrt []byte, do you mean that there will be bold caution comment saying that you must
not mutate them?

@bradfitz
Copy link
Contributor Author

Comment 5:

Answered []byte question here: https://golang.org/issue/3512?c=5

@bradfitz
Copy link
Contributor Author

bradfitz commented Apr 3, 2013

Comment 6:

Labels changed: added performance.

@bradfitz
Copy link
Contributor Author

bradfitz commented Apr 3, 2013

Comment 7:

Labels changed: removed optimization.

@bradfitz
Copy link
Contributor Author

Comment 8:

Labels changed: added garbage.

@rsc
Copy link
Contributor

rsc commented Nov 27, 2013

Comment 9:

Labels changed: added go1.3maybe.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2013

Comment 10:

Labels changed: added release-none, removed go1.3maybe.

@rsc
Copy link
Contributor

rsc commented Dec 4, 2013

Comment 11:

Labels changed: added repo-main.

@josharian
Copy link
Contributor

@bradfitz I made you a thing: https://github.com/josharian/intern/. It's not exactly what you asked for, and it's a kinda horrible abuse of sync.Pool, and it relies on the gc-specific []byte map key optimization...but it's short and simple and is kinda somewhat like what you asked for. :)

I fiddled with a bunch of non-short, non-simple alternatives and didn't like any of them.

@CAFxX
Copy link
Contributor

CAFxX commented Sep 27, 2018

@josharian thanks for https://github.com/josharian/intern! in one of our internal apps that has to parse a lot of json (with a lot of repeated string values) and process it, we were able to slash memory usage by a significant amount (30% on the whole app, >75% just for the in-memory data coming from the parsed json) by using it.

I know I'm jumping the gun on this one, but +1 for having interning - ideally even as an option when unmarshaling serialized payloads, i.e. something along the lines of:

type Collection struct {
  Objects []Object `json:"objects"`
}

type Object struct {
  GUID string `json:"guid"`
  ParentGUID string `json:"parent_guid,intern"` // perform interning during unmarshaling
  // ...
}

(the rationale for doing interning during unmarshaling is this: consider the case of the Objects array containing many elements; if interning is done after unmarshaling we have to do a lot of allocations that could be avoided)

@josharian
Copy link
Contributor

@CAFxX I'm pleasantly surprised that it ended up being useful.

I think prior to exposing any API, it’d be worth experimenting with doing this automatically in the runtime. Do it:

  • per-P
  • only for very small strings (to avoid pinning large chunks of memory for arbitrary lengths of time)
  • using a cheap hash (or go fully associative)
  • use something simple like random replacement for the cache

I can't predict offhand how that would turn out, but it seems a simple enough experiment to run.

@CAFxX
Copy link
Contributor

CAFxX commented Oct 9, 2018

I'm pleasantly surprised that it ended up being useful

Full disclosure: those JSON structures we parse, and that contain significant amounts of repeated strings, are kept in memory as a form of local LRU cache to avoid lookups on a remote API. So this is probably a very favorable scenario for interning. At the same time, it's a scenario that is common enough to probably warrant consideration.

I think prior to exposing any API, it’d be worth experimenting with doing this automatically in the runtime

Absolutely agree, as long as this won't significantly impact regular string ops. I'm not an expert in GC/runtime, but it would be neat (not for the PoC) if this form of automatic interning was done only for strings that are either likely to survive GC or likely to be already in the interning table. And ideally the interning table should not keep strings alive during GC (like you do in intern).

Random idea for future consideration if this ever gets implemented: maybe interning could be done in multiple steps:

  • first, opportunistically (small, per-P table with random replacement), during string creation
  • then, more aggressively, during concurrent GC for non-garbage strings

only for very small strings

Just as a data point, in our case most of the strings were GUIDs... so it would have helped if the upper limit for "very small" was at least 36 bytes.

@gopherbot
Copy link

Change https://golang.org/cl/141641 mentions this issue: runtime: intern some small strings

@josharian
Copy link
Contributor

CL 141641 is a quick-and-dirty experiment with doing this in the runtime.

@CAFxX
Copy link
Contributor

CAFxX commented Oct 13, 2018

@josharian I actually also tried (and got a little bit farther: I also hooked up the GC in cleanpools and implemented interning in concatstrings; btw I went instead for a hash-based approach to allow bigger per-P tables) but tested it only on the go1 benchmarks and they were not bad but not too good either. I was planning to work on it more today. Maybe I can submit an early CL in the meantime for feedback.

@josharian
Copy link
Contributor

Cool!

I recommend trying a handful of std packages benchmarks as well as go1; the go1 benchmarks can be narrow in scope. And of course definitely run the runtime string benchmarks. :)

@josharian
Copy link
Contributor

@CAFxX see also discussion with @randall77 in
https://go-review.googlesource.com/c/go/+/141641.

@CAFxX
Copy link
Contributor

CAFxX commented Oct 15, 2018

I have pushed my current experiment on my fork (see master...CAFxX:intern). It is not pretty as it's meant to be an experiment; specifically, there's a lot of duplication that would need to be sorted out. I'm also not 100% sure the GC part is done correctly (specifically, is it correct to assume that we don't need locking in clearpools?).

What is implemented:

  • per-P interning tables with 1024 slots, hashing via memhash and per-P seeds, random replacement policy (specifically, replacement on hash collision)
  • intern all escaping strings shorter than 64 bytes created either from []byte, []rune or by concatenating strings
  • per-P tables are cleared and per-P seeds are updated at start of GC

The magic numbers 1024 and 64 are arbitrary, haven't really attempted any tuning.

Performance is a mixed bag; some benchmarks improve, other slow down. Allocation benchmarks strictly improve. I suspect some of the regressions could be avoided/mitigated either by refactoring some logic in the standard library (e.g. in strings.Builder), by changes in the compiler or by deferring interning to the GC mark phase. I will post the current benchmark results ASAP.

I read @randall77's comments and I can say that yes, we can try to start smaller (e.g. only do interning in encoding/json and then move from there). I still stand by what I suggested above, i.e. that if the runtime does the right thing with no knobs, it's better.

@CAFxX
Copy link
Contributor

CAFxX commented Oct 16, 2018

I managed to improve things a bit (removed code duplication and lowered the overhead for when interning is not supposed to happen) and now there are almost no regressions apart from 3 benchmarks that I still need to analyze in detail (they could very well be artifacts). These are the current results for the strings, strconv, unicode/utf8 and encoding/json benchmarks (comparing master to my branch; benchmarks run on my laptop): https://gist.github.com/CAFxX/5b057d40e9d61dab54ffd26f54181f0d

I plan to run some additional benchmarks simulating our production workloads to validate the results. If you have suggestions about additional benchmarks to run, please let me know. It would be advisable if also somebody else ran benchmarks on different hardware.

I pushed the latest version here: master...CAFxX:intern

Assuming we manage to contain all regressions, would such an approach (that has the benefit of exposing new APIs or modifying existing ones) be considered for inclusion?

@josharian josharian added the NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. label Nov 10, 2018
@josharian josharian modified the milestones: Unplanned, Go1.13 Nov 10, 2018
@josharian josharian changed the title runtime: optionally allow callers to intern strings proposal: runtime: optionally allow callers to intern strings Dec 14, 2018
@josharian
Copy link
Contributor

Given the recent flurry of activity, and given that this is a significant change, and given that there are a variety of options (automatic interning, package runtime API, etc.), I've changed this to a proposal so that we can get some input from @golang/proposal-review.

@rsc
Copy link
Contributor

rsc commented Dec 19, 2018

@josharian What is the exact API proposal? I see a very long issue thread here and lots of links, but could you please post a note here with the API in the text of the note?

@rsc
Copy link
Contributor

rsc commented Dec 19, 2018

OK, it looks like https://github.com/josharian/intern/blob/master/intern.go is the API but I don't see what the proposed implementation is. (Probably not what's in that file.) Please help me.

@josharian
Copy link
Contributor

@rsc there are several different paths we could take here. I'm not confident that I have clarity about optimal path(s) forward; I'm hoping to restart discussion.

One option is to automatically intern some strings. This has the benefit of no new API, but the downside of making program performance harder to predict. @CAFxX has experimented with this a bunch (above).

If we are going to do manual interning, this looks a lot like free lists and sync.Pool. If you have a way to handle string interning with the correct lifetime and control over concurrency, you should just do it there (much like with free lists). The harder case is where you don't have a good way to manage the string pool's lifetime or concurrency.

For the easy case, we might consider adding to package strings:

type Interner struct {
  // ...
}

func (i *Interner) Intern(s string) string
func (i *Interner) InternBytes(b []byte) string

The argument for adding API for the easy case is that, unlike with free lists, the implementation is non-obvious and non-portable (it requires specific compiler optimization #3512). (Maybe that means it belongs in package runtime as runtime.StringInterner, although that is uglier.)

The harder case needs to handle lifetime as well as concurrency. The analogy with sync.Pool suggests that package sync might a good home for it, with the API spelled the same as above in package strings. We document that interning is best effort, just like re-using objects from sync.Pool. When sync.Pools get flushed during GC, so do the set of interned strings (possibly using a similar mechanism to #22950). And the sync version provides concurrency-safety, perhaps just using a plain mutex. (Or maybe using #18802; this is another reason to make interning best-effort.)

Does this help, Russ?

josharian added a commit to josharian/go that referenced this issue Dec 30, 2018
DO NOT SUBMIT

Updates golang#5160

This is the simplest thing I could think of to experiment with.

It's incomplete: It doesn't cover all the ways to create strings.
It's not optimized at all. I picked the magic numbers haphazardly.

Nevertheless, it doesn't do too badly, when applied to the compiler:

name        old time/op       new time/op       delta
Template          187ms ± 3%        184ms ± 2%  -1.47%  (p=0.000 n=97+95)
Unicode          86.9ms ± 5%       87.3ms ± 4%    ~     (p=0.065 n=99+94)
GoTypes           658ms ± 2%        659ms ± 2%    ~     (p=0.614 n=99+97)
Compiler          2.94s ± 2%        2.94s ± 2%    ~     (p=0.945 n=95+95)
SSA               8.53s ± 1%        8.54s ± 1%    ~     (p=0.276 n=97+98)
Flate             125ms ± 3%        124ms ± 4%  -0.78%  (p=0.000 n=99+97)
GoParser          149ms ± 4%        149ms ± 3%    ~     (p=0.595 n=100+95)
Reflect           410ms ± 3%        412ms ± 4%  +0.48%  (p=0.047 n=97+96)
Tar               167ms ± 3%        166ms ± 5%    ~     (p=0.078 n=99+98)
XML               227ms ± 3%        227ms ± 4%    ~     (p=0.723 n=96+95)
[Geo mean]        388ms             387ms       -0.17%

name        old user-time/op  new user-time/op  delta
Template          223ms ± 3%        221ms ± 3%  -0.78%  (p=0.000 n=99+95)
Unicode           109ms ± 8%        111ms ± 6%  +1.36%  (p=0.001 n=99+99)
GoTypes           846ms ± 2%        848ms ± 2%    ~     (p=0.092 n=97+97)
Compiler          3.91s ± 2%        3.91s ± 2%    ~     (p=0.666 n=97+96)
SSA               12.1s ± 2%        12.1s ± 1%    ~     (p=0.128 n=92+96)
Flate             145ms ± 3%        144ms ± 4%    ~     (p=0.157 n=93+99)
GoParser          180ms ± 5%        181ms ± 5%  +0.63%  (p=0.004 n=90+94)
Reflect           522ms ± 3%        524ms ± 4%    ~     (p=0.055 n=95+96)
Tar               203ms ± 5%        203ms ± 5%    ~     (p=0.880 n=100+99)
XML               280ms ± 4%        281ms ± 4%    ~     (p=0.170 n=99+98)
[Geo mean]        487ms             488ms       +0.19%

name        old alloc/op      new alloc/op      delta
Template         36.3MB ± 0%       36.2MB ± 0%  -0.23%  (p=0.008 n=5+5)
Unicode          29.7MB ± 0%       29.7MB ± 0%  -0.07%  (p=0.008 n=5+5)
GoTypes           126MB ± 0%        126MB ± 0%  -0.27%  (p=0.008 n=5+5)
Compiler          537MB ± 0%        536MB ± 0%  -0.21%  (p=0.008 n=5+5)
SSA              2.00GB ± 0%       2.00GB ± 0%  -0.12%  (p=0.008 n=5+5)
Flate            24.6MB ± 0%       24.6MB ± 0%  -0.28%  (p=0.008 n=5+5)
GoParser         29.4MB ± 0%       29.4MB ± 0%  -0.20%  (p=0.008 n=5+5)
Reflect          87.3MB ± 0%       86.9MB ± 0%  -0.52%  (p=0.008 n=5+5)
Tar              35.6MB ± 0%       35.5MB ± 0%  -0.28%  (p=0.008 n=5+5)
XML              48.4MB ± 0%       48.4MB ± 0%  -0.16%  (p=0.008 n=5+5)
[Geo mean]       83.3MB            83.1MB       -0.23%

name        old allocs/op     new allocs/op     delta
Template           352k ± 0%         347k ± 0%  -1.16%  (p=0.008 n=5+5)
Unicode            341k ± 0%         339k ± 0%  -0.76%  (p=0.008 n=5+5)
GoTypes           1.28M ± 0%        1.26M ± 0%  -1.48%  (p=0.008 n=5+5)
Compiler          4.97M ± 0%        4.90M ± 0%  -1.38%  (p=0.008 n=5+5)
SSA               15.6M ± 0%        15.3M ± 0%  -2.11%  (p=0.016 n=4+5)
Flate              233k ± 0%         228k ± 0%  -1.92%  (p=0.008 n=5+5)
GoParser           294k ± 0%         290k ± 0%  -1.32%  (p=0.008 n=5+5)
Reflect           1.04M ± 0%        1.03M ± 0%  -1.73%  (p=0.008 n=5+5)
Tar                343k ± 0%         337k ± 0%  -1.62%  (p=0.008 n=5+5)
XML                432k ± 0%         426k ± 0%  -1.28%  (p=0.008 n=5+5)
[Geo mean]         813k              801k       -1.48%

Change-Id: I4cd95bf4a74479b0e8a8d339d77e248d1467a6e0
@rsc
Copy link
Contributor

rsc commented Jan 9, 2019

Now that the compiler is better about not copying for string(b) in various cases, is there any reason this needs to be in the standard library? Is there any runtime access it really needs?

@bradfitz says he'd be OK with closing this.

@josharian
Copy link
Contributor

  1. If we are going to automatically intern strings in the runtime, then this obvious needs runtime access. But maybe that doesn't need a proposal?

  2. The argument for adding std API for assistance interning strings is given explicitly in my comment above. See "The argument for adding API for the easy case is...". However, I can see that this is not particularly compelling, and a third party package would be fine. Maybe I'll write one.

  3. The hard case above needs runtime support for the same reason sync.Pool does: the ability to respond to GC needs and not grow forever. (sync.Pool also benefits from per-P storage, which interning would as well, but that seems less critical.) Maybe we can address that independently, which would benefit everyone. I'll write up a new issue to discuss that, although I anticipate that there will be tricky design issues.

@rsc
Copy link
Contributor

rsc commented Jan 23, 2019

It still seems too specialized. You can do interning with a plain map[string]string today, but then you have to know when to drop the map. It sounds like this API contributes "you don't have to think about when to drop the map". But then the runtime does. That's a lot of work and it's unclear that we know how to do it well. We're not even managing sync.Pool that well. This is much more complicated, since it essentially depends on weak reference logic to decide whether a string is needed or not. I'm skeptical we should spend the time to get this right versus other work we could be doing..

Is it really worth the cost of a general solution here, versus people who have this need just using a map[string]string?

Brad suggests maybe there's some way to "just mark these intern-string pools last" in order to automatically evict strings from the pool once all the references are gone, but it's not obvious to me how to implement that with the distributed mark termination detection and lazy sweeping. It seems like possibly adding a lot of complexity that would otherwise go unused.

/cc @RLH @aclements

@josharian
Copy link
Contributor

Re-reading this, in an ideal world, we'd be able to build something sophisticated and sync.Pool-like as a third party package. I tried this, using the finalizer hack from @CAFxX to detect GC cycles. The performance was an order of magnitude worse than my sync.Pool-based implementation, due to mutex overhead; we need #18802.

I'm going to close this proposal.

For the simple case, it is easy enough to build yourself or have as a third party package.

For the harder case, I think we should provide the tools for people to build this themselves, well: #18802 and #29696.

As for doing it automatically in the runtime, I guess we'll just decide on a CL by CL basis.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge GarbageCollector NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. Performance
Projects
None yet
Development

No branches or pull requests

7 participants