Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/link: optimize de-duplication #14648

Closed
bradfitz opened this issue Mar 4, 2016 · 29 comments
Closed

cmd/link: optimize de-duplication #14648

bradfitz opened this issue Mar 4, 2016 · 29 comments

Comments

@bradfitz
Copy link
Contributor

bradfitz commented Mar 4, 2016

Using suffixarray in the linker to de-duplicate strings saves a bunch of space in the binaries. @crawshaw did this in https://golang.org/cl/19987 but it burns a lot of CPU.

Running the benchmarks in index/suffixarray,

$ go test -v -bench=NewIndex -benchtime=3s -cpuprofile=prof.cpu index/suffixarray

Shows 43% of the CPU time (50% before SSA!) is on a single line (the suffixSortable.Less) function:

(pprof) top
7470ms of 7960ms total (93.84%)
Dropped 39 nodes (cum <= 39.80ms)
Showing top 10 nodes out of 56 (cum >= 110ms)
      flat  flat%   sum%        cum   cum%
    3470ms 43.59% 43.59%     3470ms 43.59%  index/suffixarray.(*suffixSortable).Less
    1900ms 23.87% 67.46%     5370ms 67.46%  sort.doPivot
     820ms 10.30% 77.76%      990ms 12.44%  index/suffixarray.(*suffixSortable).updateGroups
     310ms  3.89% 81.66%      310ms  3.89%  runtime.duffcopy
     210ms  2.64% 84.30%     7220ms 90.70%  index/suffixarray.qsufsort
     200ms  2.51% 86.81%      200ms  2.51%  runtime.deductSweepCredit
     190ms  2.39% 89.20%      220ms  2.76%  index/suffixarray.initGroups
     130ms  1.63% 90.83%      130ms  1.63%  runtime.usleep
     130ms  1.63% 92.46%      280ms  3.52%  sort.insertionSort
     110ms  1.38% 93.84%      110ms  1.38%  index/suffixarray.(*suffixSortable).Swap

(pprof) list Less
Total: 7.96s
ROUTINE ======================== index/suffixarray.(*suffixSortable).Less in /Users/bradfitz/go/src/index/suffixarray/qsufsort.go
     3.47s      3.47s (flat, cum) 43.59% of Total
         .          .    137:   h   int
         .          .    138:   buf []int // common scratch space
         .          .    139:}
         .          .    140:
         .          .    141:func (x *suffixSortable) Len() int           { return len(x.sa) }
     3.47s      3.47s    142:func (x *suffixSortable) Less(i, j int) bool { return x.inv[x.sa[i]+x.h] < x.inv[x.sa[j]+x.h] }
         .          .    143:func (x *suffixSortable) Swap(i, j int)      { x.sa[i], x.sa[j] = x.sa[j], x.sa[i] }
         .          .    144:
         .          .    145:func (x *suffixSortable) updateGroups(offset int) {
         .          .    146:   bounds := x.buf[0:0]
         .          .    147:   group := x.inv[x.sa[0]+x.h]


(pprof) disasm Less
Total: 7.96s
ROUTINE ======================== index/suffixarray.(*suffixSortable).Less
     3.47s      3.47s (flat, cum) 43.59% of Total
      90ms       90ms      71210: GS MOVQ GS:0x8a0, CX
      70ms       70ms      71219: CMPQ 0x10(CX), SP
         .          .      7121d: JBE 0x7127e
     160ms      160ms      7121f: MOVQ 0x8(SP), CX
     100ms      100ms      71224: MOVQ 0(CX), DX
     100ms      100ms      71227: MOVQ 0x18(CX), BX
     130ms      130ms      7122b: MOVQ 0x20(CX), BP
      20ms       20ms      7122f: MOVQ 0x8(CX), SI
      80ms       80ms      71233: MOVQ 0x10(SP), DI
      80ms       80ms      71238: CMPQ SI, DI
         .          .      7123b: JAE 0x71277
     200ms      200ms      7123d: MOVQ 0(DX)(DI*8), AX
     140ms      140ms      71241: MOVQ 0x30(CX), CX
      80ms       80ms      71245: LEAQ 0(AX)(CX*1), DI
     110ms      110ms      71249: CMPQ BP, DI
         .          .      7124c: JAE 0x71277
     120ms      120ms      7124e: MOVQ 0(BX)(DI*8), DI
     320ms      320ms      71252: MOVQ 0x18(SP), R8
      30ms       30ms      71257: CMPQ SI, R8
         .          .      7125a: JAE 0x71277
      30ms       30ms      7125c: MOVQ 0(DX)(R8*8), AX
      50ms       50ms      71260: ADDQ AX, CX
     140ms      140ms      71263: CMPQ BP, CX
         .          .      71266: JAE 0x71277
      50ms       50ms      71268: MOVQ 0(BX)(CX*8), CX
     770ms      770ms      7126c: CMPQ CX, DI
     240ms      240ms      7126f: SETL AL
     160ms      160ms      71272: MOVL AL, 0x20(SP)
     200ms      200ms      71276: RET
         .          .      71277: CALL runtime.panicindex(SB)
         .          .      7127c: UD2
         .          .      7127e: CALL runtime.morestack_noctxt(SB)
         .          .      71283: JMP index/suffixarray.(*suffixSortable).Less(SB)
         .          .      71285: INT $0x3
         .          .      71286: INT $0x3
         .          .      71287: INT $0x3
         .          .      71288: INT $0x3
         .          .      71289: INT $0x3
         .          .      7128a: INT $0x3
         .          .      7128b: INT $0x3
         .          .      7128c: INT $0x3
         .          .      7128d: INT $0x3
         .          .      7128e: INT $0x3

Any Go and/or SSA tweaks we can do to speed up suffixarray? Robert, how much has this been tuned over time?

/cc @crawshaw @randall77 @tzneal @davecheney @ianlancetaylor @griesemer

@bradfitz bradfitz added this to the Unplanned milestone Mar 4, 2016
@griesemer
Copy link
Contributor

@bradfitz The suffixarray implementation performance was improved a lot by Eric Eisner a while back and is based on a paper from 1999 (http://www.larsson.dogma.net/ssrev-tr.pdf). I'm no expert in the field, but my understanding is that the new algorithm is quite efficient. Not sure we can get a lot out of this.

I was a bit surprised by the change. Suffix arrays are great when computation is done essentially once (because it tends to be slow) but lookups must be fast (fast binary search over vast quantity of data). That's not really the case here, is it?

@crawshaw
Copy link
Member

crawshaw commented Mar 4, 2016

@griesemer happy to discuss the algorithm choice if you like (I picked it because it was a small amount of readable code that was significantly faster than the obvious quadratic algorithm, I'm sure it can be improved on), but it should probably be somewhere else. Brad's point here is that 50% of the time spent in the benchmarks is in this function:

func (x *suffixSortable) Less(i, j int) bool {
        return x.inv[x.sa[i]+x.h] < x.inv[x.sa[j]+x.h]
}

...which is probably doing more bounds checking than necessary and maybe SSA can help.

@bradfitz
Copy link
Contributor Author

bradfitz commented Mar 4, 2016

Yeah, I'm sure the suffix array creation algorithm is good. I'm talking about microoptimizations in the generated code or the Go code implementing the algorithm.

Considering how hot that Less function is, any SSA and/or bounds check removal would help a lot.

And independently and elsewhere, feel free to tell David what algorithm to actually use for de-duping the strings. :)

@griesemer
Copy link
Contributor

This function is called via the sort.Interface - unless this interface call is inlined and thus opening up the callee (the Less) function to index analysis, I suspect it's going to be tough to eliminate the bounds checks (assuming we refrain from doing unsafe ops instead). Happy to be proved otherwise.

@randall77
Copy link
Contributor

I agree with Robert, I don't see any obvious way to improve on the generated code. It is almost all bounds checks but I see no way of eliminating them.

@OneOfOne
Copy link
Contributor

OneOfOne commented Mar 4, 2016

@randall77 is there any possible way to run a pass that converts interfaces to their actual values so calls can get optimized / inlined better?

@randall77
Copy link
Contributor

I don't see how that would be done. You are trying to infer types through a generic sort.Sort funnel, so without making a copy of sort.Sort for each callsite you're out of luck.
Even if you could solve that somehow, there is still the requirement of associating the length of these various slices with each other so we know the accesses are in bounds. That is also a hard problem.

@OneOfOne
Copy link
Contributor

OneOfOne commented Mar 4, 2016

Well, there's always unsafe.Pointers, I used that to solve a similar performance issue in one of my project, I can submit a CL if there's interest but I know unsafe usage is looked down upon.

@bradfitz
Copy link
Contributor Author

bradfitz commented Mar 4, 2016

No, we don't want unsafe.Pointer. If @randall77 doesn't see anything bad in the codegen, I guess there's nothing easy to be done.

I'll close this.

@bradfitz bradfitz closed this as completed Mar 4, 2016
@josharian
Copy link
Contributor

One could precompute x.inv[x.sa[i]+x.h] for all relevant values of i. I mailed CL 20233. I really need to be heads down elsewhere, but if anyone else wants to hijack that CL and see it through, go for it.

@bradfitz
Copy link
Contributor Author

bradfitz commented Mar 4, 2016

Resurrecting this bug in light of https://golang.org/cl/20233 which @davecheney loves. :)

@gopherbot
Copy link

CL https://golang.org/cl/20241 mentions this issue.

@dsnet
Copy link
Member

dsnet commented Mar 5, 2016

What are people's thoughts on using a different algorithm? It's significantly more work, but has the possibility to run significantly faster.

The current algorithm, qsufsort, runs in O(nlogn), but relies on the standard library sort, which incurs some performance penalty.

Possible alternatives:

  • Improve qsufsort, by inlining the sort implementation directly into it. My guess is that it will become ~3x faster, but will still be ~40% slower than SAIS.
  • Use SAIS, which runs in O(n), but is ~10% slower than divsufsort.
  • Use divsufsort, which runs in O(nlogn), but is generally considered the fastest where n is not ridiculously large.

Although divsufsort seems to be state-of-the-art, it is also fairly complicated. SAIS, however, is a significantly simpler algorithm comparable to qsufsort and may be worth investigating. A port of SAIS that I did was >5x faster than the standard library's version of qsufsort.

Some performance numbers for the C libraries are here: https://github.com/y-256/libdivsufsort/blob/wiki/SACA_Benchmarks.md

P.S. I'm interested in suffix arrays because they are easiest way of doing the burrows-wheeler transform needed for a bzip2 encoder.

@crawshaw
Copy link
Member

crawshaw commented Mar 5, 2016

@dsnet divsufsort sounds like a good fit for the use in cmd/link. I'm not sure about replacing what's in the suffixarray package with it, Robert is suggesting that moving from n to 2n memory use in cl/20241 is already too expensive for some uses, and divsufsort comes in at 5n.

I have some other work I need to focus on, but if you end up with some code I'd be happy to try it out in the linker for you.

@dsnet
Copy link
Member

dsnet commented Mar 5, 2016

@crawshaw, if you want to give my port of SAIS a try, I'd be interested in how it affects the linker time. I am aware of a bug in my implementation that causes it crash on some rare inputs (will investigate in future). I should mention that SAIS increases memory by 2n, which is an advantage over divsufsort's 5n.

I have not ported divsufsort yet, but it is something I was planning on doing.

@crawshaw
Copy link
Member

crawshaw commented Mar 5, 2016

I hacked it in and I see juju link times drop from 6.4s to 5.2s. So as expected, it looks like a better fit than the standard library's suffixarray package.

Up to you want you want to do. We can wait until you port divsufsort and try that (I expect it's a little better for the job, string data seems to cap out at a few megabytes in binaries I've looked at, so memory use is not a concern compared to the other things the linker does).

Or you could send a CL importing it as cmd/link/internal/sais, and we can hook it up now.

Or more radically you can propose it be added as an option to index/suffixarray. I tend to stay out of stdlib additions so I'm not going to weigh in on that.

@minux
Copy link
Member

minux commented Mar 5, 2016 via email

@dgryski
Copy link
Contributor

dgryski commented Mar 7, 2016

@mwhudson mwhudson changed the title index/suffixarray: optimize NewIndex cmd/link: optimize de-duplication Mar 7, 2016
@mwhudson
Copy link
Contributor

mwhudson commented Mar 7, 2016

Re-opening and re-titling the issue. Hope I'm not treading on anyone's toes!

@mwhudson mwhudson reopened this Mar 7, 2016
@rsc
Copy link
Contributor

rsc commented Mar 7, 2016

As I commented on https://go-review.googlesource.com/#/c/19987, I don't believe suffix array code (neither index/suffixarray nor the more complex variants being suggested) has any place in a linker. Let's find a way to get the benefit without such heavy machinery. The cost here is enormous for very little benefit. 50% longer link time to reduce output size by 1% is not the tradeoff we want to encourage.

@crawshaw
Copy link
Member

crawshaw commented Mar 7, 2016

While improved performance benefits me personally (I sit around waiting for make.bash all day), the uses I'm trying to enable are blocked by binary size, not link time. So if someone else wants to jump on this, please do.

@dsnet
Copy link
Member

dsnet commented Mar 7, 2016

I don't know much about linker code, but the problem in CL/19987 seems to be about string deduplication, which is really just data compression. Suffix arrays are certainly fairly heavy weight and known to be fairly slow (probably the reason of bzip2's decline in recent years).

Perhaps it would be better to try and solve the string-deduplication problem with a different approach? I was thinking an LZ77-inspired approach would possibly work. It probably won't catch as many deduplication opportunities, but will probably be significantly faster. I don't have the bandwidth currently to take this on, but it's something I can look into later.

@OneOfOne
Copy link
Contributor

OneOfOne commented Mar 7, 2016

If allocations count doesn't matter, using a naive parallel sort based on the stdlib's quicksort improves NewIndexRandom by 12% and NewIndexRepeat by almost 18%.

If there is interest, I can try to optimize it and submit a CL.

@crawshaw
Copy link
Member

crawshaw commented Mar 7, 2016

@dsnet finding substrings has the advantage that there's no runtime decoding, just link-time string header rewriting.

I experimented a couple of weeks ago with dumping the string data from some large binaries and running LZ77-like compression on it. It was far more successful at compression than this deduplication, but would require more runtime memory and CPU.

@dsnet
Copy link
Member

dsnet commented Mar 7, 2016

The advantage of LZ77 is it's pretty tunable. Which LZ77 algorithm were you running on it? Would it be worth just compressing the binaries with snappy? If that means embedded a snappy decoder into the binary, it's not that bad since a snappy decoder is fairly simple.

@nigeltao

@rsc
Copy link
Contributor

rsc commented Mar 8, 2016

@crawshaw, I understand the constraints for your project, but you don't need 1%. You need many%. We're certainly not going to pay a 2500% cost in link time for 50% binary size reduction, so I don't think we should pay a 50% cost in link time for 1% binary size reduction. I remain happy to work with you to find other ways to reduce space, but there are many competing interests in this project, and we can't overfit to the one goal of binary size elimination at significant cost to the other goals of compile and link time.

Put another way, suppose you'd come to us and said "I can make the linker 33% faster at the cost of making binaries 1% larger." That would be an obvious win. Basically everyone would be cheering. The CL in question does the opposite.

Suffix arrays in particular have significant time cost, significant memory cost, and significant code complexity cost. The suggestions being made on this issue are that we pay an even greater code complexity cost to reduce the time and hopefully memory cost (though no one seems to be even talking about that). This seems to me excessive for 1% binary size. I am sure there are ways to get that 1% and more, without the triple whammy of costs incurred by suffix arrays. So let's work on other ways to get that 1% instead. But not by resorting to other complex things, such as LZ77.

This is the classic compiler writer's trap, which you see over and over again:

  1. Do something terribly inefficient.
  2. Do something terribly complex to clean up the inefficiency.

As an experiment, the suffix array is great: it shows that the Go toolchain is doing something terribly inefficient. Instead of debating various options for step 2, we should instead be focusing on not doing step 1 in the first place.

Let's use juju as an example like the CL does: what are the actual duplicated strings? Where do they come from? From the CL it sounds like the duplicated strings are the string field for the reflect.Type struct implementations. How much smaller does the binary get if you just don't put any string there at all? Forget the fact that the binary won't run, just how much is saved by not writing those strings? If it's not significant, then what else is also contributing to the win for deduplication? Characterizing what is being deduplicated is the first step. If the deduplication is winning primarily due to the reflect.Type strings, that says to me that we need to do a better job not generating them in the first place. They are there for debugging, and they are undeniably convenient, but the vast majority of them are never looked at. A precise characterization of the storage cost of that convenience might make clear that the cost is too high and that we should deal with the complexity of generating those strings at run-time as needed.

@rsc
Copy link
Contributor

rsc commented Mar 8, 2016

I looked into this. While removing all the type strings would in fact make mergestrings unnecessary, there are simpler ways to make it mostly unnecessary. In my tests I am using the github.com/davecheney/benchjuju tree and I'm leaving DWARF in the binary (because I was lazy). If you want numbers without DWARF, multiply the space percentages by 1.5X.

I measure the benefit of calling mergestrings as 2.5% in jujud.
However, after changing go.string symbols to have default alignment 1 byte (not maxalign for most strings) and changing the reflect type information for non-pointer "T" to record its name as "*T"[1:], forcing the sharing of the names for "T" and "*T", the benefit of calling mergestrings drops to 0.6%. At that reduced effectiveness, it seems even more clearly not worth a 50% time overhead. I prepared CL 20334 making these two optimizations and deleting mergestrings. That will close this issue.

However, I didn't want to make jujud even 0.6% larger, because I do understand that it's the wrong direction. So I also prepared CL 20335 to change functype to merge the in and out slices, saving one slice header per type. That saved 1.35%, so after both CLs, instead of jujud being +0.6% in size it is -0.75% in size. I know @crawshaw has a more aggressive functype CL pending, but this one was a trivial first step and preserves downward binary size monotonicity.

@gopherbot
Copy link

CL https://golang.org/cl/20334 mentions this issue.

@gopherbot
Copy link

CL https://golang.org/cl/20335 mentions this issue.

@golang golang locked and limited conversation to collaborators Mar 13, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests