Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crypto/md5: remove assembly cores #49248

Open
FiloSottile opened this issue Oct 31, 2021 · 3 comments
Open

crypto/md5: remove assembly cores #49248

FiloSottile opened this issue Oct 31, 2021 · 3 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@FiloSottile
Copy link
Contributor

crypto/md5 has a series of assembly cores, all looking pretty much the same and very similar to the generic code.

The generic code looks like something that should intrinsify well, and the assembly doesn't use any special platform instructions as far as I can tell.

https://github.com/golang/go/blob/2ebe77a2fda1ee9ff6fd9a3e08933ad1ebaea039/src/crypto/md5/md5block.go

MD5 is broken and obsolete, so ideally we'd reduce the complexity budget we spend on it over time.

If we can get the generic code within 80% of the performance of the assembly, I'd be happy to remove the cores.

@FiloSottile FiloSottile added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Oct 31, 2021
@andig
Copy link
Contributor

andig commented Nov 2, 2021

Here's a data point for the M1. You get a ~30% decrease, go binary built with 1.17.2.

optimized:

❯ ~/htdocs/go/bin/go test -bench=.
goos: darwin
goarch: arm64
pkg: example/md5
BenchmarkHash8Bytes-8            	10756149	       110.7 ns/op	  72.25 MB/s
BenchmarkHash64-8                	 6370714	       191.2 ns/op	 334.65 MB/s
BenchmarkHash128-8               	 4167610	       288.5 ns/op	 443.69 MB/s
BenchmarkHash256-8               	 2517034	       475.5 ns/op	 538.41 MB/s
BenchmarkHash512-8               	 1397494	       870.0 ns/op	 588.50 MB/s
BenchmarkHash1K-8                	  720913	      1638 ns/op	 625.02 MB/s
BenchmarkHash8K-8                	   96699	     12347 ns/op	 663.47 MB/s
BenchmarkHash1M-8                	     760	   1578441 ns/op	 664.31 MB/s
BenchmarkHash8M-8                	      86	  12689257 ns/op	 661.08 MB/s
BenchmarkHash8BytesUnaligned-8   	10780814	       111.5 ns/op	  71.76 MB/s
BenchmarkHash1KUnaligned-8       	  719868	      1635 ns/op	 626.49 MB/s
BenchmarkHash8KUnaligned-8       	   96816	     12342 ns/op	 663.75 MB/s
PASS
ok  	example/md5	17.895s

generic:

❯ ~/htdocs/go/bin/go test -bench=.
goos: darwin
goarch: arm64
pkg: example/md5
BenchmarkHash8Bytes-8            	 6348889	       157.7 ns/op	  50.74 MB/s
BenchmarkHash64-8                	 4332680	       276.5 ns/op	 231.47 MB/s
BenchmarkHash128-8               	 2835121	       422.6 ns/op	 302.88 MB/s
BenchmarkHash256-8               	 1677094	       714.6 ns/op	 358.24 MB/s
BenchmarkHash512-8               	  897469	      1300 ns/op	 393.85 MB/s
BenchmarkHash1K-8                	  476378	      2470 ns/op	 414.65 MB/s
BenchmarkHash8K-8                	   63402	     18837 ns/op	 434.89 MB/s
BenchmarkHash1M-8                	     499	   2394526 ns/op	 437.91 MB/s
BenchmarkHash8M-8                	      57	  19171187 ns/op	 437.56 MB/s
BenchmarkHash8BytesUnaligned-8   	 7573111	       158.0 ns/op	  50.65 MB/s
BenchmarkHash1KUnaligned-8       	  476732	      2478 ns/op	 413.31 MB/s
BenchmarkHash8KUnaligned-8       	   63505	     18832 ns/op	 434.99 MB/s
PASS
ok  	example/md5	16.650s

@cespare
Copy link
Contributor

cespare commented Nov 2, 2021

It's not currently within 80% on this machine either.

goos: linux
goarch: amd64
pkg: std/crypto/md5
cpu: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz

name                    old speed      new speed      delta
Hash8Bytes-12           84.7MB/s ± 1%  61.7MB/s ± 1%  -27.17%  (p=0.008 n=5+5)
Hash64-12                402MB/s ± 1%   280MB/s ± 2%  -30.32%  (p=0.008 n=5+5)
Hash128-12               562MB/s ± 1%   386MB/s ± 1%  -31.31%  (p=0.008 n=5+5)
Hash256-12               699MB/s ± 1%   471MB/s ± 1%  -32.62%  (p=0.008 n=5+5)
Hash512-12               790MB/s ± 1%   531MB/s ± 1%  -32.75%  (p=0.008 n=5+5)
Hash1K-12                848MB/s ± 1%   556MB/s ± 4%  -34.38%  (p=0.008 n=5+5)
Hash8K-12                908MB/s ± 1%   602MB/s ± 0%  -33.70%  (p=0.008 n=5+5)
Hash1M-12                923MB/s ± 0%   607MB/s ± 1%  -34.28%  (p=0.008 n=5+5)
Hash8M-12                917MB/s ± 1%   607MB/s ± 1%  -33.78%  (p=0.008 n=5+5)
Hash8BytesUnaligned-12  85.0MB/s ± 1%  62.3MB/s ± 1%  -26.78%  (p=0.008 n=5+5)
Hash1KUnaligned-12       845MB/s ± 2%   563MB/s ± 1%  -33.38%  (p=0.008 n=5+5)
Hash8KUnaligned-12       912MB/s ± 0%   592MB/s ± 2%  -35.17%  (p=0.008 n=5+5)

@josharian
Copy link
Contributor

I poked at this a bit as a palate cleanser after working on package net/syscall. There are several things going wrong here.

The most important by is that the compiler takes a+b+c+d and generates a+(b+(c+d)). The hand-written assembly does (a+b)+(c+d). The assembly version is amenable to superscalar processing; the compiler's code is not. This will require compiler improvements. Filed #49331.

Also, the register allocator puts too many things in registers in the outer loop, so by the time it gets to the heavy numeric lifting, it doesn't have enough to work with. This is fixable by massaging the code a bit in some (admittedly weird) ways. See https://gist.github.com/josharian/42e66bf022f32da3da0a9b1bdf0a974b. Without filing the other issues mentioned here, this rewrite itself doesn't improve performance.

Also, the CSE pass combines a bunch of the xNN memory loads. These then require registers, and the register allocator decides to spill them rather than reload from memory. Filed #49332.

I'm done investigating this for now. The next step is to fix #49331, or find some clever way to write the code to force the compiler to generate more balanced computation trees.

@seankhliao seankhliao added this to the Unplanned milestone Aug 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

5 participants