math/big: mulAddVWW too slow #22643

TuomLarsen · 2017-11-09T15:24:15Z

What version of Go are you using?

go1.9.2 darwin/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using?

darwin/amd64

What did you do?

The following Go program multiplies big.Int with one Word constant:

package main

import (
	"fmt"
	"math/big"
)

func main() {
	a := big.NewInt(1)
	b := big.NewInt(1)
	a.Lsh(a, 8096)
	for i := uint64(1); i <= 10000000; i++ {
		a.Mul(a, b)
	}
	fmt.Println(a.BitLen())
}

What did you expect to see?

The following C program uses GMP 6.1.2:

#include <stdio.h>
#include <stdint.h>
#include "gmp.h"
 
int main(int argc, char *argv[]) {
    mpz_t a, b;
    mpz_init_set_ui(a, 1);
    mpz_init_set_ui(b, 1);
    mpz_mul_2exp(a, a, 8096);
    for (uint64_t i = 1; i <= 10000000ULL; i++) {
        mpz_mul(a, a, b);
    }
    printf("%lu\n", mpz_sizeinbase(a, 2));
    return 0;
}

and terminates under around 1s.

What did you see instead?

The Go program completes on my machine in 1.87s, i.e. almost twice as slow.

mulAddVWW is building block for multiplication (by one word, or more) so a better performance would benefit many applications.

The text was updated successfully, but these errors were encountered:

vielmetti · 2017-11-09T15:29:34Z

See also https://go-review.googlesource.com/c/go/+/76270 (for arm64) where it is reported:

The lack of proper addMulVVW implementation for arm64 hurts RSA
performance

This is an optimized implementation, it improves RSA2048 performance
by 10X to 15X on ARMv8 based server processors

griesemer · 2017-11-09T16:16:57Z

Duplicate of #9245.

randall77 · 2017-11-09T17:08:06Z

I'm not seeing the performance difference you're seeing. With your two example programs:

$ go build issue22643.go
$ gcc -O3 -lgmp issue22643.c 
$ time ./a.out 
8097

real	0m1.063s
user	0m1.059s
sys	0m0.002s
$ time ./issue22643
8097

real	0m1.106s
user	0m1.102s
sys	0m0.004s

That's only a few percent slower.
This is with go 1.9 and gmp 6.1.2.
On darwin amd64 with a "gcc" version of Apple LLVM version 9.0.0 (clang-900.0.38).
Intel Xeon E5.

ALTree · 2017-11-09T17:47:10Z

To add another data point, on Linux/amd64 on an Haswell i7 with go1.9 and gmp6.1 and gcc 7.2 I get, on 5 runs:

Go:
real	0m1.243s
real	0m1.244s
real	0m1.268s
real	0m1.261s
real	0m1.248s

C+gmp:
real	0m1.151s
real	0m1.137s
real	0m1.191s
real	0m1.141s
real	0m1.141s

So within ~8% of gmp.

vielmetti · 2017-11-09T17:47:41Z

Results for arm64, specifically on a HiSilicon Hi1616 (Packet "Type 2A2")

root@2a2-neon:~/src/golang# time ./a.out
8097

real    0m3.930s
user    0m3.926s
sys     0m0.004s

root@2a2-neon:~/src/golang# go version
go version go1.9.2 linux/arm64
root@2a2-neon:~/src/golang# go build issue22643.go 
root@2a2-neon:~/src/golang# time ./issue22643
8097

real    0m4.150s
user    0m4.145s
sys     0m0.009s

Again, go version within a few % of the C results.

TuomLarsen · 2017-11-09T20:50:16Z

That's really strange, I triple-checked but still getting the same 1.8x difference.
CPU: Intel Core M 1,1 GHz, 256 KB L2 Cache (per Core).

$ cat issue22643.go
package main

import (
	"fmt"
	"math/big"
)

func main() {
	a := big.NewInt(1)
	b := big.NewInt(1)
	a.Lsh(a, 8096)
	for i := uint64(1); i <= 10000000; i++ {
		a.Mul(a, b)
	}
	fmt.Println(a.BitLen())
}
$ go build issue22643.go
$ time ./issue22643
8097

real	0m1.814s
user	0m1.791s
sys	0m0.013s

And:

$ cat issue22643.c 
#include <stdio.h>
#include <stdint.h>
#include "gmp.h"
 
int main(int argc, char *argv[]) {
    mpz_t a, b;
    mpz_init_set_ui(a, 1);
    mpz_init_set_ui(b, 1);
    mpz_mul_2exp(a, a, 8096);
    for (uint64_t i = 1; i <= 10000000ULL; i++) {
        mpz_mul(a, a, b);
    }
    printf("%lu\n", mpz_sizeinbase(a, 2));
    return 0;
}
$ gcc -O3 -o issue22643 issue22643.c -lgmp -lm
$ time ./issue22643
8097

real	0m1.082s
user	0m1.067s
sys	0m0.007s

Btw., I've got Go from the downloads (go1.9.2.darwin-amd64.pkg)

$ shasum darwin_amd64/math/big.a 
9ad030b2c3f78119d15c85d2241a77e680612b10  big.a
$ shasum bin/go
1976f6cd942b2c4b903ee7523dd1f889a161188e  go

randall77 · 2017-11-09T20:57:56Z

Can you profile both instances (Go and C++) and see where the time goes? We can then compare your profile to profiles on our machines.
I suspect a difference between chips.

TuomLarsen · 2017-11-09T21:05:11Z

I used pkg/profile, is that enough?

package main

import (
    "fmt"
    "github.com/pkg/profile"
    "math/big"
)

func main() {
    defer profile.Start().Stop()
    a := big.NewInt(1)
    b := big.NewInt(1)
    a.Lsh(a, 8096)
    for i := uint64(1); i <= 10000000; i++ {
        a.Mul(a, b)
    }
    fmt.Println(a.BitLen())
}

And pprof top:

$ go tool pprof [...]/cpu.pprof
Duration: 1.92s, Total samples = 1.71s (89.28%)
(pprof) top
Showing nodes accounting for 1.71s, 100% of 1.71s total
Showing top 10 nodes out of 14
      flat  flat%   sum%        cum   cum%
     1.42s 83.04% 83.04%      1.42s 83.04%  math/big.mulAddVWW /usr/local/go/src/math/big/arith_amd64.s
     0.10s  5.85% 88.89%      1.57s 91.81%  math/big.nat.mulAddWW /usr/local/go/src/math/big/nat.go
     0.06s  3.51% 92.40%      1.69s 98.83%  math/big.(*Int).Mul /usr/local/go/src/math/big/int.go
     0.06s  3.51% 95.91%      1.63s 95.32%  math/big.nat.mul /usr/local/go/src/math/big/nat.go
     0.05s  2.92% 98.83%      0.05s  2.92%  math/big.nat.norm /usr/local/go/src/math/big/nat.go
     0.01s  0.58% 99.42%      0.01s  0.58%  runtime.duffcopy /usr/local/go/src/runtime/duff_amd64.s
     0.01s  0.58%   100%      0.01s  0.58%  runtime.usleep /usr/local/go/src/runtime/sys_darwin_amd64.s
         0     0%   100%      1.69s 98.83%  main.main /Users/tuom/go/issue22643.go
         0     0%   100%      1.69s 98.83%  runtime.main /usr/local/go/src/runtime/proc.go
         0     0%   100%      0.01s  0.58%  runtime.morestack /usr/local/go/src/runtime/asm_amd64.s

randall77 · 2017-11-09T21:15:29Z

@TuomLarsen I need to see a profile of the C+gmp program as well. Without it it is hard to tell whether Go is slow on your system, or gmp is fast.

Using your profiled program, my machine spends 97+% of its time in mulAddVWW, whereas your code only spends 83% there. I'm not sure if that has anything to do with the issue at hand, but it seems weird.

TuomLarsen · 2017-11-09T21:29:17Z

From Instruments:

              Weight       Self Weight      Symbol Name
  1.26 s      99.8%          0 s            start
  1.25 s      99.1%         14.00 ms        main
  1.11 s      88.2%          1.11 s         __gmpn_mul_1
123.00 ms      9.7%        123.00 ms        __gmpz_mul
  7.00 ms      0.5%          7.00 ms        DYLD-STUB$$__gmpz_mul
  2.00 ms      0.1%          0 s            <Unknown Address>
  2.00 ms      0.1%          2.00 ms        __gmpz_mul

randall77 · 2017-11-10T18:16:33Z

Thanks for that data. Unfortunately, no smoking gun.
I don't know what is going on with your chip that makes it so different from ours.

For the record, here is the inner loop in GMP:

   18bf0:       4e 89 14 df             mov    %r10,(%rdi,%r11,8)
   18bf4:       49 01 c1                add    %rax,%r9
   18bf7:       4a 8b 04 de             mov    (%rsi,%r11,8),%rax
   18bfb:       49 11 d0                adc    %rdx,%r8
   18bfe:       41 ba 00 00 00 00       mov    $0x0,%r10d
   18c04:       48 f7 e1                mul    %rcx
   18c07:       4e 89 4c df 08          mov    %r9,0x8(%rdi,%r11,8)
   18c0c:       49 01 c0                add    %rax,%r8
   18c0f:       48 11 d3                adc    %rdx,%rbx
   18c12:       4a 8b 44 de 08          mov    0x8(%rsi,%r11,8),%rax
   18c17:       48 f7 e1                mul    %rcx
   18c1a:       4e 89 44 df 10          mov    %r8,0x10(%rdi,%r11,8)
   18c1f:       48 01 c3                add    %rax,%rbx
   18c22:       49 11 d2                adc    %rdx,%r10
   18c25:       4a 8b 44 de 10          mov    0x10(%rsi,%r11,8),%rax
   18c2a:       48 f7 e1                mul    %rcx
   18c2d:       4a 89 5c df 18          mov    %rbx,0x18(%rdi,%r11,8)
   18c32:       41 b8 00 00 00 00       mov    $0x0,%r8d
   18c38:       4c 89 c3                mov    %r8,%rbx
   18c3b:       49 01 c2                add    %rax,%r10
   18c3e:       4a 8b 44 de 18          mov    0x18(%rsi,%r11,8),%rax
   18c43:       4d 89 c1                mov    %r8,%r9
   18c46:       49 11 d1                adc    %rdx,%r9
   18c49:       48 f7 e1                mul    %rcx
   18c4c:       49 83 c3 04             add    $0x4,%r11
   18c50:       78 9e                   js     18bf0 <___gmpn_mul_1+0xa7>

compared with the inner loop in math/big:

U5:	// i+4 <= n
	// regular loop body unrolled 4x
	MOVQ (0*8)(R8)(BX*8), AX
	MULQ R9
	ADDQ CX, AX
	ADCQ $0, DX
	MOVQ AX, (0*8)(R10)(BX*8)
	MOVQ DX, CX
	MOVQ (1*8)(R8)(BX*8), AX
	MULQ R9
	ADDQ CX, AX
	ADCQ $0, DX
	MOVQ AX, (1*8)(R10)(BX*8)
	MOVQ DX, CX
	MOVQ (2*8)(R8)(BX*8), AX
	MULQ R9
	ADDQ CX, AX
	ADCQ $0, DX
	MOVQ AX, (2*8)(R10)(BX*8)
	MOVQ DX, CX
	MOVQ (3*8)(R8)(BX*8), AX
	MULQ R9
	ADDQ CX, AX
	ADCQ $0, DX
	MOVQ AX, (3*8)(R10)(BX*8)
	MOVQ DX, CX
	ADDQ $4, BX		// i += 4
	
	LEAQ 4(BX), DX
	CMPQ DX, R11
	JLE U5

They look remarkably similar. The GMP library does through some hoops to do mov $0, x; adc y, x instead of adc $0, x, possibly relevant but I can't see why.

TuomLarsen · 2017-11-11T16:59:04Z

@randall77 And thanks for looking in it! I'm certainly no expert but, if that U5 would indeed be the bottleneck, maybe GMP is trying to avoid data dependency?

Btw., should this issue be re-opened as it seems it is not about general improvement tracked by #9245?

griesemer · 2017-11-11T17:08:46Z

Re-opened as this is a more special case of #9245.

gopherbot · 2017-11-13T21:21:43Z

Change https://golang.org/cl/77371 mentions this issue: math/big: improve mulAddVWW inner loop

randall77 · 2017-11-13T21:22:30Z

@TuomLarsen : could you try patching in CL 77371 and see if that improves performance on your machine?

TuomLarsen · 2017-11-14T21:04:06Z

@randall77 Unfortunately no, it seems it does not improve it. What I did was I created a stub library with both new and old mulAddVWW, and replaced the above a.Mul(a, b) with suitable call. Both took around 1.58s to finish.

randall77 · 2017-11-15T17:40:48Z

Then I'm out of ideas. Without a copy of your machine, it's hard to debug.
You might try copying the exact inner loop from gmp into math/bits and see what happens.
Then mutate it one or a few instructions at a time until the performance degrades.
That might get us a reasonable starting point.

griesemer closed this as completed Nov 9, 2017

griesemer reopened this Nov 11, 2017

ianlancetaylor added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Mar 29, 2018

ianlancetaylor added this to the Unplanned milestone Mar 29, 2018

aeneasr mentioned this issue Feb 19, 2020

High Latency with id_token mutator and RS256 keys ory/oathkeeper#364

Closed

MrAlias mentioned this issue Aug 27, 2020

Investigate Cortex exporter test panics open-telemetry/opentelemetry-go-contrib#310

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

math/big: mulAddVWW too slow #22643

math/big: mulAddVWW too slow #22643

TuomLarsen commented Nov 9, 2017 •

edited

vielmetti commented Nov 9, 2017

griesemer commented Nov 9, 2017

randall77 commented Nov 9, 2017

ALTree commented Nov 9, 2017

vielmetti commented Nov 9, 2017

TuomLarsen commented Nov 9, 2017 •

edited

randall77 commented Nov 9, 2017

TuomLarsen commented Nov 9, 2017

randall77 commented Nov 9, 2017

TuomLarsen commented Nov 9, 2017

randall77 commented Nov 10, 2017

TuomLarsen commented Nov 11, 2017

griesemer commented Nov 11, 2017

gopherbot commented Nov 13, 2017

randall77 commented Nov 13, 2017

TuomLarsen commented Nov 14, 2017

randall77 commented Nov 15, 2017

math/big: mulAddVWW too slow #22643

math/big: mulAddVWW too slow #22643

Comments

TuomLarsen commented Nov 9, 2017 • edited

What version of Go are you using?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using?

What did you do?

What did you expect to see?

What did you see instead?

vielmetti commented Nov 9, 2017

griesemer commented Nov 9, 2017

randall77 commented Nov 9, 2017

ALTree commented Nov 9, 2017

vielmetti commented Nov 9, 2017

TuomLarsen commented Nov 9, 2017 • edited

randall77 commented Nov 9, 2017

TuomLarsen commented Nov 9, 2017

randall77 commented Nov 9, 2017

TuomLarsen commented Nov 9, 2017

randall77 commented Nov 10, 2017

TuomLarsen commented Nov 11, 2017

griesemer commented Nov 11, 2017

gopherbot commented Nov 13, 2017

randall77 commented Nov 13, 2017

TuomLarsen commented Nov 14, 2017

randall77 commented Nov 15, 2017

TuomLarsen commented Nov 9, 2017 •

edited

TuomLarsen commented Nov 9, 2017 •

edited