|
|
Created:
12 years ago by minux1 Modified:
12 years ago Reviewers:
CC:
dave_cheney.net, remyoudompheng, mtj1, rsc, golang-dev Visibility:
Public. |
Descriptionmath: ARM assembly implementation for Abs
Obtained on 700MHz OMAP4460:
benchmark old ns/op new ns/op delta
BenchmarkAbs 61 23 -61.63%
Patch Set 1 #Patch Set 2 : diff -r 23c94e7b3fd6 https://code.google.com/p/go/ #Patch Set 3 : diff -r 23c94e7b3fd6 https://code.google.com/p/go/ #
Total comments: 1
Patch Set 4 : diff -r 69a5c6d983b4 https://code.google.com/p/go/ #Patch Set 5 : diff -r 69a5c6d983b4 https://code.google.com/p/go/ #MessagesTotal messages: 13
Hello golang-dev@googlegroups.com (cc: golang-dev@googlegroups.com), I'd like you to review this change to https://code.google.com/p/go/
Sign in to reply to this message.
I can't speak to the accuracy of this change, but the results are equally impressive for those of us without and FPU. stora(~/go/src/pkg/math) % ~/go/misc/benchcmp {old,new}.txt benchmark old ns/op new ns/op delta BenchmarkAbs 29069 5556 -80.89% On Sat, Apr 21, 2012 at 9:13 PM, <minux.ma@gmail.com> wrote: > Reviewers: golang-dev_googlegroups.com, > > Message: > Hello golang-dev@googlegroups.com (cc: golang-dev@googlegroups.com), > > I'd like you to review this change to > https://code.google.com/p/go/ > > > Description: > math: ARM assembly implementation for Abs > > Obtained on 700MHz OMAP4460: > benchmark old ns/op new ns/op delta > BenchmarkAbs 61 24 -60.39% > > > Please review this at http://codereview.appspot.com/6094047/ > > Affected files: > M src/pkg/math/abs_arm.s > > > Index: src/pkg/math/abs_arm.s > =================================================================== > --- a/src/pkg/math/abs_arm.s > +++ b/src/pkg/math/abs_arm.s > @@ -2,5 +2,12 @@ > // Use of this source code is governed by a BSD-style > // license that can be found in the LICENSE file. > > +// func Abs(x float64) float64 > TEXT ·Abs(SB),7,$0 > - B ·abs(SB) > + MOVW 0(FP), R0 > + MOVW R0, 8(FP) > + MOVW $((1<<31)-1), R2 > + MOVW 4(FP), R1 > + AND R2, R1 > + MOVW R1, 12(FP) > + RET > >
Sign in to reply to this message.
I can squeeze one more ns out if we use ldrd and strd. TEXT ·Abs(SB),7,$0 WORD $0xe1cd00d4 // ldrd r0, [sp, #4] MOVW $((1<<31)-1), R2 AND R2, R1 WORD $0xe1cd00fc // strd r0, [sp, #12] RET benchmark old ns/op new ns/op delta BenchmarkAbs 24 23 -4.53% benchmark old ns/op new ns/op delta BenchmarkAbs 61 23 -62.34% But, to make it acceptable, I must first add LDRD and STRD to 5a and 5l. @dfc, could you please benchmark this new version on ARMv5 hardware?
Sign in to reply to this message.
How does this compare to the following Go code? func abs(x float64) (y float64) { *(*uint64)(unsafe.Pointer(&y)) = *(*uint64)(unsafe.Pointer(&x)) &^ uint64(1<<63) return }
Sign in to reply to this message.
On Sun, Apr 22, 2012 at 1:28 AM, <remyoudompheng@gmail.com> wrote: > How does this compare to the following Go code? > > func abs(x float64) (y float64) { > *(*uint64)(unsafe.Pointer(&y)) = *(*uint64)(unsafe.Pointer(&x)) &^ > uint64(1<<63) > return > } > 5g generates this for this function: 0000 (flt.go:3) TEXT abs+0(SB),$0-16 0001 (flt.go:3) MOVD $(0.00000000000000000e+00),F0 0002 (flt.go:3) MOVD F0,y+8(FP) 0003 (flt.go:5) MOVW $y+8(FP),R5 0004 (flt.go:5) MOVW $x+0(FP),R0 0005 (flt.go:5) MOVW 0(R0),R4 0006 (flt.go:5) MOVW 4(R0),R1 0007 (flt.go:5) MOVW $-1,R2 0008 (flt.go:5) AND R2,R4,R0 0009 (flt.go:5) MOVW $2147483647,R2 0010 (flt.go:5) AND R2,R1,R3 0011 (flt.go:5) MOVW R0,R1 0012 (flt.go:5) MOVW R5,R0 0013 (flt.go:5) MOVW R3,R2 0014 (flt.go:5) MOVW R1,0(R5) 0015 (flt.go:5) MOVW R3,4(R5) 0016 (flt.go:6) RET , it will be much slower than the assembly version, and what's worse, it unnecessarily load and store 0.0 into y, which will trigger a much more expensive soft float emulation on systems without FPU.
Sign in to reply to this message.
Which is the key to all of this tuning -- it is "non floating point" operations on "floating point" values -- and that's why you can't so easily get to the optimum in any language that has opinions about the meaning of storage. This is a job for assembler or BCPL or PL/360 or ... On Sat, Apr 21, 2012 at 7:45 PM, minux <minux.ma@gmail.com> wrote: > > On Sun, Apr 22, 2012 at 1:28 AM, <remyoudompheng@gmail.com> wrote: > >> How does this compare to the following Go code? >> >> func abs(x float64) (y float64) { >> *(*uint64)(unsafe.Pointer(&y)) = *(*uint64)(unsafe.Pointer(&x)) &^ >> uint64(1<<63) >> return >> } >> > 5g generates this for this function: > 0000 (flt.go:3) TEXT abs+0(SB),$0-16 > 0001 (flt.go:3) MOVD $(0.00000000000000000e+00),F0 > 0002 (flt.go:3) MOVD F0,y+8(FP) > 0003 (flt.go:5) MOVW $y+8(FP),R5 > 0004 (flt.go:5) MOVW $x+0(FP),R0 > 0005 (flt.go:5) MOVW 0(R0),R4 > 0006 (flt.go:5) MOVW 4(R0),R1 > 0007 (flt.go:5) MOVW $-1,R2 > 0008 (flt.go:5) AND R2,R4,R0 > 0009 (flt.go:5) MOVW $2147483647,R2 > 0010 (flt.go:5) AND R2,R1,R3 > 0011 (flt.go:5) MOVW R0,R1 > 0012 (flt.go:5) MOVW R5,R0 > 0013 (flt.go:5) MOVW R3,R2 > 0014 (flt.go:5) MOVW R1,0(R5) > 0015 (flt.go:5) MOVW R3,4(R5) > 0016 (flt.go:6) RET , > > it will be much slower than the assembly version, and what's worse, it > unnecessarily > load and store 0.0 into y, which will trigger a much more expensive soft > float emulation > on systems without FPU. > -- Michael T. Jones | Chief Technology Advocate | mtj@google.com | +1 650-335-5765
Sign in to reply to this message.
On Sun, Apr 22, 2012 at 1:45 AM, minux <minux.ma@gmail.com> wrote: > > On Sun, Apr 22, 2012 at 1:28 AM, <remyoudompheng@gmail.com> wrote: > >> How does this compare to the following Go code? >> >> func abs(x float64) (y float64) { >> *(*uint64)(unsafe.Pointer(&y)) = *(*uint64)(unsafe.Pointer(&x)) &^ >> uint64(1<<63) >> return >> } >> > 5g generates this for this function: > 0000 (flt.go:3) TEXT abs+0(SB),$0-16 > 0001 (flt.go:3) MOVD $(0.00000000000000000e+00),F0 > 0002 (flt.go:3) MOVD F0,y+8(FP) > 0003 (flt.go:5) MOVW $y+8(FP),R5 > 0004 (flt.go:5) MOVW $x+0(FP),R0 > 0005 (flt.go:5) MOVW 0(R0),R4 > 0006 (flt.go:5) MOVW 4(R0),R1 > 0007 (flt.go:5) MOVW $-1,R2 > 0008 (flt.go:5) AND R2,R4,R0 > 0009 (flt.go:5) MOVW $2147483647,R2 > 0010 (flt.go:5) AND R2,R1,R3 > 0011 (flt.go:5) MOVW R0,R1 > 0012 (flt.go:5) MOVW R5,R0 > 0013 (flt.go:5) MOVW R3,R2 > 0014 (flt.go:5) MOVW R1,0(R5) > 0015 (flt.go:5) MOVW R3,4(R5) > 0016 (flt.go:6) RET , > > it will be much slower than the assembly version, and what's worse, it > unnecessarily > load and store 0.0 into y, which will trigger a much more expensive soft > float emulation > on systems without FPU. > I also tried a C version: void abs(long long in, long long out) { out = in & ~(1ULL<<63); USED(out); } 5c generates: TEXT abs+0(SB),0,$20-16 MOVW $out+8(FP),R1 MOVW R1,4(R13) MOVW R13,R1 MOVW $in+0(FP),R2 MOVM.U 0(R2),[R3,R4] MOVW R3,8(R1) MOVW R4,12(R1) MOVW R13,R1 MOVW $-1,R2 MOVW R2,16(R1) MOVW $2147483647,R2 MOVW R2,20(R1) BL ,_andv+0(SB) NOP out+8(FP), NOP ,R0 NOP ,F0 RET , END , It calls _andv, when it should have done it in line. For comparison, 8c generate this for the same function: (flt.c:1) TEXT abs+0(SB),(gok(71)) (flt.c:2) MOVL in+0(FP),AX (flt.c:2) ANDL $-1,AX (flt.c:2) MOVL in+4(FP),CX (flt.c:2) ANDL $2147483647,CX (flt.c:2) MOVL AX,out+8(FP) (flt.c:2) MOVL CX,out+12(FP) (flt.c:3) RET , (flt.c:3) END , This is much better, and approaches the quality hand written assembly. For the record, 8g generate this for the Go version: 0000 (flt.go:3) TEXT abs+0(SB),$0-16 0001 (flt.go:3) FMOVD $(0.00000000000000000e+00),F0 0002 (flt.go:3) FMOVDP F0,y+8(FP) 0003 (flt.go:5) LEAL y+8(FP),BX 0004 (flt.go:5) MOVL BX,BP 0005 (flt.go:5) LEAL x+0(FP),BX 0006 (flt.go:5) MOVL (BX),AX 0007 (flt.go:5) MOVL 4(BX),CX 0008 (flt.go:5) ANDL $2147483647,CX 0009 (flt.go:5) MOVL AX,(BP) 0010 (flt.go:5) MOVL CX,4(BP) 0011 (flt.go:6) RET , Again, a much better job than 5g.
Sign in to reply to this message.
Le 21 avril 2012 19:51, Michael Jones <mtj@google.com> a écrit : > Which is the key to all of this tuning -- it is "non floating point" > operations on "floating point" values -- and that's why you can't so easily > get to the optimum in any language that has opinions about the meaning of > storage. This is a job for assembler or BCPL or PL/360 or ... I'd say the use of unsafe and a more mature compiler (that eliminate dead stores and doesn't move a value through all registers) should already give good results. It's just a bitwise operation and there isn't any assembly tricks to perform here. Or, the assembly could jump to this function instead: func unsafeabs(x uint64) uint64 { return x &^ (1<<63) } But I agree that in the current state of 5g, the proposal is acceptable. Rémy.
Sign in to reply to this message.
> But, to make it acceptable, I must first add LDRD and STRD to 5a and 5l. > @dfc, could you please benchmark this new version on ARMv5 hardware? stora(~/go/src/pkg/math) % ~/go/misc/benchcmp {old,new}.txt benchmark old ns/op new ns/op delta BenchmarkAbs 29069 6386 -78.03% This form is slightly slower on a Feroceon 88FR131 rev 1 (v5l), but I don't think the difference is worth worrying about. It's still a significant improvement.
Sign in to reply to this message.
http://codereview.appspot.com/6094047/diff/5001/src/pkg/math/abs_arm.s File src/pkg/math/abs_arm.s (right): http://codereview.appspot.com/6094047/diff/5001/src/pkg/math/abs_arm.s#newcode7 src/pkg/math/abs_arm.s:7: MOVW 0(FP), R0 please use variable names and try to do all the loads/stores in blocks (easier to see they're all handled). MOVW lo+0(FP), R0 MOVW hi+4(FP), R1 MOVW $((1<<31)-1), R2 AND R2, R1 MOVW R0, resultlo+8(FP) MOVW R1, resulthi+12(FP) separately, the arm will do the MOVW for you if you use an immediate argument, so you can the center two instructions just one: AND $((1<<31)-1), R1
Sign in to reply to this message.
Hello dave@cheney.net, remyoudompheng@gmail.com, mtj@google.com, rsc@golang.org (cc: golang-dev@googlegroups.com), Please take another look.
Sign in to reply to this message.
*** Submitted as http://code.google.com/p/go/source/detail?r=5a1d471de6d2 *** math: ARM assembly implementation for Abs Obtained on 700MHz OMAP4460: benchmark old ns/op new ns/op delta BenchmarkAbs 61 23 -61.63% R=dave, remyoudompheng, mtj, rsc CC=golang-dev http://codereview.appspot.com/6094047
Sign in to reply to this message.
|