-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: optimizations to generated code on amd64 #1766
Comments
You can see the assembly by using a standard tool like gdb or objdump but also by running 6l with the -a flag. On my Linux system, I get addloop1_c (0.24s) .L10: addl %ecx, %edx addl $1, %ecx cmpl %ecx, %esi jg .L10 addloop1_modular (0.95s) 421734 01c1 | (6) ADDL AX,CX 421736 ffc0 | (5) INCL ,AX 421738 39d0 | (5) CMPL AX,DX 42173a 7cf8 | (5) JLT ,421734 It looks like the only difference is the incl vs addl $1. Should that really cause such a huge time difference? addloop1_monolithic (2.83s) 400d36 0144244c | (12) ADDL AX,main.t+76(SP) 400d3a ffc0 | (11) INCL ,AX 400d3c 418b18 | (11) MOVL (R8),BX 400d3f 39c3 | (11) CMPL BX,AX 400d41 7ff3 | (11) JGT ,400d36 Here the difference is the use of (R8) as the limit. Because the call to fmt.Sscan took the address of limit, the code is being overly conservative about making sure it sees changes to limit. I will take a note to fix that: it only really needs to reload from memory after an assignment through a pointer or a function call. Owner changed to @rsc. Status changed to Accepted. |
According to Agner Fog's assembly optimization manual[1], add is faster than inc. It costs an extra micro-op on some architectures and always causes a false dependency on the carry flag from a previous instruction -- in this case the instruction directly preceding it. I'm also surprised it would make this much of a difference, though. [1] http://www.agner.org/optimize/optimizing_assembly.pdf, section 16.2 |
Thanks for the 6l -a tip --- my gdb seems not to be able to disassemble gc binaries. Russ, what are your CPU specs? I am surprised that you got a billion loop iterations to run in 0.24 seconds. I think that would require your CPU to be running at a bit over 4.0 GHz at minimum. Also, the code you cite from addloop1_c and addloop1_modular has essentially the same first three instructions, but then addloop1_c branches with jg, while addloop1 says JLT. Those don't appear to mean the same thing. Note that addloop1_monolithic differs from addloop1_modular not only in reloading limit from memory, but also in keeping the accumulator t in memory. The accumulation into t is a loop-carried dependence chain; routing that through memory is bound to be expensive in a short loop like this. I bet the performance is getting hurt from t being in memory more than from reloading bx through r8, which is not involved in a loop-carried dependence chain. It is interesting that t was put in memory during the loop: it is local to the function and nothing has happened to it yet but := and += statements. The only thing I can think of is that the later call to fmt.Printf() might be taking the address of t under the hood (I don't know what the compiler does to make an interface{} out of an int for that call). Regarding inc versus add, Agner Fog's excellent manuals do warn that using inc foo instead of add $1, foo can cause stalls or extra dependencies in some cases, but I don't have a good understanding of when it hurts. I have attached another tarball that runs four different assembly implementations of the addition loop. On my chip it doesn't seem to matter in this case whether I use add $1, foo or inc foo. But it does matter whether I use jg or jne as the loop branch instruction: jne is better here. I think that is because my chip (running in 32-bit mode for this executable) can macrofuse a cmp with a jne but not with a jg. I wouldn't have expected that to hurt this much, though: % ./addloop1_asm-run.sh inc + jg: -1243309312 0.86 real 0.86 user 0.00 sys add + jg: -1243309312 0.86 real 0.85 user 0.00 sys inc + jne: -1243309312 0.50 real 0.50 user 0.00 sys add + jne: -1243309312 0.50 real 0.50 user 0.00 sys In the numbers I posted originally, my gcc figured out that provided limit is positive, it could use jne instead of jg or the like in the arithmetic loop, and that's why the binary it compiled ran in 0.50 seconds instead of 0.86 seconds. Regarding branch target alignment, that can matter, but I wouldn't expect Russ's listed addloop1_modular code to get hurt because the whole loop fits into one naturally aligned 16-byte code block despite the branch target's misalignment. Attachments:
|
I think this can be closed; the "modular" vs "monolithic" performance issue (which was the main problem here) is a thing of the past; I tested the monolithic version and it's as fast as I'd expect. We're still slower than |
by umbricola:
Attachments:
The text was updated successfully, but these errors were encountered: