New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: optimization of constant pool on arm #19844
Comments
Also, a | 0xffff -> a & 0xffff0 -> |
CL https://golang.org/cl/39552 mentions this issue. |
The above attachment is a rough test, the output log on my raspberry pi 2 is it means, However, the constant pool is in the data cache. If not, the cache miss will cause much more inefficiency. |
@josharian, can you point @benshi001 to directions on how to run compiler benchmark tests? @benshi001, you want to make pretty commit messages using https://godoc.org/golang.org/x/perf/cmd/benchstat like this one 50688fc |
The time has come for me to clean up compiler benchmarking a bit. Let me do that first. |
Not done, but cleaned up enough for the moment. Do:
Make sure all resulting binaries are in your $PATH. Commit your work. For memory benchmarking:
For exection time benchmarking (with everything else closed):
These will compare master to HEAD. compilecmp supports lots of other variations, run with -h for more. Ask if you have questions or feature requests. |
I have troubles with my benchmark test that @josharian suggested. The reason is due to the network security policy. I ssh connected to a remote host locates in USA, and did git clone https://go.googlesource.com/go Then I got the expected result on the remote host via ssh terminal. I copy the /root/go and the /root/gopath from the remote host to my local raspberry pi 2. And did then the tools are built into ARM ELF. But when I did name old data-bytes new data-bytes delta name old bss-bytes new bss-bytes delta name old exe-bytes new exe-bytes delta The others like are missing. what are wrong with my operations? |
The network security policy forbids my raspberry pi board to access the internet, so I have to do it via tar cfz/ scp / tar xfz |
Does any of the three tools need internet access when running? I do not think there is any difference between my board and the remote host. |
No. I suspect that the problem is that compilecmp uses \r to update a running estimate of when the task will be done, and that your terminal didn't like it. But (unless the tmp dir has been emptied), the results are still there. Just manually run:
You could also tweak compilecmp to remove the live update. If that is in fact the problem, I'd be happy to either add a flag to remove it or (better) do some terminal sniffing to decide when not to use it. (Suggestions welcome on the latter front.) |
I encountered two issues when running the benchmark test. And they were not related to the '\r'.
The first one can be fixed by changing line 234 of src/golang.org/x/tools/cmd/compilebench/main.go BTW: I am using golang 1.6.2 as bootstrap and build golang.org/x/tools/cmd/compilebench. However I got the benchmark result, thank you. Here is the log (/tmp/185436984) |
The golang 1.6.2 is installed to /usr/lib/go-1.6, where root privilege is required. |
Can I treat the following output as "there is improvement after applying my patch CL 39552" ? pi@raspberrypi:~/pending/go/src $ compilecmp -n 20 -cpu name old time/op new time/op delta name old user-ns/op new user-ns/op delta name old text-bytes new text-bytes delta name old data-bytes new data-bytes delta name old bss-bytes new bss-bytes delta name old exe-bytes new exe-bytes delta |
I also attached the log to CL 39552's commit message. I will try |
I doubt -n50 will help, those p values are quite high. But it doesn't really matter, does it? Correct me if i'm wrong, but the goal of your CL was to make the generated code faster, and not to make the compiler faster. So now you've verified that your CL does not make the compiler slower (in fact, it's slightly faster now), but you still have to write benchmarks that show that the code we generate on ARM is executed faster. @bradfitz Am I missing something? Why are we making OP run the compiler benchmarks? Were you expecting the change to slow down the compiler? |
1. My original intention is to make the generated arm code both faster and shorter.
2. I am not familiar with this benchmark system. How to write a proper test program to verify my change? I have attached a test above, it showed 30% improvement in execution time.
|
For benchmarks, you can use some of the go1 benchmarks as a model: Your benchmark program should contain a loop that runs b.N times so that the benchmarks harness can determine an appropriate run count:
Compile-and-test is To compile into a binary so you can examine the generated code, rerun later for testing against a different version, use We use benchstat (
I tend to run the benchmarks for a count of 25 to be sure that I get good numbers, and of course the more you can reduce changes in machine performance during a benchmark run, the better, but benchstat will help mitigate that somewhat or at least let you know that you have a problem. |
@benshi001 if you think your change will benefit ARM code in general, you can start from the go1 benchmark suite mentioned by @dr2chase, i.e. do:
from the current master, then again from your patch:
and then run
Otherwise (if you see no change on the go1 benchmark suite) you'll have to write a go function that your CL makes faster and benchmark it as @dr2chase explained. The archive you posted is not a valid benchmark because it's c and assembly code. Benchmarks improvements needs to be measured on go code. |
Thanks to all of your helps. I will try it later. |
With patch set 5 of CL 39552, I did the go1 benchmark test. Here are my operation steps and results. Steps:
How to understand the following results? name old time/op new time/op delta name old speed new speed delta |
I also attached the results to commit message of CL 39552. |
Mixed results I'd say? You made 5 benchmarks slightly faster, but 10 are slightly slower (and a dozen are unchanged). You can also pass the Also please change the commit message in the CL to make clear that At this point I'd say the effects of the change are documented, so you'll have to wait for the opinion of whoever review the change. |
Helpful to include the -geomean option (why is it not the default? oh well) so you get the aggregate change. In practice, for other optimizations, we look for a geomean improvement, and we look at the outliers, and we look skeptically at the inner loops of some of the frequent offenders in benchmark noise (Revcomp in particular). |
I updated my patch, did a new benchmark test, and attached the result in the commit message of patch set 7 of CL 39552 (https://go-review.googlesource.com/?polygerrit=0#/c/39552/). How to understand the "[Geo mean] 410µs 586µs +42.99%" ? A very bad result? |
That geomean can't be correct, either there's something wrong with the files you passed to |
Even the go1 benchmark run two times with the same go build, there are difference between the results. name old time/op new time/op delta name old speed new speed delta |
Here is a contrast, can I treated the result as " go VS go My patch VS my patch go VS my patch |
I also attached a detailed log to the commit message of CL 39552 (https://go-review.googlesource.com/?polygerrit=0#/c/39552/ ). |
I made go1 benchmark test two times for the original go and two times for my patch. Then 6 comparison are made. (old means original go and new means my patch) The conclusion is,
The attachment is the detailed comparison. |
Current status, Keith Randall has a better solution https://go-review.googlesource.com/41612 And here is a supplement https://go-review.googlesource.com/#/c/41679/ |
update: a = b + 0x00ffff00 -> a = (b + 0x01000000) - 0x00000100 |
CL https://golang.org/cl/42430 mentions this issue. |
For the following code
func ass12345678(a int) int {
return a + 0xffff
}
Currently go will store 0xffff to the constant pool and load it in runtime.
a.go:6 0x95a04 e59fb06c MOVW 0x6c(R15), R11
a.go:6 0x95a08 e080000b ADD R11, R0, R0
.................................
a.go:9 0x95a78 0000ffff STRD.EQ [R0], -PC, R15, R15
But gcc optimized it to
a = a + 0x10000
a = a - 1
Both 1 and 0x10000 can be directly encoded to $immediate-12 into the instructions without any access to memory.
The text was updated successfully, but these errors were encountered: