cmd/compile: greedy basic block layout #66420

y1yang0 · 2024-03-20T06:28:34Z

Proposal Details

Fine-grained performance metrics (such as L1-iTLB, IPC, branch-load, branch-load-miss, etc.) and optimization experiments(#63670) have repeatedly shown that block layout can noticably impact code execution efficiency.

In contrast to the current case-by-case tuning approach, I propose adopting the traditional and time-tested Pettis-Hansen (PH) basic block layout algorithm. Its core concept is to place hot block as closely together as possible, allowing basic blocks to fall through whenever feasible, thereby reducing jump instructions. This principle is considered the golden rule in the field of block layout, and state-of-the-art algorithms like extTSP and almost all variants are based on this idea, incorporating advanced heuristic techniques.

The PH algorithm relies on a weighted chain graph, where the weights represent the frequency of edges. In the absence of PGO information, we can only resort to branch prediction results from likelyadjust pass. In the future, we can incorporate PGO data as weights to make the algorithm even more effective.

Experiment Results

The x-axis represents testcase id, the y-axis indicates the performance change in percentage points, and negative values denote performance improvement.

The text was updated successfully, but these errors were encountered:

Implement Pettis&Hanse's greedy algorithm, i.e. bottom-up variant Fixes golang#66420

gopherbot · 2024-03-20T06:59:22Z

Change https://go.dev/cl/572975 mentions this issue: cmd/compile: greedy basic block layout

mdempsky · 2024-03-20T20:53:09Z

What does the graph represent? What tests were run? What is a "testcase ID"? What's the difference between the orange and blue lines? How do you interpret the graph to conclude the CL is net positive? Thanks.

mknyszek · 2024-03-20T20:55:33Z

https://go.dev/cl/c/go/+/571535 is another CL related to basic block reordering.

y1yang0 · 2024-03-21T02:23:13Z

@mknyszek Thanks for your reminder, I'll take a closer look at it.

@mdempsky Hi,

What does the graph represent?

A Chain Graph is composed of chains and edges. It aggregates a series of blocks into chains, which are then interconnected by edges. The purpose of this is to minimize jmp instructions and maximize fallthrough occurrences as much as possible.

A vivid and real-life example can be referenced in the image below.

What tests were run? What is a "testcase ID"? What's the difference between the orange and blue lines? How do you interpret the graph to conclude the CL is net positive?

id means testcase1pkg: archive/tar testcase2 pkg: archive/zip etc. Due to the length of the package names, I've used IDs to represent them on the x-axis. orange means round1 test and blue indicates round2.

Each round runs all go tests with count=10 at given CPU set

taskset -c 0-32 $GOROOT/bin/go test -bench=. -timeout=99h -count=10  ./...

Changes within 2% are considered reasonable and tolerable fluctuations, which allows us to disregard the majority of cases. However, it has notably improved several cases, specifically those with sharp fluctuations exceeding 6%. On the other hand, this is a simple and time-tested algorithm; variations of it are used worldwide for block layout implementation, and I believe Go is also an excellent candidate for its application. Furthermore, the current graph does not contain PGO information; each edge is either 100% taken or 0% taken. Yet, it still manages to achieve such results. We can optimistically and reasonably expect that a graph augmented with PGO information will yield very favorable outcomes.

mdempsky · 2024-03-21T19:44:03Z

Sorry, I was asking what the graph in your "experimental results" indicated. Your argument seems to be that the graph demonstrates an experimental improvement on the benchmarks, but it looks like noise to me.

Typically we look at the results from running benchstat.

orange means round1 test and blue indicates round2.

Okay. What are "round1" and "round2"?

Can you please instead use "old" and "new"? Or "baseline" and "experiment"? These are unambiguous terms.

Thanks.

alexanius · 2024-03-22T11:21:06Z

In this CL https://go.dev/cl/c/go/+/ the basic block counters are implemented and the likely information is corrected. You can use it in your algorithm.

y1yang0 · 2024-03-22T11:25:10Z

Can you please instead use "old" and "new"? Or "baseline" and "experiment"? These are unambiguous terms.

The raw benchstat results are as follow

round2.log
round1.log

They are too big for inspection, that's why I draw the line char. Each point on x-xis represents geomean of bench result for each package. "Round" inicates completing a full run of all go tests. Please let me know if there is nything else that is unclear to you. Thanks!

In this CL https://go.dev/cl/c/go/+/ the basic block counters are implemented and the likely information is corrected. You can use it in your algorithm.

Thanks, this could be a follow-up enhancement for PH block lyout IMHO. I'll take a closer look on Monday

y1yang0 · 2024-03-27T07:46:42Z

What does the graph represent? What tests were run? What is a "testcase ID"? What's the difference between the orange and blue lines? How do you interpret the graph to conclude the CL is net positive? Thanks.

Hi @mdempsky , do you have any plan on reviewing this patch? Thanks

thanm · 2024-03-27T18:07:28Z

For performance testing best practice is currently to run the "sweet" and "bent" benchmarks, e.g.

git clone https://go.googlesource.com/benchmarks
cd benchmarks/cmd/bench
go build -o bench.exe .
./bench.exe -goroot `go env GOROOT`

then capture the output of the "bench.exe" run. Once you have two runs worth (one with your change and one without), then create a final report based on

benchstat base-output.txt new-output.txt

This will give you a better picture overall than running the std library package benchmarks (which is what it sounds like you are doing?).

y1yang0 added the Proposal label Mar 20, 2024

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Mar 20, 2024

y1yang0 added a commit to y1yang0/go that referenced this issue Mar 20, 2024

cmd/compile: greedy basic block layout

be2f319

Implement Pettis&Hanse's greedy algorithm, i.e. bottom-up variant Fixes golang#66420

y1yang0 linked a pull request Mar 20, 2024 that will close this issue

cmd/compile: greedy basic block layout #66421

Open

mknyszek modified the milestones: Backlog, Unplanned Mar 20, 2024

mknyszek added the Performance label Mar 20, 2024

ianlancetaylor removed the Proposal label Mar 27, 2024

thanm added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: greedy basic block layout #66420

cmd/compile: greedy basic block layout #66420

y1yang0 commented Mar 20, 2024 •

edited

gopherbot commented Mar 20, 2024

mdempsky commented Mar 20, 2024 •

edited

mknyszek commented Mar 20, 2024

y1yang0 commented Mar 21, 2024

mdempsky commented Mar 21, 2024 •

edited

alexanius commented Mar 22, 2024

y1yang0 commented Mar 22, 2024 •

edited

y1yang0 commented Mar 27, 2024

thanm commented Mar 27, 2024

cmd/compile: greedy basic block layout #66420

cmd/compile: greedy basic block layout #66420

Comments

y1yang0 commented Mar 20, 2024 • edited

Proposal Details

Experiment Results

gopherbot commented Mar 20, 2024

mdempsky commented Mar 20, 2024 • edited

mknyszek commented Mar 20, 2024

y1yang0 commented Mar 21, 2024

mdempsky commented Mar 21, 2024 • edited

alexanius commented Mar 22, 2024

y1yang0 commented Mar 22, 2024 • edited

y1yang0 commented Mar 27, 2024

thanm commented Mar 27, 2024

y1yang0 commented Mar 20, 2024 •

edited

mdempsky commented Mar 20, 2024 •

edited

mdempsky commented Mar 21, 2024 •

edited

y1yang0 commented Mar 22, 2024 •

edited