Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: goroutines blocked in mallocgc triggering gcStart #72932

Open
dpolansky opened this issue Mar 18, 2025 · 4 comments
Open

runtime: goroutines blocked in mallocgc triggering gcStart #72932

dpolansky opened this issue Mar 18, 2025 · 4 comments
Assignees
Labels
BugReport Issues describing a possible bug in the Go implementation. compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@dpolansky
Copy link

dpolansky commented Mar 18, 2025

Go version

go version go1.23.0 linux/amd64

Output of go env in your module/workspace:

GOOS=linux
GOARCH=amd64

What did you do?

I have a production service that became slow during a non-prod load test. The service is a WebRTC media forwarding server with relatively high network write throughput (~20 Gbps per machine). The server takes inbound media packets from one source connection and fans them out to thousands of outbound connection sinks.

The app has a set of worker goroutines, each with a list of packet observers. When a packet arrives, the worker goroutines do some work for each observer in the list: allocate heap memory, do crypto, and write to the network.

The load test involved ramping up the number of outbound connections to a specific goal, followed by a period of steady state work at that load. The test monitors the packet delay observed by receivers.

The load test was performed on a machine with the following system topology (24 cores, 2 hyperthreads per core):

$ hwloc-ls

Machine (377GB total) Package L#0 NUMANode L#0 (P#0 377GB) L3 L#0 (36MB) L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#24) L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 PU L#2 (P#1) PU L#3 (P#25) L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 PU L#4 (P#2) PU L#5 (P#26) L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 PU L#6 (P#3) PU L#7 (P#27) L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 PU L#8 (P#4) PU L#9 (P#28) L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 PU L#10 (P#5) PU L#11 (P#29) L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 PU L#12 (P#6) PU L#13 (P#30) L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 PU L#14 (P#7) PU L#15 (P#31) L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 PU L#16 (P#8) PU L#17 (P#32) L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 PU L#18 (P#9) PU L#19 (P#33) L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 PU L#20 (P#10) PU L#21 (P#34) L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 PU L#22 (P#11) PU L#23 (P#35) L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 PU L#24 (P#12) PU L#25 (P#36) L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 PU L#26 (P#13) PU L#27 (P#37) L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 PU L#28 (P#14) PU L#29 (P#38) L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 PU L#30 (P#15) PU L#31 (P#39) L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 PU L#32 (P#16) PU L#33 (P#40) L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 PU L#34 (P#17) PU L#35 (P#41) L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 PU L#36 (P#18) PU L#37 (P#42) L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 PU L#38 (P#19) PU L#39 (P#43) L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 PU L#40 (P#20) PU L#41 (P#44) L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 PU L#42 (P#21) PU L#43 (P#45) L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 PU L#44 (P#22) PU L#45 (P#46) L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 PU L#46 (P#23) PU L#47 (P#47) (additional output elided)

What did you see happen?

After some time at steady state, the load test showed packets arriving late by a few hundred milliseconds. An execution trace of the test, spanning ~5 seconds of wall clock time, shows the app becomes busy during a GC cycle, triggered by reaching the goal heap size threshold.

Image

During the busy window, the count of runnable goroutines grows, and scheduler delay grows. Worker goroutines responsible for packet fanout become blocked for several hundred milliseconds after allocating memory, due to triggering the start of GC (on acquiring startSema in gcStart).

Below is a view of a worker goroutine that was blocked for ~500ms during the congestion window. When this goroutine does not run, the app falls behind on its work to fan out packets to connections. The periods of select block time (light red in this gotraceui view) represent the worker waiting for inbound packets to arrive.

Image

What did you expect to see?

I expected the app to become congested during a GC cycle due to the CPU time dedicated to mark and GC assist work. I expected that this CPU starvation would lead to increased scheduler delay and observable delay in the load test.

I was surprised to see the worker goroutines blocked for several hundred milliseconds in mallocgc on the semaphore in gcStart. The trace’s sync block profile, focused to the worker goroutines, shows this block time in aggregate:

Image

To be clear, I do not know if any fix is warranted for this observation. It is possible that my interpretation of the trace is incorrect and the gcStart block time is a red herring, and the scheduler delay due to the app’s workload is the right root cause to focus on.

For this service, there is room for improvement to reduce heap allocations and use ballast to limit GC frequency at the tradeoff of higher memory utilization. However, it seemed worthwhile to document this case to see if there’s anything productive that can come from it, particularly because I am able to reproduce the issue consistently in the service.


I attempted to write a reproducer benchmark that produces similar blocking behavior. I don’t know that the benchmark is a fair enough approximation of the service, but hopefully it is a useful starting point.

Benchmark code

package main

import (
	"context"
	"flag"
	"log"
	"os"
	"runtime"
	"runtime/trace"
	"time"

	"github.com/aclements/go-gcbench/gcbench"
)

var (
	workerSinks             [][]interface{}
	ballast                 []byte
	lat                     = gcbench.LatencyDist{}
	traceFileName           = flag.String("trace-out", "trace.out", "trace output file name")
	workersPerProc          = flag.Int("workers-per-proc", 2, "number of worker goroutines per GOMAXPROCS")
	ballastSizeInGiB        = flag.Int("ballast-gib", 1, "ballast size in GiB")
	workIterations          = flag.Int("work-iterations", 20000, "number of loop iterations for a worker goroutine")
	durationSeconds         = flag.Int("duration-sec", 10, "run duration in seconds")
	tickIntervalMillieconds = flag.Int("tick-ms", 100, "work tick interval in milliseconds")
)

func main() {
	flag.Parse()

	workerCount := *workersPerProc * runtime.GOMAXPROCS(-1)
	workerSinks = make([][]interface{}, workerCount)
	ballast = make([]byte, *ballastSizeInGiB<<30)

	f, err := os.Create(*traceFileName)
	if err != nil {
		log.Fatalf("create trace file: %v", err)
	}
	defer func() {
		err := f.Close()
		if err != nil {
			log.Fatalf("close trace file: %v", err)
		}
	}()

	wake := make(chan struct{}, workerCount)
	for i := 0; i < workerCount; i++ {
		go work(i, wake)
	}
	runtime.Gosched()

	ctx, cancel := context.WithTimeout(context.Background(), time.Second*time.Duration(*durationSeconds))
	defer cancel()
	done := false
	tick := time.NewTicker(time.Millisecond * time.Duration(*tickIntervalMillieconds))
	defer tick.Stop()

	trace.Start(f)
	for !done {
		select {
		case <-ctx.Done():
			done = true
		case <-tick.C:
			for i := 0; i < workerCount; i++ {
				wake <- struct{}{}
			}
		}
	}

	trace.Stop()

	lat.Fprint(os.Stdout)
	gcbench.ReportExtra("P99.9-latency-ms", float64(lat.Quantile(0.999).Milliseconds()))
	gcbench.ReportExtra("P99.99-latency-ms", float64(lat.Quantile(0.9999).Milliseconds()))
	gcbench.ReportExtra("P99.999-latency-ms", float64(lat.Quantile(0.99999).Milliseconds()))
	gcbench.ReportExtra("max-latency-ms", float64(lat.Max.Milliseconds()))
}

func work(id int, wake chan struct{}) {
	for range wake {
		for j := 0; j < *workIterations; j++ {
			before := time.Now()
			workerSinks[id] = append(workerSinks[id], make([]byte, 4<<10))
			lat.Add(time.Since(before))
		}
	}
}

I ran the reproducer on a separate machine with the following system topology (64 cores, 1 hyperthread per core):

go1.23.3
GOOS=linux
GOARCH=arm64

$ hwloc-ls

Machine (126GB total) Package L#0 NUMANode L#0 (P#0 126GB) L3 L#0 (32MB) L2 L#0 (1024KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (1024KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1) L2 L#2 (1024KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (1024KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 (P#3) L2 L#4 (1024KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4 (P#4) L2 L#5 (1024KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5 (P#5) L2 L#6 (1024KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6 (P#6) L2 L#7 (1024KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7 (P#7) L2 L#8 (1024KB) + L1d L#8 (64KB) + L1i L#8 (64KB) + Core L#8 + PU L#8 (P#8) L2 L#9 (1024KB) + L1d L#9 (64KB) + L1i L#9 (64KB) + Core L#9 + PU L#9 (P#9) L2 L#10 (1024KB) + L1d L#10 (64KB) + L1i L#10 (64KB) + Core L#10 + PU L#10 (P#10) L2 L#11 (1024KB) + L1d L#11 (64KB) + L1i L#11 (64KB) + Core L#11 + PU L#11 (P#11) L2 L#12 (1024KB) + L1d L#12 (64KB) + L1i L#12 (64KB) + Core L#12 + PU L#12 (P#12) L2 L#13 (1024KB) + L1d L#13 (64KB) + L1i L#13 (64KB) + Core L#13 + PU L#13 (P#13) L2 L#14 (1024KB) + L1d L#14 (64KB) + L1i L#14 (64KB) + Core L#14 + PU L#14 (P#14) L2 L#15 (1024KB) + L1d L#15 (64KB) + L1i L#15 (64KB) + Core L#15 + PU L#15 (P#15) L2 L#16 (1024KB) + L1d L#16 (64KB) + L1i L#16 (64KB) + Core L#16 + PU L#16 (P#16) L2 L#17 (1024KB) + L1d L#17 (64KB) + L1i L#17 (64KB) + Core L#17 + PU L#17 (P#17) L2 L#18 (1024KB) + L1d L#18 (64KB) + L1i L#18 (64KB) + Core L#18 + PU L#18 (P#18) L2 L#19 (1024KB) + L1d L#19 (64KB) + L1i L#19 (64KB) + Core L#19 + PU L#19 (P#19) L2 L#20 (1024KB) + L1d L#20 (64KB) + L1i L#20 (64KB) + Core L#20 + PU L#20 (P#20) L2 L#21 (1024KB) + L1d L#21 (64KB) + L1i L#21 (64KB) + Core L#21 + PU L#21 (P#21) L2 L#22 (1024KB) + L1d L#22 (64KB) + L1i L#22 (64KB) + Core L#22 + PU L#22 (P#22) L2 L#23 (1024KB) + L1d L#23 (64KB) + L1i L#23 (64KB) + Core L#23 + PU L#23 (P#23) L2 L#24 (1024KB) + L1d L#24 (64KB) + L1i L#24 (64KB) + Core L#24 + PU L#24 (P#24) L2 L#25 (1024KB) + L1d L#25 (64KB) + L1i L#25 (64KB) + Core L#25 + PU L#25 (P#25) L2 L#26 (1024KB) + L1d L#26 (64KB) + L1i L#26 (64KB) + Core L#26 + PU L#26 (P#26) L2 L#27 (1024KB) + L1d L#27 (64KB) + L1i L#27 (64KB) + Core L#27 + PU L#27 (P#27) L2 L#28 (1024KB) + L1d L#28 (64KB) + L1i L#28 (64KB) + Core L#28 + PU L#28 (P#28) L2 L#29 (1024KB) + L1d L#29 (64KB) + L1i L#29 (64KB) + Core L#29 + PU L#29 (P#29) L2 L#30 (1024KB) + L1d L#30 (64KB) + L1i L#30 (64KB) + Core L#30 + PU L#30 (P#30) L2 L#31 (1024KB) + L1d L#31 (64KB) + L1i L#31 (64KB) + Core L#31 + PU L#31 (P#31) L2 L#32 (1024KB) + L1d L#32 (64KB) + L1i L#32 (64KB) + Core L#32 + PU L#32 (P#32) L2 L#33 (1024KB) + L1d L#33 (64KB) + L1i L#33 (64KB) + Core L#33 + PU L#33 (P#33) L2 L#34 (1024KB) + L1d L#34 (64KB) + L1i L#34 (64KB) + Core L#34 + PU L#34 (P#34) L2 L#35 (1024KB) + L1d L#35 (64KB) + L1i L#35 (64KB) + Core L#35 + PU L#35 (P#35) L2 L#36 (1024KB) + L1d L#36 (64KB) + L1i L#36 (64KB) + Core L#36 + PU L#36 (P#36) L2 L#37 (1024KB) + L1d L#37 (64KB) + L1i L#37 (64KB) + Core L#37 + PU L#37 (P#37) L2 L#38 (1024KB) + L1d L#38 (64KB) + L1i L#38 (64KB) + Core L#38 + PU L#38 (P#38) L2 L#39 (1024KB) + L1d L#39 (64KB) + L1i L#39 (64KB) + Core L#39 + PU L#39 (P#39) L2 L#40 (1024KB) + L1d L#40 (64KB) + L1i L#40 (64KB) + Core L#40 + PU L#40 (P#40) L2 L#41 (1024KB) + L1d L#41 (64KB) + L1i L#41 (64KB) + Core L#41 + PU L#41 (P#41) L2 L#42 (1024KB) + L1d L#42 (64KB) + L1i L#42 (64KB) + Core L#42 + PU L#42 (P#42) L2 L#43 (1024KB) + L1d L#43 (64KB) + L1i L#43 (64KB) + Core L#43 + PU L#43 (P#43) L2 L#44 (1024KB) + L1d L#44 (64KB) + L1i L#44 (64KB) + Core L#44 + PU L#44 (P#44) L2 L#45 (1024KB) + L1d L#45 (64KB) + L1i L#45 (64KB) + Core L#45 + PU L#45 (P#45) L2 L#46 (1024KB) + L1d L#46 (64KB) + L1i L#46 (64KB) + Core L#46 + PU L#46 (P#46) L2 L#47 (1024KB) + L1d L#47 (64KB) + L1i L#47 (64KB) + Core L#47 + PU L#47 (P#47) L2 L#48 (1024KB) + L1d L#48 (64KB) + L1i L#48 (64KB) + Core L#48 + PU L#48 (P#48) L2 L#49 (1024KB) + L1d L#49 (64KB) + L1i L#49 (64KB) + Core L#49 + PU L#49 (P#49) L2 L#50 (1024KB) + L1d L#50 (64KB) + L1i L#50 (64KB) + Core L#50 + PU L#50 (P#50) L2 L#51 (1024KB) + L1d L#51 (64KB) + L1i L#51 (64KB) + Core L#51 + PU L#51 (P#51) L2 L#52 (1024KB) + L1d L#52 (64KB) + L1i L#52 (64KB) + Core L#52 + PU L#52 (P#52) L2 L#53 (1024KB) + L1d L#53 (64KB) + L1i L#53 (64KB) + Core L#53 + PU L#53 (P#53) L2 L#54 (1024KB) + L1d L#54 (64KB) + L1i L#54 (64KB) + Core L#54 + PU L#54 (P#54) L2 L#55 (1024KB) + L1d L#55 (64KB) + L1i L#55 (64KB) + Core L#55 + PU L#55 (P#55) L2 L#56 (1024KB) + L1d L#56 (64KB) + L1i L#56 (64KB) + Core L#56 + PU L#56 (P#56) L2 L#57 (1024KB) + L1d L#57 (64KB) + L1i L#57 (64KB) + Core L#57 + PU L#57 (P#57) L2 L#58 (1024KB) + L1d L#58 (64KB) + L1i L#58 (64KB) + Core L#58 + PU L#58 (P#58) L2 L#59 (1024KB) + L1d L#59 (64KB) + L1i L#59 (64KB) + Core L#59 + PU L#59 (P#59) L2 L#60 (1024KB) + L1d L#60 (64KB) + L1i L#60 (64KB) + Core L#60 + PU L#60 (P#60) L2 L#61 (1024KB) + L1d L#61 (64KB) + L1i L#61 (64KB) + Core L#61 + PU L#61 (P#61) L2 L#62 (1024KB) + L1d L#62 (64KB) + L1i L#62 (64KB) + Core L#62 + PU L#62 (P#62) L2 L#63 (1024KB) + L1d L#63 (64KB) + L1i L#63 (64KB) + Core L#63 + PU L#63 (P#63) (additional output elided)

$  go run main.go -trace-out trace.out -workers-per-proc 2 -ballast-gib 1 -duration-sec 10 -tick-ms 100
Output

[ 72ns, 79ns) 17573 [ 79ns, 85ns) 216591 [ 85ns, 93ns) 116850 [ 93ns, 100ns) 236451 [ 100ns, 109ns) 314753 [ 109ns, 118ns) 230927 [ 118ns, 128ns) 109421 [ 128ns, 139ns) 59321 [ 139ns, 151ns) 38202 [ 151ns, 163ns) 28376 [ 163ns, 177ns) 21861 [ 177ns, 192ns) 18295 [ 192ns, 209ns) 13143 [ 209ns, 226ns) 7696 [ 226ns, 245ns) 4588 [ 245ns, 266ns) 2483 [ 266ns, 289ns) 6813 [ 289ns, 313ns) 18756 [ 313ns, 339ns) 40835 [ 339ns, 368ns) 30730 [ 368ns, 399ns) 12849 [ 399ns, 433ns) 25823 [ 433ns, 469ns) 23884 [ 469ns, 509ns) 7659 [ 509ns, 552ns) 33934 [ 552ns, 598ns) 55478 [ 598ns, 649ns) 62495 [ 649ns, 704ns) 34499 [ 704ns, 763ns) 14653 [ 763ns, 827ns) 9812 [ 827ns, 897ns) 19002 [ 897ns, 973ns) 25270 [ 973ns, 1.055µs) 50580 [ 1.055µs, 1.144µs) 92187 [ 1.144µs, 1.24µs) 127457 [ 1.24µs, 1.345µs) 140571 [ 1.345µs, 1.459µs) 139210 [ 1.459µs, 1.582µs) 132822 [ 1.582µs, 1.715µs) 125200 [ 1.715µs, 1.86µs) 115555 [ 1.86µs, 2.016µs) 107404 [ 2.016µs, 2.186µs) 105898 [ 2.186µs, 2.371µs) 113228 [ 2.371µs, 2.571µs) 117703 [ 2.571µs, 2.788µs) 119107 [ 2.788µs, 3.023µs) 119796 [ 3.023µs, 3.278µs) 126588 [ 3.278µs, 3.554µs) 138539 [ 3.554µs, 3.854µs) 159116 [ 3.854µs, 4.179µs) 189187 [ 4.179µs, 4.531µs) 233557 [ 4.531µs, 4.913µs) 301022 [ 4.913µs, 5.327µs) 398887 [ 5.327µs, 5.777µs) 505051 [ 5.777µs, 6.264µs) 541441 [ 6.264µs, 6.792µs) 634854 [ 6.792µs, 7.365µs) 881924 [ 7.365µs, 7.986µs) 1185808 [ 7.986µs, 8.659µs) 1446630 [ 8.659µs, 9.389µs) 1626775 [ 9.389µs, 10.181µs) 2031474 [ 10.181µs, 11.039µs) 2094121 [ 11.039µs, 11.97µs) 1807589 [ 11.97µs, 12.98µs) 1624230 [ 12.98µs, 14.074µs) 1056720 [ 14.074µs, 15.261µs) 646032 [ 15.261µs, 16.548µs) 305557 [ 16.548µs, 17.943µs) 153696 [ 17.943µs, 19.456µs) 114515 [ 19.456µs, 21.096µs) 98403 [ 21.096µs, 22.875µs) 104676 [ 22.875µs, 24.804µs) 94487 [ 24.804µs, 26.895µs) 78385 [ 26.895µs, 29.163µs) 70610 [ 29.163µs, 31.622µs) 57449 [ 31.622µs, 34.289µs) 56891 [ 34.289µs, 37.18µs) 70099 [ 37.18µs, 40.315µs) 101161 [ 40.315µs, 43.714µs) 154279 [ 43.714µs, 47.4µs) 223443 [ 47.4µs, 51.396µs) 278641 [ 51.396µs, 55.73µs) 287298 [ 55.73µs, 60.429µs) 240337 [ 60.429µs, 65.524µs) 150835 [ 65.524µs, 71.049µs) 72459 [ 71.049µs, 77.04µs) 38615 [ 77.04µs, 83.536µs) 21532 [ 83.536µs, 90.579µs) 13442 [ 90.579µs, 98.217µs) 10747 [ 98.217µs, 106.498µs) 11481 [ 106.498µs, 115.478µs) 13177 [ 115.478µs, 125.214µs) 15959 [ 125.214µs, 135.772µs) 16279 [ 135.772µs, 147.22µs) 14439 [ 147.22µs, 159.633µs) 13646 [ 159.633µs, 173.093µs) 12696 [ 173.093µs, 187.688µs) 11684 [ 187.688µs, 203.513µs) 11598 [ 203.513µs, 220.673µs) 12099 [ 220.673µs, 239.279µs) 13443 [ 239.279µs, 259.455µs) 15340 [ 259.455µs, 281.331µs) 15861 [ 281.331µs, 305.052µs) 15143 [ 305.052µs, 330.773µs) 12819 [ 330.773µs, 358.663µs) 11203 [ 358.663µs, 388.905µs) 10457 [ 388.905µs, 421.696µs) 8777 [ 421.696µs, 457.252µs) 7415 [ 457.252µs, 495.806µs) 6004 [ 495.806µs, 537.611µs) 5317 [ 537.611µs, 582.941µs) 4360 [ 582.941µs, 632.093µs) 3718 [ 632.093µs, 685.389µs) 3143 [ 685.389µs, 743.179µs) 2605 [ 743.179µs, 805.842µs) 2406 [ 805.842µs, 873.788µs) 2153 [ 873.788µs, 947.463µs) 6832 [ 947.463µs, 1.02735ms) 88088 [ 1.02735ms, 1.113973ms) 16948 [ 1.113973ms, 1.2079ms) 11312 [ 1.2079ms, 1.309747ms) 8098 [ 1.309747ms, 1.420181ms) 9853 [ 1.420181ms, 1.539926ms) 10967 [ 1.539926ms, 1.669768ms) 11771 [ 1.669768ms, 1.810558ms) 11627 [ 1.810558ms, 1.963219ms) 9232 [ 1.963219ms, 2.128751ms) 5193 [ 2.128751ms, 2.308241ms) 2422 [ 2.308241ms, 2.502865ms) 1374 [ 2.502865ms, 2.713899ms) 979 [ 2.713899ms, 2.942727ms) 1307 [ 2.942727ms, 3.190848ms) 3215 [ 3.190848ms, 3.459891ms) 1285 [ 3.459891ms, 3.751619ms) 1236 [ 3.751619ms, 4.067944ms) 657 [ 4.067944ms, 4.410941ms) 479 [ 4.410941ms, 4.782858ms) 369 [ 4.782858ms, 5.186134ms) 226 [ 5.186134ms, 5.623413ms) 234 [ 5.623413ms, 6.097562ms) 335 [ 6.097562ms, 6.61169ms) 257 [ 6.61169ms, 7.169167ms) 116 [ 7.169167ms, 7.77365ms) 104 [ 7.77365ms, 8.4291ms) 128 [ 8.4291ms, 9.139816ms) 161 [ 9.139816ms, 9.910458ms) 78 [ 9.910458ms, 10.746078ms) 1201 [ 10.746078ms, 11.652154ms) 798 [ 11.652154ms, 12.634629ms) 622 [ 12.634629ms, 13.699942ms) 371 [ 13.699942ms, 14.85508ms) 180 [ 14.85508ms, 16.107615ms) 150 [ 16.107615ms, 17.46576ms) 342 [ 17.46576ms, 18.93842ms) 305 [ 18.93842ms, 20.53525ms) 11388 [ 20.53525ms, 22.26672ms) 6810 [ 22.26672ms, 24.144182ms) 894 [ 24.144182ms, 26.179946ms) 510 [ 26.179946ms, 28.387359ms) 441 [ 28.387359ms, 30.780895ms) 443 [ 30.780895ms, 33.376246ms) 553 [ 33.376246ms, 36.19043ms) 256 [ 36.19043ms, 39.241897ms) 279 [ 39.241897ms, 42.550655ms) 1322 [ 42.550655ms, 46.138396ms) 459 [ 46.138396ms, 50.028646ms) 486 [ 50.028646ms, 54.246909ms) 440 [ 54.246909ms, 58.820843ms) 342 [ 58.820843ms, 63.780438ms) 347 [ 63.780438ms, 69.15821ms) 398 [ 69.15821ms, 74.98942ms) 350 [ 74.98942ms, 81.3123ms) 222 [ 81.3123ms, 88.168306ms) 111 [ 88.168306ms, 95.60239ms) 37 [ 95.60239ms,103.663292ms) 23 [103.663292ms,112.403866ms) 10 [112.403866ms,121.881418ms) 10 [121.881418ms,132.158088ms) 7 [132.158088ms,143.301257ms) 9 [143.301257ms,155.383983ms) 11 [155.383983ms,168.485487ms) 13 [168.485487ms,182.691671ms) 6 [182.691671ms,198.095677ms) 7 [198.095677ms,214.798502ms) 4 [214.798502ms,232.909659ms) 6 [232.909659ms,252.547893ms) 11 [252.547893ms,273.841963ms) 8 [273.841963ms,296.931484ms) 12 [296.931484ms,321.967844ms) 45 [321.967844ms,349.115193ms) 13 [349.115193ms,378.551524ms) 11 [378.551524ms,410.469838ms) 15 [410.469838ms,445.079406ms) 13 [445.079406ms,482.607147ms) 16 [482.607147ms,523.299114ms) 16 [523.299114ms,567.422104ms) 21 [567.422104ms, 615.26541ms) 19 [ 615.26541ms,667.142718ms) 37 [667.142718ms,723.394162ms) 14 [723.394162ms,784.388558ms) 3 [784.388558ms,850.525815ms) 1 [850.525815ms,922.239565ms) 3 [922.239565ms, 1s) 34 metric 19 P99.9-latency-ms metric 52 P99.99-latency-ms metric 309 P99.999-latency-ms metric 1285 max-latency-ms

Image

bench.trace.zip

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Mar 18, 2025
@gabyhelp gabyhelp added the BugReport Issues describing a possible bug in the Go implementation. label Mar 18, 2025
@ianlancetaylor
Copy link
Member

CC @golang/runtime @mknyszek

@mknyszek
Copy link
Contributor

Hm, that is a lot of time spent blocked. Normally, this is what I would expect to happen.

  • We've allocated up to the heap goal.
  • G1 wins the race to start the GC (trigger tests true, they got work.startSema first).
  • Until the STW, a few other goroutines slip into gcStart to block.
  • G1 successfully stops the world.
  • G1 changes gcphase to _GCmark.
  • G1 restarts the world, and releases work.startSema. This enqueues the next goroutine blocked on work.startSema.
  • Everybody blocked on work.startSema, in order, wakes up and acquires it, checks the trigger condition, and releases it, then continues on its merry way.

The only problem I can see with this is if it takes too long to wake a goroutine blocked on this semaphore, an entire GC cycle passes, and it wakes up only to get stuck again. This seems like it would be incredibly rare, but one thing we can try (that we've done elsewhere) is do a direct G handoff on semrelease which does a bit more to guarantee progress is made by those woken up.

I can suggest a patch, if you'd like to try it.

@dpolansky
Copy link
Author

dpolansky commented Mar 18, 2025

Hey Michael, thanks for your quick reply.

Thanks for that context as well. Given the STW timing of that GC cycle is small (~300 microseconds), it seems like the goroutines waiting on work.startSema should be able to progress quickly after blocking on it, in theory. A direct G handoff sounds promising, and I'd be happy to test it on the service.

@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 19, 2025
@mknyszek mknyszek self-assigned this Mar 19, 2025
@mknyszek mknyszek added this to the Go1.25 milestone Mar 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BugReport Issues describing a possible bug in the Go implementation. compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

5 participants