Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: Getting interrupts on isolated cores (caused by simple Go app) during low-latency work #60735

Open
p-zak opened this issue Jun 12, 2023 · 15 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@p-zak
Copy link

p-zak commented Jun 12, 2023

What version of Go are you using (go version)?

$ go version
go version go1.20.5 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/root/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/root/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20.5"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1121239308=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Details below.

What did you expect to see?

I expect to see no interrupts on isolated cores and want to run Go-based applications on environments when lowlatency workloads are executed.

What did you see instead?

Interrupts on isolated cores on which low-latency work is performed. Interrupts are caused by simple Go application (printing "Hello world" every 1 second) which runs on different, non-isolated cores.

Summary

LOC, IWI, RES and CAL interrupts are observed on isolated cores on which low-latency benchmark is performed. Interrupts are caused by simple Go application (printing "Hello world" every 1 second) which runs on different, non-isolated cores. Similar Python application doesn't cause such problems.

Tested on Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-68-lowlatency x86_64) (compiled with: "Full Dynticks System (tickless)" and "No Forced Preemption (Server)").

Hardware:
2 x Intel(R) Xeon(R) Gold 6438N (32 cores each)

BIOS:
Hyperthreading disabled

OS and configuration:
Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-68-lowlatency x86_64) (compiled with: "Full Dynticks System (tickless)" and "No Forced Preemption (Server)" from https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/log/?h=lowlatency-next)

irqbalance stopped and disabled:

$ systemctl stop irqbalance.service
$ systemctl disable irqbalance.service

Based on workload type, experiments and knowledge found in the Internet, following kernel parameters were used:

$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.15.0-68-lowlatency root=UUID=5c9c2ea3-e0c6-4dd8-ae70-57e0c0af20d3 ro ro rhgb quiet ipv6.disable=1 audit=0 selinux=0 hugepages=256 hugepagesz=1G intel_iommu=on iommu=pt nmi_watchdog=0 mce=off tsc=reliable nosoftlockup hpet=disable skew_tick=1 acpi_pad.disable=1 nowatchdog nomce numa_balancing=disable irqaffinity=0 rcu_nocb_poll processor.max_cstate=0 clocksource=tsc nosmt nohz=on nohz_full=20-23 rcu_nocbs=20-23 isolcpus=nohz,domain,managed_irq,20-23

For every core/socket: Cx states (for x > 0) were disabled, particular power governor was used and fixed uncore values were set.
To achieve that power.py script from https://github.com/intel/CommsPowerManagement was used.

prepare_cpus.sh script for setting this up and results
#!/bin/bash

#Disable all possible Cx states (x > 0)
./power.py -d C1_ACPI -r 0-63
./power.py -d C2_ACPI -r 0-63
./power.py -d C1 -r 0-63
./power.py -d C1E -r 0-63
./power.py -d C6 -r 0-63

#Powersave governor
./power.py -g powersave

#Set uncore min and max freq to 1100
./power.py -u 1100
./power.py -U 1100

sleep 3

#Show current status
./power.py -l


cpu_prepared

CPUs 20-23 are "isolated" (thanks to proper kernel parameters) - benchmark/workload will be run on them.

Kernel threads were moved from CPU20-23:

$ cat /sys/devices/virtual/workqueue/cpumask
ffffffff,ff0fffff
get_irqs.sh script which checks which target CPUs are permitted for a given IRQ sources
#!/bin/bash

for I in $(ls /proc/irq)
do
if [[ -d "/proc/irq/$I" ]]
then
echo -n $I:
cat /proc/irq/$I/smp_affinity_list
fi
done

output of mentioned script:
0:0
1:0
10:0
11:0
12:0
124:0
13:0
133:0
134:0
135:0
136:0
137:0
138:0
14:0
15:0
16:0
2:0
203:0
212:0
24:0
25:0
26:0
27:0
277:0
278:0
279:0
28:0
280:1
281:2
282:3
283:4
284:5
285:6
286:7
287:8
288:9
289:10
29:0
290:11
291:12
292:13
293:14
294:15
295:16
296:17
297:18
298:19
299:20
3:0
30:0
300:21
301:22
302:23
303:24
304:25
305:26
306:27
307:28
308:29
309:30
31:0
310:31
311:32
312:33
313:34
314:35
315:36
316:37
317:38
318:39
319:40
32:0
320:41
321:42
322:43
323:44
324:45
325:46
326:47
327:48
328:49
329:50
33:0
330:51
331:52
332:53
333:54
334:55
335:56
336:57
337:58
338:59
339:60
34:0
340:61
341:62
342:63
343:0
344:0
345:0
347:0
348:0
349:0
35:0
350:0
351:0
352:0
353:0
354:0
355:0
356:0
357:0
358:0
359:0
36:0
360:0
361:0
362:0
363:0
364:0
365:0
37:0
38:0
39:0
4:0
40:0
41:0
42:0
43:0
44:0
45:0
46:0
47:0
48:0
49:0
5:0
50:0
51:0
52:0
53:0
54:0
55:0
56:0
57:0
58:0
59:0
6:0
7:0
8:0
9:0
lscpu output:
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         52 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-63
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Gold 6438N
    CPU family:          6
    Model:               143
    Thread(s) per core:  1
    Core(s) per socket:  32
    Socket(s):           2
    Stepping:            8
    CPU max MHz:         3600.0000
    CPU min MHz:         800.0000
    BogoMIPS:            4000.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pn
                         i pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single cdp_l2 ssbd 
                         mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsav
                         ec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni
                          avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   3 MiB (64 instances)
  L1i:                   2 MiB (64 instances)
  L2:                    128 MiB (64 instances)
  L3:                    120 MiB (2 instances)
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-31
  NUMA node1 CPU(s):     32-63
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected

JITTER tool - Baseline

jitter is benchmarking tool which is meant for measuring the "jitter" in the execution time caused by OS and/or the underlying architecture.

git clone https://github.com/FDio/archived-pma_tools
cd archived-pma_tools/jitter
Put "run_jitter.sh" script inside above directory.
#!/bin/bash

printf "Start time:\n"
date
#Print current state of interrupts (for: cpu20, cpu21, cpu22, cpu23, cpu24), discard interrupts with only zeros (or empty)
printf "\nInterrupts before:\n"
sed '1s/^/INT:/' /proc/interrupts | awk '{print $1 " " $22 " " $23 " " $24 " " $25 " " $26}' | grep -v "0 0 0 0 0" | grep -v ": "

printf "\nRun jitter... "
#Start on cpu21, benchmark thread will be run on isolated cpu20
taskset -c 21 ./jitter -c 20 -f -i 1450 -l 15500 > jit.log

printf "finished!\n"

printf "\nInterrupts after:\n"
sed '1s/^/INT:/' /proc/interrupts | awk '{print $1 " " $22 " " $23 " " $24 " " $25 " " $26}' | grep -v "0 0 0 0 0" | grep -v ": "

printf "\nEnd time:\n"
date

printf "\nResults:\n"
cat jit.log

printf "\n"

Run:

make
./run_jitter.sh

Results:

  • output from "run_jitter.sh" script: jitter_base.txt
  • chart created from above output:

jitter_base

Comment:
jitter tool shows intervals and jitter in CPU Core cycles. Benchmark is done on 2000 MHz core so on graph values are divided by 2 and presented in nanoseconds.
Very stable results, no significant jitters (max jitter: 51ns) during 335 seconds.
No interruptions made on isolated CPU20 during benchmark.

JITTER tool - Python

hello.py - simple Python app which prints "Hello world" every 1 second
import time

while True:
print("Hello world")
time.sleep(1)

run_python_hello.sh - script to run python app on particular (non-isolated) core
#!/bin/bash

taskset -a -c 8 python3 hello.py

$ python3 --version
Python 3.10.6

In first console ./run_python_hello.sh was started, in second console ./run_jitter.sh was run.

Results:

  • output from "run_jitter.sh" script: jitter_python.txt
  • chart created from above output:

jitter_python

Comment:
Acceptable result, one noticeable jitter (1190ns), the remaining jitters did not exceed 60ns during 336 seconds.
No interruptions made on isolated CPU20 during benchmark.

JITTER tool - Golang

hello.go - simple Golang app which prints "Hello world" every 1 second
package main

import (
"fmt"
"time"
)

func main() {
for {
fmt.Println("Hello world")
time.Sleep(1 * time.Second)
}
}

go.mod - go module definition
module hello

go 1.20

run_go_hello.sh - script to run Go app on particular (non-isolated) core
#!/bin/bash

taskset -a -c 8 ./hello

$ go version
go version go1.20.5 linux/amd64

In first console Go app was built: go build and started: ./run_go_hello.sh, in second console ./run_jitter.sh was run.

Results:

  • output from "run_jitter.sh" script: jitter_go.txt
  • chart created from above output:

jitter_go

Comment:
34 significant jitters (the worst had: 44961ns) during 335 seconds.
Following interruptions were made on isolated CPU20 during benchmark:
LOC: 67
IWI: 34
RES: 34
RES: 34

It seems that every jitter is made every ~10s.

What is also interesting that for idle and isolated CPU22 and CPU23 no interruptions were made during benchmark. For CPU24 (not isolated) only LOC were made (335283 of them).

Notes:

  1. Instead of static isolation (using kernel parameters) I tried also with cpuset and its shield turned on. Unfortunately, results were even worse (jitters were "bigger" and more interruptions were made to shielded cores), moreover cset was not able to move kernel threads outside of shielded pool.
  2. I checked it also on Realtime kernel (GNU/Linux 5.15.65-rt49 x86_64 -> https://mirrors.edge.kernel.org/pub/linux/kernel/v5.x/linux-5.15.65.tar.gz patched with https://cdn.kernel.org/pub/linux/kernel/projects/rt/5.15/older/patch-5.15.65-rt49.patch.gz) and problem with interrupts and jitters done by Go app doesn't exist there. However, RT kernel is not the best solution for everyone and it would be great to not have jitters also on lowlatency tickless kernel.
  3. I also did a lot of experiments with different kernel parameters, seems that this combination was the best (however, maybe I missed something).
  4. Same situation with Go app built using 1.19.x and 1.20.2.
  5. I'm aware that this kind of benchmark should be executed for hours, but for now these results are pretty meaningful.
@seankhliao seankhliao changed the title go: Getting interrupts on isolated cores (caused by simple Go app) during low-latency work runtime: Getting interrupts on isolated cores (caused by simple Go app) during low-latency work Jun 12, 2023
@seankhliao
Copy link
Member

cc @golang/runtime

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jun 12, 2023
@mknyszek mknyszek added this to the Backlog milestone Jun 12, 2023
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 12, 2023
@prattmic
Copy link
Member

Thanks for the report. Since there is a lot of information, here is my paraphrased understanding. Please let me know if I misunderstood anything.

  • jitter is a C program that looks for jitter in execution time of a tight loop (due to OS interference, e.g.).
  • The jitter program is running on dedicated cores (no other threads allowed).
  • A Go program is running on other, non-dedicated cores.
  • The jitter program measures elevated jitter, including additional CPU interrupts, compared to an otherwise idle system and compared to a Python program running on non-dedicated cores.

@p-zak
Copy link
Author

p-zak commented Jun 12, 2023

@prattmic Thanks for quick response!

  • jitter is a C program that looks for jitter in execution time of a tight loop (due to OS interference, e.g.).

Yes!
Details from https://github.com/FDio/archived-pma_tools/blob/master/jitter/jitter.c#L21:

 *   The program does the following:
 *		- Calls a dummy function which consists of a small piece of code block being executed in a loop. The loop count is user defined (default 80,000).
 *		- Measures the Execution time for this function in Core cycles
 *		- Compares the Execution time with Instantaneous Minimum and Maximum execution time values tracked during the display update interval. These values are adjusted based on the comparison.
 *		- Compares the execution time with Absolute Minimum and maximum execution values since the program started. These values are adjusted based on the comparison.
 *		- Repeats above operation.
  • The jitter program is running on dedicated cores (no other threads allowed).

Yes, and isolated.

  • A Go program is running on other, non-dedicated cores.

Exactly.

  • The jitter program measures elevated jitter, including additional CPU interrupts, compared to an otherwise idle system and compared to a Python program running on non-dedicated cores.

That's right. Python scenario is for reference. During tests with Go app, Python app was disabled of course.

@prattmic
Copy link
Member

prattmic commented Jun 12, 2023

This sounds like a limitation in the CPU isolation functionality in the Linux kernel, not a Go issue.

If the Go program is isolated away from a CPU, then its behaviors should presumably not affect that CPU, but it seems like it is. With regards to the interrupts, some possibilities I could imagine:

  • LOC (Local timer interrupts): Linux is handling Go's timers on the isolated CPU?
  • RES (Rescheduling interrupts): Linux is delivering preemptive scheduling tick interrupts on the isolated CPU? Even if there are no additional threads eligible to run on that CPU?
  • IWI (IRQ work interrupt): I'm not particularly familiar with IRQ work, but it sounds like some work generated as a result of the Go program is ending up waking up the isolated CPU.

@p-zak
Copy link
Author

p-zak commented Jun 12, 2023

This sounds like a limitation in the CPU isolation functionality in the Linux kernel, not a Go issue.

If the Go program is isolated away from a CPU, then its behaviors should presumably not affect that CPU, but it seems like it is.

@prattmic Generally, I agree. However, please look closely at Python scenario :)

As you can see in run_jitter.sh I run tool using: taskset -c 21 ./jitter -c 20 -f -i 1450 -l 15500. It means that initially, thread affinity is set to (isolated) CPU21, but -c 20 changes it to (isolated) CPU20 inside jitter tool (I can run it using taskset -c 20 ./jitter -c 20 and results will be the same for every scenario). Main point is that jitter uses only isolated cores.

Both Python and Go app are run by: taskset -a -c 8. It means that all threads for a given command should run on CPU8 (non-isolated).

jitter runs for about 335s and I collect interrupts from /proc/interrupts just before jitter starts and just after jitter ends for isolated CPU20-23 and non-isolated CPU24 (for reference). Only these kind of interrupts appear for these cores for all scenarios: LOC, IWI, RES, CAL, TLB (I believe that last one is done by jitter when it ends). No other types for these cores.

So here is the difference in these interrupts for "BASELINE" (only jitter running):
image

Here is for jitter with Python app:
image

And here is for jitter with Go app:
image

We can see that Python app doesn't influence isolated cores at all, while Go equivalent unfortunately does.
As a Go engineer, I've been trying to find out how to proceed with writing applications in Go but have impact for isolated cores as in Python ;)

What and where should be tuned/configured/changed/fixed to achieve that?

@prattmic
Copy link
Member

I acknowledge that there is a difference between your Python and Go cases, but the Python case is also simpler. e.g., with CPython 3.11 on my machine, that program is single-threaded. On the other hand, Go programs are always multi-threaded, even with only one goroutine. So there is more interesting surface in the Go case.

Given the connection with isolated CPUs (a kernel feature) and interrupts (which Go has no direct control over), I think this issue would be better served on a Linux kernel mailing list / issue tracker, where there are more experts in these areas. If there are concrete things we are doing wrong that makes us a bad neighbor, then we can consider whether they should change.

@mknyszek
Copy link
Contributor

mknyszek commented Jun 12, 2023

+1 to what Michael said, and to add on, here are a few things you can try to attempt to isolate the issue:

  • Run the Go program with GODEBUG=asyncpreemptoff=1.
  • Run the Go program with GOGC=off (given that you're only printing "hello world" every second, I doubt you'll actually run out of memory in any reasonable amount of time... however I also doubt you're triggering the GC at all).
  • Buffer the print to stdout (os.Stdout and os.Stderr are not buffered by default, IIRC they are with Python.)

However, given that you're seeing jitter on the order of 10 seconds, I don't think there's any obvious thing (to us, anyway) the Go runtime is doing that would cause the jitter. Honestly, I don't really expect any of the 3 things above to have an effect, but it's something to rule out.

Beyond that, I'm not sure we can provide more help here at this time.

@p-zak
Copy link
Author

p-zak commented Jun 12, 2023

@mknyszek

  • Buffer the print to stdout (os.Stdout and os.Stderr are not buffered by default, IIRC they are with Python.)

Even without printing anything (just for with 1s sleep I have the same results).

  • Run the Go program with GOGC=off (given that you're only printing "hello world" every second, I doubt you'll actually run out of memory in any reasonable amount of time... however I also doubt you're triggering the GC at all).

Forgot to mention that I've tried that already. Also tried GOMAXPROCS=1. No success.

  • Run the Go program with GODEBUG=asyncpreemptoff=1.

This looked promising, unfortunately didn't change anything :/

@andig
Copy link
Contributor

andig commented Jun 13, 2023

Is the Python program even a good comparison regarding a multithreaded program given the GIL?

@p-zak
Copy link
Author

p-zak commented Jun 13, 2023

Is the Python program even a good comparison regarding a multithreaded program given the GIL?

@andig From low level perspective (like GIL) such comparison may not be fair, I agree.

However, it makes sense for someone who want to choose language for application basing on how bad/good neighbor it will be in the OS.

I did two more tests...

JITTER tool - RUST

hello.rs - simple RUST app which prints "Hello world" every 1 second
use std::{thread, time::Duration};

fn main() {
loop {
println!("Hello world");
thread::sleep(Duration::from_secs(1));
}
}

run_rust_hello.sh - script to run RUST app on particular (non-isolated) core
#!/bin/bash

taskset -a -c 8 ./hello

$ rustc --version
rustc 1.70.0 (90c541806 2023-05-31)

In first console RUST app was built: rustc hello.rs and started: ./run_rust_hello.sh, in second console ./run_jitter.sh was run.

Results:

  • output from "run_jitter.sh" script: jitter_rust.txt
  • chart created from above output:
    jitter_rust

Comment:
1 significant jitter (62154ns), the remaining jitters did not exceed 106ns during 335 seconds.
Following interruptions were made during benchmark:
int_rust

JITTER tool - C++

hello.cpp - simple C++ app which prints "Hello world" every 1 second
use std::{thread, time::Duration};

fn main() {
loop {
println!("Hello world");
thread::sleep(Duration::from_secs(1));
}
}

run_cpp_hello.sh - script to run C++ app on particular (non-isolated) core
#!/bin/bash

taskset -a -c 8 ./hello

$ g++ --version
g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

In first console C++ app was built: g++ hello.cpp -o hello and started: ./run_cpp_hello.sh, in second console ./run_jitter.sh was run.

Results:

  • output from "run_jitter.sh" script: jitter_cpp.txt
  • chart created from above output:
    jitter_cpp

Comment:
1 significant jitter (70972ns), the remaining jitters did not exceed 59ns during 335 seconds.
Following interruptions were made during benchmark:
int_cpp

@p-zak
Copy link
Author

p-zak commented Sep 20, 2023

@mknyszek @andig @prattmic

Originally, I used go1.20.5 linux/amd64 to build simple Golang app for this issue.

Later, I also built it using:

  • go1.20.6 linux/amd64
  • go1.20.7 linux/amd64
  • go1.20.8 linux/amd64
  • go1.21.0 linux/amd64 (with go 1.21 in go.mod)

and results were the same.

But look at results I've got when I built this app using go1.21.1 linux/amd64 (with go 1.21 in go.mod):

jitter_go 1 21 1

Comment:
0 significant jitters (the biggest jitter did not exceed 50 during 335 seconds.
Following interruptions were made during benchmark:

int_go 1 21 1

Notes:

  • The same configuration of HW and OS was used for testing the same Golang app (only change was in version of go used to build this app). So still on low latency kernel with the same configuration.
  • No jitters and no interrupts were observed on isolated CPU20 during benchmark.
  • I also run more complicated Golang apps (built using go1.21.1 linux/amd64) - and I didn't observe any jitters or interrupts on isolated CPU20 during benchmark. There were a lot of jitters and interrupts when these apps were built with previous versions of go.

It would be great to find which commit from these: https://github.com/golang/go/issues?q=milestone%3AGo1.21.1+label%3ACherryPickApproved has actually fixed this problem and root-cause why.
Can you help me with that?

@prattmic
Copy link
Member

IIRC, nothing in 1.21.1 intentionally affected scheduling. My best guess would be #62329. To be sure, you should bisect on branch release-branch.go1.21 (the individual releases are tagged).

@p-zak
Copy link
Author

p-zak commented Sep 20, 2023

@prattmic Right, #62329 (runtime: MADV_HUGEPAGE causes stalls when allocating memory) was also my best guess :)

These are commit between tag: go1.21.0 and tag: go1.21.1 for release-branch.go1.21 branch:

2c1e5b05fe (tag: go1.21.1) [release-branch.go1.21] go1.21.1
bbd043ff0d [release-branch.go1.21] html/template: properly handle special tags within the script context
b0e1d3ea26 [release-branch.go1.21] html/template: support HTML-like comments in script contexts
d25a935574 [release-branch.go1.21] cmd/go: reject toolchain directives containing path separators
e3ba569c78 [release-branch.go1.21] net/http: revert "support streaming POST content in wasm"
8dc6ad1c61 [release-branch.go1.21] runtime: restore caller's frame pointer when recovering from panic
06df3292a8 [release-branch.go1.21] runtime: avoid MADV_HUGEPAGE for heap memory
b120517ffd [release-branch.go1.21] go/types, types2: remove order dependency in inference involving channels
0a9582163c [release-branch.go1.21] cmd/go: retry ETXTBSY errors when running test binaries
91a4e74b98 [release-branch.go1.21] crypto/tls: QUIC: fix panics when processing post-handshake messages
6385a6fb18 [release-branch.go1.21] cmd/go: find GOROOT using os.Executable when installed to GOROOT/bin/GOOS_GOARCH
2d07bb86f0 [release-branch.go1.21] encoding/xml: overriding by empty namespace when no new name declaration
745b81b6e6 [release-branch.go1.21] encoding/gob: prevent panic from index out of range in Decoder.typeString
13339c75b8 [release-branch.go1.21] runtime: fix maps.Clone bug when cloning a map mid-grow
2977709875 [release-branch.go1.21] context: fix synchronization in ExampleAfterFunc_cond
2d4746f37b [release-branch.go1.21] go/types, types2: disable interface inference for versions before Go 1.21
2b8026f025 [release-branch.go1.21] cmd/compile: in expandCalls, move all arg marshalling into call block
7c97cc7d97 [release-branch.go1.21] Revert "os: use handle based APIs to read directories on windows"
cb6ea94996 [release-branch.go1.21] Revert "cmd/compile: omit redundant sign/unsign extension on arm64"
45b98bfb79 [release-branch.go1.21] path/filepath: don't drop .. elements when cleaning invalid Windows paths
bac083a584 [release-branch.go1.21] cmd/link: don't mangle string symbol names
70aa116c4a [release-branch.go1.21] runtime/internal/wasitest: skip racy TCP echo test
31c5a236bc [release-branch.go1.21] runtime: mark traceEnabled and traceShuttingDown as no:split
25ec110e51 [release-branch.go1.21] cmd/compile: ensure empty blocks in write barriers are marked unpreemptible
6634ce2f41 [release-branch.go1.21] runtime: profiling on Darwin cannot use blocking reads
25c6dce188 [release-branch.go1.21] cmd/compile: make backingArrayPtrLen to return typecheck-ed nodes
4e34f2e81d [release-branch.go1.21] go/types, types2: don't panic during interface completion
d91843ff67 [release-branch.go1.21] go/types, types2: use correct parameter list when checking argument passing
7437db1085 [release-branch.go1.21] go/types, types2: use exact unification when comparing interface methods
ed527ecfb2 [release-branch.go1.21] cmd/api: rename api.go to main_test.go
b78e8cc145 [release-branch.go1.21] crypto/tls: add GODEBUG to control max RSA key size
3475e6af4c [release-branch.go1.21] cmd/go: fix missing case checking for empty slice
179821c9e1 [release-branch.go1.21] net/http: permit requests with invalid Host headers
9398951479 [release-branch.go1.21] cmd/distpack: include directory entries in tar files
75d8be5fb4 [release-branch.go1.21] cmd/go/internal/web: release the net token when an HTTP request fails due to CheckRedirect
1755d14559 [release-branch.go1.21] cmd/compile: fix missing init nodes for len(string([]byte)) optimization
c19c4c566c (tag: go1.21.0) [release-branch.go1.21] go1.21.0

So I compiled go from sources from this commit: 06df3292a8 [release-branch.go1.21] runtime: avoid MADV_HUGEPAGE for heap memory and compile my simple app using compiled go.
Results:
No jitters and no interrupts were observed

After that I compiled go from sources from one commit before: b120517ffd [release-branch.go1.21] go/types, types2: remove order dependency in inference involving channels and compile my simple app using compiled go.
Results:
Jitters and interrupts were observed

So we can assume that side effect of fixing #62329 (original PR: #61718 created by @dominikh) is that problems described in this PR were also fixed.

Here is commit: 06df329

And I'm still very curious what was the root cause of problem described in this PR.

@mknyszek
Copy link
Contributor

And I'm still very curious what was the root cause of problem described in this PR.

It's a long story. https://go.dev/doc/gc-guide#Linux_transparent_huge_pages has some background, but from there I think most of the detail is all in #61718. Going all the way back, this started with https://bugzilla.kernel.org/show_bug.cgi?id=93111, which is why the runtime (for releases prior to Go 1.21.0) was calling madvise(MADV_HUGEPAGE) and madvise(MADV_NOHUGEPAGE) at all.

Coming back to the current issue, @prattmic theorizes that forcing huge pages on and off via madvise(MADV_HUGEPAGE) and madvise(MADV_NOHUGEPAGE) results in TLB shootdowns down the line that end up causing other programs to stall when they access memory and take a very slow path through the kernel and hardware.

The TL;DR is that Go 1.21.1 doesn't make these calls anymore. There are a handful of cases where they're still used, but the maximum number of calls to these syscalls in any given program is constant and small.

@p-zak
Copy link
Author

p-zak commented Sep 26, 2023

@mknyszek Thank you very much for the detailed response! I have one more question.

For applications built with Go 1.21.1 MADV_COLLAPSE works for Linux 6.1 and higher, right? So in my case (Linux 5.15), madvice most likely doesn't do anything and returns some error, which is intentionally not handled in the Go runtime. Am I correct?
image

I'm wondering (and I don't have a way to check it right now) what impact the correct execution of the madvise(v, n, _MADV_COLLAPSE) operation for Linux 6.1+ may have on my workloads, their latency, and interrupts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

6 participants