New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: Getting interrupts on isolated cores (caused by simple Go app) during low-latency work #60735
Comments
cc @golang/runtime |
Thanks for the report. Since there is a lot of information, here is my paraphrased understanding. Please let me know if I misunderstood anything.
|
@prattmic Thanks for quick response!
Yes!
Yes, and isolated.
Exactly.
That's right. Python scenario is for reference. During tests with Go app, Python app was disabled of course. |
This sounds like a limitation in the CPU isolation functionality in the Linux kernel, not a Go issue. If the Go program is isolated away from a CPU, then its behaviors should presumably not affect that CPU, but it seems like it is. With regards to the interrupts, some possibilities I could imagine:
|
@prattmic Generally, I agree. However, please look closely at Python scenario :) As you can see in Both
So here is the difference in these interrupts for "BASELINE" (only Here is for And here is for We can see that What and where should be tuned/configured/changed/fixed to achieve that? |
I acknowledge that there is a difference between your Python and Go cases, but the Python case is also simpler. e.g., with CPython 3.11 on my machine, that program is single-threaded. On the other hand, Go programs are always multi-threaded, even with only one goroutine. So there is more interesting surface in the Go case. Given the connection with isolated CPUs (a kernel feature) and interrupts (which Go has no direct control over), I think this issue would be better served on a Linux kernel mailing list / issue tracker, where there are more experts in these areas. If there are concrete things we are doing wrong that makes us a bad neighbor, then we can consider whether they should change. |
+1 to what Michael said, and to add on, here are a few things you can try to attempt to isolate the issue:
However, given that you're seeing jitter on the order of 10 seconds, I don't think there's any obvious thing (to us, anyway) the Go runtime is doing that would cause the jitter. Honestly, I don't really expect any of the 3 things above to have an effect, but it's something to rule out. Beyond that, I'm not sure we can provide more help here at this time. |
Even without printing anything (just
Forgot to mention that I've tried that already. Also tried
This looked promising, unfortunately didn't change anything :/ |
Is the Python program even a good comparison regarding a multithreaded program given the GIL? |
@andig From low level perspective (like GIL) such comparison may not be fair, I agree. However, it makes sense for someone who want to choose language for application basing on how bad/good neighbor it will be in the OS. I did two more tests... JITTER tool - RUST
|
Originally, I used Later, I also built it using:
and results were the same. But look at results I've got when I built this app using
Comment: Notes:
It would be great to find which commit from these: https://github.com/golang/go/issues?q=milestone%3AGo1.21.1+label%3ACherryPickApproved has actually fixed this problem and root-cause why. |
IIRC, nothing in 1.21.1 intentionally affected scheduling. My best guess would be #62329. To be sure, you should bisect on branch |
@prattmic Right, #62329 ( These are commit between
So I compiled After that I compiled So we can assume that side effect of fixing #62329 (original PR: #61718 created by @dominikh) is that problems described in this PR were also fixed. Here is commit: 06df329 And I'm still very curious what was the root cause of problem described in this PR. |
It's a long story. https://go.dev/doc/gc-guide#Linux_transparent_huge_pages has some background, but from there I think most of the detail is all in #61718. Going all the way back, this started with https://bugzilla.kernel.org/show_bug.cgi?id=93111, which is why the runtime (for releases prior to Go 1.21.0) was calling Coming back to the current issue, @prattmic theorizes that forcing huge pages on and off via The TL;DR is that Go 1.21.1 doesn't make these calls anymore. There are a handful of cases where they're still used, but the maximum number of calls to these syscalls in any given program is constant and small. |
@mknyszek Thank you very much for the detailed response! I have one more question. For applications built with Go 1.21.1 I'm wondering (and I don't have a way to check it right now) what impact the correct execution of the |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Details below.
What did you expect to see?
I expect to see no interrupts on isolated cores and want to run Go-based applications on environments when lowlatency workloads are executed.
What did you see instead?
Interrupts on isolated cores on which low-latency work is performed. Interrupts are caused by simple Go application (printing "Hello world" every 1 second) which runs on different, non-isolated cores.
Summary
LOC, IWI, RES and CAL interrupts are observed on isolated cores on which low-latency benchmark is performed. Interrupts are caused by simple Go application (printing "Hello world" every 1 second) which runs on different, non-isolated cores. Similar Python application doesn't cause such problems.
Tested on
Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-68-lowlatency x86_64)
(compiled with: "Full Dynticks System (tickless)" and "No Forced Preemption (Server)").Hardware:
2 x Intel(R) Xeon(R) Gold 6438N
(32 cores each)BIOS:
Hyperthreading disabled
OS and configuration:
Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-68-lowlatency x86_64)
(compiled with: "Full Dynticks System (tickless)" and "No Forced Preemption (Server)" from https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/log/?h=lowlatency-next)irqbalance
stopped and disabled:Based on workload type, experiments and knowledge found in the Internet, following kernel parameters were used:
For every core/socket: Cx states (for x > 0) were disabled, particular power governor was used and fixed uncore values were set.
To achieve that
power.py
script from https://github.com/intel/CommsPowerManagement was used.prepare_cpus.sh
script for setting this up and resultsCPUs 20-23 are "isolated" (thanks to proper kernel parameters) - benchmark/workload will be run on them.
Kernel threads were moved from CPU20-23:
get_irqs.sh
script which checks which target CPUs are permitted for a given IRQ sourcesoutput of mentioned script:
lscpu
output:JITTER tool - Baseline
jitter is benchmarking tool which is meant for measuring the "jitter" in the execution time caused by OS and/or the underlying architecture.
Put "run_jitter.sh" script inside above directory.
Run:
Results:
Comment:
jitter
tool shows intervals and jitter in CPU Core cycles. Benchmark is done on 2000 MHz core so on graph values are divided by 2 and presented in nanoseconds.Very stable results, no significant jitters (max jitter: 51ns) during 335 seconds.
No interruptions made on isolated CPU20 during benchmark.
JITTER tool - Python
hello.py
- simple Python app which prints "Hello world" every 1 secondrun_python_hello.sh
- script to run python app on particular (non-isolated) coreIn first console
./run_python_hello.sh
was started, in second console./run_jitter.sh
was run.Results:
Comment:
Acceptable result, one noticeable jitter (1190ns), the remaining jitters did not exceed 60ns during 336 seconds.
No interruptions made on isolated CPU20 during benchmark.
JITTER tool - Golang
hello.go
- simple Golang app which prints "Hello world" every 1 secondgo.mod
- go module definitionrun_go_hello.sh
- script to run Go app on particular (non-isolated) coreIn first console Go app was built:
go build
and started:./run_go_hello.sh
, in second console./run_jitter.sh
was run.Results:
Comment:
34 significant jitters (the worst had: 44961ns) during 335 seconds.
Following interruptions were made on isolated CPU20 during benchmark:
LOC: 67
IWI: 34
RES: 34
RES: 34
It seems that every jitter is made every ~10s.
What is also interesting that for idle and isolated CPU22 and CPU23 no interruptions were made during benchmark. For CPU24 (not isolated) only LOC were made (335283 of them).
Notes:
cpuset
and its shield turned on. Unfortunately, results were even worse (jitters were "bigger" and more interruptions were made to shielded cores), moreovercset
was not able to move kernel threads outside of shielded pool.GNU/Linux 5.15.65-rt49 x86_64
-> https://mirrors.edge.kernel.org/pub/linux/kernel/v5.x/linux-5.15.65.tar.gz patched with https://cdn.kernel.org/pub/linux/kernel/projects/rt/5.15/older/patch-5.15.65-rt49.patch.gz) and problem with interrupts and jitters done by Go app doesn't exist there. However, RT kernel is not the best solution for everyone and it would be great to not have jitters also on lowlatency tickless kernel.1.19.x
and1.20.2
.The text was updated successfully, but these errors were encountered: