Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: FreeBSD nanotime performance regression with different CPU setups #50947

Open
TriAnMan opened this issue Feb 1, 2022 · 24 comments
Open
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-FreeBSD Performance
Milestone

Comments

@TriAnMan
Copy link

TriAnMan commented Feb 1, 2022

The setup

$ go version
go version go1.17.6 freebsd/amd64
$ freebsd-version
12.2-RELEASE

1.17.6 is currently the latest release.

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="freebsd"
GOINSECURE=""
GOMODCACHE="/root/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="freebsd"
GOPATH="/root/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/freebsd_amd64"
GOVCS=""
GOVERSION="go1.17.6"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1526234500=/tmp/go-build -gno-record-gcc-switches"

The case

I've found that the performance of the scheduler on FreeBSD highly depends on the CPU setup I have. While it is OK for a one-CPU VirtualBox instance running on my Mac, it decreases a lot either for more-than-one-CPU VirtualBox or for any-number-of-CPUs DigitalOcean instances (as far as I know DigitalOcean uses KVM as a hypervisor for their instances).

This benchmark can be used as a minimal reproducer

$ cat scheduler_test.go
package whatever

import (
	"testing"
)

func BenchmarkScheduler(b *testing.B) {
	ch := make(chan struct{})

	go func() {
		for range ch {}
	}()

	for i := 0; i < b.N; i++ {
		ch <- struct{}{}
	}
}
go test -bench=. scheduler_test.go

While this benchmark runs on a single-CPU VirtualBox instance it performs quite convenient.

$ sysctl hw.ncpu
hw.ncpu: 1

$ go test -bench=. scheduler_test.go
goos: freebsd
goarch: amd64
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkScheduler 	 7658449	       152.8 ns/op
PASS
ok  	command-line-arguments	1.331s

However if the benchmark was run on more-than-one-CPU VirtualBox or on a DigitalOcean instance with any number of CPUs the significant decrease in performance emerges.

$ sysctl hw.ncpu
hw.ncpu: 2

$ go test -bench=. scheduler_test.go
goos: freebsd
goarch: amd64
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkScheduler-2   	 1108549	      1163 ns/op
PASS
ok  	command-line-arguments	2.213s
$ sysctl hw.ncpu
hw.ncpu: 1

$ go test -bench=. scheduler_test.go

goos: freebsd
goarch: amd64
cpu: DO-Premium-Intel
BenchmarkScheduler 	  706951	      1949 ns/op
PASS
ok  	command-line-arguments	1.405s

Profiling shows that the scheduler code for FreeBSD relies highly on time machinery. And runtime.nanotime() performance highly affects the scheduler.

Execution flow for 1CPU VirtualBox 1CPU VirtualBox
Execution flow for 2CPUs VirtualBox 2CPUs VirtualBox
Execution flow for 1CPU DigitalOcean 1CPU DigitalOcean

I've also made benchmarks for the time.Now() function in the same setups. The results of them correlates with scheduler ones.

@TriAnMan
Copy link
Author

TriAnMan commented Feb 1, 2022

#17712 may be somehow related to this.

@ianlancetaylor
Copy link
Contributor

CC @aclements @mknyszek @prattmic

@ianlancetaylor ianlancetaylor added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 1, 2022
@ianlancetaylor ianlancetaylor added this to the Go1.19 milestone Feb 1, 2022
@prattmic
Copy link
Member

prattmic commented Feb 1, 2022

Your profiles show the three different cases of time source:

1CPU VirtualBox -> TSC
2CPU VirtualBox -> HPET (VMM emulated? I'm not too familiar with HPET virtualization)
1CPU DigitalOcean -> Neither, fall back to system call

I am not very familiar with FreeBSD timekeeping, but from my experience with Linux timekeeping your results match my expectations: using the TSC is most efficient, followed by userspace-accessible HPET, following by actually making a system call.

Regarding how to handle this, for the HPET case the resolution should be in figuring out why this system is in HPET mode. There shouldn't be much reason for a hypervisor on a CPU as modern as you have to not support use of the TSC on SMP VMs. Either something in the hypervisor is disabling access to the TSC, or something in the FreeBSD kernel is deciding the TSC isn't reliable.

For the DigitalOcean case, the first step is figuring out which clock mode the kernel is putting the VDSO in (perhaps print from the function linked above). I see that FreeBSD supports a few more modes including a KVM paravirtualized clock. If it is in one of those modes, perhaps our VDSO code needs to add support.

@prattmic prattmic changed the title runtime: FreeBSD scheduler performance regression with different CPU setups runtime: FreeBSD nanotime performance regression with different CPU setups Feb 1, 2022
@prattmic
Copy link
Member

prattmic commented Feb 1, 2022

Based on freebsd/freebsd-src@8cc15b0, it looks like FreeBSD doesn't trust the TSC on VirtualBox for some reason (N.B. test_tsc only runs on SMP systems).

@TriAnMan
Copy link
Author

TriAnMan commented Feb 1, 2022

I've got sysctl kern.timecounter outputs for my setups.

1CPU VirtualBox
kern.timecounter.tsc_shift: 1
kern.timecounter.smp_tsc_adjust: 0
kern.timecounter.smp_tsc: 0
kern.timecounter.invariant_tsc: 1
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: ACPI-fast(900) i8254(0) TSC-low(1000) dummy(-1000000)
kern.timecounter.hardware: TSC-low
kern.timecounter.alloweddeviation: 5
kern.timecounter.timehands_count: 2
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.ACPI-fast.quality: 900
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.counter: 295440892
kern.timecounter.tc.ACPI-fast.mask: 4294967295
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.counter: 50397
kern.timecounter.tc.i8254.mask: 65535
kern.timecounter.tc.TSC-low.quality: 1000
kern.timecounter.tc.TSC-low.frequency: 1296029345
kern.timecounter.tc.TSC-low.counter: 3889432018
kern.timecounter.tc.TSC-low.mask: 4294967295
2CPU VirtualBox
kern.timecounter.tsc_shift: 1
kern.timecounter.smp_tsc_adjust: 0
kern.timecounter.smp_tsc: 0
kern.timecounter.invariant_tsc: 1
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: ACPI-fast(900) i8254(0) TSC-low(-100) dummy(-1000000)
kern.timecounter.hardware: ACPI-fast
kern.timecounter.alloweddeviation: 5
kern.timecounter.timehands_count: 2
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.ACPI-fast.quality: 900
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.counter: 3480961343
kern.timecounter.tc.ACPI-fast.mask: 4294967295
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.counter: 35839
kern.timecounter.tc.i8254.mask: 65535
kern.timecounter.tc.TSC-low.quality: -100
kern.timecounter.tc.TSC-low.frequency: 1296127369
kern.timecounter.tc.TSC-low.counter: 2127372050
kern.timecounter.tc.TSC-low.mask: 4294967295
1CPU DigitalOcean
kern.timecounter.tsc_shift: 1
kern.timecounter.smp_tsc_adjust: 0
kern.timecounter.smp_tsc: 0
kern.timecounter.invariant_tsc: 0
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: i8254(0) ACPI-fast(900) HPET(950) TSC-low(800) dummy(-1000000)
kern.timecounter.hardware: HPET
kern.timecounter.alloweddeviation: 5
kern.timecounter.timehands_count: 2
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.counter: 43479
kern.timecounter.tc.i8254.mask: 65535
kern.timecounter.tc.ACPI-fast.quality: 900
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.counter: 7730510
kern.timecounter.tc.ACPI-fast.mask: 16777215
kern.timecounter.tc.HPET.quality: 950
kern.timecounter.tc.HPET.frequency: 100000000
kern.timecounter.tc.HPET.counter: 2395020808
kern.timecounter.tc.HPET.mask: 4294967295
kern.timecounter.tc.TSC-low.quality: 800
kern.timecounter.tc.TSC-low.frequency: 1247085954
kern.timecounter.tc.TSC-low.counter: 3806080953
kern.timecounter.tc.TSC-low.mask: 4294967295

@prattmic
Copy link
Member

prattmic commented Feb 1, 2022

Oops, this makes me realized I swapped VirtualBox and DigitalOcean. The VB profile shows fallback to syscall, while DigitalOcean shows use of HPET.

@prattmic
Copy link
Member

prattmic commented Feb 1, 2022

Summary of what I see from that:

1CPU VirtualBox: TSC is fine.
2CPU VirtualBox: FreeBSD doesn't trust TSC on VirtualBox, and there is no HPET -> Syscall fallback is only option.
1CPU DigitalOcean: TSC-low(800) implies this path. Perhaps DigitalOcean is mucking with the vendor name in CPUID and breaking FreeBSD's detection -> HPET.

I don't think there is a lot for Go to do here.

For VirtualBox, FreeBSD should consider trusting their TSC, or perhaps more likely, convince them to fix their TSC.

For DigitalOcean, FreeBSD should look at how DigitalOcean reports CPU vendor so they can detect it as Intel and use TSC if appropriate.

The one thing we may want to do is drop HPET entirely if system calls are faster, as your benchmark implies. Of course, your benchmarks are on completely different machines, so we'd want benchmarks of syscall vs HPET on the same machine to see if that is actually true. In theory it shouldn't be, as the system call probably results in HPET use in the kernel.

@TriAnMan
Copy link
Author

TriAnMan commented Feb 2, 2022

Got benchmarks and profiles for different kern.timecounter.hardware on 1CPU DigitalOcean

HPET
$ sysctl kern.timecounter.hardware=HPET
kern.timecounter.hardware: HPET -> HPET

$ go test -bench=. scheduler_test.go goos: freebsd goarch: amd64 cpu: DO-Premium-Intel BenchmarkScheduler 616752 1954 ns/op PASS ok command-line-arguments 2.228s
$ go test -bench=. -cpuprofile hpet.out scheduler_test.go goos: freebsd goarch: amd64 cpu: DO-Premium-Intel BenchmarkScheduler 161212 6563 ns/op PASS ok command-line-arguments 1.325s
hpet
ACPI-fast
$ sysctl kern.timecounter.hardware=ACPI-fast
kern.timecounter.hardware: HPET -> ACPI-fast

$ go test -bench=. scheduler_test.go goos: freebsd goarch: amd64 cpu: DO-Premium-Intel BenchmarkScheduler 752794 1663 ns/op PASS ok command-line-arguments 1.273s
$ go test -bench=. -cpuprofile acpi.out scheduler_test.go goos: freebsd goarch: amd64 cpu: DO-Premium-Intel BenchmarkScheduler 190075 5578 ns/op PASS ok command-line-arguments 1.317s
acpi
TSC-low
$ sysctl kern.timecounter.hardware=TSC-low
kern.timecounter.hardware: ACPI-fast -> TSC-low

$ go test -bench=. scheduler_test.go goos: freebsd goarch: amd64 cpu: DO-Premium-Intel BenchmarkScheduler 3701788 313.4 ns/op PASS ok command-line-arguments 1.494s
$ go test -bench=. -cpuprofile tsc.out scheduler_test.go goos: freebsd goarch: amd64 cpu: DO-Premium-Intel BenchmarkScheduler 4263708 296.1 ns/op PASS ok command-line-arguments 1.715s
tsc

As an interesting fact, a profiling with TSC-low seems not consuming additional resources.

@TriAnMan
Copy link
Author

TriAnMan commented Feb 2, 2022

I have also benchmarked time.Now() for different kern.timecounter.hardware on 1CPU DigitalOcean

package whatever

import (
	"testing"
	"time"
)

func BenchmarkTime(b *testing.B) {
	for i := 0; i < b.N; i++ {
		time.Now()
	}
}
$ sysctl kern.timecounter.hardware=HPET
kern.timecounter.hardware: HPET -> HPET

$ go test -bench=. time_test.go
goos: freebsd
goarch: amd64
cpu: DO-Premium-Intel
BenchmarkTime 	  130621	      8971 ns/op
PASS
ok  	command-line-arguments	1.276s

$ sysctl kern.timecounter.hardware=ACPI-fast
kern.timecounter.hardware: HPET -> ACPI-fast

$ go test -bench=. time_test.go
goos: freebsd
goarch: amd64
cpu: DO-Premium-Intel
BenchmarkTime 	  160593	      7183 ns/op
PASS
ok  	command-line-arguments	1.239s

$ sysctl kern.timecounter.hardware=TSC-low
kern.timecounter.hardware: ACPI-fast -> TSC-low

$ go test -bench=. time_test.go
goos: freebsd
goarch: amd64
cpu: DO-Premium-Intel
BenchmarkTime 	16510144	        75.84 ns/op
PASS
ok  	command-line-arguments	1.330s

Notice huge advantage of TSC-low.

@koobs
Copy link

koobs commented Feb 2, 2022

Edit: The reverse is true, see: #50947 (comment)


Original Comment:

FWIW, and hopefully it's helpful... with the following simple benchmark, I found HPET significantly faster than TSC-Low, which was being selected for my FreeBSD Guests (12.x and CURRENT, 2 core, 8gb each) on VirtualBox (Win64, Intel 2600K) prompting setting a HPET value in /etc/sysctl.conf. I had been testing to try to identify performance differences between VirtualBox's Paravirtualization provider settings.

If I also recall correctly, there were some timecounter changes around that time (benchmark result files are dated Apr 18 2021), and/or discussion about TSC/ACPI-Fast performance particularly in FreeBSD's hypervisor detection code (so its worth going over those logs, if one hasn't already).

I seem to recall that detection was 'incorrect' (or not optimal) for a time, and then updated as the HV strings had changed or something.

sysctl.conf and benchmark results

# [koobs@140-CURRENT-amd64-564d:~/timecounter] ministat -s -w80 hpet tsc-low acpi-fast
# x hpet
# + tsc-low
# * acpi-fast
# +--------------------------------------------------------------------------------+
# | x                                 *                                         +  |
# |xx                                **                                         ++ |
# |xx                               ***                                         ++ |
# |xx                               ****                                     + ++++|
# ||A                                                                              |
# |                                                                            |A_||
# |                                  |A|                                           |
# +--------------------------------------------------------------------------------+
#     N           Min           Max        Median           Avg        Stddev
# x  10        139595        143041        142209      141704.5     1204.2239
# +  10        355667        371410      366164.5      366057.4       4360.75
# Difference at 95.0% confidence
#         224353 +/- 3005.7
#         158.324% +/- 2.51149%
#         (Student's t, pooled s = 3198.93)
# *  10        237116        244564      241635.5      241141.3     2735.9878
# Difference at 95.0% confidence
#         99436.8 +/- 1986.06
#         70.1719% +/- 1.60272%
#         (Student's t, pooled s = 2113.74)

kern.timecounter.hardware=HPET # TSC-Low is selected but is much slower

Bench

#include <sys/time.h>
#include <stdio.h>
#include <time.h>

#define COUNT 1000000

int main() {
    struct timespec ts_start, ts_stop, ts_read;
    double time;
    int i;
    clock_gettime(CLOCK_MONOTONIC, &ts_start);
    for(i = 0; i < COUNT; i++) {
        clock_gettime(CLOCK_MONOTONIC, &ts_read);
    }
    clock_gettime(CLOCK_MONOTONIC, &ts_stop);
    time = (ts_stop.tv_sec - ts_start.tv_sec) + (ts_stop.tv_nsec
    - ts_start.tv_nsec) * 1E-9;
    printf("%.0f\n", COUNT / time);
}

@TriAnMan
Copy link
Author

TriAnMan commented Feb 2, 2022

Enabling HPET for my VirtualBox and using @koobs benchmark along with time.Now() benchmark, got this results for different kern.timecounter.hardware and setups:

1CPU VBox

$ sysctl kern.timecounter.choice
kern.timecounter.choice: ACPI-fast(900) i8254(0) HPET(950) TSC-low(1000) dummy(-1000000)

$ sysctl kern.timecounter.hardware=HPET
$ ./clock_test
580179
$ go test -bench=. time_test.go
BenchmarkTime 	  312913	      3474 ns/op

$ sysctl kern.timecounter.hardware=ACPI-fast
$ ./clock_test
727058
$ go test -bench=. time_test.go
BenchmarkTime 	  427713	      2771 ns/op

$ sysctl kern.timecounter.hardware=TSC-low
$ ./clock_test
30036568
$ go test -bench=. time_test.go
BenchmarkTime 	24549586	        45.58 ns/op

2CPU VBox

$ sysctl kern.timecounter.choice
kern.timecounter.choice: ACPI-fast(900) i8254(0) HPET(950) TSC-low(-100) dummy(-1000000)

$ sysctl kern.timecounter.hardware=HPET
$ ./clock_test
576028
$ go test -bench=. time_test.go
BenchmarkTime-2   	  333536	      3542 ns/op

$ sysctl kern.timecounter.hardware=ACPI-fast
$ ./clock_test
828282
$ go test -bench=. time_test.go
BenchmarkTime-2   	  425350	      2443 ns/op

$ sysctl kern.timecounter.hardware=TSC-low
$ ./clock_test
1138499
$ go test -bench=. time_test.go
BenchmarkTime-2   	  575733	      1812 ns/op

1CPU DO

$ sysctl kern.timecounter.choice
kern.timecounter.choice: i8254(0) ACPI-fast(900) HPET(950) TSC-low(800) dummy(-1000000)

$ sysctl kern.timecounter.hardware=HPET
$ ./clock_test
232181
$ go test -bench=. time_test.go
BenchmarkTime 	  123722	      9597 ns/op

$ sysctl kern.timecounter.hardware=ACPI-fast
$ ./clock_test
262493
$ go test -bench=. time_test.go
BenchmarkTime 	  185848	      7891 ns/op

$ sysctl kern.timecounter.hardware=TSC-low
$ ./clock_test
28790175
$ go test -bench=. time_test.go
BenchmarkTime 	14390761	        80.54 ns/op

2CPU DO

$ sysctl kern.timecounter.choice
kern.timecounter.choice: i8254(0) ACPI-fast(900) HPET(950) TSC-low(-100) dummy(-1000000)

$ sysctl kern.timecounter.hardware=HPET
$ ./clock_test
251691
$ go test -bench=. time_test.go
BenchmarkTime-2   	  126259	      9750 ns/op

$ sysctl kern.timecounter.hardware=ACPI-fast
$ ./clock_test
307210
$ go test -bench=. time_test.go
BenchmarkTime-2   	  125708	      8008 ns/op

$ sysctl kern.timecounter.hardware=TSC-low
kern.timecounter.hardware: ACPI-fast -> TSC-low
$ ./clock_test
29018696
$ go test -bench=. time_test.go
BenchmarkTime-2   	15485086	        80.60 ns/op

@TriAnMan
Copy link
Author

TriAnMan commented Feb 2, 2022

@koobs looking at your output of minstat I wonder which data was in the hpet tsc-low acpi-fast files?
If the data was results of multiple runs of your benchmark then + tsc-low shows better results than either x hpet or * acpi-fast: see Median or Average columns.

@prattmic
Copy link
Member

prattmic commented Feb 2, 2022

FWIW, and hopefully it's helpful... with the following simple benchmark, I found HPET significantly faster than TSC-Low, which was being selected for my FreeBSD Guests (12.x and CURRENT, 2 core, 8gb each) on VirtualBox (Win64, Intel 2600K)

@koobs Looking into VirtualBox a little more, it seems that they have a variety of conditions to select between hardware TSC (no VM exit) and emulated TSC (VM exit on every RDTSC), most notable of which is that it sounds like (not 100% sure) they require VMX TSC offsetting support to use hardware TSC. Intel 2600K is a older Sandy Bridge core and consumer SKU, so perhaps it doesn't support TSC offsetting, hence a fallback to emulated TSC with poor performance. I think you'd need to read various VMX capability MSRs to verify.

@TriAnMan Very interesting that you consistently find ACPI-fast via a syscall faster than userspace HPET. Though it is still not clear if the syscall would still be faster if the system was still in kern.timecounter.hardware=HPET, but Go made a syscall anyways. I suspect that the kernel would use the same HPET source and still be slow.

I still don't think that there is anything for Go itself to do here. It is clear that FreeBSD's kernel clock selection is not always optimal. Go could second guess the kernel's choice in some cases (HPET vs syscall), but even if that helps, it would only help Go programs. If the kernel itself improves its clock selection then all programs on the system will benefit.

@mpx
Copy link
Contributor

mpx commented Feb 2, 2022

Perhaps some documentation could be considered if Go is particularly affected? Eg, platform caveats/notes.

This seems like something people might assume id Go's fault (and is unlikely to be widely understood).

@koobs
Copy link

koobs commented Feb 2, 2022

@koobs looking at your output of minstat I wonder which data was in the hpet tsc-low acpi-fast files? If the data was results of multiple runs of your benchmark then + tsc-low shows better results than either x hpet or * acpi-fast: see Median or Average columns.

Ugh, thanks to you, I just realised the bench reports syscall counts and not latencies. I must have been drowning in yaks that day. My apologies :|

So contrary to my original: #50947 (comment) TSC-Low is the fastest, and HPET the slowest.

The numbers for acpi-fast, hpet and tsc-low are (with a 1 million rounds per run, on 2-core vbox guest)

ACPI-fast 237301 243384 242953 244564 243750 242997 240318 239218 239812 237116
HPET 140750 142011 142703 142432 139595 140107 143041 142407 142729 141270
TSC-Low 366357 365972 369583 364411 369548 371410 367791 364049 365786 355667

@koobs
Copy link

koobs commented Feb 2, 2022

This was the relevent commit I referred to earlier re TSC & VirtualBox:

freebsd/freebsd-src@8cc15b0

Note: Things may have changed since, I have not reviewed TSC or other timecounter changes since then. I've not observed issues with TSC myself, but this is what FreeBSD is seeing and doing currently (CURRENT)

Timecounter "HPET" frequency 14318180 Hz quality 950
Timecounter "i8254" frequency 1193182 Hz quality 0
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
Timecounter "TSC-low" frequency 1700022709 Hz quality 0

kern.timecounter.choice: TSC-low(0) ACPI-fast(900) i8254(0) HPET(950) dummy(-1000000)

On 12.3 (the above commit was not merged to stable/12)

Timecounter "HPET" frequency 14318180 Hz quality 950
Timecounter "i8254" frequency 1193182 Hz quality 0
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900

kern.timecounter.choice: ACPI-fast(900) i8254(0) HPET(950) TSC-low(-100) dummy(-1000000)

Additional References:

@TriAnMan
Copy link
Author

TriAnMan commented Feb 3, 2022

I've made some benchmarks with the commit switching off TSC and HPET paths in Go: TriAnMan@6160d56

1CPU VBox

$ go version
go version devel go1.18-6160d56d11 Thu Feb 3 21:58:18 2022 +0300 freebsd/amd64

$ sysctl kern.timecounter.choice
kern.timecounter.choice: ACPI-fast(900) i8254(0) HPET(950) TSC-low(1000) dummy(-1000000)

$ sysctl kern.timecounter.hardware=HPET
$ go test -bench=. time_test.go
BenchmarkTime 	  303352	      3896 ns/op

$ sysctl kern.timecounter.hardware=ACPI-fast
$ go test -bench=. time_test.go
BenchmarkTime 	  425329	      2832 ns/op

$ sysctl kern.timecounter.hardware=TSC-low
$ go test -bench=. time_test.go
BenchmarkTime 	 3125544	       374.6 ns/op

2CPU VBox

$ go version
go version devel go1.18-6160d56d11 Thu Feb 3 21:58:18 2022 +0300 freebsd/amd64

$ sysctl kern.timecounter.choice
kern.timecounter.choice: ACPI-fast(900) i8254(0) HPET(950) TSC-low(-100) dummy(-1000000)

$ sysctl kern.timecounter.hardware=HPET
$ go test -bench=. time_test.go
BenchmarkTime-2   	  272592	      3955 ns/op

$ sysctl kern.timecounter.hardware=ACPI-fast
$ go test -bench=. time_test.go
BenchmarkTime-2   	  442884	      2477 ns/op

$ sysctl kern.timecounter.hardware=TSC-low
$ go test -bench=. time_test.go
BenchmarkTime-2   	  504488	      2195 ns/op

So it looks like syscall uses a timecounter configured with sysctl. The results for HPET and TSC-low are consistent with previous ones #50947 (comment) except having some overhead on making a syscall.

@gopherbot
Copy link

Change https://go.dev/cl/391856 mentions this issue: runtime: fast clock_gettime on FreeBSD, split getHPETTimecounter

@paulzhol
Copy link
Member

paulzhol commented Mar 11, 2022

I've sent a small improvement above, for the HPET timecounter path to switch to the system stack only once on first call. My system (AMD FX-8300 hardware) shows a small improvement in BenchmarkNow. However it behaves very differently than yours.

kern.timecounter.tsc_shift: 1
kern.timecounter.smp_tsc_adjust: 0
kern.timecounter.smp_tsc: 1
kern.timecounter.invariant_tsc: 1
kern.timecounter.fast_gettime: 1
kern.timecounter.tick: 1
kern.timecounter.choice: ACPI-safe(850) HPET(950) i8254(0) TSC-low(1000) dummy(-1000000)
kern.timecounter.hardware: TSC-low

For example forcing HPET, then disabling kern.timecounter.fast_gettime (which forces Go to fallback to the regular syscall)

if timekeepSharedPage.enabled == 0 {
return zeroBintime
}

//go:nosplit
func nanotime1() int64 {
bt := vdsoClockGettime(_CLOCK_MONOTONIC)
if bt == zeroBintime {
return fallback_nanotime()
}
return int64((1e9 * uint64(bt.sec)) + ((1e9 * uint64(bt.frac>>32)) >> 32))
}
func walltime() (sec int64, nsec int32) {
bt := vdsoClockGettime(_CLOCK_REALTIME)
if bt == zeroBintime {
return fallback_walltime()
}
return int64(bt.sec), int32((1e9 * uint64(bt.frac>>32)) >> 32)
}

There's a noticeable difference:

root@relic:~ # sysctl -w kern.timecounter.hardware=HPET
kern.timecounter.hardware: TSC-low -> HPET

paulzhol@relic:~/go/src/time % ../../bin/go test -run=NONE -bench=BenchmarkNow ./... > old_hpet.txt

root@relic:/tmp # sysctl kern.timecounter.fast_gettime=0
kern.timecounter.fast_gettime: 1 -> 0

paulzhol@relic:~/go/src/time % ../../bin/go test -run=NONE -bench=BenchmarkNow ./... > old_hpet_no_fast.txt

paulzhol@relic:~/go/src/time % ~/gocode/bin/benchcmp old_hpet_no_fast.txt old_hpet.txt
benchcmp is deprecated in favor of benchstat: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
benchmark                   old ns/op     new ns/op     delta
BenchmarkNow-8              1832          1420          -22.49%
BenchmarkNowUnixNano-8      1834          1421          -22.52%
BenchmarkNowUnixMilli-8     1835          1423          -22.45%
BenchmarkNowUnixMicro-8     1833          1423          -22.37%

Specifically ACPI-safe is much slower:

paulzhol@relic:~/go/src/time % ~/gocode/bin/benchcmp baseline_acpi.txt old_hpet.txt
benchcmp is deprecated in favor of benchstat: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
benchmark                   old ns/op     new ns/op     delta
BenchmarkNow-8              6843          1420          -79.25%
BenchmarkNowUnixNano-8      6946          1421          -79.54%
BenchmarkNowUnixMilli-8     6914          1423          -79.42%
BenchmarkNowUnixMicro-8     6923          1423          -79.45%

FreeBSD 12.3 added support for kvmclock, I believe it should be much faster than using HPET (which is essentially a non-paravirtualized emulated hardware device). I'll try to prepare a PR for it when I have the time.

@ianlancetaylor
Copy link
Contributor

What is the current status here? This issue is currently in the 1.19 milestone. Should it move to 1.20? To Backlog? Thanks.

@paulzhol
Copy link
Member

I did a small improvement on the HPET code path, it has a +2 review and wait-release in the PR. I'll do the kvmclock after I've finished with removal of the COMPAT_FREEBSD11. Moving to 1.20.

@paulzhol paulzhol modified the milestones: Go1.19, Go1.20 Jun 24, 2022
@paulzhol paulzhol self-assigned this Jun 24, 2022
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022
@gopherbot
Copy link

Change https://go.dev/cl/422977 mentions this issue: runtime: fast clock_gettime on FreeBSD, use physical/virtual ARM timer

@gopherbot
Copy link

Change https://go.dev/cl/422978 mentions this issue: runtime: fast clock_gettime on FreeBSD, add support for pvclock(kvmclock) and Hyper-V reference TSC

gopherbot pushed a commit that referenced this issue Aug 19, 2022
Call only initHPETTimecounter on the system stack.
Use O_CLOEXEC flag when opening the HPET device.

FreeBSD 12.3-RELEASE-p2, AMD FX-8300
paulzhol@relic:~/go/src/time % ~/gocode/bin/benchcmp old_hpet.txt new_hpet.txt
benchcmp is deprecated in favor of benchstat: https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
benchmark                   old ns/op     new ns/op     delta
BenchmarkNow-8              1420          1397          -1.62%
BenchmarkNowUnixNano-8      1421          1404          -1.20%
BenchmarkNowUnixMilli-8     1423          1405          -1.26%
BenchmarkNowUnixMicro-8     1423          1404          -1.34%

Update #50947

Change-Id: I553b5427fb0b86d7e070af4516b36326bc0aaf00
Reviewed-on: https://go-review.googlesource.com/c/go/+/391856
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
gopherbot pushed a commit that referenced this issue Sep 4, 2022
as setup by the kernel on GOARCH=arm64.

Update #50947

Change-Id: I2f44be9b36e9ce8d264eccc0aa3df10825c5f4f9
Reviewed-on: https://go-review.googlesource.com/c/go/+/422977
Reviewed-by: Heschi Kreinick <heschi@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Meng Zhuo <mzh@golangcn.org>
Reviewed-by: Dmitri Goutnik <dgoutnik@gmail.com>
Run-TryBot: Yuval Pavel Zholkover <paulzhol@gmail.com>
@gopherbot gopherbot modified the milestones: Go1.20, Go1.21 Feb 1, 2023
@mknyszek
Copy link
Contributor

mknyszek commented Jun 9, 2023

What else is left to do here? There appears to be one outstanding CL. Since this was already moved forward a few times, I'm moving it to the backlog for now. Thanks.

@mknyszek mknyszek modified the milestones: Go1.21, Backlog Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-FreeBSD Performance
Projects
Status: In Progress
Status: No status
Development

No branches or pull requests

8 participants