New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: TestRuntimeLockMetricsAndProfile failures #64253
Comments
Found new dashboard test flakes for:
2023-11-17 23:04 linux-ppc64-sid-power10 go@0b31a46f runtime.TestRuntimeLockMetricsAndProfile (log)
2023-11-17 23:04 windows-386-2016 go@0b31a46f runtime.TestRuntimeLockMetricsAndProfile (log)
|
Found new dashboard test flakes for:
2023-11-17 23:16 windows-amd64-longtest go@f664031b runtime.TestRuntimeLockMetricsAndProfile (log)
|
Found new dashboard test flakes for:
2023-11-17 23:16 linux-ppc64-sid-power10 go@f664031b runtime.TestRuntimeLockMetricsAndProfile (log)
2023-11-17 23:34 linux-ppc64-sid-power10 go@631a6c2a runtime.TestRuntimeLockMetricsAndProfile (log)
|
Found new dashboard test flakes for:
2023-11-17 23:15 windows-arm64-11 go@f67b2d8f runtime.TestRuntimeLockMetricsAndProfile (log)
2023-11-17 23:16 windows-arm64-11 go@f664031b runtime.TestRuntimeLockMetricsAndProfile (log)
2023-11-17 23:34 windows-arm64-11 go@631a6c2a runtime.TestRuntimeLockMetricsAndProfile (log)
|
Found new dashboard test flakes for:
2023-11-19 02:15 windows-amd64-2016 go@aa9dd500 runtime.TestRuntimeLockMetricsAndProfile (log)
|
Found new dashboard test flakes for:
2023-11-19 07:31 linux-ppc64-sid-power10 go@d67ac938 runtime.TestRuntimeLockMetricsAndProfile (log)
2023-11-19 07:31 windows-arm64-11 go@d67ac938 runtime.TestRuntimeLockMetricsAndProfile (log)
2023-11-19 15:24 linux-ppc64-sid-power10 go@1c15291f runtime.TestRuntimeLockMetricsAndProfile (log)
2023-11-19 15:24 windows-arm64-11 go@1c15291f runtime.TestRuntimeLockMetricsAndProfile (log)
2023-11-19 17:06 linux-ppc64-sid-power10 go@2551fffd runtime.TestRuntimeLockMetricsAndProfile (log)
|
Found new dashboard test flakes for:
2023-11-19 21:11 windows-arm64-11 go@06145fe0 runtime.TestRuntimeLockMetricsAndProfile (log)
2023-11-19 22:05 windows-arm64-11 go@63828938 runtime.TestRuntimeLockMetricsAndProfile (log)
|
Found new dashboard test flakes for:
2023-11-21 21:30 linux-ppc64le-buildlet go@41f58b22 runtime.TestRuntimeLockMetricsAndProfile (log)
|
Found new dashboard test flakes for:
2023-11-21 21:27 linux-386-longtest go@48a6362d runtime.TestRuntimeLockMetricsAndProfile (log)
|
Change https://go.dev/cl/544375 mentions this issue: |
Most of the failures are from before https://go.dev/cl/544195. Following that, I've seen four failures (via fetchlogs/greplogs). https://build.golang.org/log/c733e91b4693774eb42501bbe93a8ee071ef312e addressed by https://go.dev/cl/544375 |
Found new dashboard test flakes for:
2023-11-21 21:29 linux-ppc64-sid-buildlet go@90ba4452 runtime.TestRuntimeLockMetricsAndProfile (log)
|
Found new dashboard test flakes for:
2023-11-21 21:29 aix-ppc64 go@71052169 runtime.TestRuntimeLockMetricsAndProfile (log)
|
Found new dashboard test flakes for:
2023-11-22 18:12 linux-arm64-longtest go@67600298 runtime.TestRuntimeLockMetricsAndProfile (log)
|
That's two more ppc64 failures from a clock mismatch (not sure what's up with ppc64's clocks), and one more of the type that https://go.dev/cl/544375 addresses. |
Found new dashboard test flakes for:
2023-11-27 17:23 solaris-amd64-oraclerel go@e158cb21 runtime.TestRuntimeLockMetricsAndProfile (log)
|
Most contention on the runtime locks inside semaphores is observed in runtime.semrelease1, but it can also appear in runtime.semacquire1. When examining contention profiles in TestRuntimeLockMetricsAndProfile, allow call stacks that include either. For #64253 Change-Id: Id4f16af5e9a28615ab5032a3197e8df90f7e382f Reviewed-on: https://go-review.googlesource.com/c/go/+/544375 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com> Auto-Submit: Rhys Hiltner <rhys@justin.tv> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Knyszek <mknyszek@google.com>
Note that there are some additional timeouts at #55308 (comment). I'm not sure how to update the watchflakes rule to match these timeouts. |
From that issue comment, Previously, the test involved three threads and required one of them to hold the semaphore's lock, one to contend for it, and the third to notice that contention and mark the test as complete. My interpretation of the In https://go.dev/cl/544195 , I relaxed that requirement: now the test can finish early if one of the threads notices the other two have created the necessary contention, otherwise it'll give up after 10,000,000 iterations. (My darwin/arm64 system typically needed 500,000–4,000,000 iterations during the I continue to struggle with the balance between wanting thorough tests for this part of the runtime (involving concurrency and clocks), and the need for 0% flakes. Maybe the performance dashboard sets an example there, including an opportunity for more output than |
Found new dashboard test flakes for:
2023-11-29 19:13 freebsd-arm64-dmgk go@636c6e35 runtime.TestRuntimeLockMetricsAndProfile (log)
|
This failure appeared on a first-class port on https://go.dev/cl/546635. @bcmills was that CL based on a commit before https://go.dev/cl/544375 landed by any chance? EDIT: Yes, it was. Here's the failure: https://ci.chromium.org/ui/p/golang/builders/try-workers/gotip-linux-arm64-test_only/b8762881407384401921/overview |
From https://ci.chromium.org/ui/p/golang/builders/try-workers/gotip-linux-arm64-test_only/b8762881407384401921/overview, I get https://logs.chromium.org/logs/golang/buildbucket/cr-buildbucket/8762881407384401921/+/u/step/11/log/2, which shows two "metrics_test.go:1103: want stack" lines, indicating that it includes https://go.dev/cl/544375. CL 546635 at PS 4 is https://go.googlesource.com/go/+/30e6fc629529abf8da4528f4fdbb5a78363624fb, with parent https://go.googlesource.com/go/+/2e6387cbec924dbd01007421d7442125037c66b2 . I don't see a line in the failing build log matching https://go.googlesource.com/go/+/2e6387cbec924dbd01007421d7442125037c66b2/src/runtime/metrics_test.go#1271 , which means the test ran for 10,000,000 iterations without the test code noticing contention on the semaphore lock. The mutex profile result shows that the runtime itself didn't encounter contention either. (In #55160, the contention was present but the test had been unable to notice it and so would run forever.) The "TestRuntimeLockMetricsAndProfile/runtime.lock" test verifies that runtime-internal lock contention is able to be reported with the correct count, magnitude, and call stack. The role of the "TestRuntimeLockMetricsAndProfile/runtime.semrelease" test is to check that the call stack ends at a particular depth, mostly so we can notice when that changes (so we can update the skip parameter, for example). It's proven tricker to test, since the lock itself isn't under the test's control. Do you have advice on how to test semacquire/semrelease, or is the best option to |
Found new dashboard test flakes for:
2023-12-01 21:50 linux-ppc64-sid-buildlet go@3220bbe1 runtime.TestRuntimeLockMetricsAndProfile (log)
|
Found new dashboard test flakes for:
2023-12-08 03:28 aix-ppc64 go@78b42a53 runtime.TestRuntimeLockMetricsAndProfile (log)
2023-12-08 18:34 aix-ppc64 go@6cdf2cca runtime.TestRuntimeLockMetricsAndProfile (log)
|
Updated the |
Found new dashboard test flakes for:
2023-11-19 02:15 linux-arm64-longtest go@aa9dd500 runtime.TestRuntimeLockMetricsAndProfile (log)
2023-11-21 21:29 linux-s390x-ibm go@71052169 runtime.TestRuntimeLockMetricsAndProfile (log)
2023-11-21 21:29 linux-s390x-ibm go@71052169 runtime.TestRuntimeLockMetricsAndProfile (log)
2023-11-21 22:06 linux-amd64-longtest go@5a6f1b35 runtime.TestRuntimeLockMetricsAndProfile (log)
2023-11-21 22:58 linux-s390x-ibm go@e6b76bfc runtime.TestRuntimeLockMetricsAndProfile (log)
2023-11-22 01:59 linux-s390x-ibm go@29ca9cc4 runtime.TestRuntimeLockMetricsAndProfile (log)
2023-11-22 15:50 linux-s390x-ibm go@37971b31 runtime.TestRuntimeLockMetricsAndProfile (log)
2023-11-30 13:03 linux-s390x-ibm-race go@b4fa5b16 runtime.TestRuntimeLockMetricsAndProfile (log)
2023-11-30 16:41 linux-s390x-ibm go@76d90a34 runtime.TestRuntimeLockMetricsAndProfile (log)
2023-12-02 03:15 linux-s390x-ibm go@de5b418b runtime.TestRuntimeLockMetricsAndProfile (log)
2023-12-06 21:50 linux-s390x-ibm go@ff9269ee runtime.TestRuntimeLockMetricsAndProfile (log)
2023-12-07 19:13 linux-s390x-ibm-race go@77e76c43 runtime.TestRuntimeLockMetricsAndProfile (log)
2023-12-08 23:05 linux-s390x-ibm go@fe1b2f95 runtime.TestRuntimeLockMetricsAndProfile (log)
|
@rhysh, note that the failure in #64253 (comment) is on |
Found new dashboard test flakes for:
2023-12-11 17:49 linux-ppc64le-buildlet go@20a03fc7 runtime.TestRuntimeLockMetricsAndProfile (log)
|
Found new dashboard test flakes for:
2023-12-11 23:01 linux-s390x-ibm-race go@f4ff0e25 runtime.TestRuntimeLockMetricsAndProfile (log)
|
Found new dashboard test flakes for:
2023-12-12 17:12 aix-ppc64 go@ce43fb80 runtime.TestRuntimeLockMetricsAndProfile (log)
|
https://ci.chromium.org/ui/p/golang/builders/try-workers/gotip-linux-arm64-test_only/b8761883502571661809/test-results?sortby=&groupby= on a |
Alrighty, I think we need to do something about this test. @rhysh, I think from your last comment your conclusion is that this is a test-only issue, right? I think maybe for the moment let's skip the flakiest of the tests, which seems like the semrelease one.
Looking at the test, it seems like it just has 3 workers acquiring/releasing the same semaphore. Could you force them to block on the semaphore and operate in lockstep? Is the problem that it never takes the slow path, or that nothing ever blocks on the lock in the slow path? If it's the latter, then I guess operating in lockstep doesn't really help. Would it be possible to manufacture something that also expects similar call stack pruning but is basically just a |
Found new dashboard test flakes for:
2023-12-12 19:57 linux-ppc64-sid-buildlet go@4daf924f runtime.TestRuntimeLockMetricsAndProfile (log)
|
Hm, it seems like the |
For
Right, the critical section for manipulating semaphore sleeps is pretty small and not under the control of the (runtime-internal) user of the semaphore. There's also the problem that the runtime has 251 locks, with the hash of the semaphore address determining which to use. So observing contention on one of those shared locks (as the current test does) isn't proof that the test's own semaphore use is the one causing/experiencing the contention. The failure in 2023-11-29T19:13:04-636c6e3/freebsd-arm64-dmgk saw contention (its output includes Maybe the semrelease test could look for a particular kind of failure rather than requiring success: If the profile includes the target function ( Or, outright skip the test for Go 1.22. It's unlikely we'll land a change that would accidentally break it further, and the level of detail this is trying to test isn't visible without an opt-in GODEBUG flag anyway. |
In triage, we decided to just skip these tests for now. That'll let us put it to okay-after-rc1, but with a goal of further investigation. |
Found new dashboard test flakes for:
2023-12-13 20:40 linux-ppc64-sid-buildlet go@d95e25e8 runtime.TestRuntimeLockMetricsAndProfile (log)
|
More of an explanation: these tests are checking behavior that's behind a We won't delete the tests because that functionality is going to be revisited for Go 1.23 anyway (by @rhysh I believe). |
Change https://go.dev/cl/550775 mentions this issue: |
This test was added to check new mutex profile functionality. Specifically, it checks to make sure that the functionality behind GODEBUG=runtimecontentionstacks works. The runtime currently tracks contention from runtime-internal mutexes in mutex profiles, but it does not record stack traces for them, attributing the time to a dummy symbol. This GODEBUG enables collecting stacks. Just disable the test. Even if this functionality breaks, it won't affect Go users and it'll help keep the builders green. It's fine to leave the test because this will be revisited in the next dev cycle. For #64253. Change-Id: I7938fe0f036fc4e4a0764f030e691e312ec2c9b5 Reviewed-on: https://go-review.googlesource.com/c/go/+/550775 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Eli Bendersky <eliben@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com>
I think this can still be worked on in Go 1.22 as a test-only fix, but there's really no reason this has to remain a release blocker, even okay-after-rc1. Moving to Go 1.23. |
Found new dashboard test flakes for:
2023-12-06 21:50 gotip-linux-arm64 go@e914671f runtime.TestRuntimeLockMetricsAndProfile/runtime.semrelease (log)
2023-12-06 22:04 gotip-darwin-amd64_11 go@0ff2b33e runtime.TestRuntimeLockMetricsAndProfile/runtime.lock/sample-2 (log)
2023-12-08 00:50 gotip-darwin-amd64_11 go@4bf1ca4b runtime.TestRuntimeLockMetricsAndProfile/runtime.lock/sample-2 (log)
2023-12-08 20:07 gotip-darwin-amd64_13 go@e1cba47e runtime.TestRuntimeLockMetricsAndProfile/runtime.lock/sample-2 (log)
2023-12-11 22:29 gotip-linux-arm64 go@a839ec1e runtime.TestRuntimeLockMetricsAndProfile/runtime.semrelease (log)
2023-12-12 22:10 gotip-darwin-amd64_14 go@dce7f335 runtime.TestRuntimeLockMetricsAndProfile/runtime.lock/sample-2 (log)
|
Found new dashboard test flakes for:
2023-12-11 23:29 linux-s390x-ibm go@b18b0588 runtime.TestRuntimeLockMetricsAndProfile (log)
2023-12-11 23:29 linux-s390x-ibm go@b18b0588 runtime.TestRuntimeLockMetricsAndProfile (log)
|
This test was added to check new mutex profile functionality. Specifically, it checks to make sure that the functionality behind GODEBUG=runtimecontentionstacks works. The runtime currently tracks contention from runtime-internal mutexes in mutex profiles, but it does not record stack traces for them, attributing the time to a dummy symbol. This GODEBUG enables collecting stacks. Just disable the test. Even if this functionality breaks, it won't affect Go users and it'll help keep the builders green. It's fine to leave the test because this will be revisited in the next dev cycle. For golang#64253. Change-Id: I7938fe0f036fc4e4a0764f030e691e312ec2c9b5 Reviewed-on: https://go-review.googlesource.com/c/go/+/550775 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Eli Bendersky <eliben@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Auto-Submit: Michael Knyszek <mknyszek@google.com>
Change https://go.dev/cl/581296 mentions this issue: |
Fully testing the runtime's profiles and metrics for contention on its internal mutex values involves comparing two separate clocks (cputicks for the profile and nanotime for the metric), verifying its fractional sampling (when MutexProfileRate is greater than 1), and observing a very small critical section outside of the test's control (semrelease). Flakiness (#64253) from those parts of the test have led to skipping it entirely. But there are portions of the mutex profiling behavior that should have more consistent behavior: for a mutex under the test's control, the test and the runtime should be able to agree that the test successfully induced contention, and should agree on the call stack that caused the contention. Allow those more consistent parts to run. For #64253 Change-Id: I7f368d3265a5c003da2765164276fab616eb9959 Reviewed-on: https://go-review.googlesource.com/c/go/+/581296 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Joedian Reid <joedian@google.com> Auto-Submit: Rhys Hiltner <rhys.hiltner@gmail.com>
Issue created automatically to collect these failures.
Example (log):
— watchflakes
The text was updated successfully, but these errors were encountered: