Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http/pprof: TestDeltaProfile failures missing mutexHog2 on ARM architectures #50218

Open
bcmills opened this issue Dec 16, 2021 · 19 comments
Assignees
Labels
arch-arm Issues solely affecting the 32-bit arm architecture. arch-arm64 NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Dec 16, 2021

--- FAIL: TestDeltaProfile (53.30s)
    pprof_test.go:206: want mutexHog2 but no mutexHog1 in the profile, and non-zero p.DurationNanos, got PeriodType: contentions count
        Period: 1
        Time: 2021-12-16 19:51:59.42219359 +1100 AEDT
        Duration: 32.041535781s
        Samples:
        contentions/count delay/nanoseconds
        Locations
        Mappings
        1: 0x0/0x0/0x0   [FN]
    pprof_test.go:214: want both mutexHog1 and mutexHog2 in the profile, got PeriodType: contentions count
        Period: 1
        Time: 2021-12-16 19:51:59.43107526 +1100 AEDT
        Samples:
        contentions/count delay/nanoseconds
                  9   17621016: 1 2 3 
                  1     149081: 1 4 3 
        Locations
             1: 0x851d3 M=1 sync.(*Mutex).Unlock /home/gopher/build/go/src/sync/mutex.go:214 s=0
             2: 0x2d249f M=1 net/http/pprof.mutexHog1 /home/gopher/build/go/src/net/http/pprof/pprof_test.go:107 s=0
             3: 0x2d2937 M=1 net/http/pprof.mutexHog.func1 /home/gopher/build/go/src/net/http/pprof/pprof_test.go:148 s=0
             4: 0x2d24ab M=1 net/http/pprof.mutexHog1 /home/gopher/build/go/src/net/http/pprof/pprof_test.go:108 s=0
        Mappings
        1: 0x0/0x0/0x0   [FN]
FAIL
FAIL	net/http/pprof	57.491s

greplogs --dashboard -md -l -e 'FAIL: TestDeltaProfile ' --since=2021-01-01

2021-12-16T00:34:10-7f23145/openbsd-arm-jsing
2021-11-25T00:02:52-b2a5a37/openbsd-arm-jsing
2021-06-04T17:33:24-831f937/openbsd-arm-jsing

(Forked from #38544, which was mitigated in CL 229498.)

CC @4a6f656c @hyangah

@bcmills bcmills added arch-arm Issues solely affecting the 32-bit arm architecture. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-OpenBSD labels Dec 16, 2021
@bcmills bcmills added this to the Backlog milestone Dec 16, 2021
@gopherbot
Copy link

Change https://golang.org/cl/372794 mentions this issue: net/http/pprof: skip TestDeltaProfile on openbsd/arm

@bcmills bcmills changed the title net/http/pprof: TestDeltaProfile failures with "got PeriodType: contentions count" on openbsd-arm-jsing net/http/pprof: TestDeltaProfile failures missing mutexHog2 on openbsd/arm Dec 16, 2021
gopherbot pushed a commit that referenced this issue Dec 16, 2021
It is observed to be flaky on the only openbsd/arm builder.
Skipping on that platform until someone can investigate.

For #50218

Change-Id: Id3a6dc12b93b3cec67870d8d81bd608c4589c952
Reviewed-on: https://go-review.googlesource.com/c/go/+/372794
Trust: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
Reviewed-by: Hyang-Ah Hana Kim <hyangah@gmail.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@bcmills
Copy link
Contributor Author

bcmills commented Feb 7, 2022

greplogs --dashboard -md -l -e 'FAIL: TestDeltaProfile ' --since=2021-12-17

2022-02-04T22:34:05-f9763a6/android-arm64-corellium

@bcmills bcmills changed the title net/http/pprof: TestDeltaProfile failures missing mutexHog2 on openbsd/arm net/http/pprof: TestDeltaProfile failures missing mutexHog2 on ARM architectures Feb 7, 2022
@bcmills
Copy link
Contributor Author

bcmills commented Feb 7, 2022

Given the android/arm64 failure, this does not seem specific to OpenBSD.
(CC @cherrymui)

@bcmills
Copy link
Contributor Author

bcmills commented Feb 8, 2022

Another android/arm64 today:

greplogs --dashboard -md -l -e 'FAIL: TestDeltaProfile ' --since=2022-02-05

2022-02-07T21:00:02-7db75b3/android-arm64-corellium

@gopherbot
Copy link

Change https://go.dev/cl/383997 mentions this issue: net/http/pprof: skip TestDeltaProfile on all arm and arm64 architectures

@bcmills
Copy link
Contributor Author

bcmills commented Feb 10, 2022

I suspect it may be possible to use something along the lines of the awaitBlockedGoroutine function from CL 384534 to eliminate the timing-dependence of this test.

(I do not plan to follow up on that myself.)

@bcmills bcmills modified the milestones: Backlog, Go1.19 Mar 2, 2022
@bcmills
Copy link
Contributor Author

bcmills commented Mar 2, 2022

Marking as release-blocker for Go 1.19 due to the high failure rate on the builders (CC @golang/release).

If we don't care about this test, we can merge CL 383997 (mailed ~3 weeks ago) to encode that decision as a test-skip. Otherwise, this test failure needs to be addressed.

greplogs --dashboard -md -l -e 'FAIL: TestDeltaProfile ' --since=2022-02-08

2022-03-01T20:52:30-f4722d8/android-arm-corellium
2022-02-28T21:56:43-f04d5c1/android-arm-corellium

gopherbot pushed a commit that referenced this issue Mar 8, 2022
Given that we have seen failures with the same failure mode on both
openbsd/arm and android/arm64, it seems likely that the underlying bug
affects at least all ARM-based architectures.

It appears that either these architectures are not able to sample at
the frequency expected by the test, or the samples are for some reason
being dropped.

For #50218

Change-Id: I42a6c8ecda57448f8068e8facb42a4a2cecbbb37
Reviewed-on: https://go-review.googlesource.com/c/go/+/383997
Trust: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Cherry Mui <cherryyz@google.com>
@aclements
Copy link
Member

@golang/runtime needs to decide whether to fix and unskip this test, or to drop the test.

@prattmic prattmic assigned prattmic and unassigned jeremyfaller Mar 22, 2022
@dmitshur dmitshur added NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels May 18, 2022
@mknyszek
Copy link
Contributor

We should decide what to do here. I'll take a look at this.

@mknyszek mknyszek self-assigned this May 18, 2022
@dmitshur
Copy link
Contributor

dmitshur commented Jun 1, 2022

@mknyszek Have you had a chance to revisit this decision? Do we need to make it before beta 1 (aiming for next week)?

@mknyszek
Copy link
Contributor

mknyszek commented Jun 1, 2022

Sorry, I didn't get around to it. I'll do it right now.

@mknyszek
Copy link
Contributor

mknyszek commented Jun 1, 2022

So, this hasn't happened since March. I have a feeling this is related to a few things that were fixed with respect to profiles recently. Curiously, the failures stopped happening on March 7th, and on March 8th https://go.dev/cl/384239 landed to fix #50996. There aren't any other pprof or profile related CLs around this one. The previous one landed in February, and the following one in May. That CL, I think, only affected proto output, but then again that's exactly what the net/http/pprof test is checking, because it obtains the profile via query.

This test is checking a mutex profile, but I imagine it uses the same stack frame information that anything else does. I think it's possible that what's happening in this test is call frames are dropped because these NOPs are actually (created when inlining for a "fake" PC for the inlined function) weren't handled correctly.

However, I'm not totally clear on what the failure mode for #50996 is. Specifically, neither mutexHop1 nor mutexHog2 are inline-able, but they have a few calls that are (like the Lock calls). So is it the parent frame that could get lost? Just the inlined call? The entire stack? I suspect it's the entire stack because if we land on one of these NOPs and don't handle them correctly (but only sometimes?). It's also not clear whether these NOPs play a role here. They probably play some role, given that this is a mutex profile and the Lock and Unlock calls are inlined.

I'm inclined to call this fixed. I haven't fully figured out the details, but I think there is a plausible connection here.

@mknyszek
Copy link
Contributor

mknyszek commented Jun 1, 2022

CC @rhysh @prattmic @cherrymui who were involved to varying degrees in https://go.dev/cl/384239. It might be clearer to y'all what exactly was the problem here and whether that CL fixed this particular issue. What's also puzzling me is, if there was a problem handling these NOPs, the failure should intuitively be more deterministic for a mutex profile, but I'm also not terribly familiar with how the mutex profile works.

@mknyszek
Copy link
Contributor

mknyszek commented Jun 1, 2022

At the very least, I'm moving this into the backlog optimistically to keep release progress moving.

@mknyszek mknyszek modified the milestones: Go1.19, Backlog Jun 1, 2022
@bcmills
Copy link
Contributor Author

bcmills commented Jun 1, 2022

I'm inclined to call this fixed. I haven't fully figured out the details, but I think there is a plausible connection here.

In that case, shall we close the issue until / unless it recurs?

@mknyszek
Copy link
Contributor

mknyszek commented Jun 1, 2022

I'm not opposed, I wanted to give others a chance to chime in, but it's probably less work to just close it and move on. :) If it happens again we can take some of this information forward.

@mknyszek mknyszek closed this as completed Jun 1, 2022
@prattmic prattmic removed their assignment Jun 1, 2022
@prattmic
Copy link
Member

prattmic commented Jun 1, 2022

If I recall correctly, https://go.dev/cl/384239 shouldn't affect this failure. In that case, the failure was that you'd end up with multiple Locations for things that should be the same Location. e.g., one Location including all of the inlined frames, and 2 separate Locations with one containing the inlined frame and the other containing the parent frame.

But all of the frames should still appear in the profile. In this case, the mutexHog2 frame is missing entirely, so this seems different.

Still, I think it is OK to wait and see if this occurs again.

@rhysh
Copy link
Contributor

rhysh commented Jun 1, 2022

https://go.dev/cl/383997 landed on March 8 to skip the test on arm and arm64. So the bug is probably not fixed.

Strange that mutexHog2 doesn't show up even once (in the non-delta profile printed at the end of each failure), since it must run at least once in order for that goroutine to check the done channel. But, if there's something that causes a function to not be reported some of the time (as the comments about gcc's optimizations hint), we wouldn't see all the times mutexHog1 is missing because those cause the test to skip.

@mknyszek
Copy link
Contributor

mknyszek commented Jun 1, 2022

Ha, OK. I missed that when I was walking back over the commit log. Thanks @rhysh. I'm going to reopen this. Still not certain this should block the release at this point, but it looks like there's a real problem and it's currently being masked.

@mknyszek mknyszek reopened this Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm Issues solely affecting the 32-bit arm architecture. arch-arm64 NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Projects
Development

No branches or pull requests

8 participants