New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: rare stack corruption on Go 1.20 and later #64781
Comments
cc @golang/runtime |
Hi @klauspost, others will likely have more useful comments, but two initial questions that spring to mind — are you able to gather exact kernel versions for different machines where this is observed, and is there an opportunity to try some triage debug settings (for example, along the lines of those outlined in #58212 (comment))? |
hi go Team, thanks for looking into this, I've able to reproduce it (rarely), here is what kernel used right now:
|
Customer 1: CGO disabled for build FWIW. Updated disassembly. Was accidentally Windows compiled. Now shows Linux. |
Kernel: |
Two things where I might make mistakes at: One thing I did encounter was that races aren't instrumented in assembly; so if you have a data-race that may cause some pain. Annotate your assembly (or anything that might call assembly in stdlib) with The other is that assembly code might write outside of it's bounds -- in which case, write exhaustive tests across some range of data sizes / offsets, with memory protected byte slices. |
@egonelbre As I wrote, I am pretty sure this isn't a buffer write issue. I have fuzz tests in place that already checks out-of-buffer writes - allocating a bigger buffer, writing random data there - and then truncates the capacity in the buffer given to compression as destination. And after the compression checking if the data after the buffer has been touched. Since it (to me at least) appears to be a stack issue, I don't think there are races going on, since of course stacks are never shared. I do run all tests through the race detector as part of CI tests. However instrumenting sounds like a good idea anyway - do you have any doc/examples? Google doesn't really turn up anything useful. |
Here's a quick demo https://github.com/egonelbre/compress/commit/c7df670870dc509127572281d79075a7faac8f13. Instrumenting around buffers and pools is also a good idea. |
I added |
@dctrwatson, that's great, thanks for trying that! What do you think about trying those one at a time, maybe in the reverse order they are listed in #58212 (comment)? |
Switching to next flag |
So looking at the generated assembly one things that sticks out is that
Since I don't know anything about the GC preemption it is hard for me to see if this is a problem with my code or the runtime. |
What generated assembly, in which function? Thanks. |
@cherrymui The avo generated assembly - example - meaning the assembly being called and presumably pre-empted. |
Oh, that is fine. The Go assembler itself will insert prologue and epilogue to adjust SP and BP, as long as the frame size (589848 in |
Then unfortunately that isn't where the problem lies. |
I'll switch to |
Well that was fast... got one already with
|
Interesting finding! I see that
But I see here that we are not changing the status to Gcopystack https://go.googlesource.com/go/+/adec22b9f7c9d9cfb95ff6af1c63ec489d6e9bb8/src/runtime/stack.go#1214 Can it be the case? If it is not in _Gcopystack the concurrent gc might try to scan the stack while we are doing the copy? cc @cherrymui |
@mauri870 Thanks for catching that. I agree that it seems inconsistent with the comment. But I don't obviously see what is wrong. The G (and its stack) is either owned by the scan bit, or it is the running G itself. So other goroutines shouldn't be doing anything with its stack. Also, #64781 (comment) seems to suggest that stack shrinking isn't the cause. That said, if you'd like, you can try if this patch makes any difference.
|
From what I understood, setting This paired with the comment that we should be changing the status to _Gcopystack raised some eyebrows. Plus this comment in the other usage of copystack:
|
Okay, I misread the comment #64781 (comment) -- it is gccheckmark, not gcshrinkstackoff... Then try if that patch makes any difference. Thanks. |
@dctrwatson If you are able to, can you compile your binary with a custom Go toolchain with cherrymui's patch and then run the program without any GODEBUG env set to see if that fixes the issue? Thanks |
@dctrwatson As far as I can tell the workflow would be:
AFAICT it shouldn't be necessary to recompile the go compiler, but someone can correct me if I am wrong. |
You're right that there is no need to recompile the Go compiler. You can also find your Go installation by running |
I'm still not sure what could go wrong. For the concurrent GC to scan its stack, it needs to preempt the goroutine. But if we're copying the stack, we cannot be preempted until the copying is done. |
I don't know if it is relevant. At least 2/3 (and probably also the third) of our customers that have dual-CPU systems - it may however not statistically significant, since probably about 40% of our customers use that. It seems to be somewhat specific for certain machines to be "vulnerable" to this. If it was something that would happen on all installs, I'd say it would happen a lot more than what we are observing. |
If you're OK with log spew in the name of debugging, it might also be worthwhile to set If this is some stack-shrink-at-a-bad-time bug then we should at least be able to identify which stack shrink messed things up. And if we can get exactly when that stack shrink happened (i.e. what the stack looked like at the time of the shrink), then that might give us a smoking gun. |
For this patch #64781 (comment), go1.19 https://github.com/golang/go/blob/release-branch.go1.19/src/runtime/stack.go#L1160 doesn't have such patch, and it's same as go1.20 https://github.com/golang/go/blob/release-branch.go1.20/src/runtime/stack.go#L1162. For this error, This error happens at 1.19 #56552, 1.14 #37968 too. In this issue's case,
Is it possible customers using different systems with/without C compilers leading this error ^? |
@mauri870, I followed the steps by @klauspost in #64781 (comment) After couple hours got a panic though:
Perhaps as @mknyszek suggested, we can add some debug prints. This way I can also validate I'm building properly with the patch applied. |
@klauspost @dctrwatson hey, do you guys have some minimal code structure sample to setup in a local env to reproduce this error? |
This is with Thanos v0.32.5 running the |
@smiletrl As I wrote it seems to only trigger on certain setups. Unfortunately these are all production setups running on customer hardware to which we don't have access. |
We are also seeing this corruption issue on our production, building with go 1.20.11, using Klauspost v1.17.3 on Debian 10 running kernel 5.10.0-25-amd64 on dual CPU Intel Xeon Gold 6348/6230R. |
@klauspost This ended up back in the compiler/runtime triage queue, do you have any updates from your end on trying it out with debug settings enabled? |
@mknyszek We also see the issue being worked around with the It was shortly looking like "Intel SDE" could reproduce the issue, but it turned out that it was probably just emulation quirks. So I still don't have a reliable reproducer. It does match quite well with the code using quite a bit of stack. Probably more than most other code out there. |
Hi all, I want to add one more case to the issue with some additional details if it helps for the debugging. Here are the last log lines in 3 different occasions of fatal errors (without GODEBUG): Pod 1:
Pod 2:
Pod 3:
I hope this helps. |
@karsov , thanks! Are you able to share the stack tracebacks as well? You can strip out the symbols, paths, and arguments, though if you could keep the runtime functions that's probably helpful. I mostly want to see the goroutines and the sp/fp printed on each frame. |
That's the strange part (at least for me). |
Oh, hmm. I guess I would expect there to be no stack trace for the "fatal: bad g in signal handler" ones. I would have expected a traceback from pod 2, though. The fact that two of them are printing "fatal: bad g in signal handler" is really strange. We don't manipulate workbufs or scan stacks from signal handlers. This suggests we somehow 1) crashed for one reason and then 2) somehow set g to nil, which would immediately cause a signal and print that message. I don't see how we get from 1 to 2. |
@karsov We also encountered the same error, can you tell me how to set GODEBUG=gcshrinkstackoff=1 environment variable in k8s deployment? thx |
If someone has the possibility to test the latest release without the GODEBUG switch. v1.17.7 contains an OOB write fix. It isn't on the stack however, so I am unsure if it resolves this issue. We have been unable to confirm either way internally. |
Go version
Go 1.20.x and later
What operating system and processor architecture are you using (
go env
)?What did you do?
We have at MinIO been experiencing runtime crashes since the release of Go 1.20
The issues only appear to occur for a very small number of our customers, and supplying them with a Go 1.19.x compiled binary always solves the issue.
The issue appear slightly different each time, but all indicate some sort of corruption. Since none of them had any "smoking gun", I held off on submitting an issue.
Here are some
samples of customer crashes (click to expand)
.. Same hardware:
Second customer:
Third customer:
It does however seem like the "Thanos" project has supplied what looks to be smoking gun. (issue contains additional traces)
So the crash occurs in a goroutine that compresses a block.
The only really interesting thing that goes on is that calls assembler to compress the block. Looking at the assembly caller
it is pretty straightforward. This is the disassembly for the caller:
This is the assembly function called signature:
... and the function definition:
The reason I included the stack size check is that it uses quite a bit of stack, so there is a chance that it is called.
The stack is used for a dynamic lookup table. I am fairly sure there are no writes outside the stack, and I also am pretty confident there are no writes outside the provided slices (this would likely also give different errors).
I do not use the
BP
register - so it is not clobbered, and only SSE2 registers are used - so noVZEROUPPER
weirdness. The stack is managed by avo, so less likely there is a bug with that.So my questions are:
A) Am I doing something obviously wrong?
B) What would be a typical reason for this error to show up?
C) This seems releated to GC, so is there a window where the goroutine could be preempted in an unsafe state?
D) Are there any Go 1.20 changes that seem likely to be triggering this?
Keep in mind that this doesn't appear to happen on too many machines. Talos reported that it seem to happen more often if a lot of memory is allocated.
I will of course assist with any information that may be needed - but I feel at this point I need some pointers from people deeper understanding of the runtime to get much further.
Also note we have tested CPU+RAM on some of these customer systems extensively since that seemed like a possibility at first. Also note that crashes can be completely unrelated - but the coincidence seems to big.
What did you expect to see?
No crash
What did you see instead?
Rare, random runtime crashes
Edit: Assembly is now linux compiled.
The text was updated successfully, but these errors were encountered: