-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: crash with "invalid pc-encoded table" in TSAN mode #27540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yeah, this is pretty similar. I don't think I can do anything at all about this without seeing the part of the error you've hidden behind [...] and 0xXXXXXX though. |
...actually, reading back through the internal bug where we diagnosed this, I think that error might not be relevant. (All the logs I was looking have expired so I'm trying to piece it back together from memory.) What we need is the stack that was being scanned by gentraceback. So, starting with a core file and a crash dump like the above:
|
Sorry about the 0xXX stuff - trying to avoid exposing memory with ascii. |
No problem. This looks exactly the same as the previous bug, which is worrying. Lines 3822 to 3823 in 41e62b8
go/src/runtime/sys_linux_amd64.s Line 243 in 41e62b8
so I don't get it. Maybe the pcsp table is wrong even before the stack alignment? What does disass 0x45d5d5 look like? This is unmodified go 1.11? |
yes this is unmodified go1.11 |
Cool! This makes absolutely no sense to me. Can you print/x *mp in the sigprof frame? I'd like to confirm that it generally matches the other args passed to sigprof. |
I have not been able to drop into the sigprof frame (which is weird to me a |
Okay, then please do (from memory):
which should print the structure of an m with contents that make sense. |
|
Okay. I don't understand how vdsoSP can possibly be 0 given that it would have been set immediately before the point at which the signal was caught. @aclements, @ianlancetaylor, any ideas? |
What is the function at address |
This is very unlikely, but does your program ever call |
|
This program doesn't use cgo, but it does mmap a file. |
The value of My conclusion so far is like @heschik 's: this is impossible. Unless somehow |
I definitely would have told you if I found something else printed out. :) |
Since nobody else has asked yet: are any data races detected if you run your executable built with |
no races detected. the frequency of crashes observed so far is low (only 1 per week so far under reasonable load) as well. |
I noticed that in sys_linux_amd64.s,
the But even so, I don't understand how vdsoSP could be 0. |
I didn't spell it out above, but The problem is that no matter what I don't see any way that |
Sorry, I think it was a mistake to ask for the printout of mp, it's a red herring. I believe this is a core dump for a process that threw with GOTRACEBACK=crash, so all the output we're getting is after the crash dump was printed. I didn't realize that the process of crashing would make a vdso call -- since it does, everything in m will have been overwritten. |
We recently updated one of our production data analysis servers from 1.10.x to 1.11.5 (then reverted to 1.11.0) and seem to have encountered either this or a similar issue. As with the original reporter, we are running a complex system (~35k go routines) and have continuous background profiling enabled which we capture and save out for diagnostic when prompted by certain conditions in our software. We are reading data files from disk, processing, and saving results at the maximum possible speed, generally achieving >70% total processor utilization. We typically run an analysis case for hours to days. What version of Go are you using (
|
One other data point that may be helpful, 1.11.0 ran for just over 30 minutes (about 30 min, 20 sec). I record system resource state snapshots to logs every 5 minutes and the last capture I had was: 1.11.5 crashed after ~20 seconds and I don't have a snapshot, but it would have been most of the way piped up to this level, both were running on the same data set. |
FYI; Still present with 1.12.0, same story as others, service with profiling (cloud.google.com/go/profiler) Some tracebacks from the failure if it helps:
|
Hi to everyone, I catched the same problem:
I used go 1.12.7 version |
@scarbo87 I was running into this issue regularly when I had a background profiling process running periodically on a long-running process. My only fix so far has been to shift to only collecting profiles on demand or when prompted by specific anomalous events. Does not really solve the issue but significantly reduces the chance of tripping it.... |
Yep, go1.16.5 (╥﹏╥) |
2021-11-19T18:23:43-80cb59c/linux-amd64-unified |
|
This failure mode reproduces on the builders for This doesn't reproduce frequently enough to be a blocker for Go 1.18, but it isn't appropriate in the Unplanned milestone either. If there isn't enough information in the tracebacks we're getting to diagnose the problem, then a good next step might be to improve the diagnostics so that we can make progress the next time it happens on the builders. |
There are different failure modes, albeit all being related "invalid pc-encoded table" and profiling signals. The last two seem specific to TSAN, due to its delayed delivery of signals. The problem is that when the signal is delivered the signal context no longer matches the thread's state, so unwinding would fail. The earlier ones are probably related to VDSO. |
We fixed several issues related to VDSOs that could cause this. We haven't done anything about the TSAN issue (we're not sure what to do). We should probably look at more recent builder failures and categorize them to see if they're all TSAN or if we still have some other issue. |
For TSAN, given the delayed signal issue, should we just disallow profiling signals in TSAN builds? Is there any way to work around this? |
It looks like the (Or, could we do something like ignore the signal context when handling the profiling signal under TSAN?) |
The Were the VDSO fixes landed around that timeframe? If so, then it seems plausible that TSAN is the only remaining cause for this failure mode. |
( |
cc @golang/runtime |
For what it's worth, the original bug report did not involve TSAN. That app was (is) an RPC server that serves requests with net/http and calls some backends with net/http. It appeared to be crashing due to its combination of regular CPU profiling (receiving SIGPROF) and enabling the block/mutex profiles (frequent vDSO calls to find the current time). I checked in with that team a couple weeks ago, and we weren't able to find any recent crashes (though the machines the app runs on no longer report in the same way they used to when a process dumps core, so it's a bit harder to tell). |
Indeed. I think this issue has ended up tracking something like three distinct but time-overlapping bugs with similar symptoms. (Hopefully we'll eventually reach a state where we don't have so many time-overlapping bugs and they'll be easier to distinguish! 😅) |
Change https://go.dev/cl/408218 mentions this issue: |
What version of Go are you using (
go version
)?go 1.11
Does this issue reproduce with the latest release?
yes
What operating system and processor architecture are you using (
go env
)?What did you do?
I am running a fairly run-of-the-mill production api server. I have observed one crash so far on this code since upgrading to go1.11 with
fatal error: invalid runtime symbol table
. I am running a library in this server code that periodically captures a cpu, block, and mutex profile every few minutes. The sample rates on the mutex profile is 1/10 and 1 block profile every second.What did you expect to see?
No crashes
What did you see instead?
This crash is reminiscent of a bug I was hoping would be fixed with the 1.11 release related to #24925.
The crashes do seem less frequent than when I was running the same code under 1.10.
CC @heschik and @rhysh
The text was updated successfully, but these errors were encountered: