-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: occasional hard lockup / 100% CPU usage in go applications #56424
Comments
CC @golang/runtime |
Could you collect a core dump ( Then in gdb running |
Will do. Just to make sure I'm doing this correctly:
I tested out attaching to it from gdb (see below) and the output from the backtrace is minimal and not very useful as far as I can tell. Is there something else I need to do, either in my application or from gdb, to make the backtrace more verbose / useful? Here's what I got. Note that this is in the normal state, not the runaway state.
Apologies if this is a basic question, but I'm not at all familiar with using gdb on a go app. |
That sounds right. Now when you send a SIGABRT it should generate a core file (though I'm not sure where systemd will put it). That said, if you can simply attach the PID with gdb (as you did below) when it is stuck that is fine too.
This is odd, it should show the symbols there. This isn't a stripped binary, is it (e.g., You can at least double check that these addresses are in the range of the binary (see |
Ah, it was a stripped binary. Rebuilt and running now with all the debugging symbols. Here's the output from another test (again in the nominal not runaway state). Looks much better:
|
Yup, that looks correct! |
It took several days, but the runaway happened again. I tried to attach to the process from gdb without success - gdb simply hung "Attaching to process 2709" and never gave me the opportunity to request the backtrace. I also tried sending SIGABRT ( Any additional suggestions? |
Tried again today with gcore. Similar failure:
Attempting to attach somehow killed the process. What on earth is up with this? |
gdb being unable to attach makes this sound like an OS bug, Go shouldn't do anything that prevents attaching. You could try using the perf tool to collect a system-wide profile and see where all that CPU time is going.
|
Ok, here's what perf shows:
I've no idea what all that means. Any thoughts? |
Drilling down into the top item in that list:
|
Interesting, it seems to be spinning on some atomics. Could you expand the |
I think so - not at all familiar with perf, but I'll try... Here you go:
If that's not what you need, please let me know and I'll try again. Specific instructions would be great! |
I can also upload the perf.data file if that would help. |
Yeah, it doesn't look quite right. The perf UI is honestly pretty confusing w.r.t. expanding entries. Try this:
If for some reason that doesn't match results, just remove the |
Here's what that outputs:
|
Unfortunately the perf.data file doesn't include the symbol information (perf report finds the right binaries and reads them), so it is difficult to send them around. Sorry for all the questions.
Hrm, it seems perf didn't find/record the callers for some reason. Could you try a few things?
|
Not a problem. Just want to get to the bottom of this if possible.
|
I rebuilt with "GOARM=7" and that seems to have resolved the issue. I've run the rebuilt version for the past week without a lockup / runaway. The only problem is, I'm not sure if this truly fixed the problem, or simply made it much less likely to happen. The theory from a friend who's much better versed in the internals of Linux:
Unfortunately, I was never able to get a core dump from the process. His theory on that:
Not sure if this should be closed or left open for someone with better tools for debugging to take a look. At the very least it might be good to note this compatibility step (the "GOARM=7" bit) somewhere. |
That is certainly possible. It is also possible that there is a bug in our GOARM<7 locking path which uses the kuser mappings, or in the kuser code itself. What kernel version is this system using ( |
|
i meet same question, how about now ? |
@coolonion2000 If you are using a Raspberry PI, try setting |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Unknown. I've been having this issue since I started working on this project roughly 24 months ago. Same behavior on all go versions.
What operating system and processor architecture are you using (
go env
)?Raspberry Pi OS v10.11 on armv7l
go env
OutputWhat did you do?
I'm using go to build a set of services on the Raspberry Pi platform (Pi 4, CM4). Several of the higher duty-cycle services occasionally "run away", consuming upwards of 400% (100% x 4 cores) of available processing power. This usually happens after several hours or even a few days of running. When this happens, all of the goroutines in the application stop running, remaining in "runtime.gopark". (See the stack trace below.)
The program most given to this behavior is a network broadcaster. It subscribes to and receives data from a set of Redis pubsub channels and forwards that data to clients over UDP. The typical throughput is relatively low - only about 34 KB/second. It never gets above 38 KB/sec. It is run at a higher priority level, but it is not run as "realtime".
I've added a watchdog goroutine to try to detect the runway condition and terminate the service, but the watchdog timer never fires.
I tried pulling a profile using pprof but when the problem happens the http listener fails to respond.
I've tried tracing with strace but it doesn't return any data.
I've monitored open files with lsof and it does not appear to be leaking handles - open file count remains constant at 369.
I kicked it around with the Redigo team, and from what we can tell it's not something going on in Redigo - all the pubsub listeners simply get stuck in their waiting state.
What did you expect to see?
The application runs at its normal modest CPU load (5% - 8%)
What did you see instead?
The application sucks down as much CPU power as possible (395%)
Stack Trace
Too long to embed in this issue. Here it is as a gist:
https://gist.github.com/ssokol/b168de8b4546efd9b43a9d6af8538de9
The text was updated successfully, but these errors were encountered: