Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: exceeded thread wakeup limitation on iOS #29284

Open
hajimehoshi opened this issue Dec 15, 2018 · 27 comments
Open

runtime: exceeded thread wakeup limitation on iOS #29284

hajimehoshi opened this issue Dec 15, 2018 · 27 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. mobile Android, iOS, and x/mobile
Milestone

Comments

@hajimehoshi
Copy link
Member

hajimehoshi commented Dec 15, 2018

We found my gomobile iOS application crash with the below message:

Event:           wakeups
Action taken:    none
Wakeups:         45001 wakeups over the last 64 seconds (704 wakeups per second average), exceeding limit of 150 wakeups per second over 300 seconds
Wakeups limit:   45000
Limit duration:  300s
Wakeups caused:  45001
Duration:        17.04s
Steps:           18

Hardware model:  iPhone8,1
Active cpus:     2
Boot args: 

I guess Go runtime switched threads so many times that iOS complained, but I am not sure. It is hard to create a minimal code to reproduce this problem, but I'd like to report this (potential) problem in gomobile.

@gopherbot gopherbot added this to the Unreleased milestone Dec 15, 2018
@gopherbot gopherbot added the mobile Android, iOS, and x/mobile label Dec 15, 2018
@eliasnaur
Copy link
Contributor

Please do try to figure out a way to reproduce the error; fixing the problem will be much more likely if you do.

@eliasnaur
Copy link
Contributor

I found a thread with a similar issue: https://forum.unity.com/threads/the-weird-ios8-unity3d-hiccup.273186/. It seems that Unity apps don't actually crash, only freezes while the exception report is generated.

@g3devv
Copy link

g3devv commented Jul 14, 2022

Hey I want to work on this issue since I "translated" my app to go - and now I face problems here.

I found this thread at https://developer.apple.com/forums/thread/124180 and user eskimo has some insights about what to look for.

In short, it seems to count the number of times that a blocked thread was unblocked because of an ‘interrupt’

At first I tried to reproduce the issue in a minified example by doing thousands of Gosched() calls in multiple threads - unfortunately without success. My next attempt was mixing up some sync.Mutex .Lock() .Unlock() calls in multiple threads. This has lead to a lot of context switches in the Xcode profiler, however it has only produced reports for cpu_resource ("90 seconds cpu time over ...") but not the desired wakeups_resource report.

Now I'm kind of out of ideas what to look out for there, I'm trying to think of other design patterns that I use in my main project that could potentially fall into that "thread unblock or interrupt" territory but the Gosched and the Mutex calls are the only things I can come up with now. Maybe also channels but my understanding of these is quite limited

@hajimehoshi did make any further findings back then?

@eliasnaur has something like that ever been brought up again? Any info that could help me showcase a minimal example of the issue to debug or fix?

@hajimehoshi
Copy link
Member Author

I have not seen this issue recently.

@g3devv
Copy link

g3devv commented Jul 14, 2022

@hajimehoshi I do 🙈 Is the project where you have experienced the problem a few years ago open source by any chance or would you open-source it (of course only if it has no further value for you that would be destroyed by open sourcing)? Maybe that would make it easier to find the reason for the problem

@hajimehoshi
Copy link
Member Author

I'm afraid it is not an open source...

@g3devv
Copy link

g3devv commented Jul 14, 2022

Oh too bad but don't worry about it, if at any time you remember something you stumbled across when debugging please let me know. And maybe Elias knows something too haha

@g3devv
Copy link

g3devv commented Jul 14, 2022

It's the channels. My example project is quite messy, I am going minify my example project and let you know. So happy I found a way to reproduce the error :)

@g3devv
Copy link

g3devv commented Jul 14, 2022

Please do try to figure out a way to reproduce the error; fixing the problem will be much more likely if you do.

@eliasnaur I have just made a project that reliably produces that error. https://github.com/g3devv/channelwakeups

It would be very exciting to hear from you if you have an idea how we could approach getting this resolved in gomobile

@eliasnaur
Copy link
Contributor

This is unlikely to be a gomobile issue, and more likely to be an iOS limitation that affects Go runtime scheduling. If you have a reliable reproducer, the Go runtime team may be able to help. I can come up with one clarifying question: is this still an issue on the very latest iOS version?

CC @golang/runtime .

@g3devv
Copy link

g3devv commented Jul 17, 2022

Dear Elias, thank you for your response!

I completely agree that this very well may not be an issue with gomobile intrinsically, however I fear that some "specialties" that Go comes with make this issue more likely to come up in a Go app compared to native apps or apps developed with other frameworks.

What I mean by that (and I'm a complete novice when it comes to golang so maybe it's a bad example) is that for example I once read that Go is super fast at context switching and for that reason does it at a higher frequency than other language runtimes do.

So maybe, if we are able to find a singular "feature" to blame, it would be possible to pass a build flag (or compile flag or what it's called) to optionally disable (or throttle) such a feature; in a way that the generated framework is less likely at hitting these iOS restrictions.

Over the weekend, I was able to come up with a more concise example (5 lines of code) to reproduce the issue, which should make it easier to pin down the underlying cause. The error rate with the new code is still a bit temperamental (either because it's not hitting the nail OR iOS doesn't emit the crash info on each run). Once I was able to confirm my new demo is better, I will ping you again if that's okay with you.

About the iOS version, my production app showcased that issue on 15.2.1 and I currently test my demo project on 14.3 but I will update the phone tomorrow just to make sure.

@eliasnaur
Copy link
Contributor

You're welcome to ping me, but I'm unlikely to be able to do much more for this issue.

So maybe, if we are able to find a singular "feature" to blame, it would be possible to pass a build flag (or compile flag or what it's called) to optionally disable (or throttle) such a feature; in a way that the generated framework is less likely at hitting these iOS restrictions.

It's very unlikely an explicit flag will be accepted by the Go team, especially for a workaround for what seems to be a misfeature of a secondary port. What may be possible is that the Go runtime could somehow automatically limit thread switching when runtime.GOOS == "ios".

@prattmic
Copy link
Member

Without knowing more context about the application, it is hard to provide specific advice, but one thing you could try is lowering GOMAXPROCS.

Why? GOMAXPROCS represents the parallelism available to the Go scheduler. The scheduler will attempt to maximize utilization of these resources (work conservation). In practice, this means (simplified) that there will be GOMAXPROCS OS threads running Go code. If there are fewer than GOMAXPROCS runnable goroutines, then extra threads will sleep. Then when new work is available, we will wake up one of those extra threads to run the work (this will count as a wakeup).

If GOMAXPROCS is usually less than the number of runnable goroutines, then threads won't sleep and thus won't be woken either. Of course, this will also hurt the scheduling latency of the application, since there are not enough CPU resources for all of the work.

This is not the only possible source of excessive wakeups, but it is certainly the top one that comes to my mind.

@prattmic prattmic changed the title x/mobile: thread wakeup limitation on iOS runtime: exceeded thread wakeup limitation on iOS Jul 19, 2022
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 19, 2022
@prattmic prattmic modified the milestones: Unreleased, Backlog Jul 19, 2022
@prattmic
Copy link
Member

Ah, I see you provided a reproducer in https://github.com/g3devv/channelwakeups/blob/main/goproj/goprojlib/goprojlib.go.

I don't think that there is anything we can do here. If I am reading correctly, RunChannelConciseExample wakes every 4ms from sleep, which is 250 wakeups[1] per second. Even ignoring the rest of the program, this already far exceeds a 100 wakeup/s limit.

In a real application, the use of an unbuffered channel could exacerbate the issue because the producer side must sleep if the consumer isn't waiting for the next value. Adding a buffer would allow both sides to execute concurrently.

[1] Disclaimer, I don't know exactly what iOS considers a "wakeup", but I would certainly consider wake from sleep to be a wakeup.

@mknyszek
Copy link
Contributor

This is kind of unfortunate, and we're not sure what to do about it. We could enforce coarse-grained wakeups on iOS, so that there isn't more than one wakeup every 10 ms (either spinning or waiting to wake up another thread).

The immediate workaround is to just have user application enforce this, but that can be difficult. It's surprising that iOS crashes the process instead of just refusing to wake up the process, though. That's a pretty hard hammer for what seems like a fixable problem.

@mknyszek
Copy link
Contributor

CC @golang/ios

@mknyszek
Copy link
Contributor

Assigning to @changkun as the port maintainer, but feel free to respond and unassign.

@iamcalledrob
Copy link

We're seeing this issue on macOS too, when running an app that links against a framework built with go.

It's extremely difficult to reproduce, and is only occurring on certain devices (of course, never our own).

The process is killed, and a diagnostic report is written to /Library/Logs/DiagnosticReports staying "Action taken: none", yet the process does appear to either crash or be killed by the OS.

Here's an excerpt from a diagnostic report when this happens:

Event:            wakeups
Action taken:     none
Wakeups:          45001 wakeups over the last 7 seconds (6142 wakeups per second average), exceeding limit of 150 wakeups per second over 300 seconds
Wakeups limit:    45000
Limit duration:   300s
Wakeups caused:   45001
Wakeups duration: 7s
Duration:         7.33s
Duration Sampled: 0.00s
Steps:            1

Hardware model:   Mac13,1
Active cpus:      10
HW page size:     16384
VM page size:     16384

Fan speed:        1328 rpm
Advisory levels:  Battery -> 2, User -> 2, ThermalPressure -> 0, Combined -> 2
Free disk space:  725.52 GB/3721.87 GB, low space threshold 3072 MB
Vnodes Available: 86.91% (228724/263168, 132392 allocated)

Preferred User Language: en-CA
Country Code:     CA
OS Cryptex File Extents: 1983

Heaviest stack for the target process:
  1  runtime.mcall + 84 (MyFramework + 3958644) [0x10369e774]
  1  runtime.park_m + 268 (MyFramework + 3785276) [0x10367423c]
  1  runtime.schedule + 152 (MyFramework + 3783896) [0x103673cd8]
  1  runtime.findRunnable + 1096 (MyFramework + 3777384) [0x103672368]
  1  runtime.stealWork + 440 (MyFramework + 3780744) [0x103673088]

@iamcalledrob
Copy link

Also to note, the framework is not built using gomobile, it's built directly with go build -buildmode=c-archive -ldflags '-s -w'

@iamcalledrob
Copy link

@prattmic
Copy link
Member

@iamcalledrob Thanks for the references. For posterity, the reproducer from https://groups.google.com/g/golang-nuts/c/yywUWpRA1EE is:

func main() {
    time.Sleep(10*time.Second) // added sleep to make sure it run enough to show in mac activities monitor
    for count < 10000 {
        count += 1
        netConn, err := dialer.Dial("tcp", ":8080")
        if err == nil {
            netConn.Close()
        }
    }

    time.Sleep(10*time.Second) // added sleep to make sure it run enough to show in mac activities monitor

}

@iamcalledrob does this program reproduce the problem for you as well?

@aclements
Copy link
Member

@iamcalledrob what version of Go are you running? Can you paste go version and go env? Thanks!

@prattmic
Copy link
Member

prattmic commented Feb 1, 2024

The process is killed, and a diagnostic report is written to /Library/Logs/DiagnosticReports staying "Action taken: none", yet the process does appear to either crash or be killed by the OS.

FWIW, on https://developer.apple.com/forums/thread/124180?answerId=742350022#742350022 and following posts, it seems that this report is not supposed to actually kill the process, though, like you, the user seems to report that their process is dying.

@iamcalledrob
Copy link

@aclements

$ go version
go version go1.21.6 darwin/arm64
$ go env
GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/rob/Library/Caches/go-build'
GOENV='/Users/rob/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/rob/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/rob/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/opt/homebrew/Cellar/go/1.21.6/libexec'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/opt/homebrew/Cellar/go/1.21.6/libexec/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.21.6'
GCCGO='gccgo'
AR='ar'
CC='cc'
CXX='c++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/9x/tzryjjm93zvcvhmn517n9t3h0000gn/T/go-build633570136=/tmp/go-build -gno-record-gcc-switches -fno-common'

I'm able to (unreliably) reproduce an "excessive wakeups" diagnostic report in a plain go application:

package main

func main() {
    for i := 0; i < 10000; i++ {
        go func() {
            for {
                <-time.After(10 * time.Millisecond)
            }
        }()
    }

    select {}
}

Note: I say unreliably, because macOS seems to be applying some magical logic to what it does when it encounters "excessive wakeups". Sometimes it writes a DiagnosticReport, sometimes it doesn't. Never when you want it to, of course. And it appears that sometimes it kills the process, and sometimes it doesn't. In my observations, it suppresses subsequent reports from the same process name. And it never writes reports when the app is "Run" from Xcode.

It does seem like there's some special condition where the OS does choose to kill the process, since the crashes I (and others) are experiencing occur immediately after this diagnostic report is written, with no additional crash logs.

System log caused by excessive wakeups:

default	13:20:43.067351-0600	kernel	process example[2403] caught waking the CPU 45003 times over ~45 seconds, averaging 993 wakes / second and violating a limit of 45000 wakes over 300 seconds.
default	13:20:43.067445-0600	symptomsd	Received CPU wakes trigger:
  example[2403] () woke the CPU 45003 times over 45.29 seconds (average 993/sec), violating a CPU wakes limit of 45000 over 300 seconds.
default	13:20:43.068397-0600	symptomsd	RESOURCE_NOTIFY trigger for example [2403] (45003 CPU wakes over 45.00s seconds, violating limit of 45000 CPU wakes over 300.00s seconds)

When it chooses to write a diagnostic report, there are entries relating to spindump in the system log:

ReportCrash[95787]: Invoking spindump for pid=95661 wakeups_rate=7041 duration=7 because of excessive wakeups

(Note: I failed to copy+paste the exact log line when I was getting these, but it's similar to this)

Here's a .diag Diagnostic Report from starting many timers (snippet above)
Date/Time:        2024-02-02 12:32:28.938 -0600
End time:         2024-02-02 12:32:31.329 -0600
OS Version:       macOS 12.6.1 (Build 21G217)
Architecture:     arm64e
Report Version:   35.1
Incident Identifier: 7683982F-244A-41E1-9775-2ED69E7E4321
Share With Devs:  Yes

Data Source:      Microstackshots
Shared Cache:     7CF73F63-0640-33EC-8444-138D9834ECB5 slid base address 0x1992b8000, slide 0x192b8000

Command:          example2
Path:             /Users/USER/Desktop/*/example2
Architecture:     arm64
Parent:           UNKNOWN [889]
PID:              3435

Event:            wakeups
Action taken:     none
Wakeups:          45003 wakeups over the last 2 seconds (18828 wakeups per second average), exceeding limit of 150 wakeups per second over 300 seconds
Wakeups limit:    45000
Limit duration:   300s
Wakeups caused:   45003
Wakeups duration: 2s
Duration:         2.39s
Duration Sampled: 0.00s
Steps:            1

Hardware model:   MacBookPro18,3
Active cpus:      8
HW page size:     16384
VM page size:     16384

Fan speed:        0 rpm
Advisory levels:  Battery -> 3, User -> 2, ThermalPressure -> 0, Combined -> 2
Free disk space:  61.35 GB/460.43 GB, low space threshold 3072 MB

Heaviest stack for the target process:
  1  runtime.systemstack.abi0 + 108 (example2 + 411772) [0x10474887c]
  1  runtime.gcAssistAlloc.func1 + 36 (example2 + 124788) [0x104702774]
  1  runtime.gcAssistAlloc1 + 292 (example2 + 125108) [0x1047028b4]
  1  runtime.gcDrainN + 404 (example2 + 130420) [0x104703d74]
  1  runtime.markroot + 460 (example2 + 121788) [0x104701bbc]
  1  runtime.markroot.func1 + 180 (example2 + 122612) [0x104701ef4]
  1  runtime.scanstack + 564 (example2 + 127332) [0x104703164]
  1  runtime.scanframeworker + 112 (example2 + 128448) [0x1047035c0]
  1  runtime.(*stkframe).getStackMap + 372 (example2 + 329716) [0x1047347f4]
  1  runtime.(*stkframe).argMapInternal + 32 (example2 + 328656) [0x1047343d0]


Powerstats for:   example2 [3435]
UUID:             C7B3F2A5-76DA-3196-84C4-6B21897A2221
Path:             /Users/USER/Desktop/*/example2
Architecture:     arm64
Parent:           UNKNOWN [889]
UID:              501
Footprint:        41.08 MB
Start time:       2024-02-02 12:32:29.251 -0600
End time:         2024-02-02 12:32:29.251 -0600
Num samples:      1 (100%)
Primary state:    1 samples Non-Frontmost App, Non-Suppressed, User mode, Effective Thread QoS Default, Requested Thread QoS Default, Override Thread QoS Unspecified
User Activity:    0 samples Idle, 1 samples Active
Power Source:     0 samples on Battery, 1 samples on AC
  1  runtime.systemstack.abi0 + 108 (example2 + 411772) [0x10474887c]
    1  runtime.gcAssistAlloc.func1 + 36 (example2 + 124788) [0x104702774]
      1  runtime.gcAssistAlloc1 + 292 (example2 + 125108) [0x1047028b4]
        1  runtime.gcDrainN + 404 (example2 + 130420) [0x104703d74]
          1  runtime.markroot + 460 (example2 + 121788) [0x104701bbc]
            1  runtime.markroot.func1 + 180 (example2 + 122612) [0x104701ef4]
              1  runtime.scanstack + 564 (example2 + 127332) [0x104703164]
                1  runtime.scanframeworker + 112 (example2 + 128448) [0x1047035c0]
                  1  runtime.(*stkframe).getStackMap + 372 (example2 + 329716) [0x1047347f4]
                    1  runtime.(*stkframe).argMapInternal + 32 (example2 + 328656) [0x1047343d0]

  Binary Images:
           0x1046e4000 -        0x10491ffff  example2 (0) <C7B3F2A5-76DA-3196-84C4-6B21897A2221>  /Users/USER/Desktop/*/example2

If I could understand the logic being used for when the high wakes trigger spindump (or not) and when the process is killed (or not) then this would feel like something that might be addressable. Perhaps the answer lies in the darwin source somewhere?

@prattmic
Copy link
Member

prattmic commented Feb 2, 2024

Thanks for the additional details.

At a high level, I am still unsure what, if anything, we should do about this at the Go runtime level. The program that you shared is requesting 1 million goroutine wakeups per second (10k goroutines waking 100 times/s). So it doesn't seem fundamentally incorrect for the OS to warn about this.

The program in #29284 (comment) doesn't obviously contain sleeps, but 45k wakeups during 10k dials implies each dial sleeps at least 4.5 times on average. Perhaps we could do something about that?

@iamcalledrob
Copy link

iamcalledrob commented Feb 3, 2024

Right, I totally get that the "reliable repro" is a bit extremely contrived.

The challenge we are facing is that this "excessive wakeups -> crash" problem is occurring seemingly randomly in non-contrived production code. Code that isn't starting thousands of timers :)

I'm curious if there may be paths occasionally taken by the Go runtime that might be leading to an unexpectedly high number of wakeups?

We do a lot of network and disk IO, and these crashes are happening "in the wild" during these operations.

@g3devv
Copy link

g3devv commented Apr 23, 2024

With the last iOS update I've seen virtually zero crashes to my app anymore. Maybe the crashes we've experienced came from another reason which added to the wakeup limitations or Apple has made changes on their end, or the thread wakeups have became less for some reason. But what I can tell is that it's not really an issue for us any more, it seemed to be a lot more critical when I first noticed it but now it's mostly fine ig :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. mobile Android, iOS, and x/mobile
Projects
Development

No branches or pull requests

9 participants