runtime: piController can produce NaNs if the error term grows in an unbounded manner #51061
Labels
FrozenDueToAge
NeedsFix
The path to resolution is known, but the work has not been done.
release-blocker
Milestone
If something goes horribly wrong with the assumptions surrounding the PI controller in the runtime, its internal error state might accumulate in an unbounded manner. In practice this means unexpected
Inf
andNaN
values.This is more likely to happen in the scavenger, because the assumption there is a proportional response in adjustments to sleep time affecting measured CPU usage, and our measurements of time here have some significant caveats. Namely, we use
nanotime
for everything and make a pretty naive assumption about what CPU time looks like (which mostly holds). These assumptions break down in, for example, over-subscribed systems (even if they're only transiently over-subscribed).I think we should just handle this case and fall back to a conservative setting. I also think that we should try to come back to using the controller after some time, because it really does pay to be more aggressive.
The GC pacer also uses this controller, but for a somewhat different purpose. AFAICT this is not an issue there because the controlled value is the output of the controller, so the input and output are always very directly correlated. Also, I was unable to break the controller with the same parameters under a fuzzer.
We've seen this happen very occasionally in internal Google services and lead to a runtime crash trying to set a timer with a garbage value because we try to convert a NaN to an int.
The text was updated successfully, but these errors were encountered: