Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: multi-ms sweep termination pauses #17831

Closed
rhysh opened this issue Nov 7, 2016 · 32 comments
Closed

runtime: multi-ms sweep termination pauses #17831

rhysh opened this issue Nov 7, 2016 · 32 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@rhysh
Copy link
Contributor

rhysh commented Nov 7, 2016

What version of Go are you using (go version)?

go version devel +3df059e Mon Nov 7 06:46:01 2016 +0000

What operating system and processor architecture are you using (go env)?

linux/amd64

What did you do?

I have a server program that makes http requests to collect a large volume of data which it then unmarshals and transforms. It runs with around 1000 active goroutines, has a live heap of around 30–100MB and executes around 50 GCs per second. The GC does a fine job keeping up with the program's throughput demands.

What did you expect to see?

I expected sub-millisecond stop-the-world times during garbage collections; per #17503 (comment), "The hybrid barrier is now live on master."

What did you see instead?

I see mark termination STW times of over 1ms somewhat regularly (which I thought the issue 17503 hybrid barrier would eliminate).

I see sweep termination STW times of over 1ms in the majority of GC cycles, which I find particularly surprising since I thought there was very little work required in that phase.

gc 20072 @698.871s 14%: 0.14+5.1+0.35 ms clock, 4.7+14/41/1.1+11 ms cpu, 104->104->46 MB, 107 MB goal, 36 P
gc 20073 @698.886s 14%: 3.1+4.5+6.6 ms clock, 90+57/30/0.74+191 ms cpu, 91->92->61 MB, 92 MB goal, 36 P
gc 20074 @698.917s 14%: 1.5+4.3+0.64 ms clock, 46+47/35/0+19 ms cpu, 121->122->69 MB, 123 MB goal, 36 P
gc 20075 @698.935s 14%: 7.9+4.5+1.0 ms clock, 230+35/30/0+31 ms cpu, 142->143->85 MB, 143 MB goal, 36 P
gc 20076 @698.967s 14%: 6.2+10+1.2 ms clock, 180+29/34/0+35 ms cpu, 160->164->86 MB, 170 MB goal, 36 P

It looks like sometimes the runtime stops most of the goroutines in preparation for a GC, but a few remain running. See for instance Proc 15 around 285–289ms (Proc 28 offscreen is also running), or Proc 0 and Proc 19 around 312–314ms

screen shot 2016-11-07 at 5 22 40 am

/cc @aclements @RLH

@rhysh
Copy link
Contributor Author

rhysh commented Nov 7, 2016

Below are the distributions of GC STW pauses for the program (in milliseconds), via https://godoc.org/github.com/aclements/go-moremath/cmd/dist

Sweep termination:

N 9991  sum 31376.8  mean 3.1405  gmean 2.22543  std dev 1.93111  variance 3.7292

     min 0.023
   1%ile 0.0772467
   5%ile 0.25
  25%ile 1.5
  median 3.3
  75%ile 4.6
  95%ile 6.2
  99%ile 8.1
     max 11

⠀⠀⠀⠀⠀⠀⠀⡰⠒⢦⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⣀⡤⠤⠤⠤⠤⢤⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.247
⠀⠀⠀⠀⠀⠀⣰⠁⠀⠀⠓⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠋⠉⠉⠉⠉⠉⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠒⠦⠤⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠴⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠓⠒⠒⠒⠒⠲⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
      0                               5

Mark termination:

N 9991  sum 6526.75  mean 0.653263  gmean 0.502061  std dev 0.848719  variance 0.720324

     min 0.19
   1%ile 0.26
   5%ile 0.3
  25%ile 0.36
  median 0.42
  75%ile 0.54
  95%ile 1.8
  99%ile 5.2
     max 13

⠀⠀⠀⠀⠀⠀⠠⢲⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 4.273
⠀⠀⠀⠀⠀⠀⠨⠀⢇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠎⠀⠈⠳⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
    0                                            5

@RLH
Copy link
Contributor

RLH commented Nov 7, 2016

Do we know what the running goroutines are doing? Could they be in a tight
long running loop that doesn't involve a preemption point such as a call?

On Mon, Nov 7, 2016 at 11:06 AM, Rhys Hiltner notifications@github.com
wrote:

Below are the distributions of GC STW pauses for the program (in
milliseconds), via https://godoc.org/github.com/
aclements/go-moremath/cmd/dist

Sweep termination:

N 9991 sum 31376.8 mean 3.1405 gmean 2.22543 std dev 1.93111 variance 3.7292

 min 0.023

1%ile 0.0772467
5%ile 0.25
25%ile 1.5
median 3.3
75%ile 4.6
95%ile 6.2
99%ile 8.1
max 11

⠀⠀⠀⠀⠀⠀⠀⡰⠒⢦⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⣀⡤⠤⠤⠤⠤⢤⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.247
⠀⠀⠀⠀⠀⠀⣰⠁⠀⠀⠓⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠒⠋⠉⠉⠉⠉⠉⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠒⠦⠤⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠴⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠓⠒⠒⠒⠒⠲⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
0 5

Mark termination:

N 9991 sum 6526.75 mean 0.653263 gmean 0.502061 std dev 0.848719 variance 0.720324

 min 0.19

1%ile 0.26
5%ile 0.3
25%ile 0.36
median 0.42
75%ile 0.54
95%ile 1.8
99%ile 5.2
max 13

⠀⠀⠀⠀⠀⠀⠠⢲⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 4.273
⠀⠀⠀⠀⠀⠀⠨⠀⢇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠎⠀⠈⠳⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
0 5


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#17831 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wn2OCnmSd0PnEkA5zex5Bgd5bqpZWks5q70yWgaJpZM4KrPQ8
.

@rhysh
Copy link
Contributor Author

rhysh commented Nov 7, 2016

Hmm, yes. It looks like every time the beginning of a GC cycle is delayed, one of the running goroutines appears to be in a single expensive call to encoding/base64.StdEncoding.DecodeString; the "End Stack Trace" indicates a line right after that kind of call.

The base64 messages that are decoded are around 1MB, which could keep a goroutine busy for several milliseconds.

@RLH
Copy link
Contributor

RLH commented Nov 7, 2016

Can that code be rewritten so that it makes a preemptable call once every
100usecs or whatever your tolerance level is?

On Mon, Nov 7, 2016 at 11:58 AM, Rhys Hiltner notifications@github.com
wrote:

Hmm, yes. It looks like every time the beginning of a GC cycle is delayed,
one of the running goroutines appears to be in a single expensive call to
encoding/base64.StdEncoding.DecodeString; the "End Stack Trace" indicates
a line right after that kind of call.

The base64 messages that are decoded are around 1MB, which could keep a
goroutine busy for several milliseconds.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#17831 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wn9GR6awzyYvse9JKrC_MFrOQ6Mlfks5q71i1gaJpZM4KrPQ8
.

@rhysh
Copy link
Contributor Author

rhysh commented Nov 7, 2016

Are you suggesting a change to the encoding/base64 package, or to all applications that use it for large slices?

My application uses encoding/base64 indirectly via a third-party package, so making a localized change isn't a straightforward solution.

Fixing #10958 would be pretty helpful here. The program I've encountered this in is very throughput sensitive and somewhat latency sensitive, which goes against the assumptions used to de-prioritize that issue.

Gradually disabling each P in preparation for a GC pause means that my 36-core server is effectively a 1- or 2-core server for several milliseconds each cycle.

@RLH
Copy link
Contributor

RLH commented Nov 7, 2016

The appropriate place would seem to be the encoding/base64 package.

On Mon, Nov 7, 2016 at 12:32 PM, Rhys Hiltner notifications@github.com
wrote:

Are you suggesting a change to the encoding/base64 package, or to all
applications that use it for large slices?

My application uses encoding/base64 indirectly via a third-party package,
so making a localized change isn't a straightforward solution.

Fixing #10958 #10958 would be pretty
helpful here. The program I've encountered this in is very throughput
sensitive and somewhat latency sensitive, which goes against the
assumptions used to de-prioritize that issue.

Gradually disabling each P in preparation for a GC pause means that my
36-core server is effectively a 1- or 2-core server for several
milliseconds each cycle.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#17831 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wn1y4uqHuwTmhwVgtMIEfVxYFKBh8ks5q72CqgaJpZM4KrPQ8
.

@bradfitz
Copy link
Contributor

bradfitz commented Nov 7, 2016

The appropriate place would seem to be the encoding/base64 package.

I strongly disagree.

We are not going to litter the standard library and every package containing a backwards branch with calls to help the GC or scheduler.

Maybe @rhysh can hack up a local fork of the standard library as a short-term workaround, but fixing #10958 has to be the right long term-term answer.

@aclements
Copy link
Member

Fixing #10958 would be pretty helpful here. The program I've encountered this in is very throughput sensitive and somewhat latency sensitive, which goes against the assumptions used to de-prioritize that issue.

The main reason that issue has been de-prioritized is not because of a concern over throughput (though that does complicate the implementation), but just because we haven't seen it arise in "real" code. So, this is actually an excellent example to motivate re-prioritizing it.

Unfortunately, fixing this is rather complicated (and it's clearly not a regression), so I think it's too late in Go 1.8 to put a real fix in the compiler.

@rhysh, to confirm that we really do understand the issue (and as a possible temporary fix), could you modify encoding/base64.Encoding.decode to force a preemption check at the top of the outer loop? It's slightly annoying to do this because our compiler is getting more clever, but calling the following function seems to do it:

var alwaysFalse = false

//go:noinline
func preempt(x int) {
    if alwaysFalse {
        println()
    }
}

@RLH
Copy link
Contributor

RLH commented Nov 7, 2016

Whether the compiler puts it in or it is put in by hand the appropriate
place is in encoding/base64. We have avoid putting it into the compiler due
to the (anticipated) performance issues. This bug is interesting since it
argues for adding the check to improve performance.

On Mon, Nov 7, 2016 at 2:12 PM, Brad Fitzpatrick notifications@github.com
wrote:

The appropriate place would seem to be the encoding/base64 package.

I strongly disagree.

We are not going to litter the standard library and every package
containing a backwards branch with calls to help the GC or scheduler.

Maybe @rhysh https://github.com/rhysh can hack up a local fork of the
standard library as a short-term workaround, but fixing #10958
#10958 has to be the right long
term-term answer.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#17831 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wn96-HofcxfrWfv3gGkoOPmPwlUHNks5q73gWgaJpZM4KrPQ8
.

@ianlancetaylor
Copy link
Contributor

I think a simpler way to add a preemption point would be to call runtime.Gosched(). That seems like the right approach for a long-running loop anyhow. (And then if necessary we should speed up Gosched.)

@aclements
Copy link
Member

I agree that that's simpler, but Gosched() there will force a preemption every 4 characters consumed from the buffer. The cost of entering the scheduler for that, even if we just decide to return to the same goroutine, will surely significantly outweigh the cost of a loop iteration.

@ianlancetaylor
Copy link
Contributor

Sorry, my unstated assumption was that you would not do anything on every single loop iteration. Just every 1024 iterations or something.

@aclements
Copy link
Member

Oh, I see. That would be fine as a temporary solution, of course. Though in terms of an experiment with adding implicit preemption points in this loop (and in general), I'd rather try something closer to what the compiler would presumably do.

@quentinmit quentinmit added the NeedsFix The path to resolution is known, but the work has not been done. label Nov 7, 2016
@quentinmit quentinmit added this to the Go1.8 milestone Nov 7, 2016
@ianlancetaylor
Copy link
Contributor

Even if and when the compiler does this, I think the compiler should not add a check to every loop iteration. Any non-preemptible loop is by definition quickly executing. Better to keep a counter in a register for when to check for preemption.

@rhysh
Copy link
Contributor Author

rhysh commented Nov 8, 2016

Thank you all for the pointers on this—it looks like we're on the right track.

@aclements I added a preemption point to encoding/base64 as you suggested, and that stack no longer shows up as the ending stack of goroutines that continue running while the runtime is trying to stop the world.

I've added three additional preemption points to encoding/json and a handful to github.com/golang/protobuf/proto where it looked like they were required based on the other end stack traces that showed up. And still, the 99th percentile STW times are close to 1ms since I haven't successfully tracked them all down.

Below are updated STW time distributions for the program.

Sweep termination:

N 9992  sum 2570.14  mean 0.25722  gmean 0.211735  std dev 0.204902  variance 0.0419848

     min 0.022
   1%ile 0.042
   5%ile 0.067
  25%ile 0.14
  median 0.24
  75%ile 0.31
  95%ile 0.5
  99%ile 0.9323
     max 4.7

⠀⠀⠀⠀⠀⠀⠀⠀⣠⡀⠀⠀⠀⢀⡔⠒⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 3.991
⠀⠀⠀⠀⠀⠀⢀⡜⠁⠈⠲⠴⠞⠉⠀⠀⠀⠈⠦⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠎⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠙⠒⠒⠒⠒⠦⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
     0                                      1

Mark termination:

N 9992  sum 4144.63  mean 0.414795  gmean 0.404515  std dev 0.0998096  variance 0.00996196

     min 0.16
   1%ile 0.26
   5%ile 0.29
  25%ile 0.35
  median 0.4
  75%ile 0.47
  95%ile 0.59
  99%ile 0.7
     max 2.9

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡠⡴⡴⡶⠲⠲⠲⢦⡤⣤⢀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 5.463
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⠖⠋⠁⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠋⠛⠳⠲⠲⠦⠤⢤⣀⣀⣀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠤⠔⠒⠚⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠉⠙⠛⠛⠒⠒⠒⠒⠦⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
                                 0.5

@aclements
Copy link
Member

@ianlancetaylor, we keep assuming that we need a sophisticated solution to keep the costs low, but as far as I know, we haven't actually tried the simple solution and proven it insufficient. For example, it's not clear to me that checking and maintaining a counter in a loop is necessarily cheaper than the two instructions it would take to check the stack guard (at least on x86). I'd like us to try this experiment.

@rhysh, thanks for trying that out. It looks like that significantly improved the tails, at least. Do you still see the regular gaps in the execution trace with just one P running during sweep termination and mark termination?

@RLH
Copy link
Contributor

RLH commented Nov 8, 2016

We will be looking to see if we can find a solution that is save and has
minimal impact.

Rhys, you have inspected the routines that need to have backward branches
checked. Could you summarize the nature of these routines. In particular,
were the loops large enough that loading the stack guard and doing a
compare and branch will be lost in the noise or are these tight 10
instruction loops where spilling a register will be noticeable? I ask
because one possible solution would be for the compiler to only check
backward branches in larger loops for now.

On Mon, Nov 7, 2016 at 10:37 PM, Austin Clements notifications@github.com
wrote:

@ianlancetaylor https://github.com/ianlancetaylor, we keep assuming
that we need a sophisticated solution to keep the costs low, but as far as
I know, we haven't actually tried the simple solution and proven it
insufficient. For example, it's not clear to me that checking and
maintaining a counter in a loop is necessarily cheaper than the two
instructions it would take to check the stack guard (at least on x86). I'd
like us to try this experiment.

@rhysh https://github.com/rhysh, thanks for trying that out. It looks
like that significantly improved the tails, at least. Do you still see the
regular gaps in the execution trace with just one P running during sweep
termination and mark termination?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#17831 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wn0C-HOOr-ix4HzERUssUpPsgo-tPks5q7-5qgaJpZM4KrPQ8
.

@rhysh
Copy link
Contributor Author

rhysh commented Nov 8, 2016

@aclements I'm happy to test a simple solution if you've got one, even if it has up to ~50% overhead. We'd definitely want something better for the release, but I could at least help validate an approach.

I'm no longer sure how to interpret the execution tracer output. Below is a screenshot of the execution tracer for the program after I added preemption points to several deserialization routines.

It looks like Proc 31 is the last to stop running before the collection cycle is able to begin .. but Proc 35 looks like it never stops at all. How was the collection cycle able to run without stopping G92 on Proc 35?

Also, the GC row claims that the cycle started around 545.4ms, but there are goroutines on Procs 15, 20, 22, and 26 that appear to be making syscalls and then continuing their work even after the "GC" row at the top indicates there's a collection going on. Procs 1, 3, 7, 10, 14, 16, 19, 24, 25, 28, 30, and 32 appear to schedule some goroutines for execution after 545.4ms. Comparing the "GC" and "Threads" rows, what is the runtime up to between the start of the "GC" row's "GC" span at 545.4ms and the beginning of the decline of "Threads" row's count of active threads at 545.7ms?

@aclements does the runtime stop the world one P/M at a time in some order, or is each P/M able to learn independently that the world should be stopped? The execution traces make it look like there's some order so that if one goroutine is running a non-preemptable loop, any Ps "behind" it in line will continue scheduling goroutines as usual.

screen shot 2016-11-08 at 8 41 53 am

@RLH I haven't been looking at the generated instructions; I've been looking at GC cycles with the execution tracer, finding goroutines that appear to have continued running when it looks like the scheduler is trying to stop the world, reading the end stack trace they list, and reading through the Go code to try to find the last non-preemptable loop that would have run right before it. Seeing samples of running stack traces (#16895) would be handy here.

The loops I've found have been around 20–60 lines of Go code, so they'd probably count as "larger loops".

@dr2chase
Copy link
Contributor

@rhysh Here's a likely simple solution; there are loops that will evade its notice if they're cleverly nested or contain calls to nosplit functions, but otherwise, it does the job.

https://go-review.googlesource.com/c/33093/

@rhysh
Copy link
Contributor Author

rhysh commented Nov 11, 2016

Thanks @dr2chase — I tried CL 33093 PS 2 and it didn't have much of an impact on median STW times, likely within the noise.

Looking at the disassembly of encoding/base64.(*Encoding).decode, it looks like CL 33093 works as the commit message describes, adding calls to runtime.goschedguarded in three innermost loops. Unfortunately these fall under case 1 of 33093/2's commit message: in my program they're executed zero times so the code on the backedge doesn't trigger, since the base64 data I'm decoding doesn't contain newlines.

Why are the inserted calls limited to innermost loops? Is it for performance, code size, or correctness?

The three lines where calls to runtime.goschedguarded are inserted:

for si < len(src) && (src[si] == '\n' || src[si] == '\r') {

for si < len(src) && (src[si] == '\n' || src[si] == '\r') {

for si < len(src) && (src[si] == '\n' || src[si] == '\r') {

I'll see what I can do for a small reproducer, though it'll necessarily only cover functions I know are related to the problem.

@dr2chase
Copy link
Contributor

First-implementation sanity. I will take a shot at getting the outer loops right, too.

I could make a version that just stomped on all the backedges.

@dr2chase
Copy link
Contributor

I made a version (PS 3) that stomps on all the backedges, will be benchmarking it for overhead, wanted to see if it is good enough for your purposes. Only exceptions are (1) inside nosplit functions and (2) irreducible loops, which cannot be written w/o goto.

@dr2chase
Copy link
Contributor

Benchmarking results for all the backedges are more noticeable.
On AMD64, geomean is +5%, worst at RevComp+45%, MandelBrot+20%, Gzip+11%.
On PPC64, geomean is +4%, worst is MandelBrot+67%, Fannkuch+16%, Revcomp+9%.

@rhysh
Copy link
Contributor Author

rhysh commented Nov 11, 2016

@dr2chase CL 33093 PS 3 has a very good effect on sweep termination pauses for my server, reducing their median duration by a factor of 20, from 4ms to 0.17ms. This matches the gains I was able to get from manually inserting these checks into encoding/base64, encoding/json, and github.com/golang/protobuf/proto (though other programs would have a different set of packages in need of such modification), as I described in #17831 (comment).

The median duration of the STW pauses is still longer than the 100µs goal. I really like CL 33093 PS 3 rather than the manual modifications—it'll allow work to continue on diagnosing any other causes of long pauses without needing to worry if there are tight loops that weren't hand-modified. The overhead hasn't been a problem for my server.

Thank you!


Here's sweep termination time for my server:

With 3df059e:

N 9982  sum 37262  mean 3.73292  gmean 3.1269  std dev 1.71427  variance 2.93872

     min 0.034
   1%ile 0.22
   5%ile 0.6445
  25%ile 2.6
  median 4
  75%ile 4.8
  95%ile 6.4
  99%ile 8.4
     max 15

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣠⠤⠖⠒⢆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.377
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣀⣀⣠⠤⠤⠤⠴⠒⠒⠉⠉⠀⠀⠀⠀⠀⠈⠳⣄⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠚⠉⠉⠓⠚⠉⠉⠉⠉⠉⠉⠉⠉⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠙⠒⠒⠒⠒⠲⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
     0                                5

With CL 33093 PS 3, checked out (not cherry-picked):

N 9990  sum 2198.54  mean 0.220074  gmean 0.171482  std dev 0.253005  variance 0.0640117

     min 0.03
   1%ile 0.049
   5%ile 0.07
  25%ile 0.11
  median 0.17
  75%ile 0.25
  95%ile 0.47
  99%ile 1.4
     max 4.5

⠀⠀⠀⠀⠀⠀⠀⠰⠢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 5.963
⠀⠀⠀⠀⠀⠀⢠⠏⠀⠙⠒⠖⢢⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠞⠀⠀⠀⠀⠀⠀⠉⠓⠲⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠁
     0                            1                             2

Here's mark termination time for my server:

With 3df059e:

N 9984  sum 7957.11  mean 0.796986  gmean 0.574812  std dev 0.991652  variance 0.983374

     min 0.17
   1%ile 0.27
   5%ile 0.3
  25%ile 0.36
  median 0.44
  75%ile 0.72
  95%ile 2.1
  99%ile 5.9
     max 11

⠀⠀⠀⠀⠀⠀⢠⢲⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 3.617
⠀⠀⠀⠀⠀⠀⢐⠈⢆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠆⠀⠈⠳⠤⠤⠤⠤⠤⠤⠦⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
    0                                          5

With CL 33093 PS 3, checked out (not cherry-picked):

N 9992  sum 4078.36  mean 0.408163  gmean 0.397218  std dev 0.118707  variance 0.0140915

     min 0.18
   1%ile 0.26
   5%ile 0.29
  25%ile 0.34
  median 0.38
  75%ile 0.46
  95%ile 0.59
  99%ile 0.73
     max 6.7

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⡤⡔⠔⠖⠖⠶⣦⣄⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 6.128
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⡖⠋⠁⠀⠀⠀⠀⠀⠀⠀⠈⠙⠙⠲⠢⠤⠤⣤⣀⣀⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠤⠴⠒⠋⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠁⠉⠉⠉⠙⠓⠒⠒⠒⠲⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
                               0.5

@dr2chase
Copy link
Contributor

Do you think you're going to want this for more than just a diagnostic tool?

@rhysh
Copy link
Contributor Author

rhysh commented Nov 17, 2016

@dr2chase are you asking me, or asking @aclements / @RLH ?

If the change can ensure that all loops are preemptible within some roughly-bounded time without adding too much overhead, I think it would make a good solution to #10958. I'm not in a position to judge the cost/benefit of that work.

The change provides a good reduction in one source of long STW pauses. It looks like there are still other factors that can lead to STW pauses of >100µs; I have some other programs that this change didn't help much, which I have not yet diagnosed.

I've updated to PS 4, which continues to work as expected. Could it be included in the releases guarded by the GOEXPERIMENT value, as framepointers were for Go 1.5 and 1.6?

@dr2chase
Copy link
Contributor

The intent is for this to land in go1.8 under GOEXPERIMENT, hopefully to make it into the first beta.

I'm looking into reducing the overhead to make it less performance-risky for people to use/try, because there's a worry that maybe you're not the only person who has this problem -- it just happens that you're the person who's looking carefully.

@aclements
Copy link
Member

Since we've now confirmed that this is because of a lack of preemption points in loops, I'm going to close this as a dup of #10958 so the general discussion stays in one place. It's fine to keep discussing things specific to this application here.

If the change can ensure that all loops are preemptible within some roughly-bounded time without adding too much overhead, I think it would make a good solution to #10958.

The cost is definitely too high to turn on by default in its current form, but the current prototype is really an upper bound on the cost. There are many optimizations we can apply but haven't yet, so I suspect there's a lot of headroom. We'll probably just go down the list of optimizations until the cost is under 1% and then we'll stop. :) But some of the optimizations are fairly intrusive, so they wouldn't happen until the 1.9 time frame.

It looks like there are still other factors that can lead to STW pauses of >100µs; I have some other programs that this change didn't help much, which I have not yet diagnosed.

How much in excess of 100µs? I only tested up to 12 cores, so it's possible there are things remaining in mark termination that aren't perfectly scalable with the number of cores. I have a modified version that logs precise time stamps of every step of mark termination to stderr if you'd be interested in trying that with one of these applications. If it's not much more than 100µs, though, I'm not too worried.

Replying to some of your earlier questions:

It looks like Proc 31 is the last to stop running before the collection cycle is able to begin .. but Proc 35 looks like it never stops at all. How was the collection cycle able to run without stopping G92 on Proc 35?

Whichever goroutine triggered the GC cycle is responsible for the process of stopping the world, so it keeps running in some sense, though it's all in the runtime.

Comparing the "GC" and "Threads" rows, what is the runtime up to between the start of the "GC" row's "GC" span at 545.4ms and the beginning of the decline of "Threads" row's count of active threads at 545.7ms?

It's hard to say for sure, but there are a few things that happen between emitting the trace event for GC start and actually stopping the world, so I'm not surprised that there are .3ms there, especially if there are a lot of Ps and goroutines.

@aclements does the runtime stop the world one P/M at a time in some order, or is each P/M able to learn independently that the world should be stopped?

The order is technically non-deterministic, but can appear to be sequential. Stopping the world sets a global flag that any P can observe when it passes through the scheduler, so any of them could stop at that point. It then attempts to forcibly preempt the running goroutine on each P in order so that it passes through the scheduler. This is a non-blocking operation (it just sets a flag on the g), but if goroutines are responding to preemption quickly, you'll see them descheduling roughly in P order. I don't think an unpreemptible P will do anything to prevent us from preempting later Ps.

@gopherbot
Copy link

CL https://golang.org/cl/33093 mentions this issue.

@rhysh
Copy link
Contributor Author

rhysh commented Nov 17, 2016

Gating via GOEXPERIMENT for Go 1.8 sounds great.

How much in excess of 100µs?

I have a program with a 95th percentile sweep termination time of 60ms, and 95th percentile mark termination time of around 30ms. This is measured with go version devel +1e3c57c Wed Nov 16 20:31:40 2016 +0000, with CL 33093 PS 4 cherry-picked on top, with GOEXPERIMENT=preemptibleloops. The hosts running that program have 8 cores and each run around 40 copies of the program. I haven't looked at execution traces for the program yet, I'll file a separate bug when I have more information. Here's an example gctrace line from the program:

gc 11249 @74577.976s 0%: 11+185+35 ms clock, 90+143/358/582+280 ms cpu, 112->114->56 MB, 115 MB goal, 8 P

If it's not much more than 100µs, though, I'm not too worried.

It seems like pauses of more than 1ms are worth hunting down—are pauses of a few hundred microseconds worth reporting as well? I suppose anything more than 100µs is worth reporting even if addressing it isn't a priority for a release or two.

Whichever goroutine triggered the GC cycle is responsible for the process of stopping the world, so it keeps running in some sense, though it's all in the runtime.

Ah! This makes a lot of sense, thank you. It means that the goroutine that looks like it keeps running is in fact the one that stopped (running user code) the earliest.

... so I'm not surprised that there are .3ms there, especially if there are a lot of Ps and goroutines.

I don't think an unpreemptible P will do anything to prevent us from preempting later Ps.

I see, the GC execution trace event is emitted early but sweep termination doesn't begin until later. When I look at the beginning of the decline in active Ps in the Threads row, the execution trace makes a lot more sense—although goroutines continue to be scheduled onto Ps after the GC row shows a collection is in progress, they are not scheduled once the Threads row begins to decline.

Thank you for explaining!

@rhysh
Copy link
Contributor Author

rhysh commented Nov 17, 2016

I filed #17969 for the other program that sometimes sees hundred-millisecond pauses.

@gopherbot
Copy link

CL https://golang.org/cl/33910 mentions this issue.

gopherbot pushed a commit that referenced this issue Jan 9, 2017
Loop breaking with a counter.  Benchmarked (see comments),
eyeball checked for sanity on popular loops.  This code
ought to handle loops in general, and properly inserts phi
functions in cases where the earlier version might not have.

Includes test, plus modifications to test/run.go to deal with
timeout and killing looping test.  Tests broken by the addition
of extra code (branch frequency and live vars) for added
checks turn the check insertion off.

If GOEXPERIMENT=preemptibleloops, the compiler inserts reschedule
checks on every backedge of every reducible loop.  Alternately,
specifying GO_GCFLAGS=-d=ssa/insert_resched_checks/on will
enable it for a single compilation, but because the core Go
libraries contain some loops that may run long, this is less
likely to have the desired effect.

This is intended as a tool to help in the study and diagnosis
of GC and other latency problems, now that goal STW GC latency
is on the order of 100 microseconds or less.

Updates #17831.
Updates #10958.

Change-Id: I6206c163a5b0248e3f21eb4fc65f73a179e1f639
Reviewed-on: https://go-review.googlesource.com/33910
Run-TryBot: David Chase <drchase@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@golang golang locked and limited conversation to collaborators Dec 2, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

8 participants