runtime: performance degradation in go 1.12 #36521

interviewQ · 2020-01-13T06:37:40Z

What version of Go are you using (`go version`)?

go 1.11.12, planning to upgrade to go 1.12.15

Does this issue reproduce with the latest release?

Yes, with go 1.12.15

What operating system and processor architecture are you using (`go env`)?

Ubuntu 18.04

What did you do?

Since go 1.11 is about to go EOL, we are planning to upgrade to the next version. I attempted
to upgrade to go 1.12.15, but I am seeing 15-20% performance degradation. A similar result was seen with go 1.13.6.

What did you expect to see?

I was hoping to see performance improvement with a newer go compiler.

What did you see instead?

Instead of performance improvement, saw performance degradation instead.

Is there anything in newer go versions (1.12 and beyond) that could answer the above performance degradation? I read online that the scheduler was modified to allow preemption of long-running goroutines. While I do not know if this could cause the performance issue, is there any way to revert this scheduler behavior just for test purposes. Any other suggestions are welcome.

The text was updated successfully, but these errors were encountered:

ALTree · 2020-01-13T08:09:08Z

Hi,

thanks for reporting this, but your description of the issue is far too vague to be actionable.

Please provide a runnable, auto-contained benchmark that demonstrates the problem on the latest 1.13 release, together with some benchmarking results on the platform you are running on.

interviewQ · 2020-01-13T08:36:37Z

Thanks for the response. I do see the performance issue with 1.13 also. Give me a few days to come up with a sample code that demonstrates the issue.

interviewQ · 2020-02-04T15:08:49Z

Sorry for the delay, it took me a while to narrow down the problem code path. Attached is a sample benchmark test. I am running this test on my Mac and here are the results -

1.11.13 -

goos: darwin
goarch: amd64
BenchmarkBlockSplit-12 1 2506669772 ns/op
PASS

1.13.7 -

goos: darwin
goarch: amd64
BenchmarkBlockSplit-12 1 2945086182 ns/op
PASS

So, 2.5 sec -> 2.9 sec equates to 16% drop in performance.
perf.txt

randall77 · 2020-02-04T17:48:43Z

Here's a simple reproducer. I can reproduce on Linux.

package main

import (
	"fmt"
	"sync"
	"time"
)

var sink []byte

func main() {
	start := time.Now()
	var wg sync.WaitGroup
	for i := 0; i < 10; i++ {
		wg.Add(1)
		go func() {
			for i := 0; i < 16000; i++ {
				sink = make([]byte, 64*1024)
			}
			wg.Done()
		}()
	}
	wg.Wait()
	fmt.Printf("time: %f\n", time.Since(start).Seconds())
}

This test is really allocation heavy. It's going to depend on the precise details of how allocation is done. I get these times:

go1.11.13: 0.689 sec
go1.12.16: 0.973 sec
go1.13.7: 1.151 sec
tip: 1.099 sec

I think most of the difference has to do with how big the heap grows. Here's my eyeballing the maxiumum RSS (adding lots of iterations to the innermost loop so I can watch top while it runs):

go1.11.13: 310MB
go1.12.16: 202MB
go1.13.7: 158MB
tip: 98MB

This test is stressing the allocator so much that it goes way beyond 200% of live heap data, which should only be 640KB or so. You'll see that we've made steady improvements on how much over we go. Of course, that means that we'll run garbage collections more often, which will hurt the run time.

I'm not sure there's any bug here. It just seems like a different speed/memory tradeoff. To actually demonstrate a regression, I think we'd need to hold heap size constant somehow.

@mknyszek

interviewQ · 2020-02-04T19:01:19Z

Our app which is showing performance degradation is also allocation heavy, which is why we are seeing this performance degradation with newer go runtime. Having said that, over the last few weeks I have made lots of changes to improve memory management and did see improvements that make the performance drop lower. From what you are saying above, it seems like memory allocator heavy apps will have to take a hit when they upgrade to a newer go runtime.

In your test above it seems time taken increased from 0.689 sec -> 1.151 sec which is a 65% increase. Is this big drop in performance solely attributed to running GC more often?

mknyszek · 2020-02-04T19:30:11Z

@interviewQ Some background: there was a known regression in the slow path of the allocator in Go 1.12, which was intentional in order to support returning memory to the OS more eagerly. For the vast majority of programs, this wasn't noticed. The regression only showed up in microbenchmarks, not in any real production services or applications (as far as we could see/find). You need to be really allocation bound to notice this, as @randall77's benchmark shows. Note that @randall77's reproducer makes many 64 KiB allocations without ever touching that memory (aside from zeroing), which would already meaningfully reduce the load on the allocator.

However, it's also been known that allocating heavily and in parallel has had serious lock contention issues since at least Go 1.11. This was noticed when we worked on the regression in Go 1.12. In Go 1.14 we worked to fix these issues (#35112) and it went reasonably well. I'm fairly confident that the allocator is now faster, so this seems to me like it's the change in the default trade-off in these "way past the heap goal" cases as @randall77 says.

With that being said, @interviewQ could you:

Check if using the tip of the master branch for Go results in less memory used for your application.
If so, try cranking up GOGC to the point where you're using the same amount of memory as before, then measure CPU performance again.

Hopefully the net effect will be that we made performance better. :) If not, we can move on from there.

interviewQ · 2020-02-04T20:01:05Z

First of all thanks a lot for these quick replies and very useful info.

I had tested with 1.14beta1 in Dec and that still showed the performance issue.
Did the fix for 35112 go after 1.14 beta1 was released? I was planning on retesting with 1.14 when it gets released (was due on Feb 1). However, I could try the tip of the master branch.

In our app, we have 1000 goroutines, all allocating memory in parallel, so from what you are saying above "allocating heavily and in parallel" is what applies to our app.

Using sync.Pool is something I have thought about but it seems non-trivial since I will need to figure out when to put the buffer back to the pool. It is not impossible just hard to figure out using sync.Pool.

What if I have 1 goroutine whose only job is to alloc memory. These 1000 goroutines can talk to this "allocator" goroutine (via channels). That way these huge allocations happen from only 1 goroutine and contention is eliminated. Please let me know if this makes sense. Also, I come from C++ world and hence new to Go, so please pardon my ignorance.

@mknyszek

mknyszek · 2020-02-04T20:22:59Z

First of all thanks a lot for these quick replies and very useful info.

I had tested with 1.14beta1 in Dec and that still showed the performance issue.
Did the fix for 35112 go after 1.14 beta1 was released? I was planning on retesting with 1.14 when it gets released (was due on Feb 1). However, I could try the tip of the master branch.

There may be a regression in the default configuration, but have you looked at memory use like I mentioned? If it went down, you could increase GOGC from the default until your memory use is the same as before (in the steady-state) and see if it performs better? Since you mentioned you're relatively new to Go, check out the comment at the top of https://golang.org/pkg/runtime for an explanation of GOGC, and also see https://golang.org/pkg/runtime/debug/#SetGCPercent for more information.

In our app, we have 1000 goroutines, all allocating memory in parallel, so from what you are saying above "allocating heavily and in parallel" is what applies to our app.

The number of goroutines doesn't really matter since that's just concurrency. The GOMAXPROCS value is your actual level of parallelism (up to the number of independent CPU cores on your machine). What is GOMAXPROCS for your application?

Using sync.Pool is something I have thought about but it seems non-trivial since I will need to figure out when to put the buffer back to the pool. It is not impossible just hard to figure out using sync.Pool.

What if I have 1 goroutine whose only job is to alloc memory. These 1000 goroutines can talk to this "allocator" goroutine (via channels). That way these huge allocations happen from only 1 goroutine and contention is eliminated. Please let me know if this makes sense. Also, I come from C++ world and hence new to Go, so please pardon my ignorance.

Contention isn't really eliminated; you're just moving it from the allocator and to the channels in this case. Before trying to restructure your code for this please give my suggestion above a try. You might just have to increase GOGC to get the same level of CPU performance for the same memory usage.

interviewQ · 2020-02-04T20:32:00Z

GOMAXPROCS = number of CPU cores. Our app runs on EC2 instances, so it depends on which instance we are running on. In my test GOMAXPROCS = 72. I will try out your suggestions and will get back to you (most likely tomorrow).

interviewQ · 2020-02-04T23:42:55Z

I ran tests with various go versions and checked RSS usage.

1.11.13 - 3.3 GB
1.13.7 - 3.4GB
1.14 (tip of master) - 4.6 GB

Actually, RSS increased with 1.14, due to this I did not so the GOGC adjustment. On this note, I see mention of SetMaxHeap function in runtime (https://blog.golang.org/ismmkeynote). Is this something that would be included in future versions of Go?

Relative to 1.11.13, performance-wise I see 25% drop with 1.13.7 and 12% drop with 1.14.

By reducing heap allocation in the app I do see performance improvement, so it is clear allocator is the bottleneck.

@mknyszek

mknyszek · 2020-02-05T15:36:25Z

Just out of curiosity, how did you measure RSS?

I just want to be very precise on this because the virtual memory usage of Go increased significantly with 1.14 (around 600 MiB, which admittedly doesn't account for everything you're seeing), but most of that memory is not mapped in/committed.

Thanks for the quick turnaround on this and for your cooperation. 72 cores is not a level of parallelism I've personally ever tested, it's possible we have some scalability problems at that level that we haven't seen before.

As another experiment, can you try setting GOMAXPROCS to 48 or so? There are other things I'd be interested in looking at as well, but I'll wait for your reply first.

mknyszek · 2020-02-05T16:37:31Z

@interviewQ Also, if you'd be willing to share a GC trace, that would give us a lot more insight into what the runtime is doing. You can collect one by running your application with GODEBUG=gctrace=1 set and capturing STDERR.

interviewQ · 2020-02-05T16:45:02Z

@mknyszek
I measured RSS by looking into "top" output, did you mean something else when you mentioned memory usage of the application. I had collected gctrace but did not save it. Will run the test again and give you the output. Planning to run the following tests -

GOMAXPROCS=72, gctrace=1, go 1.11.13
GOMAXPROCS=72, gctrace=1, go 1.13.7
GOMAXPROCS=48, gctrace=1, go 1.11.13
GOMAXPROCS=48, gctrace=1, go 1.13.7

Do you also need results from 1.14 (top of master branch)?

mknyszek · 2020-02-05T16:50:47Z

@interviewQ I did not mean anything else, that's exactly it. Thank you for confirming.

Running those sounds great to me. If you have the time/resources to try tip of the master branch as well that would be very helpful too.

interviewQ · 2020-02-06T20:02:04Z

@mknyszek
Finished tests and am attaching results. I saw go1.14rc1 was released yesterday, so tested with that.

Keeping 1.11.13 as the baseline, with GOMAXPROCS=72, I see huge perf drop (25% or so) with 1.13.7. There is a perf drop with 1.14rc1 also but very less (around 5%).

Keeping 1.11.13 as the baseline, with GOMAXPROCS=48, I see huge perf drop (20% or so) with 1.13.7. With 1.14rc1 perf seems the same as 1.11.13.

Comparing GOMAXPROCS=72 with GOMAXPROCS=48, we are reducing CPU resources by 35%, but perf does not drop that much.
1.11.3 - perf drops around 7%
1.13.7 - perf remains the same
1.14rc1 - perf drop around 3%

The app in question is CPU bound, there is no real i/o to slow it down. It is also memory alloc heavy.

In the attached logs if you see logs for GOMAXPROCS=48, you will see it starts with 72P but after half a sec or so changes to 48P. I am changing this programmatically by calling runtime.GOMAXPROCS. The real test starts only after this setting has been made.

Archive.zip

interviewQ · 2020-02-06T21:42:25Z

If you have any code changes in go runtime, I can test it with our app and give it a whirl.

interviewQ · 2020-02-10T19:19:17Z

@mknyszek
I ran tests on an EC2 instance with 36 core CPU. With this, I see a similar performance with go 1.11.13 vs 1.13.7. Go 1.14rc1 is actually slightly faster than go1.11.13. So, the performance issue which I am seeing is only with CPUs with large number of cores. This points to the lock contention issue in the memory allocator that you mention above. I do see this has improved considerably in Go 1.14, but it is still slower than Go 1.11 (only with large number of cores). Is there any way to improve this any further in Go1.14.

mknyszek · 2020-02-10T22:11:28Z

@mknyszek
I ran tests on an EC2 instance with 36 core CPU. With this, I see a similar performance with go 1.11.13 vs 1.13.7. Go 1.14rc1 is actually slightly faster than go1.11.13. So, the performance issue which I am seeing is only with CPUs with large number of cores. This points to the lock contention issue in the memory allocator that you mention above. I do see this has improved considerably in Go 1.14, but it is still slower than Go 1.11 (only with large number of cores). Is there any way to improve this any further in Go1.14.

Thinking about this more, it doesn't make sense that if Go 1.11 collapses at 48 cores that it would do better than Go 1.14 at 72 (in terms of allocations). One case I can think of is that the Go 1.14 allocator just completely breaks down at 72 cores, doing worse than Go 1.11.

So I set up a 72-core VM and ran some allocator scalability microbenchmarks. These show that the 1.14 allocator scales much better than Go 1.11: around 1.5-2x throughput improvement at 72 cores for a range of allocation sizes (anywhere from 1 KiB to 64 KiB).

I think that perhaps the scalability of another part of the runtime got worse in Go 1.12, and that perhaps scalability improvements are carrying things a bit. I don't know what that is, but now that I have a 72-core VM I can experiment more deeply.

I poured over the GC traces for Go 1.11 and Go 1.14 for several hours, doing a bunch of aggregate analysis, and things have certainly changed a lot (e.g. much more time spent in assists in Go 1.14 vs. older versions; this is not totally unexpected), but nothing stands out as "this is obviously the cause" to me. Though, since I don't think this relates to #35112 anymore, I'm going to take a closer look at the Go 1.13 GC traces and see what I can glean from them.

networkimprov · 2020-02-10T22:15:04Z

@gopherbot remove WaitingForInfo

interviewQ · 2020-02-10T22:25:53Z

So I set up a 72-core VM and ran some allocator scalability microbenchmarks. These show that the 1.14 allocator scales much better than Go 1.11: around 1.5-2x throughput improvement at 72 cores for a range of allocation sizes (anywhere from 1 KiB to 64 KiB).

@mknyszek
In the app I am testing, we have several large-sized block allocations. A lot of the allocations are fixed 64K blocks. Other allocations are for blocks sized 50K-120K. I just thought of providing this info. Maybe for larger block allocs and that too in parallel with large number of cores (72),
we have hit some scalability limits with go 1.14 vs go 1.11.

mknyszek · 2020-02-10T22:36:35Z

@interviewQ Right, I figured based on the original benchmark you provided, that's why I explicitly tried out 64 KiB allocations. I can try out larger ones but up to 128 KiB they all have the potential to be satisfied out the page cache, so the scalability of the allocator should be fine.

@aclements suggested to me that if there is some scalability problem in the runtime here, and your application is indeed CPU-bound, we should be able to figure out what it is (or get a strong hint) by just looking at a differential profile for the same version of Go but for different GOMAXPROCS. So for example, for Go 1.14, if you could collect a pprof CPU profile of your application running with GOMAXPROCS=36 and GOMAXPROCS=48 and GOMAXPROCS=72, you can look at the diff in pprof and see if anything is growing. If anything is, that will point heavily to whatever the bottleneck is.

You can collect a profile by using https://golang.org/pkg/runtime/pprof/ and then running go tool pprof --diff_base=low_count.cpu high_count.cpu then typing the command "top 20" in the interactive input.

dmitshur · 2020-02-13T18:03:53Z

It's not clear to me yet if this should be a release blocker for 1.14, but I'll tentatively add it for visibility so we can discuss it in a release meeting.

dmitshur · 2020-02-14T00:15:32Z

We've looked over this, and it should not be a release blocker for Go 1.14, because it's not a regression that is specific to 1.14 (it affects previous releases too). Removing release-blocker.

gopherbot · 2020-04-12T02:38:32Z

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

networkimprov · 2020-04-12T02:48:48Z

@interviewQ I think this was waiting on pprof results from you, described in #36521 (comment).

ALTree added the WaitingForInfo label Jan 13, 2020

ALTree changed the title ~~Performance degradation in go 1.12~~ runtime: performance degradation in go 1.12 Jan 13, 2020

gopherbot removed the WaitingForInfo label Feb 10, 2020

dmitshur added the NeedsInvestigation label Feb 12, 2020

dmitshur added this to the Go1.14 milestone Feb 12, 2020

dmitshur added the release-blocker label Feb 13, 2020

dmitshur removed the release-blocker label Feb 14, 2020

dmitshur modified the milestones: Go1.14, Go1.15 Feb 14, 2020

mknyszek added the WaitingForInfo label Mar 12, 2020

mknyszek self-assigned this Mar 12, 2020

gopherbot closed this as completed Apr 12, 2020

golang locked and limited conversation to collaborators Apr 12, 2021

gopherbot added the FrozenDueToAge label Apr 12, 2021

rsc unassigned mknyszek Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: performance degradation in go 1.12 #36521

runtime: performance degradation in go 1.12 #36521

interviewQ commented Jan 13, 2020

ALTree commented Jan 13, 2020

interviewQ commented Jan 13, 2020

interviewQ commented Feb 4, 2020 •

edited

Loading

randall77 commented Feb 4, 2020

interviewQ commented Feb 4, 2020

mknyszek commented Feb 4, 2020

interviewQ commented Feb 4, 2020 •

edited

Loading

mknyszek commented Feb 4, 2020

interviewQ commented Feb 4, 2020

interviewQ commented Feb 4, 2020 •

edited

Loading

mknyszek commented Feb 5, 2020

mknyszek commented Feb 5, 2020

interviewQ commented Feb 5, 2020 •

edited

Loading

mknyszek commented Feb 5, 2020

interviewQ commented Feb 6, 2020 •

edited

Loading

interviewQ commented Feb 6, 2020

interviewQ commented Feb 10, 2020

mknyszek commented Feb 10, 2020

networkimprov commented Feb 10, 2020

interviewQ commented Feb 10, 2020

mknyszek commented Feb 10, 2020

dmitshur commented Feb 13, 2020

dmitshur commented Feb 14, 2020

gopherbot commented Apr 12, 2020

networkimprov commented Apr 12, 2020

runtime: performance degradation in go 1.12 #36521

runtime: performance degradation in go 1.12 #36521

Comments

interviewQ commented Jan 13, 2020

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

ALTree commented Jan 13, 2020

interviewQ commented Jan 13, 2020

interviewQ commented Feb 4, 2020 • edited Loading

randall77 commented Feb 4, 2020

interviewQ commented Feb 4, 2020

mknyszek commented Feb 4, 2020

interviewQ commented Feb 4, 2020 • edited Loading

mknyszek commented Feb 4, 2020

interviewQ commented Feb 4, 2020

interviewQ commented Feb 4, 2020 • edited Loading

mknyszek commented Feb 5, 2020

mknyszek commented Feb 5, 2020

interviewQ commented Feb 5, 2020 • edited Loading

mknyszek commented Feb 5, 2020

interviewQ commented Feb 6, 2020 • edited Loading

interviewQ commented Feb 6, 2020

interviewQ commented Feb 10, 2020

mknyszek commented Feb 10, 2020

networkimprov commented Feb 10, 2020

interviewQ commented Feb 10, 2020

mknyszek commented Feb 10, 2020

dmitshur commented Feb 13, 2020

dmitshur commented Feb 14, 2020

gopherbot commented Apr 12, 2020

networkimprov commented Apr 12, 2020

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?

interviewQ commented Feb 4, 2020 •

edited

Loading

interviewQ commented Feb 4, 2020 •

edited

Loading

interviewQ commented Feb 4, 2020 •

edited

Loading

interviewQ commented Feb 5, 2020 •

edited

Loading

interviewQ commented Feb 6, 2020 •

edited

Loading