proposal: database/sql: use a stack-based instead of a queue-based approach for connection pool #43173

agnivade · 2020-12-14T15:21:47Z

What version of Go are you using (`go version`)?

$ go version
go version devel +f5978a0958 Tue Dec 1 05:11:44 2020 +0000 linux/amd64

Problem statement

Currently, the database/sql implementation of connection pool uses a queue based approach where it pops a connection from the front of the freeConn slice while using it, and then adds it at the end of the slice when putting it back to the pool.

There are some tradeoffs with this approach. The current approach basically does a round-robin of trying to use all connections in the pool, so a single connection never gets a chance to remain idle for a long time. However, if we always push and pop from the top of the slice, then we always work with the recent connections, allowing the older connections to expire and get reaped.

The new approach allows us to use the set of connections more efficiently, allowing unnecessary connections to expire faster.

Analysis

The main tradeoff that I see is that the old approach load balances queries amongst all connections. So if the user has a very good estimate of the steady-state number of connections required for the app, then the approach does a good job of using all the connections, and avoids the connection creation overhead.

However, if the application can have temporary spikes, or we don't know up-front what can be the max number of connections, and we want to keep some amount of buffer space in case of spikes, then a small spike can use upto max connections, and then it will never go down, until it hits the ConnMaxLifetime. So the application would be potentially wasting a lot of connections, even if it can do with a lot less.

Another advantage of the stack-based approach is that since we would always be operating from the top of the stack, in connectionCleanerRunLocked, the slice basically remains sorted by c.returnedAt and c.createdAt. So we can just iterate from the back of the slice and stop when we have passed expiredSince. This locks the cleaner for a shorter duration and scales better, essentially reducing the lock time complexity from O(n) to O(k).

Real-life necessity and demo

The need for this came up when I was investigating why the Mattermost app's DB connection usage does not come down when the number of queries have come down. Mattermost is a typical chat application where users will login in the morning, do some work throughout the day, and then logoff.

This effectively means that there will be a connection spike at the start of the day, and then it will wear off. But with the current implementation, the issue is that, even when the RPS drop off, the connection count doesn't drop until the ConnMaxLifetime hits, which is a significantly larger time than the idle timeout.

This a more acute problem in a multi-tenant database where we want to pack as many customers as possible in a single DB.

The following graph shows good improvement when I made the change, and ran a load test again. (The change is essentially a one-line change which pushes to the front of the slice instead of appending to the back).

Old: We can see that the connections doesn't drop off until ConnMaxLifetime.

New: We can see that the connections nicely goes down as the idle timeout hits.

Wondering if there has been any thoughts given to this, and using a queue-based approach was a deliberate design decision?

/cc @kardianos

The text was updated successfully, but these errors were encountered:

kardianos · 2020-12-14T16:04:50Z

@agnivade I'm not aware of any initial design docs on this aspect of database/sql. It is my impression when Brad created the first version of this his intent was was to make the simplest implementation.

How large is the impact on the code base?

In a v2, I would really like to separate out the pool implementation, though have a default pool.

agnivade · 2020-12-14T16:41:32Z

How large is the impact on the code base?

My MVP implementation is literally a one-line change. And with the connectionCleanerRunLocked improvements, it should be a few more lines.

I don't expect it to be more than a 20 line change. Unless there are other things which I am missing at the moment.

CAFxX · 2020-12-23T06:05:39Z

But with the current implementation, the issue is that, even when the RPS drop off, the connection count doesn't drop until the ConnMaxLifetime hits, which is a significantly larger time than the idle timeout.

Isn't https://golang.org/pkg/database/sql/#DB.SetConnMaxIdleTime designed expressly for this use case?

For the record I think it would be a grave mistake to give up the load balancing property.

agnivade · 2020-12-23T06:22:33Z

As I mentioned above, due to the round-robin property of choosing connections, a connection never really gets a chance to be idle, and hence the idle timeout never hits. Therefore, the connection doesn't drop until ConnMaxLifetime hits.

In the graphs that I attached, a maxIdleTime of 5 minutes was already set, but still as you can see, nothing happens.

It is this problem that the proposal tries to improve. This is more acutely felt in a multi-tenant database, where connections are at a premium.

The load balancing property is indeed something that breaks with this proposal. But it's not a either-or choice. Choosing an appropriate idletime and lifetime will still allow an administrator to load balance.

If the admin wants to spread the load amongst a large number of connections, they can set a large idle time to handle spikes better. But if number of connections is scarce, then setting a low idle time should reduce the number. Currently, as things are, reducing the idle time effectively is a noop in steady state usage.

It's a matter of choosing the right tradeoff of course. My proposal is that this change allows more a more intuitive implementation of the connection pool that the user expects, and which can be tuned according to their needs.

ulikunitz · 2020-12-23T07:11:36Z

Round-robin works well for load-balanced database clusters. Idle time control is not sufficient to ensure load distribution, because the stack approach causes the nodes with connections at the top of the stack to get the majority of the queries. What you would need is a group of stacks where load is distributed round-robin, with the group size being equal to a multiple of the number of nodes in the cluster.

CAFxX · 2020-12-23T10:53:56Z

@agnivade sorry, you're right. I missed that bit. I understand the problem now.

Even with that, removing the load balancing property is still something I strongly wish we will not do. It's going to be a surprising change, that will likely cause pain to users with certain usage patterns (full disclosure: including us).

I would be fine though making it configurable so that people can opt-in to the new behavior if needed. Or, even better, do what @ulikunitz is suggesting (with the group size being set by default to the number of endpoints that the address resolves to).

dolmen · 2021-01-12T11:48:30Z

Issue #31708 (database/sql: connection pool was originally FIFO, is now random, but should be LIFO) looks related.

agnivade · 2021-11-04T04:21:18Z

@stevenh - It looks like #39471 solves this exact problem. Could you confirm please?

stevenh · 2021-11-04T22:09:48Z

Yes that solves the problem described by ensuring that connections can be cleared down as soon as possible.

agnivade · 2021-11-05T04:31:48Z

Perfect, thanks @stevenh !

rsc · 2021-12-01T20:51:51Z

This proposal has been declined as retracted.
— rsc for the proposal review group

gopherbot added this to the Proposal milestone Dec 14, 2020

gopherbot added the Proposal label Dec 14, 2020

ianlancetaylor added this to Incoming in Proposals (old) Dec 14, 2020

agnivade mentioned this issue Oct 3, 2021

database/sql: idle connections are not cleaned up properly #39471

Closed

agnivade closed this as completed Nov 5, 2021

rsc moved this from Incoming to Declined in Proposals (old) Dec 1, 2021

golang locked and limited conversation to collaborators Dec 1, 2022

gopherbot added the FrozenDueToAge label Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: database/sql: use a stack-based instead of a queue-based approach for connection pool #43173

proposal: database/sql: use a stack-based instead of a queue-based approach for connection pool #43173

agnivade commented Dec 14, 2020

kardianos commented Dec 14, 2020

agnivade commented Dec 14, 2020

CAFxX commented Dec 23, 2020 •

edited

agnivade commented Dec 23, 2020

ulikunitz commented Dec 23, 2020 •

edited

CAFxX commented Dec 23, 2020 •

edited

dolmen commented Jan 12, 2021 •

edited

agnivade commented Nov 4, 2021

stevenh commented Nov 4, 2021

agnivade commented Nov 5, 2021

rsc commented Dec 1, 2021

proposal: database/sql: use a stack-based instead of a queue-based approach for connection pool #43173

proposal: database/sql: use a stack-based instead of a queue-based approach for connection pool #43173

Comments

agnivade commented Dec 14, 2020

What version of Go are you using (go version)?

Problem statement

Analysis

Real-life necessity and demo

kardianos commented Dec 14, 2020

agnivade commented Dec 14, 2020

CAFxX commented Dec 23, 2020 • edited

agnivade commented Dec 23, 2020

ulikunitz commented Dec 23, 2020 • edited

CAFxX commented Dec 23, 2020 • edited

dolmen commented Jan 12, 2021 • edited

agnivade commented Nov 4, 2021

stevenh commented Nov 4, 2021

agnivade commented Nov 5, 2021

rsc commented Dec 1, 2021

What version of Go are you using (`go version`)?

CAFxX commented Dec 23, 2020 •

edited

ulikunitz commented Dec 23, 2020 •

edited

CAFxX commented Dec 23, 2020 •

edited

dolmen commented Jan 12, 2021 •

edited