Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database/sql: rate limit creation of new connections #57248

Open
pfreixes opened this issue Dec 11, 2022 · 9 comments
Open

database/sql: rate limit creation of new connections #57248

pfreixes opened this issue Dec 11, 2022 · 9 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Milestone

Comments

@pfreixes
Copy link

pfreixes commented Dec 11, 2022

I would like to proposal a little change on the connection pool implementation for the database/sql the change wants to address some issues with the current implementation which can in some situations overwhelm the database during the event of multiple create connections, the change wants to protect the underlying database from a connection stampede situation.

There is a draft of the change proposal that can be reviewed here [1], is just a draft and needs to be improved by adding all of the proper test coverage, but before going forward I would like to see if there is any chance that this proposal could be accepted.

Connection stampede issue

Current implementation allows to create as many new connections are necessary while there is still room for it when new DB operations are handled and there are no new connections available, see code here [2]. By allowing this the database can be overwhelmed by having to handle multiple connection created concurrently, which impacts also the latency of the other operations that are already inflight.

I have observed that this situation sky-rockets the CPU of the database quickly to 100% impacting the latency (AVG and P99), see following screenshot for understanding the impact of this situation.

The following screenshot shows a synthetic workload of 1500 ops/sec for a simple select value from table where key = key, returning at most 1 row per query. For that test we configured the pool for having at most 32 open connections and leaving the max idle connections to the default 2 value, having 8 instances of a Golang service accessing to a Postgres DB.

current_pool_2_max_idle

This issue presented, can happen at any time, not only during unexpected spike of traffics, also when the traffic is "stable", since the fluctuation of outstanding operations can vary due to different reasons - GC pauses, etc. In the previous example the traffic was sustained at 1500 ops/sec but it had enough instability for triggering the undesired behaviour, multiple connections created at once and later one having most of them removed because there were "enough" idle connections, leaving the DB the full time in a partial outage

A workaround for avoiding this situation is having enough idle connections configured beforehand, where the connection stampede problem is circumvented since enough connections should be already in place, however it forces to have the system running with a lot of spare connections which is not a cheap resource for databases like Postgres.

I have repeated the same test with 6 and 12 max idle connections, and found that with 6 the DB was still overwhelmed and with 12 the system was able operate with an AVG P99 of 32ms and with the CPU operating at the expected levels

Avoiding the create connection stampede

Im proposing a small change [1] in the current implementation that prevents to create multiple connections at once, and allowing to create a connection one by one while there is still necessity - there are still operations queued and the system did no reach the maximum number of opened connections.

Following screenshot shows how the same workload that before overwhelmed the DB now allows the system to keep working, still with a noticeable impact with the latencies but being able to keep responding the operations.

new_pool_2_max_idle

When we increase the number of idle connections to 6 - previous example had 2 - we can see how the system can operate normally, with the latency expected in whats our SLO (20 ms) and most important the CPU usage of the DB is under 40%, with the current code even having 6 idle connections we were still overwhelming the DB

new_pool_6_max_idle

One of the benefits of not allowing the creation of multiple connections is that operations that were queued can quickly start using connections release back to the pool, while in the current implementation this does not happen when there is still room in the pool.

Other interesting changes

The adoption of this new change could come with also other changes that could be beneficial

Configuring the maximum number of inflight create connections operations

Current proposal limits this to 1 max inflight create connection operation [4], this can be configured. Indeed we could achieve backwards compatibility with the current code by configuring it into the number of max open connections, or 0 if there were no max open connections configure.

Allowing also to configure the min idle connections

While we do have a configuration for the max idle connections, the current pool does not have the way to configure the pool in such a way that there are always some spare idle connections and it proactively makes sure that the pool has this spare resources ready to be used in case of connection spikes, with the proposed implementation we can "easily" add this new condition when maybeOpenNewConnectionsDBLocked [5] is called.

Minimum time for considering a connection idle

Current implementation considers a connection as idle from the moment that is released back to the pool, and if the number of connections considered free is larger than max idle connections they are immediately discarded. This kind of solution does not play well when there are fluctuations in short period of times. The connection pool could consider a connection as idle after some period of time - for example 60 seconds by default - without receiving traffic, this will allow the pool to evict unnecessary connections only after some period of time.

New metric for knowing how much connection have been opened

Today there is no such metric, for troubleshooting issues like this or for just undestanding how many connections are we creating at some point this metric will definitely help. Today this can be circumvented by using the one regarding the close connection, that can be an approx proxy metric.

[1] pfreixes@506ae65
[2] https://github.com/golang/go/blob/master/src/database/sql/sql.go#L1387
[3] https://pkg.go.dev/database/sql#DB.SetMaxIdleConns
[4] pfreixes@506ae65#diff-5a8509ddb343ed538034d07d446b921410102a666e7e9715f13137c0e8690dbaR779
[5] pfreixes@506ae65#diff-5a8509ddb343ed538034d07d446b921410102a666e7e9715f13137c0e8690dbaR1201

@gopherbot gopherbot added this to the Proposal milestone Dec 11, 2022
@seankhliao seankhliao changed the title proposal: affected/package: database/sql database/sql: rate limit creation of new connections Dec 11, 2022
@seankhliao seankhliao added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. and removed Proposal labels Dec 11, 2022
@seankhliao seankhliao modified the milestones: Proposal, Backlog Dec 11, 2022
@seankhliao
Copy link
Member

cc @bradfitz @kardianos

@kardianos
Copy link
Contributor

@pfreixes The preferred way to handle this is to wrap the Connector with a one that rate limits the Connect method. Is this possible for you to implement?

@pfreixes
Copy link
Author

pfreixes commented Dec 15, 2022

Yes definitely we can use our own Connector for rate limit the connection creation, this will solve the main issue presented.

However, there are still some other benefits that by overriding the connector we wont benefit, like for example for those requests that did not found a connection but there was still room and asked for a new connection while they are waiting already connections opened could be added back to the pool, these at connector level wont have the chance to reuse them. The proposed solution applies the rate limit while allows the system to reuse connections that were put back to the pool.

Also notice that this a situation that anyone that uses this connection pooling implementation might suffer, so I would not expect to have to ask about implementing their own Connector implementation for apply the proper throttling. Yeps a third package can be created outside of the standard library that everybody can use, but it's still weird having to do that. And still thinking that the trotting + reusing connections returned back to the pool is a good combo that would need to go together.

Not sure if there is any way to more forward with this proposal without creating too much friction, or do you think that the benefits of the proposal will not justify a change in the current implementation?

On the other hand, the other improvements suggested could be added on top of the current implementation also, so maybe its worth it to consider them regardless of the original discussion, talking about having a way to configure the minimum idle connections, time for considering a connection idle, counter for telling number of connections opened.

WDYT?

@kardianos
Copy link
Contributor

Yes definitely we can use our own Connector for rate limit the connection creation, this will solve the main issue presented.

Great.

However, there are still some other benefits that by overriding the connector we wont benefit, like for example for those requests that did not found a connection but there was still room and asked for a new connection while they are waiting already connections opened could be added back to the pool, these at connector level wont have the chance to reuse them. The proposed solution applies the rate limit while allows the system to reuse connections that were put back to the pool.

I don't understand what you are trying to communicate in this paragraph.

Yeps a third package can be created outside of the standard library that everybody can use, but it's still weird having to do that.

I have heard you that this feels weird to you. This doesn't feel weird to me.

Not sure if there is any way to more forward with this proposal without creating too much friction,

I don't understand what you are trying to communicate here. We agree that limiting the connection opening can be done in the Connector. Now I just need:

What I need to proceed

Please explain what your desired outcome for interacting with database/sql is. Separate out different concerns.

@rittneje
Copy link

I don't understand what you are trying to communicate in this paragraph.

@kardianos Suppose goroutine A needs a database connection, but all the ones in the pool are in use. So it asks the Connector for a new one. But (as per your suggestion), Connector.Connect itself is going to block in the middle via time.Sleep or similar in order to rate limit new connections. Meanwhile, goroutine B finishes what it was doing and returns its connection to the pool. Ideally goroutine A should take that connection instead of continuing to wait for no reason.

If database/sql were to cancel the context it passes to Connector.Connect when this happens then it would probably work. Based on a cursory evaluation it doesn't appear to do so. Another option is to allow Connect to return a channel or duration or something (perhaps wrapped in some error type) telling database/sql when to try again. Then database/sql can honor that while still accounting for things getting returned to the pool, but does not have to get into the details of how that specific delay was calculated.

@kardianos
Copy link
Contributor

@rittneje Thank you. However this example assumes that Max Connections limit hasn't been reached yet. If you set it at some reasonable limit, it will just wait for the B connection to finish and use that. If it hasn't reached it, then the most expedite thing would be indeed to open a new connection. If you truly apply a rate limit in the Connector, then under normal operation there should be no sleep anyway. The only time a rate limit would apply is to prevent a stampede which is a very different situation then what you presented in this example.

So while I believe I understand what you are saying, a rate limiter in said situation wouldn't actually take effect, because we are using a rate limit, not a standard delay. Unless everyone needed a new connection, in which the first few would be issued right away, and the others would need to wait and would wait even as existing connections returned. But in this case what I would want would NOT be a rate limit per say, but a limit on the number of active openers at a time that could be adjusted. I wouldn't want to serialize it totally; when staring a new application I would like to open MANY new connections at the same time, even if that number is bounded; further, I would want the number of open connection requests open to actually be a fractional amount of existing connections, compared to some max, so the closer it is to the max, the more likely it is to reuse a connection.

But...

Are you sure that the database connection rate limit is your problem? I would be given to think you are trying to address the wrong thing: perhaps you are too aggressively closing database connections? In the charts you posted, the number of connections closed every minute is way too high. 10 connections are open at a time but you have several thousand connections closed per minute. Maybe I'm misreading those charts.

@pfreixes
Copy link
Author

pfreixes commented Dec 15, 2022

@rittneje thanks for helping me to clarify, better an example that thousands of words, regarding this

If database/sql were to cancel the context it passes to Connector.Connect when this happens then it would probably work. Based on a cursory evaluation it doesn't appear to do so. Another option is to allow Connect to return a channel or duration or something (perhaps wrapped in some error type) telling database/sql when to try again. Then database/sql can honor that while still accounting for things getting returned to the pool, but does not have to get into the details of how that specific delay was calculated.

Not sure if i understand this properly, but having the feeling that would be better to keep the throttling complexity under the connection pooling - as it does the POC that I shared [1], by doing that the Connector can remain as it is Today without having to be bothered with this complexity.

@kardianos regarding this

Are you sure that the database connection rate limit is your problem? I would be given to think you are trying to address the wrong thing: perhaps you are too aggressively closing database connections? In the charts you posted, the number of connections closed every minute is way too high. 10 connections are open at a time but you have several thousand connections closed per minute. Maybe I'm misreading those charts.

Thats a different problem, which also happens, not sure if you have read my first post or maybe I did not explain myself in a right way. So let me try it again.

As part of the connection stampede problem, which is still an issue by itself, the implementation Today is too eager on closing connections, meaning that any connection that is returning back to the pool which is accounted as a connection that is > than MaxConnectionsIdle is automatically removed. So for avoiding this we are being forced to configure the MaxConnectionsIdle ~ MaxConnections, but this is IMO another workaround.

As part of the proposal for solving the connection stampede issue also Im proposing other changes, and one of them is related on making the closing of idle connections less eager. This would be done by allowing to configure the minimum time before a connection is considered idle, for example by default 60 seconds, where until this time is not reached the connection can be still be reused as a free connection and would not be considered idle. Together with this, the system can also prevent to remove all idle connections in a single shot, why? when a traffic of spike happens, what use to happen is that all connections are created in a short period of window, so without any safe guard all of them will be expired together. When a minimum time for considering a connection idle + not removing all of them at the same time is used, the connection pool has better chances to cope with traffic that has some variances in short period of times.

Another option that Im proposing, is adding a min idle connections option, this would help on having some room on the pool for unexpected traffic, for example while you can have a max connections of 32, you can have a min idle of lets say 10%, 3/4 connections that are just sitting there in case of new traffic comes.

The overall proposal is multi factorial and is composed by the following

  • Avoid the connection stampede issue (what is in he POC)
  • Avoid having the system purging all connections immediately
  • Allow to configure a minimum room of idle connections

And regarding this

So while I believe I understand what you are saying, a rate limiter in said situation wouldn't actually take effect, because we are using a rate limit, not a standard delay. Unless everyone needed a new connection, in which the first few would be issued right away, and the others would need to wait and would wait even as existing connections returned. But in this case what I would want would NOT be a rate limit per say, but a limit on the number of active openers at a time that could be adjusted. I wouldn't want to serialize it totally; when staring a new application I would like to open MANY new connections at the same time, even if that number is bounded; further, I would want the number of open connection requests open to actually be a fractional amount of existing connections, compared to some max, so the closer it is to the max, the more likely it is to reuse a connection.

In my original post Im already mentioning that the the current constant [2] of a maximum 1 inflight create connection can be moved to a config parameter where the user can tell how many create connections would be allowed to have inflight, where 2 could be totally acceptable value

[1] pfreixes@506ae65
[2] pfreixes@506ae65#diff-5a8509ddb343ed538034d07d446b921410102a666e7e9715f13137c0e8690dbaR779

@kardianos
Copy link
Contributor

To rephrase, you are proposing:

  1. Limit the number of in-flight database connections that can be initiated concurrently.
  2. Add a minimum time duration to wait before a connection is considered idle (for the connection closer).
  3. Add a minimum number of idle connection parameter, to automatically open idle connection in the background as more connections are used.

Is this correct?

@pfreixes
Copy link
Author

yes, good summary @kardianos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance
Projects
None yet
Development

No branches or pull requests

5 participants