Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database/sql: make maxBadConnRetries configurable #48309

Open
thameezb opened this issue Sep 10, 2021 · 8 comments
Open

database/sql: make maxBadConnRetries configurable #48309

thameezb opened this issue Sep 10, 2021 · 8 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@thameezb
Copy link

What version of Go are you using (go version)?

v1.16.2

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

n/a

What did you do?

The value of maxBadConnRetries is hardcoded to 2 -> https://github.com/golang/go/blob/master/src/database/sql/sql.go#L1511.
I was wondering if there is a specific reason for this, as there seems to be attempts to make this variable in the past which were abandoned -> #15608

Making it variable would be very much appreciated in our particular use-case.

Scenario

  • Running a k8s cluster with various services running as pods in the cluster. Each of these pods has a sidecar Health-Monitor container which simply pings certain dependencies on a regular interval (10s). Should the health-check fail, the dependency is marked as down and potentially the pod can be marked as not-ready.
  • One of our dependency types is SQL, and the go-sql lib is used.
  • There are times where we need to bring down our DB (during patching etc)
    • and during these times we see connections to the DB spike to triple in traffic (makes sense now that we have looked at the code)
    • this causes the DB instance to be hit hard and load to increase, during a period where load is already high due to an external process
    • this also means that the HM acts differently according to different scenarios (happy/sad)

Note that we cannot simply block connections to the entire DB instance as it needs to be contactable during this period.

@seankhliao seankhliao changed the title Make maxBadConnRetries configurable database/sql: make maxBadConnRetries configurable Sep 10, 2021
@seankhliao seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 10, 2021
@seankhliao
Copy link
Member

cc @bradfitz @kardianos

@thameezb
Copy link
Author

any update?

@kardianos
Copy link
Contributor

There are two ways to approach this, especially that we do a better job of marking connections as bad before they go into the pool to begin with.

  1. Add a configuration with the driver to not return driver.ErrBadConn
  2. Add a configuration with database/sql to not retry.

I'm open to (2), but I would prefer it not to introduce more locks and shared state. Want to suggest a change?

@sjmudd
Copy link

sjmudd commented Jan 25, 2023

I'd like to chip in as I recently opened go-sql-driver/mysql#1385. I think this comes down to a similar problem space.
I manage a system that manages MySQL infrastructure so is continually polling and checking health of a large number of MySQL servers. I recently noticed some driver: bad connection errors which provide no information to the caller on what the problem actually was and then saw that this is related to "transparent retries at the database/sql layer" which from an architectural point of view is undesirable if monitoring the heath of the backend database servers.

So I would like this setting to be configurable so I can set it to 1 and directly experience any failure.
I would also like the low level driver to be able to return a more detailed error so the caller can see this and maybe use Is(whateverLowLevelError, ErrBadConn) or similar if needed, to catch the current behaviour but also provide detail in a way that the application can use to record "database problems" and the type of issue seen. As it stands at least with the MySQL driver layer there are about 6 different error cases that trigger ErrBadConn to be returned so it's really hard to know exactly which cause of error happened. Also if I see this problem now it implies there was a similar problem, a retry took place and that failed too, so the underlying issue, whatever that might be is actually worse than shown.

I would expect normal applications to use the current behaviour and not touch any special configurations unless they specifically want to. However, I would like to be able to turn this off, and thus get a much better idea of issues seen when monitoring the health of my database servers. I would imagine others may have a similar issue which is partly where this issue comes from.

I also notice that sql.Stats provide some very simple statistics and one of the issues I notice is that we can't see how frequently connections are retried, so there's no visibility in the "stability" of the backend servers being talked to. Basically the current retry logic while good from the application perspective hides issues which we might want to know about.

Is there a chance that this could be corrected and you handle one of the 2 points above and also perhaps provide a global counter for deliberate retries or some non-logical errors which perhaps the low level driver can return to you?

It seems clear that co-ordination with the low level driver authors is needed so that things work as required. I'm not sure how such co-ordination is best handled, but I do see the low level driver layer can not change things that database/sql expects and in the end the application is expected to talk to database/sql and not try to use the lower level driver directly. So is there a more appropriate way to discuss such issues other than opening separate tickets in different places?

@kardianos
Copy link
Contributor

@sjmudd database/sql already uses isBadConn = errors.Is(err, driver.ErrBadConn) [1]. It is encouraged for database drivers to wrap BadConn.

[1] https://cs.opensource.google/go/go/+/refs/tags/go1.19.5:src/database/sql/sql.go;l=852

@sjmudd
Copy link

sjmudd commented Jan 25, 2023

Yes, and I would like it to be incorporated into https://github.com/go-sql-driver/mysql but believe there are some concerns about compatibility issues with the different versions of go that is supported.

I would like also to see exposed something like:

  • a counter of non-logical errors, perhaps like DBStats which I guess you could use, and maybe counted by the type of error, so the user can identify both the fact an error has occurred and also which types they were.
  • the errors could be handled by the low level drivers but that does not make much sense to me. It feels better that they can return them to database/sql and your package makes them visible in a standard way to the user.
  • e.g. something like DBErrors being a map[errorType] -> struct { Count int, LastError: error, RetriedCount int } or similar.

I believe that there was some concern about making maxBadConnRetries configurable due to potential locking issues. Is that still the case? If so you may not be keen to have to maintain and additional DBErrors structure. However, under normal circumstances the actual number of errors I'd expect to see of type driver.ErrBadConn would be rather small so I would expect locking issues to not be a major concern.

What I see now is about 1 in 100,000 connections failing though that's after the retry having been applied so I would expect the actual error count to be higher. Given I'm polling a large number of servers all the time this number is actually significant as some of these connections are actually to healthy servers which need to be fixed when another server fails. My primary interest is to be able to collect statistics from database/sql and optionally be able to turn off the retries. With that in place I can then get the detail necessary back from a modified The MySQL driver through database/sql to identify what specific conditions are generating these errors.

@sjmudd
Copy link

sjmudd commented Apr 10, 2023

Could I ask where maxBadConnRetries came from?

I know database/sql is backend agnostic layer, but I'm also aware of MySQL being notorious for "MySQL gone away errors" often triggered by the server disconnecting connections which were idle for too long.

So the reason for this setting or at least the attempt to reconnect if you get an error would seem to match that scenario. Perhaps other backends suffer from similar behaviour? This just hides the problem from the user "most of the time" which is good in some ways.

However, the lack of visibility as to how often this is taking place is a concern as it is hiding a reliability problem which perhaps needs addressing outside of the database/sql layer.

@sjmudd
Copy link

sjmudd commented Apr 10, 2023

Slightly related but of relevance is that MySQL had a patch provided by a colleague for this specific problem which now makes "updated clients" see an error message in such a case (e.g. adjust your MySQL setting of wait_timeout as you've been idle for too long...). I believe the go MySQL driver layer does not handle this (and I need to make bug report for that, official MySQL clients/libraries from Oracle do but they don't provide a go driver) and as things stand even if the error was reported up to the database/sql layer in the way it's done now it would trigger a reconnect to the database server and the error would never be seen.

So having a way to configure the retries, and having visibility on the problem when all things are in place would allow people monitoring errors more closely to see errors and then act upon them and thus have a better experience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants