database/sql: retry logic is unsafe #11978

jackc · 2015-08-01T16:08:58Z

What version of Go are you using (go version)?

jack@edi:~$ go version
go version go1.5beta3 linux/amd64

What operating system and processor architecture are you using?

jack@edi:~$ uname -a
Linux edi 3.13.0-61-generic #100-Ubuntu SMP Wed Jul 29 11:21:34 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

What did you do?

Execute a query with database/sql over an unreliable connection. For example:

update t set n=n+1;

What did you expect to see?

An error message that means the query failed and n is unchanged, or no error meaning the query executed exactly one time and n has been incremented by 1.

What did you see instead?

The query may have actually executed multiple times and thereby n may have been incremented multiple times.

Reproducing the error

It was necessary to build some infrastructure to reproduce this error. I wrote cavein, a TCP tunnel server that purposely breaks connections to aid in reproduction. That, combined with go_database_sql_retry_bug, a test application that executes many updates over the unreliable connections provided by cavein can reliably reproduce queries being executed extra times.

Suggested solution

Automatic retry should be removed from Exec and Query. It is impossible at the library layer to know if it is safe to retry a query.

The text was updated successfully, but these errors were encountered:

gopherbot · 2015-08-07T01:00:21Z

CL https://golang.org/cl/13350 mentions this issue.

johto · 2015-08-07T07:42:57Z

Automatic retry should be removed from Exec and Query. It is impossible at the library layer to know if it is safe to retry a query.

The driver documentation sayeth the following:

ErrBadConn should be returned by a driver to signal to the sql package
that a driver.Conn is in a bad state (such as the server having earlier
closed the connection) and the sql package should retry on a new
connection.

To prevent duplicate operations, ErrBadConn should NOT be returned
if there's a possibility that the database server might have performed
the operation. Even if the server sends back an error, you shouldn't
return ErrBadConn.

If this test case results in duplicate operations, it's the driver that's at fault, not database/sql's. Not a bug.

bradfitz · 2015-08-07T07:45:53Z

Yes, what @johto said.

jackc · 2015-08-07T12:20:08Z

Under what conditions could an Exec or Query correctly return ErrBadConn?

Once the query command is written to the socket, the driver has to consider that the query could have been executed. The only way it can safely assert that the query did not execute is if it receives a query failure response from the database server -- but if it receives a query failure message then the connection would still be usable -- so ErrBadConn would not be appropriate.

So it really seems the only situation where ErrBadConn is the correct response is on an error prior to socket write.

I suppose the driver can be blamed, but it feels like retry on Query and Exec is a foot-gun waiting to happen -- for minimal convenience improvement. This is not a theoretical assertion either, both drivers I tested, pq and my own pgx, trigger this undesirable behaviour.

johto · 2015-08-07T12:28:26Z

So it really seems the only situation where ErrBadConn is the correct response is on an error prior to socket write.

Right; you need to return it to get a bad connection to be discarded from the pool. So the order of events for e.g. Exec should be:

See if the connection is already marked bad. If so, return ErrBadConn
Write the query, its parameters etc. into the socket
If a network error happens before you're completely "done" with processing the resulting row set or database-level error, mark the connection bad, and return the network error directly back to database/sql
If the query was processed completely (in postgres' case, that would mean you saw ReadyForQuery), return either the result or the error from the database without marking the connection bad

Note that even without this issue, you already have to have the "bad" marker on a connection implemented because of #11264, so implementing it like this should not result in any unreasonable extra work.

I suppose the driver can be blamed, but it feels like retry on Query and Exec is a foot-gun waiting to happen -- for minimal convenience improvement.

Fine, maybe, but how would you discard bad connections from the pool in whatever it is you're suggesting?

This is not a theoretical assertion either, both drivers I tested, pq and my own pgx, trigger this undesirable behaviour.

Last time this came up on pq's issue tracker, I argued that pq's current behavior is broken, but lost that argument.

jackc · 2015-08-07T14:02:19Z

Right; you need to return it to get a bad connection to be discarded from the pool.

Okay, I wish it didn't work this way, but I understand the now.

Fine, maybe, but how would you discard bad connections from the pool in whatever it is you're suggesting?

The bad connection would have been caused by a previous query. The connection is marked as bad then. I would suggest checking for connection health on releasing the connection back to the pool instead of on next attempted use. This is what pgx's native connection pool does (when not using database/sql). Though I suppose that would require changing the database/sql driver interface, so that is not possible for Go 1.x.

Last time this came up on pq's issue tracker, I argued that pq's current behavior is broken, but lost that argument.

I've re-examined my test app and pgx, and determined I was incorrect in my assertion that it was vulnerable to this issue. However, pq definitely is. Perhaps this test app could be useful in persuading pq's maintainers.

jackc mentioned this issue Aug 1, 2015

handling dead connections and broken pipes jackc/pgx#74

Closed

ianlancetaylor added this to the Go1.6 milestone Aug 3, 2015

mikioh changed the title ~~database/sql retry logic is unsafe~~ database/sql: retry logic is unsafe Aug 3, 2015

arnehormann mentioned this issue Aug 3, 2015

fix driver.ErrBadConn behavior, stop repeated queries go-sql-driver/mysql#302

Merged

bradfitz closed this as completed Aug 7, 2015

golang locked and limited conversation to collaborators Aug 8, 2016

gopherbot added the FrozenDueToAge label Aug 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

database/sql: retry logic is unsafe #11978

database/sql: retry logic is unsafe #11978

jackc commented Aug 1, 2015

gopherbot commented Aug 7, 2015

johto commented Aug 7, 2015

bradfitz commented Aug 7, 2015

jackc commented Aug 7, 2015

johto commented Aug 7, 2015

jackc commented Aug 7, 2015

database/sql: retry logic is unsafe #11978

database/sql: retry logic is unsafe #11978

Comments

jackc commented Aug 1, 2015

What version of Go are you using (go version)?

What operating system and processor architecture are you using?

What did you do?

What did you expect to see?

What did you see instead?

Reproducing the error

Suggested solution

gopherbot commented Aug 7, 2015

johto commented Aug 7, 2015

bradfitz commented Aug 7, 2015

jackc commented Aug 7, 2015

johto commented Aug 7, 2015

jackc commented Aug 7, 2015