Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database/sql: add extra locking when calling driver #21117

Closed
blinsay opened this issue Jul 21, 2017 · 31 comments
Closed

database/sql: add extra locking when calling driver #21117

blinsay opened this issue Jul 21, 2017 · 31 comments
Labels
FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone

Comments

@blinsay
Copy link

blinsay commented Jul 21, 2017

What version of Go are you using (go version)?

go version go1.8.3 darwin/amd64

What did you do?

Unlike Conn or Stmt there's no guidance on Tx indicating how the stdlib uses implementations. This seems to cause bugs in driver implementations, as a Tx is expected to handle concurrent calls from multiple goroutines unlike most of the other interfaces in the package.

This came up as I was debugging a panic inside of lib/pq and discovered it was a data race due to the authors not assuming that their Tx would have to deal with calls from multiple goroutines (the repro is here. It slowly adjusts timeouts to try and make the Context time out as results are coming back). Out of curiosity, I looked at go-sql-driver/mysql and the lack of a race there seems like a happy accident - the Context expiring occasionally causes an internal error to bubble up, since the authors appear to aggressively check their invariants.

I'm not sure whether this is a documentation problem or if there's something database/sql could do to make life easier for driver implementors.

@odeke-em odeke-em changed the title unclear docs causing races in sql/driver implementations database/sql: unclear docs for Tx cause races in sql/driver implementations Jul 21, 2017
@odeke-em
Copy link
Member

/cc @kardianos @bradfitz

@chrispassas
Copy link

In my case I have a pool of database connections. I open a transaction and run queries. If I use BeginTx() I get an crash panic overtime.

Part of why I need a transaction is so I can use postgresql statement_timeout. The only way to use it with a connection pool is to use a transaction.

If I just use Begin() without context it never crashes. In my case though I need to be able to cancel the transaction if conditions are met in another goroutine.

@blinsay
Copy link
Author

blinsay commented Aug 30, 2017

Hey y'all,

It looks like driver libs are still tossing out panics in Go 1.9. The underlying problem with the way the stdlib and the drivers are misunderstanding each other seems like it's still there. This comment has some more info (race detector output, panic traces, and a reliable repro) on how lib/pq and database/sql interact to panic.

@kardianos
Copy link
Contributor

This might not be a documentation issue but a driver issue.
row.Next locks on closemu, but doesn't lock on the driver connection lock. If DB.Next operations also locked on the dc, then we would probably correct this issue. Could someone apply the following diff locally and see if it fixes it? I need to do a fuller evaluation of what is going on as well.

In database/sql/sql.go

diff --git a/src/database/sql/sql.go b/src/database/sql/sql.go
index c609fe4..4f2c9f9 100644
--- a/src/database/sql/sql.go
+++ b/src/database/sql/sql.go
@@ -2454,6 +2454,9 @@ func (rs *Rows) nextLocked() (doClose, ok bool) {
        if rs.lastcols == nil {
                rs.lastcols = make([]driver.Value, len(rs.rowsi.Columns()))
        }
+       rs.dc.Lock()
+       defer rs.dc.Unlock()
+
        rs.lasterr = rs.rowsi.Next(rs.lastcols)
        if rs.lasterr != nil {
                // Close the connection if there is a driver error.
@@ -2503,6 +2506,8 @@ func (rs *Rows) NextResultSet() bool {
                doClose = true
                return false
        }
+       rs.dc.Lock()
+       defer rs.dc.Unlock()
        rs.lasterr = nextResultSet.NextResultSet()
        if rs.lasterr != nil {
                doClose = true

I need to do a full evaluation of this solution. But you could try it out and see if the issue goes away.

@blinsay
Copy link
Author

blinsay commented Sep 1, 2017

I applied the patch and that seems to fix the panic and satisfy the race detector!

Now that the log lines aren't mostly race detector output/panic stacktrace I've started noticing that quite a lot of the errors that come back are sql.ErrNoRows. That seems surprising to me.

Here's some sample data after running the repro for a few minutes:

$ ~/oss/go/bin/go run -race repro.go &> repro.log

# wait a while, grab some coffee

$ grep -E -o "error=.*" < repro.log | sort | uniq -c
  46 error="context deadline exceeded"
   1 error="driver: bad connection"
3017 error="pq: canceling statement due to user request"
  42 error="sql: Rows are closed"
1369 error="sql: no rows in result set"

@kardianos
Copy link
Contributor

@blinsay Good catch. Next step will be to determine why it is reporting no rows sometimes. I suspect it may be driver issue of returning the wrong error (NoRows rather then ctx.Error), but I haven't looked.

@gopherbot
Copy link

Change https://golang.org/cl/65731 mentions this issue: database/sql: prevent race in driver by locking dc in Next

@chrispassas
Copy link

chrispassas commented Sep 25, 2017

Once fixed can this be released in go 1.9.1 or 1.9.2? This is causing me problems in my application and waiting for 1.10 would be hard.

Thank you

@chrispassas
Copy link

I'm not that familiar with the Go release process. This issue appears to be fixed but sitting in Code Review. What comes next? Is there any hope of getting this in a 1.9.x release?

https://go-review.googlesource.com/c/go/+/65731

@ianlancetaylor ianlancetaylor added NeedsFix The path to resolution is known, but the work has not been done. and removed Documentation labels Oct 3, 2017
@ianlancetaylor ianlancetaylor added this to the Go1.9.1 milestone Oct 3, 2017
@ianlancetaylor
Copy link
Contributor

I've marked this for consideration for a 1.9 release.

@rsc
Copy link
Contributor

rsc commented Oct 13, 2017

Every time we touch the locking in database/sql it seems like we introduce a deadlock. What is the condition under which the problem being fixed happens? Are we sure that the expected benefit here outweighs the very real chance of introducing a deadlock in some other case?

@rsc rsc reopened this Oct 13, 2017
@rsc rsc added NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. and removed NeedsFix The path to resolution is known, but the work has not been done. Documentation labels Oct 13, 2017
@rsc
Copy link
Contributor

rsc commented Oct 13, 2017

Also, is this a regression from Go 1.8? If not, then we really should leave it alone and let the ordinary Go 1.10 testing happen.

@rsc rsc changed the title database/sql: unclear docs for Tx cause races in sql/driver implementations database/sql: add extra locking when calling driver Oct 13, 2017
@kardianos
Copy link
Contributor

@rsc I don't believe this is a regression from go1.8.

The benefit is it prevents real races in real programs.

In this case the risk of deadlock seems non-existent. Unlike previous issues, this just using the existing locking in *driverConn as it is meant to be used, locking for the short period of time around driver connection calls. The only way this could introduce a deadlock is if a driver calls some public sql API referencing the same connection from within the driver.Next, which would be a blatant driver issue.

I offer no opinion on when it should be released.

@gopherbot
Copy link

Change https://golang.org/cl/71450 mentions this issue: Revert "database/sql: prevent race in driver by locking dc in Next"

@gopherbot
Copy link

Change https://golang.org/cl/71433 mentions this issue: database/sql: ensure all driver interfaces are called under single lock

gopherbot pushed a commit that referenced this issue Oct 18, 2017
This reverts commit 897080d.

Reason for revert: Fails to fix all the locking issues.

Updates #21117

Change-Id: I6fc9cb7897244d6e1af78c089a2bf383258ec049
Reviewed-on: https://go-review.googlesource.com/71450
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@gopherbot
Copy link

Change https://golang.org/cl/71510 mentions this issue: [release-branch.go1.9] database/sql: prevent race in driver by locking dc in Next

@rsc
Copy link
Contributor

rsc commented Oct 19, 2017

@blinsay and @chrispassas and anyone else seeing these crashes: I am assuming you are using Linux. Can you please try https://golang-release-staging.storage.googleapis.com/go1.9.2/go1.9.2rc2.linux-amd64.tar.gz and see if the problem disappears? Thank you.

@chrispassas
Copy link

@rsc I just tried with 1.9.1 and can reproduce the crash reliably in a few seconds.

Using 1.9.2rc2 I've let it run for a few minutes 3 times with 0 crashes.

@rsc
Copy link
Contributor

rsc commented Oct 19, 2017

@chrispassas, thanks very much. Based on that, I'm comfortable with including that fix in the final Go 1.9.2.

@chrispassas
Copy link

@rsc Thank you, I'm looking forward to having it!

@rsc
Copy link
Contributor

rsc commented Oct 19, 2017

OK, I am going to approve the cherry-pick to Go 1.9.2 over in #22314 and leave this bug open to track additional locking fixes in Go 1.10.

@rsc rsc modified the milestones: Go1.9.2, Go1.10 Oct 19, 2017
gopherbot pushed a commit that referenced this issue Oct 25, 2017
…g dc in Next

Database drivers should be called from a single goroutine to ease
driver's design. If a driver chooses to handle context
cancels internally it may do so.

The sql package violated this agreement when calling Next or
NextResultSet. It was possible for a concurrent rollback
triggered from a context cancel to call a Tx.Rollback (which
takes a driver connection lock) while a Rows.Next is in progress
(which does not tack the driver connection lock).

The current internal design of the sql package is each call takes
roughly two locks: a closemu lock which prevents an disposing of
internal resources (assigning nil or removing from lists)
and a driver connection lock that prevents calling driver code from
multiple goroutines.

Fixes #21117

Change-Id: Ie340dc752a503089c27f57ffd43e191534829360
Reviewed-on: https://go-review.googlesource.com/65731
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-on: https://go-review.googlesource.com/71510
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Daniel Theophanes <kardianos@gmail.com>
@cespare
Copy link
Contributor

cespare commented Dec 21, 2017

I'm reopening this because the commit that auto-closed it was reverted afterwards by 292366e

Reason for revert: Fails to fix all the locking issues.

AFAICT this issue has not been addressed for 1.10. Feel free to close if that's not correct. (I was brought here by some serious Tx+context races/crashes in both 1.9.2 and 1.10beta1.)

@cespare cespare reopened this Dec 21, 2017
@chrispassas
Copy link

The fix that went into version 1.9.2 did fix the crashes for QueryContext. If I use ExecContext and PrepareContext I do get crashes fairly often. It looked like right after 1.9.2 was tagged they went back and fixed the muteness for 1.10.

If I can get some time this week I'll do some test with 1.10 beta 1 to see if it still crashes.

@kardianos
Copy link
Contributor

The fix for this did get in shortly after that revert. Yes, please verify @chrispassas , but as far as I know this is fixed.

@cespare
Copy link
Contributor

cespare commented Dec 21, 2017

@kardianos can you please link the fix here?

@kardianos
Copy link
Contributor

@cespare
Copy link
Contributor

cespare commented Dec 22, 2017

@kardianos thanks, I missed that.

After running my crashing code with on tip + CL 85175 (for #23208) I don't see any issues.

I'm going to re-close this as we don't seem to have any reason to believe that there's still a locking issue here. We can revisit if @chrispassas finds something.

@cespare cespare closed this as completed Dec 22, 2017
@chrispassas
Copy link

I have tested version 1.9.2 and QueryContext is good but ExecContext will trigger a depending how often your calling it.

I still need to test that 1.10-beta1 ExecContext is OK.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made.
Projects
None yet
Development

No branches or pull requests

8 participants