Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database/sql: data race in runtime string copy #43576

Closed
kevinburke1 opened this issue Jan 7, 2021 · 8 comments
Closed

database/sql: data race in runtime string copy #43576

kevinburke1 opened this issue Jan 7, 2021 · 8 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@kevinburke1
Copy link

kevinburke1 commented Jan 7, 2021

I ran the company's test suite and hit a data race. Unfortunately I am having trouble reproducing it - I ran the specific failing tests 10,000 times, and could not produce the error again. I am pasting this because I am concerned about the potential for memory corruption and in the hopes that the stack trace can pinpoint the source of the problem. I'm unsure what layer it's at.

What version of Go are you using (go version)?

$ go version
go version devel +fa90aaca7d Thu Jan 7 19:31:03 2021 +0000 darwin/amd64

Does this issue reproduce with the latest release?

I can't reproduce this reliably enough to determine that its absence is proof it was fixed, see the issue summary.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="off"
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/kevin/Library/Caches/go-build"
GOENV="/Users/kevin/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/kevin/pkg/mod"
GONOPROXY="github.com/meterup"
GONOSUMDB="github.com/meterup"
GOOS="darwin"
GOPATH="/Users/kevin"
GOPRIVATE="github.com/meterup"
GOPROXY="direct"
GOROOT="/Users/kevin/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/kevin/go/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="devel +fa90aaca7d Thu Jan 7 19:31:03 2021 +0000"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/s1/909yt58s4wj8h_7v7frr8rkm0000gn/T/go-build1691940976=/tmp/go-build -gno-record-gcc-switches -fno-common"
GOROOT/bin/go version: go version devel +fa90aaca7d Thu Jan 7 19:31:03 2021 +0000 darwin/amd64
GOROOT/bin/go tool compile -V: compile version devel +fa90aaca7d Thu Jan 7 19:31:03 2021 +0000
uname -v: Darwin Kernel Version 19.6.0: Tue Nov 10 00:10:30 PST 2020; root:xnu-6153.141.10~1/RELEASE_X86_64
ProductName:	Mac OS X
ProductVersion:	10.15.7
BuildVersion:	19H114
lldb --version: lldb-1200.0.44.2
Apple Swift version 5.3.2 (swiftlang-1200.0.45 clang-1200.0.32.28)

Here is the data race stack trace:

==================
WARNING: DATA RACE
Write at 0x00c000473450 by goroutine 132:
  runtime.slicecopy()
      /Users/kevin/go/src/runtime/slice.go:247 +0x0
  github.com/lib/pq.(*writeBuf).string()
      /Users/kevin/src/github.com/meterup/api/vendor/github.com/lib/pq/buf.go:69 +0x39a
  github.com/lib/pq.(*conn).prepareTo()
      /Users/kevin/src/github.com/meterup/api/vendor/github.com/lib/pq/conn.go:769 +0x2f8
  github.com/lib/pq.(*conn).query()
      /Users/kevin/src/github.com/meterup/api/vendor/github.com/lib/pq/conn.go:850 +0x3e8
  github.com/lib/pq.(*conn).QueryContext()
      /Users/kevin/src/github.com/meterup/api/vendor/github.com/lib/pq/conn_go18.go:20 +0x207
  database/sql.ctxDriverQuery()
      /Users/kevin/go/src/database/sql/ctxutil.go:48 +0x2f2
  database/sql.(*DB).queryDC.func1()
      /Users/kevin/go/src/database/sql/sql.go:1650 +0x2c8
  database/sql.withLock()
      /Users/kevin/go/src/database/sql/sql.go:3294 +0x7e
  database/sql.(*DB).queryDC()
      /Users/kevin/go/src/database/sql/sql.go:1645 +0x833
  database/sql.(*DB).query()
      /Users/kevin/go/src/database/sql/sql.go:1628 +0x18a
  database/sql.(*DB).QueryContext()
      /Users/kevin/go/src/database/sql/sql.go:1605 +0xe9
  database/sql.(*DB).QueryRowContext()
      /Users/kevin/go/src/database/sql/sql.go:1706 +0xb0
  github.com/meterup/api/src/internal/db.(*Queries).queryRow()
      /Users/kevin/src/github.com/meterup/api/src/internal/db/db.go:1302 +0x12f
  github.com/meterup/api/src/internal/db.(*Queries).GetCompanyByController()
      /Users/kevin/src/github.com/meterup/api/src/internal/db/companies.sql.go:152 +0x170
  github.com/meterup/api/src/api.getCompany.func2()
      /Users/kevin/src/github.com/meterup/api/src/api/v1_network_access.go:186 +0x157
  golang.org/x/sync/errgroup.(*Group).Go.func1()
      /Users/kevin/src/github.com/meterup/api/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x94

Previous read at 0x00c000473455 by goroutine 152:
  runtime.slicebytetostring()
      /Users/kevin/go/src/runtime/string.go:80 +0x0
  github.com/meterup/api/src/internal/db.(*ControllerStatus).Scan()
      /Users/kevin/src/github.com/meterup/api/src/internal/db/models.go:108 +0x158
  database/sql.convertAssignRows()
      /Users/kevin/go/src/database/sql/convert.go:385 +0x1e73
  database/sql.(*Rows).Scan()
      /Users/kevin/go/src/database/sql/sql.go:3144 +0x37d
  database/sql.(*Row).Scan()
      /Users/kevin/go/src/database/sql/sql.go:3237 +0x244
  github.com/meterup/api/src/internal/db.(*Queries).GetControllerByHostname()
      /Users/kevin/src/github.com/meterup/api/src/internal/db/controllers.sql.go:248 +0x36a
  github.com/meterup/api/src/api.getCompany.func1()
      /Users/kevin/src/github.com/meterup/api/src/api/v1_network_access.go:163 +0x110
  golang.org/x/sync/errgroup.(*Group).Go.func1()
      /Users/kevin/src/github.com/meterup/api/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x94

We're using sqlc v1.6.0 as an ORM layer, and github.com/lib/pq at commit 083382b7e6fc. It's possible the error is in those layers; I can't reproduce it frequently enough, so I can't say for sure whether updating to the latest commit fixes the error. I'm not sure whether a driver is responsible for marking memory as dirty or not.

Here are some of the relevant parts of the source code:

v1_network_access.go line 163 and 186
	group, errctx := errgroup.WithContext(ctx)
	var foundCompany *db.Company
	var foundError error
	var mu sync.Mutex
	group.Go(func() error {
		controller, err := qs.GetControllerByHostname(errctx, cname)
		if err == sql.ErrNoRows {
			// this is the legacy way, the other goroutine will try the normal
			// way.
			return nil
		}
		if err != nil {
			mu.Lock()
			foundError = err
			mu.Unlock()
			return nil
		}
		company, err := qs.GetCompanyByController(errctx, controller.Name)
		mu.Lock()
		defer mu.Unlock()
		if err != nil { // ErrNoRows is also a problem here; every controller should be attached to a company.
			foundError = err
			return nil
		}
		foundCompany = &company
		return errStopRacing
	})
	group.Go(func() error {
		company, err := qs.GetCompanyByController(errctx, cname)
		mu.Lock()
		defer mu.Unlock()
		if err != nil {
			foundError = err
			return nil
		}
		foundCompany = &company
		return errStopRacing
	})
	group.Wait() // either nil or errStopRacing
	if foundError != nil && foundCompany == nil {
		return nil, foundError
	}
controllers.sql.go line 248
func (q *Queries) GetControllerByHostname(ctx context.Context, hostname string) (Controller, error) {
	row := q.queryRow(ctx, q.getControllerByHostnameStmt, getControllerByHostname, hostname)
	var i Controller
	err := row.Scan(
		&i.ID,
                 // other fields omitted for brevity
	)
	return i, err
}
db/models.go line 108
type ControllerStatus string

const (
	ControllerStatusOnline  ControllerStatus = "online"
	ControllerStatusOffline ControllerStatus = "offline"
)

func (e *ControllerStatus) Scan(src interface{}) error {
	switch s := src.(type) {
	case []byte:
		*e = ControllerStatus(s)
	case string:
		*e = ControllerStatus(s)
	default:
		return fmt.Errorf("unsupported scan type for ControllerStatus: %T", src)
	}
	return nil
}
companies.sql.go line 152
	row := q.queryRow(ctx, q.getCompanyByControllerStmt, getCompanyByController, name)

which calls

internal/db/db.go line 1302
		return q.db.QueryRowContext(ctx, query, args...)

I can send a full version of the code privately or try out different things if you have ideas about how to reproduce it.

@toothrot toothrot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 8, 2021
@toothrot toothrot added this to the Backlog milestone Jan 8, 2021
@toothrot
Copy link
Contributor

toothrot commented Jan 8, 2021

Interesting! Also cc'ing your co-owners: @bradfitz @kardianos

@kyleconroy
Copy link

Author of sqlc here, very interested in following along.

@kevinburke1
Copy link
Author

Update, I'm able to consistently reproduce this error now.

@kevinburke1
Copy link
Author

kevinburke1 commented Jan 10, 2021

This is my current minimal test case

func getCompany(ctx context.Context, qs *db.ManagedDB, cname string) (*db.Company, error) {
	group, errctx := errgroup.WithContext(ctx)
	group.Go(func() error {
		qs.GetControllerByHostname(errctx, cname)
		return errStopRacing
	})
	group.Go(func() error {
		qs.GetCompanyByController(errctx, cname)
		return errStopRacing
	})
	group.Wait() // either nil or errStopRacing
	return nil, nil
}

I can't reproduce if i replace errStopRacing with nil

@kevinburke1
Copy link
Author

Updated minimal test case, I think this is actually to the point where it's reasonable for someone else to try to reproduce. I am running lib/pq at tip and Go at tip. sqlc is not involved anymore.

func TestGetCompany(t *testing.T) {
	t.Parallel()
	conn := testkit.OpenDB(t)
	t.Cleanup(func() { conn.Close() })
	for j := 0; j < 20; j++ {
		var wg sync.WaitGroup
		for i := 0; i < 100; i++ {
			wg.Add(1)
			go func() {
				defer wg.Done()
				err := getCompany(conn.DB)
				if err != nil {
					panic(err)
				}
			}()
		}
		wg.Wait()
	}
}

Source code:

func getCompany(qs *sql.DB) error {
	group, errctx := errgroup.WithContext(context.Background())
	group.Go(func() error {
		var s string
		qs.QueryRowContext(errctx, "SELECT 'online'::controller_status").Scan(&s)
		return errStopRacing
	})
	group.Go(func() error {
		var s string
		qs.QueryRowContext(errctx, "SELECT 'online'::controller_status").Scan(&s)
		return errStopRacing
	})
	group.Wait()
	return nil
}

And then in the database (Postgres), you need:

CREATE TYPE controller_status AS ENUM ('online', 'offline');

I am running with the following test invocation:

GORACE="halt_on_error=1" command time go test -race -count 500 -run 'TestGetCompany'

Here's the current data race stack trace.

==================
WARNING: DATA RACE
Write at 0x00c00027f628 by goroutine 129:
  runtime.slicecopy()
      /Users/kevin/go/src/runtime/slice.go:247 +0x0
  github.com/lib/pq.(*writeBuf).string()
      /Users/kevin/src/github.com/meterup/api/vendor/github.com/lib/pq/buf.go:69 +0x1ae
  github.com/lib/pq.(*conn).simpleQuery()
      /Users/kevin/src/github.com/meterup/api/vendor/github.com/lib/pq/conn.go:665 +0x114
  github.com/lib/pq.(*conn).query()
      /Users/kevin/src/github.com/meterup/api/vendor/github.com/lib/pq/conn.go:862 +0x61c
  github.com/lib/pq.(*conn).QueryContext()
      /Users/kevin/src/github.com/meterup/api/vendor/github.com/lib/pq/conn_go18.go:21 +0x207
  database/sql.ctxDriverQuery()
      /Users/kevin/go/src/database/sql/ctxutil.go:48 +0x2f2
  database/sql.(*DB).queryDC.func1()
      /Users/kevin/go/src/database/sql/sql.go:1651 +0x2d5
  database/sql.withLock()
      /Users/kevin/go/src/database/sql/sql.go:3295 +0x7e
  database/sql.(*DB).queryDC()
      /Users/kevin/go/src/database/sql/sql.go:1645 +0x833
  database/sql.(*DB).query()
      /Users/kevin/go/src/database/sql/sql.go:1628 +0x18a
  database/sql.(*DB).QueryContext()
      /Users/kevin/go/src/database/sql/sql.go:1605 +0xe9
  database/sql.(*DB).QueryRowContext()
      /Users/kevin/go/src/database/sql/sql.go:1707 +0xf6
  github.com/meterup/api/src/api.getCompany.func2()
      /Users/kevin/src/github.com/meterup/api/src/api/v1_network_access.go:163 +0x135
  golang.org/x/sync/errgroup.(*Group).Go.func1()
      /Users/kevin/src/github.com/meterup/api/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x94

Previous read at 0x00c00027f62b by goroutine 55:
  runtime.slicebytetostring()
      /Users/kevin/go/src/runtime/string.go:80 +0x0
  database/sql.convertAssignRows()
      /Users/kevin/go/src/database/sql/convert.go:249 +0x2e84
  database/sql.(*Rows).Scan()
      /Users/kevin/go/src/database/sql/sql.go:3145 +0x37d
  database/sql.(*Row).Scan()
      /Users/kevin/go/src/database/sql/sql.go:3238 +0x244
  github.com/meterup/api/src/api.getCompany.func2()
      /Users/kevin/src/github.com/meterup/api/src/api/v1_network_access.go:163 +0x18b
  database/sql.(*DB).QueryRowContext()
      /Users/kevin/go/src/database/sql/sql.go:1707 +0xf6
  github.com/meterup/api/src/api.getCompany.func2()
      /Users/kevin/src/github.com/meterup/api/src/api/v1_network_access.go:163 +0x135
  golang.org/x/sync/errgroup.(*Group).Go.func1()
      /Users/kevin/src/github.com/meterup/api/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x94

Goroutine 129 (running) created at:
  golang.org/x/sync/errgroup.(*Group).Go()
      /Users/kevin/src/github.com/meterup/api/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0x73
  github.com/meterup/api/src/api.getCompany()
      /Users/kevin/src/github.com/meterup/api/src/api/v1_network_access.go:161 +0x244
  github.com/meterup/api/src/api.TestGetCompany.func2()
      /Users/kevin/src/github.com/meterup/api/src/api/v1_network_access_test.go:190 +0x99

Goroutine 55 (finished) created at:
  golang.org/x/sync/errgroup.(*Group).Go()
      /Users/kevin/src/github.com/meterup/api/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0x73
  github.com/meterup/api/src/api.getCompany()
      /Users/kevin/src/github.com/meterup/api/src/api/v1_network_access.go:161 +0x244
  github.com/meterup/api/src/api.TestGetCompany.func2()
      /Users/kevin/src/github.com/meterup/api/src/api/v1_network_access_test.go:190 +0x99

@kevinburke1
Copy link
Author

I'm 95% convinced that this is an error in github.com/lib/pq.

@kevinburke1
Copy link
Author

OK, I tracked this down to a race condition in lib/pq - basically, pq wasn't prepared for database/sql to send a Close() call from one goroutine (triggered by a context cancelation) while another goroutine was handling Next()/Scan() in the course of ordinary database processing.

That is probably a fairly common case and I'm wondering if we could add some sort of driver compatibility layer to test that out.

@kardianos
Copy link
Contributor

I think there is a separate issue for such a test suite.
I haven't had time / funding to setup such a suite. But it would be a good idea.

I'm closing because this doesn't seem a problem for database/sql.

@golang golang locked and limited conversation to collaborators Jan 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants