Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/go: Random failures building/running some go1.15.7 unit tests on high-core machines #43907

Closed
somersf opened this issue Jan 25, 2021 · 5 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@somersf
Copy link
Contributor

somersf commented Jan 25, 2021

On high-core machines (eg 40-core), there seems to be a slice append issue in the latest GCCGO-related fixes to exec.go:cgo() when processing the new cgoEnv() cached slice.

What version of Go are you using (go version)?

This is the latest version which builds, but fails some tests randomly:

$ go version
go version go1.15.7 linux/386

It was built using:

$ go version
go version go1.15.6 linux/386

Does this issue reproduce with the latest release?

This issue occurs while running the golang toolchain unit tests, building/testing the latest (go1.15.7) release.

What operating system and processor architecture are you using (go env)?

These are the env results for the go1.15.7 toolchain which was built, but is randomly failing some tests.

go env Output
$ go env
GO111MODULE=""
GOARCH="386"
GOBIN=""
GOCACHE="/tmp/build/cache/build"
GOENV="/home/user/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="386"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/go:/usr/share/gocode"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/tmp/build/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/tmp/build/go/pkg/tool/linux_386"
GCCGO="gccgo"
GO386="sse2"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/tmp/build/go/misc/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m32 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build969928004=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I picked one test that was failing randomly - TestCrossPackageTests, and ran it in a stress test loop.

#!/bin/bash

set -ex -o pipefail

build_dir=/tmp/build

# Setup
rm -rf $build_dir
mkdir -p $build_dir
cd $build_dir
#wget https://golang.org/dl/go1.15.7.src.tar.gz
cp /tmp/go1.15.7.src.tar.gz .
tar -zxf go1.15.7.src.tar.gz

# Build go
export GOCACHE=$build_dir/cache/build
cd $build_dir/go/src
./make.bash --no-clean -v

# Run extract from tests
export GOROOT=$build_dir/go
export PATH=$GOROOT/bin:$PATH
cd $build_dir/go/misc/cgo/test

for iter in {1..100} ; do
   echo "=== $iter"

   # Clean gocache for each test run
   export GOCACHE=$build_dir/cache/test
   rm -rf $GOCACHE/*

   go test . -work -v -short -count=1 -tags= -run=TestCrossPackageTests
done

The failure seems to be in the go build stage which the test runs, rather than the sub-tests it runs.

Clearing the GOCACHE before each build causes it to occur more frequently.
exporting GOMAXPROCS=2 dramatically reduced the likelihood of it occurring.

What did you expect to see?

The stress test script above should continue to pass.

What did you see instead?

At some random iteration (often in the first 5-15), the stress test script above fails as follows:

...
=== 3
...
+ go test . -work -v -short -count=1 -tags= -run=TestCrossPackageTests
WORK=/tmp/go-build217270159
=== RUN   TestCrossPackageTests
    pkg_test.go:67: go test -v -short: exit status 2
        panic: runtime error: index out of range [2] with length 2
        
        goroutine 476 [running]:
        cmd/go/internal/work.(*Builder).cgo(0x9fb5c80, 0xa123970, 0x9f75da0, 0x24, 0x9fadc00, 0x1c, 0x0, 0x0, 0x0, 0x0, ...)
                /tmp/build/go/src/cmd/go/internal/work/exec.go:2794 +0x3a3c
        cmd/go/internal/work.(*Builder).build(0x9fb5c80, 0xa123970, 0x0, 0x0)
                /tmp/build/go/src/cmd/go/internal/work/exec.go:602 +0xdba
        cmd/go/internal/work.(*Builder).Do.func2(0xa123970)
                /tmp/build/go/src/cmd/go/internal/work/exec.go:119 +0x7c
        cmd/go/internal/work.(*Builder).Do.func3(0x98bca50, 0x9fb5c80, 0x9cba9c0)
                /tmp/build/go/src/cmd/go/internal/work/exec.go:179 +0x63
        created by cmd/go/internal/work.(*Builder).Do
                /tmp/build/go/src/cmd/go/internal/work/exec.go:166 +0x2dd
--- FAIL: TestCrossPackageTests (0.63s)
FAIL
FAIL    misc/cgo/test   0.641s
FAIL

The panic is in the We expect to find the contents of cgoLDFLAGS in flags check at the end of cgo.
When I added some trace statements, I found that when it was about to index-out-of-range, it was looking for a longer cgoLDFLAGS than was possible with the extracted flags; the case that I found repeated over all failed runs was when:

        flags = []string{"-g", "-O2"}
        cgoLDFLAGS = []string{"-g", "-O2", "-lm"}

git bisecting go1.15.6..go1.15.7 lead me to commit e8e7fac [release-branch.go1.15-security] cmd/go: pass resolved CC, GCCGO to cgo.
(see also https://go-review.googlesource.com/c/go/+/284780/)

Examining this, I noticed some new append()s in the diff of the form a = append(b, c) that looked unsafe (and an append to the returned, cached slice).

For testing purposes, I modified go/src/cmd/go/internal/work/exec.go to always return a copy.

func (b *Builder) cgoEnv() []string {
	b.cgoEnvOnce.Do(func() {
...
	})
	// Make sure to return a separate copy; the caller sometimes append()s to it. 
	// return b.cgoEnvCache
	return append([]string{}, b.cgoEnvCache...)
}

With that change, the stress test completed successfully.

I noticed there is a later change https://go-review.googlesource.com/c/go/+/285873 which reverts much of the GCCGO related change in 1.15.7, so there may already be a fix for this.

@AlexRouSg
Copy link
Contributor

I noticed there is a later change https://go-review.googlesource.com/c/go/+/285873 which reverts much of the GCCGO related change in 1.15.7, so there may already be a fix for this.

Can you try with tip? CL 285873 gets rid of the cgoEnvCache too so if returning a copy works then it should be already fixed.

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Jan 26, 2021
@bcmills bcmills added this to the Backlog milestone Jan 26, 2021
@bcmills
Copy link
Contributor

bcmills commented Jan 26, 2021

CC @jayconrod @matloob

@bcmills bcmills added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Jan 26, 2021
@somersf
Copy link
Contributor Author

somersf commented Jan 26, 2021

Can you try with tip? CL 285873 gets rid of the cgoEnvCache too so if returning a copy works then it should be already fixed.

Sure - tip passed two runs of the repro script (each with 100x invocations of go test TestCrossPackageTests), so the follow-on changes appear to have addressed the problem, thanks.

@AlexRouSg
Copy link
Contributor

Then I guess there's nothing to do here?

You can follow #43860 for the 1.15 backport

@somersf
Copy link
Contributor Author

somersf commented Jan 27, 2021

You can follow #43860 for the 1.15 backport

Excellent - thanks for the backport

@somersf somersf closed this as completed Jan 27, 2021
@golang golang locked and limited conversation to collaborators Jan 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

4 participants