Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/go: TLS handshake timeout running in qemu #50584

Closed
ppenguin opened this issue Jan 12, 2022 · 18 comments
Closed

cmd/go: TLS handshake timeout running in qemu #50584

ppenguin opened this issue Jan 12, 2022 · 18 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@ppenguin
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.17.2 linux/arm
# in a build container based on debian stretch

Does this issue reproduce with the latest release?

untested

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="arm"
GOBIN=""
GOCACHE="/builds/r/rcnode/.gocache-arm/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/builds/r/rcnode/.gomodcache-arm/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/builds/r/rcnode/.go-arm"
GOPRIVATE=""
GOPROXY="https://goproxy.io,https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_arm"
GOVCS=""
GOVERSION="go1.17.2"
GCCGO="gccgo"
GOARM="6"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -marm -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3509342896=/tmp/go-build -gno-record-gcc-switches"

(Note the directories are in a container, and /builds is mounted into it from the project dir.)

What did you do?

I built a program with a largish number of dependencies.

The build consistently fails during getting these with go get -v -u ./..., notably mostly on

cloud.google.com/go/storage@v1.14.0: Get "https://goproxy.io/cloud.google.com/go/storage/@v/v1.14.0.mod": net/http: TLS handshake timeout

and

pkg/config/config.go:12:2: gopkg.in/yaml.v2@v2.2.1: Get "https://goproxy.io/gopkg.in/yaml.v2/@v/v2.2.1.zip": net/http: TLS handshake timeout

It is suspicious that this only happens with arm32 builds (likely only with with CGO=1), and it can indeed be seen that the qemu container process has a high load before go get fails.

My suspicion is that go get can't find a binary package for arm32 arch and starts building it automatically, which is of course good and expected behaviour. However: it seems that go get wrongly applies a networking timeout to this situation, where it should handle package builds from source with a different timeout (these usually take much longer, especially under qemu, which should be allowed).

What did you expect to see?

A successful build, independent of architecture. (As a reference: other non-native builds, e.g. arm64, in a container which is built in the same way except for the arch, do not have this issue, but they are also significantly faster, which leads to the suspicion that the issue is indeed the application of an unreasonable timeout).

What did you see instead?

Networking timeouts during go get, where the bottleneck appears to be not the network but a local package build (high processor load).

@ppenguin ppenguin changed the title affected/package: cmd/go affected/package: cmd/go (consistent networking timeout during go get for certain packages in qemu container build, suspected bug in timeout behaviour) Jan 12, 2022
@seankhliao
Copy link
Member

what if you only download without building? -d

@ppenguin
Copy link
Author

I tried with -d, and still see a high processor load (specifically qemu-arm-static /usr/local/go/bin/go mod tidy and later qemu-arm-static /usr/local/go/bin/go install -v) during go: downloading ..., during which I get the same mentioned net/http: TLS handshake timeout messages as before.

I guess the reason it's building in the first place is because for my arch (arm) the/some binary packages are not cached, so there wouldn't really be a choice to "not build" anyway.

@seankhliao
Copy link
Member

go get -d ./... will only resolve and download the dependencies without building. If it still fails there, then it's a network error.
There are no such things as binary packages (anymore) all dependencies are downloaded as source and compiled locally.
Also, does it still occur if you use a different GOPROXY?

@ppenguin
Copy link
Author

Ah, thanks for clarifying.
If I do a vanilla go get -v -d ./... without anything else, I still get those timeouts.

What I don't understand is the high processor load of the parallel qemu-arm-static /usr/local/go/bin/go get -v -d ./... and qemu-arm-static /usr/local/go/bin/go install -v processes if they are "just" downloading... (Also: keep in mind that this only happens for my non-native arm container build, which is by itself not network related?)

I have tried with different values for GOPROXY, (GOPROXY=direct to GOPROXY=https://goproxy.io,https://proxy.golang.org,direct and all variations in between), the same packages fail in all cases, but with a different message:

  • for GOPROXY=direct:
    github.com/go-delve/delve/pkg/config imports
         gopkg.in/yaml.v2: unrecognized import path "gopkg.in/yaml.v2": https fetch: Get "https://gopkg.in/yaml.v2?go-get=1": net/http: TLS handshake timeout
    
  • for other GOPROXY:
    github.com/go-delve/delve/pkg/config imports
         gopkg.in/yaml.v2: gopkg.in/yaml.v2@v2.2.1: Get "https://proxy.golang.org/gopkg.in/yaml.v2/@v/v2.2.1.zip": net/http: TLS handshake timeout
    

@seankhliao seankhliao changed the title affected/package: cmd/go (consistent networking timeout during go get for certain packages in qemu container build, suspected bug in timeout behaviour) cmd/go: TLS handshake timeout running in qemu Jan 13, 2022
@seankhliao
Copy link
Member

go install does compile, but I haven't asked you to do that.
You say parallel, what do you mean by that?
How much resources are these processes given?

@ppenguin
Copy link
Author

ppenguin commented Jan 13, 2022

go install appears to be triggered by go get in more involved cases.
go get appears to behave exactly the same in terms of processor load as go mod tidy and go install processes I have observed for different variations of (part of) the container builds I have been testing on.

For go get specifically (in the simpler case where I did go get -d ./... in a clone of this one inside an arm build container) the load is between 30-50% (according to htop) during about 1m30s on 24 cores, or in summary around "800% sustained CPU load" during this time (each process qemu-arm-static /usr/local/go/bin/go get -d ./...):

root@4527a4c64888:/builds# time go get -d ./...
go: downloading gopkg.in/yaml.v2 v2.2.1
go: downloading golang.org/x/arch v0.0.0-20190927153633-4e8777c89be4
go: downloading go.starlark.net v0.0.0-20200821142938-949cc6f4b097
go: downloading golang.org/x/sys v0.0.0-20200625212154-ddb9806d33ae
github.com/go-delve/delve/pkg/config imports
        gopkg.in/yaml.v2: gopkg.in/yaml.v2@v2.2.1: Get "https://goproxy.io/gopkg.in/yaml.v2/@v/v2.2.1.zip": net/http: TLS handshake timeout
github.com/go-delve/delve/pkg/proc imports
        golang.org/x/arch/arm/armasm: golang.org/x/arch@v0.0.0-20190927153633-4e8777c89be4: Get "https://goproxy.io/golang.org/x/arch/@v/v0.0.0-20190927153633-4e8777c89be4.zip": net/http: TLS handshake timeout
github.com/go-delve/delve/pkg/proc imports
        golang.org/x/arch/arm64/arm64asm: golang.org/x/arch@v0.0.0-20190927153633-4e8777c89be4: Get "https://goproxy.io/golang.org/x/arch/@v/v0.0.0-20190927153633-4e8777c89be4.zip": net/http: TLS handshake timeout
github.com/go-delve/delve/pkg/proc imports
        golang.org/x/arch/x86/x86asm: golang.org/x/arch@v0.0.0-20190927153633-4e8777c89be4: Get "https://goproxy.io/golang.org/x/arch/@v/v0.0.0-20190927153633-4e8777c89be4.zip": net/http: TLS handshake timeout
github.com/go-delve/delve/pkg/proc/native imports
        golang.org/x/sys/unix: golang.org/x/sys@v0.0.0-20200625212154-ddb9806d33ae: Get "https://goproxy.io/golang.org/x/sys/@v/v0.0.0-20200625212154-ddb9806d33ae.zip": net/http: TLS handshake timeout
github.com/go-delve/delve/pkg/terminal/starbind imports
        go.starlark.net/resolve: go.starlark.net@v0.0.0-20200821142938-949cc6f4b097: Get "https://goproxy.io/go.starlark.net/@v/v0.0.0-20200821142938-949cc6f4b097.zip": net/http: TLS handshake timeout
github.com/go-delve/delve/pkg/terminal/starbind imports
        go.starlark.net/starlark: go.starlark.net@v0.0.0-20200821142938-949cc6f4b097: Get "https://goproxy.io/go.starlark.net/@v/v0.0.0-20200821142938-949cc6f4b097.zip": net/http: TLS handshake timeout
github.com/go-delve/delve/pkg/terminal/starbind imports
        go.starlark.net/syntax: go.starlark.net@v0.0.0-20200821142938-949cc6f4b097: Get "https://goproxy.io/go.starlark.net/@v/v0.0.0-20200821142938-949cc6f4b097.zip": net/http: TLS handshake timeout

real    1m26.730s
user    1m27.045s
sys     9m10.221s

That seems pretty fishy if it would be only network related...

@bcmills
Copy link
Contributor

bcmills commented Jan 13, 2022

Note that we currently lack a QEMU builder (#1508), and per https://go.dev/wiki/PortingPolicy, supported platforms require builders. (User-mode QEMU is not supported, although it may work in some cases.)

Does this failure mode also reproduce on physical ARM hardware?

@ppenguin
Copy link
Author

ppenguin commented Jan 13, 2022

I only have "semi-embedded" arm hardware (which are my targets), on which I can't easily run full builds.

Interestingly though, the same repo with the same Makefile and the same build-image (arm) does build in a gitlab CI. I'm even using the rest of my go env (incl. e.g. GOMODCACHE) in the same way in my CI and for the local dev build, and the runner-host is also very similar to my dev host (both nixos based). So I'm really at a loss what is causing this...

I managed to consistently build relatively complex go arm binaries on that gitlab-runner with container builds, but they also consistently fail with the mentioned errors with "local" container builds.

For reference: for amd64, arm64, arm the duration of the CI is respectively a few minutes for a*64 vs. approx. >30min for arm. Sometimes (luckily seldom) it even times out above 90min.

@ppenguin
Copy link
Author

ppenguin commented Jan 13, 2022

@bcmills oh wait, I just checked and since linux/arm is a first class port, and I'm building within an qemu-arm container, isn't that completely transparent in that case (i.e. as if I'm running on arm hardware)? FYI: my build-image contains a linux/arm binary of go itself, so considering a "cross-builder" in this context would be moot? Or am I missing something?

@bcmills
Copy link
Contributor

bcmills commented Jan 13, 2022

@ppenguin, QEMU sometimes differs in substantial ways from a native kernel running on native hardware (see, for example, #33746, #42080, #44572, #50188).

@bcmills bcmills added this to the Backlog milestone Jan 13, 2022
@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Jan 13, 2022
@ppenguin
Copy link
Author

ppenguin commented Jan 13, 2022

@bcmills thanks for the info, I guess I'm lucky that I'm getting away with it pretty ok, except for this strange issue (though note that my CI is also qemu-arm and works). The image is "native" pretty far down (down to ld and libc), but indeed anything directly kernel related would be different...

I'm wondering though whether it could make sense to somehow set the network timeout for go get to a much higher value, and then check whether the error disappears. That would at least eliminate the uncertainty whether (slow emulated) pkg build could be the actual cause of the network timeouts.

@jauderho
Copy link

jauderho commented Jan 20, 2022

Commenting here since my issue was closed. linux/ppc64le is not a first class port AFAIK and neither is linux/s390x. But it seems strange to me that I only ever see issues with ppc64le but not s390x.

Specifically, this is with GitHub hosted runners.

@rzr
Copy link

rzr commented Feb 3, 2022

Hi,
What makes you think it's arm or qemu specific because I am also facing this on x86_64 in docker env (x86_64 not arm):

I am able to reproduce a similar failure in yocto environment with go-1.17.5-r0:
If it help I can share a dockerfile to reproduce the problem:

ERROR: sysota-git-r0 do_compile: ExecutionError('/local/src/oniro/oe-core/build/tmp/work/qemux86_64-oniro-linux/sysota/git-r0/temp/run.do_compile.2661763', 1, None, None)
ERROR: Logfile of failure stored in: /local/src/oniro/oe-core/build/tmp/work/qemux86_64-oniro-linux/sysota/git-r0/temp/log.do_compile.2661763
Log data follows:
| DEBUG: Executing shell function do_compile
| go: github.com/godbus/dbus/v5@v5.0.4: Get "https://proxy.golang.org/github.com/godbus/dbus/v5/@v/v5.0.4.mod": net/http: TLS handshake timeout
| go: downloading github.com/godbus/dbus/v5 v5.0.4
| go: downloading gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c
| go: github.com/godbus/dbus/v5@v5.0.4: Get "https://proxy.golang.org/github.com/godbus/dbus/v5/@v/v5.0.4.mod": net/http: TLS handshake timeout
| no Go files in /local/src/oniro/oe-core/build/tmp/work/qemux86_64-oniro-linux/sysota/git-r0/build/src/booting.oniroproject.org/distro/components/sysota
| WARNING: exit code 1 from a shell command.
NOTE: recipe sysota-git-r0: task do_compile: Failed
ERROR: Task (/local/src/oniro/oe-core/../oniro/meta-oniro-core/recipes-core/sysota/sysota_git.bb:do_compile) failed with exit code '1'

Relate-to: https://booting.oniroproject.org/distro/oniro/-/merge_requests/524

@ppenguin
Copy link
Author

@rzr for me it only happens on arm, where by arm I mean self-built arm images based on older debian (jessie for glibc-2.19, for backward compatibility). The timeouts happen consistently during/after the build is running (go build) under considerable CPU load:

image

Interestingly, a CI that is executing the same build script in the same container image on the same (type of) OS (nixos) doesn't have this issue, although it takes 5 times as long as an arm64 build and about 10 times as long as a amd64 build (all in respective containers).

That is the reason that I'm suspecting that network timeouts are reported due to build duration and not actual connection errors.
In fact, since your errors occur under qemu, it might mean that the build is also "slower than native" so in that respect there seem to be similarities...

@bcmills regarding the label WaitingForInfo: what more info could I supply to help confronting the issue further?

@bcmills
Copy link
Contributor

bcmills commented Feb 23, 2022

@ppenguin, it would help to know how far we are from the TLS timeout when running on real ARM hardware. IIRC the timeout is on the order of 10s, so the emulation would have to be heavily overloaded to miss it. (And it that case on real hardware I might suggest lowering GOMAXPROCS so that the CPU has less other work to do at the same time.)

@seankhliao seankhliao added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Jun 12, 2022
@gopherbot
Copy link

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@0xdeface
Copy link

0xdeface commented Jan 24, 2023

Hello, today i have a strange issue. In some containers which used http.Get i received a tls timeout. also i receive go mod tidy : net/http: TLS handshake timeout on my gitlab runner. All pc where i receive a tls timeout work into proxmox qemu.
if i use the same docker image on host hypervisor or local machine on the same subnet all work right, but if it quemu virtualized machine i have a timeout. maybe its relative to this issue. thank you.

@proofrock
Copy link

Similar to this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

8 participants