Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/http: Segmentation violation when using an IPSec VPN Tunnel (Go 1.19.8/Go 1.20.4/Go 1.21.4) #64020

Open
yznima opened this issue Nov 8, 2023 · 9 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@yznima
Copy link

yznima commented Nov 8, 2023

This is continuation of #61552 since that one is closed.

What version of Go are you using (go version)?

$ go version
go version go1.20.4 darwin/arm64

Does this issue reproduce with the latest release?

Yes. All versions after go.19, and including, reproduce the issue.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="on"
GOARCH="arm64"
GOBIN=""
GOCACHE="{HOME}/Library/Caches/go-build"
GOENV="{HOME}/Library/Application Support/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="{HOME}/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/nima.yahyazadeh/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/opt/homebrew/opt/go/libexec"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/opt/homebrew/opt/go/libexec/pkg/tool/darwin_arm64"
GOVCS=""
GOVERSION="go1.20.4"
GCCGO="gccgo"
AR="ar"
CC="cc"
CXX="c++"
CGO_ENABLED="1"
GOMOD="{XXX}/go.mod"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/s2/3dxjlr5n0d910kqqgq_lcp1w0000gp/T/go-build3602641218=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I compiled a client code and server code as following on my M1 Mac and uploaded to a linux Server.

For client I compiled using GOOS=linux GOARCH=amd64 go build -o client

package main

import (
	"bytes"
	"context"
	"flag"
	"fmt"
	"io"
	"io/ioutil"
	"net/http"
	"strings"
	"time"
)

var hostsFlag = flag.String("hosts", "localhost", "")
var portFlag = flag.Int("port", 3000, "")

func main() {
	flag.Parse()

	hosts := strings.Split(*hostsFlag, ",")
	ports := []int{*portFlag}

	<-time.After(2 * time.Second)

	for _, h := range hosts {
		for _, p := range ports {
			client := http.DefaultClient
			for i := 0; i < 100; i++ {
				go func(host string, port int) {
					n := 0
					for {
						n++
						if n%10 == 0 {
							println(host, n)
						}
						write(client, host, port)
						writeCancel(client, host, port)
					}
				}(h, p)
			}
		}
	}

	select {}
}

func write(client *http.Client, host string, port int) {
	req, err := http.NewRequest(http.MethodPost, fmt.Sprintf("http://%s:%d/hello", host, port), bytes.NewReader([]byte(`{}`)))
	if err != nil {
		panic(err)
	}

	resp, err := client.Do(req)
	defer CloseSafe(resp)
	if err != nil {
		if err != nil {
			println(err.Error())
		}
		return
	}

	handleError(resp)
}

func writeCancel(client *http.Client, host string, port int) {
	ctx, cancel := context.WithTimeout(context.Background(), time.Second)
	req, err := http.NewRequestWithContext(ctx, http.MethodPost, fmt.Sprintf("http://%s:%d/hello", host, port), bytes.NewReader([]byte(`{}`)))
	if err != nil {
		panic(err)
	}

	go cancel()

	resp, err := client.Do(req)
	defer CloseSafe(resp)
}

func handleError(resp *http.Response) {
	if resp.StatusCode != http.StatusOK {
		_, err := io.ReadAll(resp.Body)
		if err != nil {
			panic(err)
		}
	}
}

func CloseSafe(resp *http.Response) {
	if resp == nil || resp.Body == nil {
		return
	}

	_, _ = io.Copy(ioutil.Discard, resp.Body)
	_ = resp.Body.Close()
}

For server I compiled using GOOS=linux GOARCH=amd64 go build -o server

package main

import (
	"fmt"
	"net/http"
)

func main() {
	err := http.ListenAndServe("0.0.0.0:3000", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
		http.Error(w, http.StatusText(http.StatusUnauthorized), http.StatusUnauthorized)
	}))

	fmt.Println(err)
}

I ran the server in 3 different EC2 instances and in docker.

FROM scratch

COPY server .

ENTRYPOINT ["./server"]

I used the following command to run it.

docker build . -t test:latest
docker run --rm --network host test:latest

When I ran the server on the three different machines using host networking and then exected the client I see the client panics with the following stacktraces

fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x44bb66]

goroutine 305 [running]:
runtime.throw({0x6d35fe?, 0xc000217428?})
        /opt/homebrew/opt/go/libexec/src/runtime/panic.go:1047 +0x5d fp=0xc000217368 sp=0xc000217338 pc=0x43439d
runtime.sigpanic()
        /opt/homebrew/opt/go/libexec/src/runtime/signal_unix.go:821 +0x3e9 fp=0xc0002173c8 sp=0xc000217368 pc=0x44a429
runtime.(*waitq).enqueue(...)
        /opt/homebrew/opt/go/libexec/src/runtime/chan.go:766
runtime.selectgo(0xc000217750, 0xc000217558, 0x6c828e?, 0x0, 0xc000217540?, 0x1)
        /opt/homebrew/opt/go/libexec/src/runtime/select.go:316 +0x6c6 fp=0xc000217508 sp=0xc0002173c8 pc=0x4467e6
net/http.(*persistConn).roundTrip(0xc000a7b9e0, 0xc001027e40)
        /opt/homebrew/opt/go/libexec/src/net/http/transport.go:2638 +0x994 fp=0xc0002177c0 sp=0xc000217508 pc=0x6461b4
net/http.(*Transport).roundTrip(0x8cdda0, 0xc000223500)
        /opt/homebrew/opt/go/libexec/src/net/http/transport.go:603 +0x7fa fp=0xc0002179e8 sp=0xc0002177c0 pc=0x63a2da
net/http.(*Transport).RoundTrip(0x26?, 0x73aea0?)
        /opt/homebrew/opt/go/libexec/src/net/http/roundtrip.go:17 +0x19 fp=0xc000217a08 sp=0xc0002179e8 pc=0x6247d9
net/http.send(0xc000223500, {0x73aea0, 0x8cdda0}, {0x8?, 0x6bac80?, 0x0?})
        /opt/homebrew/opt/go/libexec/src/net/http/client.go:252 +0x5f7 fp=0xc000217c00 sp=0xc000217a08 pc=0x5edef7
net/http.(*Client).send(0x8d3d40, 0xc000223500, {0xc001095520?, 0x6?, 0x0?})
        /opt/homebrew/opt/go/libexec/src/net/http/client.go:176 +0x9b fp=0xc000217c78 sp=0xc000217c00 pc=0x5ed77b
net/http.(*Client).do(0x8d3d40, 0xc000223500)
        /opt/homebrew/opt/go/libexec/src/net/http/client.go:716 +0x8fb fp=0xc000217e80 sp=0xc000217c78 pc=0x5ef99b
net/http.(*Client).Do(...)
        /opt/homebrew/opt/go/libexec/src/net/http/client.go:582
main.write(0x73b140?, {0x6ca23e?, 0xc000217f98?}, 0x1?, {0x1?, 0x0?})
....
fatal error: runtime: sudog with non-nil next

goroutine 140 [running]:
runtime.throw({0x6cf467?, 0x4e636c?})
        /opt/homebrew/opt/go/libexec/src/runtime/panic.go:1047 +0x5d fp=0xc00016f1b8 sp=0xc00016f188 pc=0x43439d
runtime.releaseSudog(0xc00016f2e0?)
        /opt/homebrew/opt/go/libexec/src/runtime/proc.go:443 +0x316 fp=0xc00016f228 sp=0xc00016f1b8 pc=0x4377f6
runtime.semacquire1(0xc0011169a8, 0xa0?, 0x1, 0x0, 0x40?)
        /opt/homebrew/opt/go/libexec/src/runtime/sema.go:168 +0x2b9 fp=0xc00016f290 sp=0xc00016f228 pc=0x447a79
internal/poll.runtime_Semacquire(0x7f75faa8b418?)
        /opt/homebrew/opt/go/libexec/src/runtime/sema.go:67 +0x27 fp=0xc00016f2c8 sp=0xc00016f290 pc=0x461227
internal/poll.(*FD).Close(0xc001116980)
        /opt/homebrew/opt/go/libexec/src/internal/poll/fd_unix.go:116 +0x6d fp=0xc00016f2f0 sp=0xc00016f2c8 pc=0x4ad58d
net.(*netFD).Close(0xc001116980)
        /opt/homebrew/opt/go/libexec/src/net/fd_posix.go:37 +0x38 fp=0xc00016f320 sp=0xc00016f2f0 pc=0x4dde18
net.(*conn).Close(0xc0005e0340)
        /opt/homebrew/opt/go/libexec/src/net/net.go:207 +0x45 fp=0xc00016f350 sp=0xc00016f320 pc=0x4e8fc5
net.(*TCPConn).Close(0x445f93?)
        <autogenerated>:1 +0x25 fp=0xc00016f368 sp=0xc00016f350 pc=0x4f47a5
net/http.(*persistConn).closeLocked(0xc000b9e120, {0x73ac60, 0xc00008ca40})
        /opt/homebrew/opt/go/libexec/src/net/http/transport.go:2732 +0x130 fp=0xc00016f3f0 sp=0xc00016f368 pc=0x646bb0
net/http.(*persistConn).cancelRequest(0x4112e4?, {0x73ac60?, 0xc00008c170?})
        /opt/homebrew/opt/go/libexec/src/net/http/transport.go:1991 +0xe5 fp=0xc00016f450 sp=0xc00016f3f0 pc=0x642f65
net/http.(*persistConn).cancelRequest-fm({0x73ac60?, 0xc00008c170?})
        <autogenerated>:1 +0x39 fp=0xc00016f478 sp=0xc00016f450 pc=0x651f19
net/http.(*Transport).cancelRequest(0x8cdda0, {0xc00016f528?}, {0x73ac60, 0xc00008c170})
        /opt/homebrew/opt/go/libexec/src/net/http/transport.go:810 +0xee fp=0xc00016f4d8 sp=0xc00016f478 pc=0x63b36e
net/http.(*persistConn).roundTrip(0xc000b9e120, 0xc000cc2380)
        /opt/homebrew/opt/go/libexec/src/net/http/transport.go:2684 +0xbac fp=0xc00016f790 sp=0xc00016f4d8 pc=0x6463cc
net/http.(*Transport).roundTrip(0x8cdda0, 0xc0007d1a00)
        /opt/homebrew/opt/go/libexec/src/net/http/transport.go:603 +0x7fa fp=0xc00016f9b8 sp=0xc00016f790 pc=0x63a2da
net/http.(*Transport).RoundTrip(0xc00016f9f8?, 0x73aea0?)
        /opt/homebrew/opt/go/libexec/src/net/http/roundtrip.go:17 +0x19 fp=0xc00016f9d8 sp=0xc00016f9b8 pc=0x6247d9
net/http.send(0xc0007d1a00, {0x73aea0, 0x8cdda0}, {0x8?, 0x6bac80?, 0x0?})
        /opt/homebrew/opt/go/libexec/src/net/http/client.go:252 +0x5f7 fp=0xc00016fbd0 sp=0xc00016f9d8 pc=0x5edef7
net/http.(*Client).send(0x8d3d40, 0xc0007d1a00, {0xc000ded2b0?, 0x6?, 0x0?})
        /opt/homebrew/opt/go/libexec/src/net/http/client.go:176 +0x9b fp=0xc00016fc48 sp=0xc00016fbd0 pc=0x5ed77b
net/http.(*Client).do(0x8d3d40, 0xc0007d1a00)
        /opt/homebrew/opt/go/libexec/src/net/http/client.go:716 +0x8fb fp=0xc00016fe50 sp=0xc00016fc48 pc=0x5ef99b
net/http.(*Client).Do(...)
        /opt/homebrew/opt/go/libexec/src/net/http/client.go:582
main.writeCancel(0x73b140?, {0x6ca22a, 0x14}, 0x1?, {0x1?, 0x0?})
....

What did you expect to see?

I expected not to see a panic.

What did you see instead?

I saw a panic from within the standard library. This issue occurred when one of the servers was down.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Nov 8, 2023
@yznima yznima changed the title runtime: Panics and segmentation violation (Go 1.19.8/Go 1.20.4) net/http: Panics and segmentation violation (Go 1.19.8/Go 1.20.4) Nov 8, 2023
@randall77
Copy link
Contributor

This is strange.

runtime.(*waitq).enqueue(...)
/opt/homebrew/opt/go/libexec/src/runtime/chan.go:766

At that line, the pointer dereference can't possibly be nil.

This looks like memory corruption. Have you tried running with the race detector?
In particular, the go cancel() line looks strange. Isn't the request still using that context? Perhaps you mean defer cancel()?

@yznima
Copy link
Author

yznima commented Nov 8, 2023

@randall77

Isn't the request still using that context? Perhaps you mean defer cancel()?

No that is on purpose to simulate a context cancelation.

At that line, the pointer dereference can't possibly be nil.

That's what I thought. I was able to capture a core dump of a similar process that was panicing with the same error. The sudog points to an invalid address. It is definitely a memory corruption. It also seems like the Sudog struct is getting released or overwritten. In fact, if you look at the two different stack traces, seems like one stack trace is trying to release the sudog and the other one is trying to add it to the waitq. I suspect those two calls are racing with each other. I've tried running it with the -race but that doesn't reproduce the issue. I don't know why.

See the core dump

Screen Shot 2023-11-08 at 5 59 12 PM

@heschi heschi added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 9, 2023
@heschi heschi added this to the Backlog milestone Nov 9, 2023
@yznima
Copy link
Author

yznima commented Nov 9, 2023

Adding the GC and SCHED last line calls before the panic

SCHED

SCHED 24409ms: gomaxprocs=2 idleprocs=0 threads=13 spinningthreads=1 needspinning=1 idlethreads=9 runqueue=0 [0 161]
fatal error: unexpected signal during runtime execution
SCHED 24409ms: gomaxprocs=2 idleprocs=0 threads=13 spinningthreads=1 needspinning=1 idlethreads=9 runqueue=0 gcwaiting=false nmidlelocked=0 stopwait=0 sysmonwait=false
  P0: status=1 schedtick=96440 syscalltick=121539 m=0 runqsize=0 gfreecnt=61 timerslen=21
  P1: status=1 schedtick=100650 syscalltick=126085 m=5 runqsize=163 gfreecnt=52 timerslen=9
  M12: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
  M11: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=false lockedg=nil
  M10: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
  M9: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
  M8: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
  M7: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
  M6: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
  M5: p=1 curg=194 mallocing=1 throwing=2 preemptoff= locks=3 dying=1 spinning=false blocked=false lockedg=nil
  M4: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
  M3: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
  M2: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=true lockedg=nil
  M1: p=nil curg=nil mallocing=0 throwing=0 preemptoff= locks=0 dying=0 spinning=false blocked=false lockedg=nil
  M0: p=0 curg=166125 mallocing=0 throwing=0 preemptoff= locks=3 dying=0 spinning=false blocked=false lockedg=nil
  
  ...
  G194: status=2(sync.Mutex.Lock) m=5 lockedm=nil
  G166125: status=2(sync.Mutex.Lock) m=0 lockedm=ni
  ...
 

GC

gc 96 @24.262s 6%: 0.082+33+0.003 ms clock, 0.16+14/10/0+0.006 ms cpu, 8->8->3 MB, 8 MB goal, 1 MB stacks, 0 MB globals, 2 P

@yznima yznima changed the title net/http: Panics and segmentation violation (Go 1.19.8/Go 1.20.4) net/http: Panics and segmentation violation (Go 1.19.8/Go 1.20.4/Go 1.21.4) Nov 15, 2023
@mknyszek
Copy link
Contributor

Sorry for the lack of follow-up here; it got into our triage queue last week, but we didn't get to it. We looked at it this week, but probably no one will follow up until next week because of the US holiday.

I tried to reproduce at tip-of-tree and with go1.21.0 (just what I had lying around) with some slightly modifications (3 servers all running on localhost on 3 different ports), but I haven't been able to yet. I'll leave it running for a while. How quickly does this reproduce for you?

This is mysterious enough that maybe we want to take some shots in the dark:

  • What version of the Linux kernel are you running?
  • Can you reproduce if you build your programs from a Linux machine? What about the same machine you're testing on?
  • Does it reproduce if GODEBUG=asyncpreemptoff=1 is set?

I also noticed that quite a few people have given this issue a thumbs-up; could anyone who did so briefly follow-up? Are you also affected? Is this happening often? Is there any data you can share about the execution environment?

Thanks.

@yznima
Copy link
Author

yznima commented Nov 27, 2023

Hey @mknyszek, I've gathered some of the information you were looking for. I'll get back to you about the Kernel Version.

How quickly does this reproduce for you?

If run the program about 10 times and stop everytime after 1 minute, I'm guaranteed to see it at least once.

What version of the Linux kernel are you running?

This is happening on Centos 7.9.

Can you reproduce if you build your programs from a Linux machine? What about the same machine you're testing on?

I rebuilt on the same machine using the latest Go version and still reproduced the issue.

Does it reproduce if GODEBUG=asyncpreemptoff=1 is set?

Yes it still reproduces the issue.

@mknyszek
Copy link
Contributor

Thanks. Does this fail on other Linux machines? Perhaps with different Linux distros and/or different Linux kernel versions? I haven't been able to reproduce so far.

@mknyszek mknyszek self-assigned this Nov 29, 2023
@yznima
Copy link
Author

yznima commented Nov 30, 2023

@mknyszek I've been able to reproduce it on other linux machines as long as it was the same OS. I haven't been able to reproduce it on any other distros or kernel version. I've only tested it on Amazon AL2 other than that.

@yznima
Copy link
Author

yznima commented Jan 11, 2024

Hi @mknyszek,

Happy New Year 🎉 . I apologize about the delay in responding but I haven’t forgotten about this issue nor have I stopped working on it. Today I was actually able to find much more detailed instruction and information that could point more closely why this issue occurs.

In summary, the problem seems to arise specifically when there is an IPSec tunnel between two nodes. I have consistently and quickly reproduced this bug when the connection between the two nodes is secured using an IPSec tunnel. Below are detailed instructions on how to reproduce this issue. Please note that my testing has been in an AWS environment, so I've tailored the instructions to align closely with AWS. Feel free to make adjustments to suit your specific environment.

Create two EC2 instances

  • Use the Public Centos 7.9 AMI (e.g ami-0aedf6b1cb669b4c7)
  • Create the EC2 instances in the same VPC, Subnet and SecurityGroup
  • Ensure you have the SSH key access for further instructions

Setup the VPN Tunnel

  • On both sites run the following commands to download and setup IPSec
sudo yum install -y libreswan # Using Libreswan version 3.25
sudo systemctl enable ipsec --now
  • Create a PreSharedKey using the following command
openssl rand -base64 128 # The output should be put in one line when using in the subsequent files
  • On the instance 1 create the following files and content
    • /etc/ipsec.d/instance-2.conf
conn instance-2
    type=transport
    left=<INSTANCE2_IP>
    right=<INSTANCE1_IP>
    authby=secret
    auto=start
  • /etc/ipsec.d/instance-2.secrets
<INSTANCE2_IP> <INSTANCE1_IP> : PSK "<PRE_SHARED_KEY>"
  • On the instance 2, repeat the same process and create the same files but with the Instance1 and 2 IPs reversed.
  • Now that we have the files setup, go ahead and start the VPN tunnel by running (On Instance 1)
ipsec auto --add instance-1
/sbin/service ipsec restart
/usr/sbin/ipsec auto --up instance-1
  • On instance 2
ipsec auto --add instance-2
/sbin/service ipsec restart
/usr/sbin/ipsec auto --up instance-2

You should see a log such as following indicating the VPN tunnel is created

STATE_V2_IPSEC_I: IPsec SA established transport mode

One quick note, for the IPSec tunnel, your security group should be configured as following to allow for the tunnel to be created.

Screen Shot 2024-01-10 at 9 55 59 PM

Run the executables

  • Now that the connection is setup, go ahead and upload the server and client to the servers. I’ve modified the client and server slightly. Feel free to recompile.
    • On instance 2, run ./server
    • On instance 1, run ./client –hosts <INSTANCE2_IP>

You should be able to see the issue reproduce within a minute.

Hope this helps. Looking forward to hearing back from you.

@yznima yznima changed the title net/http: Panics and segmentation violation (Go 1.19.8/Go 1.20.4/Go 1.21.4) net/http: Segmentation violation when using an IPSec VPN Tunnel (Go 1.19.8/Go 1.20.4/Go 1.21.4) Jan 11, 2024
@yznima
Copy link
Author

yznima commented Jan 18, 2024

@mknyszek I wanted to add a new piece of information I've discovered. It might be useful in your investigation and just a note to anyone else that might run into this in future. Using StrongSWAN instead of LibreSWAN resolves the issue. In addition using the AL2 AMI doesn't reproduce the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: In Progress
Development

No branches or pull requests

5 participants