Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel crash when closing device beginning in Go 1.9 #43906

Closed
NateThornton opened this issue Jan 25, 2021 · 5 comments
Closed

Kernel crash when closing device beginning in Go 1.9 #43906

NateThornton opened this issue Jan 25, 2021 · 5 comments
Labels
FrozenDueToAge WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.

Comments

@NateThornton
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.15.7 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="on"
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/root/go/pkg/mod"
GONOPROXY=
GONOSUMDB="
GOOS="linux"
GOPATH="/root/go"
GOPRIVATE=
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build761147876=/tmp/go-build -gno-record-gcc-switches"
GOROOT/bin/go version: go version go1.15.7 linux/amd64
GOROOT/bin/go tool compile -V: compile version go1.15.7
uname -sr: Linux 4.18.0-240.10.1.el8_3.x86_64
/lib64/libc.so.6: GNU C Library (GNU libc) stable release version 2.28.
gdb --version: GNU gdb (GDB) Red Hat Enterprise Linux 8.2-12.el8

What did you do?

Open switchtec device to perform basic identify of the device - this is a Go port of the switchtec-user code from https://github.com/Microsemi/switchtec-user

The following code results in a kernel panic beginning in go1.9

package main

import (
	"encoding/binary"
	"fmt"
	"os"
)

func main() {
	f, err := os.OpenFile("/dev/switchtec0", os.O_RDWR, 0)
	if err != nil {
		fmt.Printf("open file failure")
	}
	req := make([]byte, 4)
	binary.LittleEndian.PutUint32(req[:], uint32(129))

	_, err = f.Write(req)
	if err != nil {
		fmt.Printf("Write failed")
	}

	rsp := make([]byte, 8)
	_, err = f.Read(rsp)
	if err != nil {
		fmt.Printf("Read failed")
	}

	ret := binary.LittleEndian.Uint32(rsp[0:4])
	if ret != 0 {
		fmt.Printf("Read Response %d", ret)
	}

	fmt.Printf("Response: %+v", rsp[4:])
	err = f.Close() // <----------------- THIS WILL CAUSE KERNEL CRASH
	if err != nil {
		fmt.Printf("Close error")
	}
}

go1.8.7 has no issues

What did you expect to see?

No kernel crash observed and a valid Response: [0 0 0 0] from the device

What did you see instead?

Beginning in Go1.9 the above code crashes the kernel with the kdump info as show below

kdump Output
[  669.797781] list_del corruption. prev->next should be ffff9546e2ccd618, but was ffff954759b32020
[  669.797793] ------------[ cut here ]------------
[  669.797794] kernel BUG at lib/list_debug.c:53!
[  669.797801] invalid opcode: 0000 [#1] SMP PTI
[  669.797804] CPU: 4 PID: 16007 Comm: main Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-240.10.1.el8_3.x86_64 #1
[  669.797805] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0011.C0001.032620200659 03/26/2020
[  669.797812] RIP: 0010:__list_del_entry_valid.cold.1+0x34/0x4c
[  669.797814] Code: cd 4e 8a e8 b8 44 cd ff 0f 0b 48 c7 c7 60 ce 4e 8a e8 aa 44 cd ff 0f 0b 48 89 f2 48 89 fe 48 c7 c7 20 ce 4e 8a e8 96 44 cd ff <0f> 0b 48 89 fe 48 c7 c7 e8 cd 4e 8a e8 85 44 cd ff 0f 0b 90 90 90
[  669.797816] RSP: 0018:ffffa479c7a03c70 EFLAGS: 00010046
[  669.797818] RAX: 0000000000000054 RBX: ffff9546e2ccd600 RCX: 0000000000000000
[  669.797820] RDX: 0000000000000000 RSI: ffff95475ff16a08 RDI: ffff95475ff16a08
[  669.797821] RBP: ffff954759b32018 R08: 0000000000000564 R09: 0000000000000063
[  669.797822] R10: 0000000000000000 R11: ffffa479c7a03b20 R12: 0000000000000286
[  669.797824] R13: dead000000000200 R14: dead000000000100 R15: ffff95471efff860
[  669.797826] FS:  0000000000000000(0000) GS:ffff95475ff00000(0000) knlGS:0000000000000000
[  669.797827] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  669.797829] CR2: 00000000039e6b90 CR3: 0000000142a0a001 CR4: 00000000007606e0
[  669.797830] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  669.797831] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  669.797832] PKRU: 55555554
[  669.797834] Call Trace:
[  669.797840]  remove_wait_queue+0x20/0x60
[  669.797845]  ep_unregister_pollwait.isra.14+0x68/0x90
[  669.797848]  ep_free+0x4c/0xd0
[  669.797851]  ep_eventpoll_release+0x16/0x20
[  669.797854]  __fput+0xb7/0x230
[  669.797859]  task_work_run+0x8a/0xb0
[  669.797863]  do_exit+0x38a/0xae0
[  669.797866]  do_group_exit+0x3a/0xa0
[  669.797869]  get_signal+0x158/0x870
[  669.797874]  do_signal+0x36/0x660
[  669.797878]  ? __switch_to_asm+0x35/0x70
[  669.797880]  ? __switch_to_asm+0x41/0x70
[  669.797882]  ? __switch_to_asm+0x35/0x70
[  669.797884]  ? __switch_to_asm+0x41/0x70
[  669.797886]  ? __switch_to_asm+0x35/0x70
[  669.797889]  ? __audit_syscall_exit+0x249/0x2a0
[  669.797894]  exit_to_usermode_loop+0x89/0xf0
[  669.797897]  do_syscall_64+0x198/0x1a0
[  669.797899]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[  669.797902] RIP: 0033:0x465043
[  669.797907] Code: Bad RIP value.
[  669.797908] RSP: 002b:000000c000051e58 EFLAGS: 00000286 ORIG_RAX: 00000000000000ca
[  669.797910] RAX: fffffffffffffe00 RBX: 000000c000046800 RCX: 0000000000465043
[  669.797912] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 000000c000046948
[  669.797913] RBP: 000000c000051ea0 R08: 0000000000000000 R09: 0000000000000000
[  669.797914] R10: 0000000000000000 R11: 0000000000000286 R12: 00000000000000f2
[  669.797915] R13: 0000000000000000 R14: 00000000004da7a2 R15: 0000000000000000
[  669.797917] Modules linked in: switchtec(OE) vxlan ip6_udp_tunnel udp_tunnel xt_nat xt_statistic ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs xt_comment xt_mark xt_CHECKSUM xt_conntrack ipt_REJECT nf_nat_tftp nft_objref nf_conntrack_tftp ipt_MASQUERADE nf_conntrack_netlink nft_counter xt_addrtype tun br_netfilter bridge stp llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6_tables nft_compat ip_set overlay nf_tables nfnetlink sunrpc vfat fat intel_rapl_msr intel_rapl_common isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif kvm iTCO_wdt iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore pcspkr intel_rapl_perf joydev mei_me i2c_i801 lpc_ich mei ioatdma wmi dca ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter ip_tables xfs libcrc32c
[  669.797953]  sd_mod sg ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm drm_kms_helper i40e syscopyarea ahci sysfillrect sysimgblt fb_sys_fops libahci crc32c_intel drm libata uas usb_storage dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: switchtec]
@ianlancetaylor
Copy link
Contributor

If the kernel is crashing, this sounds like a bug in the kernel, not a bug in Go, and it sounds like something that should be fixed in the kernel, not something that should be fixed in Go.

If you want to find the difference between Go 1.8 and Go 1.9, try running the program under strace -f to see what changed in how the system calls were made.

@ianlancetaylor ianlancetaylor added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Jan 25, 2021
@NateThornton
Copy link
Author

I agree @ianlancetaylor - the issue should be that Go 1.9 exposed this kernel bug. But I still would like to understand the change that broke.

The OpenFile results in a *file.nonblock: true and I can trace the call to set nonblock fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK|O_LARGEFILE) = 0

I'm wondering if its possible to turn off epoll for this character device.

strace's are below, but it just confirms that 1.9 is dropping into the O_NONBLOCK/epoll code as suspect.

go1.8.7_strace.txt
go1.9_strace.txt

@ianlancetaylor
Copy link
Contributor

Use syscall.Open and then os.NewFile.

I'm going to close this issue because I don't think there is anything for us to do in Go.

@NateThornton
Copy link
Author

Thanks @ianlancetaylor - appreciate the help. I confirmed the suggested workaround works. I will pursue root cause with the kernel contributors.

I wonder if this behavior might be more baked in to os.OpenFile(), particularly for use cases like mine where you are interfacing with a device driver in simple request/response fashion. Something like an O_ flag the opposite of O_NONBLOCK so os.OpenFile(name, O_RDWR | O_BLOCK, 0)

@ianlancetaylor
Copy link
Contributor

These are highly unusual cases, and they normally indicate kernel bugs. I don't see a reason to have two workarounds.

@golang golang locked and limited conversation to collaborators Jan 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

3 participants