Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: intermittent slice bounds out of range panic on linux/mipsle #58212

Closed
racingmars opened this issue Feb 1, 2023 · 10 comments
Closed
Assignees
Labels
arch-mips compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@racingmars
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.19.5 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

I'm building from linux/amd64; I'm cross-compiling to GOOS=linux GOARCH=mipsle

The target device is running Linux kernel version 3.15.0.

The target device is relatively memory-constrained:

$ free
             total         used         free       shared      buffers
Mem:        254268       184452        69816            0            0
-/+ buffers:             184452        69816
Swap:            0            0            0

However, if I intentionally exhaust the memory from my Go code, I get an "out of memory" panic, as expected, not the panic I am reporting in this bug.

On the target device, /proc/cpuinfo lists the CPU as (repeated four times for processors/cores 0-3):

processor               : 0
cpu model               : MIPS 1004Kc V2.12
BogoMIPS                : 266.64
wait instruction        : yes
microsecond timers      : yes
tlb_entries             : 32
extra interrupt vector  : yes
hardware watchpoint     : yes, count: 4, address/irw mask: [0x0ffc, 0x0ffc, 0x0ffb, 0x0ffb]
isa                     : mips1 mips2 mips32r1 mips32r2
ASEs implemented        : mips16 dsp mt
shadow register sets    : 1
kscratch registers      : 0
core                    : 1
VCED exceptions         : not available
VCEI exceptions         : not available
VPE                     : 1

What did you do?

Using Go 1.19.5 on Linux/amd64, I built the following program using:
$ GOOS=linux GOARCH=mipsle CGO_ENABLED=0 go build -o mipstest -ldflags="-s -w"

package main

import (
	"bufio"
	"log"
	"os"
	"regexp"
)

func main() {
	var fileData string

	scanner := bufio.NewScanner(os.Stdin)
	gotBegin := false

	for scanner.Scan() {
		line := scanner.Text()
		if match, _ := regexp.MatchString("----CMDDATA END----", line); match {
			break
		}

		if gotBegin {
			fileData += line
		}

		if !gotBegin {
			if match, _ := regexp.MatchString("----CMDDATA BEGIN----", line); match {
				gotBegin = true
			}
		}
	}

	if err := scanner.Err(); err != nil {
		log.Fatal(err)
	}
}

I built an ~800k, 28,000 line sample input file for the program with the following commands:

$ echo "----CMDDATA BEGIN----" > samplefile.txt
$ i=0
$ while [ $i -lt 28000 ]; do
    echo "abcdefghijklmnopqrstuvwxyz12" >> samplefile.txt
    i=$(expr $i + 1)
  done
$ echo "----CMDDATA END----" >> samplefile.txt

I then transferred the binary and the samplefile.txt to the MIPS system and run with:
$ ./mipstest < samplefile.txt

What did you expect to see?

I expected the program to run to completion. Or, if this ultimately is related to a low-memory condition, I'd expect something other than a slice bounds out of range panic.

What did you see instead?

Out of 100 runs, 41 of the runs resulted in the program terminating with a panic:

fatal error: slice bounds out of range

runtime stack:
runtime.throw({0x109c2e, 0x19})
        /usr/lib/go/src/runtime/panic.go:1047 +0x54 fp=0x839f7c sp=0x839f68 pc=0x52208
runtime.panicCheck1(0x4c3e0, {0x109c2e, 0x19})
        /usr/lib/go/src/runtime/panic.go:57 +0xfc fp=0x839f90 sp=0x839f7c pc=0x4f3f0
runtime.goPanicSliceAcap(0x800138, 0x200)
        /usr/lib/go/src/runtime/panic.go:138 +0x54 fp=0x839fb0 sp=0x839f90 pc=0x4f7e0
runtime.wbBufFlush1(0x81a000)
        /usr/lib/go/src/runtime/mwbbuf.go:287 +0x16c fp=0x839fe0 sp=0x839fb0 pc=0x4c3e0
runtime.wbBufFlush.func1()
        /usr/lib/go/src/runtime/mwbbuf.go:201 +0x44 fp=0x839fe8 sp=0x839fe0 pc=0x4c1a8
runtime.systemstack()
        /usr/lib/go/src/runtime/asm_mipsx.s:188 +0x7c fp=0x839fec sp=0x839fe8 pc=0x88224

[...]

I have attached a couple complete representative examples of the full stack traces. In one case, I got a very different looking panic -- see panic.different.txt, which has "fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)".

A successful run of the program with the samplefile.txt takes a couple minutes on this platform. Sometimes the panics occur very quickly after beginning execution, but other times they occur farther into the execution time.

Attachments

main.go.txt
samplefile.txt

panic.1.txt
panic.2.txt
panic.different.txt

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Feb 1, 2023
@bcmills
Copy link
Contributor

bcmills commented Feb 1, 2023

(attn @golang/mips; CC @golang/runtime)

If the machine is memory constrained, it seems likely that the runtime is missing an error check somewhere in a memory-allocation path.

@mknyszek
Copy link
Contributor

mknyszek commented Feb 1, 2023

CC @golang/runtime

That's odd. That might be some kind of memory corruption, since I don't see how this slicing operation could fail like that. (The other kind of failure is interesting, too.) This is a relatively simple program, so it's not great that it's falling over so easily.

Here's a few things to try:

  • GODEBUG=gccheckmark=1 (looks for bugs in concurrent GC)
  • GODEBUG=gcshrinkstackoff=1 (disables stack shrinking, since this looks potentially like some kind of stack corruption)
  • GODEBUG=asyncpreemptoff=1 (disables asynchronous preemption signals, just because)

@mknyszek
Copy link
Contributor

mknyszek commented Feb 1, 2023

@bcmills To me, this doesn't look like a symptom of running out of memory, and in general we do a decent job of surfacing out of memory errors that we can catch (after some improvements in that area, some years ago). What we can't catch is something like the Linux OOM killer turning on when a page is demanded, but I don't think we'd crash like this.

Perhaps I'm missing your point though.

@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Feb 1, 2023
@mknyszek mknyszek added this to the Backlog milestone Feb 1, 2023
@cherrymui
Copy link
Member

runtime.goPanicSliceAcap(0x800138, 0x200)

Here the slice index is 0x800138, which looks like an address. (The cap 0x200 looks correct, which matches the write barrier buffer size.) Looks like a memory corruption. Agreed that this doesn't look like running out of memory.

The target device is running Linux kernel version 3.15.0.

Would it be possible to try newer kernels? Thanks.

@racingmars
Copy link
Author

The target device is running Linux kernel version 3.15.0.

Would it be possible to try newer kernels? Thanks.

Yes, there are newer firmware versions available which I suspect upgrade the kernel. However, we only have one of these devices and a firmware update would be a one-way operation, so I'd lose the ability to further investigate or test potential fixes against this current kernel version.

@Rongronggg9
Copy link
Member

Rongronggg9 commented Feb 1, 2023

I agree that this doesn't look like OOM too. 69MiB of free memory is enough for your case. I can't reproduce the issue at all (no panic or OOM) on Loongson 3A4000 with a 16MiB memory limit.

The most probable cause is memory corruption, either a software/kernel problem or a hardware one. As upgrading the kernel is impossible, though broken RAM seems not very common, could you do some memory tests (e.g. https://pyropus.ca./software/memtester/)?

The target device is running Linux kernel version 3.15.0.

firmware

I suspect that the manufacturer offers firmware with their downstream kernel, which is expected to be not so robust. In such a case, only knowing that its version is, sometimes, not enough. Could you please tell that what the device is (MT7621?) and do the manufacturer offers the source code of their downstream kernel?

a firmware update would be a one-way operation, so I'd lose the ability to further investigate or test potential fixes against this current kernel version.

I don't think the manufacturer will apply a strict downgrade restriction (e.g. e-fuse). If it deserves a try to you, dump the ROM with a flasher and it should be able to be used to flash it back at any time.

@racingmars
Copy link
Author

Trying the GODEBUG options suggested by @mknyszek:

GODEBUG=gccheckmark=1 (looks for bugs in concurrent GC)

No change in behavior -- 48% failure rate across 25 test runs; same panics as before.

GODEBUG=gcshrinkstackoff=1 (disables stack shrinking, since this looks potentially like some kind of stack corruption)

No change in behavior -- 32% failure rate across 25 test runs; same panics as before.

GODEBUG=asyncpreemptoff=1 (disables asynchronous preemption signals, just because)

Aha! No panics across 25 test runs. Whatever this is doing, it seems to avoid causing the conditions that lead to the panic.

@mknyszek
Copy link
Contributor

mknyszek commented Feb 1, 2023

@racingmars That suggests to me a possible kernel bug. We've seen both Linux and Illumos (Illumos-derivative? and maybe some BSD variant but I forget) kernel bugs with issues in their signal delivery implementation. Usually something very subtle like a page fault in the signal handler causing a register's state to not get properly restored.

What GODEBUG=asyncpreemptoff=1 is disabling is a mechanism that uses OS signals to preempt running goroutines. This makes Go programs deliver signals to themselves much more frequently than other applications typically would. As a result, it's had a habit of triggering latent bugs in kernel implementations.

It's possible it's still an issue on the Go side, as we have of course discovered bugs in this path since it was released in Go 1.14. But a firmware/kernel upgrade might be sufficient to just solve your problem. Also, if GODEBUG=asyncpreemptoff=1 is a sufficient workaround, I suggest following through with that in the meantime. The downside is you might lose some performance.

@racingmars
Copy link
Author

racingmars commented Feb 1, 2023

Thanks @mknyszek -- (and cc @Rongronggg9 regarding firmware/kernel update)

I've ordered a second device so I'll be able to upgrade the firmware on that one without risking losing the reproducible test case on this 3.15.0 kernel version in case that's useful in the future.

It'll be about a week before it arrives, probably, but I'll follow up with a report on whether or not upgrading makes the problem go away.

Thanks!

@racingmars
Copy link
Author

Sorry for the delay in the follow-up, but I was able to get more hardware and investigate this further.

On another device with the same kernel version (3.15.0), the problem was reproducible, so it wasn't due to bad hardware on the first device.

After upgrading to the next oldest firmware version available to me, which uses kernel version 4.4.27, the problem is no longer reproducible. The test Go program and my real application run 100% reliably. So it does appear something was fixed in the mipsle kernel along the way related to how Go is depending on signals.

If folks come across this issue by searching for the error message in the future, the summary is:

  1. Upgrade to a newer kernel to fix the problem, if possible.
  2. Otherwise, setting the environment variable GODEBUG=asyncpreemptoff=1 as a work-around will allow the program to run reliably but with a performance penalty on older kernel versions.

@golang golang locked and limited conversation to collaborators Mar 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-mips compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

6 participants