Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: Crash in autogenerated function #43934

Open
hagarjibm opened this issue Jan 26, 2021 · 14 comments
Open

runtime: Crash in autogenerated function #43934

hagarjibm opened this issue Jan 26, 2021 · 14 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@hagarjibm
Copy link

hagarjibm commented Jan 26, 2021

What version of Go are you using (go version)?

$ go version
go version go1.15.2 linux/amd64

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/phuser/.cache/go-build"
GOENV="/home/phuser/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/phuser/go/pkg/mod"
GONOPROXY="github.ibm.com"
GONOSUMDB="github.ibm.com"
GOOS="linux"
GOPATH="/home/phuser/go"
GOPRIVATE="github.ibm.com"
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/phuser/goroot"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/phuser/goroot/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g  "
CGO_ENABLED="1"
GOMOD="/code/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build256493267=/tmp/go-build -gno-record-gcc-switches"
GOROOT/bin/go version: go version go1.15.2 linux/amd64
GOROOT/bin/go tool compile -V: compile version go1.15.2
uname -sr: Linux 4.19.76-linuxkit
/lib64/libc.so.6: GNU C Library (GNU libc) stable release version 2.17, by Roland McGrath et al.
gdb --version: GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-119.el7

What did you do?

We have a project that consumes a number of external projects. As part of this, we log key points of interest using Zap (https://github.com/uber-go/zap) and Lumberjack (https://github.com/natefinch/lumberjack) during the execution of the program.

We received a report of a crash from a tester with the following call stack:

go.uber.org/zap/zapcore.(*writerWrapper).Write(0xc000205640, 0xc01ce21c00, 0xd5, 0x400, 0x1370200, 0x0, 0x0)
        <autogenerated>:1 +0x5a fp=0xc01d705890 sp=0xc01d705848 pc=0x90a69a
go.uber.org/zap/zapcore.(*ioCore).Write(0xc00039e180, 0x0, 0xbff8a186aa6a005f, 0xa711142ecab, 0x1370200, 0x0, 0x0, 0xc04697ad70, 0x10, 0x1, ...)
        /home/phuser/go/pkg/mod/go.uber.org/zap@v1.16.0/zapcore/core.go:90 +0x106 fp=0xc01d705978 sp=0xc01d705890 pc=0x8f94c6
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00039de00, 0xc046952900, 0x2, 0x2)
        /home/phuser/go/pkg/mod/go.uber.org/zap@v1.16.0/zapcore/entry.go:220 +0x12d fp=0xc01d705b38 sp=0xc01d705978 pc=0x8fb86d
go.uber.org/zap.(*Logger).Info(0xc0003d8000, 0xc04697ad70, 0x10, 0xc046952900, 0x2, 0x2)
        /home/phuser/go/pkg/mod/go.uber.org/zap@v1.16.0/logger.go:188 +0x85 fp=0xc01d705b70 sp=0xc01d705b38 pc=0x9137e5
(omitted)
Aborted (core dumped)

Unfortunately, no core dump was found, despite ulimit -c unlimited having been set prior, as well as a call to the following upon startup of our program:

debug.SetTraceback("crash")

Using go tool objdump the compiler generated the following assembly in which we crashed:

TEXT go.uber.org/zap/zapcore.(*writerWrapper).Write(SB) <autogenerated>
  <autogenerated>:1     0x90a640                64488b0c25f8ffffff      MOVQ FS:0xfffffff8, CX
  <autogenerated>:1     0x90a649                483b6110                CMPQ 0x10(CX), SP
  <autogenerated>:1     0x90a64d                7673                    JBE 0x90a6c2
  <autogenerated>:1     0x90a64f                4883ec40                SUBQ $0x40, SP
  <autogenerated>:1     0x90a653                48896c2438              MOVQ BP, 0x38(SP)
  <autogenerated>:1     0x90a658                488d6c2438              LEAQ 0x38(SP), BP
  <autogenerated>:1     0x90a65d                488b5920                MOVQ 0x20(CX), BX
  <autogenerated>:1     0x90a661                4885db                  TESTQ BX, BX
  <autogenerated>:1     0x90a664                7566                    JNE 0x90a6cc
  <autogenerated>:1     0x90a666                488b442448              MOVQ 0x48(SP), AX
  <autogenerated>:1     0x90a66b                488b08                  MOVQ 0(AX), CX
  <autogenerated>:1     0x90a66e                488b4008                MOVQ 0x8(AX), AX
  <autogenerated>:1     0x90a672                488b4918                MOVQ 0x18(CX), CX
  <autogenerated>:1     0x90a676                48890424                MOVQ AX, 0(SP)
  <autogenerated>:1     0x90a67a                488b442450              MOVQ 0x50(SP), AX
  <autogenerated>:1     0x90a67f                4889442408              MOVQ AX, 0x8(SP)
  <autogenerated>:1     0x90a684                488b442458              MOVQ 0x58(SP), AX
  <autogenerated>:1     0x90a689                4889442410              MOVQ AX, 0x10(SP)
  <autogenerated>:1     0x90a68e                488b442460              MOVQ 0x60(SP), AX
  <autogenerated>:1     0x90a693                4889442418              MOVQ AX, 0x18(SP)
  <autogenerated>:1     0x90a698                ffd1                    CALL CX
  <autogenerated>:1     0x90a69a                488b442420              MOVQ 0x20(SP), AX
  <autogenerated>:1     0x90a69f                488b4c2428              MOVQ 0x28(SP), CX
  <autogenerated>:1     0x90a6a4                488b542430              MOVQ 0x30(SP), DX
  <autogenerated>:1     0x90a6a9                4889442468              MOVQ AX, 0x68(SP)
  <autogenerated>:1     0x90a6ae                48894c2470              MOVQ CX, 0x70(SP)
  <autogenerated>:1     0x90a6b3                4889542478              MOVQ DX, 0x78(SP)
  <autogenerated>:1     0x90a6b8                488b6c2438              MOVQ 0x38(SP), BP
  <autogenerated>:1     0x90a6bd                4883c440                ADDQ $0x40, SP
  <autogenerated>:1     0x90a6c1                c3                      RET
  <autogenerated>:1     0x90a6c2                e8194eb6ff              CALL runtime.morestack_noctxt(SB)
  <autogenerated>:1     0x90a6c7                e974ffffff              JMP go.uber.org/zap/zapcore.(*writerWrapper).Write(SB)
  <autogenerated>:1     0x90a6cc                488d7c2448              LEAQ 0x48(SP), DI
  <autogenerated>:1     0x90a6d1                48393b                  CMPQ DI, 0(BX)
  <autogenerated>:1     0x90a6d4                7590                    JNE 0x90a666
  <autogenerated>:1     0x90a6d6                488923                  MOVQ SP, 0(BX)
  <autogenerated>:1     0x90a6d9                eb8b                    JMP 0x90a666

So it looks to me like we crashed on: MOVQ 0x20(SP), AX

We create the zap logger with:

	lLogger := &lumberjack.Logger{
		Filename:   fn,
		MaxSize:    500, // In megabytes.
		MaxAge:     1,   // Maximum age of log.
		MaxBackups: 4,   // Maximum amount of logs.
		LocalTime:  true,
	}

	config := zap.NewDevelopmentEncoderConfig()
	config.FunctionKey = "F"
	core := zapcore.NewTee(
		zapcore.NewCore(
			zapcore.NewConsoleEncoder(config),
			zapcore.AddSync(lLogger),
			zap.LevelEnablerFunc(func(lvl zapcore.Level) bool {
				return lvl <= zapcore.ErrorLevel
			})),
		zapcore.NewCore(
			zapcore.NewConsoleEncoder(config),
			zapcore.AddSync(os.Stderr),
			zap.LevelEnablerFunc(func(lvl zapcore.Level) bool {
				return lvl >= zapcore.WarnLevel
			})),
	)

Thus according to the zap code, zapcore.AddSync will wrap the Lumberjack logger with a writerWrapper which is merely an interface comprised of io.Writer plus a Sync method, which is supplied by zap.

It looks to me like the Go compiler is autogenerating some glue to make the writerWrapper interface compatible with the desired WriteSyncer which is an interface specifying an io.Writer and Sync method.

I'm opening the issue here because it dies in between Zap and Lumberjack and seems like we have stack corruption somehow, rather than squarely in one of those two projects. It also looks like, although this is not PPC64LE, Go has previously had issues with stack corruption in or nearby autogenerated functions on #10628.

If it matters, we have attempted to log a string, a zap.Uint64 (7 character key and uint64 value), and a zap.Bool (7 character key and bool value).

What did you expect to see?

No crash.

What did you see instead?

Crash without a core file.

@davecheney
Copy link
Contributor

Does this issue reproduce with the latest release?

Can you please test with Go 1.15.7 or 1.16beta1. Thank you

@davecheney
Copy link
Contributor

Have you tried running your program under the race detector? See https://blog.golang.org/race-detector .

@hagarjibm
Copy link
Author

@davecheney We'll go up to 1.16beta1 and then if that doesn't produce a crash we'll also run it with the race detector. Any other tricks we need to do to get a core dump in case that would help?

@randall77
Copy link
Contributor

So it looks to me like we crashed on: MOVQ 0x20(SP), AX

That almost certainly means the thing being called corrupted the stack pointer somehow. This is the first use of SP after the call. Do you know what code it called?

You show the stack trace, but what was the error that was printed above the trace ("segmentation fault at ..." or something)?

@hagarjibm
Copy link
Author

hagarjibm commented Jan 27, 2021

That's a good question. I didn't get that from our tester and the system has been cleaned up since. If we can recreate the crash I will be sure they capture it this time.

Do you know what code it called?

Yes, the thing being called is Lumberjack's logger's write function:

// Write implements io.Writer.  If a write would cause the log file to be larger
// than MaxSize, the file is closed, renamed to include a timestamp of the
// current time, and a new log file is created using the original log file name.
// If the length of the write is greater than MaxSize, an error is returned.
func (l *Logger) Write(p []byte) (n int, err error) {
	l.mu.Lock()
	defer l.mu.Unlock()

	writeLen := int64(len(p))
	if writeLen > l.max() {
		return 0, fmt.Errorf(
			"write length %d exceeds maximum file size %d", writeLen, l.max(),
		)
	}

	if l.file == nil {
		if err = l.openExistingOrNew(len(p)); err != nil {
			return 0, err
		}
	}

	if l.size+writeLen > l.max() {
		if err := l.rotate(); err != nil {
			return 0, err
		}
	}

	n, err = l.file.Write(p)
	l.size += int64(n)

	return n, err
}

I don't see anything special here that should be corrupting the stack. I also thought the stack was generally inaccessible to Go programs and isn't generally susceptible to corruption.

@davecheney
Copy link
Contributor

Any other tricks we need to do to get a core dump in case that would help?

try env GOTRACEBACK=crash

@hagarjibm
Copy link
Author

Any other tricks we need to do to get a core dump in case that would help?

try env GOTRACEBACK=crash

Isn't that what happens when the code calls debug.SetTraceback("crash")?

@toothrot toothrot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 29, 2021
@toothrot toothrot added this to the Backlog milestone Jan 29, 2021
@toothrot toothrot changed the title Crash in autogenerated function runtime: Crash in autogenerated function Jan 29, 2021
@hagarjibm
Copy link
Author

hagarjibm commented Feb 5, 2021

I have finally been able to consistently reproduce our code crashing. This time though it crashes on:

> [unrecovered-panic] runtime.fatalpanic() /home/phuser/goroot1.16/src/runtime/panic.go:1190 (hits goroutine(768):1 total:1) (PC: 0x44aa60)
Warning: debugging optimized function
	runtime.curg._panic.arg: interface {}(string) "runtime error: invalid memory address or nil pointer dereference"

The block of code it crashes in does not seem like it should be possible to crash in this way:

		if p.h == nil {
			p.h = h
		}

		plog.LogDebug(class, "ready",
			zap.Int64("idx", p.idx),
			zap.Int("num", num),
crash here >>>			zap.Int("remaining", remaining),
			zap.Int("offset", offset),
		)

Because I can consistently crash now, I ran under dlv with the race detector on, and we have no races prior to the crash. However, this is interesting:

(dlv) p &remaining
(*int)(0xc01542b008)
(dlv) p remaining
131072
(dlv) p num
36
(dlv) p p
*(unreadable read out of bounds)
(dlv) p &p
("**github.ibm.com/Spectrum-Protect/Phusion/phusion.omittedType")(0xbeef0000)
(dlv)

It seems impossible to me that our code should be able to do this given that we pass the if statement which dereferences the pointer that is now pointing at 0xbeef0000. However, interestingly to me clobberfree sets memory to 0xdeadbeef which is suggestive of some sort of overlay somewhere that bleeds into our page pointer. Asking delve to emit goroutines also looks mangled, as there's several entries like this:

  Goroutine 852 - User: :0 ??? (0xe9fd9c) (thread 98019)
  Goroutine 853 - User: :0 ??? (0xe9fd9c) (thread 98024)

Given that the race detector didn't trip leading up to the crash, I really don't know where to take this next.

@randall77
Copy link
Contributor

  	zap.Int64("pageIdx", page.pageIdx),

If this is the last reference to page in your function, then it is quite possible that the pointer is overwritten / stale by the point where the crash happens.
It's not clear from your output whether you're debugging an optimized binary or not. If dlv compiled your binary for you then I think it turns optimizations off, so maybe an overwrite couldn't happen? In which case, maybe the weird address does mean something.

However, interestingly to me clobberfree sets memory to 0xdeadbeef

Did you explicitly request clobberfree? It isn't normally run unless you set GODEBUG=clobberfree=1.

@hagarjibm
Copy link
Author

hagarjibm commented Feb 5, 2021

It's not the last reference, there's many more references beyond that line. The binary is compiled with go build -race and then run with dlv exec, I wasn't sure how to turn on the race detector when delve builds the code.

I did find some sequence of events in our code and how we receive data from FUSE, along with a subtle bug in our code which leads to a situation I didn't think possible with Go. The function we crash in goes roughly like this

Call helper function to find pointer to appropriate data structure or build a new one
Set reference to the receiver of the crashing function in the data structure from above if not set
Emit the logging message above
Perform operations on the data input into the function that crashed and store results into the data structure obtained above
Perform some tracking operations data structure
Loop back if there's more work to be done

The subtle bug in our code lies in the helper function which prepares a blank data structure if there isn't a matching one already. What we weren't doing was heading down the path that prepared the blank data structure in some situations, leading the crashing function to think it had a valid structure when it did not.

Here's the situation I didn't think possible: the helper function returns several pointers, including this necessary data structure. When this data structure creation path wasn't taken, the page output variable wasn't assigned by the helper function, which I had thought would result in return of a nil pointer. When I fix the helper function to always assign a valid data structure, the crash goes away.

So now my question is, when you have a function of the form:

func (t *T) helper(... input params) (page *ioPage, t2 *T2, ... err error)

Shouldn't page and t2 be nil until otherwise assigned? By naming them in the function declaration we should be able to rely on naked returns throughout the function always either returning the last assigned value or nil if never assigned. It seems like we're getting garbage that always equals 0xbeef000 for this data structure and the assigned result in the crashing function for t2 is always unreadable when the creation path isn't taken.

Did you explicitly request clobberfree? It isn't normally run unless you set GODEBUG=clobberfree=1.

No, I didn't dig too deep in that so I didn't realize it was behind an environment variable. Wonder where the beef is coming from then?

@randall77
Copy link
Contributor

The binary is compiled with go build -race and then run with dlv exec, I wasn't sure how to turn on the race detector when delve builds the code.

Ok, that's an optimized binary. Try using go build -race -gcflags=-N and see if that changes anything.

So now my question is, when you have a function of the form:

func (t *T) helper(... input params) (page *ioPage, t2 *T2, ... err error)
Shouldn't page and t2 be nil until otherwise assigned?

Yes.

Can you share the code and the assembly for that helper function?

@hagarjibm
Copy link
Author

Ok, that's an optimized binary. Try using go build -race -gcflags=-N and see if that changes anything.

I will give that a try later on this afternoon.

Can you share the code and the assembly for that helper function?

I'll need to explore if that's a possibility and what approvals I might need. I will get back soon hopefully on that.

@hagarjibm
Copy link
Author

We need to work on paring this down for a minimal recreation so that we don't inadvertently expose confidential information before we can do that. I'm booked for the rest of this sprint, however we'll begin next week with our next sprint, so hopefully in the next few weeks we'll have what you're requesting ready.

@hagarjibm
Copy link
Author

Also, nothing changed when I used the gcflags argument.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: Triage Backlog
Development

No branches or pull requests

5 participants