Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime/cgo: cgo Process crashing due to fatal error in cgo calls during Go and Python interaction #58290

Closed
writetonikhil opened this issue Feb 3, 2023 · 13 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.

Comments

@writetonikhil
Copy link

writetonikhil commented Feb 3, 2023

What version of Go are you using (go version)?

$ go version
go version go1.17.8 linux/amd64

Does this issue reproduce with the latest release?

Yes, tried with 1.18.5

What operating system and processor architecture are you using (go env)?

Ubuntu 20.04 and 22.04

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17.8"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3248353624=/tmp/go-build -gno-record-gcc-switches"

What did you do?

We are using CGO package (https://pkg.go.dev/cmd/cgo) to communicate from our Go code to Python code. This code is deployed in a Docker container where the container restarts as the interaction between Go and Python throws a fatal error. The tracebacks are attached.
CGOTracebacks.txt

These are the code snippets failing:

Case 1: It gives fatal error in C.PyFinalize() call

    // Init python env
defer C.Py_Finalize()
if !utils.PyIsInitialized() {
	C.Py_Initialize()
}

Case 2: fails in PyErrPrintEx() method

   oImport := PyImportImportModule("recommendations.src.models.pod_usage_recommendations")
if !(oImport != nil && PyErrOccurred() == nil) {
	utils.PyErrPrintEx(false)
	tenant.Log(dashfw.Error, "failed to import module 'recommendations'")
	return nil
}
 func PyImportImportModule(name string) *PyObject {
     cname := C.CString(name)
     defer C.free(unsafe.Pointer(cname))
     return togo(C.PyImport_ImportModule(cname))
  }

Case 3: Code same as case 2 but fails in PyImportImportModule() module
There is no sequence but randomly fails in any of the above.

What did you expect to see?

We should not get fatal errors.

What did you see instead?

Tracebacks as attached.

@writetonikhil writetonikhil changed the title affected/package: cgo affected/package: cgo Process crashing due to fatal error in cgo calls during Go and Python interaction Feb 3, 2023
@seankhliao
Copy link
Member

with the 1.20 release, our current supported versions are 1.19/1.20

@seankhliao seankhliao changed the title affected/package: cgo Process crashing due to fatal error in cgo calls during Go and Python interaction runtime/cgo: cgo Process crashing due to fatal error in cgo calls during Go and Python interaction Feb 3, 2023
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Feb 3, 2023
@writetonikhil
Copy link
Author

writetonikhil commented Feb 3, 2023

Could you please let me know the compatible versions in this format?

Ubuntu:
Python:
Golang:
GCC:
GLIBC:

We are using the current versions:
Ubuntu: 20.04
Python: 3.8
Golang: 1.17.8
GCC: 9.4.0
GLIBC: 2.31

root@ccm-insights-controller-app-5cb78c7ffb-24f2t:/# gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

root@ccm-insights-controller-app-5cb78c7ffb-24f2t:/# ldd --version
ldd (Ubuntu GLIBC 2.31-0ubuntu9.9) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
root@ccm-insights-controller-app-5cb78c7ffb-24f2t:/#

@writetonikhil
Copy link
Author

I just tried with golang 1.19 and the issue is still getting reproduced.

@bcmills
Copy link
Contributor

bcmills commented Feb 3, 2023

From the stack traces those are segfaults with very low addresses (0x38). That suggests that something in the C code is attempting to access a field of a struct or object through a null pointer.

@writetonikhil, this looks to me like a bug in the C / Python part of your program. I would suggest using runtime.SetCgoTraceback (perhaps with github.com/ianlancetaylor/cgosymbolizer to get better stack traces so that you can diagnose it.

@bcmills bcmills added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Feb 3, 2023
@writetonikhil
Copy link
Author

Ok let me try using runtime.SetCgoTraceback. Will let you know after the testing.

@writetonikhil
Copy link
Author

@bcmills I tried adding your suggested code. But the only things gets added is some "non-go function" lines and other Py operations(like PyArray_Item_INCREF or PyArray_FromScalar or _PyLong_FromNbInt) which I am not able to understand why they are getting called while doing C.PyFinalize() call. There are some hash codes as well. Could you please point me to the documentation of how to read this trace? The trace is like this:

2023-02-06T07:00:00.233Z | 2023/02/06 07:00:00 recommendations_jobs.go:465: tenant=[63db6335995067deeaa84ced - tenant2] failed to import python module 'recommendations'
2023-02-06T07:00:00.233Z | fatal error: unexpected signal during runtime execution
2023-02-06T07:00:00.237Z | [signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x7f97402e9e28]
2023-02-06T07:00:00.237Z |
2023-02-06T07:00:00.237Z | runtime stack:
2023-02-06T07:00:00.237Z | runtime.throw({0x111f181?, 0x0?})
2023-02-06T07:00:00.237Z | runtime/panic.go:992 +0x71
2023-02-06T07:00:00.237Z | runtime.sigpanic()
2023-02-06T07:00:00.237Z | runtime/signal_unix.go:802 +0x389
2023-02-06T07:00:00.237Z |
2023-02-06T07:00:00.237Z | goroutine 2659 [syscall, locked to thread]:
2023-02-06T07:00:00.290Z | PyArray_Item_INCREF
2023-02-06T07:00:00.290Z | pc=0x7f97402e9e28
2023-02-06T07:00:00.290Z | PyArray_FromScalar
2023-02-06T07:00:00.290Z | pc=0x7f97402ed406
2023-02-06T07:00:00.290Z | gentype_int
2023-02-06T07:00:00.290Z | pc=0x7f97402ee37b
2023-02-06T07:00:00.290Z | _PyLong_FromNbInt
2023-02-06T07:00:00.290Z | pc=0x7f97791fe23d
2023-02-06T07:00:00.290Z | _PyLong_FromNbIndexOrNbInt
2023-02-06T07:00:00.290Z | pc=0x7f97791fe3e3
2023-02-06T07:00:00.290Z | PyLong_AsLongAndOverflow
2023-02-06T07:00:00.290Z | pc=0x7f97791feba4
2023-02-06T07:00:00.290Z | _PyLong_AsInt
2023-02-06T07:00:00.290Z | pc=0x7f97791fecc1
2023-02-06T07:00:00.290Z | non-Go function
2023-02-06T07:00:00.290Z | pc=0x7f977906e510
2023-02-06T07:00:00.290Z | non-Go function
2023-02-06T07:00:00.290Z | pc=0x7f97791c41b8
2023-02-06T07:00:00.290Z | _PyObject_MakeTpCall
2023-02-06T07:00:00.290Z | pc=0x7f9779223b2a
2023-02-06T07:00:00.290Z | non-Go function
2023-02-06T07:00:00.290Z | pc=0x7f9779223fb1
2023-02-06T07:00:00.290Z | _PyObject_CallFunction_SizeT
2023-02-06T07:00:00.290Z | pc=0x7f9779224f11
2023-02-06T07:00:00.290Z | non-Go function
2023-02-06T07:00:00.290Z | pc=0x7f97790bf68a
2023-02-06T07:00:00.290Z | runtime.cgocall(0xe7f300, 0xc00082cad8)
2023-02-06T07:00:00.290Z | runtime/cgocall.go:157 +0x5c fp=0xc00082cab0 sp=0xc00082ca78 pc=0x40933c
2023-02-06T07:00:00.290Z | github.kyndryl.net/MCMP-Cluster-Management/ccm-insights-controller/worker._Cfunc_Py_Finalize()
2023-02-06T07:00:00.290Z | _cgo_gotypes.go:43 +0x45 fp=0xc00082cad8 sp=0xc00082cab0 pc=0xe7ad25
2023-02-06T07:00:00.290Z | github.kyndryl.net/MCMP-Cluster-Management/ccm-insights-controller/worker.feedStatsToModel.func1()
2023-02-06T07:00:00.290Z | github.kyndryl.net/MCMP-Cluster-Management/ccm-insights-controller/worker/recommendations_jobs.go:436 +0x25 fp=0xc00082cae8 sp=0xc00082cad8 pc=0xe7e485

@writetonikhil
Copy link
Author

RCA:
After adding the runtime.SetCgoTraceback to the code as suggested by you, we got more detailed traceback. After analyzing this we found that we were initializing/un initializing/reinitializing the Python env every time the Recommendation job ran.
This behavior is not supported by numpy as it wants the env to be initialized only once, it reuses the resources. With multiple initializations, it tries to use resources from the first interpreter with the second interpreter and gives segmentation fault.
We moved out python env init logic to run only once during the lifetime of the service inside the container and that resolved the problem. Thanks for the help.
Please keep this ticket open for a couple of days as we are still monitoring to see if the issue reappears.

@bcmills bcmills added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Feb 6, 2023
@writetonikhil
Copy link
Author

The issue is fixed. Thanks for the help.

@writetonikhil
Copy link
Author

writetonikhil commented Feb 23, 2023

The issue started reappearing without a known reason. We haven't made any change to the code but upgraded go version to 1.19.5 because our vulnerability check tool raised vulnerability issues. Any clues would be appreciated.
Here is the CGO traceback:

2023-02-22T09:00:03.373554953Z Fatal Python error: _PyInterpreterState_Get(): no current thread state
2023-02-22T09:00:03.373585403Z Python runtime state: initialized
2023-02-22T09:00:03.373590223Z 
2023-02-22T09:00:03.373594543Z Thread 0x00007f255b2b8740 (most recent call first):
2023-02-22T09:00:03.373598473Z   File "<frozen importlib._bootstrap>", line 98 in acquire
2023-02-22T09:00:03.373602744Z   File "<frozen importlib._bootstrap>", line 202 in _lock_unlock_module
2023-02-22T09:00:03.373606654Z   File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
2023-02-22T09:00:03.373610464Z   File "<frozen importlib._bootstrap_external>", line 1166 in create_module
2023-02-22T09:00:03.373613974Z   File "<frozen importlib._bootstrap>", line 556 in module_from_spec
2023-02-22T09:00:03.373617374Z   File "<frozen importlib._bootstrap>", line 657 in _load_unlocked
2023-02-22T09:00:03.373621024Z   File "<frozen importlib._bootstrap>", line 975 in _find_and_load_unlocked
2023-02-22T09:00:03.373624554Z   File "<frozen importlib._bootstrap>", line 991 in _find_and_load
2023-02-22T09:00:03.373627984Z   File "/usr/local/lib/python3.8/dist-packages/numpy/core/overrides.py", line 7 in <module>
2023-02-22T09:00:03.373631884Z   File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
2023-02-22T09:00:03.373635384Z   File "<frozen importlib._bootstrap_external>", line 848 in exec_module
2023-02-22T09:00:03.373639144Z   File "<frozen importlib._bootstrap>", line 671 in _load_unlocked
2023-02-22T09:00:03.373642564Z   File "<frozen importlib._bootstrap>", line 975 in _find_and_load_unlocked
2023-02-22T09:00:03.373653954Z   File "<frozen importlib._bootstrap>", line 991 in _find_and_load
2023-02-22T09:00:03.373657674Z   File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
2023-02-22T09:00:03.373661385Z   File "<frozen importlib._bootstrap>", line 1042 in _handle_fromlist
2023-02-22T09:00:03.373664705Z   File "/usr/local/lib/python3.8/dist-packages/numpy/core/multiarray.py", line 12 in <module>
2023-02-22T09:00:03.373668255Z   File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
2023-02-22T09:00:03.373671535Z   File "<frozen importlib._bootstrap_external>", line 848 in exec_module
2023-02-22T09:00:03.373674845Z   File "<frozen importlib._bootstrap>", line 671 in _load_unlocked
2023-02-22T09:00:03.373678095Z   File "<frozen importlib._bootstrap>", line 975 in _find_and_load_unlocked
2023-02-22T09:00:03.373681405Z   File "<frozen importlib._bootstrap>", line 991 in _find_and_load
2023-02-22T09:00:03.373697215Z   File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
2023-02-22T09:00:03.373700865Z   File "<frozen importlib._bootstrap>", line 1042 in _handle_fromlist
2023-02-22T09:00:03.373704435Z   File "/usr/local/lib/python3.8/dist-packages/numpy/core/__init__.py", line 22 in <module>
2023-02-22T09:00:03.373709785Z   File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
2023-02-22T09:00:03.373713455Z   File "<frozen importlib._bootstrap_external>", line 848 in exec_module
2023-02-22T09:00:03.373717455Z   File "<frozen importlib._bootstrap>", line 671 in _load_unlocked
2023-02-22T09:00:03.373720806Z   File "<frozen importlib._bootstrap>", line 975 in _find_and_load_unlocked
2023-02-22T09:00:03.373724036Z   File "<frozen importlib._bootstrap>", line 991 in _find_and_load
2023-02-22T09:00:03.373727496Z   File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
2023-02-22T09:00:03.373730906Z   File "<frozen importlib._bootstrap>", line 1042 in _handle_fromlist
2023-02-22T09:00:03.373734716Z   File "/usr/local/lib/python3.8/dist-packages/numpy/__init__.py", line 140 in <module>
2023-02-22T09:00:03.373738256Z   File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
2023-02-22T09:00:03.373741506Z   File "<frozen importlib._bootstrap_external>", line 848 in exec_module
2023-02-22T09:00:03.373744886Z   File "<frozen importlib._bootstrap>", line 671 in _load_unlocked
2023-02-22T09:00:03.373748306Z   File "<frozen importlib._bootstrap>", line 975 in _find_and_load_unlocked
2023-02-22T09:00:03.373751636Z   File "<frozen importlib._bootstrap>", line 991 in _find_and_load
2023-02-22T09:00:03.373754916Z   File "/usr/local/lib/python3.8/dist-packages/pandas/__init__.py", line 11 in <module>
2023-02-22T09:00:03.373762256Z   File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
2023-02-22T09:00:03.373766096Z   File "<frozen importlib._bootstrap_external>", line 848 in exec_module
2023-02-22T09:00:03.373769726Z   File "<frozen importlib._bootstrap>", line 671 in _load_unlocked
2023-02-22T09:00:03.373773476Z   File "<frozen importlib._bootstrap>", line 975 in _find_and_load_unlocked
2023-02-22T09:00:03.373777046Z   File "<frozen importlib._bootstrap>", line 991 in _find_and_load
2023-02-22T09:00:03.373780277Z   File "/recommendations/src/models/pod_usage_recommendations.py", line 12 in <module>
2023-02-22T09:00:03.373784077Z   File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
2023-02-22T09:00:03.373795467Z   File "<frozen importlib._bootstrap_external>", line 848 in exec_module
2023-02-22T09:00:03.373799077Z   File "<frozen importlib._bootstrap>", line 671 in _load_unlocked
2023-02-22T09:00:03.373805457Z   File "<frozen importlib._bootstrap>", line 975 in _find_and_load_unlocked
2023-02-22T09:00:03.373838657Z   File "<frozen importlib._bootstrap>", line 991 in _find_and_load
2023-02-22T09:00:03.376243778Z SIGABRT: abort
2023-02-22T09:00:03.376264118Z PC=0x7f255b4a400b m=12 sigcode=18446744073709551610
2023-02-22T09:00:03.376269998Z signal arrived during cgo execution
2023-02-22T09:00:03.376274078Z 
2023-02-22T09:00:03.376277448Z goroutine 22475 [syscall, locked to thread]:
2023-02-22T09:00:03.376280938Z runtime.cgocall(0xe9f7e0, 0xc0009f0ad0)
2023-02-22T09:00:03.376284448Z 	runtime/cgocall.go:158 +0x5c fp=0xc0009f0aa8 sp=0xc0009f0a70 pc=0x408abc
2023-02-22T09:00:03.376287779Z github.kyndryl.net/MCMP-Cluster-Management/ccm-insights-controller/utils._Cfunc_PyImport_ImportModule(0x7f2504002e40)
2023-02-22T09:00:03.376291239Z 	_cgo_gotypes.go:654 +0x49 fp=0xc0009f0ad0 sp=0xc0009f0aa8 pc=0xe018a9

@writetonikhil writetonikhil reopened this Feb 23, 2023
@ianlancetaylor
Copy link
Contributor

I don't know much about Python but my understanding is that it runs in a single thread. It may be necessary to prevent concurrent calls from Go into Python. Or it may be necessary to ensure that all calls from Go to Python run on the same thread.

In general mixing Go and Python in the same program is not well supported, at least not by us.

@bcmills
Copy link
Contributor

bcmills commented Feb 24, 2023

Python 3 uses thread-local state. Calls from Python to Go are generally fine (because they stay on the Python thread), but calls from Go to Python need to be careful to lock the goroutine to its thread (or make all of the Python calls within a single C function call) and ensure that the thread-local state has been initialized on it.

See https://docs.python.org/3/c-api/init.html#non-python-created-threads.

@writetonikhil
Copy link
Author

Python 3 uses thread-local state. Calls from Python to Go are generally fine (because they stay on the Python thread), but calls from Go to Python need to be careful to lock the goroutine to its thread (or make all of the Python calls within a single C function call) and ensure that the thread-local state has been initialized on it.

See https://docs.python.org/3/c-api/init.html#non-python-created-threads.

Thank for the clue. I am trying this suggestion and well let you know the results.

@writetonikhil
Copy link
Author

Its not working. We are splitting the go and python code into separate microservices interacting with each other using REST APIs.

@golang golang locked and limited conversation to collaborators Mar 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

5 participants