Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

context: panic in parentCancelCtx when context timeout #64307

Closed
wawababe opened this issue Nov 21, 2023 · 5 comments
Closed

context: panic in parentCancelCtx when context timeout #64307

wawababe opened this issue Nov 21, 2023 · 5 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@wawababe
Copy link

wawababe commented Nov 21, 2023

What version of Go are you using (go version)?

$ go version
1.17.3 

Does this issue reproduce with the latest release?

Yes. It panics in version 1.21 too.

What operating system and processor architecture are you using (go env)?

uname -a Output

Linux Tencent-SNG 5.4.119-1-tlinux4-0010.3 #1 SMP Thu Jan 5 17:31:23 CST 2023 x86_64 x86_64 x86_64 GNU/Linux

What did you do?

The service is a typical tcp RPC service.
In my business handler, I will copy the context passed from the RPC handler or the redo machine, pass it to rpc calling function both in sync or async ways. It panic in our production environment occasionally, and I haven't found the actual path. But as NewThsContext creates a new context from the background, the panic no longer happens.

Here is the function I used to copy the context.
func NewThsContext(ctx context.Context, uid int64) *ths.Context {
	nctx := ths.NewContext(ctx)
	nctx.Uin = uint64(uid)
	h := th.NewTmeHeader()
	nauth := h.GetAuthInfo()
	nauth.UID = uint64(uid)
	h.SetAuthInfo(&nauth)
	if h.GetLocalServiceLen() == 0 {
		h.SetLocalServiceName("go_unified_ktv_chorus_game_svr")
	}
	nctx.Head = h
	return nctx
}

The context is passed by a lot of paths, I haven't found the path causing the panic.
Paths mainly include

1. RPC timeout context passed to the business handler
func (c *conn) serve(ctx context.Context) {
	ctx = context.WithValue(ctx, ClientAddr, c.remoteAddr)
	ctx, cancelCtx := context.WithCancel(ctx)
	c.cancelCtx = cancelCtx
	defer c.rwc.Close()
c.wg.Add(3)
go c.readRequests(ctx)
go c.handle(ctx)
go c.writeResponses(ctx)
c.wg.Wait()

}

func (c *conn) handle(ctx context.Context) {
defer func() {
c.cancelCtx()
c.wg.Done()
if err := recover(); err != nil {
c.rwc.SetDeadline(time.Now())
buf := make([]byte, RecoverStackSize)
buf = buf[:runtime.Stack(buf, false)]
log.Output(1, fmt.Sprintf("[PANIC] %v: %v\n%s", c.remoteAddr, err, buf))
}
}()

for {
	select {
	case <-ctx.Done():
		return
	case req := <-c.cin:
		go func() {
			subCtx, cancel := context.WithTimeout(ctx, c.server.msgTimeout)
			rsp, err := c.server.handler.Serve(subCtx, req)
			cancel()
			if err != nil {
				SysLog(err)
				return
			}
			select {
			case <-ctx.Done():
				SysLog("tcp handle business goroutine context done")
				return
			case c.cout <- rsp:
				SysLog("tcp handle business goroutine write rsp to cout channel")
			}
		}()
	}
}

}

func (sm ServeMutex) Serve(baseCtx context.Context, reqBuf []byte) ([]byte, error) {
conf := cat.GetConfig()
defer func() {
if e := recover(); e != nil {
dataBuf := make([]byte, 4
1024*1024)
dataBuf = dataBuf[:runtime.Stack(dataBuf, false)]
cat.Logf("[PANIC]%v\n>> %s", e, dataBuf)
}
}()

//parsing the business protocol  head
    head = parseHead(reqBuf)
    //set the context with info from the head
	baseCtx = context.WithValue(baseCtx, RemoteServiceName, head.GetLocalServiceName())
	baseCtx = context.WithValue(baseCtx, LocalServiceName, head.GetServiceName())
	baseCtx = context.WithValue(baseCtx, TraceID, head.GetTraceID())
	baseCtx = context.WithValue(baseCtx, SpanID, head.GetSpanID())
	baseCtx = context.WithValue(baseCtx, ParentSpanID, head.GetParentSpanID())
	baseCtx = context.WithValue(baseCtx, Env, head.GetEnv())

ctx := NewContext(baseCtx)
defer ctx.WriteLog()

// set the protocol head
ctx.Head = head

// found the handler
entry, ok := sm.mapEntries[head.GetCmdPattern()]
if !ok {
	return nil, fmt.Errorf("invalid cmd patrern:%s", head.GetCmdPattern())
}

// set the request and response type
ctx.Req = reflect.New(entry.reqType.Elem()).Interface()
ctx.Rsp = reflect.New(entry.rspType.Elem()).Interface()
    // deocde the request
var err error
    err = BodyUnmarshal(h, reqBuf, ctx.Req)
if err != nil {
	cat.Logf("protocol Decode buf failed, msg:%v", err)
	return nil, err
}

// 
entry.h.Process(ctx)

ctx.ErrCode = ctx.Head.GetResultCode()

// pack header、response
    .....
	return pkgBuf, nil

}

// NewContext 创建新上下文
func NewContext(ctx context.Context) *Context {
newCtx := Context{
Context: cat.NewContext(ctx, cat.GetConfig().Log.LogWriter),
startTime: time.Now(),
}
newCtx.LogLevel = int(cat.GetConfig().Log.Level)
newCtx.Head = NewHeader()
return &newCtx
}

2. Redo timeout context

func (r *Redo) ExecAsync(obj interface{}, opts ...ExecOptionFunc) {
r.execWithWAL(true, r.ctx, obj, opts...)
}
func (r *Redo) execWithWAL(async bool, ctx context.Context, obj interface{}, opts ...ExecOptionFunc) error {
//.....
f := func() error {
err := r.exec(ctx, key, obj)
if requireCommit {
r.wal.Commit(key)
}
return err
}

if async {
	go f()
	return nil
}
return f()

}

The default redo context is created in function New.

func New(path string, h Handler, fs ...optionFunc) (*Redo, error) {

	r := Redo{
		h:      h,
		keyGen: newKeyGenerator(),
	}
	...
	r.ctx, r.cancel = context.WithCancel(context.Background())

	if r.opts.Retry.Thread > 1 {
		r.retryCh = make(chan []byte)
		for i := 0; i < int(r.opts.Retry.Thread); i++ {
			r.wg.Go(r.retryLoop)
		}
	}

	r.wg.Go(r.mainLoop)
	return &r, nil
}

What did you expect to see?

What did you see instead?

Here is the stack dumped in version 1.17.3 lately:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x4bc9c9]

goroutine 1684597594 [running]:
context.(*valueCtx).Done(0xc000042068)
        :1 +0x29
context.parentCancelCtx({0x1b9d600, 0xc0121dbc20})
        /usr/local/go/src/context/context.go:298 +0x2b
context.removeChild({0x1b9d600, 0xc0121dbc20}, {0x1b7fe30, 0xc0121dbce0})
        /usr/local/go/src/context/context.go:315 +0x31
context.(*timerCtx).cancel(0xc0121dbce0, 0x1, {0x1b731a0, 0x69709e0})
        /usr/local/go/src/context/context.go:486 +0x55
context.WithDeadline.func2()
        /usr/local/go/src/context/context.go:456 +0x30
created by time.goFunc
        /usr/local/go/src/time/sleep.go:180 +0x31

Both in go version 1.17.3 and 1.21, our services run in linux vm and container panic in parent.Done() in context/context.go

// parentCancelCtx returns the underlying *cancelCtx for parent.
// It does this by looking up parent.Value(&cancelCtxKey) to find
// the innermost enclosing *cancelCtx and then checking whether
// parent.Done() matches that *cancelCtx. (If not, the *cancelCtx
// has been wrapped in a custom implementation providing a
// different done channel, in which case we should not bypass it.)
func parentCancelCtx(parent Context) (*cancelCtx, bool) {
	done := parent.Done()
	if done == closedchan || done == nil {
		return nil, false
	}
	p, ok := parent.Value(&cancelCtxKey).(*cancelCtx)
	if !ok {
		return nil, false
	}
	pdone, _ := p.done.Load().(chan struct{})
	if pdone != done {
		return nil, false
	}
	return p, true
}
@bcmills bcmills changed the title affected/context: panic in parentCancelCtx when context timeout context: panic in parentCancelCtx when context timeout Nov 21, 2023
@bcmills
Copy link
Contributor

bcmills commented Nov 21, 2023

What did you do?

Please fill out the complete issue template. What is the surrounding code that creates the context.Context involved in the panic?

If possible, provide a runnable example using https://go.dev/play/.

@bcmills bcmills added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Nov 21, 2023
@bcmills bcmills added this to the Backlog milestone Nov 21, 2023
@wawababe
Copy link
Author

I have added more details into the section What did you do. The context involved too many calling paths. And the service code could not be run in the go playground.

What did you do?

Please fill out the complete issue template. What is the surrounding code that creates the context.Context involved in the panic?

If possible, provide a runnable example using https://go.dev/play/.

@bcmills bcmills removed the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Nov 27, 2023
@bcmills
Copy link
Contributor

bcmills commented Nov 27, 2023

Somehow the context is becoming corrupted. Any uses of unsafe in the surrounding code, or data races? If you build and run the program with the race detector enabled (the -race flag), does it report anything?

@bcmills bcmills added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Nov 27, 2023
@wawababe
Copy link
Author

Somehow the context is becoming corrupted. Any uses of unsafe in the surrounding code, or data races? If you build and run the program with the race detector enabled (the -race flag), does it report anything?

I see no uses of unsafe in the surrounding code. The panic only occurs in production environment, so only one machine could be used to run the old panic service with race flag enabled . It may takes a while to detect the data race.

@seankhliao seankhliao added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Dec 12, 2023
@gopherbot
Copy link

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@gopherbot gopherbot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

4 participants