Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/internal/workflow: TaskContext.ResetWatchdog is willing to reset expired timer #60444

Open
dmitshur opened this issue May 25, 2023 · 0 comments
Labels
Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@dmitshur
Copy link
Contributor

dmitshur commented May 25, 2023

watchdogTimer is currently initialized to a timer that cancels the TaskContext's context when it times out:

func runTask(ctx context.Context, workflowID uuid.UUID, listener Listener, state taskState, args []reflect.Value) taskState {
	ctx, cancel := context.WithCancel(ctx)
	defer cancel()

	tctx := &TaskContext{
		Context:       ctx,
		Logger:        listener.Logger(workflowID, state.def.name),
		TaskName:      state.def.name,
		WorkflowID:    workflowID,
		watchdogTimer: time.AfterFunc(WatchdogDelay, cancel),
	}

	// ...

ResetWatchdog exists to extend the lifetime of the timer to prevent it from ticking and doing the task's context cancellation. It's currently willing to reset the timer even after the timer has expired and the task's context got cancelled. This alone is odd but fine, but it can affect other places in unexpected ways.

There's at least one place that determines whether the task hasn't yet hung by checking if the timer is still running:

if !tctx.watchdogTimer.Stop() {
	state.err = fmt.Errorf("task did not log for %v, assumed hung", WatchdogDelay)

Logging via TaskContext.Printf calls ResetWatchdog.

As a result some tasks end up having logging statements that are slightly load-bearing. If a Printf happens after a task has timed out and started to abort (via context cancellation), it'll at minimum cause the state.err = fmt.Errorf("task did not log for ..., assumed hung", ...) branch not to trigger. (Maybe that's the entire extent of the side-effects, I hadn't looked further.)

All this is to say the logic here can be reworked to be slightly less fragile, and this intersects a bit with #60443.

I spotted this while looking at the code some time ago. Filing a tracking issue so I can close some open windows, but it's low priority. CC @golang/release.

@dmitshur dmitshur added Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels May 25, 2023
@dmitshur dmitshur added this to the Backlog milestone May 25, 2023
@dmitshur dmitshur modified the milestones: Backlog, Unreleased May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: Planned
Development

No branches or pull requests

1 participant