supervisor: never give up
This fixes T756, in which supervised processes would reach a negative
backoff value. This seems to be caused by the backoff library's
ExponentialBackoff having a default MaxElapsedTime of 15 minutes, after
which it returns 'Stop', or, -1 seconds.
Test Plan: There's no easy way to test this. Unfortunately, the behaviour to return Stop is not after a number of calls, but after time has elapsed. We don't want to wait 15 minutes for a test, and we don't have an easy way to mock time, either. But I did test this manually and I cannot observe the 'negative backoffs' after 15 minutes anymore.
Bug: T756
X-Origin-Diff: phab/D619
GitOrigin-RevId: 49d8617bcf2c8b36127cb43acde8afb7cc35c99f
diff --git a/core/internal/common/supervisor/supervisor_node.go b/core/internal/common/supervisor/supervisor_node.go
index 32f9720..e2af62c 100644
--- a/core/internal/common/supervisor/supervisor_node.go
+++ b/core/internal/common/supervisor/supervisor_node.go
@@ -148,11 +148,16 @@
// newNode creates a new node with a given parent. It does not register it with the parent (as that depends on group
// placement).
func newNode(name string, runnable Runnable, sup *supervisor, parent *node) *node {
+ // We use exponential backoff for failed runnables, but at some point we cap at a given backoff time.
+ // To achieve this, we set MaxElapsedTime to 0, which will cap the backoff at MaxInterval.
+ bo := backoff.NewExponentialBackOff()
+ bo.MaxElapsedTime = 0
+
n := &node{
name: name,
runnable: runnable,
- bo: backoff.NewExponentialBackOff(),
+ bo: bo,
sup: sup,
parent: parent,