osbase/supervisor: only enter DONE state after runnable returns

Previously, a node was marked DONE as soon as it signaled DONE. If a GC
run happens between the time when the runnable signals DONE, and when
the runnable exit is processed, this causes problems. The test which I
added panics without the other changes:

panic: could not find [inner] (root.inner) in root (NODE_STATE_NEW)

If the delay is long enough that the inner node has already restarted,
then this could even end up with multiple instances of the same runnable
running simultaneously.

I fixed this problem by only entering the DONE state after the runnable
has returned.

Change-Id: If73b73f104c4cc204bce4374f4ba5f7e163e4a0b
Reviewed-on: https://review.monogon.dev/c/monogon/+/3715
Tested-by: Jenkins CI
Reviewed-by: Lorenz Brun <lorenz@monogon.tech>
diff --git a/osbase/supervisor/supervisor_node.go b/osbase/supervisor/supervisor_node.go
index 272b650..4f4d180 100644
--- a/osbase/supervisor/supervisor_node.go
+++ b/osbase/supervisor/supervisor_node.go
@@ -56,6 +56,10 @@
 	// The current state of the runnable in this node.
 	state NodeState
 
+	// signaledDone is set when the runnable has signaled Done. The transition to
+	// DONE state only happens after the runnable returns.
+	signaledDone bool
+
 	// Backoff used to keep runnables from being restarted too fast.
 	bo *backoff.ExponentialBackOff
 
@@ -211,6 +215,7 @@
 
 	// Clear children and state
 	n.state = NodeStateNew
+	n.signaledDone = false
 	n.children = make(map[string]*node)
 	n.reserved = make(map[string]bool)
 	n.groups = nil
@@ -307,8 +312,10 @@
 		if n.state != NodeStateHealthy {
 			panic(fmt.Errorf("node %s signaled done", n))
 		}
-		n.state = NodeStateDone
-		n.sup.metrics.NotifyNodeState(n.dn(), n.state)
+		if n.signaledDone {
+			panic(fmt.Errorf("node %s signaled done twice", n))
+		}
+		n.signaledDone = true
 		n.bo.Reset()
 	}
 }