m/n/core/consensus: work around etcd dial timeout
Observed in an E2E test:
consensus ready to serve client requests
supervisor Runnable root.role.controlplane.launcher.consensus died: returned
error when NODE_STATE_NEW: bootstrap failed: when getting bootstrap
client: context deadline exceeded
supervisor rescheduling supervised node root.role.controlplane.launcher.consensus
with backoff 681.402139ms
consensus data absent, bootstrapping.
consensus Bootstrapping PKI: starting etcd...
supervisor Runnable root.role.controlplane.launcher.consensus died: returned
error when NODE_STATE_NEW: bootstrap failed: failed to start etcd:
listen tcp127.0.0.1:7834: bind: address already in use
I'm not sure what caused the original timeout of the client. Let's bump
it to two seconds instead of one.
In addition, let's also properly stop the bootstrap etcd server on
failure, instead of letting it run forever and preventing any subsequent
etcd server from starting up.
Change-Id: Icbcc31cb1e0b9e619360cbd71c5ee81396c79724
Reviewed-on: https://review.monogon.dev/c/monogon/+/1352
Tested-by: Jenkins CI
Reviewed-by: Lorenz Brun <lorenz@monogon.tech>
diff --git a/metropolis/node/core/consensus/consensus.go b/metropolis/node/core/consensus/consensus.go
index 5daa02f..fff95b0 100644
--- a/metropolis/node/core/consensus/consensus.go
+++ b/metropolis/node/core/consensus/consensus.go
@@ -341,8 +341,9 @@
// Start the bootstrap etcd instance...
server, err := embed.StartEtcd(cfg)
if err != nil {
- return fmt.Errorf("failed to start etcd: %w", err)
+ return fmt.Errorf("failed to start bootstrap etcd: %w", err)
}
+ defer server.Close()
// ... wait for it to run ...
select {