Diff - b76b8d19c05e5df546e2b2dc08f6cdbec2a9ead0^! - monogon

commit	b76b8d19c05e5df546e2b2dc08f6cdbec2a9ead0	[log] [tgz]
author	Serge Bazanski <serge@monogon.tech>	Thu Mar 16 00:46:56 2023 +0100
committer	Serge Bazanski <serge@monogon.tech>	Thu Mar 16 21:04:59 2023 +0000
tree	d04ffe4c6866be5139dbc87424d14cac2baea6cd
parent	05f813bf2d311f94dbc8021a85b37ff7c2e33242 [diff] [blame]

m/n/core/consensus: work around etcd dial timeout

Observed in an E2E test:

  consensus  ready to serve client requests
  supervisor Runnable root.role.controlplane.launcher.consensus died: returned
             error when NODE_STATE_NEW: bootstrap failed: when getting bootstrap
             client: context deadline exceeded
  supervisor rescheduling supervised node root.role.controlplane.launcher.consensus
             with backoff 681.402139ms
  consensus  data absent, bootstrapping.
  consensus  Bootstrapping PKI: starting etcd...
  supervisor Runnable root.role.controlplane.launcher.consensus died: returned
             error when NODE_STATE_NEW: bootstrap failed: failed to start etcd:
             listen tcp127.0.0.1:7834: bind: address already in use

I'm not sure what caused the original timeout of the client. Let's bump
it to two seconds instead of one.

In addition, let's also properly stop the bootstrap etcd server on
failure, instead of letting it run forever and preventing any subsequent
etcd server from starting up.

Change-Id: Icbcc31cb1e0b9e619360cbd71c5ee81396c79724
Reviewed-on: https://review.monogon.dev/c/monogon/+/1352
Tested-by: Jenkins CI
Reviewed-by: Lorenz Brun <lorenz@monogon.tech>

diff --git a/metropolis/node/core/consensus/consensus.go b/metropolis/node/core/consensus/consensus.go
index 5daa02f..fff95b0 100644
--- a/metropolis/node/core/consensus/consensus.go
+++ b/metropolis/node/core/consensus/consensus.go

@@ -341,8 +341,9 @@
 	// Start the bootstrap etcd instance...
 	server, err := embed.StartEtcd(cfg)
 	if err != nil {
-		return fmt.Errorf("failed to start etcd: %w", err)
+		return fmt.Errorf("failed to start bootstrap etcd: %w", err)
 	}
+	defer server.Close()
 
 	// ... wait for it to run ...
 	select {