m/n/c/consensus: fix startup after removing a cluster node

The consensus service was waiting for all initial peers to be DNS
resolvable before starting etcd. However, the list of initial peers is
never updated. If an etcd member is removed from the cluster, it is no
longer resolvable, but may still be contained in initial peer lists. The
consensus service then fails to start, as it is blocked forever waiting
for the removed peer to become resolvable.

The wait for resolvability was added in c1cb37ce9c43 with this
explanation:

> It also makes the consensus service wait for DNS resolvability before
> attempting to join an existing cluster, which makes etcd startup much
> cleaner (as etcd will itself crash if it cannot immediately resolve
> its ExistingPeers in startup).

This does not appear to be needed anymore. I did not observe etcd
crashes after removing the wait for resolvability.

I extended the e2e test to test this scenario. After removing the
consensus role, it also deletes the node and reboots the remaining
nodes. I moved these tests to the ha_cold suite, because with encryption
enabled, we currently cannot reboot a node in a 2-node cluster.

Change-Id: If811c79ea127550fa9ca750014272fa885767c77
Reviewed-on: https://review.monogon.dev/c/monogon/+/3454
Tested-by: Jenkins CI
Reviewed-by: Serge Bazanski <serge@monogon.tech>
5 files changed
tree: fd718931af36798650ddbf1a8978a94220994e82
  1. .github/
  2. build/
  3. cloud/
  4. go/
  5. intellij/
  6. metropolis/
  7. osbase/
  8. third_party/
  9. tools/
  10. version/
  11. .bazelignore
  12. .bazelproject
  13. .bazelrc
  14. .bazelrc.ci
  15. .bazelrc.sandboxroot
  16. .bazelversion
  17. .git-ignore-revs
  18. .gitignore
  19. BUILD.bazel
  20. CODING_STANDARDS.md
  21. go.mod
  22. go.sum
  23. LICENSE
  24. MODULE.bazel
  25. MODULE.bazel.lock
  26. README.md
  27. SETUP.md
  28. shell.nix
  29. WORKSPACE
README.md

Monogon Monorepo

This is the main repository containing the source code for the Monogon Platform.

This is pre-release software - take a look, and check back later! In the meantime, join us on Matrix (#monogon-os-community:matrix.org) or Discord.

Environment

Our build environment is self-contained and requires only minimal host dependencies:

  • A Linux machine or VM.
  • Bazelisk >= v1.15.0 (or a working Nix environment).
  • A reasonably recent kernel with user namespaces enabled.
  • Working KVM with access to /dev/kvm (if you want to run tests).

Our docs assume that Bazelisk is available as bazel on your PATH.

Refer to SETUP.md for detailed instructions.

Monogon OS

The source code lives in //metropolis (Metropolis is the codename of Monogon OS).

See the //metropolis/README.md for a developer quick start guide, or see the Monogon OS Handbook for user documentation.