m/node/core: remove etcd membership before removing consensus role
When removing the consensus role, we also need to remove etcd
membership. It is safer to remove membership first, and then the role,
because otherwise, the etcd cluster is in a degraded state during the
time where etcd on the node has been stopped, but the node is still
counted as a voting member by etcd.
If the membership is removed, but then removing the role fails, the
cluster ends up in an inconsistent state. If the affected node was the
curator or etcd leader, that will almost certainly happen. In this case,
the request can just be retried until it succeeds, and then the cluster
state is consistent again between etcd membership and roles.
Change-Id: I1ab526470a4201e76817e8ca0a597996fb903d1f
Reviewed-on: https://review.monogon.dev/c/monogon/+/3437
Tested-by: Jenkins CI
Reviewed-by: Serge Bazanski <serge@monogon.tech>
diff --git a/metropolis/node/core/consensus/status.go b/metropolis/node/core/consensus/status.go
index 55b4339..ee3efbc 100644
--- a/metropolis/node/core/consensus/status.go
+++ b/metropolis/node/core/consensus/status.go
@@ -189,3 +189,21 @@
externalAddress string
externalPort int
}
+
+// RemoveNode removes the etcd member with the given node ID, if it is currently
+// a member. Etcd fails this operation if it is not safe to perform.
+func (s *Status) RemoveNode(ctx context.Context, nodeID string) error {
+ members, err := s.cl.MemberList(ctx)
+ if err != nil {
+ return fmt.Errorf("could not retrieve existing members: %w", err)
+ }
+ for _, m := range members.Members {
+ if GetEtcdMemberNodeId(m) == nodeID {
+ _, err := s.cl.MemberRemove(ctx, m.ID)
+ if err != nil {
+ return fmt.Errorf("could not remove member: %w", err)
+ }
+ }
+ }
+ return nil
+}