kubespray/docs/operations/recover-control-plane.md


# Recovering the control plane

To recover from broken nodes in the control plane use the "recover\-control\-plane.yml" playbook.

Examples of what broken means in this context:

* One or more bare metal node(s) suffer from unrecoverable hardware failure
* One or more node(s) fail during patching or upgrading
* Etcd database corruption

* Other node related failures leaving your control plane degraded or nonfunctional

__Note that you need at least one functional node to be able to recover using this method.__

## Runbook

* Backup what you can
* Provision new nodes to replace the broken ones
* Copy any broken etcd nodes into the "broken\_etcd" group, make sure the "etcd\_member\_name" variable is set.
* Copy any broken control plane nodes into the "broken\_kube\_control\_plane" group.
* Place the surviving nodes of the control plane first in the "etcd" and "kube\_control\_plane" groups
* Add the new nodes below the surviving control plane nodes in the "etcd" and "kube\_control\_plane" groups

Then run the playbook with ```--limit etcd,kube_control_plane``` and increase the number of ETCD retries by setting ```-e etcd_retries=10``` or something even larger. The amount of retries required is difficult to predict.

When finished you should have a fully working control plane again.

## Recover from lost quorum

The playbook attempts to figure out it the etcd quorum is intact. If quorum is lost it will attempt to take a snapshot from the first node in the "etcd" group and restore from that. If you would like to restore from an alternate snapshot set the path to that snapshot in the "etcd\_snapshot" variable.

```-e etcd_snapshot=/tmp/etcd_snapshot```

## Caveats

* The playbook has only been tested with fairly small etcd databases.
* There may be disruptions while running the playbook.
* There are absolutely no guarantees.

If possible try to break a cluster in the same way that your target cluster is broken and test to recover that before trying on the real target cluster.
Documentation and playbook for recovering control plane from node failure (#4146) 2019-04-29 16:40:20 +08:00
Add markdown CI (#5380) 2019-12-04 23:22:57 +08:00			`# Recovering the control plane`
Documentation and playbook for recovering control plane from node failure (#4146) 2019-04-29 16:40:20 +08:00
			`To recover from broken nodes in the control plane use the "recover\-control\-plane.yml" playbook.`

			`Examples of what broken means in this context:`

			`* One or more bare metal node(s) suffer from unrecoverable hardware failure`
			`* One or more node(s) fail during patching or upgrading`
			`* Etcd database corruption`
[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci 2024-05-21 21:39:01 +08:00
Documentation and playbook for recovering control plane from node failure (#4146) 2019-04-29 16:40:20 +08:00			`* Other node related failures leaving your control plane degraded or nonfunctional`

			`__Note that you need at least one functional node to be able to recover using this method.__`

Fix recover-control-plane to work with etcd 3.3.x and add CI (#5500) * Fix recover-control-plane to work with etcd 3.3.x and add CI * Set default values for testcase * Add actual test jobs * Attempt to satisty gitlab ci linter * Fix ansible targets * Set etcd_member_name as stated in the docs... * Recovering from 0 masters is not supported yet * Add other master to broken_kube-master group as well * Increase number of retries to see if etcd needs more time to heal * Make number of retries for ETCD loops configurable, increase it for recovery CI and document it 2020-02-11 17:38:01 +08:00			`## Runbook`
Documentation and playbook for recovering control plane from node failure (#4146) 2019-04-29 16:40:20 +08:00
Offline control plane recover (#10660) * ignore_unreachable for etcd dir cleanup ignore_errors ignores errors occur within "file" module. However, when the target node is offline, the playbook will still fail at this task with node "unreachable" state. Setting "ignore_unreachable: true" allows the playbook to bypass offline nodes and move on to proceed recovery tasks on remaining online nodes. * Re-arrange control plane recovery runbook steps * Remove suggestion to manually update IP addresses The suggestion was added in 48a182844c9c3438e36c78cbc4518c962e0a9ab2 4 years ago. But a new task added 2 years ago, in ee0f1e9d58ed8bf1fd13ff1eb1527678fe4fa6da, automatically update API server arg with updated etcd node ip addresses. This suggestion is no longer needed. 2024-01-23 00:22:27 +08:00			`* Backup what you can`
			`* Provision new nodes to replace the broken ones`
Update recover-control-plane.md (#11155) #10844 Copy node instead of move 2024-05-13 18:25:00 +08:00			`* Copy any broken etcd nodes into the "broken\_etcd" group, make sure the "etcd\_member\_name" variable is set.`
			`* Copy any broken control plane nodes into the "broken\_kube\_control\_plane" group.`
Offline control plane recover (#10660) * ignore_unreachable for etcd dir cleanup ignore_errors ignores errors occur within "file" module. However, when the target node is offline, the playbook will still fail at this task with node "unreachable" state. Setting "ignore_unreachable: true" allows the playbook to bypass offline nodes and move on to proceed recovery tasks on remaining online nodes. * Re-arrange control plane recovery runbook steps * Remove suggestion to manually update IP addresses The suggestion was added in 48a182844c9c3438e36c78cbc4518c962e0a9ab2 4 years ago. But a new task added 2 years ago, in ee0f1e9d58ed8bf1fd13ff1eb1527678fe4fa6da, automatically update API server arg with updated etcd node ip addresses. This suggestion is no longer needed. 2024-01-23 00:22:27 +08:00			`* Place the surviving nodes of the control plane first in the "etcd" and "kube\_control\_plane" groups`
			`* Add the new nodes below the surviving control plane nodes in the "etcd" and "kube\_control\_plane" groups`
Documentation and playbook for recovering control plane from node failure (#4146) 2019-04-29 16:40:20 +08:00
Replace kube-master with kube_control_plane (#7256) This replaces kube-master with kube_control_plane because of [1]: The Kubernetes project is moving away from wording that is considered offensive. A new working group WG Naming was created to track this work, and the word "master" was declared as offensive. A proposal was formalized for replacing the word "master" with "control plane". This means it should be removed from source code, documentation, and user-facing configuration from Kubernetes and its sub-projects. NOTE: The reason why this changes it to kube_control_plane not kube-control-plane is for valid group names on ansible. [1]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-cluster-lifecycle/kubeadm/2067-rename-master-label-taint/README.md#motivation 2021-03-24 08:26:05 +08:00			Then run the playbook with ```--limit etcd,kube_control_plane``` and increase the number of ETCD retries by setting ```-e etcd_retries=10``` or something even larger. The amount of retries required is difficult to predict.
Documentation and playbook for recovering control plane from node failure (#4146) 2019-04-29 16:40:20 +08:00
Fix recover-control-plane to work with etcd 3.3.x and add CI (#5500) * Fix recover-control-plane to work with etcd 3.3.x and add CI * Set default values for testcase * Add actual test jobs * Attempt to satisty gitlab ci linter * Fix ansible targets * Set etcd_member_name as stated in the docs... * Recovering from 0 masters is not supported yet * Add other master to broken_kube-master group as well * Increase number of retries to see if etcd needs more time to heal * Make number of retries for ETCD loops configurable, increase it for recovery CI and document it 2020-02-11 17:38:01 +08:00			`When finished you should have a fully working control plane again.`
Documentation and playbook for recovering control plane from node failure (#4146) 2019-04-29 16:40:20 +08:00
Fix recover-control-plane to work with etcd 3.3.x and add CI (#5500) * Fix recover-control-plane to work with etcd 3.3.x and add CI * Set default values for testcase * Add actual test jobs * Attempt to satisty gitlab ci linter * Fix ansible targets * Set etcd_member_name as stated in the docs... * Recovering from 0 masters is not supported yet * Add other master to broken_kube-master group as well * Increase number of retries to see if etcd needs more time to heal * Make number of retries for ETCD loops configurable, increase it for recovery CI and document it 2020-02-11 17:38:01 +08:00			`## Recover from lost quorum`
Documentation and playbook for recovering control plane from node failure (#4146) 2019-04-29 16:40:20 +08:00
Fix recover-control-plane to work with etcd 3.3.x and add CI (#5500) * Fix recover-control-plane to work with etcd 3.3.x and add CI * Set default values for testcase * Add actual test jobs * Attempt to satisty gitlab ci linter * Fix ansible targets * Set etcd_member_name as stated in the docs... * Recovering from 0 masters is not supported yet * Add other master to broken_kube-master group as well * Increase number of retries to see if etcd needs more time to heal * Make number of retries for ETCD loops configurable, increase it for recovery CI and document it 2020-02-11 17:38:01 +08:00			`The playbook attempts to figure out it the etcd quorum is intact. If quorum is lost it will attempt to take a snapshot from the first node in the "etcd" group and restore from that. If you would like to restore from an alternate snapshot set the path to that snapshot in the "etcd\_snapshot" variable.`
Documentation and playbook for recovering control plane from node failure (#4146) 2019-04-29 16:40:20 +08:00
Fix recover-control-plane to work with etcd 3.3.x and add CI (#5500) * Fix recover-control-plane to work with etcd 3.3.x and add CI * Set default values for testcase * Add actual test jobs * Attempt to satisty gitlab ci linter * Fix ansible targets * Set etcd_member_name as stated in the docs... * Recovering from 0 masters is not supported yet * Add other master to broken_kube-master group as well * Increase number of retries to see if etcd needs more time to heal * Make number of retries for ETCD loops configurable, increase it for recovery CI and document it 2020-02-11 17:38:01 +08:00			```-e etcd_snapshot=/tmp/etcd_snapshot```
Documentation and playbook for recovering control plane from node failure (#4146) 2019-04-29 16:40:20 +08:00
			`## Caveats`

			`* The playbook has only been tested with fairly small etcd databases.`
			`* There may be disruptions while running the playbook.`
			`* There are absolutely no guarantees.`

			`If possible try to break a cluster in the same way that your target cluster is broken and test to recover that before trying on the real target cluster.`