Offline control plane recover (#10660)
* ignore_unreachable for etcd dir cleanup ignore_errors ignores errors occur within "file" module. However, when the target node is offline, the playbook will still fail at this task with node "unreachable" state. Setting "ignore_unreachable: true" allows the playbook to bypass offline nodes and move on to proceed recovery tasks on remaining online nodes. * Re-arrange control plane recovery runbook steps * Remove suggestion to manually update IP addresses The suggestion was added inpull/10448/head48a182844c
4 years ago. But a new task added 2 years ago, inee0f1e9d58
, automatically update API server arg with updated etcd node ip addresses. This suggestion is no longer needed.
parent
4e52fb7a1f
commit
0e971a37aa
|
@ -3,11 +3,6 @@
|
|||
|
||||
To recover from broken nodes in the control plane use the "recover\-control\-plane.yml" playbook.
|
||||
|
||||
* Backup what you can
|
||||
* Provision new nodes to replace the broken ones
|
||||
* Place the surviving nodes of the control plane first in the "etcd" and "kube\_control\_plane" groups
|
||||
* Add the new nodes below the surviving control plane nodes in the "etcd" and "kube\_control\_plane" groups
|
||||
|
||||
Examples of what broken means in this context:
|
||||
|
||||
* One or more bare metal node(s) suffer from unrecoverable hardware failure
|
||||
|
@ -19,8 +14,12 @@ __Note that you need at least one functional node to be able to recover using th
|
|||
|
||||
## Runbook
|
||||
|
||||
* Backup what you can
|
||||
* Provision new nodes to replace the broken ones
|
||||
* Move any broken etcd nodes into the "broken\_etcd" group, make sure the "etcd\_member\_name" variable is set.
|
||||
* Move any broken control plane nodes into the "broken\_kube\_control\_plane" group.
|
||||
* Place the surviving nodes of the control plane first in the "etcd" and "kube\_control\_plane" groups
|
||||
* Add the new nodes below the surviving control plane nodes in the "etcd" and "kube\_control\_plane" groups
|
||||
|
||||
Then run the playbook with ```--limit etcd,kube_control_plane``` and increase the number of ETCD retries by setting ```-e etcd_retries=10``` or something even larger. The amount of retries required is difficult to predict.
|
||||
|
||||
|
@ -35,7 +34,6 @@ The playbook attempts to figure out it the etcd quorum is intact. If quorum is l
|
|||
## Caveats
|
||||
|
||||
* The playbook has only been tested with fairly small etcd databases.
|
||||
* If your new control plane nodes have new ip addresses you may have to change settings in various places.
|
||||
* There may be disruptions while running the playbook.
|
||||
* There are absolutely no guarantees.
|
||||
|
||||
|
|
|
@ -39,6 +39,7 @@
|
|||
delegate_to: "{{ item }}"
|
||||
with_items: "{{ groups['broken_etcd'] }}"
|
||||
ignore_errors: true # noqa ignore-errors
|
||||
ignore_unreachable: true
|
||||
when:
|
||||
- groups['broken_etcd']
|
||||
- has_quorum
|
||||
|
|
Loading…
Reference in New Issue