Fix recover-control-plane to work with etcd 3.3.x and add CI (#5500)

* Fix recover-control-plane to work with etcd 3.3.x and add CI

* Set default values for testcase

* Add actual test jobs

* Attempt to satisty gitlab ci linter

* Fix ansible targets

* Set etcd_member_name as stated in the docs...

* Recovering from 0 masters is not supported yet

* Add other master to broken_kube-master group as well

* Increase number of retries to see if etcd needs more time to heal

* Make number of retries for ETCD loops configurable, increase it for recovery CI and document it
pull/5626/head
qvicksilver 2020-02-11 10:38:01 +01:00 committed by GitHub
parent 68c8c05775
commit ac2135e450
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
23 changed files with 204 additions and 134 deletions

View File

@ -26,6 +26,8 @@ variables:
RESET_CHECK: "false"
UPGRADE_TEST: "false"
LOG_LEVEL: "-vv"
RECOVER_CONTROL_PLANE_TEST: "false"
RECOVER_CONTROL_PLANE_TEST_GROUPS: "etcd[2:],kube-master[1:]"
before_script:
- ./tests/scripts/rebase.sh

View File

@ -124,3 +124,19 @@ packet_amazon-linux-2-aio:
stage: deploy-part2
extends: .packet
when: manual
packet_ubuntu18-calico-ha-recover:
stage: deploy-part2
extends: .packet
when: on_success
variables:
RECOVER_CONTROL_PLANE_TEST: "true"
RECOVER_CONTROL_PLANE_TEST_GROUPS: "etcd[2:],kube-master[1:]"
packet_ubuntu18-calico-ha-recover-noquorum:
stage: deploy-part2
extends: .packet
when: on_success
variables:
RECOVER_CONTROL_PLANE_TEST: "true"
RECOVER_CONTROL_PLANE_TEST_GROUPS: "etcd[1:],kube-master[1:]"

View File

@ -17,37 +17,23 @@ Examples of what broken means in this context:
__Note that you need at least one functional node to be able to recover using this method.__
## If etcd quorum is intact
## Runbook
* Set the etcd member names of the broken node(s) in the variable "old\_etcd\_members", this variable is used to remove the broken nodes from the etcd cluster.
```old_etcd_members=etcd2,etcd3```
* If you reuse identities for your etcd nodes add the inventory names for those nodes to the variable "old\_etcds". This will remove any previously generated certificates for those nodes.
```old_etcds=etcd2.example.com,etcd3.example.com```
* If you would like to remove the broken node objects from the kubernetes cluster add their inventory names to the variable "old\_kube\_masters"
```old_kube_masters=master2.example.com,master3.example.com```
* Move any broken etcd nodes into the "broken\_etcd" group, make sure the "etcd\_member\_name" variable is set.
* Move any broken master nodes into the "broken\_kube-master" group.
Then run the playbook with ```--limit etcd,kube-master```
Then run the playbook with ```--limit etcd,kube-master``` and increase the number of ETCD retries by setting ```-e etcd_retries=10``` or something even larger. The amount of retries required is difficult to predict.
When finished you should have a fully working and highly available control plane again.
When finished you should have a fully working control plane again.
## If etcd quorum is lost
## Recover from lost quorum
* If you reuse identities for your etcd nodes add the inventory names for those nodes to the variable "old\_etcds". This will remove any previously generated certificates for those nodes.
```old_etcds=etcd2.example.com,etcd3.example.com```
* If you would like to remove the broken node objects from the kubernetes cluster add their inventory names to the variable "old\_kube\_masters"
```old_kube_masters=master2.example.com,master3.example.com```
The playbook attempts to figure out it the etcd quorum is intact. If quorum is lost it will attempt to take a snapshot from the first node in the "etcd" group and restore from that. If you would like to restore from an alternate snapshot set the path to that snapshot in the "etcd\_snapshot" variable.
Then run the playbook with ```--limit etcd,kube-master```
When finished you should have a fully working and highly available control plane again.
The playbook will attempt to take a snapshot from the first node in the "etcd" group and restore from that. If you would like to restore from an alternate snapshot set the path to that snapshot in the "etcd\_snapshot" variable.
```etcd_snapshot=/tmp/etcd_snapshot```
```-e etcd_snapshot=/tmp/etcd_snapshot```
## Caveats
* The playbook has only been tested on control planes where the etcd and kube-master nodes are the same, the playbook will warn if run on a cluster with separate etcd and kube-master nodes.
* The playbook has only been tested with fairly small etcd databases.
* If your new control plane nodes have new ip addresses you may have to change settings in various places.
* There may be disruptions while running the playbook.

View File

@ -22,7 +22,6 @@
- hosts: "{{ groups['etcd'] | first }}"
roles:
- { role: kubespray-defaults}
- { role: recover_control_plane/pre-recover }
- { role: recover_control_plane/etcd }
- hosts: "{{ groups['kube-master'] | first }}"

View File

@ -62,3 +62,6 @@ etcd_secure_client: true
# Enable peer client cert authentication
etcd_peer_client_auth: true
# Number of loop retries
etcd_retries: 4

View File

@ -67,7 +67,7 @@
shell: "{{ bin_dir }}/etcdctl --no-sync --endpoints={{ etcd_client_url }} cluster-health | grep -q 'cluster is healthy'"
register: etcd_cluster_is_healthy
until: etcd_cluster_is_healthy.rc == 0
retries: 4
retries: "{{ etcd_retries }}"
delay: "{{ retry_stagger | random + 3 }}"
ignore_errors: false
changed_when: false
@ -88,7 +88,7 @@
shell: "{{ bin_dir }}/etcdctl --no-sync --endpoints={{ etcd_events_client_url }} cluster-health | grep -q 'cluster is healthy'"
register: etcd_events_cluster_is_healthy
until: etcd_events_cluster_is_healthy.rc == 0
retries: 4
retries: "{{ etcd_retries }}"
delay: "{{ retry_stagger | random + 3 }}"
ignore_errors: false
changed_when: false

View File

@ -6,7 +6,7 @@
{{ docker_bin_dir }}/docker rm -f etcdctl-binarycopy"
register: etcd_task_result
until: etcd_task_result.rc == 0
retries: 4
retries: "{{ etcd_retries }}"
delay: "{{ retry_stagger | random + 3 }}"
changed_when: false
when: etcd_cluster_setup

View File

@ -3,7 +3,7 @@
shell: "{{ bin_dir }}/etcdctl --endpoints={{ etcd_events_access_addresses }} member add {{ etcd_member_name }} {{ etcd_events_peer_url }}"
register: member_add_result
until: member_add_result.rc == 0
retries: 4
retries: "{{ etcd_retries }}"
delay: "{{ retry_stagger | random + 3 }}"
when: target_node == inventory_hostname
environment:

View File

@ -3,7 +3,7 @@
shell: "{{ bin_dir }}/etcdctl --endpoints={{ etcd_access_addresses }} member add {{ etcd_member_name }} {{ etcd_peer_url }}"
register: member_add_result
until: member_add_result.rc == 0
retries: 4
retries: "{{ etcd_retries }}"
delay: "{{ retry_stagger | random + 3 }}"
when: target_node == inventory_hostname
environment:

View File

@ -1,7 +1,78 @@
---
- include_tasks: prepare.yml
- name: Get etcd endpoint health
shell: "{{ bin_dir }}/etcdctl --cacert {{ etcd_cert_dir }}/ca.pem --cert {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem --key {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem --endpoints={{ etcd_access_addresses }} endpoint health"
register: etcd_endpoint_health
ignore_errors: true
changed_when: false
check_mode: no
environment:
- ETCDCTL_API: 3
when:
- groups['broken_etcd']
- name: Set healthy fact
set_fact:
healthy: "{{ etcd_endpoint_health.stderr | match('Error: unhealthy cluster') }}"
when:
- groups['broken_etcd']
- name: Set has_quorum fact
set_fact:
has_quorum: "{{ etcd_endpoint_health.stdout_lines | select('match', '.*is healthy.*') | list | length >= etcd_endpoint_health.stderr_lines | select('match', '.*is unhealthy.*') | list | length }}"
- include_tasks: recover_lost_quorum.yml
when:
- has_etcdctl
- not etcd_cluster_is_healthy
- groups['broken_etcd']
- not has_quorum
- name: Remove etcd data dir
file:
path: "{{ etcd_data_dir }}"
state: absent
delegate_to: "{{ item }}"
with_items: "{{ groups['broken_etcd'] }}"
when:
- groups['broken_etcd']
- has_quorum
- name: Delete old certificates
# noqa 302 - rm is ok here for now
shell: "rm {{ etcd_cert_dir }}/*{{ item }}*"
with_items: "{{ groups['broken_etcd'] }}"
register: delete_old_cerificates
ignore_errors: true
when: groups['broken_etcd']
- name: Fail if unable to delete old certificates
fail:
msg: "Unable to delete old certificates for: {{ item.item }}"
loop: "{{ delete_old_cerificates.results }}"
changed_when: false
when:
- groups['broken_etcd']
- "item.rc != 0 and not 'No such file or directory' in item.stderr"
- name: Get etcd cluster members
shell: "{{ bin_dir }}/etcdctl --cacert {{ etcd_cert_dir }}/ca.pem --cert {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem --key {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem member list"
register: member_list
changed_when: false
check_mode: no
environment:
- ETCDCTL_API: 3
when:
- groups['broken_etcd']
- not healthy
- has_quorum
- name: Remove broken cluster members
shell: "{{ bin_dir }}/etcdctl --cacert {{ etcd_cert_dir }}/ca.pem --cert {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem --key {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem --endpoints={{ etcd_access_addresses }} member remove {{ item[1].replace(' ','').split(',')[0] }}"
environment:
- ETCDCTL_API: 3
with_nested:
- "{{ groups['broken_etcd'] }}"
- "{{ member_list.stdout_lines }}"
when:
- groups['broken_etcd']
- not healthy
- has_quorum
- hostvars[item[0]]['etcd_member_name'] == item[1].replace(' ','').split(',')[2]

View File

@ -1,48 +0,0 @@
---
- name: Delete old certificates
# noqa 302 - rm is ok here for now
shell: "rm /etc/ssl/etcd/ssl/*{{ item }}* /etc/kubernetes/ssl/etcd/*{{ item }}*"
with_items: "{{ old_etcds.split(',') }}"
register: delete_old_cerificates
ignore_errors: true
when: old_etcds is defined
- name: Fail if unable to delete old certificates
fail:
msg: "Unable to delete old certificates for: {{ item.item }}"
loop: "{{ delete_old_cerificates.results }}"
changed_when: false
when:
- old_etcds is defined
- "item.rc != 0 and not 'No such file or directory' in item.stderr"
- name: Get etcd cluster members
shell: "{{ bin_dir }}/etcdctl member list"
register: member_list
changed_when: false
check_mode: no
environment:
- ETCDCTL_API: 3
- ETCDCTL_CA_FILE: /etc/ssl/etcd/ssl/ca.pem
- ETCDCTL_CERT: "/etc/ssl/etcd/ssl/admin-{{ inventory_hostname }}.pem"
- ETCDCTL_KEY: "/etc/ssl/etcd/ssl/admin-{{ inventory_hostname }}-key.pem"
when:
- has_etcdctl
- etcd_cluster_is_healthy
- old_etcd_members is defined
- name: Remove old cluster members
shell: "{{ bin_dir }}/etcdctl --endpoints={{ etcd_access_addresses }} member remove {{ item[1].replace(' ','').split(',')[0] }}"
environment:
- ETCDCTL_API: 3
- ETCDCTL_CA_FILE: /etc/ssl/etcd/ssl/ca.pem
- ETCDCTL_CERT: "/etc/ssl/etcd/ssl/admin-{{ inventory_hostname }}.pem"
- ETCDCTL_KEY: "/etc/ssl/etcd/ssl/admin-{{ inventory_hostname }}-key.pem"
with_nested:
- "{{ old_etcd_members.split(',') }}"
- "{{ member_list.stdout_lines }}"
when:
- has_etcdctl
- etcd_cluster_is_healthy
- old_etcd_members is defined
- item[0] == item[1].replace(' ','').split(',')[2]

View File

@ -1,11 +1,8 @@
---
- name: Save etcd snapshot
shell: "{{ bin_dir }}/etcdctl snapshot save /tmp/snapshot.db"
shell: "{{ bin_dir }}/etcdctl --cacert {{ etcd_cert_dir }}/ca.pem --cert {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem --key {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem snapshot save /tmp/snapshot.db"
environment:
- ETCDCTL_API: 3
- ETCDCTL_CA_FILE: /etc/ssl/etcd/ssl/ca.pem
- ETCDCTL_CERT: "/etc/ssl/etcd/ssl/member-{{ inventory_hostname }}.pem"
- ETCDCTL_KEY: "/etc/ssl/etcd/ssl/member-{{ inventory_hostname }}-key.pem"
when: etcd_snapshot is not defined
- name: Transfer etcd snapshot to host
@ -25,12 +22,9 @@
state: absent
- name: Restore etcd snapshot
shell: "{{ bin_dir }}/etcdctl snapshot restore /tmp/snapshot.db --name {{ etcd_member_name }} --initial-cluster {{ etcd_member_name }}={{ etcd_peer_url }} --initial-cluster-token k8s_etcd --initial-advertise-peer-urls {{ etcd_peer_url }} --data-dir {{ etcd_data_dir }}"
shell: "{{ bin_dir }}/etcdctl --cacert {{ etcd_cert_dir }}/ca.pem --cert {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem --key {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem snapshot restore /tmp/snapshot.db --name {{ etcd_member_name }} --initial-cluster {{ etcd_member_name }}={{ etcd_peer_url }} --initial-cluster-token k8s_etcd --initial-advertise-peer-urls {{ etcd_peer_url }} --data-dir {{ etcd_data_dir }}"
environment:
- ETCDCTL_API: 3
- ETCDCTL_CA_FILE: /etc/ssl/etcd/ssl/ca.pem
- ETCDCTL_CERT: "/etc/ssl/etcd/ssl/member-{{ inventory_hostname }}.pem"
- ETCDCTL_KEY: "/etc/ssl/etcd/ssl/member-{{ inventory_hostname }}-key.pem"
- name: Remove etcd snapshot
file:

View File

@ -8,21 +8,22 @@
retries: 6
delay: 10
changed_when: false
when: groups['broken_kube-master']
- name: Delete old kube-master nodes from cluster
- name: Delete broken kube-master nodes from cluster
shell: "{{ bin_dir }}/kubectl delete node {{ item }}"
environment:
- KUBECONFIG: "{{ ansible_env.HOME | default('/root') }}/.kube/config"
with_items: "{{ old_kube_masters.split(',') }}"
register: delete_old_kube_masters
with_items: "{{ groups['broken_kube-master'] }}"
register: delete_broken_kube_masters
failed_when: false
when: old_kube_masters is defined
when: groups['broken_kube-master']
- name: Fail if unable to delete old kube-master nodes from cluster
- name: Fail if unable to delete broken kube-master nodes from cluster
fail:
msg: "Unable to delete old kube-master node: {{ item.item }}"
loop: "{{ delete_old_kube_masters.results }}"
msg: "Unable to delete broken kube-master node: {{ item.item }}"
loop: "{{ delete_broken_kube_masters.results }}"
changed_when: false
when:
- old_kube_masters is defined
- groups['broken_kube-master']
- "item.rc != 0 and not 'NotFound' in item.stderr"

View File

@ -1,2 +0,0 @@
---
control_plane_is_converged: "{{ groups['etcd'] | sort == groups['kube-master'] | sort | bool }}"

View File

@ -1,36 +0,0 @@
---
- name: Check for etcdctl binary
raw: "test -e {{ bin_dir }}/etcdctl"
register: test_etcdctl
- name: Set has_etcdctl fact
set_fact:
has_etcdctl: "{{ test_etcdctl.rc == 0 | bool }}"
- name: Check if etcd cluster is healthy
shell: "{{ bin_dir }}/etcdctl --endpoints={{ etcd_access_addresses }} cluster-health | grep -q 'cluster is healthy'"
register: etcd_cluster_health
ignore_errors: true
changed_when: false
check_mode: no
environment:
ETCDCTL_CERT_FILE: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem"
ETCDCTL_KEY_FILE: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem"
ETCDCTL_CA_FILE: "{{ etcd_cert_dir }}/ca.pem"
when: has_etcdctl
- name: Set etcd_cluster_is_healthy fact
set_fact:
etcd_cluster_is_healthy: "{{ etcd_cluster_health.rc == 0 | bool }}"
- name: Abort if etcd cluster is healthy and old_etcd_members is undefined
assert:
that: "{{ old_etcd_members is defined }}"
msg: "'old_etcd_members' must be defined when the etcd cluster has quorum."
when: etcd_cluster_is_healthy
- name: Warn for untested recovery
debug:
msg: Control plane recovery of split control planes is UNTESTED! Abort or continue at your own risk.
delay: 30
when: not control_plane_is_converged

View File

@ -5,7 +5,7 @@
- name: Set VM count needed for CI test_id
set_fact:
vm_count: "{%- if mode in ['separate', 'separate-scale', 'ha', 'ha-scale'] -%}{{ 3|int }}{%- elif mode == 'aio' -%}{{ 1|int }}{%- else -%}{{ 2|int }}{%- endif -%}"
vm_count: "{%- if mode in ['separate', 'separate-scale', 'ha', 'ha-scale', 'ha-recover', 'ha-recover-noquorum'] -%}{{ 3|int }}{%- elif mode == 'aio' -%}{{ 1|int }}{%- else -%}{{ 2|int }}{%- endif -%}"
- import_tasks: create-vms.yml
when:

View File

@ -45,6 +45,45 @@ instance-1
[vault]
instance-1
{% elif mode == "ha-recover" %}
[kube-master]
instance-1
instance-2
[kube-node]
instance-3
[etcd]
instance-3
instance-1
instance-2
[broken_kube-master]
instance-2
[broken_etcd]
instance-2 etcd_member_name=etcd3
{% elif mode == "ha-recover-noquorum" %}
[kube-master]
instance-3
instance-1
instance-2
[kube-node]
instance-3
[etcd]
instance-3
instance-1
instance-2
[broken_kube-master]
instance-1
instance-2
[broken_etcd]
instance-1 etcd_member_name=etcd2
instance-2 etcd_member_name=etcd3
{% endif %}
[k8s-cluster:children]

View File

@ -0,0 +1,10 @@
---
# Instance settings
cloud_image: ubuntu-1804
mode: ha-recover-noquorum
vm_memory: 1600Mi
# Kubespray settings
kube_network_plugin: calico
deploy_netchecker: true
dns_min_replicas: 1

View File

@ -0,0 +1,10 @@
---
# Instance settings
cloud_image: ubuntu-1804
mode: ha-recover
vm_memory: 1600Mi
# Kubespray settings
kube_network_plugin: calico
deploy_netchecker: true
dns_min_replicas: 1

View File

@ -47,6 +47,12 @@ if [ "${UPGRADE_TEST}" != "false" ]; then
ansible-playbook ${LOG_LEVEL} -e @${CI_TEST_VARS} -e local_release_dir=${PWD}/downloads -e ansible_python_interpreter=${PYPATH} --limit "all:!fake_hosts" $PLAYBOOK
fi
# Test control plane recovery
if [ "${RECOVER_CONTROL_PLANE_TEST}" != "false" ]; then
ansible-playbook ${LOG_LEVEL} -e @${CI_TEST_VARS} -e local_release_dir=${PWD}/downloads -e ansible_python_interpreter=${PYPATH} --limit "${RECOVER_CONTROL_PLANE_TEST_GROUPS}:!fake_hosts" -e reset_confirmation=yes reset.yml
ansible-playbook ${LOG_LEVEL} -e @${CI_TEST_VARS} -e local_release_dir=${PWD}/downloads -e ansible_python_interpreter=${PYPATH} -e etcd_retries=10 --limit etcd,kube-master:!fake_hosts recover-control-plane.yml
fi
# Tests Cases
## Test Master API
ansible-playbook -e ansible_python_interpreter=${PYPATH} --limit "all:!fake_hosts" tests/testcases/010_check-apiserver.yml $LOG_LEVEL

View File

@ -25,3 +25,9 @@ kube-master
calico-rr
[calico-rr]
[broken_kube-master]
node2
[broken_etcd]
node2

View File

@ -29,6 +29,12 @@
[vault]
{{droplets.results[1].droplet.name}}
{{droplets.results[2].droplet.name}}
[broken_kube-master]
{{droplets.results[1].droplet.name}}
[broken_etcd]
{{droplets.results[2].droplet.name}}
{% else %}
[kube-master]
{{droplets.results[0].droplet.name}}

View File

@ -37,6 +37,13 @@
{{node1}}
{{node2}}
{{node3}}
[broken_kube-master]
{{node2}}
[etcd]
{{node2}}
{{node3}}
{% elif mode == "default" %}
[kube-master]
{{node1}}