2016-09-15 17:23:27 +08:00
|
|
|
Large deployments of K8s
|
|
|
|
========================
|
|
|
|
|
|
|
|
For a large scaled deployments, consider the following configuration changes:
|
|
|
|
|
2021-04-26 23:33:02 +08:00
|
|
|
* Tune [ansible settings](https://docs.ansible.com/ansible/latest/intro_configuration.html)
|
2016-09-15 17:23:27 +08:00
|
|
|
for `forks` and `timeout` vars to fit large numbers of nodes being deployed.
|
|
|
|
|
|
|
|
* Override containers' `foo_image_repo` vars to point to intranet registry.
|
|
|
|
|
2016-12-19 22:50:04 +08:00
|
|
|
* Override the ``download_run_once: true`` and/or ``download_localhost: true``.
|
|
|
|
See download modes for details.
|
2016-09-15 17:23:27 +08:00
|
|
|
|
|
|
|
* Adjust the `retry_stagger` global var as appropriate. It should provide sane
|
|
|
|
load on a delegate (the first K8s master node) then retrying failed
|
|
|
|
push or download operations.
|
|
|
|
|
2019-12-04 23:22:57 +08:00
|
|
|
* Tune parameters for DNS related applications
|
2019-04-02 03:32:34 +08:00
|
|
|
Those are ``dns_replicas``, ``dns_cpu_limit``,
|
2016-11-25 18:33:39 +08:00
|
|
|
``dns_cpu_requests``, ``dns_memory_limit``, ``dns_memory_requests``.
|
|
|
|
Please note that limits must always be greater than or equal to requests.
|
|
|
|
|
2016-12-23 22:44:44 +08:00
|
|
|
* Tune CPU/memory limits and requests. Those are located in roles' defaults
|
|
|
|
and named like ``foo_memory_limit``, ``foo_memory_requests`` and
|
|
|
|
``foo_cpu_limit``, ``foo_cpu_requests``. Note that 'Mi' memory units for K8s
|
2017-02-07 22:01:02 +08:00
|
|
|
will be submitted as 'M', if applied for ``docker run``, and cpu K8s units
|
|
|
|
will end up with the 'm' skipped for docker as well. This is required as
|
|
|
|
docker does not understand k8s units well.
|
|
|
|
|
|
|
|
* Tune ``kubelet_status_update_frequency`` to increase reliability of kubelet.
|
|
|
|
``kube_controller_node_monitor_grace_period``,
|
|
|
|
``kube_controller_node_monitor_period``,
|
2021-01-08 23:20:53 +08:00
|
|
|
``kube_apiserver_pod_eviction_not_ready_timeout_seconds`` &
|
|
|
|
``kube_apiserver_pod_eviction_unreachable_timeout_seconds`` for better Kubernetes reliability.
|
2017-02-07 22:01:02 +08:00
|
|
|
Check out [Kubernetes Reliability](kubernetes-reliability.md)
|
2016-12-23 22:44:44 +08:00
|
|
|
|
2018-01-26 22:13:21 +08:00
|
|
|
* Tune network prefix sizes. Those are ``kube_network_node_prefix``,
|
|
|
|
``kube_service_addresses`` and ``kube_pods_subnet``.
|
|
|
|
|
2021-04-29 20:20:50 +08:00
|
|
|
* Add calico_rr nodes if you are deploying with Calico or Canal. Nodes recover
|
|
|
|
from host/network interruption much quicker with calico_rr. Note that
|
|
|
|
calico_rr role must be on a host without kube_control_plane or kube_node role (but
|
2017-01-11 23:15:04 +08:00
|
|
|
etcd role is okay).
|
|
|
|
|
|
|
|
* Check out the
|
2017-02-07 22:01:02 +08:00
|
|
|
[Inventory](getting-started.md#building-your-own-inventory)
|
2017-01-11 23:15:04 +08:00
|
|
|
section of the Getting started guide for tips on creating a large scale
|
|
|
|
Ansible inventory.
|
|
|
|
|
2018-03-07 19:00:00 +08:00
|
|
|
* Override the ``etcd_events_cluster_setup: true`` store events in a separate
|
|
|
|
dedicated etcd instance.
|
|
|
|
|
2016-09-15 17:23:27 +08:00
|
|
|
For example, when deploying 200 nodes, you may want to run ansible with
|
|
|
|
``--forks=50``, ``--timeout=600`` and define the ``retry_stagger: 60``.
|