mirror of https://github.com/easzlab/kubeasz.git
fix: kube-prometheus-stack installation
parent
b006dcff20
commit
356a98ed48
|
@ -1,24 +1,20 @@
|
||||||
# Prometheus
|
# Prometheus
|
||||||
随着`heapster`项目停止更新并慢慢被`metrics-server`取代,集群监控这项任务也将最终转移。`prometheus`的监控理念、数据结构设计其实相当精简,包括其非常灵活的查询语言;但是对于初学者来说,想要在k8s集群中实践搭建一套相对可用的部署却比较麻烦,由此还产生了不少专门的项目(如:[prometheus-operator](https://github.com/coreos/prometheus-operator)),本文介绍使用`helm chart`部署集群的prometheus监控。
|
`prometheus`已经成为k8s集群上默认的监控解决方案,它的监控理念、数据结构设计其实相当精简,包括其非常灵活的查询语言;但是对于初学者来说,想要在k8s集群中实践搭建一套相对可用的部署却比较麻烦。本项目3.x采用的helm chart方式部署,使用的charts地址: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
|
||||||
- `helm`已成为`CNCF`独立托管项目,预计会更加流行起来
|
|
||||||
|
|
||||||
## 前提
|
|
||||||
|
|
||||||
- 安装 helm
|
|
||||||
- 安装 [kube-dns](kubedns.md)
|
|
||||||
|
|
||||||
## 安装
|
## 安装
|
||||||
|
|
||||||
项目3.x采用的部署charts: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
|
|
||||||
|
|
||||||
kubeasz 集成安装
|
kubeasz 集成安装
|
||||||
|
|
||||||
- 1.修改 clusters/xxxx/config.yml 中配置项 prom_install: "yes"
|
- 1.修改 /etc/kubeasz/clusters/xxxx/config.yml 中配置项 prom_install: "yes"
|
||||||
- 2.安装 ezctl setup xxxx 07
|
- 2.安装 /etc/kubeasz/ezctl setup xxxx 07
|
||||||
|
|
||||||
注:涉及到镜像需从 quay.io 下载,国内比较慢,可以使用项目中的工具脚本 tools/imgutils
|
生成的charts自定义配置在/etc/kubeasz/clusters/xxxx/yml/prom-values.yaml
|
||||||
|
|
||||||
--- 以下内容暂未更新
|
注1:如果需要修改配置,修改roles/cluster-addon/templates/prometheus/values.yaml.j2 后重新执行安装命令
|
||||||
|
|
||||||
|
注2:如果集群节点有增减,重新执行安装命令
|
||||||
|
|
||||||
|
注3:涉及到很多相关镜像下载比较慢,另外部分k8s.gcr.io的镜像已经替换成easzlab的mirror镜像地址
|
||||||
|
|
||||||
## 验证安装
|
## 验证安装
|
||||||
|
|
||||||
|
@ -26,45 +22,38 @@ kubeasz 集成安装
|
||||||
# 查看相关pod和svc
|
# 查看相关pod和svc
|
||||||
$ kubectl get pod,svc -n monitor
|
$ kubectl get pod,svc -n monitor
|
||||||
NAME READY STATUS RESTARTS AGE
|
NAME READY STATUS RESTARTS AGE
|
||||||
pod/alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 0 3m11s
|
pod/alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 0 160m
|
||||||
pod/prometheus-grafana-6d6d47996f-7xlpt 2/2 Running 0 3m14s
|
pod/prometheus-grafana-69f88948bc-7hnbp 3/3 Running 0 160m
|
||||||
pod/prometheus-kube-prometheus-operator-5f6774b747-bpktd 1/1 Running 0 3m14s
|
pod/prometheus-kube-prometheus-operator-f8f4758cb-bm6gs 1/1 Running 0 160m
|
||||||
pod/prometheus-kube-state-metrics-95d956569-dhlkx 1/1 Running 0 3m14s
|
pod/prometheus-kube-state-metrics-74b8f49c6c-f9wgg 1/1 Running 0 160m
|
||||||
pod/prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running 1 3m11s
|
pod/prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running 0 160m
|
||||||
pod/prometheus-prometheus-node-exporter-d9m7j 1/1 Running 0 3m14s
|
pod/prometheus-prometheus-node-exporter-6nfb4 1/1 Running 0 160m
|
||||||
|
pod/prometheus-prometheus-node-exporter-q4qq2 1/1 Running 0 160m
|
||||||
|
|
||||||
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
|
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
|
||||||
service/alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 3m12s
|
service/alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 160m
|
||||||
service/prometheus-grafana NodePort 10.68.31.225 <none> 80:30903/TCP 3m14s
|
service/prometheus-grafana NodePort 10.68.253.23 <none> 80:30903/TCP 160m
|
||||||
service/prometheus-kube-prometheus-alertmanager NodePort 10.68.212.136 <none> 9093:30902/TCP 3m14s
|
service/prometheus-kube-prometheus-alertmanager NodePort 10.68.125.191 <none> 9093:30902/TCP 160m
|
||||||
service/prometheus-kube-prometheus-operator NodePort 10.68.226.171 <none> 443:30900/TCP 3m14s
|
service/prometheus-kube-prometheus-operator NodePort 10.68.161.218 <none> 443:30900/TCP 160m
|
||||||
service/prometheus-kube-prometheus-prometheus NodePort 10.68.100.42 <none> 9090:30901/TCP 3m14s
|
service/prometheus-kube-prometheus-prometheus NodePort 10.68.64.217 <none> 9090:30901/TCP 160m
|
||||||
service/prometheus-kube-state-metrics ClusterIP 10.68.80.70 <none> 8080/TCP 3m14s
|
service/prometheus-kube-state-metrics ClusterIP 10.68.111.106 <none> 8080/TCP 160m
|
||||||
service/prometheus-operated ClusterIP None <none> 9090/TCP 3m12s
|
service/prometheus-operated ClusterIP None <none> 9090/TCP 160m
|
||||||
service/prometheus-prometheus-node-exporter ClusterIP 10.68.64.56 <none> 9100/TCP 3m14s
|
service/prometheus-prometheus-node-exporter ClusterIP 10.68.252.83 <none> 9100/TCP 160m
|
||||||
```
|
```
|
||||||
|
|
||||||
- 访问prometheus的web界面:`http://$NodeIP:30901`
|
- 访问prometheus的web界面:`http://$NodeIP:30901`
|
||||||
- 访问alertmanager的web界面:`http://$NodeIP:30902`
|
- 访问alertmanager的web界面:`http://$NodeIP:30902`
|
||||||
- 访问grafana的web界面:`http://$NodeIP:30903` (默认用户密码 admin:Admin1234!)
|
- 访问grafana的web界面:`http://$NodeIP:30903` (默认用户密码 admin:Admin1234!)
|
||||||
|
|
||||||
## 管理操作
|
## 其他操作
|
||||||
|
|
||||||
## 验证告警
|
-- 以下内容没有更新测试
|
||||||
|
|
||||||
- 修改`prom-alertsmanager.yaml`文件中邮件告警为有效的配置内容,并使用 helm upgrade更新安装
|
### [可选] 配置钉钉告警
|
||||||
- 手动临时关闭 master 节点的 kubelet 服务,等待几分钟看是否有告警邮件发送
|
|
||||||
|
|
||||||
``` bash
|
|
||||||
# 在 master 节点运行
|
|
||||||
$ systemctl stop kubelet
|
|
||||||
```
|
|
||||||
|
|
||||||
## [可选] 配置钉钉告警
|
|
||||||
|
|
||||||
- 创建钉钉群,获取群机器人 webhook 地址
|
- 创建钉钉群,获取群机器人 webhook 地址
|
||||||
|
|
||||||
使用钉钉创建群聊以后可以方便设置群机器人,【群设置】-【群机器人】-【添加】-【自定义】-【添加】,然后按提示操作即可,参考 https://open-doc.dingtalk.com/docs/doc.htm?spm=a219a.7629140.0.0.666d4a97eCG7XA&treeId=257&articleId=105735&docType=1
|
使用钉钉创建群聊以后可以方便设置群机器人,【群设置】-【群机器人】-【添加】-【自定义】-【添加】,然后按提示操作即可,参考 https://open.dingtalk.com/document/group/custom-robot-access
|
||||||
|
|
||||||
上述配置好群机器人,获得这个机器人对应的Webhook地址,记录下来,后续配置钉钉告警插件要用,格式如下
|
上述配置好群机器人,获得这个机器人对应的Webhook地址,记录下来,后续配置钉钉告警插件要用,格式如下
|
||||||
|
|
||||||
|
@ -72,35 +61,27 @@ $ systemctl stop kubelet
|
||||||
https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxx
|
https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxx
|
||||||
```
|
```
|
||||||
|
|
||||||
- 创建钉钉告警插件,参考 http://theo.im/blog/2017/10/16/release-prometheus-alertmanager-webhook-for-dingtalk/
|
- 创建钉钉告警插件,参考:
|
||||||
|
- https://github.com/timonwong/prometheus-webhook-dingtalk
|
||||||
|
- http://theo.im/blog/2017/10/16/release-prometheus-alertmanager-webhook-for-dingtalk/
|
||||||
|
|
||||||
``` bash
|
``` bash
|
||||||
# 编辑修改文件中 access_token=xxxxxx 为上一步你获得的机器人认证 token
|
# 编辑修改文件中 access_token=xxxxxx 为上一步你获得的机器人认证 token
|
||||||
$ vi /etc/ansible/manifests/prometheus/dingtalk-webhook.yaml
|
$ vi /etc/kubeasz/roles/cluster-addon/templates/prometheus/dingtalk-webhook.yaml
|
||||||
# 运行插件
|
# 运行插件
|
||||||
$ kubectl apply -f /etc/ansible/manifests/prometheus/dingtalk-webhook.yaml
|
$ kubectl apply -f /etc/kubeasz/roles/cluster-addon/templates/prometheus/dingtalk-webhook.yaml
|
||||||
```
|
```
|
||||||
|
|
||||||
- 修改 alertsmanager 告警配置后,更新 helm prometheus 部署,成功后如上节测试告警发送
|
- 修改 alertsmanager 告警配置,重新运行安装命令/etc/kubeasz/ezctl setup xxxx 07,成功后如上节测试告警发送
|
||||||
|
|
||||||
``` bash
|
``` bash
|
||||||
# 修改 alertsmanager 告警配置
|
# 修改 alertsmanager 告警配置
|
||||||
$ cd /etc/ansible/manifests/prometheus
|
$ vi /etc/kubeasz/roles/cluster-addon/templates/prometheus/values.yaml.j2
|
||||||
$ vi prom-alertsmanager.yaml
|
|
||||||
# 增加 receiver dingtalk,然后在 route 配置使用 receiver: dingtalk
|
# 增加 receiver dingtalk,然后在 route 配置使用 receiver: dingtalk
|
||||||
receivers:
|
receivers:
|
||||||
- name: dingtalk
|
- name: dingtalk
|
||||||
webhook_configs:
|
webhook_configs:
|
||||||
- send_resolved: false
|
- send_resolved: false
|
||||||
url: http://webhook-dingtalk.monitoring.svc.cluster.local:8060/dingtalk/webhook1/send
|
url: http://webhook-dingtalk.monitor.svc.cluster.local:8060/dingtalk/webhook1/send
|
||||||
# ...
|
# ...
|
||||||
# 更新 helm prometheus 部署
|
|
||||||
$ helm upgrade --tls monitor -f prom-settings.yaml -f prom-alertsmanager.yaml -f prom-alertrules.yaml prometheus
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## 下一步
|
|
||||||
|
|
||||||
- 继续了解prometheus查询语言和配置文件
|
|
||||||
- 继续了解prometheus告警规则,编写适合业务应用的告警规则
|
|
||||||
- 继续了解grafana的dashboard编写,本项目参考了部分[feisky的模板](https://grafana.com/orgs/feisky/dashboards)
|
|
||||||
如果对以上部分有心得总结,欢迎分享贡献在项目中。
|
|
||||||
|
|
|
@ -32,7 +32,7 @@
|
||||||
when: '"kubernetes-dashboard" not in pod_info.stdout and dashboard_install == "yes"'
|
when: '"kubernetes-dashboard" not in pod_info.stdout and dashboard_install == "yes"'
|
||||||
|
|
||||||
- import_tasks: prometheus.yml
|
- import_tasks: prometheus.yml
|
||||||
when: '"kube-prometheus-operator" not in pod_info.stdout and prom_install == "yes"'
|
when: 'prom_install == "yes"'
|
||||||
|
|
||||||
- import_tasks: nfs-provisioner.yml
|
- import_tasks: nfs-provisioner.yml
|
||||||
when: '"nfs-client-provisioner" not in pod_info.stdout and nfs_provisioner_install == "yes"'
|
when: '"nfs-client-provisioner" not in pod_info.stdout and nfs_provisioner_install == "yes"'
|
||||||
|
|
|
@ -33,10 +33,7 @@
|
||||||
--from-file=etcd-client-key=etcd-client-key.pem"
|
--from-file=etcd-client-key=etcd-client-key.pem"
|
||||||
when: '"etcd-client-cert" not in secrets_info.stdout'
|
when: '"etcd-client-cert" not in secrets_info.stdout'
|
||||||
|
|
||||||
# 判断 kubernetes 版本
|
- debug: var="K8S_VER"
|
||||||
- name: 注册变量 K8S_VER
|
|
||||||
shell: "{{ base_dir }}/bin/kube-apiserver --version|cut -d' ' -f2|cut -d'v' -f2"
|
|
||||||
register: K8S_VER
|
|
||||||
|
|
||||||
- name: 创建 prom chart 个性化设置
|
- name: 创建 prom chart 个性化设置
|
||||||
template: src=prometheus/values.yaml.j2 dest={{ cluster_dir }}/yml/prom-values.yaml
|
template: src=prometheus/values.yaml.j2 dest={{ cluster_dir }}/yml/prom-values.yaml
|
||||||
|
|
|
@ -5,9 +5,15 @@ metadata:
|
||||||
labels:
|
labels:
|
||||||
run: dingtalk
|
run: dingtalk
|
||||||
name: webhook-dingtalk
|
name: webhook-dingtalk
|
||||||
namespace: monitoring
|
namespace: monitor
|
||||||
spec:
|
spec:
|
||||||
replicas: 1
|
replicas: 1
|
||||||
|
selector:
|
||||||
|
matchLabels:
|
||||||
|
run: dingtalk
|
||||||
|
selector:
|
||||||
|
matchLabels:
|
||||||
|
run: dingtalk
|
||||||
template:
|
template:
|
||||||
metadata:
|
metadata:
|
||||||
labels:
|
labels:
|
||||||
|
@ -31,7 +37,7 @@ metadata:
|
||||||
labels:
|
labels:
|
||||||
run: dingtalk
|
run: dingtalk
|
||||||
name: webhook-dingtalk
|
name: webhook-dingtalk
|
||||||
namespace: monitoring
|
namespace: monitor
|
||||||
spec:
|
spec:
|
||||||
ports:
|
ports:
|
||||||
- port: 8060
|
- port: 8060
|
||||||
|
|
|
@ -1,18 +1,17 @@
|
||||||
## Provide a k8s version to auto dashboard import script example: kubeTargetVersionOverride: 1.16.6
|
## Provide a k8s version to auto dashboard import script example: kubeTargetVersionOverride: 1.16.6
|
||||||
kubeTargetVersionOverride: "{{ K8S_VER.stdout }}"
|
kubeTargetVersionOverride: "{{ K8S_VER }}"
|
||||||
|
|
||||||
## Configuration for alertmanager
|
## Configuration for alertmanager
|
||||||
alertmanager:
|
alertmanager:
|
||||||
enabled: true
|
enabled: true
|
||||||
config:
|
#config:
|
||||||
route:
|
# route:
|
||||||
receiver: 'null'
|
# receiver: dingtalk
|
||||||
routes:
|
# receivers:
|
||||||
- receiver: 'null'
|
# - name: dingtalk
|
||||||
matchers:
|
# webhook_configs:
|
||||||
- alertname =~ "InfoInhibitor|Watchdog"
|
# - send_resolved: false
|
||||||
receivers:
|
# url: http://webhook-dingtalk.monitor.svc.cluster.local:8060/dingtalk/webhook1/send
|
||||||
- name: 'null'
|
|
||||||
|
|
||||||
## Configuration for Alertmanager service
|
## Configuration for Alertmanager service
|
||||||
service:
|
service:
|
||||||
|
@ -26,6 +25,8 @@ grafana:
|
||||||
service:
|
service:
|
||||||
nodePort: 30903
|
nodePort: 30903
|
||||||
type: NodePort
|
type: NodePort
|
||||||
|
sidecar:
|
||||||
|
skipTlsVerify: true
|
||||||
|
|
||||||
## Component scraping the kube api server
|
## Component scraping the kube api server
|
||||||
kubeApiServer:
|
kubeApiServer:
|
||||||
|
@ -42,6 +43,13 @@ kubeControllerManager:
|
||||||
{% for h in groups['kube_master'] %}
|
{% for h in groups['kube_master'] %}
|
||||||
- {{ h }}
|
- {{ h }}
|
||||||
{% endfor %}
|
{% endfor %}
|
||||||
|
service:
|
||||||
|
port: 10257
|
||||||
|
targetPort: 10257
|
||||||
|
serviceMonitor:
|
||||||
|
https: true
|
||||||
|
insecureSkipVerify: true
|
||||||
|
serverName: localhost
|
||||||
|
|
||||||
## Component scraping coreDns. Use either this or kubeDns
|
## Component scraping coreDns. Use either this or kubeDns
|
||||||
coreDns:
|
coreDns:
|
||||||
|
@ -54,9 +62,6 @@ kubeEtcd:
|
||||||
{% for h in groups['etcd'] %}
|
{% for h in groups['etcd'] %}
|
||||||
- {{ h }}
|
- {{ h }}
|
||||||
{% endfor %}
|
{% endfor %}
|
||||||
|
|
||||||
## Configure secure access to the etcd cluster by loading a secret into prometheus and
|
|
||||||
## specifying security configuration below. For example, with a secret named etcd-client-cert
|
|
||||||
serviceMonitor:
|
serviceMonitor:
|
||||||
scheme: https
|
scheme: https
|
||||||
insecureSkipVerify: true
|
insecureSkipVerify: true
|
||||||
|
@ -73,6 +78,12 @@ kubeScheduler:
|
||||||
{% for h in groups['kube_master'] %}
|
{% for h in groups['kube_master'] %}
|
||||||
- {{ h }}
|
- {{ h }}
|
||||||
{% endfor %}
|
{% endfor %}
|
||||||
|
service:
|
||||||
|
port: 10259
|
||||||
|
targetPort: 10259
|
||||||
|
serviceMonitor:
|
||||||
|
https: true
|
||||||
|
insecureSkipVerify: true
|
||||||
|
|
||||||
## Component scraping kube proxy
|
## Component scraping kube proxy
|
||||||
kubeProxy:
|
kubeProxy:
|
||||||
|
@ -87,9 +98,19 @@ kubeProxy:
|
||||||
{% endif %}
|
{% endif %}
|
||||||
{% endfor %}
|
{% endfor %}
|
||||||
|
|
||||||
|
## Configuration for kube-state-metrics subchart
|
||||||
|
kube-state-metrics:
|
||||||
|
image:
|
||||||
|
repository: easzlab/kube-state-metrics
|
||||||
|
|
||||||
## Manages Prometheus and Alertmanager components
|
## Manages Prometheus and Alertmanager components
|
||||||
prometheusOperator:
|
prometheusOperator:
|
||||||
enabled: true
|
enabled: true
|
||||||
|
admissionWebhooks:
|
||||||
|
enabled: true
|
||||||
|
patch:
|
||||||
|
image:
|
||||||
|
repository: easzlab/kube-webhook-certgen
|
||||||
service:
|
service:
|
||||||
nodePort: 30899
|
nodePort: 30899
|
||||||
nodePortTls: 30900
|
nodePortTls: 30900
|
||||||
|
|
|
@ -4,8 +4,10 @@ Documentation=https://github.com/GoogleCloudPlatform/kubernetes
|
||||||
|
|
||||||
[Service]
|
[Service]
|
||||||
ExecStart={{ bin_dir }}/kube-controller-manager \
|
ExecStart={{ bin_dir }}/kube-controller-manager \
|
||||||
--bind-address=0.0.0.0 \
|
|
||||||
--allocate-node-cidrs=true \
|
--allocate-node-cidrs=true \
|
||||||
|
--authentication-kubeconfig=/etc/kubernetes/kube-controller-manager.kubeconfig \
|
||||||
|
--authorization-kubeconfig=/etc/kubernetes/kube-controller-manager.kubeconfig \
|
||||||
|
--bind-address=0.0.0.0 \
|
||||||
--cluster-cidr={{ CLUSTER_CIDR }} \
|
--cluster-cidr={{ CLUSTER_CIDR }} \
|
||||||
--cluster-name=kubernetes \
|
--cluster-name=kubernetes \
|
||||||
--cluster-signing-cert-file={{ ca_dir }}/ca.pem \
|
--cluster-signing-cert-file={{ ca_dir }}/ca.pem \
|
||||||
|
|
Loading…
Reference in New Issue