Kubernetes的早期版本依靠Heapster来实现完整的性能数据采集和监控功能,Kubernetes从1.8版本开始,性能数据开始以Metrics API的方式提供标准化接口,并且从1.10版本开始将Heapster替换为Metrics Server。在Kubernetes新的监控体系中,Metrics Server用于提供核心指标(Core Metrics),包括Node、Pod的CPU和内存使用指标。对其他自定义指标(Custom Metrics)的监控则由Prometheus等组件来完成。
通过Metrics Server监控Pod和Node的CPU和内存资源使用数据
Metrics Server在部署完成后,将通过Kubernetes核心API Server的“/apis/metrics.k8s.io/v1beta1”路径提供Pod和Node的监控数据。Metrics Server源代码和部署配置可以在GitHub代码库(https://github.com/kubernetes-incubator/metrics-server)
首先,部署Metrics Server实例,在下面的YAML配置中包含ServiceAccount、Deployment和Service的定义:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56 1---
2apiVersion: v1
3kind: ServiceAccount
4metadata:
5 name: metrics-server
6 namespace: kube-system
7
8---
9apiVersion: extensions/v1beta1
10kind: Deployment
11metadata:
12 name: metrics-server
13 namespace: kube-system
14 labels:
15 k8s-app: metrics-server
16spec:
17 selector:
18 matchLabels:
19 k8s-app: metrics-server
20 template:
21 metadata:
22 name: metrics-server
23 labels:
24 k8s-app: metrics-server
25 spec:
26 serviceAccountName: metrics-server
27 containers:
28 - name: metrics-server
29 image: k8s.gcr.io/metrics-server-amd64:v0.3.1 # 镜像可以先拉到本地镜像仓库
30 imagePullPolicy: IfNotPresent
31 command:
32 - /metrics-server
33 - --kubelet-insecure-tls
34 - --kubelet-preferred-address-types=InternIP
35 volumeMounts:
36 - name: tmp-dir
37 mountPath: /tmp
38 volumes:
39 - name: tmp-dir
40 emptyDir: {}
41---
42apiVersion: v1
43kind: Service
44metadata:
45 name: metrics-server
46 namespace: kube-system
47 labels:
48 kubernetes.io/name: "Metrics-server"
49spec:
50 selector:
51 k8s-app: metrics-server
52 ports:
53 - port: 443
54 protocol: TCP
55 targetPort: 443
56
最后,创建APIService资源,将监控数据通过“/apis/metrics.k8s.io/v1beta1”路径提供:
1
2
3
4
5
6
7
8
9
10
11
12
13
14 1apiVersion: apiregistration.k8s.io/v1beta1
2kind: APIService
3metadata:
4 name: v1beta1.metrics.k8s.io
5spec:
6 service:
7 name: metrics-server
8 namespace: kube-system
9 group: metrics.k8s.io
10 version: v1beta1
11 insecureSkipTLSVerify: true
12 groupPriorityMinimum: 100
13 versionPriority: 100
14
在部署完成后确保metrics-server的Pod启动成功。
使用kubectl top nodes和kubectl top pods命令监控CPU和内存资源的使用情况。
Metrics Server提供的数据也可以供HPA控制器使用,以实现基于CPU使用率或内存使用值的Pod自动扩缩容功能。
Prometheus + Grafana 集群性能监控平台搭建
Prometheus是由SoundCloud公司开发的开源监控系统,是继Kubernetes之后CNCF第2个毕业的项目,在容器和微服务领域得到了广泛应用。Prometheus的主要特点如下。
◎ 使用指标名称及键值对标识的多维度数据模型。
◎ 采用灵活的查询语言PromQL。
◎ 不依赖分布式存储,为自治的单节点服务。
◎ 使用HTTP完成对监控数据的拉取。
◎ 支持通过网关推送时序数据。
◎ 支持多种图形和Dashboard的展示,例如Grafana。
Prometheus生态系统由各种组件组成,用于功能的扩充。
◎ Prometheus Server:负责监控数据采集和时序数据存储,并提供数据查询功能。
◎ 客户端SDK:对接Prometheus的开发工具包。
◎ Push Gateway:推送数据的网关组件。
◎ 第三方Exporter:各种外部指标收集系统,其数据可以被Prometheus采集。
◎ AlertManager:告警管理器。
◎ 其他辅助支持工具。
Prometheus的核心组件Prometheus Server的主要功能包括:从Kubernetes Master获取需要监控的资源或服务信息;从各种Exporter抓取(Pull)指标数据,然后将指标数据保存在时序数据库(TSDB)中;向其他系统提供HTTP API进行查询;提供基于PromQL语言的数据查询;可以将告警数据推送(Push)给AlertManager,等等。
下面对部署Prometheus服务的过程进行说明。
首先,创建一个ConfigMap用于保存Prometheus的主配置文件prometheus.yml,其中可以配置需要监控的Kubernetes集群的资源对象或服务(如Service、Pod、Node等):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279 1---
2apiVersion: v1
3kind: ConfigMap
4metadata:
5 name: prometheus-config
6 namespace: kube-system
7 labels:
8 kubernetes.io/cluster-service: "true"
9 addonmanager.kubernetes.io/mode: EnsureExists
10data:
11 prometheus.rules.yml: |
12 groups:
13 - name: recording_rules
14 rules:
15 - record: service_myweb:container_memory_working_set_bytes:sum
16 expr: sum by (namespace, label_service_myweb) (sum(container_memory_working_set_bytes{image!=""}) by (pod_name, namespace) * on (namespace, pod_name) group_left(service_myweb, label_service_myweb) label_replace(kube_pod_labels, "pod_name", "$1", "pod", "(.*)"))
17 prometheus.yml: |
18 global:
19 scrape_interval: 30s
20 rule_files:
21 - 'prometheus.rules.yml'
22 scrape_configs:
23 - job_name: prometheus
24 static_configs:
25 - targets:
26 - localhost:9090
27 - job_name: kubernetes-apiservers
28 kubernetes_sd_configs:
29 - role: endpoints
30 relabel_configs:
31 - action: keep
32 regex: default;kubernetes;https
33 source_labels:
34 - __meta_kubernetes_namespace
35 - __meta_kubernetes_service_name
36 - __meta_kubernetes_endpoint_port_name
37 scheme: https
38 tls_config:
39 ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
40 insecure_skip_verify: true
41 bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
42 - job_name: kubernetes-nodes-kubelet
43 kubernetes_sd_configs:
44 - role: node
45 relabel_configs:
46 - action: labelmap
47 regex: __meta_kubernetes_node_label_(.+)
48 scheme: https
49 tls_config:
50 ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
51 insecure_skip_verify: true
52 bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
53 - job_name: kubernetes-nodes-cadvisor
54 kubernetes_sd_configs:
55 - role: node
56 relabel_configs:
57 - action: labelmap
58 regex: __meta_kubernetes_node_label_(.+)
59 - target_label: __metrics_path__
60 replacement: /metrics/cadvisor
61 scheme: https
62 tls_config:
63 ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
64 insecure_skip_verify: true
65 bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
66 - job_name: kubernetes-service-endpoints
67 kubernetes_sd_configs:
68 - role: endpoints
69 relabel_configs:
70 - action: keep
71 regex: true
72 source_labels:
73 - __meta_kubernetes_service_annotation_prometheus_io_scrape
74 - action: replace
75 regex: (https?)
76 source_labels:
77 - __meta_kubernetes_service_annotation_prometheus_io_scheme
78 target_label: __scheme__
79 - action: replace
80 regex: (.+)
81 source_labels:
82 - __meta_kubernetes_service_annotation_prometheus_io_path
83 target_label: __metrics_path__
84 - action: replace
85 regex: ([^:]+)(?::\d+)?;(\d+)
86 replacement: $1:$2
87 source_labels:
88 - __address__
89 - __meta_kubernetes_service_annotation_prometheus_io_port
90 target_label: __address__
91 - action: labelmap
92 regex: __meta_kubernetes_service_label_(.+)
93 - action: replace
94 source_labels:
95 - __meta_kubernetes_namespace
96 target_label: kubernetes_namespace
97 - action: replace
98 source_labels:
99 - __meta_kubernetes_service_name
100 target_label: kubernetes_name
101 - job_name: kubernetes-services
102 kubernetes_sd_configs:
103 - role: service
104 metrics_path: /probe
105 params:
106 module:
107 - http_2xx
108 relabel_configs:
109 - action: keep
110 regex: true
111 source_labels:
112 - __meta_kubernetes_service_annotation_prometheus_io_probe
113 - source_labels:
114 - __address__
115 target_label: __param_target
116 - replacement: blackbox
117 target_label: __address__
118 - source_labels:
119 - __param_target
120 target_label: instance
121 - action: labelmap
122 regex: __meta_kubernetes_service_label_(.+)
123 - source_labels:
124 - __meta_kubernetes_namespace
125 target_label: kubernetes_namespace
126 - source_labels:
127 - __meta_kubernetes_service_name
128 target_label: kubernetes_name
129 - job_name: kubernetes-pods
130 kubernetes_sd_configs:
131 - role: pod
132 relabel_configs:
133 - action: keep
134 regex: true
135 source_labels:
136 - __meta_kubernetes_pod_annotation_prometheus_io_scrape
137 - action: replace
138 regex: (.+)
139 source_labels:
140 - __meta_kubernetes_pod_annotation_prometheus_io_path
141 target_label: __metrics_path__
142 - action: replace
143 regex: ([^:]+)(?::\d+)?;(\d+)
144 replacement: $1:$2
145 source_labels:
146 - __address__
147 - __meta_kubernetes_pod_annotation_prometheus_io_port
148 target_label: __address__
149 - action: labelmap
150 regex: __meta_kubernetes_pod_label_(.+)
151 - action: replace
152 source_labels:
153 - __meta_kubernetes_namespace
154 target_label: kubernetes_namespace
155 - action: replace
156 source_labels:
157 - __meta_kubernetes_pod_name
158 target_label: kubernetes_pod_name
159 alerting:
160 alertmanagers:
161 - kubernetes_sd_configs:
162 - role: pod
163 tls_config:
164 ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
165 bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
166 relabel_configs:
167 - source_labels: [__meta_kubernetes_namespace]
168 regex: kube-system
169 action: keep
170 - source_labels: [__meta_kubernetes_pod_label_k8s_app]
171 regex: alertmanager
172 action: keep
173 - source_labels: [__meta_kubernetes_pod_container_port_number]
174 regex:
175 action: drop
176---
177apiVersion: extensions/v1beta1
178kind: Deployment
179metadata:
180 name: prometheus
181 namespace: kube-system
182 labels:
183 k8s-app: prometheus
184 kubernetes.io/cluster-service: "true"
185 addonmanager.kubernetes.io/mode: Reconcile
186spec:
187 replicas: 1
188 selector:
189 matchLabels:
190 k8s-app: prometheus
191 template:
192 metadata:
193 labels:
194 k8s-app: prometheus
195 annotations:
196 scheduler.alpha.kubernetes.io/critical-pod: ''
197 spec:
198 priorityClassName: system-cluster-critical
199 initContainers:
200 - name: "init-chown-data"
201 image: "busybox:latest"
202 imagePullPolicy: "IfNotPresent"
203 command: ["chown", "-R", "65534:65534", "/data"]
204 volumeMounts:
205 - name: storage-volume
206 mountPath: /data
207 subPath: ""
208 containers:
209 - name: prometheus-server-configmap-reload
210 image: "jimmidyson/configmap-reload:v0.1"
211 imagePullPolicy: "IfNotPresent"
212 args:
213 - --volume-dir=/etc/config
214 - --webhook-url=http://localhost:9090/-/reload
215 volumeMounts:
216 - name: config-volume
217 mountPath: /etc/config
218 readOnly: true
219 - name: prometheus-server
220 image: "prom/prometheus:v2.8.0"
221 imagePullPolicy: "IfNotPresent"
222 args:
223 - --config.file=/etc/config/prometheus.yml
224 - --storage.tsdb.path=/data
225 - --web.console.libraries=/etc/prometheus/console_libraries
226 - --web.console.templates=/etc/prometheus/consoles
227 - --web.enable-lifecycle
228 ports:
229 - containerPort: 9090
230 readinessProbe:
231 httpGet:
232 path: /-/ready
233 port: 9090
234 initialDelaySeconds: 30
235 timeoutSeconds: 30
236 livenessProbe:
237 httpGet:
238 path: /-/healthy
239 port: 9090
240 initialDelaySeconds: 30
241 timeoutSeconds: 30
242 volumeMounts:
243 - name: config-volume
244 mountPath: /etc/config
245 - name: storage-volume
246 mountPath: /data
247 subPath: ""
248 terminationGracePeriodSeconds: 300
249 volumes:
250 - name: config-volume
251 configMap:
252 name: prometheus-config
253 - name: storage-volume
254 hostPath:
255 path: /prometheus-data
256 type: Directory
257
258---
259kind: Service
260apiVersion: v1
261metadata:
262 name: prometheus
263 namespace: kube-system
264 labels:
265 kubernetes.io/name: "Prometheus"
266 kubernetes.io/cluster-service: "true"
267 addonmanager.kubernetes.io/mode: Reconcile
268spec:
269 type: NodePort
270 ports:
271 - name: http
272 port: 9090
273 nodePort: 9090
274 protocol: TCP
275 targetPort: 9090
276 selector:
277 k8s-app: prometheus
278
279
Prometheus提供了一个简单的Web页面用于查看已采集的监控数据,上面的Service定义了NodePort为9090,我们可以通过访问Node的9090端口访问这个页面。
在Prometheus提供的Web页面上,可以输入PromQL查询语句对指标数据进行查询,也可以选择一个指标进行查看,例如选择container_network_receive_bytes_total指标查看容器的网络。
单击Graph标签,可以查看该指标的时序图。
接下来可以针对各种系统和服务部署各种Exporter进行指标数据的采集。目前Prometheus支持多种开源软件的Exporter,包括数据库、硬件系统、消息系统、存储系统、HTTP服务器、日志服务,等等,可以从Prometheus的官网https://prometheus.io/docs/instrumenting/exporters/获取各种Exporter的信息。
下面以官方维护的node_exporter为例进行部署。node_exporter主要用于采集主机相关的性能指标数据,其官网为https://github.com/prometheus/node_exporter。node_exporter的YAML配置文件如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82 1---
2apiVersion: extensions/v1beta1
3kind: DaemonSet
4metadata:
5 name: node-exporter
6 namespace: kube-system
7 labels:
8 k8s-app: node-exporter
9 kubernetes.io/cluster-service: "true"
10 addonmanager.kubernetes.io/mode: Reconcile
11 version: v0.17.0
12spec:
13 updateStrategy:
14 type: OnDelete
15 template:
16 metadata:
17 labels:
18 k8s-app: node-exporter
19 version: v0.17.0
20 annotations:
21 scheduler.alpha.kubernetes.io/critical-pod: ''
22 spec:
23 priorityClassName: system-node-critical
24 containers:
25 - name: prometheus-node-exporter
26 image: "prom/node-exporter:v0.17.0"
27 imagePullPolicy: "IfNotPresent"
28 args:
29 - --path.procfs=/host/proc
30 - --path.sysfs=/host/sys
31 ports:
32 - name: metrics
33 containerPort: 9100
34 hostPort: 9100
35 volumeMounts:
36 - name: proc
37 mountPath: /host/proc
38 readOnly: true
39 - name: sys
40 mountPath: /host/sys
41 readOnly: true
42 resources:
43 limits:
44 cpu: 1
45 memory: 512Mi
46 requests:
47 cpu: 100m
48 memory: 50Mi
49 hostNetwork: true
50 hostPID: true
51 volumes:
52 - name: proc
53 hostPath:
54 path: /proc
55 - name: sys
56 hostPath:
57 path: /sys
58
59---
60apiVersion: v1
61kind: Service
62metadata:
63 name: node-exporter
64 namespace: kube-system
65 annotations:
66 prometheus.io/scrape: "true"
67 labels:
68 kubernetes.io/cluster-service: "true"
69 addonmanager.kubernetes.io/mode: Reconcile
70 kubernetes.io/name: "NodeExporter"
71spec:
72 clusterIP: None
73 ports:
74 - name: metrics
75 port: 9100
76 protocol: TCP
77 targetPort: 9100
78 selector:
79 k8s-app: node-exporter
80
81
82
Prometheus的Web页面就可以查看node-exporter采集的Node指标数据了,如图:
最后,部署Grafana用于展示专业的监控页面,其YAML配置文如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70 1---
2kind: Deployment
3apiVersion: extensions/v1beta1
4metadata:
5 name: grafana
6 namespace: kube-system
7 labels:
8 k8s-app: grafana
9 kubernetes.io/cluster-service: "true"
10 addonmanager.kubernetes.io/mode: Reconcile
11spec:
12 replicas: 1
13 selector:
14 matchLabels:
15 k8s-app: grafana
16 template:
17 metadata:
18 labels:
19 k8s-app: grafana
20 annotations:
21 scheduler.alpha.kubernetes.io/critical-pod: ''
22 spec:
23 priorityClassName: system-cluster-critical
24 tolerations:
25 - key: node-role.kubernetes.io/master
26 effect: NoSchedule
27 - key: "CriticalAddonsOnly"
28 operator: "Exists"
29 containers:
30 - name: grafana
31 image: grafana/grafana:6.0.1
32 imagePullPolicy: IfNotPresent
33 resources:
34 limits:
35 cpu: 1
36 memory: 1Gi
37 requests:
38 cpu: 100m
39 memory: 100Mi
40 env:
41 - name: GF_AUTH_BASIC_ENABLED
42 value: "false"
43 - name: GF_AUTH_ANONYMOUS_ENABLED
44 value: "true"
45 - name: GF_AUTH_ANONYMOUS_ORG_ROLE
46 value: Admin
47 - name: GF_SERVER_ROOT_URL
48 value: /api/v1/namespaces/kube-system/services/grafana/proxy/
49 ports:
50 - name: ui
51 containerPort: 3000
52
53---
54apiVersion: v1
55kind: Service
56metadata:
57 name: grafana
58 namespace: kube-system
59 labels:
60 kubernetes.io/cluster-service: "true"
61 addonmanager.kubernetes.io/mode: Reconcile
62 kubernetes.io/name: "Grafana"
63spec:
64 ports:
65 - port: 80
66 protocol: TCP
67 targetPort: ui
68 selector:
69 k8s-app: grafana
70
部署完成后,通过Kubernetes Master的URL访问Grafana页面,例如http://ip:8080/api/v1/namespaces/kube-system/services/grafana/proxy。
在Grafana的设置页面添加类型为Prometheus的数据源,输入Prometheus服务的URL(如http://prometheus:9090)进行保存。
在Grafana的Dashboard控制面板导入预置的Dashboard模板,以显示各种监控图表。Grafana官网(https://grafana.com/dashboards)提供了许多针对Kubernetes集群监控的Dashboard模板,可以下载、导入并使用。下图显示了一个可以监控集群CPU、内存、文件系统、网络吞吐率的Dashboard。
小结:
至此,基于Prometheus+Grafana的Kubernetes集群监控系统就搭建完成了。
谢谢大家的浏览。