kubernetes 的动态伸缩 HPA 是非常有用的特性。

我们的服务器托管在阿里云的 ACK 上,k8s 根据 cpu 或者 内存的使用情况,会自动伸缩关键 pod 的数量,以应对大流量的情形。而且更妙的是,动态扩展的 pod 并不是使用自己的固定服务器,而是使用阿里动态的 ECI 虚拟节点服务器,这样就真的是即开即用,用完即毁。有多大流量付多少钱,物尽其用。

我们先明确一下概念:

k8s 的资源指标获取是通过 api 接口来获得的,有两种 api,一种是核心指标,一种是自定义指标。

  • 核心指标:Core metrics,由metrics-server提供API,即 metrics.k8s.io,仅提供Node和Pod的CPU和内存使用情况。api 是 metrics.k8s.io

  • 自定义指标:Custom Metrics,由Prometheus Adapter提供API,即 custom.metrics.k8s.io,由此可支持任意Prometheus采集到的自定义指标。api 是 custom.metrics.k8s.ioexternal.metrics.k8s.io

一、核心指标metrics-server

阿里的 ACK 缺省是装了 metrics-server 的,看一下,系统里有一个metrics-server

kubectl get pods -n kube-system

image-20211125092117637

再看看 api 的核心指标能拿到什么,先看 Node 的指标:

kubectl get --raw "/apis/metrics.k8s.io" | jq .
kubectl get --raw "/apis/metrics.k8s.io/v1beta1" | jq .
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/nodes" | jq . 

image-20211125092534308

可以看到阿里 eci 虚拟节点的 cpu 和 memory 资源。

再看看 Pod 的指标:

kubectl get --raw "/apis/metrics.k8s.io/v1beta1/pods" | jq . 

image-20211125092730720

可以清楚得看到调度到虚拟节点上的 Pod 的 cpu 和 memory 资源使用情况。大家看到只有 cpu 和 memory,这就够了。

给个全部使用自己 Work Node 实体节点伸缩的例子:

php-hpa.yaml (这里定义了平均cpu使用到达80%或者内存平均使用到200M就伸缩):

php-hpa 规范了 php-deploy 如何伸缩,最小2,最大10:

apiVersion: autoscaling/v2beta1 
kind: HorizontalPodAutoscaler 
metadata: 
  name: php-hpa
  namespace: default
spec: 
  scaleTargetRef: 
    apiVersion: extensions/v1beta1 
    kind: Deployment 
    name: php-deploy
  minReplicas: 2 
  maxReplicas: 10 
  metrics: 
  - type: Resource 
    resource: 
      name: cpu 
      targetAverageUtilization: 80 
  - type: Resource 
    resource: 
      name: memory 
      targetAverageValue: 200Mi

再给个阿里 ACK 使用 ECI 虚拟节点伸缩的例子:

我们首先定义一个正常的 deployment ,php-deploy,这个就正常定义跟别的没区别。

然后再定义扩展到 eci 节点的 ElasticWorload,elastic-php,这个用来控制 php-deploy 拓展到 eci 虚拟节点去

下面是固定6个,动态24个,合计30个

apiVersion: autoscaling.alibabacloud.com/v1beta1
kind: ElasticWorkload
metadata:
  name: elastic-php
spec:
  sourceTarget:
    name: php-deploy
    kind: Deployment
    apiVersion: apps/v1
    min: 0
    max: 6
  elasticUnit:
  - name: virtual-kubelet
    labels:
      virtual-kubelet: "true"
      alibabacloud.com/eci: "true"
    annotations:
      virtual-kubelet: "true"
    nodeSelector:
      type: "virtual-kubelet"
    tolerations:
    - key: "virtual-kubelet.io/provider"
      operator: "Exists"
      min: 0
      max: 24
  replicas: 30

然后定义 HPA php-hpa 来控制 elastic-php 的自动伸缩

apiVersion: autoscaling/v2beta2 
kind: HorizontalPodAutoscaler 
metadata: 
  name: php-hpa 
  namespace: default 
spec: 
  scaleTargetRef: 
    apiVersion: autoscaling.alibabacloud.com/v1beta1 
    kind: ElasticWorkload 
    name: elastic-php 
  minReplicas: 6 
  maxReplicas: 30 
  metrics: 
  - type: Resource 
    resource: 
      name: cpu 
      target: 
        type: Utilization 
        averageUtilization: 90 
  behavior: 
    scaleUp: 
      policies: 
      #- type: percent 
      #  value: 500% 
      - type: Pods 
        value: 5 
        periodSeconds: 180 
    scaleDown: 
      policies: 
      - type: Pods 
        value: 1 
        periodSeconds: 600 

上面的 ElasticWorkload 需要仔细解释一下,php-deploy 定义的 pod 副本固定是6个,这6个都是在我们自己节点上不用再付费,然后 ECI 的 pod 副本数是0个到24个,那么总体pod数量就是 6+24 = 30 个,其中 24个是可以在虚拟节点上伸缩的。而 php-hpa 定义了伸缩范围是 6-30,那就意味着平时流量小的时候用的都是自己服务器上那6个固定 pod,如果流量大了,就会扩大到 eci 虚拟节点上,虚拟节点最大量是24个,如果流量降下来了,就会缩回去,缩到自己服务器的6个pod 上去。这样可以精确控制成本。

php-hpa 定义的最下面,扩大的时候如果 cpu 到了 90%,那么一次性扩大5个pod,缩小的时候一个一个缩,这样避免带来流量的毛刺。

二、自定义指标Prometheus

如上其实已经可以满足大多数要求了,但是想更进一步,比如想从 Prometheus 拿到的指标来进行 hpa 伸缩。

那就比较麻烦了

image-20211125103113631

看上图,Prometheus Operator 通过 http 拉取 pod 的 metric 指标,Prometheus Adaptor 再拉取 Prometheus Operator 存储的数据并且暴露给 Custom API 使用。为啥要弄这二个东西呢?因为 Prometheus 采集到的 metrics 数据并不能直接给 k8s 用,因为两者数据格式不兼容,还需要另外一个组件(kube-state-metrics),将prometheus 的 metrics 数据格式转换成 k8s API 接口能识别的格式,转换以后,因为是自定义API,所以还需要用 Kubernetes aggregator 在主 API 服务器中注册,以便其他程序直接通过 /apis/ 来访问。

我们首先来看看如何安装这二个东西:

首先装 Prometheus Operator,这家伙会自动装一捆东西, Pormetheus、Grafana、Alert manager等,所以最好给它单独弄一个命名空间

#安装
helm install --name prometheus --namespace monitoring  stable/prometheus-operator

#开个端口本地访问 prometheus 的面板,curl http://localhost:9090
kubectl port-forward --namespace monitoring svc/prometheus-operator-prometheus 9090:9090


#看看都有什么pod
kubectl get pod -n monitoring
NAME                                                          READY   STATUS    RESTARTS   AGE
pod/alertmanager-prometheus-operator-alertmanager-0           2/2     Running   0          98m
pod/prometheus-operator-grafana-857dfc5fc8-vdnff              2/2     Running   0          99m
pod/prometheus-operator-kube-state-metrics-66b4c95cd9-mz8nt   1/1     Running   0          99m
pod/prometheus-operator-operator-56964458-8sspk               2/2     Running   0          99m
pod/prometheus-operator-prometheus-node-exporter-dcf5p        1/1     Running   0          99m
pod/prometheus-operator-prometheus-node-exporter-nv6ph        1/1     Running   0          99m
pod/prometheus-prometheus-operator-prometheus-0               3/3     Running   1          98m

#看看都有什么svc
kubectl get svc -n monitoring
NAME                                           TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
alertmanager-operated                          ClusterIP   None           <none>        9093/TCP,9094/TCP,9094/UDP   100m
prometheus-operated                            ClusterIP   None           <none>        9090/TCP                     100m
prometheus-operator-alertmanager               NodePort    10.1.238.78    <none>        9093:31765/TCP               102m
prometheus-operator-grafana                    NodePort    10.1.125.228   <none>        80:30284/TCP                 102m
prometheus-operator-kube-state-metrics         ClusterIP   10.1.187.129   <none>        8080/TCP                     102m
prometheus-operator-operator                   ClusterIP   10.1.242.61    <none>        8080/TCP,443/TCP             102m
prometheus-operator-prometheus                 NodePort    10.1.156.181   <none>        9090:30268/TCP               102m
prometheus-operator-prometheus-node-exporter   ClusterIP   10.1.226.134   <none>        9100/TCP                     102m

我们看到有 prometheus-operated 这个ClusterIP svc,注意 k8s 的 coredns 域名解析方式,集群的内部域名是 hbb.local,那么这个 svc 的全 hostname 就是 prometheus-operated.monitoring.svc.hbb.local,集群中可以取舍到 prometheus-operated.monitoring 或者 prometheus-operated.monitoring.svc 来访问。

然后再装 Prometheus Adaptor,我们要根据上面的具体情况来设置 prometheus.url:

helm install --name prometheus-adapter stable/prometheus-adapter --set prometheus.url="http://prometheus-operated.monitoring.svc",prometheus.port="9090" --set image.tag="v0.4.1" --set rbac.create="true" --namespace custom-metrics

访问 external 和 custom 验证一下:

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": []
}

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq
{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "custom.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "*/agent.googleapis.com|agent|api_request_count",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    },
[...lots more metrics...]
    {
      "name": "*/vpn.googleapis.com|tunnel_established",
      "singularName": "",
      "namespaced": true,
      "kind": "MetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

重头戏在下面,其实 Prometheus Adaptor 从 Prometheus 拿的指标也是有限的,如果有自定义指标,或者想多拿些,就得继续拓展!!!

最快的方式是编辑 namespace 是 custom-metrics 下的 configmap ,名字是 Prometheus-adapter,增加seriesQuery

seriesQuery长这样子,下面是统计了所有 app=shopping-kart 的 pod 5分钟之内的变化速率总和:

apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    app: prometheus-adapter
    chart: prometheus-adapter-v1.2.0
    heritage: Tiller
    release: prometheus-adapter
  name: prometheus-adapter
data:
  config.yaml: |
- seriesQuery: '{app="shopping-kart",kubernetes_namespace!="",kubernetes_pod_name!=""}'
        seriesFilters: []
        resources:
          overrides:
            kubernetes_namespace:
              resource: namespace
            kubernetes_pod_name:
              resource: pod
        name:
          matches: ""
          as: ""
        metricsQuery: sum(rate(<<.Series>>{<<.LabelMatchers>>}[5m])) by (<<.GroupBy>>)

Adapter configmap的配置如下:

  1. seriesQuery tells the Prometheus Metric name to the adapter(要去prometheus拿什么指标)

  2. resources tells which Kubernetes resources each metric is associated with or which labels does the metric include, e.g., namespace, pod etc.(关联的资源,最常用的就是 pod 和 namespace)

  3. metricsQuery is the actual Prometheus query that needs to be performed to calculate the actual values.(叠加 seriesQuery 后发送给prometheus的实际查询,用于得出最终的指标值)

  4. name with which the metric should be exposed to the custom metrics API(暴露给API的指标名)

举个例子:如果我们要计算 container_network_receive_packets_total,在Prometheus UI里我们要输入以下行来查询:

sum(rate(container_network_receive_packets_total{namespace=“default”,pod=~“php-deploy.*”}[10m])) by (pod)*

转换成 Adapter 的 metricsQuery 就变成这样了,很难懂:

*metricsQuery: ‘sum(rate(«.series»{«.labelmatchers»}10m])) by («.groupby»)’</.groupby></.labelmatchers></.series>*

再给个例子:

rate(gorush_total_push_count{instance="push.server.com:80",job="push-server"}[5m])

image-20211125135531725

变成 adapter 的 configmap

apiVersion: v1
data:
  config.yaml: |
    rules:
    - seriesQuery: '{__name__=~"gorush_total_push_count"}'
      seriesFilters: []
      resources:
        overrides:
          namespace:
            resource: namespace
          pod:
            resource: pod
      name:
        matches: ""
        as: "gorush_push_per_second"
      metricsQuery: rate(<<.Series>>{<<.LabelMatchers>>}[5m])

修改了configmap,必须重启prometheus-adapter的pod重新加载配置!!!

在hpa中应用的例子:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: gorush-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: gorush
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Pods
    pods:
      metricName: gorush_push_per_second
      targetAverageValue: 1m

再来一个,prometheus 函数名是 myapp_client_connected:

apiVersion: v1
data:
  config.yaml: |
    rules:
    - seriesQuery: '{__name__= "myapp_client_connected"}'
      seriesFilters: []
      resources:
        overrides:
          k8s_namespace:
            resource: namespace
          k8s_pod_name:
            resource: pod
      name:
        matches: "myapp_client_connected"
        as: ""
      metricsQuery: <<.Series>>{<<.LabelMatchers>>,container_name!="POD"}

hpa的使用

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: hpa-sim
  namespace: default
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: hpa-sim
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metricName: myapp_client_connected
      targetAverageValue: 20

很复杂吧。我们下面给个详细例子

三、自定义指标全套例子

我们先定义一个 deployment,运行一个 nginx-vts 的 pod,这个镜像其实已经自己暴露出了 metric 指标

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deploy
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "80"
    prometheus.io/path: "/status/format/prometheus"
spec:
  selector:
    matchLabels:
      app: nginx-deploy
  template:
    metadata:
      labels:
        app: nginx-deploy
    spec:
      containers:
      - name: nginx-deploy
        image: cnych/nginx-vts:v1.0
        resources:
          limits:
            cpu: 50m
          requests:
            cpu: 50m
        ports:
        - containerPort: 80
          name: http

然后定义个 svc,把80端口暴露出去

apiVersion: v1
kind: Service
metadata:
  name: nginx-svc
spec:
  ports:
  - port: 80
    targetPort: 80
    name: http
  selector:
    app: nginx-deploy
  type: ClusterIP

prometheus 是自动发现的,所以 annotations 就会触发 prometheus 自动开始收集这些 nginx metric指标

集群内起个shell,访问看看

$ curl nginx-svc.default.svc.hbb.local/status/format/prometheus
# HELP nginx_vts_info Nginx info
# TYPE nginx_vts_info gauge
nginx_vts_info{hostname="nginx-deployment-65d8df7488-c578v",version="1.13.12"} 1
# HELP nginx_vts_start_time_seconds Nginx start time
# TYPE nginx_vts_start_time_seconds gauge
nginx_vts_start_time_seconds 1574283147.043
# HELP nginx_vts_main_connections Nginx connections
# TYPE nginx_vts_main_connections gauge
nginx_vts_main_connections{status="accepted"} 215
nginx_vts_main_connections{status="active"} 4
nginx_vts_main_connections{status="handled"} 215
nginx_vts_main_connections{status="reading"} 0
nginx_vts_main_connections{status="requests"} 15577
nginx_vts_main_connections{status="waiting"} 3
nginx_vts_main_connections{status="writing"} 1
# HELP nginx_vts_main_shm_usage_bytes Shared memory [ngx_http_vhost_traffic_status] info
# TYPE nginx_vts_main_shm_usage_bytes gauge
nginx_vts_main_shm_usage_bytes{shared="max_size"} 1048575
nginx_vts_main_shm_usage_bytes{shared="used_size"} 3510
nginx_vts_main_shm_usage_bytes{shared="used_node"} 1
# HELP nginx_vts_server_bytes_total The request/response bytes
# TYPE nginx_vts_server_bytes_total counter
# HELP nginx_vts_server_requests_total The requests counter
# TYPE nginx_vts_server_requests_total counter
# HELP nginx_vts_server_request_seconds_total The request processing time in seconds
# TYPE nginx_vts_server_request_seconds_total counter
# HELP nginx_vts_server_request_seconds The average of request processing times in seconds
# TYPE nginx_vts_server_request_seconds gauge
# HELP nginx_vts_server_request_duration_seconds The histogram of request processing time
# TYPE nginx_vts_server_request_duration_seconds histogram
# HELP nginx_vts_server_cache_total The requests cache counter
# TYPE nginx_vts_server_cache_total counter
nginx_vts_server_bytes_total{host="_",direction="in"} 3303449
nginx_vts_server_bytes_total{host="_",direction="out"} 61641572
nginx_vts_server_requests_total{host="_",code="1xx"} 0
nginx_vts_server_requests_total{host="_",code="2xx"} 15574
nginx_vts_server_requests_total{host="_",code="3xx"} 0
nginx_vts_server_requests_total{host="_",code="4xx"} 2
nginx_vts_server_requests_total{host="_",code="5xx"} 0
nginx_vts_server_requests_total{host="_",code="total"} 15576
nginx_vts_server_request_seconds_total{host="_"} 0.000
nginx_vts_server_request_seconds{host="_"} 0.000
nginx_vts_server_cache_total{host="_",status="miss"} 0
nginx_vts_server_cache_total{host="_",status="bypass"} 0
nginx_vts_server_cache_total{host="_",status="expired"} 0
nginx_vts_server_cache_total{host="_",status="stale"} 0
nginx_vts_server_cache_total{host="_",status="updating"} 0
nginx_vts_server_cache_total{host="_",status="revalidated"} 0
nginx_vts_server_cache_total{host="_",status="hit"} 0
nginx_vts_server_cache_total{host="_",status="scarce"} 0
nginx_vts_server_bytes_total{host="*",direction="in"} 3303449
nginx_vts_server_bytes_total{host="*",direction="out"} 61641572
nginx_vts_server_requests_total{host="*",code="1xx"} 0
nginx_vts_server_requests_total{host="*",code="2xx"} 15574
nginx_vts_server_requests_total{host="*",code="3xx"} 0
nginx_vts_server_requests_total{host="*",code="4xx"} 2
nginx_vts_server_requests_total{host="*",code="5xx"} 0
nginx_vts_server_requests_total{host="*",code="total"} 15576
nginx_vts_server_request_seconds_total{host="*"} 0.000
nginx_vts_server_request_seconds{host="*"} 0.000
nginx_vts_server_cache_total{host="*",status="miss"} 0
nginx_vts_server_cache_total{host="*",status="bypass"} 0
nginx_vts_server_cache_total{host="*",status="expired"} 0
nginx_vts_server_cache_total{host="*",status="stale"} 0
nginx_vts_server_cache_total{host="*",status="updating"} 0
nginx_vts_server_cache_total{host="*",status="revalidated"} 0
nginx_vts_server_cache_total{host="*",status="hit"} 0
nginx_vts_server_cache_total{host="*",status="scarce"} 0

然后用 wrk 随机发狂发请求压一把,我们去 prometheus 的面板看看指标被收集到没有

image-20211125152110040

很疯狂啊。我们编辑 Prometheus-Adapter 的 configmap ,加上如下内容

rules:
- seriesQuery: 'nginx_vts_server_requests_total'
  seriesFilters: []
  resources:
    overrides:
      kubernetes_namespace:
        resource: namespace
      kubernetes_pod_name:
        resource: pod
  name:
    matches: "^(.*)_total"
    as: "${1}_per_second"
  metricsQuery: (sum(rate(<<.Series>>{<<.LabelMatchers>>}[1m])) by (<<.GroupBy>>))

然后杀了 Prometheus-Adapter 的 Pod 让它重启重新加载配置,过段时间访问一下,看看值,是527m

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/nginx_vts_server_requests_per_second" | jq .
{
  "kind": "MetricValueList",
  "apiVersion": "custom.metrics.k8s.io/v1beta1",
  "metadata": {
    "selfLink": "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/%2A/nginx_vts_server_requests_per_second"
  },
  "items": [
    {
      "describedObject": {
        "kind": "Pod",
        "namespace": "default",
        "name": "hpa-prom-demo-755bb56f85-lvksr",
        "apiVersion": "/v1"
      },
      "metricName": "nginx_vts_server_requests_per_second",
      "timestamp": "2020-04-07T09:45:45Z",
      "value": "527m",
      "selector": null
    }
  ]
}

ok,没问题,我们定义个hpa,根据这个指标来伸缩

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: nginx-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nginx-deploy
  minReplicas: 2
  maxReplicas: 5
  metrics:
  - type: Pods
    pods:
      metricName: nginx_vts_server_requests_per_second
      targetAverageValue: 10

这样就好了。

如果 pod 本身不能暴露 metric ,我们可以在 sidecar 里安装 exporter 来收集数据并暴露出去就可以了。