Monitoring

Overview

Kyma comes bundled with third-party applications like Prometheus, Alertmanager, and Grafana, that offer a monitoring functionality for all Kyma resources. These applications are deployed during the Kyma cluster installation, along with a set of pre-defined alerting rules, Grafana dashboards, and Prometheus configuration.

The whole installation package provides the end-to-end Kubernetes cluster monitoring that allows you to:

  • View metrics exposed by the Pods.
  • Use the metrics to create descriptive dashboards that monitor any Pod anomalies.
  • Manage the default alert rules and create new ones.
  • Set up channels for notifications informing of any detected alerts.

NOTE: The monitoring functionality is available by default in the cluster installation, but it is disabled in the Kyma Lite local installation on Minikube. Read here how to enable monitoring and prometheus-operator modules for the local installation.

Architecture

Before you learn how the complete metric flow looks in Kyma, read about components and resources that are crucial elements of the monitoring flow in Kyma.

Components

The main monitoring components include:

  • Prometheus Operator that creates a Prometheus instance, manages its deployment, and provides configuration for it. It also operates ServiceMonitor custom resources that specify monitoring definitions for groups of services. Prometheus Operator is a prerequisite for installing other core monitoring components, such as Alertmanager and Grafana.

    For more details, read the Prometheus Operator documentation.

  • Prometheus that collects metrics from Pods. The metrics are the time-stamped data that provide information on the running jobs, workload, CPU consumption, memory usage, and more. Pods can also contain applications with custom metrics, such as the total storage space available in the MinIO server. Prometheus stores this polled data in a time-series database (TSDB) and runs rules over them to generate alerts if it detects any metric anomalies.

    For more details, read the Prometheus documentation.

  • Grafana that provides a dashboard and a graph editor to visualize metrics collected from the Prometheus API. Grafana uses the query language called PromQL to select and aggregate metrics data from the Prometheus database. To access the Grafana UI, use the https://grafana.{DOMAIN} address, where {DOMAIN} is the domain of your Kyma cluster.

    For more details, read the Grafana documentation.

  • Alertmanager that receives alerts from Prometheus and forwards this data to configured Slack or Victor Ops channels.

    NOTE: There are no notification channels configured in the default monitoring installation. The current configuration allows you to add either Slack or Victor Ops channels.

    For more details, read the Alertmanager documentation.

Monitoring in Kyma also relies heavily on these custom resources:

  • PrometheusRules define alert conditions for metrics. They are configured in Prometheus as PrometheusRule custom resource definitions (CRDs). Kyma provides a set of out-of-the-box alerting rules that are passed from Prometheus to Alertmanager. The definitions of such rules specify the alert logic, the value at which alerts are triggered, the alerts' severity, and more. If you pre-define specific Slack or Victor Ops channels, Alertmanager displays the alerts in the channel each time the alerts are triggered.

  • ServiceMonitors are CRDs that specify the endpoints from which Prometheus should poll the metrics. Even if you expose a handful of metrics in your application, Prometheus polls only those from the /metrics endpoints of ports specified in ServiceMonitor CRDs.

End-to-end monitoring flow

The complete monitoring flow in Kyma comes down to these components and steps:

  1. Upon Kyma installation on a cluster, Prometheus Operator creates a Prometheus instance with default configuration.
  2. The Prometheus server periodically polls all metrics exposed on /metrics endpoints of ports specified in ServiceMonitor CRDs. Prometheus stores these metrics in a time-series database.
  3. If Prometheus detects any metric values matching the logic of alerting rules, it triggers the alerts and passes them to Alertmanager.
  4. If you manually configure a notification channel, you can instantly receive detailed information on metric alerts detected by Prometheus.
  5. You can visualize metrics and track their historical data on Grafana dashboards.

Details

Alertmanager

Alertmanager receives and manages alerts coming from Prometheus. It can then forward the notifications about fired alerts to specific channels, such as Slack or VictorOps.

Alertmanager configuration

Use the following files to configure and manage Alertmanager:

  • alertmanager.yaml which deploys the Alertmananger Pod.
  • alertmanager.config.yaml which you can use to define core Alertmanager configuration and alerting channels. For details on configuration elements, see this document.
  • alertmanager.rules which lists alerting rules used to monitor Alertmanager's health.

Additionally, Alertmanager instances require a Secret resource which contains the encoded alertmanager.yaml.tpl file. This Secret is picked up during Pod deployment and mounted as alertmanager.config.yaml, which allows you to configure alert settings and notifications.

The Secret resource looks as follows:

Click to copy
apiVersion: v1
kind: Secret
metadata:
labels:
alertmanager: {{ .Release.Name }}
app: {{ template "alertmanager.name" . }}
chart: {{ .Chart.Name }}-{{ .Chart.Version }}
heritage: {{ .Release.Service }}
release: {{ .Release.Name }}
name: alertmanager-{{ .Release.Name }}
data:
alertmanager.yaml: |-
{{ include "alertmanager.yaml.tpl" . | b64enc }}
{{- range $key, $val := .Values.templateFiles }}
{{ $key }}: {{ $val | b64enc | quote }}
{{- end }}

To configure the alerts and be able to forward them to different channels, define the parameters:

ParameterDescription
nameSpecifies the name of the Secret. The name must follow the alertmanager-{ALERTMANAGER_NAME} format.
dataContains the encoded alertmanager.yaml.tpl file which contains all the configuration for alerting notifications provided in the alertmanager.config.yaml file.

Alerting rules

Kyma comes with a set of alerting rules provided out of the box. You can find them here. These rules provide alerting configuration for logging, webapps, rest services, and custom Kyma rules.

You can also define your own alerting rule. To learn how, see this tutorial.

Configuration

Alertmanager sub-chart

To configure the Alertmanager sub-chart, override the default values of its values.yaml file. This document describes parameters that you can configure.

TIP: To learn more about how to use overrides in Kyma, see the following documents:

Configurable parameters

This table lists the configurable parameters, their descriptions, and default values:

ParameterDescriptionDefault value
global.alertTools.credentials.slack.apiurlSpecifies the URL endpoint which sends alerts triggered by Prometheus rules.None
global.alertTools.credentials.slack.channelRefers to the Slack channel which receives notifications on new alerts.None
global.alertTools.credentials.victorOps.routingkeyDefines the team routing key in VictorOps.None
global.alertTools.credentials.victorOps.apikeyDefines the team API key in VictorOps.None

NOTE: Override all configurable values for the Alertmanager sub-chart using Secrets (kind: Secret).

Grafana sub-chart

To configure the Grafana sub-chart, override the default values of its values.yaml file. This document describes parameters that you can configure.

TIP: To learn more about how to use overrides in Kyma, see the following documents:

Configurable parameters

This table lists the configurable parameters, their descriptions, and default values:

ParameterDescriptionDefault value
users.default.themeSpecifies the background colour of the Grafana UI. You can change it to dark.light

Prometheus sub-chart

To configure the Prometheus sub-chart, override the default values of its values.yaml file. This document describes parameters that you can configure.

TIP: To learn more about how to use overrides in Kyma, see the following documents:

Configurable parameters

This table lists the configurable parameters, their descriptions, and default values:

ParameterDescriptionDefault value
retentionSpecifies a period of time for which Prometheus stores the metrics in-memory. This retention time applies to in-memory storage only. Prometheus stores the recent data in-memory for the specified amount of time to avoid reading the entire data from disk.2h
storageSpec.volumeClaimTemplate.spec.resources.requests.storageSpecifies the size of a Persistent Volume Claim (PVC).4Gi

Tutorials

Overview

The set of monitoring tutorials you are about to read describes the complete monitoring flow for your services in Kyma. Going through the tutorials, you get the gist of Kyma in-built monitoring applications, such as Prometheus, Grafana, and Alertmanager. This hands-on experience with monitoring helps you understand how and where you can observe and visualize your service metrics to monitor them for any alerting values.

All the tutorials use the monitoring-custom-metrics example and one of its services called sample-metrics-8081. This service exposes the cpu_temperature_celsius custom metric on the /metrics endpoint. This custom metric is the central element of the whole tutorial set. The metric value simulates the current processor temperature and changes randomly from 60 to 90 degrees Celsius. The alerting threshold in these tutorials is 75 degrees Celsius. If the temperature exceeds this value, the Grafana dashboard, Prometheus rule, and Alertmanager notifications you create clearly inform you about this.

The tutorial set consists of these documents:

  1. Observe application metrics in which you redirect the cpu_temperature_celsius metric to the localhost and the Prometheus UI. You later observe how the metric value changes in the predefined 10 seconds interval in which Prometheus scrapes the metric values from the service's /metrics endpoint.

  2. Create a Grafana dashboard in which you create a Grafana dashboard of a Gauge type for the cpu_temperature_celsius metric. This dashboard shows explicitly when the CPU temperature is equal to or higher than the predefined threshold of 75 degrees Celsius, at which point the dashboard turns red.

  3. Define alerting rules in which you define the CPUTempHigh alerting rule by creating a PrometheusRule resource. Prometheus accesses the /metrics endpoint every 10 seconds and validates the current value of the cpu_temperature_celsius metric. If the value is equal to or higher than 75 degrees Celsius, Prometheus waits for 10 seconds to check it again. If the value still exceeds the threshold, Prometheus triggers the rule. You can observe both the rule and the alert it generates on the Prometheus dashboard.

  4. Send notifications to Slack in which you configure Alertmanager to send notifications on Prometheus alerts to a Slack channel. This way, whenever Prometheus triggers or resolves the CPUTempHigh alert, Alertmanager sends a notification to the test-monitoring-alerts Slack channel defined for the purpose of the tutorial.

See the diagram for an overview of the tutorials purpose and tools used in them:

Monitoring tutorials

Observe application metrics

This tutorial shows how you can observe your application metrics. Learn how to list all metrics exposed by a sample Go service and watch their changing values by redirecting the metrics port and the default Prometheus server port to a localhost.

This tutorial uses the monitoring-custom-metrics example and one of its services named sample-metrics-8081. The service exposes its metrics on the standard /metrics endpoint that is available under port 8081. You deploy the service (deployment.yaml) along with the ServiceMonitor custom resource (service-monitor.yaml) that instructs Prometheus to scrape metrics:

  • From the service with the k8s-app: metrics label
  • From the /metrics endpoint
  • At 10s interval

This tutorial focuses on the cpu_temperature_celsius metric that is one of the custom metrics exposed by the sample-metrics-8081 service. Using the metric logic implemented in the example, you can observe how the CPU temperature changes in the range between 60 and 90 degrees Celsius when Prometheus calls the /metrics endpoint.

Prerequisites

To complete the tutorial, you must meet one of these prerequisites and have:

  • A cluster with Kyma 1.3 or higher
  • Kyma 1.3 or higher installed locally with the Monitoring module

NOTE: The Monitoring module is not installed by default as a part of the Kyma Lite package.

Steps

Follow this tutorial to:

  • Deploy the sample service with its default configuration.
  • Redirect the metrics to a localhost.
  • Redirect the metrics to the Prometheus server to observe the metrics in the Prometheus UI.
  • Clean up the deployed example.

Deploy the example configuration

Follow these steps:

  1. Create the testing-monitoring Namespace.

    Click to copy
    kubectl create namespace testing-monitoring
  2. Deploy the sample service in the testing-monitoring Namespace.

    Click to copy
    kubectl create -f https://raw.githubusercontent.com/kyma-project/examples/master/monitoring-custom-metrics/deployment/deployment.yaml --namespace=testing-monitoring
  3. Deploy the ServiceMonitor custom resource definition (CRD) in the kyma-system Namespace that is a default Namespace for all ServiceMonitor CRDs.

    Click to copy
    kubectl apply -f https://raw.githubusercontent.com/kyma-project/examples/master/monitoring-custom-metrics/deployment/service-monitor.yaml
  4. Test your deployment.

    Click to copy
    kubectl get pods -n testing-monitoring

    You should get a result similar to this one:

    Click to copy
    NAME READY STATUS RESTARTS AGE
    sample-metrics-6f7c8fcf4b-mlgbx 2/2 Running 0 26m

View metrics on a localhost

Follow these steps:

  1. Run the port-forward command on the sample-metrics-8081 service for port 8081 to check the metrics.

    Click to copy
    kubectl port-forward svc/sample-metrics-8081 -n testing-monitoring 8081:8081
  2. Open a browser and access http://localhost:8081/metrics.

You can see the cpu_temperature_celsius metric and its current value of 62 on the list of all metrics exposed by the sample-metrics-8081 service.

metrics on port 8081

Thanks to the example logic, the custom metric value changes each time you refresh the localhost address.

View metrics on the Prometheus UI

You can also observe the cpu_temperature_celsius metric in the Prometheus UI and see how its value changes in the pre-defined 10s interval in which Prometheus scrapes the metric values from the service endpoint.

Follow these steps to redirect the metrics:

  1. Run the port-forward command on the monitoring-prometheus service.

    Click to copy
    kubectl port-forward svc/monitoring-prometheus -n kyma-system 9090:9090
  2. Access the Prometheus UI to see the service endpoint and its details on the Targets list.

    Prometheus Dashboard

  3. Click the Graph tab, search for the cpu_temperature_celsius metric in the Expression search box, and click the Execute button to check the last value scraped by Prometheus.

    Prometheus Dashboard

    The Prometheus UI shows a new value every 10 seconds upon refreshing the page.

Clean up the configuration

When you finish the tutorial, remove the deployed example and all its resources from the cluster.

Follow these steps:

  1. Remove the deployed ServiceMonitor CRD from the kyma-system Namespace.

    Click to copy
    kubectl delete servicemonitor -l example=monitoring-custom-metrics -n kyma-system
  2. Remove the example deployment from the testing-monitoring Namespace.

    Click to copy
    kubectl delete all -l example=monitoring-custom-metrics -n testing-monitoring

Create a Grafana dashboard

This tutorial shows how to create and configure a basic Grafana dashboard of a Gauge type. The dashboard shows how the values of the cpu_temperature_celsius metric change in time, representing the current processor temperature ranging from 60 to 90 degrees Celsius. The dashboard shows explicitly when the CPU temperature exceeds the pre-defined threshold of 75 degrees Celsius.

Prerequisites

This tutorial is a follow-up of the Observe application metrics tutorial that uses the monitoring-custom-metrics example. This example deploys the sample-metrics-8081 service which exposes the cpu_temperature_celsius metric. That configuration is required to complete this tutorial.

Steps

Follow these sections to create the Gauge dashboard type for the cpu_temperature_celsius metric.

Create the dashboard

  1. Navigate to Grafana. It is available under the https://grafana.{DOMAIN} address, where {DOMAIN} is the domain of your Kyma cluster, such as https://grafana.34.63.57.190.xip.io or https://grafana.example.com/. To access it from the Console UI, click Stats & Metrics on the left navigation menu.

    Stats and Metrics

  2. Click the + icon on the left sidebar and select Dashboard from the Create menu.

    Create a dashboard

  3. Select Add Query.

    Add Query

  4. Select Prometheus data source from the Queries to drop-down list and pick the cpu_temperature_celsius metric.

    New dashboard

  5. Toggle the Instant query to be able to retrieve the latest metric value on demand.

    Instant option

  6. Switch to the Visualization section and select the Gauge dashboard type.

    Gauge dashboard type

  7. Click the disk icon in the top right corner of the page to save the changes. Provide a name for the dashboard.

    Save the dashboard

Configure the dashboard

  1. To edit the dashboard settings, go to the Panel Title options and select Edit.

    Edit the dashboard

  2. Back in the Visualization section, set up the measuring unit to Celsius degrees to reflect the metric data type.

    Temperature

  3. Set the minimum metric value to 60 and the maximum value to 90 to reflect the cpu_temperature_celsius metric value range. Enable the Show labels option to display this range on the dashboard.

    Minimum and maximum values

  4. Set a red color threshold to 75 for the dashboard to turn red once the CPU temperature reaches and exceeds this value.

    Threshold

  5. Go to the General section and give a title to the dashboard.

    Panel title

  6. Click the disk icon in the top right corner of the page to save the changes. Add an optional note to describe the changes made.

    Note

Verify the dashboard

Refresh the browser to see how the dashboard changes according to the current value of the cpu_temperature_celsius metric.

  • It turns green if the current metric value ranges from 60 to 74 degrees Celsius:

    Green dashboard

  • It turns red if the current metric value ranges from 75 to 90 degrees Celsius:

    Red dashboard

Define alerting rules

This tutorial shows you how to define alerting rules to monitor the health status of your resources. In this example, you will write an alerting rule based on the cpu_temperature_celsius metric. The alert defined in the rule will fire whenever the CPU temperature is equal to or greater than 75 degrees Celsius.

Prerequisites

This tutorial is a follow-up of the observe application metrics tutorial that uses the monitoring-custom-metrics example. Follow this tutorial to deploy the sample-metrics-8081 service which exposes the cpu_temperature_celsius metric. That configuration is required to complete this tutorial.

Steps

Follow these steps to create an alerting rule:

  1. Create the PrometheusRule resource holding the configuration of your alerting rule.

    NOTE: Prometheus requires a specific label to identify PrometheusRule definitions. Make sure you set role to alert-rules.

    Click to copy
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
    name: cpu.temp.rules
    namespace: kyma-system
    labels:
    app: cpu.temp.rules
    prometheus: monitoring
    release: monitoring
    role: alert-rules
    spec:
    groups:
    - name: cpu.temp.rules
    rules:
    - alert: CPUTempHigh
    expr: cpu_temperature_celsius >= 75
    for: 10s
    labels:
    severity: critical
    annotations:
    description: "CPU temperature is equal to or greater than 75 degrees Celsius"
    summary: "CPU temperature is too high"

    Configure your alert rule using the following parameters:

    ParameterDescriptionExample value
    groups.nameSpecifies the name of the group listing the rules.cpu.temp.rules
    rules.alertSpecifies the name of the alert.CPUTempHigh
    rules.exprA PromQL expression which specifies the conditions that must be met for the alarm to fire. Specify the expression using Kubernetes functions and metrics.cpu_temperature_celsius >= 75
    rules.forSpecifies the time period between encountering an active alert for the first time during rule evaluation and firing the alert.10s
    rules.labels.severitySpecifies the severity of the alert.critical
    rules.annotations.descriptionProvides the alert details.CPU temperature is equal to or greater than 75 degrees Celsius
    rules.annotations.summaryProvides a short alert summary.CPU temperature is too high

    For more details on defining alerting rules, see this document.

  2. Deploy the alerting rule:

    Click to copy
    kubectl apply -f {FILE_NAME}.yaml
  3. Run the port-forward command on the monitoring-prometheus service to access the Prometheus dashboard:

    Click to copy
    kubectl port-forward pod/prometheus-monitoring-0 -n kyma-system 9090:9090
  4. Go to http://localhost:9090/rules to view the rule in the dashboard.

  5. Go to http://localhost:9090/alerts to see if the alert fires properly.

Send notifications to Slack

This tutorial shows you how to configure Alertmanager to send notifications. Alertmanager supports several notification receivers, but this tutorial only focuses on sending notifications to Slack.

Prerequisites

This tutorial is a follow-up of the Observe application metrics and the Define alerting rules tutorials that use the monitoring-custom-metrics example. Follow this tutorial to deploy the sample-metrics-8081 service which exposes the cpu_temperature_celsius metric and creates an alert based on it. That configuration is required to complete this tutorial.

Steps

Follow these steps to configure notifications for Slack every time Alertmanager triggers and resolves the CPUTempHigh alert.

  1. Install the Incoming WebHooks application using Slack App Directory.

    NOTE: The approval of your Slack workspace administrator may be necessary to install the application.

  2. Configure the application to receive notifications coming from third-party services. Read this document to find out how to set up the configuration for Slack.

    The integration settings should look similar to the following:

    Integration Settings

  3. Override Alertmanager configuration. The configuration for notification receivers is located in this template. By default, it contains settings for VictorOps, Slack, and Webhooks. Define a Secret to override default values used by the chart.

    Click to copy
    apiVersion: v1
    kind: Secret
    metadata:
    name: monitoring-config-overrides
    namespace: kyma-installer
    labels:
    kyma-project.io/installation: ""
    installer: overrides
    component: monitoring
    type: Opaque
    stringData:
    global.alertTools.credentials.slack.channel: "{CHANNEL_NAME}"
    global.alertTools.credentials.slack.apiurl: "{WEBHOOK_URL}"

    Use the following parameters:

    ParameterDescription
    global.alertTools.credentials.slack.channelSpecifies the Slack channel which receives notifications on new alerts, such as test-monitoring-alerts.
    global.alertTools.credentials.slack.apiurlSpecifies the URL endpoint which sends alerts triggered by Prometheus rules. The Incoming WebHooks application provides you with the Webhook URL, such as https://hooks.slack.com/services/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, that you can paste in this configuration.

    For details on Alertmanager chart configuration and parameters, see this document.

  4. Deploy the Secret. Use this command:

    Click to copy
    kubectl apply -f {FILE_NAME}.yaml
  5. Proceed with Kyma installation.

    NOTE: If you add the overrides in the runtime, trigger the update process using this command:

    Click to copy
    kubectl label installation/kyma-installation action=install
  6. Verify if your Slack channel receives alert notifications about firing and resolved alerts. See the example:

    Alert Notifications