Monitoring

Overview

Kyma comes bundled with third-party applications like Prometheus, Alertmanager, and Grafana, that offer a monitoring functionality for all Kyma resources. These applications are deployed during the Kyma cluster installation, along with a set of pre-defined alerting rules, Grafana dashboards, and Prometheus configuration.

The whole installation package provides the end-to-end Kubernetes cluster monitoring that allows you to:

  • View metrics exposed by the Pods.
  • Use the metrics to create descriptive dashboards that monitor any Pod anomalies.
  • Manage the default alert rules and create new ones.
  • Set up channels for notifications informing of any detected alerts.

NOTE: The monitoring component is available by default in the cluster installation, but disabled in the Kyma Lite local installation on Minikube. Enable the component to install it with the local profile.

Architecture

Before you go into component details, find out more about the end-to-end monitoring flow in Kyma.

End-to-end monitoring flow

The complete monitoring flow in Kyma comes down to these components and steps:

End-to-end monitoring flow

  1. Upon Kyma installation on a cluster, Prometheus Operator creates a Prometheus instance with the default configuration.
  2. The Prometheus server periodically polls all metrics exposed on /metrics endpoints of ports specified in ServiceMonitor CRDs. Prometheus stores these metrics in a time-series database.
  3. If Prometheus detects any metric values matching the logic of alerting rules, it triggers the alerts and passes them to Alertmanager.
  4. If you manually configure a notification channel, you can instantly receive detailed information on metric alerts detected by Prometheus.
  5. You can visualize metrics and track their historical data on Grafana dashboards.

Monitoring components

The diagram presents monitoring components and the way they interact with one another.

Monitoring components

  1. Prometheus Operator creates a Prometheus instance, manages its deployment, and provides configuration for it. It also deploys Alertmanager and operates ServiceMonitor custom resources that specify monitoring definitions for groups of services.

  2. Prometheus collects metrics from Pods. Metrics are the time-stamped data that provide information on the running jobs, workload, CPU consumption, memory usage, and more. To obtain such metrics, Prometheus uses the kube-state-metrics service. It generates the metrics from Kubernetes API objects and exposes them on the /metrics HTTP endpoint.
    Pods can also contain applications with custom metrics, such as the total storage space available in the MinIO server. Prometheus stores this polled data in a time-series database (TSDB) and runs rules over them to generate alerts if it detects any metric anomalies. It also scrapes metrics provided by Node Exporter which exposes existing hardware metrics from external systems as Prometheus metrics.

  3. ServiceMonitors monitor services and specify the endpoints from which Prometheus should poll the metrics. Even if you expose a handful of metrics in your application, Prometheus polls only those from the /metrics endpoints of ports specified in ServiceMonitor CRDs.

  4. Alertmanager receives alerts from Prometheus and forwards this data to configured Slack or Victor Ops channels. You can use PrometheusRules to define alert conditions for metrics. Kyma provides a set of out-of-the-box alerting rules that are passed from Prometheus to Alertmanager. The definitions of such rules specify the alert logic, the value at which alerts are triggered, the alerts' severity, and more.

    NOTE: There are no notification channels configured in the default monitoring installation. The current configuration allows you to add either Slack or Victor Ops channels.

  5. Grafana provides a dashboard and a graph editor to visualize metrics collected from the Prometheus API. Grafana uses the query language called PromQL to select and aggregate metrics data from the Prometheus database. To access the Grafana UI, use the https://grafana.{DOMAIN} address, where {DOMAIN} is the domain of your Kyma cluster.

Details

Alertmanager

Alertmanager receives and manages alerts coming from Prometheus. It can then forward the notifications about fired alerts to specific channels, such as Slack or VictorOps.

Alertmanager configuration

Use the following files to configure and manage Alertmanager:

  • alertmanager.yaml which deploys the Alertmanager Pod.
  • values.yaml which you can use to define core Alertmanager configuration and alerting channels. For details on configuration elements, see the Prometheus documentation.

Alerting rules

Kyma comes with a set of alerting rules provided out of the box. These rules provide alerting configuration for logging, web apps, rest services, and custom Kyma rules. You can also define your own alerting rule. To learn how, see the tutorial.

Configuration

Alertmanager sub-chart

To configure the Alertmanager sub-chart, override the default values of its values.yaml file. This document describes parameters that you can set.

TIP: To learn more about how to use overrides in Kyma, see the following documents:

Configurable parameters

This table lists the configurable parameters, their descriptions, and default values:

ParameterDescriptionDefault value
global.alertTools.credentials.slack.apiurlSpecifies the URL endpoint which sends alerts triggered by Prometheus rules.None
global.alertTools.credentials.slack.channelRefers to the Slack channel which receives notifications on new alerts.None
global.alertTools.credentials.slack.matchExpressionNotifications will be sent only for those alerts whose labels match the specified expression."severity: critical"
global.alertTools.credentials.slack.sendResolvedSpecifies whether or not to notify about resolved alerts.true
global.alertTools.credentials.victorOps.routingkeyDefines the team routing key in VictorOps.None
global.alertTools.credentials.victorOps.apikeyDefines the team API key in VictorOps.None
global.alertTools.credentials.victorOps.matchExpressionNotifications will be sent only for those alerts whose labels match the specified expression."severity: critical"
global.alertTools.credentials.victorOps.sendResolvedSpecifies whether or not to notify about resolved alerts.true

NOTE: Override all configurable values for the Alertmanager sub-chart using Secrets (kind: Secret).

Grafana sub-chart

To configure the Grafana sub-chart, override the default values of its values.yaml file. This document describes parameters that you can set.

TIP: To learn more about how to use overrides in Kyma, see the following documents:

Configurable parameters

This table lists the configurable parameters, their descriptions, and default values:

ParameterDescriptionDefault value
env.GF_USERS_DEFAULT_THEMESpecifies the background color of the Grafana UI. You can change it to dark.light
env.GF_AUTH_GENERIC_OAUTH_ENABLEDEnables the generic OAuth plugin for Grafana that is already pre-configured based on the in-cluster Dex setup.true
env.GF_USERS_AUTO_ASSIGN_ORG_ROLESpecifies the automatically assigned user role for a user already authenticated by Grafana. You can change the value to Viewer or Admin.Editor
env.GF_AUTH_ANONYMOUS_ENABLEDEnables anonymous login to Grafana.false
env.GF_AUTH_ANONYMOUS_ORG_ROLESpecifies the automatically assigned user role for an anonymous user. You can change the value to Viewer or Admin.Editor
env.GF_LOG_LEVELSpecifies the log level used by Grafana. Be aware that logs at the info level may print logins, which can potentially be users' email addresses.warn
persistence.enabledSpecifies whether the user and dashboard data used by Grafana is durably persisted. If enabled, the Grafana database will be mounted to a PersistentVolume and survive restarts. If you use Grafana in a high-available setup using an external database, ensure that this flag is set to false.true
env.GF_ALERTING_ENABLEDEnables the Grafana's alerting feature.true
env.GF_DASHBOARDS_MIN_REFRESH_INTERVALSpecifies the minimum refresh interval for dashboards to prevent using a lower value. Use the official syntax to set the value.10s

Prometheus sub-chart

To configure the Prometheus sub-chart, override the default values of its values.yaml file. This document describes parameters that you can set.

TIP: To learn more about how to use overrides in Kyma, see the following documents:

Configurable parameters

This table lists the configurable parameters, their descriptions, and default values:

ParameterDescriptionDefault value
retentionSpecifies a period for which Prometheus stores the metrics in-memory. This retention time applies to in-memory storage only. Prometheus stores the recent data in-memory for the specified amount of time to avoid reading the entire data from disk.2h
storageSpec.volumeClaimTemplate.spec.resources.requests.storageSpecifies the size of a Persistent Volume Claim (PVC).4Gi

Monitoring profiles

Overview

To ensure optimal performance and avoid high memory and CPU consumption, you can install Kyma with one of the Monitoring profiles.

Default profile

The default profile is used in Kyma cluster installation when you deploy Kyma with Monitoring enabled. You can use it for development purposes but bear in mind that it is not production-ready. The profile defines short data retention time (1 day) which may not be enough to identify and solve issues in case of prolonged troubleshooting. To make Monitoring production-ready and avoid potential issues, configure Monitoring to use the production profile.

Production profile

To make sure Monitoring runs in a production environment, this profile introduces the following changes:

  • Increased retention time to prevent data loss in case of prolonged troubleshooting
  • Increased memory and CPU values to ensure stable performance

Local profile

If you install Kyma locally on Minikube, Monitoring uses a lightweight configuration by default to avoid high memory and CPU consumption.

Parameters

The table shows the parameters of each profile and their values:

ParameterDescriptionDefault profileProduction profileLocal profile
retentionSizeMaximum number of bytes that storage blocks can use. The oldest data will be removed first.2GB15GB500MB
retentionTime period for which Prometheus stores metrics in an in-memory database. Prometheus stores the recent data for the specified amount of time to avoid reading all data from the disk. This parameter only applies to in-memory storage.1d30d2h
prometheusSpec.volumeClaimTemplate.spec.resources.requests.storageAmount of storage requested by the Prometheus Pod.10Gi20Gi1Gi
prometheusSpec.resources.limits.cpuMaximum number of CPUs available for the Prometheus Pod to use.600m1150m
prometheusSpec.resources.limits.memoryMaximum amount of memory available for the Prometheus Pod to use.1500Mi3Gi800Mi
prometheusSpec.resources.requests.cpuNumber of CPUs requested by the Prometheus Pod to operate.300m300m100m
prometheusSpec.resources.requests.memoryAmount of memory requested by the Prometheus Pod to operate.1000Mi1Gi200Mi
alertmanager.alertmanagerSpec.retentionTime period for which Alertmanager retains data.120h240h1h
grafana.persistence.enabledStoring grafana database on a PersistentVolume?truetruefalse

Use profiles

The default and local profiles are installed automatically during cluster and local installation respectively. The production profile is a Helm override you can apply before Kyma installation or in the runtime.

Production profile

You can deploy a Kyma cluster with Monitoring configured to use the production profile, or add the configuration in the runtime. Follow these steps:

  • Install Kyma with production-ready Monitoring
  • Enable configuration in a running cluster

Tutorials

Overview

The set of monitoring tutorials you are about to read describes the complete monitoring flow for your services in Kyma. Going through the tutorials, you get the gist of Kyma built-in monitoring applications, such as Prometheus, Grafana, and Alertmanager. This hands-on experience with monitoring helps you understand how and where you can observe and visualize your service metrics to monitor them for any alerting values.

All the tutorials use the monitoring-custom-metrics example and one of its services called sample-metrics-8081. This service exposes the cpu_temperature_celsius custom metric on the /metrics endpoint. This custom metric is the central element of the whole tutorial set. The metric value simulates the current processor temperature and changes randomly from 60 to 90 degrees Celsius. The alerting threshold in these tutorials is 75 degrees Celsius. If the temperature exceeds this value, the Grafana dashboard, Prometheus rule, and Alertmanager notifications you create inform you about this.

The tutorial set consists of these documents:

  1. Observe application metrics in which you redirect the cpu_temperature_celsius metric to the localhost and the Prometheus UI. You later observe how the metric value changes in the predefined 10 seconds interval in which Prometheus scrapes the metric values from the service's /metrics endpoint.

  2. Create a Grafana dashboard in which you create a Grafana dashboard of a Gauge type for the cpu_temperature_celsius metric. This dashboard shows explicitly when the CPU temperature is equal to or higher than the predefined threshold of 75 degrees Celsius, at which point the dashboard turns red.

  3. Define alerting rules in which you define the CPUTempHigh alerting rule by creating a PrometheusRule resource. Prometheus accesses the /metrics endpoint every 10 seconds and validates the current value of the cpu_temperature_celsius metric. If the value is equal to or higher than 75 degrees Celsius, Prometheus waits for 10 seconds to recheck it. If the value still exceeds the threshold, Prometheus triggers the rule. You can observe both the rule and the alert it generates on the Prometheus dashboard.

  4. Send notifications to Slack in which you configure Alertmanager to send notifications on Prometheus alerts to a Slack channel. This way, whenever Prometheus triggers or resolves the CPUTempHigh alert, Alertmanager sends a notification to the test-monitoring-alerts Slack channel defined for the tutorial.

See the diagram for an overview of the purpose of the tutorials and the tools used in them:

Monitoring tutorials

Observe application metrics

This tutorial shows how you can observe your application metrics. Learn how to list all metrics exposed by a sample Go service and watch their changing values by redirecting the metrics port and the default Prometheus server port to the localhost.

This tutorial uses the monitoring-custom-metrics example and one of its services named sample-metrics-8081. The service exposes its metrics on the standard /metrics endpoint that is available under port 8081. You deploy the service (deployment.yaml) along with the ServiceMonitor custom resource (service-monitor.yaml) that instructs Prometheus to scrape metrics:

  • From the service with the k8s-app: metrics label
  • From the /metrics endpoint
  • At 10s interval

This tutorial focuses on the cpu_temperature_celsius metric, that is one of the custom metrics exposed by the sample-metrics-8081 service. Using the metric logic implemented in the example, you can observe how the CPU temperature changes in the range between 60 and 90 degrees Celsius when Prometheus calls the /metrics endpoint.

Prerequisites

To complete the tutorial, you must meet one of these prerequisites and have:

  • A cluster with Kyma 1.3 or higher
  • Kyma 1.3 or higher installed locally with the Monitoring module

NOTE: The monitoring module is not installed by default as a part of the Kyma Lite package.

Steps

Follow this tutorial to:

  • Deploy the sample service with its default configuration.
  • Redirect the metrics to the localhost.
  • Redirect the metrics to the Prometheus server to observe the metrics in the Prometheus UI.
  • Clean up the deployed example.

Deploy the example configuration

Follow these steps:

  1. Create the testing-monitoring Namespace.

    Click to copy
    kubectl create namespace testing-monitoring
  2. Deploy the sample service in the testing-monitoring Namespace.

    Click to copy
    kubectl create -f https://raw.githubusercontent.com/kyma-project/examples/master/monitoring-custom-metrics/deployment/deployment.yaml --namespace=testing-monitoring
  3. Deploy the ServiceMonitor custom resource definition (CRD) in the kyma-system Namespace that is a default Namespace for all ServiceMonitor CRDs.

    Click to copy
    kubectl apply -f https://raw.githubusercontent.com/kyma-project/examples/master/monitoring-custom-metrics/deployment/service-monitor.yaml
  4. Test your deployment.

    Click to copy
    kubectl get pods -n testing-monitoring

    You should get a result similar to this one:

    Click to copy
    NAME READY STATUS RESTARTS AGE
    sample-metrics-6f7c8fcf4b-mlgbx 2/2 Running 0 26m

View metrics on a localhost

Follow these steps:

  1. Run the port-forward command on the sample-metrics-8081 service for port 8081 to check the metrics.

    Click to copy
    kubectl port-forward svc/sample-metrics-8081 -n testing-monitoring 8081:8081
  2. Open a browser and access http://localhost:8081/metrics.

You can see the cpu_temperature_celsius metric and its current value of 62 on the list of all metrics exposed by the sample-metrics-8081 service.

metrics on port 8081

Thanks to the example logic, the custom metric value changes each time you refresh the localhost address.

View metrics on the Prometheus UI

You can also observe the cpu_temperature_celsius metric in the Prometheus UI and see how its value changes in the pre-defined 10s interval in which Prometheus scrapes the metric values from the service endpoint.

Follow these steps to redirect the metrics:

  1. Run the port-forward command on the monitoring-prometheus service.

    Click to copy
    kubectl port-forward svc/monitoring-prometheus -n kyma-system 9090:9090
  2. Access the Prometheus UI to see the service endpoint and its details on the Targets list.

    Prometheus Dashboard

  3. Click the Graph tab, search for the cpu_temperature_celsius metric in the Expression search box, and click the Execute button to check the last value scraped by Prometheus.

    Prometheus Dashboard

    The Prometheus UI shows a new value every 10 seconds upon refreshing the page.

Clean up the configuration

When you finish the tutorial, remove the deployed example and all its resources from the cluster.

NOTE: Do not clean up the resources if you want to continue with the next tutorial as these resources are used there as well.

Follow these steps:

  1. Remove the deployed ServiceMonitor CRD from the kyma-system Namespace.

    Click to copy
    kubectl delete servicemonitor -l example=monitoring-custom-metrics -n kyma-system
  2. Remove the example deployment from the testing-monitoring Namespace.

    Click to copy
    kubectl delete all -l example=monitoring-custom-metrics -n testing-monitoring
  3. Remove the testing-monitoring Namespace.

    Click to copy
    kubectl delete namespace testing-monitoring

Create a Grafana dashboard

This tutorial shows how to create and configure a basic Grafana dashboard of a Gauge type. The dashboard shows how the values of the cpu_temperature_celsius metric change in time, representing the current processor temperature ranging from 60 to 90 degrees Celsius. The dashboard shows explicitly when the CPU temperature exceeds the pre-defined threshold of 75 degrees Celsius.

Prerequisites

This tutorial is a follow-up of the Observe application metrics tutorial that uses the monitoring-custom-metrics example. This example deploys the sample-metrics-8081 service which exposes the cpu_temperature_celsius metric. That configuration is required to complete this tutorial.

Steps

Follow these sections to create the Gauge dashboard type for the cpu_temperature_celsius metric.

Create the dashboard

  1. Navigate to Grafana. It is available under the https://grafana.{DOMAIN} address, where {DOMAIN} is the domain of your Kyma cluster, such as https://grafana.34.63.57.190.xip.io or https://grafana.example.com/. To access it from the Console UI, click Stats & Metrics on the left navigation menu.

    Stats and Metrics

  2. Click the + icon on the left sidebar and select Dashboard from the Create menu.

    Create a dashboard

  3. Select Add Query.

    Add Query

  4. Select Prometheus data source from the Queries to drop-down list and pick the cpu_temperature_celsius metric.

    New dashboard

  5. Toggle the Instant query to be able to retrieve the latest metric value on demand.

    Instant option

  6. Switch to the Visualization section and select the Gauge dashboard type.

    Gauge dashboard type

  7. Click the disk icon in the top right corner of the page to save the changes. Provide a name for the dashboard.

    Save the dashboard

Configure the dashboard

  1. To edit the dashboard settings, go to the Panel Title options and select Edit.

    Edit the dashboard

  2. Back in the Visualization section, set up the measuring unit to Celsius degrees to reflect the metric data type.

    Temperature

  3. Set the minimum metric value to 60 and the maximum value to 90 to reflect the cpu_temperature_celsius metric value range. Enable the Labels option to display this range on the dashboard.

    Minimum and maximum values

  4. Set a red color threshold to 75 for the dashboard to turn red once the CPU temperature reaches and exceeds this value.

    Threshold

  5. Go to the General section and give a title to the dashboard.

    Panel title

  6. Click the disk icon in the top right corner of the page to save the changes. Add an optional note to describe the changes made.

    Note

Verify the dashboard

Refresh the browser to see how the dashboard changes according to the current value of the cpu_temperature_celsius metric.

  • It turns green if the current metric value ranges from 60 to 74 degrees Celsius:

    Green dashboard

  • It turns red if the current metric value ranges from 75 to 90 degrees Celsius:

    Red dashboard

NOTE: You can also define the dashboard's ConfigMap and add it to the resources folder under the given component's chart. To make the dashboard visible, simply use the kubectl apply command to deploy it. For details on adding monitoring to components, see the README.md document.

Define alerting rules

This tutorial shows you how to define alerting rules to monitor the health status of your resources. In this example, you will write an alerting rule based on the cpu_temperature_celsius metric. The alert defined in the rule will fire whenever the CPU temperature is equal to or greater than 75 degrees Celsius.

Prerequisites

This tutorial is a follow-up of the observe application metrics tutorial that uses the monitoring-custom-metrics example. Follow this tutorial to deploy the sample-metrics-8081 service which exposes the cpu_temperature_celsius metric. That configuration is required to complete this tutorial.

Steps

Follow these steps to create an alerting rule:

  1. Create the PrometheusRule resource holding the configuration of your alerting rule.

    NOTE: Prometheus requires specific labels to identify PrometheusRule definitions. Make sure you set app and release to monitoring.

    Click to copy
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
    name: cpu.temp.rules
    namespace: kyma-system
    labels:
    app: monitoring
    release: monitoring
    spec:
    groups:
    - name: cpu.temp.rules
    rules:
    - alert: CPUTempHigh
    expr: cpu_temperature_celsius >= 75
    for: 10s
    labels:
    severity: critical
    annotations:
    description: "CPU temperature is equal to or greater than 75 degrees Celsius"
    summary: "CPU temperature is too high"

    Configure your alert rule using the following parameters:

    ParameterDescriptionExample value
    groups.nameSpecifies the name of the group listing the rules.cpu.temp.rules
    rules.alertSpecifies the name of the alert.CPUTempHigh
    rules.exprA PromQL expression which specifies the conditions that must be met for the alarm to fire. Specify the expression using Kubernetes Functions and metrics.cpu_temperature_celsius >= 75
    rules.forSpecifies the time period between encountering an active alert for the first time during rule evaluation and firing the alert.10s
    rules.labels.severitySpecifies the severity of the alert.critical
    rules.annotations.descriptionProvides the alert details.CPU temperature is equal to or greater than 75 degrees Celsius
    rules.annotations.summaryProvides a short alert summary.CPU temperature is too high

    For more details on defining alerting rules, see the Prometheus documentation.

  2. Deploy the alerting rule:

    Click to copy
    kubectl apply -f {FILE_NAME}.yaml
  3. Run the port-forward command on the monitoring-prometheus service to access the Prometheus dashboard:

    Click to copy
    kubectl port-forward svc/monitoring-prometheus -n kyma-system 9090:9090
  4. Go to http://localhost:9090/rules to view the rule in the dashboard.

    Rule on the dashboard

  5. Go to http://localhost:9090/alerts to see if the alert fires appropriately.

    Alert on the dashboard

Send notifications to Slack

This tutorial shows you how to configure Alertmanager to send notifications. Alertmanager supports several notification receivers, but this tutorial only focuses on sending notifications to Slack.

Prerequisites

This tutorial is a follow-up of the Observe application metrics and the Define alerting rules tutorials that use the monitoring-custom-metrics example. Follow this tutorial to deploy the sample-metrics-8081 service which exposes the cpu_temperature_celsius metric and creates an alert based on it. That configuration is required to complete this tutorial.

Steps

Follow these steps to configure notifications for Slack every time Alertmanager triggers and resolves the CPUTempHigh alert.

  1. Install the Incoming WebHooks application using Slack App Directory.

    NOTE: The approval of your Slack workspace administrator may be necessary to install the application.

  2. Configure the application to receive notifications coming from third-party services. Read the instructions to find out how to set up the configuration for Slack.

    The integration settings should look similar to the following:

    Integration Settings

  3. Override Alertmanager configuration. The configuration for notification receivers is located in the template. By default, it contains settings for VictorOps and Slack. Define a Secret to override default values used by the chart.

    Click to copy
    apiVersion: v1
    kind: Secret
    metadata:
    name: monitoring-config-overrides
    namespace: kyma-installer
    labels:
    kyma-project.io/installation: ""
    installer: overrides
    component: monitoring
    type: Opaque
    stringData:
    global.alertTools.credentials.slack.channel: "{CHANNEL_NAME}"
    global.alertTools.credentials.slack.apiurl: "{WEBHOOK_URL}"

    Use the following parameters:

    ParameterDescription
    global.alertTools.credentials.slack.channelSpecifies the Slack channel which receives notifications on new alerts, such as test-monitoring-alerts.
    global.alertTools.credentials.slack.apiurlSpecifies the URL endpoint which sends alerts triggered by Prometheus rules. The Incoming WebHooks application provides you with the Webhook URL, such as https://hooks.slack.com/services/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, that you can paste in this configuration.

    For details on Alertmanager chart configuration and parameters, see the configuration document.

  4. Deploy the Secret. Use this command:

    Click to copy
    kubectl apply -f {FILE_NAME}.yaml
  5. Proceed with Kyma installation.

    NOTE: If you add the overrides in the runtime, trigger the update process using this command:

    Click to copy
    kubectl -n default label installation/kyma-installation action=install

    NOTE: If the rule you created is removed during the update, re-apply it following the Define alerting rules tutorial.

  6. Verify if your Slack channel receives alert notifications about firing and resolved alerts. See the example:

    Alert Notifications