Monitoring the DevWorkspace operator

This chapter describes how to configure an example monitoring stack to process metrics exposed by the DevWorkspace operator. You must enable the DevWorkspace operator to follow the instructions in this chapter. See Enabling DevWorkspace operator.

Collecting DevWorkspace operator metrics with Prometheus

This section describes how to use the Prometheus to collect, store, and query metrics about the DevWorkspace operator.

Prerequisites
  • The devworkspace-controller-metrics service is exposing metrics on port 8443.

  • The devworkspace-webhookserver service is exposing metrics on port 9443. By default, the service exposes metrics on port 9443.

  • Prometheus 2.26.0 or later is running. The Prometheus console is running on port 9090 with a corresponding service and route. See First steps with Prometheus.

Procedure
  1. Create a ClusterRoleBinding to bind the ServiceAccount associated with Prometheus to the devworkspace-controller-metrics-reader ClusterRole. Without the ClusterRoleBinding, you cannot access DevWorkspace metrics because they are protected with role-based access control (RBAC).

    Example 1. ClusterRole example
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRole
    metadata:
      name: devworkspace-controller-metrics-reader
    rules:
    - nonResourceURLs:
      - /metrics
      verbs:
      - get
    Example 2. ClusterRoleBinding example
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: devworkspace-controller-metrics-binding
    subjects:
      - kind: ServiceAccount
        name: <ServiceAccount name associated with the Prometheus Pod>
        namespace: <Prometheus namespace>
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: devworkspace-controller-metrics-reader
  2. Configure Prometheus to scrape metrics from the 8443 port exposed by the devworkspace-controller-metrics service, and 9443 port exposed by the devworkspace-webhookserver service.

    Example 3. Prometheus configuration example
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus-config
    data:
      prometheus.yml: |-
          global:
            scrape_interval:     5s             (1)
            evaluation_interval: 5s             (2)
          scrape_configs:                       (3)
            - job_name: 'DevWorkspace'
              authorization:
                type: Bearer
                credentials_file: '/var/run/secrets/kubernetes.io/serviceaccount/token'
              tls_config:
                insecure_skip_verify: true
              static_configs:
                - targets: ['devworkspace-controller-metrics:8443']  (4)
            - job_name: 'DevWorkspace webhooks'
              authorization:
                type: Bearer
                credentials_file: '/var/run/secrets/kubernetes.io/serviceaccount/token'
              tls_config:
                insecure_skip_verify: true
              static_configs:
                - targets: ['devworkspace-webhookserver:9443']  (5)
1 Rate at which a target is scraped.
2 Rate at which recording and alerting rules are re-checked.
3 Resources that Prometheus monitors. In the default configuration, two jobs (DevWorkspace and DevWorkspace webhooks), scrape the time series data exposed by the devworkspace-controller-metrics and devworkspace-webhookserver services.
4 Scrape metrics from the 8443 port.
5 Scrape metrics from the 9443 port.
Verification steps

DevWorkspace-specific metrics

This section describes the DevWorkspace-specific metrics exposed by the devworkspace-controller-metrics service.

Table 1. Metrics
Name Type Description Labels

devworkspace_started_total

Counter

Number of DevWorkspace starting events.

source, routingclass

devworkspace_started_success_total

Counter

Number of DevWorkspaces successfully entering the Running phase.

source, routingclass

devworkspace_fail_total

Counter

Number of failed DevWorkspaces.

source, reason

devworkspace_startup_time

Histogram

Total time taken to start a DevWorkspace, in seconds.

source, routingclass

Table 2. Labels
Name Description Values

source

The controller.devfile.io/devworkspace-source label of the DevWorkspace.

string

routingclass

The spec.routingclass of the DevWorkspace.

"basic|cluster|cluster-tls|web-terminal"

reason

The workspace startup failure reason.

"BadRequest|InfrastructureFailure|Unknown"

Table 3. Startup failure reasons
Name Description

BadRequest

Startup failure due to an invalid devfile used to create a DevWorkspace.

InfrastructureFailure

Startup failure due to the following errors: CreateContainerError, RunContainerError, FailedScheduling, FailedMount.

Unknown

Unknown failure reason.

Viewing DevWorkspace operator metrics on Grafana dashboards

This section describes how to view DevWorkspace operator metrics on Grafana with the example dashboard. Grafana version 7.5.3 or later is required to support all panels in the example dashboard.

Prerequisites
Procedure
  1. Add the data source for the Prometheus instance. See Creating a Prometheus data source.

  2. Import the example grafana-dashboard.json dashboard.

Verification steps

Grafana dashboards for the DevWorkspace operator

This section describes the example Grafana dashboard, see grafana-dashboard.json, which displays metrics collected from the DevWorkspace operator.

Grafana dashboard panels that contain metrics related to `DevWorkspace startup
Figure 1. The DevWorkspace-specific metrics panel

The DevWorkspace-specific metrics panel contains information related to DevWorkspace startup.

Average workspace start time

The average start time of a workspace.

Workspace starts

The number successful and failed workspace starts.

Workspace startup duration

A heatmap that displays workspace startup duration.

DevWorkspace successes / failures

A comparison between successful and failed DevWorkspace startups

DevWorkspace failure rate

The ratio between the number of failed workspace startups and the number of total workspace startups.

DevWorkspace startup failure reasons

A pie chart that displays the distribution of workspace startup failures. The possible failure reasons are:

  • BadRequest

  • InfrastructureFailure

  • Unknown

Grafana dashboard panels that contain Operator metrics part 1
Figure 2. The Operator metrics panel, part 1
Webhooks in flight

A comparison between the number of different webhook requests.

Work queue duration

A heatmap that displays how long the reconcile requests stay in the work queue before they are handled.

Webhooks latency (/mutate)

A heatmap that displays /mutate webhook latency.

Reconcile time

A heatmap that displays the reconcile duration.

Grafana dashboard panels that contain Operator metrics part 2
Figure 3. The Operator metrics panel, part 2
Webhooks latency (/convert)

A heatmap that displays /convert webhook latency.

Work queue depth

The number of reconcile requests that are in the work queue.

Memory

Memory usage for the DevWorkspace controller and the DevWorkspace webhook server.

Reconcile counts (DWO)

The average per-second number of reconcile counts for the DevWorkspace controller.