Monitoring the Dev Workspace Operator

You can configure an example monitoring stack to process metrics exposed by the Dev Workspace Operator.

Collecting Dev Workspace Operator metrics with Prometheus

To use Prometheus to collect, store, and query metrics about the Dev Workspace Operator:

Prerequisites
  • The devworkspace-controller-metrics Service is exposing metrics on port 8443. This is preconfigured by default.

  • The devworkspace-webhookserver Service is exposing metrics on port 9443. This is preconfigured by default.

  • Prometheus 2.26.0 or later is running. The Prometheus console is running on port 9090 with a corresponding Service. See First steps with Prometheus.

Procedure
  1. Create a ClusterRoleBinding to bind the ServiceAccount associated with Prometheus to the devworkspace-controller-metrics-reader ClusterRole. For the example monitoring stack, the name of the ServiceAccount to be be used is prometheus.

    Without the ClusterRoleBinding, you cannot access Dev Workspace metrics because access is protected with role-based access control (RBAC).
    Example 1. ClusterRoleBinding
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: devworkspace-controller-metrics-binding
    subjects:
      - kind: ServiceAccount
        name: prometheus
        namespace: monitoring
    roleRef:
      apiGroup: rbac.authorization.k8s.io
      kind: ClusterRole
      name: devworkspace-controller-metrics-reader
  2. Configure Prometheus to scrape metrics from port 8443 exposed by the devworkspace-controller-metrics Service and from port 9443 exposed by the devworkspace-webhookserver Service.

    The example monitoring stack already creates the prometheus-config ConfigMap with an empty configuration. To provide the Prometheus configuration details, edit the data field of the ConfigMap.
    Example 2. Prometheus configuration
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus-config
      namespace: monitoring
    data:
      prometheus.yml: |-
          global:
            scrape_interval: 5s (1)
            evaluation_interval: 5s (2)
          scrape_configs: (3)
            - job_name: 'DevWorkspace'
              scheme: https
              authorization:
                type: Bearer
                credentials_file: '/var/run/secrets/kubernetes.io/serviceaccount/token'
              tls_config:
                insecure_skip_verify: true
              static_configs:
                - targets: ['devworkspace-controller-metrics.<DWO_namespace>:8443'] (4)
            - job_name: 'DevWorkspace webhooks'
              scheme: https
              authorization:
                type: Bearer
                credentials_file: '/var/run/secrets/kubernetes.io/serviceaccount/token'
              tls_config:
                insecure_skip_verify: true
              static_configs:
                - targets: ['devworkspace-webhookserver.<DWO_namespace>:9443'] (5)
    1 The rate at which a target is scraped.
    2 The rate at which the recording and alerting rules are re-checked.
    3 The resources that Prometheus monitors. In the default configuration, two jobs, DevWorkspace and DevWorkspace webhooks, scrape the time series data exposed by the devworkspace-controller-metrics and devworkspace-webhookserver Services.
    4 The scrape target for the metrics from port 8443. Replace <DWO_namespace> with the namespace where the devworkspace-controller-metrics Service is located.
    5 The scrape target for the metrics from port 9443. Replace <DWO_namespace> with the namespace where the devworkspace-webhookserver Service is located.
  3. Scale the Prometheus Deployment down and up to read the updated ConfigMap from the previous step.

    $ kubectl scale --replicas=0 deployment/prometheus -n monitoring && kubectl scale --replicas=1 deployment/prometheus -n monitoring
Verification
  1. Use port forwarding to access the Prometheus Service locally:

    $ kubectl port-forward svc/prometheus 9090:9090 -n monitoring
  2. Verify that all targets are up by viewing the targets endpoint at localhost:9090/targets.

  3. Use the Prometheus console to view and query metrics:

Dev Workspace-specific metrics

The following tables describe the Dev Workspace-specific metrics exposed by the devworkspace-controller-metrics Service.

Table 1. Metrics
Name Type Description Labels

devworkspace_started_total

Counter

Number of Dev Workspace starting events.

source, routingclass

devworkspace_started_success_total

Counter

Number of Dev Workspaces successfully entering the Running phase.

source, routingclass

devworkspace_fail_total

Counter

Number of failed Dev Workspaces.

source, reason

devworkspace_startup_time

Histogram

Total time taken to start a Dev Workspace, in seconds.

source, routingclass

Table 2. Labels
Name Description Values

source

The controller.devfile.io/devworkspace-source label of the Dev Workspace.

string

routingclass

The spec.routingclass of the Dev Workspace.

"basic|cluster|cluster-tls|web-terminal"

reason

The workspace startup failure reason.

"BadRequest|InfrastructureFailure|Unknown"

Table 3. Startup failure reasons
Name Description

BadRequest

Startup failure due to an invalid devfile used to create a Dev Workspace.

InfrastructureFailure

Startup failure due to the following errors: CreateContainerError, RunContainerError, FailedScheduling, FailedMount.

Unknown

Unknown failure reason.

Viewing Dev Workspace Operator metrics on Grafana dashboards

To view the Dev Workspace Operator metrics on Grafana with the example dashboard:

Prerequisites
Procedure
  1. Add the data source for the Prometheus instance. See Creating a Prometheus data source.

  2. Import the example grafana-dashboard.json dashboard.

Verification steps

Grafana dashboard for the Dev Workspace Operator

The example Grafana dashboard based on grafana-dashboard.json displays the following metrics from the Dev Workspace Operator.

The Dev Workspace-specific metrics panel

Grafana dashboard panels that contain metrics related to `DevWorkspace startup
Figure 1. The Dev Workspace-specific metrics panel
Average workspace start time

The average workspace startup duration.

Workspace starts

The number of successful and failed workspace startups.

Workspace startup duration

A heatmap that displays workspace startup duration.

Dev Workspace successes / failures

A comparison between successful and failed Dev Workspace startups.

Dev Workspace failure rate

The ratio between the number of failed workspace startups and the number of total workspace startups.

Dev Workspace startup failure reasons

A pie chart that displays the distribution of workspace startup failures:

  • BadRequest

  • InfrastructureFailure

  • Unknown

The Operator metrics panel (part 1)

Grafana dashboard panels that contain Operator metrics part 1
Figure 2. The Operator metrics panel (part 1)
Webhooks in flight

A comparison between the number of different webhook requests.

Work queue duration

A heatmap that displays how long the reconcile requests stay in the work queue before they are handled.

Webhooks latency (/mutate)

A heatmap that displays the /mutate webhook latency.

Reconcile time

A heatmap that displays the reconcile duration.

The Operator metrics panel (part 2)

Grafana dashboard panels that contain Operator metrics part 2
Figure 3. The Operator metrics panel (part 2)
Webhooks latency (/convert)

A heatmap that displays the /convert webhook latency.

Work queue depth

The number of reconcile requests that are in the work queue.

Memory

Memory usage for the Dev Workspace controller and the Dev Workspace webhook server.

Reconcile counts (DWO)

The average per-second number of reconcile counts for the Dev Workspace controller.