Monitoring Che

This chapter describes how to configure Che to expose metrics and how to build an example monitoring stack with external tools to process data exposed as metrics by Che.

Enabling and exposing Che metrics

This section describes how to enable and expose Che metrics.

Procedure
  1. Set the CHE_METRICS_ENABLED=true environment variable, which will expose the 8087 port as a service on the che-master host.

When Eclipse Che is installed from the OperatorHub, the environment variable is set automatically if the default CheCluster CR is used:

monitoring che che cluster cr
spec:
  metrics:
    enable: true

Collecting Che metrics with Prometheus

This section describes how to use the Prometheus monitoring system to collect, store and query metrics about Che.

Prerequisites
Procedure
  • Configure Prometheus to scrape metrics from the 8087 port:

    Prometheus configuration example
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: prometheus-config
    data:
      prometheus.yml: |-
          global:
            scrape_interval:     5s             (1)
            evaluation_interval: 5s             (2)
          scrape_configs:                       (3)
            - job_name: 'che'
              static_configs:
                - targets: ['[che-host]:8087']  (4)
    1 Rate, at which a target is scraped.
    2 Rate, at which recording and alerting rules are re-checked (not used in the system at the moment).
    3 Resources Prometheus monitors. In the default configuration, there is a single job called che, which scrapes the time series data exposed by the Che server.
    4 Scrape metrics from the 8087 port.
Verification steps
  • Use the Prometheus console to query and view metrics.

    Metrics are available at: http://<che-server-url>:9090/metrics.

    For more information, see Using the expression browser in the Prometheus documentation.

Viewing Che metrics on Grafana dashboards

This section describes how to view Che metrics on Grafana dashboards.

Prerequisites
Procedure
  1. Deploy Che-specific dashboards on Grafana using the che-monitoring.yaml configuration file.

    Three ConfigMaps are used to configure Grafana:

    • grafana-datasources — configuration for Grafana datasource, a Prometheus endpoint

    • grafana-dashboards — configuration of Grafana dashboards and panels

    • grafana-dashboard-provider  — configuration of the Grafana dashboard provider API object, which tells Grafana where to look in the file system for pre-provisioned dashboards

Verification steps
  • Use the Grafana console to view Che metrics.

Additional resources

Grafana dashboards for Che

This section describes the Grafana dashboards that are displaying metrics collected from Che.

monitoring che che server dashboard general panel
Figure 1. The General panel

The General panel contains basic information, such as the total number of users and workspaces in the Che database.

monitoring che che server dashboard workspace panel
Figure 2. The Workspaces panel
  • Workspace start rate — the ratio between successful and failed started workspaces

  • Workspace stop rate — the ratio between successful and failed stopped workspaces

  • Workspace Failures — the number of workspace failures shown on the graph

  • Starting Workspaces — the gauge that shows the number of currently starting workspaces

  • Average Workspace Start Time — 1-hour average of workspace starts or fails

  • Average Workspace Stop Time — 1-hour average of workspace stops

  • Running Workspaces — the gauge that shows the number of currently running workspaces

  • Stopping Workspaces — the gauge that shows the number of currently stopping workspaces

  • Workspaces started under 60 seconds — the percentage of workspaces started under 60 seconds

  • Number of Workspaces — the number of workspaces created over time

  • Workspace start attempts — the number of attempts to start a workspace comparing regular attempts with start-debug mode

monitoring che che server dashboard users panel
Figure 3. The Users panel
  • Number of Users — the number of users known to Che over time

monitoring che che server dashboard tomcat panel
Figure 4. The Tomcat panel
  • Max number of active sessions — the max number of active sessions that have been active at the same time

  • Number of current active sessions — the number of currently active sessions

  • Total sessions — the total number of sessions

  • Expired sessions — the number of sessions that have expired

  • Rejected sessions — the number of sessions that were not created because the maximum number of active sessions was reached

  • Longest time of an expired session — the longest time (in seconds) that an expired session had been alive

monitoring che che server dashboard requests panel
Figure 5. The Request panel

The Requests panel displays HTTP requests in a graph that shows the average number of requests per minute.

monitoring che che server dashboard executors panel 1
Figure 6. The Executors panel, part 1
  • Threads running - the number of threads that are not terminated aka alive. May include threads that are in a waiting or blocked state.

  • Threads terminated - the number of threads that was finished its execution.

  • Threads created - number of threads created by thread factory for given executor service.

  • Created thread/minute - Speed of thread creating for the given executor service.

monitoring che che server dashboard executors panel 2
Figure 7. The Executors panel, part 2
  • Executor threads active - number of threads that actively execute tasks.

  • Executor pool size - number of threads that actively execute tasks.

  • Queued task - the approximate number of tasks that are queued for execution

  • Queued occupancy - the percent of the queue used by the tasks that are waiting for execution.

monitoring che che server dashboard executors panel 3
Figure 8. The Executors panel, part 3
  • Rejected task - the number of tasks that were rejected from execution.

  • Rejected task/minute - the speed of task rejections

  • Completed tasks - the number of completed tasks

  • Completed tasks/minute - the speed of task execution

monitoring che che server dashboard executors panel 4
Figure 9. The Executors panel, part 4
  • Task execution seconds max - 5min moving maximum of task execution

  • Tasks execution seconds avg - 1h moving average of task execution

  • Executor idle seconds max - 5min moving maximum of executor idle state.

  • Executor idle seconds avg - 1h moving average of executor idle state.

monitoring che che server dashboard trace panel 1
Figure 10. The Traces panel, part 1
  • Workspace start Max - maximum workspace start time

  • Workspace start Avg - 1h moving average of the workspace start time components

  • Workspace stop Max - maximum of workspace stop time

  • Workspace stop Avg - 1h moving average of the workspace stop time components

monitoring che che server dashboard trace panel 2
Figure 11. The Traces panel, part 2
  • OpenShiftInternalRuntime#start Max - maximum time of OpenShiftInternalRuntime#start operation

  • OpenShiftInternalRuntime#start Avg - 1h moving average time of OpenShiftInternalRuntime#start operation

  • Plugin Brokering Execution Max - maximum time of PluginBrokerManager#getTooling operation

  • Plugin Brokering Execution Avg - 1h moving average of PluginBrokerManager#getTooling operation

monitoring che che server dashboard trace panel 3
Figure 12. The Traces panel, part 3
  • OpenShiftEnvironmentProvisioner#provision Max - maximum time of OpenShiftEnvironmentProvisioner#provision operation

  • OpenShiftEnvironmentProvisioner#provision Avg -1h moving average of OpenShiftEnvironmentProvisioner#provision operation

  • Plugin Brokering Execution Max - maximum time of PluginBrokerManager#getTooling components execution time

  • Plugin Brokering Execution Avg - 1h moving average of time of PluginBrokerManager#getTooling components execution time

monitoring che che server dashboard trace panel 4
Figure 13. The Traces panel, part 4
  • WaitMachinesStart Max - maximum time of WaitMachinesStart operations

  • WaitMachinesStart Avg - 1h moving average time of WaitMachinesStart operations

  • OpenShiftInternalRuntime#startMachines Max - maximum time of OpenShiftInternalRuntime#startMachines operations

  • OpenShiftInternalRuntime#startMachines Avg - 1h moving average of the time of OpenShiftInternalRuntime#startMachines operations

monitoring che che server dashboard workspace detailed panel
Figure 14. The Workspace detailed panel, part 1

The Workspace Detailed panel contains heat maps, which illustrate the average time of workspace starts or fails. The row shows some period of time.

monitoring che che server dashboard workspace detailed panel 2
Figure 15. The Workspace detailed panel, part 2
  • Messages sent to runtime log - Number of messages sent to the workspace startup log.

  • Bytes sent to runtime log - Number of bytes of the messages sent to the workspace startup log.

  • Current Log Watchers - Number of currently watched containers logs

Che server JVM dashboard

Use case: JVM metrics of the Che server, such as JVM memory or classloading.

monitoring che che server jvm dashboard
Figure 16. Che server JVM dashboard
monitoring che che server jvm dashboard quick facts
Figure 17. Quick Facts
monitoring che che server jvm dashboard jvm memory
Figure 18. JVM Memory
monitoring che che server jvm dashboard jvm misc
Figure 19. JVM Misc
monitoring che che server jvm dashboard jvm memory pools heap
Figure 20. JVM Memory Pools (heap)
monitoring che che server jvm dashboard jvm memory pools non heap
Figure 21. JVM Memory Pools (Non-Heap)
monitoring che che server jvm dashboard garbage collection
Figure 22. Garbage Collection
monitoring che che server jvm dashboard classloading
Figure 23. Classloading
monitoring che che server jvm dashboard buffer pools
Figure 24. Buffer Pools

Developing Grafana dashboards

Grafana offers the possibility to add custom panels.

Procedure

To add a custom panel, use the New dashboard view.

  1. In the first section, define Queries to. Use the Prometheus Query Language to construct a specific metric, as well as to modify it with various aggregation operators.

    monitoring che new grafana dashboard queries
    Figure 25. New Grafana dashboard: Queries to
  2. In the Visualisation section, choose a metric to be shown in the following visual in the form of a graph, gauge, heatmap, or others.

    monitoring che new grafana dashboard visualization
    Figure 26. New Grafana dashboard: Visualization
  3. Save changes to the dashboard by clicking the Save button, and copy and paste the JSON code to the deployment.

  4. Load changes in the configuration of a running Grafana deployment. First remove the deployment:

    $ oc process -f che-monitoring.yaml | oc delete -f -

    Then redeploy your Grafana with the new configuration:

    $ oc process -f che-monitoring.yaml | oc apply -f - | oc rollout latest grafana

Extending Che monitoring metrics

This section describes how to create a metric or a group of metrics to extend the monitoring metrics that Che is exposing.

Che has two major modules metrics:

  • che-core-metrics-core — contains core metrics module

  • che-core-api-metrics — contains metrics that are dependent on core Che components, such as workspace or user managers

Procedure
  • Create a class that extends the MeterBinder class. This allows to register the created metric in the overridden bindTo(MeterRegistry registry) method.

    The following is an example of a metric that has a function that supplies the value for it:

    Example metric
    public class UserMeterBinder implements MeterBinder {
    
      private final UserManager userManager;
    
      @Inject
      public UserMeterBinder(UserManager userManager) {
        this.userManager = userManager;
      }
    
      @Override
      public void bindTo(MeterRegistry registry) {
        Gauge.builder("che.user.total", this::count)
            .description("Total amount of users")
            .register(registry);
      }
    
      private double count() {
        try {
          return userManager.getTotalCount();
        } catch (ServerException e) {
          return Double.NaN;
        }
      }

    Alternatively, the metric can be stored with a reference and updated manually in other place in the code.