Monitoring Che
This chapter describes how to configure Che to expose metrics and how to build an example monitoring stack with external tools to process data exposed as metrics by Che.
Enabling and exposing Che metrics
This section describes how to enable and expose Che metrics.
-
Set the
CHE_METRICS_ENABLED=true
environment variable, which will expose the8087
port as a service on theche-master
host.
When Eclipse Che is installed from the OperatorHub, the environment variable is set automatically if the default CheCluster
CR is used:
spec:
metrics:
enable: true
Collecting Che metrics with Prometheus
This section describes how to use the Prometheus monitoring system to collect, store, and query metrics about Che.
-
Che is exposing metrics on port
8087
. See Enabling and exposing Che metrics. -
Prometheus 2.9.1 or later is running. The Prometheus console is running on port
9090
with a corresponding service and route. See First steps with Prometheus.
-
Configure Prometheus to scrape metrics from the
8087
port:Example 1. Prometheus configuration exampleapiVersion: v1 kind: ConfigMap metadata: name: prometheus-config data: prometheus.yml: |- global: scrape_interval: 5s (1) evaluation_interval: 5s (2) scrape_configs: (3) - job_name: 'che' static_configs: - targets: ['[che-host]:8087'] (4)
1 Rate, at which a target is scraped. 2 Rate, at which recording and alerting rules are re-checked (not used in the system at the moment). 3 Resources Prometheus monitors. In the default configuration, a single job called che
, scrapes the time series data exposed by the Che server.4 Scrape metrics from the 8087
port.
-
Use the Prometheus console to query and view metrics.
Metrics are available at:
http://<che-server-url>:9090/metrics
.For more information, see Using the expression browser.
Viewing Che metrics on Grafana dashboards
This section describes how to view Che metrics on Grafana dashboards.
-
Prometheus is collecting metrics on the Che cluster. See Collecting Che metrics with Prometheus.
-
Grafana 6.0 or above is running on port
3000
with a corresponding service and route. See Installing Grafana.
-
Deploy Che-specific dashboards on Grafana using the
che-monitoring.yaml
configuration file.Three ConfigMaps are used to configure Grafana:
-
grafana-datasources
— configuration for Grafana data source, a Prometheus endpoint -
grafana-dashboards
— configuration of Grafana dashboards and panels -
grafana-dashboard-provider
— configuration of the Grafana dashboard provider API object, which tells Grafana where to look in the file system for pre-provisioned dashboards
-
-
Use the Grafana console to view Che metrics.
Grafana dashboards for Che
This section describes the Grafana dashboards that are displaying metrics collected from Che.

The General panel contains basic information, such as the total number of users and workspaces in the Che database.
-
Workspace start rate — the ratio between successful and failed started workspaces
-
Workspace stop rate — the ratio between successful and failed stopped workspaces
-
Workspace Failures — the number of workspace failures shown on the graph
-
Starting Workspaces — the gauge that shows the number of currently starting workspaces
-
Average Workspace Start Time — 1-hour average of workspace starts or fails
-
Average Workspace Stop Time — 1-hour average of workspace stops
-
Running Workspaces — the gauge that shows the number of currently running workspaces
-
Stopping Workspaces — the gauge that shows the number of currently stopping workspaces
-
Workspaces started under 60 seconds — the percentage of workspaces started under 60 seconds
-
Number of Workspaces — the number of workspaces created over time
-
Workspace start attempts — the number of attempts to start a workspace comparing regular attempts with start-debug mode
-
Number of Users — the number of users known to Che over time
-
Max number of active sessions — the max number of active sessions that have been active at the same time
-
Number of current active sessions — the number of currently active sessions
-
Total sessions — the total number of sessions
-
Expired sessions — the number of sessions that have expired
-
Rejected sessions — the number of sessions that were not created because the maximum number of active sessions was reached
-
Longest time of an expired session — the longest time (in seconds) that an expired session had been alive
The Requests panel displays HTTP requests in a graph that shows the average number of requests per minute.
-
Threads running - the number of threads that are not terminated aka alive. May include threads that are in a waiting or blocked state.
-
Threads terminated - the number of threads that was finished its execution.
-
Threads created - number of threads created by thread factory for given executor service.
-
Created thread per minute - Speed of thread creating for the given executor service.
-
Executor threads active - number of threads that actively execute tasks.
-
Executor pool size - number of threads that actively execute tasks.
-
Queued task - the approximate number of tasks that are queued for execution
-
Queued occupancy - the percent of the queue used by the tasks that are waiting for execution.
-
Rejected task - the number of tasks that were rejected from execution.
-
Rejected task per minute - the speed of task rejections
-
Completed tasks - the number of completed tasks
-
Completed tasks per minute - the speed of task execution
-
Task execution seconds max - 5min moving maximum of task execution
-
Tasks execution seconds avg - 1h moving average of task execution
-
Executor idle seconds max - 5min moving maximum of executor idle state.
-
Executor idle seconds avg - 1h moving average of executor idle state.
-
Workspace start Max - maximum workspace start time
-
Workspace start Avg - 1h moving average of the workspace start time components
-
Workspace stop Max - maximum of workspace stop time
-
Workspace stop Avg - 1h moving average of the workspace stop time components
-
OpenShiftInternalRuntime#start Max - maximum time of OpenShiftInternalRuntime#start operation
-
OpenShiftInternalRuntime#start Avg - 1h moving average time of OpenShiftInternalRuntime#start operation
-
Plugin Brokering Execution Max - maximum time of PluginBrokerManager#getTooling operation
-
Plugin Brokering Execution Avg - 1h moving average of PluginBrokerManager#getTooling operation
-
OpenShiftEnvironmentProvisioner#provision Max - maximum time of OpenShiftEnvironmentProvisioner#provision operation
-
OpenShiftEnvironmentProvisioner#provision Avg -1h moving average of OpenShiftEnvironmentProvisioner#provision operation
-
Plugin Brokering Execution Max - maximum time of PluginBrokerManager#getTooling components execution time
-
Plugin Brokering Execution Avg - 1h moving average of time of PluginBrokerManager#getTooling components execution time
-
WaitMachinesStart Max - maximum time of WaitMachinesStart operations
-
WaitMachinesStart Avg - 1h moving average time of WaitMachinesStart operations
-
OpenShiftInternalRuntime#startMachines Max - maximum time of OpenShiftInternalRuntime#startMachines operations
-
OpenShiftInternalRuntime#startMachines Avg - 1h moving average of the time of OpenShiftInternalRuntime#startMachines operations
The Workspace Detailed panel contains heat maps, which illustrate the average time of workspace starts or fails. The row shows some period of time.
-
Messages sent to runtime log - Number of messages sent to the workspace startup log.
-
Bytes sent to runtime log - Number of bytes of the messages sent to the workspace startup log.
-
Current Log Watchers - Number of currently watched containers logs
Developing Grafana dashboards
Grafana offers the possibility to add custom panels.
To add a custom panel, use the New dashboard view.
-
In the first section, define Queries to. Use the Prometheus Query Language to construct a specific metric, as well as to modify it with various aggregation operators.
-
In the Visualisation section, choose a metric to be shown in the following visual in the form of a graph, gauge, heatmap, or others.
-
Save changes to the dashboard using the Save button, and copy and paste the JSON code to the deployment.
-
Load changes in the configuration of a running Grafana deployment. First remove the deployment:
$ oc process -f che-monitoring.yaml | oc delete -f -
Then redeploy your Grafana with the new configuration:
$ oc process -f che-monitoring.yaml | oc apply -f - | oc rollout latest grafana