[hono-dev] Metrics for message processing

Hi list,

I am currently thinking about the metrics that we maintain for the messages we
process. In the original design we had Hono Messaging as the central component
that all protocol adapters had been connected to and which all downstream
messages had to be sent to from the adapters. It therefore felt like the right
place to implement the messaging metrics in and e.g. count the number of messages
that have been forwarded successfully vs. the messages that had to be discarded
due to a lack of credit.

With the deprecation of Hono Messaging, we are now maintaining the metrics in the
protocol adapters directly. IMHO this is a good opportunity to think a little
about the metrics we are maintaining as well.

Currently, we are record metrics for "processed", "discarded" and "undeliverable"
messages. However, we have never clearly defined these terms. That was probably
because the only place where it was relevant was Hono Messaging and the way it
was implemented there served as the "definition".

As such, we currently use something like the following in Hono Messaging:

"processed": message from device complies with all requirements and has been
successfully forwarded to the downstream consumer

"discarded": message has been sent pre-settled (by the adapter) and there is no
credit available for the message to be forwarded. The message is then silently
discarded, i.e. the sender is not informed about the failure to deliver the
message. The device cannot distinguish this case from the "processed" case.

"undeliverable": message has been sent unsettled (by the adapter) and there is no
credit available for the message to be forwarded. The message is then released
and the adapter will signal the failure to deliver to the device (if the
transport protocol allows to do so). The device may or may not be able to
distinguish this case from the "processed" case.

The first metric clearly is of interest in order to see the current throughput of
the system. The other two metrics, however, are harder to understand in the
context of a particular protocol adapter because they require to understand how
the transport protocol is mapped to AMQP 1.0. For example, a telemetry message
that is published for a tenant using QoS 0 and for which no consumer is
connected, will end up in the "discarded" metric whereas the same message
published using QoS 1 would end up in the "undeliverable" metric, despite the
fact that the reason for the failure to deliver is the same in both cases: no

After some discussion about this with our operations team, it became clear, that
from their perspective it is actually more interesting to get an indication of
the reason for a problem in the metric itself. In particular, it is of interest
to distinguish between cases where messages cannot be processed due to errors
caused by the device, e.g. malformed headers, versus errors where a message
cannot be processed due to problems in the back end infrastructure, e.g. a
service not being available or the aforementioned lack of credit. In the former
case we need to advise device developers how to fix the problem, in the latter
case the ops team needs to get going themselves.

In addition to this coarse distinction, it is still helpful to know the ratio of
credit used vs. credit available because this may serve as an indicator for
scaling the infrastructure up or down.

I would therefore like to introduce additional (adapter specific) metrics that
are better suited to cover these requirements. These metrics should be tagged
with the protocol, host, tenant and message type (e.g. telemetry, event ...) if
possible, e.g. a message might be unprocessable because it lacks tenant
information. In such a case the problem could be recorded using the "UNKNOWN"
tenant ...

"meter.hono.messages.processed" - message has been successfully processed.

"meter.hono.messages.unprocessable" - message cannot be processed because the
message does not contain all required information, e.g. malformed topic name,
missing header, not authorized etc. This metric is used by an adapter to record a
message that it either discards silently or rejects (signaling the problem to the
device). In no case will the message being processed.

"meter.hono.messages.undeliverable" -  message cannot be processed because of a
problem not caused by the sender of the message (the device), e.g. Tenant service
is not available, no credit available, etc. This metric is used by an adapter
regardless of whether the transport protocol allows for signaling back the
problem to the device or not. For instance, an MQTT message published using QoS 0
doesn't allow to signal back the failure whereas HTTP allows to send back a
status code in the HTTP response. In no case will the message being processed.

"counter|meter.hono.messages.capacity" - the number of credits remaining for
sending messages. TODO determine if a counter or a meter is more reasonable to

We could then deprecate the existing protocol adapter specific metric(s) and
eventually remove them together with Hono Messaging.


