Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [hono-dev] Metrics for message processing

Hi Kai,

I like your idea a lot. I was also confused from time to time by the "old" messaging metrics. Do you see any possibility to distinguish between undeliverable messages that got dropped because of "internal" communication issues (e.g. no credits for connection to tenant manager) or undeliverable messages that got dropped because no consumer is connected to messaging network? This would be very helpful for us to see if there is a problem within our system or if there is just no consumer connected by the user. Perhaps it would be possible to add the target system of the failure as a tag to the metric. With this we can at least distinguish if the problem was related to messaging network or for example to tenant manager.

Mit freundlichen Grüßen / Best regards

 Daniel Maier

Cloud Services LWM2M (INST/ECS4) 
Bosch Software Innovations GmbH | Stuttgarter Straße 130 | 71332 Waiblingen | GERMANY | www.bosch-si.com
daniel.maier4@xxxxxxxxxxxx

Sitz: Berlin, Registergericht: Amtsgericht Charlottenburg; HRB 148411 B 
Aufsichtsratsvorsitzender: Dr.-Ing. Thorsten Lücke; Geschäftsführung: Dr. Stefan Ferber, Michael Hahn 



-----Ursprüngliche Nachricht-----
Von: hono-dev-bounces@xxxxxxxxxxx <hono-dev-bounces@xxxxxxxxxxx> Im Auftrag von Hudalla Kai (INST/ECS4)
Gesendet: Montag, 20. August 2018 17:11
An: hono-dev@xxxxxxxxxxx
Betreff: [hono-dev] Metrics for message processing

Hi list,

I am currently thinking about the metrics that we maintain for the messages we process. In the original design we had Hono Messaging as the central component that all protocol adapters had been connected to and which all downstream messages had to be sent to from the adapters. It therefore felt like the right place to implement the messaging metrics in and e.g. count the number of messages that have been forwarded successfully vs. the messages that had to be discarded due to a lack of credit.

With the deprecation of Hono Messaging, we are now maintaining the metrics in the protocol adapters directly. IMHO this is a good opportunity to think a little about the metrics we are maintaining as well.

Currently, we are record metrics for "processed", "discarded" and "undeliverable"
messages. However, we have never clearly defined these terms. That was probably because the only place where it was relevant was Hono Messaging and the way it was implemented there served as the "definition".

As such, we currently use something like the following in Hono Messaging:

"processed": message from device complies with all requirements and has been successfully forwarded to the downstream consumer

"discarded": message has been sent pre-settled (by the adapter) and there is no credit available for the message to be forwarded. The message is then silently discarded, i.e. the sender is not informed about the failure to deliver the message. The device cannot distinguish this case from the "processed" case.

"undeliverable": message has been sent unsettled (by the adapter) and there is no credit available for the message to be forwarded. The message is then released and the adapter will signal the failure to deliver to the device (if the transport protocol allows to do so). The device may or may not be able to distinguish this case from the "processed" case.

The first metric clearly is of interest in order to see the current throughput of the system. The other two metrics, however, are harder to understand in the context of a particular protocol adapter because they require to understand how the transport protocol is mapped to AMQP 1.0. For example, a telemetry message that is published for a tenant using QoS 0 and for which no consumer is connected, will end up in the "discarded" metric whereas the same message published using QoS 1 would end up in the "undeliverable" metric, despite the fact that the reason for the failure to deliver is the same in both cases: no credit. 

After some discussion about this with our operations team, it became clear, that from their perspective it is actually more interesting to get an indication of the reason for a problem in the metric itself. In particular, it is of interest to distinguish between cases where messages cannot be processed due to errors caused by the device, e.g. malformed headers, versus errors where a message cannot be processed due to problems in the back end infrastructure, e.g. a service not being available or the aforementioned lack of credit. In the former case we need to advise device developers how to fix the problem, in the latter case the ops team needs to get going themselves.

In addition to this coarse distinction, it is still helpful to know the ratio of credit used vs. credit available because this may serve as an indicator for scaling the infrastructure up or down.

I would therefore like to introduce additional (adapter specific) metrics that are better suited to cover these requirements. These metrics should be tagged with the protocol, host, tenant and message type (e.g. telemetry, event ...) if possible, e.g. a message might be unprocessable because it lacks tenant information. In such a case the problem could be recorded using the "UNKNOWN"
tenant ...

"meter.hono.messages.processed" - message has been successfully processed.

"meter.hono.messages.unprocessable" - message cannot be processed because the message does not contain all required information, e.g. malformed topic name, missing header, not authorized etc. This metric is used by an adapter to record a message that it either discards silently or rejects (signaling the problem to the device). In no case will the message being processed.

"meter.hono.messages.undeliverable" -  message cannot be processed because of a problem not caused by the sender of the message (the device), e.g. Tenant service is not available, no credit available, etc. This metric is used by an adapter regardless of whether the transport protocol allows for signaling back the problem to the device or not. For instance, an MQTT message published using QoS 0 doesn't allow to signal back the failure whereas HTTP allows to send back a status code in the HTTP response. In no case will the message being processed.

"counter|meter.hono.messages.capacity" - the number of credits remaining for sending messages. TODO determine if a counter or a meter is more reasonable to use.


We could then deprecate the existing protocol adapter specific metric(s) and eventually remove them together with Hono Messaging.


WDYT?

--
Mit freundlichen Grüßen / Best regards

Kai Hudalla
Chief Software Architect

Bosch Software Innovations GmbH
Ullsteinstr. 128
12109 Berlin
GERMANY
www.bosch-si.com

Registered Office: Berlin, Registration Court: Amtsgericht Charlottenburg; HRB
148411 B
Chairman of the Supervisory Board: Dr.-Ing. Thorsten Lücke; Managing Directors:
Dr. Stefan Ferber, Michael Hahn
_______________________________________________
hono-dev mailing list
hono-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit https://dev.eclipse.org/mailman/listinfo/hono-dev

Back to the top