[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [hono-dev] Metrics for message processing
|
On Wed, 2018-08-22 at 07:38 +0000, Frank Karsten (INST/ECS4) wrote:
> Hi Kai,
>
> Thanks for the IMHO good summary.
>
> I think the separation of "DevOps related" and "client/devices related" metrics
> to record the overall operational state of Hono based installations
> makes a lot of sense.
>
> Thinking about the different reasons for "no credit available" I propose to
> extend the tags of the metrics:
>
> From "host, tenant, type, protocol"
> To "host, tenant, type, protocol, api".
>
> Where api should be set to the Hono-API that caused a problem (e.g. Tenant-API,
> Credentials-API, Event-API, Registration-API, and so on).
>
> Like with the tenant being set to "UNKNOWN" if no tenant was provided by a
> device, the api could be set to "UNKNOWN", if the reason for not being able to
> deliver a message
> is not related to any Hono-API.
>
> In case of "no credit available", we should always have one of Hono's APIs
> involved, so this tag could always be filled.
>
> By this means a DevOps team could better judge which part of the system might
> have to be scaled up.
> WDYT?
>
I do understand the desire to be able to identify the root cause for a message
not being delivered to the consumer. However, I think we need to be careful not
to confuse metrics with tracing. The former is intended to track the "number of
occurrences" of a particular situation or the current value of a measured
dimension in a single component whereas the latter is intended to provide insight
into the processing of an individual message as it flows through the system.
That said, I can imagine introducing metrics for the adapters' interactions with
the other services, e.g. we could add a metric for an adapter instance's
remaining credits for invoking other Hono services (or the AMQP 1.0 Messaging
Network), e.g. "hono.mqtt.credentials.credit" for tracking the MQTT adapter's
remaining credits for invoking the Credentials service. That way, we would not
mix up the semantics of the (generic) "hono.messages.undeliverable" metric with
the metrics for the adapters' (outbound) communication links to other services.
You will still need to use tracing in order to find out the reason for an
individual message's failure to be delivered downstream, but the metrics of the
adapters would be much clearer in their scope FMPOV.
A problem with this approach is, that increasing the number of service instances
based on these "remaining credit" metrics of the adapters would not really help
with alleviating the problem. Our current approach of establishing (and
maintaining) a single AMQP connection to a service instance supports a "static"
way of balancing the load to the service instances only, because the (service)
instance to invoke is determined during connection establishment only. When the
adapter runs out of credits, it will _not_ try to invoke another instance (which
might be less strained) but instead simply fail the processing of the message.
However, this is another problem that we should discuss in a different mail
thread ...
> All other points: +1 from me.
>
> Mit freundlichen Grüßen / Best regards
>
> Karsten Frank
>
> (INST/ECS4)
> Bosch Software Innovations GmbH | Ullsteinstr. 128 | 12109 Berlin | GERMANY | w
> ww.bosch-si.com
>
> Sitz: Berlin, Registergericht: Amtsgericht Charlottenburg; HRB 148411 B
> Aufsichtsratsvorsitzender: Dr.-Ing. Thorsten Lücke; Geschäftsführung: Dr.
> Stefan Ferber, Michael Hahn
>
>
>
> > -----Original Message-----
> > From: hono-dev-bounces@xxxxxxxxxxx <hono-dev-bounces@xxxxxxxxxxx> On
> > Behalf Of Hudalla Kai (INST/ECS4)
> > Sent: Montag, 20. August 2018 17:11
> > To: hono-dev@xxxxxxxxxxx
> > Subject: [hono-dev] Metrics for message processing
> >
> > Hi list,
> >
> > I am currently thinking about the metrics that we maintain for the messages
> > we
> > process. In the original design we had Hono Messaging as the central
> > component
> > that all protocol adapters had been connected to and which all downstream
> > messages had to be sent to from the adapters. It therefore felt like the
> > right place
> > to implement the messaging metrics in and e.g. count the number of messages
> > that have been forwarded successfully vs. the messages that had to be
> > discarded
> > due to a lack of credit.
> >
> > With the deprecation of Hono Messaging, we are now maintaining the metrics in
> > the protocol adapters directly. IMHO this is a good opportunity to think a
> > little about
> > the metrics we are maintaining as well.
> >
> > Currently, we are record metrics for "processed", "discarded" and
> > "undeliverable"
> > messages. However, we have never clearly defined these terms. That was
> > probably because the only place where it was relevant was Hono Messaging and
> > the way it was implemented there served as the "definition".
> >
> > As such, we currently use something like the following in Hono Messaging:
> >
> > "processed": message from device complies with all requirements and has been
> > successfully forwarded to the downstream consumer
> >
> > "discarded": message has been sent pre-settled (by the adapter) and there is
> > no
> > credit available for the message to be forwarded. The message is then
> > silently
> > discarded, i.e. the sender is not informed about the failure to deliver the
> > message.
> > The device cannot distinguish this case from the "processed" case.
> >
> > "undeliverable": message has been sent unsettled (by the adapter) and there
> > is no
> > credit available for the message to be forwarded. The message is then
> > released
> > and the adapter will signal the failure to deliver to the device (if the
> > transport
> > protocol allows to do so). The device may or may not be able to distinguish
> > this
> > case from the "processed" case.
> >
> > The first metric clearly is of interest in order to see the current
> > throughput of the
> > system. The other two metrics, however, are harder to understand in the
> > context of
> > a particular protocol adapter because they require to understand how the
> > transport
> > protocol is mapped to AMQP 1.0. For example, a telemetry message that is
> > published for a tenant using QoS 0 and for which no consumer is connected,
> > will
> > end up in the "discarded" metric whereas the same message published using QoS
> > 1 would end up in the "undeliverable" metric, despite the fact that the
> > reason for
> > the failure to deliver is the same in both cases: no credit.
> >
> > After some discussion about this with our operations team, it became clear,
> > that
> > from their perspective it is actually more interesting to get an indication
> > of the
> > reason for a problem in the metric itself. In particular, it is of interest
> > to distinguish
> > between cases where messages cannot be processed due to errors caused by the
> > device, e.g. malformed headers, versus errors where a message cannot be
> > processed due to problems in the back end infrastructure, e.g. a service not
> > being
> > available or the aforementioned lack of credit. In the former case we need to
> > advise device developers how to fix the problem, in the latter case the ops
> > team
> > needs to get going themselves.
> >
> > In addition to this coarse distinction, it is still helpful to know the ratio
> > of credit used
> > vs. credit available because this may serve as an indicator for scaling the
> > infrastructure up or down.
> >
> > I would therefore like to introduce additional (adapter specific) metrics
> > that are
> > better suited to cover these requirements. These metrics should be tagged
> > with the
> > protocol, host, tenant and message type (e.g. telemetry, event ...) if
> > possible, e.g.
> > a message might be unprocessable because it lacks tenant information. In such
> > a
> > case the problem could be recorded using the "UNKNOWN"
> > tenant ...
> >
> > "meter.hono.messages.processed" - message has been successfully processed.
> >
> > "meter.hono.messages.unprocessable" - message cannot be processed because
> > the message does not contain all required information, e.g. malformed topic
> > name,
> > missing header, not authorized etc. This metric is used by an adapter to
> > record a
> > message that it either discards silently or rejects (signaling the problem to
> > the
> > device). In no case will the message being processed.
> >
> > "meter.hono.messages.undeliverable" - message cannot be processed because of
> > a problem not caused by the sender of the message (the device), e.g. Tenant
> > service is not available, no credit available, etc. This metric is used by an
> > adapter
> > regardless of whether the transport protocol allows for signaling back the
> > problem
> > to the device or not. For instance, an MQTT message published using QoS 0
> > doesn't allow to signal back the failure whereas HTTP allows to send back a
> > status
> > code in the HTTP response. In no case will the message being processed.
> >
> > "counter|meter.hono.messages.capacity" - the number of credits remaining for
> > sending messages. TODO determine if a counter or a meter is more reasonable
> > to
> > use.
> >
> >
> > We could then deprecate the existing protocol adapter specific metric(s) and
> > eventually remove them together with Hono Messaging.
> >
> >
> > WDYT?
> >
> > --
> > Mit freundlichen Grüßen / Best regards
> >
> > Kai Hudalla
> > Chief Software Architect
> >
> > Bosch Software Innovations GmbH
> > Ullsteinstr. 128
> > 12109 Berlin
> > GERMANY
> > www.bosch-si.com
> >
> > Registered Office: Berlin, Registration Court: Amtsgericht Charlottenburg;
> > HRB
> > 148411 B
> > Chairman of the Supervisory Board: Dr.-Ing. Thorsten Lücke; Managing
> > Directors:
> > Dr. Stefan Ferber, Michael Hahn
> > _______________________________________________
> > hono-dev mailing list
> > hono-dev@xxxxxxxxxxx
> > To change your delivery options, retrieve your password, or unsubscribe from
> > this
> > list, visit https://dev.eclipse.org/mailman/listinfo/hono-dev
>
> _______________________________________________
> hono-dev mailing list
> hono-dev@xxxxxxxxxxx
> To change your delivery options, retrieve your password, or unsubscribe from
> this list, visit
> https://dev.eclipse.org/mailman/listinfo/hono-dev