Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [mosquitto-dev] libmosquitto fails to do backoff

Hi Abilio,

Thanks for the code and the bisecting, it's very much appreciated. I
think your proposed solution is correct. The only situation the
interruptible sleep function is used is after a disconnection. If
there is already a PUBLISH message pending, then the sockpairR has a
value ready for reading - this is what your code simulates. We don't
want an immediate reconnection then. If sockpairR is cleared at the
start it is still possible for a client to break out of that sleep
with a new PUBLISH - even if it's not clear that is what we actually
want anyway. The main purpose of the sockpair is during normal
operation, so a threaded client can call publish continuously and have
them delivered promptly, not tied to the select() timeout. In other
words, I don't think it is particularly critical for this part.

Regards,

Roger

On Thu, 11 Mar 2021 at 20:01, Abilio Marques <abiliojr@xxxxxxxxx> wrote:
>
> Hello guys,
>
> When using mosquitto_loop_forever, if the broker closes the connection while libmosquitto is actively sending data, the backoff never happens. This can end up causing a very fast loop of connect/disconnect that can bring down both parties. More specifically, this is a problem on QoS > 0, where the message gets buffered and retried right after connection.
>
> I manage to produce a simple piece of code to reproduce it:
>
> Here the C portion:
>>
>> #include <mosquitto.h>
>>
>> int main() {
>>     mosquitto_lib_init();
>>     struct mosquitto *m = mosquitto_new("test_id", false, NULL);
>>
>>     mosquitto_connect(m, "localhost", 1885, 60);
>>     mosquitto_publish(m, NULL, "test", 3, "hey", 1, false);
>>     mosquitto_loop_forever(m, 1, 1);
>> }
>
>
> I also wrote a fake "broker" that only accepts the connection and then drops it. My real scenario is not this crude, but this version makes it predictable, and the result is the same. Here the code:
>
>> import socket
>>
>> with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
>>     s.bind(('', 1885))
>>     s.listen(1)
>>
>>     while True:
>>         conn, _ = s.accept()
>>         with conn:
>>             print("connected")
>>             conn.send(b"\x20\x02\x01\x00") # CONNACK
>
>
> With these two pieces of code I was able to bisect and pinpoint the commit a3ebeff9d732458a4dac7513fac10a52a97cf4d1 as the one that broke the library (somewhen between 1.6.9 and 1.6.10).
>
> The issue seems to be caused by interruptible_sleep calling select on a socket mosq->sockpairR that is been written at the same time that the connection drops. A new connection starts right away.
>
> I corroborated that this is the place by quickly emptying mosq->sockpairR before going to the select. This made the problem go away.
>
> I'm not sure how to proceed from here. I feel this code is quite delicate to touch, and while fixing this bug I might introduce another one.
>
> If someone knows how to fix it, or can at least provide a suggestion, please let me know.
>
> Regards,
> Abilio
>
>
> _______________________________________________
> mosquitto-dev mailing list
> mosquitto-dev@xxxxxxxxxxx
> To unsubscribe from this list, visit https://www.eclipse.org/mailman/listinfo/mosquitto-dev


Back to the top