Hi,
I’ve been working on a project where I work that uses Mosquitto on an embedded ARM system where there is no RTC directly attached to the Linux system. This has revealed an interesting bug of sorts in a recent change to src/database.c in
the db__new_msg_id() function.
To explain: The Linux system has no RTC device. There is a way for it to obtain the clock, but this is not possible until a user space application starts up and establishes a link with another device. As such, the kernel starts up with
the clock set according to some very obscure rules that essentially come down to the modification timestamp on one of the files in the kernel source tree at the time the kernel was built.
In this instance, that date is December 16th, 2020.
In the db__new_msg_id() function, a message id is generated using the system clock (realtime, at nanosecond resolution) to generate an id that is reasonably unique. The problem begins with the seconds value (thus the current unixtime) have
MOSQ_UUID_EPOCH subtracted from it. This is #defined to have a unixtime value that corresponds to November 17th 2021.
This subtraction results in a very large number being created as the latter is greater than the former. At this point, there is no issue, because no previous db_id exists.
Some time after the startup of Mosquitto, our system connects to the clock source and sets the system time. Now the clock jumps forward from December 16th 2020 to (e.g.) December 8th 2021.
Now the calculation is different: MOSQ_UUID_EPOCH is /less/ than the current unixtime and the result is a small number. This in itself is not a problem, but then the function executes this while loop:
while ( id <= db.last_db_id ){
id++;
}
This loop goes through about 17 quadrillion iterations trying to increment id from around 15,000,000 to 17,000,000,000,000,000. On a 500MHz single-core embedded processor, this takes a long time. In fact, it looks a lot like a lock-up.
I’ve made a temporary change in our local build system to replace the loop above with the following:
if ( id <= db.last_db_id )
{
id = db.last_db_id + 1;
}
This has the same effect, but executes in considerably fewer instruction cycles. Single Mosquitto runs in a single-thread there are no concurrency issues around db.last_db_id. I suspect the real problem is the subtraction of MOSQ_UUID_EPOCH,
but without knowing the consequences of making more drastic changes I wasn’t willing to poke that bear.
The change above resolves the apparent lock.
Regards
Rebecca Gellman