I'm still working on implement a MQTT-SN gateway using the Paho 'MQTT
C Client for Posix and Windows'. And now had another problem. This
time I think it is a concurrency bug of the MQTT.C library:
My gateway run for more than a week an crashed with a Segmentation
fault. My guess is that the Segmentation fault is due to a race
condition caused by a missing lock(mutex) for s.cur_clientsds inside
of Socket.c.
Followed I provide the information and reasoning for my guess:
Info of the segmentation fault:
   Program received signal SIGSEGV, Segmentation fault.
   [Switching to Thread 0xb4509460 (LWP 17845)]
   0x00080750 in Socket_getReadySocket (more_work=0, tp=0xb4508d88)
       at source/lib/Mqtt.C/src/Socket.c:278
the code there looks like this:
   274 if (s.cur_clientsds == NULL)
   275    rc = 0;
   276 else
   277 {
   278    rc = *((int*)(s.cur_clientsds->content));
   279    ListNextElement(s.clientsds, &s.cur_clientsds);
   280 }
the backtrace of the crash is:
   (gdb) bt
   #0  0x00080750 in Socket_getReadySocket (more_work=0, tp=0xb4508d88)
       at source/lib/Mqtt.C/src/Socket.c:278
   #1  0x00077518 in MQTTClient_cycle (sock=0xb4508dc4, timeout=1000,
rc=0xb4508dc8)
       at source/lib/Mqtt.C/src/MQTTClient.c:1547
   #2  0x00074a90 in MQTTClient_run (n=0x12a9b4)
       at source/lib/Mqtt.C/src/MQTTClient.c:483
   #3  0xb6e40bfc in start_thread (arg=0xb4509460)
       at pthread_create.c:306
   #4  0xb6dd5968 in ?? ()
       at ../ports/sysdeps/unix/sysv/linux/arm/nptl/../clone.S:116
and a the variables are:
   (gdb) p s
   $1 = {rset = {__fds_bits = {2048, 0 <repeats 31 times>}}, rset_saved = {
       __fds_bits = {1024, 0 <repeats 31 times>}}, maxfdp1 = 11,
     clientsds = 0x12a774, cur_clientsds = 0x0, connect_pending = 0x12a804,
     write_pending = 0x12a894, pending_wset = {__fds_bits = {
         0 <repeats 32 times>}}}
   (gdb) p s.cur_clientsds
   $2 = (ListElement *) 0x0
The Segmentation fault is because s.cur_clientsds is NULL when line
278 is executed. Since line 278 is actually inside an else of a
(s.cur_clientsds == NULL) check I guess that  s.cur_clientsds must
have be changed by an other thread in between the check (line 274) and
the access (line 278). So I searched where s.cur_clientsds is changed
and found that the Socket_close() function sets it to NULL:
   void Socket_close(int socket)
   {
      FUNC_ENTRY;
      Socket_close_only(socket);
      FD_CLR(socket, &(s.rset_saved));
      if (FD_ISSET(socket, &(s.pending_wset)))
         FD_CLR(socket, &(s.pending_wset));
      if (s.cur_clientsds != NULL && *(int*)(s.cur_clientsds->content)
== socket)
         s.cur_clientsds = s.cur_clientsds->next;
      ...
Therefore s.cur_clientsds is accessed and changed by multiple threads:
By the libraries 'worker thread' during the MQTTClient_run() and by a
client thread when the client calls MQTTClient_disconnect() since this
calls Socket_close(). So I checked the call trees for mutex locks: The
Socket_close() is inside a Thread_lock_mutex(mqttclient_mutex) but the
Socket_getReadySocket() is not inside any mutex lock since
MQTTClient_cycle() is explicitly called outside of the
mqttclient_mutex:
   Thread_unlock_mutex(mqttclient_mutex);
   pack = MQTTClient_cycle(&sock, timeout, &rc);  ==> Socket_getReadySocket()
   Thread_lock_mutex(mqttclient_mutex);
Therefore I think my guess is correct that a race condition, because
of missing synchronization for s.cur_clientsds, caused the
Segmentation fault.
My quick solution would be to introduce a new socket_mutex inside
Socket.c to protect accesses to :
   /**
   * Structure to hold all socket data for the module
   */
   Sockets s;
but I would need to invest some more time to fully grasp the locking
strategy of the library. Therefore should I try to invest some time to
understand the internals and introduce this mutex to create a patch?
Or is it not worth my effort because one of the commiters quickly can
confirm and fix this problem.