<?xml version="1.0" encoding="UTF-8"?>
<chapter id="sync">
  <?dbhtml filename="sync.html"?>

  <title>Mutual exclusion and synchronization</title>

  <section>
    <title>Introduction</title>

    <para>The HelenOS operating system is designed to make use of parallelism
    offered by hardware and to exploit concurrency of both the kernel and
    userspace tasks. This is achieved through multiprocessor support and
    several levels of multiprogramming (i.e. multitasking, multithreading and
    through userspace pseudo threads). However, such a highly concurrent
    environment needs safe and efficient ways to handle mutual exclusion and
    synchronization of many execution flows.</para>
  </section>

  <section>
    <title>Active kernel primitives</title>

    <section>
      <title>Spinlocks</title>

      <para>The basic mutual exclusion primitive is the spinlock. Spinlock
      implements busy waiting for an availability of a memory lock (i.e.
      simple variable) in a multiprocessor-safe manner. This safety is
      achieved through the use of a specialized, architecture-dependent,
      atomic test-and-set operation which either locks the spinlock (i.e. sets
      the variable) or, provided that it is already locked, leaves it
      unaltered. In any case, the test-and-set operation returns a value, thus
      signalling either success (i.e. zero return value) or failure (i.e.
      non-zero value) in acquiring the lock. Note that this makes the
      fundamental difference between the naive algorithm that doesn't use the
      atomic operation and the spinlock algortihm. While the naive algorithm
      is prone to race conditions on SMP configuratinos and thus is completely
      SMP-unsafe, the spinlock algorithm eliminates the possibility of race
      conditions and is suitable for mutual exclusion use.</para>

      <para>The semantics of the test-and-set operation is that the spinlock
      remains unavailable until this operation called on the respective
      spinlock returns zero. HelenOS builds two functions on top of
      test-and-set operation. The first is the unconditional attempt to
      acquire the spinlock and is called <emphasis>spinlock_lock</emphasis>.
      It simply loops until test-and-set returns zero. The other operation,
      <emphasis>spinlock_trylock</emphasis>, is the conditional lock operation
      and calls the test-and-set only once to find out wheter it managed to
      acquire the spinlock or not. The conditional operation is useful in
      situations in which an algorithm cannot acquire more spinlocks in the
      proper order and a deadlock cannot be avoided. In such a case, the
      algorithm would detect the danger and instead of possibly deadlocking
      the system it would simply release some spinlocks it already holds and
      retry the whole operation with the hope that it will succeed next time.
      The unlock operation, <emphasis>spinlock_unlock</emphasis>, is quite
      easy - it merely clears the spinlock variable.</para>

      <para>Nevertheless, there is a special issue related to hardware
      optimizations that modern processors implement. Particularily
      problematic is the out-of-order execution of instructions within the
      critical section protected by a spinlock. The processors are always
      self-consistent so that they can carry out speculatively executed
      instructions in the right order with regard to dependencies among those
      instructions. However, the dependency between instructions inside the
      critical section and those that implement locking and unlocking of the
      respective spinlock is not implicit on some processor architectures and
      the processor needs to be explicitly told about each occurrence of such
      a dependency. Therefore, HelenOS adds architecture-specific hooks to all
      <emphasis>spinlock_lock</emphasis>,
      <emphasis>spinlock_trylock</emphasis> and
      <emphasis>spinlock_unlock</emphasis> to prevent the instructions inside
      the critical section from bleeding out. On some architectures, these
      hooks can be a no-op because the dependencies are implicitly there
      because of the special properties of locking and unlocking instructions.
      However, other architectures need to instrument these hooks with
      different memory barriers, depending on what operations can bleed
      out.</para>

      <para>Spinlocks have one significant drawback: when held for longer time
      periods, they harm both parallelism and concurrency. Processor executing
      <emphasis>spinlock_lock</emphasis> does not do any fruitful work and is
      effectively halted until it can grab the lock and proceed. Similarily,
      other threads cannot execute on the processor that holds the spinlock
      because the kernel disables preemption on that processor when a spinlock
      is held. The reason behind disabling preemption is priority inversion
      problem avoidance. For the same reason, threads are strongly discouraged
      from sleeping when they hold a spinlock.</para>

      <para>To summarize, spinlocks represent very simple and essential mutual
      exclusion primitive for SMP systems. On the other hand, spinlocks scale
      poorly because of the active loop they are based on. Therefore,
      spinlocks are used in HelenOS only for a short-time mutual exclusion and
      in cases where the mutual exclusion is required out of thread context.
      Lastly, spinlocks are used in the construction of passive
      synchronization primitives.</para>
    </section>
  </section>

  <section>
    <title>Passive kernel synchronization</title>

    <section>
      <title>Wait queues</title>

      <para>A wait queue is the basic passive synchronization primitive on
      which all other passive synchronization primitives build. Simply put, it
      allows a thread to sleep until an event associated with the particular
      wait queue occurs. Multiple threads are notified about incoming events
      in first come, first served fashion. Moreover, should the event come
      before any thread waits for it, it is recorded in the wait queue as a
      missed wakeup and later forwarded to the first thread that decides to
      wait in the queue. The inner structures of the wait queue are protected
      by a spinlock.</para>

      <para>The thread that wants to wait for a wait queue event uses the
      <emphasis>waitq_sleep_timeout</emphasis> function. The algorithm then
      checks the wait queue's counter of missed wakeups and if there are any
      missed wakeups, the call returns immediately. The call also returns
      immediately if only a conditional wait was requested. Otherwise the
      thread is enqueued in the wait queue's list of sleeping threads and its
      state is changed to <emphasis>Sleeping</emphasis>. It then sleeps until
      one of the following events happens:</para>

      <orderedlist>
        <listitem>
          <para>another thread calls <emphasis>waitq_wakeup</emphasis> and the
          thread is the first thread in the wait queue's list of sleeping
          threads</para>
        </listitem>

        <listitem>
          <para>another thread calls
          <emphasis>waitq_interrupt_sleep</emphasis> on the sleeping
          thread</para>
        </listitem>

        <listitem>
          <para>the sleep timeouts provided that none of the previous occurred
          within a specified time limit; the limit can be infinity</para>
        </listitem>
      </orderedlist>

      <para>All five possibilities (immediate return on success, immediate
      return on failure, wakeup after sleep, interruption and timeout) are
      distinguishable by the return value of
      <emphasis>waitq_sleep_timeout</emphasis>. The ability to interrupt a
      sleeping thread is essential for externally initiated thread termination
      and the ability to wait only for a certain amount of time is used, for
      instance, to passively delay thread execution by several microseconds or
      even seconds in <emphasis>thread_sleep</emphasis> function. Because all
      other passive kernel synchronization primitives are based on wait
      queues, they also have the option of being interrutped and, more
      importantly, can timeout. All of them also implement the conditional
      operation. Furthemore, this very fundamental interface reaches up to the
      implementation of futexes - userspace synchronization primitive, which
      makes it possible for a userspace thread to request synchronization
      operation with a timeout or a conditional operation.</para>

      <para>From the description above, it should be apparent, that when a
      sleeping thread is woken by <emphasis>waitq_wakeup</emphasis> or when
      <emphasis>waitq_sleep_timeout</emphasis> succeeds immediatelly, the
      thread can be sure the event has come and the thread need not and should
      not verify this fact. This approach is called direct hand-off and is
      characteristic for all passive HelenOS synchronization primitives with
      one exception described below.</para>
    </section>

    <section>
      <title>Semaphores</title>

      <para>The interesting point about wait queues is that the number of
      missed wakeups is equal to the number of threads that will not block in
      <emphasis>watiq_sleep_timeout</emphasis> and would immediately succeed
      instead. On the other hand, semaphores are synchronization primitives
      that will let predefined amount of threads in its critical section and
      block any other threads above this count. However, these two cases are
      exactly the same. Semaphores in HelenOS are therefore implemented as
      wait queues with a single semantic change: their wait queue is
      initialized to have so many missed wakeups as is the number of threads
      that the semphore intends to let into its critical section
      simultaneously.</para>

      <para>In the semaphore language, the wait queue operation
      <emphasis>waitq_sleep_timeout</emphasis> corresponds to
      <emphasis><emphasis>semaphore</emphasis> down</emphasis> operation,
      represented by the function <emphasis>semaphore_down_timeout</emphasis>
      and by way of similitude the wait queue operation waitq_wakeup
      corresponds to semaphore <emphasis>up</emphasis> operation, represented
      by the function <emphasis>sempafore_up</emphasis>. The conditional down
      operation is called <emphasis>semaphore_trydown</emphasis>.</para>
    </section>

    <section>
      <title>Mutexes</title>

      <para>Mutexes are are sometimes referred to as binary sempahores. That
      means that mutexes are like semaphores that allow only one thread in its
      critical section. Indeed, mutexes in HelenOS are implemented exactly in
      this way: they are built atop semaphores. From another point of view,
      they can be viewed as spinlocks without busy waiting. Their semaphore
      heritage provides good basics for both conditional operation and
      operation with timeout. The locking operation is called
      <emphasis>mutex_lock</emphasis>, the conditional locking operation is
      called <emphasis>mutex_trylock</emphasis> and the unlocking operation is
      called <emphasis>mutex_unlock</emphasis>.</para>
    </section>

    <section>
      <title>Reader/writer locks</title>

      <para>Reader/writer locks, or rwlocks, are by far the most complicated
      synchronization primitive within the kernel. The goal of these locks is
      to improve concurrency of applications in which threads need to
      synchronize access to a shared resource and that access can be
      partitioned into a read-only mode and a write mode. Reader/writer locks
      should make it possible for several, possibly many, readers to enter the
      critical section, provided that no writer is currently in the critical
      section, or to be in the critical section contemporarily. Writers are
      allowed to enter the critical section only individually, provided that
      no reader is in the critical section already. Applications in which the
      majority of operations can be done in the read-only mode can benefit
      from increased concurrency created by reader/writer locks.</para>

      <para>During reader/writer locks construction, a decision should be made
      whether readers will be prefered over writers or whether writers will be
      prefered over readers in cases when the lock is not currently held and
      both a reader and a writer want to gain the lock. Some operating systems
      prefer one group over the other, creating thus a possibility for
      starving the unprefered group. In the HelenOS operating system, none of
      the two groups is prefered. The lock is granted on the first come, first
      served basis with the additional note that readers are granted the lock
      in biggest possible batches.</para>

      <para>With this policy and the timeout modes of operation, the direct
      hand-off becomes much more complicated. For instance, a writer leaving
      the critical section must wake up all leading readers in the rwlock's
      wait queue or one leading writer or no-one if no thread is waiting.
      Similarily, the last reader leaving the critical section must wakeup the
      sleeping writer, if there are any sleeping threads at all. As another
      example, if a writer at the beginning of the rwlock's wait queue
      timeouts and the lock is held by at least one reader, the timeouting
      writer must first wake up all readers that follow him in the queue prior
      to signalling the timeout itself and giving up.</para>

      <para>Because of the issues mentioned in the previous paragraph, the
      reader/writer locks imlpementation needs to walk the rwlock wait queue's
      list of sleeping threads directly in order to find out the type of
      access that the queueing threads demand. This makes the code difficult
      to understand and dependent on the internal implementation of the wait
      queue. Nevertheless, it remains unclear to the authors of HelenOS
      whether a simpler but equivalently fair solution exists.</para>

      <para>The implementation of rwlocks as it has been already put, makes
      use of one single wait queue for both readers and writers, thus avoiding
      any possibility of starvation. In fact, rwlocks use a mutex rather than
      a bare wait queue. This mutex is called <emphasis>exclusive</emphasis>
      and is used to synchronize writers. The writer's lock operation,
      <emphasis>rwlock_write_lock_timeout</emphasis>, simply tries to acquire
      the exclusive mutex. If it succeeds, the writer is granted the rwlock.
      However, if the operation fails (e.g. times out), the writer must check
      for potential readers at the head of the list of sleeping threads
      associated with the mutex's wait queue and proceed according to the
      procedure outlined above.</para>

      <para>The exclusive mutex plays an important role in reader
      synchronization as well. However, a reader doing the reader's lock
      operation, <emphasis>rwlock_read_lock_timeout</emphasis>, may bypass
      this mutex when it detects that:</para>

      <orderedlist>
        <listitem>
          <para>there are other readers in the critical section</para>
        </listitem>

        <listitem>
          <para>there are no sleeping threads waiting for the exclusive
          mutex</para>
        </listitem>
      </orderedlist>

      <para>If both conditions are true, the reader will bypass the mutex,
      increment the number of readers in the critical section and enter the
      critical section. Note that if there are any sleeping threads at the
      beginning of the wait queue, the first of them must be a writer. If the
      conditions are not fulfilled, the reader normally waits until the
      exclusive mutex is granted to it.</para>
    </section>

    <section>
      <title>Condition variables</title>

      <para>Condvars explanation</para>
    </section>
  </section>

  <section>
    <title>Userspace synchronization</title>

    <section>
      <title>Futexes</title>

      <para></para>
    </section>
  </section>
</chapter>