<?xml version="1.0" encoding="UTF-8"?>
<chapter id="sync">
  <?dbhtml filename="sync.html"?>

  <title>Synchronization</title>

  <section>
    <title>Introduction</title>

    <para>The HelenOS operating system is designed to make use of the
    parallelism offered by the hardware and to exploit concurrency of both the
    kernel and userspace tasks. This is achieved through multiprocessor
    support and several levels of multiprogramming such as multitasking,
    multithreading and also through userspace pseudo threads. However, such a
    highly concurrent environment needs safe and efficient ways to handle
    mutual exclusion and synchronization of many execution flows.</para>
  </section>

  <section>
    <title>Active Kernel Primitives</title>

    <section>
      <indexterm>
        <primary>synchronization</primary>

        <secondary>- spinlock</secondary>
      </indexterm>

      <title>Spinlocks</title>

      <para>The basic mutual exclusion primitive is the spinlock. The spinlock
      implements active waiting for the availability of a memory lock (i.e.
      simple variable) in a multiprocessor-safe manner. This safety is
      achieved through the use of a specialized, architecture-dependent,
      atomic test-and-set operation which either locks the spinlock (i.e. sets
      the variable) or, provided that it is already locked, leaves it
      unaltered. In any case, the test-and-set operation returns a value, thus
      signalling either success (i.e. zero return value) or failure (i.e.
      non-zero value) in acquiring the lock. Note that this makes a
      fundamental difference between the naive algorithm that doesn't use the
      atomic operation and the spinlock algortihm. While the naive algorithm
      is prone to race conditions on SMP configurations and thus is completely
      SMP-unsafe, the spinlock algorithm eliminates the possibility of race
      conditions and is suitable for mutual exclusion use.</para>

      <para>The semantics of the test-and-set operation is that the spinlock
      remains unavailable until this operation called on the respective
      spinlock returns zero. HelenOS builds two functions on top of the
      test-and-set operation. The first function is the unconditional attempt
      to acquire the spinlock and is called <code>spinlock_lock()</code>. It
      simply loops until the test-and-set returns a zero value. The other
      function, <code>spinlock_trylock()</code>, is the conditional lock
      operation and calls the test-and-set only once to find out whether it
      managed to acquire the spinlock or not. The conditional operation is
      useful in situations in which an algorithm cannot acquire more spinlocks
      in the proper order and a deadlock cannot be avoided. In such a case,
      the algorithm would detect the danger and instead of possibly
      deadlocking the system it would simply release some spinlocks it already
      holds and retry the whole operation with the hope that it will succeed
      next time. The unlock function, <code>spinlock_unlock()</code>, is quite
      easy - it merely clears the spinlock variable.</para>

      <para>Nevertheless, there is a special issue related to hardware
      optimizations that modern processors implement. Particularly problematic
      is the out-of-order execution of instructions within the critical
      section protected by a spinlock. The processors are always
      self-consistent so that they can carry out speculatively executed
      instructions in the right order with regard to dependencies among those
      instructions. However, the dependency between instructions inside the
      critical section and those that implement locking and unlocking of the
      respective spinlock is not implicit on some processor architectures. As
      a result, the processor needs to be explicitly told about each
      occurrence of such a dependency. Therefore, HelenOS adds
      architecture-specific hooks to all <code>spinlock_lock()</code>,
      <code>spinlock_trylock()</code> and <code>spinlock_unlock()</code>
      functions to prevent the instructions inside the critical section from
      permeating out. On some architectures, these hooks can be void because
      the dependencies are implicitly there because of the special properties
      of locking and unlocking instructions. However, other architectures need
      to instrument these hooks with different memory barriers, depending on
      what operations could permeate out.</para>

      <para>Spinlocks have one significant drawback: when held for longer time
      periods, they harm both parallelism and concurrency. The processor
      executing <code>spinlock_lock()</code> does not do any fruitful work and
      is effectively halted until it can grab the lock and proceed.
      Similarily, other execution flows cannot execute on the processor that
      holds the spinlock because the kernel disables preemption on that
      processor when a spinlock is held. The reason behind disabling
      preemption is priority inversion problem avoidance. For the same reason,
      threads are strongly discouraged from sleeping when they hold a
      spinlock.</para>

      <para>To summarize, spinlocks represent very simple and essential mutual
      exclusion primitive for SMP systems. On the other hand, spinlocks scale
      poorly because of the active loop they are based on. Therefore,
      spinlocks are used in HelenOS only for short-time mutual exclusion and
      in cases where the mutual exclusion is required out of thread context.
      Lastly, spinlocks are used in the construction of passive
      synchronization primitives.</para>
    </section>
  </section>

  <section>
    <title>Passive Kernel Synchronization</title>

    <section>
      <indexterm>
        <primary>synchronization</primary>

        <secondary>- wait queue</secondary>
      </indexterm>

      <title>Wait Queues</title>

      <para>A wait queue is the basic passive synchronization primitive on
      which all other passive synchronization primitives are built. Simply
      put, it allows a thread to sleep until an event associated with the
      particular wait queue occurs. Multiple threads are notified about
      incoming events in a first come, first served fashion. Moreover, should
      the event come before any thread waits for it, it is recorded in the
      wait queue as a missed wakeup and later forwarded to the first thread
      that decides to wait in the queue. The inner structures of the wait
      queue are protected by a spinlock.</para>

      <para>The thread that wants to wait for a wait queue event uses the
      <code>waitq_sleep_timeout()</code> function. The algorithm then checks
      the wait queue's counter of missed wakeups and if there are any missed
      wakeups, the call returns immediately. The call also returns immediately
      if only a conditional wait was requested. Otherwise the thread is
      enqueued in the wait queue's list of sleeping threads and its state is
      changed to <constant>Sleeping</constant>. It then sleeps until one of
      the following events happens:</para>

      <orderedlist>
        <listitem>
          <para>another thread calls <code>waitq_wakeup()</code> and the
          thread is the first thread in the wait queue's list of sleeping
          threads;</para>
        </listitem>

        <listitem>
          <para>another thread calls <code>waitq_interrupt_sleep()</code> on
          the sleeping thread;</para>
        </listitem>

        <listitem>
          <para>the sleep times out provided that none of the previous
          occurred within a specified time limit; the limit can be
          infinity.</para>
        </listitem>
      </orderedlist>

      <para>All five possibilities (immediate return on success, immediate
      return on failure, wakeup after sleep, interruption and timeout) are
      distinguishable by the return value of
      <code>waitq_sleep_timeout()</code>. Being able to interrupt a sleeping
      thread is essential for externally initiated thread termination. The
      ability to wait only for a certain amount of time is used, for instance,
      to passively delay thread execution by several microseconds or even
      seconds in <code>thread_sleep()</code> function. Due to the fact that
      all other passive kernel synchronization primitives are based on wait
      queues, they also have the option of being interrutped and, more
      importantly, can timeout. All of them also implement the conditional
      operation. Furthemore, this very fundamental interface reaches up to the
      implementation of futexes - userspace synchronization primitive, which
      makes it possible for a userspace thread to request a synchronization
      operation with a timeout or a conditional operation.</para>

      <para>From the description above, it should be apparent, that when a
      sleeping thread is woken by <code>waitq_wakeup()</code> or when
      <code>waitq_sleep_timeout()</code> succeeds immediately, the thread can
      be sure that the event has occurred. The thread need not and should not
      verify this fact. This approach is called direct hand-off and is
      characteristic for all passive HelenOS synchronization primitives, with
      the exception as described below.</para>
    </section>

    <section>
      <indexterm>
        <primary>synchronization</primary>

        <secondary>- semaphore</secondary>
      </indexterm>

      <title>Semaphores</title>

      <para>The interesting point about wait queues is that the number of
      missed wakeups is equal to the number of threads that will not block in
      <code>watiq_sleep_timeout()</code> and would immediately succeed
      instead. On the other hand, semaphores are synchronization primitives
      that will let predefined amount of threads into their critical section
      and block any other threads above this count. However, these two cases
      are exactly the same. Semaphores in HelenOS are therefore implemented as
      wait queues with a single semantic change: their wait queue is
      initialized to have so many missed wakeups as is the number of threads
      that the semphore intends to let into its critical section
      simultaneously.</para>

      <para>In the semaphore language, the wait queue operation
      <code>waitq_sleep_timeout()</code> corresponds to semaphore
      <code>down</code> operation, represented by the function
      <code>semaphore_down_timeout()</code> and by way of similitude the wait
      queue operation waitq_wakeup corresponds to semaphore <code>up</code>
      operation, represented by the function <code>sempafore_up()</code>. The
      conditional down operation is called
      <code>semaphore_trydown()</code>.</para>
    </section>

    <section>
      <title>Mutexes</title>

      <indexterm>
        <primary>synchronization</primary>

        <secondary>- mutex</secondary>
      </indexterm>

      <para>Mutexes are sometimes referred to as binary sempahores. That means
      that mutexes are like semaphores that allow only one thread in its
      critical section. Indeed, mutexes in HelenOS are implemented exactly in
      this way: they are built on top of semaphores. From another point of
      view, they can be viewed as spinlocks without busy waiting. Their
      semaphore heritage provides good basics for both conditional operation
      and operation with timeout. The locking operation is called
      <code>mutex_lock()</code>, the conditional locking operation is called
      <code>mutex_trylock()</code> and the unlocking operation is called
      <code>mutex_unlock()</code>.</para>
    </section>

    <section>
      <title>Reader/Writer Locks</title>

      <indexterm>
        <primary>synchronization</primary>

        <secondary>- read/write lock</secondary>
      </indexterm>

      <para>Reader/writer locks, or rwlocks, are by far the most complicated
      synchronization primitive within the kernel. The goal of these locks is
      to improve concurrency of applications, in which threads need to
      synchronize access to a shared resource, and that access can be
      partitioned into a read-only mode and a write mode. Reader/writer locks
      should make it possible for several, possibly many, readers to enter the
      critical section, provided that no writer is currently in the critical
      section, or to be in the critical section contemporarily. Writers are
      allowed to enter the critical section only individually, provided that
      no reader is in the critical section already. Applications, in which the
      majority of operations can be done in the read-only mode, can benefit
      from increased concurrency created by reader/writer locks.</para>

      <para>During reader/writer lock construction, a decision should be made
      whether readers will be prefered over writers or whether writers will be
      prefered over readers in cases when the lock is not currently held and
      both a reader and a writer want to gain the lock. Some operating systems
      prefer one group over the other, creating thus a possibility for
      starving the unprefered group. In the HelenOS operating system, none of
      the two groups is prefered. The lock is granted on a first come, first
      served basis with the additional note that readers are granted the lock
      in the biggest possible batch.</para>

      <para>With this policy and the timeout modes of operation, the direct
      hand-off becomes much more complicated. For instance, a writer leaving
      the critical section must wake up all leading readers in the rwlock's
      wait queue or one leading writer or no-one if no thread is waiting.
      Similarily, the last reader leaving the critical section must wakeup the
      sleeping writer if there are any sleeping threads left at all. As
      another example, if a writer at the beginning of the rwlock's wait queue
      times out and the lock is held by at least one reader, the writer which
      has timed out must first wake up all readers that follow him in the
      queue prior to signalling the timeout itself and giving up.</para>

      <para>Due to the issues mentioned in the previous paragraph, the
      reader/writer lock imlpementation needs to walk the rwlock wait queue's
      list of sleeping threads directly, in order to find out the type of
      access that the queueing threads demand. This makes the code difficult
      to understand and dependent on the internal implementation of the wait
      queue. Nevertheless, it remains unclear to the authors of HelenOS
      whether a simpler but equivalently fair solution exists.</para>

      <para>The implementation of rwlocks as it has been already put, makes
      use of one single wait queue for both readers and writers, thus avoiding
      any possibility of starvation. In fact, rwlocks use a mutex rather than
      a bare wait queue. This mutex is called <emphasis>exclusive</emphasis>
      and is used to synchronize writers. The writer's lock operation,
      <code>rwlock_write_lock_timeout()</code>, simply tries to acquire the
      exclusive mutex. If it succeeds, the writer is granted the rwlock.
      However, if the operation fails (e.g. times out), the writer must check
      for potential readers at the head of the list of sleeping threads
      associated with the mutex's wait queue and then proceed according to the
      procedure outlined above.</para>

      <para>The exclusive mutex plays an important role in reader
      synchronization as well. However, a reader doing the reader's lock
      operation, <code>rwlock_read_lock_timeout()</code>, may bypass this
      mutex when it detects that:</para>

      <orderedlist>
        <listitem>
          <para>there are other readers in the critical section and</para>
        </listitem>

        <listitem>
          <para>there are no sleeping threads waiting for the exclusive
          mutex.</para>
        </listitem>
      </orderedlist>

      <para>If both conditions are true, the reader will bypass the mutex,
      increment the number of readers in the critical section and then enter
      the critical section. Note that if there are any sleeping threads at the
      beginning of the wait queue, the first must be a writer. If the
      conditions are not fulfilled, the reader normally waits until the
      exclusive mutex is granted to it.</para>
    </section>

    <section>
      <title>Condition Variables</title>

      <indexterm>
        <primary>synchronization</primary>

        <secondary>- condition variable</secondary>
      </indexterm>

      <para>Condition variables can be used for waiting until a condition
      becomes true. In this respect, they are similar to wait queues. But
      contrary to wait queues, condition variables have different semantics
      that allows events to be lost when there is no thread waiting for them.
      In order to support this, condition variables don't use direct hand-off
      and operate in a way similar to the example below. A thread waiting for
      the condition becoming true does the following:</para>

      <example>
        <title>Use of <code>condvar_wait_timeout()</code>.</title>

        <programlisting language="C"><function>mutex_lock</function>(<varname>mtx</varname>);
while (!<varname>condition</varname>)
        <function>condvar_wait_timeout</function>(<varname>cv</varname>, <varname>mtx</varname>); /* <remark>the condition is true, do something</remark> */
<function>mutex_unlock</function>(<varname>mtx</varname>);</programlisting>
      </example>

      <para>A thread that causes the condition become true signals this event
      like this:</para>

      <example>
        <title>Use of <code>condvar_signal()</code>.</title>

        <programlisting language="C"><function>mutex_lock</function>(<varname>mtx</varname>);
<varname>condition</varname> = <constant>true</constant>;
<function>condvar_signal</function>(<varname>cv</varname>);  /* <remark>condvar_broadcast(cv);</remark> */
<function>mutex_unlock</function>(<varname>mtx</varname>);</programlisting>
      </example>

      <para>The wait operation, <code>condvar_wait_timeout()</code>, always
      puts the calling thread to sleep. The thread then sleeps until another
      thread invokes <code>condvar_broadcast()</code> on the same condition
      variable or until it is woken up by <code>condvar_signal()</code>. The
      <code>condvar_signal()</code> operation unblocks the first thread
      blocking on the condition variable while the
      <code>condvar_broadcast()</code> operation unblocks all threads blocking
      there. If there are no blocking threads, these two operations have no
      efect.</para>

      <para>Note that the threads must synchronize over a dedicated mutex. To
      prevent race condition between <code>condvar_wait_timeout()</code> and
      <code>condvar_signal()</code> or <code>condvar_broadcast()</code>, the
      mutex is passed to <code>condvar_wait_timeout()</code> which then
      atomically puts the calling thread asleep and unlocks the mutex. When
      the thread eventually wakes up, <code>condvar_wait()</code> regains the
      mutex and returns.</para>

      <para>Also note, that there is no conditional operation for condition
      variables. Such an operation would make no sence since condition
      variables are defined to forget events for which there is no waiting
      thread and because <code>condvar_wait()</code> must always go to sleep.
      The operation with timeout is supported as usually.</para>

      <para>In HelenOS, condition variables are based on wait queues. As it is
      already mentioned above, wait queues remember missed events while
      condition variables must not do so. This is reasoned by the fact that
      condition variables are designed for scenarios in which an event might
      occur very many times without being picked up by any waiting thread. On
      the other hand, wait queues would remember any event that had not been
      picked up by a call to <code>waitq_sleep_timeout()</code>. Therefore, if
      wait queues were used directly and without any changes to implement
      condition variables, the missed_wakeup counter would hurt performance of
      the implementation: the <code>while</code> loop in
      <code>condvar_wait_timeout()</code> would effectively do busy waiting
      until all missed wakeups were discarded.</para>

      <para>The requirement on the wait operation to atomically put the caller
      to sleep and release the mutex poses an interesting problem on
      <code>condvar_wait_timeout()</code>. More precisely, the thread should
      sleep in the condvar's wait queue prior to releasing the mutex, but it
      must not hold the mutex when it is sleeping.</para>

      <para>Problems described in the two previous paragraphs are addressed in
      HelenOS by dividing the <code>waitq_sleep_timeout()</code> function into
      three pieces:</para>

      <orderedlist>
        <listitem>
          <para><code>waitq_sleep_prepare()</code> prepares the thread to go
          to sleep by, among other things, locking the wait queue;</para>
        </listitem>

        <listitem>
          <para><code>waitq_sleep_timeout_unsafe()</code> implements the core
          blocking logic;</para>
        </listitem>

        <listitem>
          <para><code>waitq_sleep_finish()</code> performs cleanup after
          <code>waitq_sleep_timeout_unsafe()</code>; after this call, the wait
          queue spinlock is guaranteed to be unlocked by the caller</para>
        </listitem>
      </orderedlist>

      <para>The stock <code>waitq_sleep_timeout()</code> is then a mere
      wrapper that calls these three functions. It is provided for convenience
      in cases where the caller doesn't require such a low level control.
      However, the implementation of <code>condvar_wait_timeout()</code> does
      need this finer-grained control because it has to interleave calls to
      these functions by other actions. It carries its operations out in the
      following order:</para>

      <orderedlist>
        <listitem>
          <para>calls <code>waitq_sleep_prepare()</code> in order to lock the
          condition variable's wait queue,</para>
        </listitem>

        <listitem>
          <para>releases the mutex,</para>
        </listitem>

        <listitem>
          <para>clears the counter of missed wakeups,</para>
        </listitem>

        <listitem>
          <para>calls <code>waitq_sleep_timeout_unsafe()</code>,</para>
        </listitem>

        <listitem>
          <para>retakes the mutex,</para>
        </listitem>

        <listitem>
          <para>calls <code>waitq_sleep_finish()</code>.</para>
        </listitem>
      </orderedlist>
    </section>
  </section>

  <section>
    <title>Userspace Synchronization</title>

    <section>
      <title>Futexes</title>

      <indexterm>
        <primary>synchronization</primary>

        <secondary>- futex</secondary>
      </indexterm>

      <para>In a multithreaded environment, userspace threads need an
      efficient way to synchronize. HelenOS borrows an idea from Linux<xref
      linkend="futex" /> to implement lightweight userspace synchronization
      and mutual exclusion primitive called futex. The key idea behind futexes
      is that non-contended synchronization is very fast and takes place only
      in userspace without kernel's intervention. When more threads contend
      for a futex, only one of them wins; other threads go to sleep via a
      dedicated syscall.</para>

      <para>The userspace part of the futex is a mere integer variable, a
      counter, that can be atomically incremented or decremented. The kernel
      part is rather more complicated. For each userspace futex counter, there
      is a kernel structure describing the futex. This structure
      contains:</para>

      <itemizedlist>
        <listitem>
          <para>number of references,</para>
        </listitem>

        <listitem>
          <para>physical address of the userspace futex counter,</para>
        </listitem>

        <listitem>
          <para>hash table link and</para>
        </listitem>

        <listitem>
          <para>a wait queue.</para>
        </listitem>
      </itemizedlist>

      <para>The reference count helps to find out when the futex is no longer
      needed and can be deallocated. The physical address is used as a key for
      the global futex hash table. Note that the kernel has to use physical
      address to identify the futex beacause one futex can be used for
      synchronization among different address spaces and can have different
      virtual addresses in each of them. Finally, the kernel futex structure
      includes a wait queue. The wait queue is used to put threads that didn't
      win the futex to sleep until the winner wakes one of them up.</para>

      <para>A futex should be initialized by setting its userspace counter to
      one before it is used. When locking the futex via userspace library
      function <code>futex_down_timeout()</code>, the library code atomically
      decrements the futex counter and tests if it dropped below zero. If it
      did, then the futex is locked by another thread and the library uses the
      <constant>SYS_FUTEX_SLEEP</constant> syscall to put the thread asleep.
      If the counter decreased to 0, then there was no contention and the
      thread can enter the critical section protected by the futex. When the
      thread later leaves that critical section, it, using library function
      <code>futex_up()</code>, atomically increments the counter. If the
      counter value increased to one, then there again was no contention and
      no action needs to be done. However, if it increased to zero or even a
      smaller number, then there are sleeping threads waiting for the futex to
      become available. In that case, the library has to use the
      <constant>SYS_FUTEX_WAKEUP</constant> syscall to wake one sleeping
      thread.</para>

      <para>So far, futexes are very elegant in that they don't interfere with
      the kernel when there is no contention for them. Another nice aspect of
      futexes is that they don't need to be registered anywhere prior to the
      first kernel intervention.</para>

      <para>Both futex related syscalls, <constant>SYS_FUTEX_SLEEP</constant>
      and <constant>SYS_FUTEX_WAKEUP</constant>, respectivelly, are mere
      wrappers for <code>waitq_sleep_timeout()</code> and
      <code>waitq_sleep_wakeup()</code>, respectively, with some housekeeping
      functionality added. Both syscalls need to translate the userspace
      virtual address of the futex counter to physical address in order to
      support synchronization accross shared memory. Once the physical address
      is known, the kernel checks whether the futexes are already known to it
      by searching the global futex hash table for an item with the physical
      address of the futex counter as a key. When the search is successful, it
      returns an address of the kernel futex structure associated with the
      counter. If the hash table does not contain the key, the kernel creates
      it and inserts it into the hash table. At the same time, the the current
      task's B+tree of known futexes is searched in order to find out if the
      task already uses the futex. If it does, no action is taken. Otherwise
      the reference count of the futex is incremented, provided that the futex
      already existed.</para>
    </section>
  </section>
</chapter>