Subversion Repositories HelenOS-doc

Rev

Rev 102 | Rev 141 | Go to most recent revision | Blame | Compare with Previous | Last modification | View Log | Download | RSS feed

  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <chapter id="sync">
  3.   <?dbhtml filename="sync.html"?>
  4.  
  5.   <title>Synchronization</title>
  6.  
  7.   <section>
  8.     <title>Introduction</title>
  9.  
  10.     <para>The HelenOS operating system is designed to make use of the
  11.     parallelism offered by the hardware and to exploit concurrency of both the
  12.     kernel and userspace tasks. This is achieved through multiprocessor
  13.     support and several levels of multiprogramming such as multitasking,
  14.     multithreading and also through userspace pseudo threads. However, such a
  15.     highly concurrent environment needs safe and efficient ways to handle
  16.     mutual exclusion and synchronization of many execution flows.</para>
  17.   </section>
  18.  
  19.   <section>
  20.     <title>Active Kernel Primitives</title>
  21.  
  22.     <section>
  23.       <indexterm>
  24.         <primary>synchronization</primary>
  25.  
  26.         <secondary>- spinlock</secondary>
  27.       </indexterm>
  28.  
  29.       <title>Spinlocks</title>
  30.  
  31.       <para>The basic mutual exclusion primitive is the spinlock. The spinlock
  32.       implements active waiting for the availability of a memory lock (i.e.
  33.       simple variable) in a multiprocessor-safe manner. This safety is
  34.       achieved through the use of a specialized, architecture-dependent,
  35.       atomic test-and-set operation which either locks the spinlock (i.e. sets
  36.       the variable) or, provided that it is already locked, leaves it
  37.       unaltered. In any case, the test-and-set operation returns a value, thus
  38.       signalling either success (i.e. zero return value) or failure (i.e.
  39.       non-zero value) in acquiring the lock. Note that this makes a
  40.       fundamental difference between the naive algorithm that doesn't use the
  41.      atomic operation and the spinlock algortihm. While the naive algorithm
  42.      is prone to race conditions on SMP configurations and thus is completely
  43.      SMP-unsafe, the spinlock algorithm eliminates the possibility of race
  44.      conditions and is suitable for mutual exclusion use.</para>
  45.  
  46.      <para>The semantics of the test-and-set operation is that the spinlock
  47.      remains unavailable until this operation called on the respective
  48.      spinlock returns zero. HelenOS builds two functions on top of the
  49.      test-and-set operation. The first function is the unconditional attempt
  50.      to acquire the spinlock and is called <code>spinlock_lock()</code>. It
  51.      simply loops until the test-and-set returns a zero value. The other
  52.      function, <code>spinlock_trylock()</code>, is the conditional lock
  53.      operation and calls the test-and-set only once to find out whether it
  54.      managed to acquire the spinlock or not. The conditional operation is
  55.      useful in situations in which an algorithm cannot acquire more spinlocks
  56.      in the proper order and a deadlock cannot be avoided. In such a case,
  57.      the algorithm would detect the danger and instead of possibly
  58.      deadlocking the system it would simply release some spinlocks it already
  59.      holds and retry the whole operation with the hope that it will succeed
  60.      next time. The unlock function, <code>spinlock_unlock()</code>, is quite
  61.      easy - it merely clears the spinlock variable.</para>
  62.  
  63.      <para>Nevertheless, there is a special issue related to hardware
  64.      optimizations that modern processors implement. Particularly problematic
  65.      is the out-of-order execution of instructions within the critical
  66.      section protected by a spinlock. The processors are always
  67.      self-consistent so that they can carry out speculatively executed
  68.      instructions in the right order with regard to dependencies among those
  69.      instructions. However, the dependency between instructions inside the
  70.      critical section and those that implement locking and unlocking of the
  71.      respective spinlock is not implicit on some processor architectures. As
  72.      a result, the processor needs to be explicitly told about each
  73.      occurrence of such a dependency. Therefore, HelenOS adds
  74.      architecture-specific hooks to all <code>spinlock_lock()</code>,
  75.      <code>spinlock_trylock()</code> and <code>spinlock_unlock()</code>
  76.      functions to prevent the instructions inside the critical section from
  77.      permeating out. On some architectures, these hooks can be void because
  78.      the dependencies are implicitly there because of the special properties
  79.      of locking and unlocking instructions. However, other architectures need
  80.      to instrument these hooks with different memory barriers, depending on
  81.      what operations could permeate out.</para>
  82.  
  83.      <para>Spinlocks have one significant drawback: when held for longer time
  84.      periods, they harm both parallelism and concurrency. The processor
  85.      executing <code>spinlock_lock()</code> does not do any fruitful work and
  86.      is effectively halted until it can grab the lock and proceed.
  87.      Similarily, other execution flows cannot execute on the processor that
  88.      holds the spinlock because the kernel disables preemption on that
  89.      processor when a spinlock is held. The reason behind disabling
  90.      preemption is priority inversion problem avoidance. For the same reason,
  91.      threads are strongly discouraged from sleeping when they hold a
  92.      spinlock.</para>
  93.  
  94.      <para>To summarize, spinlocks represent very simple and essential mutual
  95.      exclusion primitive for SMP systems. On the other hand, spinlocks scale
  96.      poorly because of the active loop they are based on. Therefore,
  97.      spinlocks are used in HelenOS only for short-time mutual exclusion and
  98.      in cases where the mutual exclusion is required out of thread context.
  99.      Lastly, spinlocks are used in the construction of passive
  100.      synchronization primitives.</para>
  101.    </section>
  102.  </section>
  103.  
  104.  <section>
  105.    <title>Passive Kernel Synchronization</title>
  106.  
  107.    <section>
  108.      <indexterm>
  109.        <primary>synchronization</primary>
  110.  
  111.        <secondary>- wait queue</secondary>
  112.      </indexterm>
  113.  
  114.      <title>Wait Queues</title>
  115.  
  116.      <para>A wait queue is the basic passive synchronization primitive on
  117.      which all other passive synchronization primitives are built. Simply
  118.      put, it allows a thread to sleep until an event associated with the
  119.      particular wait queue occurs. Multiple threads are notified about
  120.      incoming events in a first come, first served fashion. Moreover, should
  121.      the event come before any thread waits for it, it is recorded in the
  122.      wait queue as a missed wakeup and later forwarded to the first thread
  123.      that decides to wait in the queue. The inner structures of the wait
  124.      queue are protected by a spinlock.</para>
  125.  
  126.      <para>The thread that wants to wait for a wait queue event uses the
  127.      <code>waitq_sleep_timeout()</code> function. The algorithm then checks
  128.      the wait queue's counter of missed wakeups and if there are any missed
  129.       wakeups, the call returns immediately. The call also returns immediately
  130.       if only a conditional wait was requested. Otherwise the thread is
  131.       enqueued in the wait queue's list of sleeping threads and its state is
  132.      changed to <constant>Sleeping</constant>. It then sleeps until one of
  133.      the following events happens:</para>
  134.  
  135.      <orderedlist>
  136.        <listitem>
  137.          <para>another thread calls <code>waitq_wakeup()</code> and the
  138.          thread is the first thread in the wait queue's list of sleeping
  139.           threads;</para>
  140.         </listitem>
  141.  
  142.         <listitem>
  143.           <para>another thread calls <code>waitq_interrupt_sleep()</code> on
  144.           the sleeping thread;</para>
  145.         </listitem>
  146.  
  147.         <listitem>
  148.           <para>the sleep times out provided that none of the previous
  149.           occurred within a specified time limit; the limit can be
  150.           infinity.</para>
  151.         </listitem>
  152.       </orderedlist>
  153.  
  154.       <para>All five possibilities (immediate return on success, immediate
  155.       return on failure, wakeup after sleep, interruption and timeout) are
  156.       distinguishable by the return value of
  157.       <code>waitq_sleep_timeout()</code>. Being able to interrupt a sleeping
  158.       thread is essential for externally initiated thread termination. The
  159.       ability to wait only for a certain amount of time is used, for instance,
  160.       to passively delay thread execution by several microseconds or even
  161.       seconds in <code>thread_sleep()</code> function. Due to the fact that
  162.       all other passive kernel synchronization primitives are based on wait
  163.       queues, they also have the option of being interrutped and, more
  164.       importantly, can timeout. All of them also implement the conditional
  165.       operation. Furthemore, this very fundamental interface reaches up to the
  166.       implementation of futexes - userspace synchronization primitive, which
  167.       makes it possible for a userspace thread to request a synchronization
  168.       operation with a timeout or a conditional operation.</para>
  169.  
  170.       <para>From the description above, it should be apparent, that when a
  171.       sleeping thread is woken by <code>waitq_wakeup()</code> or when
  172.       <code>waitq_sleep_timeout()</code> succeeds immediately, the thread can
  173.       be sure that the event has occurred. The thread need not and should not
  174.       verify this fact. This approach is called direct hand-off and is
  175.       characteristic for all passive HelenOS synchronization primitives, with
  176.       the exception as described below.</para>
  177.     </section>
  178.  
  179.     <section>
  180.       <indexterm>
  181.         <primary>synchronization</primary>
  182.  
  183.         <secondary>- semaphore</secondary>
  184.       </indexterm>
  185.  
  186.       <title>Semaphores</title>
  187.  
  188.       <para>The interesting point about wait queues is that the number of
  189.       missed wakeups is equal to the number of threads that will not block in
  190.       <code>watiq_sleep_timeout()</code> and would immediately succeed
  191.       instead. On the other hand, semaphores are synchronization primitives
  192.       that will let predefined amount of threads into their critical section
  193.       and block any other threads above this count. However, these two cases
  194.       are exactly the same. Semaphores in HelenOS are therefore implemented as
  195.       wait queues with a single semantic change: their wait queue is
  196.       initialized to have so many missed wakeups as is the number of threads
  197.       that the semphore intends to let into its critical section
  198.       simultaneously.</para>
  199.  
  200.       <para>In the semaphore language, the wait queue operation
  201.       <code>waitq_sleep_timeout()</code> corresponds to semaphore
  202.       <code>down</code> operation, represented by the function
  203.       <code>semaphore_down_timeout()</code> and by way of similitude the wait
  204.       queue operation waitq_wakeup corresponds to semaphore <code>up</code>
  205.       operation, represented by the function <code>sempafore_up()</code>. The
  206.       conditional down operation is called
  207.       <code>semaphore_trydown()</code>.</para>
  208.     </section>
  209.  
  210.     <section>
  211.       <title>Mutexes</title>
  212.  
  213.       <indexterm>
  214.         <primary>synchronization</primary>
  215.  
  216.         <secondary>- mutex</secondary>
  217.       </indexterm>
  218.  
  219.       <para>Mutexes are sometimes referred to as binary sempahores. That means
  220.       that mutexes are like semaphores that allow only one thread in its
  221.       critical section. Indeed, mutexes in HelenOS are implemented exactly in
  222.       this way: they are built on top of semaphores. From another point of
  223.       view, they can be viewed as spinlocks without busy waiting. Their
  224.       semaphore heritage provides good basics for both conditional operation
  225.       and operation with timeout. The locking operation is called
  226.       <code>mutex_lock()</code>, the conditional locking operation is called
  227.       <code>mutex_trylock()</code> and the unlocking operation is called
  228.       <code>mutex_unlock()</code>.</para>
  229.     </section>
  230.  
  231.     <section>
  232.       <title>Reader/Writer Locks</title>
  233.  
  234.       <indexterm>
  235.         <primary>synchronization</primary>
  236.  
  237.         <secondary>- read/write lock</secondary>
  238.       </indexterm>
  239.  
  240.       <para>Reader/writer locks, or rwlocks, are by far the most complicated
  241.       synchronization primitive within the kernel. The goal of these locks is
  242.       to improve concurrency of applications, in which threads need to
  243.       synchronize access to a shared resource, and that access can be
  244.       partitioned into a read-only mode and a write mode. Reader/writer locks
  245.       should make it possible for several, possibly many, readers to enter the
  246.       critical section, provided that no writer is currently in the critical
  247.       section, or to be in the critical section contemporarily. Writers are
  248.       allowed to enter the critical section only individually, provided that
  249.       no reader is in the critical section already. Applications, in which the
  250.       majority of operations can be done in the read-only mode, can benefit
  251.       from increased concurrency created by reader/writer locks.</para>
  252.  
  253.       <para>During reader/writer lock construction, a decision should be made
  254.       whether readers will be prefered over writers or whether writers will be
  255.       prefered over readers in cases when the lock is not currently held and
  256.       both a reader and a writer want to gain the lock. Some operating systems
  257.       prefer one group over the other, creating thus a possibility for
  258.       starving the unprefered group. In the HelenOS operating system, none of
  259.       the two groups is prefered. The lock is granted on a first come, first
  260.       served basis with the additional note that readers are granted the lock
  261.       in the biggest possible batch.</para>
  262.  
  263.       <para>With this policy and the timeout modes of operation, the direct
  264.       hand-off becomes much more complicated. For instance, a writer leaving
  265.       the critical section must wake up all leading readers in the rwlock's
  266.      wait queue or one leading writer or no-one if no thread is waiting.
  267.      Similarily, the last reader leaving the critical section must wakeup the
  268.      sleeping writer if there are any sleeping threads left at all. As
  269.      another example, if a writer at the beginning of the rwlock's wait queue
  270.       times out and the lock is held by at least one reader, the writer which
  271.       has timed out must first wake up all readers that follow him in the
  272.       queue prior to signalling the timeout itself and giving up.</para>
  273.  
  274.       <para>Due to the issues mentioned in the previous paragraph, the
  275.       reader/writer lock imlpementation needs to walk the rwlock wait queue's
  276.      list of sleeping threads directly, in order to find out the type of
  277.      access that the queueing threads demand. This makes the code difficult
  278.      to understand and dependent on the internal implementation of the wait
  279.      queue. Nevertheless, it remains unclear to the authors of HelenOS
  280.      whether a simpler but equivalently fair solution exists.</para>
  281.  
  282.      <para>The implementation of rwlocks as it has been already put, makes
  283.      use of one single wait queue for both readers and writers, thus avoiding
  284.      any possibility of starvation. In fact, rwlocks use a mutex rather than
  285.      a bare wait queue. This mutex is called <emphasis>exclusive</emphasis>
  286.      and is used to synchronize writers. The writer's lock operation,
  287.       <code>rwlock_write_lock_timeout()</code>, simply tries to acquire the
  288.       exclusive mutex. If it succeeds, the writer is granted the rwlock.
  289.       However, if the operation fails (e.g. times out), the writer must check
  290.       for potential readers at the head of the list of sleeping threads
  291.       associated with the mutex's wait queue and then proceed according to the
  292.      procedure outlined above.</para>
  293.  
  294.      <para>The exclusive mutex plays an important role in reader
  295.      synchronization as well. However, a reader doing the reader's lock
  296.       operation, <code>rwlock_read_lock_timeout()</code>, may bypass this
  297.       mutex when it detects that:</para>
  298.  
  299.       <orderedlist>
  300.         <listitem>
  301.           <para>there are other readers in the critical section and</para>
  302.         </listitem>
  303.  
  304.         <listitem>
  305.           <para>there are no sleeping threads waiting for the exclusive
  306.           mutex.</para>
  307.         </listitem>
  308.       </orderedlist>
  309.  
  310.       <para>If both conditions are true, the reader will bypass the mutex,
  311.       increment the number of readers in the critical section and then enter
  312.       the critical section. Note that if there are any sleeping threads at the
  313.       beginning of the wait queue, the first must be a writer. If the
  314.       conditions are not fulfilled, the reader normally waits until the
  315.       exclusive mutex is granted to it.</para>
  316.     </section>
  317.  
  318.     <section>
  319.       <title>Condition Variables</title>
  320.  
  321.       <indexterm>
  322.         <primary>synchronization</primary>
  323.  
  324.         <secondary>- condition variable</secondary>
  325.       </indexterm>
  326.  
  327.       <para>Condition variables can be used for waiting until a condition
  328.       becomes true. In this respect, they are similar to wait queues. But
  329.       contrary to wait queues, condition variables have different semantics
  330.       that allows events to be lost when there is no thread waiting for them.
  331.       In order to support this, condition variables don't use direct hand-off
  332.      and operate in a way similar to the example below. A thread waiting for
  333.      the condition becoming true does the following:</para>
  334.  
  335.      <example>
  336.        <title>Use of <code>condvar_wait_timeout()</code>.</title>
  337.  
  338.        <programlisting language="C"><function>mutex_lock</function>(<varname>mtx</varname>);
  339. while (!<varname>condition</varname>)
  340.        <function>condvar_wait_timeout</function>(<varname>cv</varname>, <varname>mtx</varname>); /* <remark>the condition is true, do something</remark> */
  341. <function>mutex_unlock</function>(<varname>mtx</varname>);</programlisting>
  342.      </example>
  343.  
  344.      <para>A thread that causes the condition become true signals this event
  345.      like this:</para>
  346.  
  347.      <example>
  348.        <title>Use of <code>condvar_signal</code>.</title>
  349.  
  350.        <programlisting language="C"><function>mutex_lock</function>(<varname>mtx</varname>);
  351. <varname>condition</varname> = <constant>true</constant>;
  352. <function>condvar_signal</function>(<varname>cv</varname>);  /* <remark>condvar_broadcast(cv);</remark> */
  353. <function>mutex_unlock</function>(<varname>mtx</varname>);</programlisting>
  354.      </example>
  355.  
  356.      <para>The wait operation, <code>condvar_wait_timeout()</code>, always
  357.      puts the calling thread to sleep. The thread then sleeps until another
  358.      thread invokes <code>condvar_broadcast()</code> on the same condition
  359.      variable or until it is woken up by <code>condvar_signal()</code>. The
  360.      <code>condvar_signal()</code> operation unblocks the first thread
  361.      blocking on the condition variable while the
  362.      <code>condvar_broadcast()</code> operation unblocks all threads blocking
  363.      there. If there are no blocking threads, these two operations have no
  364.      efect.</para>
  365.  
  366.      <para>Note that the threads must synchronize over a dedicated mutex. To
  367.      prevent race condition between <code>condvar_wait_timeout()</code> and
  368.      <code>condvar_signal()</code> or <code>condvar_broadcast()</code>, the
  369.      mutex is passed to <code>condvar_wait_timeout()</code> which then
  370.      atomically puts the calling thread asleep and unlocks the mutex. When
  371.      the thread eventually wakes up, <code>condvar_wait()</code> regains the
  372.      mutex and returns.</para>
  373.  
  374.      <para>Also note, that there is no conditional operation for condition
  375.      variables. Such an operation would make no sence since condition
  376.      variables are defined to forget events for which there is no waiting
  377.      thread and because <code>condvar_wait()</code> must always go to sleep.
  378.      The operation with timeout is supported as usually.</para>
  379.  
  380.      <para>In HelenOS, condition variables are based on wait queues. As it is
  381.      already mentioned above, wait queues remember missed events while
  382.      condition variables must not do so. This is reasoned by the fact that
  383.      condition variables are designed for scenarios in which an event might
  384.      occur very many times without being picked up by any waiting thread. On
  385.      the other hand, wait queues would remember any event that had not been
  386.      picked up by a call to <code>waitq_sleep_timeout()</code>. Therefore, if
  387.      wait queues were used directly and without any changes to implement
  388.      condition variables, the missed_wakeup counter would hurt performance of
  389.      the implementation: the <code>while</code> loop in
  390.      <code>condvar_wait_timeout()</code> would effectively do busy waiting
  391.      until all missed wakeups were discarded.</para>
  392.  
  393.      <para>The requirement on the wait operation to atomically put the caller
  394.      to sleep and release the mutex poses an interesting problem on
  395.      <code>condvar_wait_timeout()</code>. More precisely, the thread should
  396.      sleep in the condvar's wait queue prior to releasing the mutex, but it
  397.       must not hold the mutex when it is sleeping.</para>
  398.  
  399.       <para>Problems described in the two previous paragraphs are addressed in
  400.       HelenOS by dividing the <code>waitq_sleep_timeout()</code> function into
  401.       three pieces:</para>
  402.  
  403.       <orderedlist>
  404.         <listitem>
  405.           <para><code>waitq_sleep_prepare()</code> prepares the thread to go
  406.           to sleep by, among other things, locking the wait queue;</para>
  407.         </listitem>
  408.  
  409.         <listitem>
  410.           <para><code>waitq_sleep_timeout_unsafe()</code> implements the core
  411.           blocking logic;</para>
  412.         </listitem>
  413.  
  414.         <listitem>
  415.           <para><code>waitq_sleep_finish()</code> performs cleanup after
  416.           <code>waitq_sleep_timeout_unsafe()</code>; after this call, the wait
  417.           queue spinlock is guaranteed to be unlocked by the caller</para>
  418.         </listitem>
  419.       </orderedlist>
  420.  
  421.       <para>The stock <code>waitq_sleep_timeout()</code> is then a mere
  422.       wrapper that calls these three functions. It is provided for convenience
  423.       in cases where the caller doesn't require such a low level control.
  424.      However, the implementation of <code>condvar_wait_timeout()</code> does
  425.      need this finer-grained control because it has to interleave calls to
  426.      these functions by other actions. It carries its operations out in the
  427.      following order:</para>
  428.  
  429.      <orderedlist>
  430.        <listitem>
  431.          <para>calls <code>waitq_sleep_prepare()</code> in order to lock the
  432.          condition variable's wait queue,</para>
  433.         </listitem>
  434.  
  435.         <listitem>
  436.           <para>releases the mutex,</para>
  437.         </listitem>
  438.  
  439.         <listitem>
  440.           <para>clears the counter of missed wakeups,</para>
  441.         </listitem>
  442.  
  443.         <listitem>
  444.           <para>calls <code>waitq_sleep_timeout_unsafe()</code>,</para>
  445.         </listitem>
  446.  
  447.         <listitem>
  448.           <para>retakes the mutex,</para>
  449.         </listitem>
  450.  
  451.         <listitem>
  452.           <para>calls <code>waitq_sleep_finish()</code>.</para>
  453.         </listitem>
  454.       </orderedlist>
  455.     </section>
  456.   </section>
  457.  
  458.   <section>
  459.     <title>Userspace Synchronization</title>
  460.  
  461.     <section>
  462.       <title>Futexes</title>
  463.  
  464.       <indexterm>
  465.         <primary>synchronization</primary>
  466.  
  467.         <secondary>- futex</secondary>
  468.       </indexterm>
  469.  
  470.       <para>In a multithreaded environment, userspace threads need an
  471.       efficient way to synchronize. HelenOS borrows an idea from Linux<xref
  472.       linkend="futex" /> to implement lightweight userspace synchronization
  473.       and mutual exclusion primitive called futex. The key idea behind futexes
  474.       is that non-contended synchronization is very fast and takes place only
  475.       in userspace without kernel's intervention. When more threads contend
  476.      for a futex, only one of them wins; other threads go to sleep via a
  477.      dedicated syscall.</para>
  478.  
  479.      <para>The userspace part of the futex is a mere integer variable, a
  480.      counter, that can be atomically incremented or decremented. The kernel
  481.      part is rather more complicated. For each userspace futex counter, there
  482.      is a kernel structure describing the futex. This structure
  483.      contains:</para>
  484.  
  485.      <itemizedlist>
  486.        <listitem>
  487.          <para>number of references,</para>
  488.        </listitem>
  489.  
  490.        <listitem>
  491.          <para>physical address of the userspace futex counter,</para>
  492.        </listitem>
  493.  
  494.        <listitem>
  495.          <para>hash table link and</para>
  496.        </listitem>
  497.  
  498.        <listitem>
  499.          <para>a wait queue.</para>
  500.        </listitem>
  501.      </itemizedlist>
  502.  
  503.      <para>The reference count helps to find out when the futex is no longer
  504.      needed and can be deallocated. The physical address is used as a key for
  505.      the global futex hash table. Note that the kernel has to use physical
  506.      address to identify the futex beacause one futex can be used for
  507.      synchronization among different address spaces and can have different
  508.      virtual addresses in each of them. Finally, the kernel futex structure
  509.      includes a wait queue. The wait queue is used to put threads that didn't
  510.       win the futex to sleep until the winner wakes one of them up.</para>
  511.  
  512.       <para>A futex should be initialized by setting its userspace counter to
  513.       one before it is used. When locking the futex via userspace library
  514.       function <code>futex_down_timeout()</code>, the library code atomically
  515.       decrements the futex counter and tests if it dropped below zero. If it
  516.       did, then the futex is locked by another thread and the library uses the
  517.       <constant>SYS_FUTEX_SLEEP</constant> syscall to put the thread asleep.
  518.       If the counter decreased to 0, then there was no contention and the
  519.       thread can enter the critical section protected by the futex. When the
  520.       thread later leaves that critical section, it, using library function
  521.       <code>futex_up()</code>, atomically increments the counter. If the
  522.       counter value increased to one, then there again was no contention and
  523.       no action needs to be done. However, if it increased to zero or even a
  524.       smaller number, then there are sleeping threads waiting for the futex to
  525.       become available. In that case, the library has to use the
  526.       <constant>SYS_FUTEX_WAKEUP</constant> syscall to wake one sleeping
  527.       thread.</para>
  528.  
  529.       <para>So far, futexes are very elegant in that they don't interfere with
  530.      the kernel when there is no contention for them. Another nice aspect of
  531.      futexes is that they don't need to be registered anywhere prior to the
  532.       first kernel intervention.</para>
  533.  
  534.       <para>Both futex related syscalls, <constant>SYS_FUTEX_SLEEP</constant>
  535.       and <constant>SYS_FUTEX_WAKEUP</constant>, respectivelly, are mere
  536.       wrappers for <code>waitq_sleep_timeout()</code> and
  537.       <code>waitq_sleep_wakeup()</code>, respectively, with some housekeeping
  538.       functionality added. Both syscalls need to translate the userspace
  539.       virtual address of the futex counter to physical address in order to
  540.       support synchronization accross shared memory. Once the physical address
  541.       is known, the kernel checks whether the futexes are already known to it
  542.       by searching the global futex hash table for an item with the physical
  543.       address of the futex counter as a key. When the search is successful, it
  544.       returns an address of the kernel futex structure associated with the
  545.       counter. If the hash table does not contain the key, the kernel creates
  546.       it and inserts it into the hash table. At the same time, the the current
  547.       task's B+tree of known futexes is searched in order to find out if the
  548.      task already uses the futex. If it does, no action is taken. Otherwise
  549.      the reference count of the futex is incremented, provided that the futex
  550.      already existed.</para>
  551.    </section>
  552.  </section>
  553. </chapter>