<?xml version="1.0" encoding="UTF-8"?>
<chapter id="sync">
<?dbhtml filename="sync.html"?>
<title>Mutual exclusion and synchronization</title>
<section>
<title>Introduction</title>
<para>The HelenOS operating system is designed to make use of the
parallelism offered by the hardware and to exploit concurrency of both the
kernel and userspace tasks. This is achieved through multiprocessor
support and several levels of multiprogramming such as multitasking,
multithreading and also through userspace pseudo threads. However, such a
highly concurrent environment needs safe and efficient ways to handle
mutual exclusion and synchronization of many execution flows.</para>
</section>
<section>
<title>Active kernel primitives</title>
<section>
<title>Spinlocks</title>
<para>The basic mutual exclusion primitive is the spinlock. The spinlock
implements active waiting for the availability of a memory lock (i.e.
simple variable) in a multiprocessor-safe manner. This safety is
achieved through the use of a specialized, architecture-dependent,
atomic test-and-set operation which either locks the spinlock (i.e. sets
the variable) or, provided that it is already locked, leaves it
unaltered. In any case, the test-and-set operation returns a value, thus
signalling either success (i.e. zero return value) or failure (i.e.
non-zero value) in acquiring the lock. Note that this makes a
fundamental difference between the naive algorithm that doesn't use the
atomic operation and the spinlock algortihm. While the naive algorithm
is prone to race conditions on SMP configurations and thus is completely
SMP-unsafe, the spinlock algorithm eliminates the possibility of race
conditions and is suitable for mutual exclusion use.</para>
<para>The semantics of the test-and-set operation is that the spinlock
remains unavailable until this operation called on the respective
spinlock returns zero. HelenOS builds two functions on top of the
test-and-set operation. The first function is the unconditional attempt
to acquire the spinlock and is called <code>spinlock_lock</code>. It
simply loops until the test-and-set returns a zero value. The other
function, <code>spinlock_trylock</code>, is the conditional lock
operation and calls the test-and-set only once to find out whether it
managed to acquire the spinlock or not. The conditional operation is
useful in situations in which an algorithm cannot acquire more spinlocks
in the proper order and a deadlock cannot be avoided. In such a case,
the algorithm would detect the danger and instead of possibly
deadlocking the system it would simply release some spinlocks it already
holds and retry the whole operation with the hope that it will succeed
next time. The unlock function, <code>spinlock_unlock</code>, is quite
easy - it merely clears the spinlock variable.</para>
<para>Nevertheless, there is a special issue related to hardware
optimizations that modern processors implement. Particularly problematic
is the out-of-order execution of instructions within the critical
section protected by a spinlock. The processors are always
self-consistent so that they can carry out speculatively executed
instructions in the right order with regard to dependencies among those
instructions. However, the dependency between instructions inside the
critical section and those that implement locking and unlocking of the
respective spinlock is not implicit on some processor architectures. As
a result, the processor needs to be explicitly told about each
occurrence of such a dependency. Therefore, HelenOS adds
architecture-specific hooks to all <code>spinlock_lock</code>,
<code>spinlock_trylock</code> and <code>spinlock_unlock</code> functions
to prevent the instructions inside the critical section from permeating
out. On some architectures, these hooks can be void because the
dependencies are implicitly there because of the special properties of
locking and unlocking instructions. However, other architectures need to
instrument these hooks with different memory barriers, depending on what
operations could permeate out.</para>
<para>Spinlocks have one significant drawback: when held for longer time
periods, they harm both parallelism and concurrency. The processor
executing <code>spinlock_lock</code> does not do any fruitful work and
is effectively halted until it can grab the lock and proceed.
Similarily, other execution flows cannot execute on the processor that
holds the spinlock because the kernel disables preemption on that
processor when a spinlock is held. The reason behind disabling
preemption is priority inversion problem avoidance. For the same reason,
threads are strongly discouraged from sleeping when they hold a
spinlock.</para>
<para>To summarize, spinlocks represent very simple and essential mutual
exclusion primitive for SMP systems. On the other hand, spinlocks scale
poorly because of the active loop they are based on. Therefore,
spinlocks are used in HelenOS only for short-time mutual exclusion and
in cases where the mutual exclusion is required out of thread context.
Lastly, spinlocks are used in the construction of passive
synchronization primitives.</para>
</section>
</section>
<section>
<title>Passive kernel synchronization</title>
<section>
<title>Wait queues</title>
<para>A wait queue is the basic passive synchronization primitive on
which all other passive synchronization primitives are built. Simply
put, it allows a thread to sleep until an event associated with the
particular wait queue occurs. Multiple threads are notified about
incoming events in a first come, first served fashion. Moreover, should
the event come before any thread waits for it, it is recorded in the
wait queue as a missed wakeup and later forwarded to the first thread
that decides to wait in the queue. The inner structures of the wait
queue are protected by a spinlock.</para>
<para>The thread that wants to wait for a wait queue event uses the
<code>waitq_sleep_timeout</code> function. The algorithm then checks the
wait queue's counter of missed wakeups and if there are any missed
wakeups, the call returns immediately. The call also returns immediately
if only a conditional wait was requested. Otherwise the thread is
enqueued in the wait queue's list of sleeping threads and its state is
changed to <constant>Sleeping</constant>. It then sleeps until one of
the following events happens:</para>
<orderedlist>
<listitem>
<para>another thread calls <code>waitq_wakeup</code> and the thread
is the first thread in the wait queue's list of sleeping
threads;</para>
</listitem>
<listitem>
<para>another thread calls <code>waitq_interrupt_sleep</code> on the
sleeping thread;</para>
</listitem>
<listitem>
<para>the sleep times out provided that none of the previous
occurred within a specified time limit; the limit can be
infinity.</para>
</listitem>
</orderedlist>
<para>All five possibilities (immediate return on success, immediate
return on failure, wakeup after sleep, interruption and timeout) are
distinguishable by the return value of <code>waitq_sleep_timeout</code>.
Being able to interrupt a sleeping thread is essential for externally
initiated thread termination. The ability to wait only for a certain
amount of time is used, for instance, to passively delay thread
execution by several microseconds or even seconds in
<code>thread_sleep</code> function. Due to the fact that all other
passive kernel synchronization primitives are based on wait queues, they
also have the option of being interrutped and, more importantly, can
timeout. All of them also implement the conditional operation.
Furthemore, this very fundamental interface reaches up to the
implementation of futexes - userspace synchronization primitive, which
makes it possible for a userspace thread to request a synchronization
operation with a timeout or a conditional operation.</para>
<para>From the description above, it should be apparent, that when a
sleeping thread is woken by <code>waitq_wakeup</code> or when
<code>waitq_sleep_timeout</code> succeeds immediately, the thread can be
sure that the event has occurred. The thread need not and should not
verify this fact. This approach is called direct hand-off and is
characteristic for all passive HelenOS synchronization primitives, with
the exception as described below.</para>
</section>
<section>
<title>Semaphores</title>
<para>The interesting point about wait queues is that the number of
missed wakeups is equal to the number of threads that will not block in
<code>watiq_sleep_timeout</code> and would immediately succeed instead.
On the other hand, semaphores are synchronization primitives that will
let predefined amount of threads into their critical section and block
any other threads above this count. However, these two cases are exactly
the same. Semaphores in HelenOS are therefore implemented as wait queues
with a single semantic change: their wait queue is initialized to have
so many missed wakeups as is the number of threads that the semphore
intends to let into its critical section simultaneously.</para>
<para>In the semaphore language, the wait queue operation
<code>waitq_sleep_timeout</code> corresponds to semaphore
<code>down</code> operation, represented by the function
<code>semaphore_down_timeout</code> and by way of similitude the wait
queue operation waitq_wakeup corresponds to semaphore <code>up</code>
operation, represented by the function <code>sempafore_up</code>. The
conditional down operation is called
<code>semaphore_trydown</code>.</para>
</section>
<section>
<title>Mutexes</title>
<para>Mutexes are sometimes referred to as binary sempahores. That means
that mutexes are like semaphores that allow only one thread in its
critical section. Indeed, mutexes in HelenOS are implemented exactly in
this way: they are built on top of semaphores. From another point of
view, they can be viewed as spinlocks without busy waiting. Their
semaphore heritage provides good basics for both conditional operation
and operation with timeout. The locking operation is called
<code>mutex_lock</code>, the conditional locking operation is called
<code>mutex_trylock</code> and the unlocking operation is called
<code>mutex_unlock</code>.</para>
</section>
<section>
<title>Reader/writer locks</title>
<para>Reader/writer locks, or rwlocks, are by far the most complicated
synchronization primitive within the kernel. The goal of these locks is
to improve concurrency of applications, in which threads need to
synchronize access to a shared resource, and that access can be
partitioned into a read-only mode and a write mode. Reader/writer locks
should make it possible for several, possibly many, readers to enter the
critical section, provided that no writer is currently in the critical
section, or to be in the critical section contemporarily. Writers are
allowed to enter the critical section only individually, provided that
no reader is in the critical section already. Applications, in which the
majority of operations can be done in the read-only mode, can benefit
from increased concurrency created by reader/writer locks.</para>
<para>During reader/writer lock construction, a decision should be made
whether readers will be prefered over writers or whether writers will be
prefered over readers in cases when the lock is not currently held and
both a reader and a writer want to gain the lock. Some operating systems
prefer one group over the other, creating thus a possibility for
starving the unprefered group. In the HelenOS operating system, none of
the two groups is prefered. The lock is granted on a first come, first
served basis with the additional note that readers are granted the lock
in the biggest possible batch.</para>
<para>With this policy and the timeout modes of operation, the direct
hand-off becomes much more complicated. For instance, a writer leaving
the critical section must wake up all leading readers in the rwlock's
wait queue or one leading writer or no-one if no thread is waiting.
Similarily, the last reader leaving the critical section must wakeup the
sleeping writer if there are any sleeping threads left at all. As
another example, if a writer at the beginning of the rwlock's wait queue
times out and the lock is held by at least one reader, the writer which
has timed out must first wake up all readers that follow him in the
queue prior to signalling the timeout itself and giving up.</para>
<para>Due to the issues mentioned in the previous paragraph, the
reader/writer lock imlpementation needs to walk the rwlock wait queue's
list of sleeping threads directly, in order to find out the type of
access that the queueing threads demand. This makes the code difficult
to understand and dependent on the internal implementation of the wait
queue. Nevertheless, it remains unclear to the authors of HelenOS
whether a simpler but equivalently fair solution exists.</para>
<para>The implementation of rwlocks as it has been already put, makes
use of one single wait queue for both readers and writers, thus avoiding
any possibility of starvation. In fact, rwlocks use a mutex rather than
a bare wait queue. This mutex is called <code>exclusive</code> and is
used to synchronize writers. The writer's lock operation,
<code>rwlock_write_lock_timeout</code>, simply tries to acquire the
exclusive mutex. If it succeeds, the writer is granted the rwlock.
However, if the operation fails (e.g. times out), the writer must check
for potential readers at the head of the list of sleeping threads
associated with the mutex's wait queue and then proceed according to the
procedure outlined above.</para>
<para>The exclusive mutex plays an important role in reader
synchronization as well. However, a reader doing the reader's lock
operation, <code>rwlock_read_lock_timeout</code>, may bypass this mutex
when it detects that:</para>
<orderedlist>
<listitem>
<para>there are other readers in the critical section and</para>
</listitem>
<listitem>
<para>there are no sleeping threads waiting for the exclusive
mutex.</para>
</listitem>
</orderedlist>
<para>If both conditions are true, the reader will bypass the mutex,
increment the number of readers in the critical section and then enter
the critical section. Note that if there are any sleeping threads at the
beginning of the wait queue, the first must be a writer. If the
conditions are not fulfilled, the reader normally waits until the
exclusive mutex is granted to it.</para>
</section>
<section>
<title>Condition variables</title>
<para>Condition variables can be used for waiting until a condition
becomes true. In this respect, they are similar to wait queues. But
contrary to wait queues, condition variables have different semantics
that allows events to be lost when there is no thread waiting for them.
In order to support this, condition variables don't use direct hand-off
and operate in a way similar to the example below. A thread waiting for
the condition becoming true does the following:</para>
<para><programlisting language="C"><function>mutex_lock</function>(<varname>mtx</varname>);
while (!<varname>condition</varname>)
<function>condvar_wait_timeout</function>(<varname>cv</varname>, <varname>mtx</varname>);
/* <remark>the condition is true, do something</remark> */
<function>mutex_unlock</function>(<varname>mtx</varname>);</programlisting></para>
<para>A thread that causes the condition become true signals this event
like this:</para>
<para><programlisting><function>mutex_lock</function>(<varname>mtx</varname>);
<varname>condition</varname> = <constant>true</constant>;
<function>condvar_signal</function>(<varname>cv</varname>); /* <remark>condvar_broadcast(cv);</remark> */
<function>mutex_unlock</function>(<varname>mtx</varname>);</programlisting></para>
<para>The wait operation, <code>condvar_wait_timeout</code>, always puts
the calling thread to sleep. The thread then sleeps until another thread
invokes <code>condvar_broadcast</code> on the same condition variable or
until it is woken up by <code>condvar_signal</code>. The
<code>condvar_signal</code> operation unblocks the first thread blocking
on the condition variable while the <code>condvar_broadcast</code>
operation unblocks all threads blocking there. If there are no blocking
threads, these two operations have no efect.</para>
<para>Note that the threads must synchronize over a dedicated mutex. To
prevent race condition between <code>condvar_wait_timeout</code> and
<code>condvar_signal</code> or <code>condvar_broadcast</code>, the mutex
is passed to <code>condvar_wait_timeout</code> which then atomically
puts the calling thread asleep and unlocks the mutex. When the thread
eventually wakes up, <code>condvar_wait</code> regains the mutex and
returns.</para>
<para>Also note, that there is no conditional operation for condition
variables. Such an operation would make no sence since condition
variables are defined to forget events for which there is no waiting
thread and because <code>condvar_wait</code> must always go to sleep.
The operation with timeout is supported as usually.</para>
<para>In HelenOS, condition variables are based on wait queues. As it is
already mentioned above, wait queues remember missed events while
condition variables must not do so. This is reasoned by the fact that
condition variables are designed for scenarios in which an event might
occur very many times without being picked up by any waiting thread. On
the other hand, wait queues would remember any event that had not been
picked up by a call to <code>waitq_sleep_timeout</code>. Therefore, if
wait queues were used directly and without any changes to implement
condition variables, the missed_wakeup counter would hurt performance of
the implementation: the <code>while</code> loop in
<code>condvar_wait_timeout</code> would effectively do busy waiting
until all missed wakeups were discarded.</para>
<para>The requirement on the wait operation to atomically put the caller
to sleep and release the mutex poses an interesting problem on
<code>condvar_wait_timeout</code>. More precisely, the thread should
sleep in the condvar's wait queue prior to releasing the mutex, but it
must not hold the mutex when it is sleeping.</para>
<para>Problems described in the two previous paragraphs are addressed in
HelenOS by dividing the <code>waitq_sleep_timeout</code> function into
three pieces:</para>
<orderedlist>
<listitem>
<para><code>waitq_sleep_prepare</code> prepares the thread to go to
sleep by, among other things, locking the wait queue;</para>
</listitem>
<listitem>
<para><code>waitq_sleep_timeout_unsafe</code> implements the core
blocking logic;</para>
</listitem>
<listitem>
<para><code>waitq_sleep_finish</code> performs cleanup after
<code>waitq_sleep_timeout_unsafe</code>; after this call, the wait
queue spinlock is guaranteed to be unlocked by the caller</para>
</listitem>
</orderedlist>
<para>The stock <code>waitq_sleep_timeout</code> is then a mere wrapper
that calls these three functions. It is provided for convenience in
cases where the caller doesn't require such a low level control.
However, the implementation of <code>condvar_wait_timeout</code> does
need this finer-grained control because it has to interleave calls to
these functions by other actions. It carries its operations out in the
following order:</para>
<orderedlist>
<listitem>
<para>calls <code>waitq_sleep_prepare</code> in order to lock the
condition variable's wait queue,</para>
</listitem>
<listitem>
<para>releases the mutex,</para>
</listitem>
<listitem>
<para>clears the counter of missed wakeups,</para>
</listitem>
<listitem>
<para>calls <code>waitq_sleep_timeout_unsafe</code>,</para>
</listitem>
<listitem>
<para>retakes the mutex,</para>
</listitem>
<listitem>
<para>calls <code>waitq_sleep_finish</code>.</para>
</listitem>
</orderedlist>
</section>
</section>
<section>
<title>Userspace synchronization</title>
<section>
<title>Futexes</title>
<para></para>
</section>
</section>
</chapter>