Rev 141 | Details | Compare with Previous | Last modification | View Log | RSS feed
| Rev | Author | Line No. | Line |
|---|---|---|---|
| 9 | bondari | 1 | <?xml version="1.0" encoding="UTF-8"?> |
| 41 | jermar | 2 | <chapter id="sync"> |
| 3 | <?dbhtml filename="sync.html"?> |
||
| 9 | bondari | 4 | |
| 82 | jermar | 5 | <title>Synchronization</title> |
| 9 | bondari | 6 | |
| 41 | jermar | 7 | <section> |
| 8 | <title>Introduction</title> |
||
| 9 | bondari | 9 | |
| 45 | jermar | 10 | <para>The HelenOS operating system is designed to make use of the |
| 11 | parallelism offered by the hardware and to exploit concurrency of both the |
||
| 12 | kernel and userspace tasks. This is achieved through multiprocessor |
||
| 13 | support and several levels of multiprogramming such as multitasking, |
||
| 171 | jermar | 14 | multithreading and also through userspace fibrils. However, such a highly |
| 15 | concurrent environment needs safe and efficient ways to handle mutual |
||
| 16 | exclusion and synchronization of many execution flows.</para> |
||
| 41 | jermar | 17 | </section> |
| 18 | |||
| 19 | <section> |
||
| 131 | jermar | 20 | <title>Active Kernel Primitives</title> |
| 41 | jermar | 21 | |
| 9 | bondari | 22 | <section> |
| 72 | bondari | 23 | <indexterm> |
| 24 | <primary>synchronization</primary> |
||
| 25 | |||
| 26 | <secondary>- spinlock</secondary> |
||
| 27 | </indexterm> |
||
| 28 | |||
| 41 | jermar | 29 | <title>Spinlocks</title> |
| 9 | bondari | 30 | |
| 45 | jermar | 31 | <para>The basic mutual exclusion primitive is the spinlock. The spinlock |
| 32 | implements active waiting for the availability of a memory lock (i.e. |
||
| 41 | jermar | 33 | simple variable) in a multiprocessor-safe manner. This safety is |
| 34 | achieved through the use of a specialized, architecture-dependent, |
||
| 35 | atomic test-and-set operation which either locks the spinlock (i.e. sets |
||
| 36 | the variable) or, provided that it is already locked, leaves it |
||
| 37 | unaltered. In any case, the test-and-set operation returns a value, thus |
||
| 38 | signalling either success (i.e. zero return value) or failure (i.e. |
||
| 45 | jermar | 39 | non-zero value) in acquiring the lock. Note that this makes a |
| 41 | jermar | 40 | fundamental difference between the naive algorithm that doesn't use the |
| 41 | atomic operation and the spinlock algortihm. While the naive algorithm |
||
| 45 | jermar | 42 | is prone to race conditions on SMP configurations and thus is completely |
| 41 | jermar | 43 | SMP-unsafe, the spinlock algorithm eliminates the possibility of race |
| 44 | conditions and is suitable for mutual exclusion use.</para> |
||
| 9 | bondari | 45 | |
| 41 | jermar | 46 | <para>The semantics of the test-and-set operation is that the spinlock |
| 47 | remains unavailable until this operation called on the respective |
||
| 45 | jermar | 48 | spinlock returns zero. HelenOS builds two functions on top of the |
| 49 | test-and-set operation. The first function is the unconditional attempt |
||
| 86 | bondari | 50 | to acquire the spinlock and is called <code>spinlock_lock()</code>. It |
| 57 | jermar | 51 | simply loops until the test-and-set returns a zero value. The other |
| 86 | bondari | 52 | function, <code>spinlock_trylock()</code>, is the conditional lock |
| 57 | jermar | 53 | operation and calls the test-and-set only once to find out whether it |
| 54 | managed to acquire the spinlock or not. The conditional operation is |
||
| 55 | useful in situations in which an algorithm cannot acquire more spinlocks |
||
| 56 | in the proper order and a deadlock cannot be avoided. In such a case, |
||
| 57 | the algorithm would detect the danger and instead of possibly |
||
| 58 | deadlocking the system it would simply release some spinlocks it already |
||
| 59 | holds and retry the whole operation with the hope that it will succeed |
||
| 86 | bondari | 60 | next time. The unlock function, <code>spinlock_unlock()</code>, is quite |
| 57 | jermar | 61 | easy - it merely clears the spinlock variable.</para> |
| 9 | bondari | 62 | |
| 41 | jermar | 63 | <para>Nevertheless, there is a special issue related to hardware |
| 45 | jermar | 64 | optimizations that modern processors implement. Particularly problematic |
| 65 | is the out-of-order execution of instructions within the critical |
||
| 66 | section protected by a spinlock. The processors are always |
||
| 41 | jermar | 67 | self-consistent so that they can carry out speculatively executed |
| 68 | instructions in the right order with regard to dependencies among those |
||
| 69 | instructions. However, the dependency between instructions inside the |
||
| 70 | critical section and those that implement locking and unlocking of the |
||
| 45 | jermar | 71 | respective spinlock is not implicit on some processor architectures. As |
| 72 | a result, the processor needs to be explicitly told about each |
||
| 73 | occurrence of such a dependency. Therefore, HelenOS adds |
||
| 86 | bondari | 74 | architecture-specific hooks to all <code>spinlock_lock()</code>, |
| 131 | jermar | 75 | <code>spinlock_trylock()</code> and <code>spinlock_unlock()</code> |
| 76 | functions to prevent the instructions inside the critical section from |
||
| 77 | permeating out. On some architectures, these hooks can be void because |
||
| 78 | the dependencies are implicitly there because of the special properties |
||
| 79 | of locking and unlocking instructions. However, other architectures need |
||
| 80 | to instrument these hooks with different memory barriers, depending on |
||
| 81 | what operations could permeate out.</para> |
||
| 9 | bondari | 82 | |
| 41 | jermar | 83 | <para>Spinlocks have one significant drawback: when held for longer time |
| 45 | jermar | 84 | periods, they harm both parallelism and concurrency. The processor |
| 86 | bondari | 85 | executing <code>spinlock_lock()</code> does not do any fruitful work and |
| 57 | jermar | 86 | is effectively halted until it can grab the lock and proceed. |
| 45 | jermar | 87 | Similarily, other execution flows cannot execute on the processor that |
| 88 | holds the spinlock because the kernel disables preemption on that |
||
| 89 | processor when a spinlock is held. The reason behind disabling |
||
| 90 | preemption is priority inversion problem avoidance. For the same reason, |
||
| 91 | threads are strongly discouraged from sleeping when they hold a |
||
| 92 | spinlock.</para> |
||
| 9 | bondari | 93 | |
| 41 | jermar | 94 | <para>To summarize, spinlocks represent very simple and essential mutual |
| 95 | exclusion primitive for SMP systems. On the other hand, spinlocks scale |
||
| 96 | poorly because of the active loop they are based on. Therefore, |
||
| 45 | jermar | 97 | spinlocks are used in HelenOS only for short-time mutual exclusion and |
| 41 | jermar | 98 | in cases where the mutual exclusion is required out of thread context. |
| 99 | Lastly, spinlocks are used in the construction of passive |
||
| 100 | synchronization primitives.</para> |
||
| 101 | </section> |
||
| 102 | </section> |
||
| 9 | bondari | 103 | |
| 41 | jermar | 104 | <section> |
| 131 | jermar | 105 | <title>Passive Kernel Synchronization</title> |
| 9 | bondari | 106 | |
| 41 | jermar | 107 | <section> |
| 72 | bondari | 108 | <indexterm> |
| 109 | <primary>synchronization</primary> |
||
| 110 | |||
| 73 | bondari | 111 | <secondary>- wait queue</secondary> |
| 72 | bondari | 112 | </indexterm> |
| 113 | |||
| 131 | jermar | 114 | <title>Wait Queues</title> |
| 9 | bondari | 115 | |
| 43 | jermar | 116 | <para>A wait queue is the basic passive synchronization primitive on |
| 45 | jermar | 117 | which all other passive synchronization primitives are built. Simply |
| 118 | put, it allows a thread to sleep until an event associated with the |
||
| 119 | particular wait queue occurs. Multiple threads are notified about |
||
| 120 | incoming events in a first come, first served fashion. Moreover, should |
||
| 121 | the event come before any thread waits for it, it is recorded in the |
||
| 122 | wait queue as a missed wakeup and later forwarded to the first thread |
||
| 123 | that decides to wait in the queue. The inner structures of the wait |
||
| 124 | queue are protected by a spinlock.</para> |
||
| 43 | jermar | 125 | |
| 126 | <para>The thread that wants to wait for a wait queue event uses the |
||
| 131 | jermar | 127 | <code>waitq_sleep_timeout()</code> function. The algorithm then checks |
| 128 | the wait queue's counter of missed wakeups and if there are any missed |
||
| 57 | jermar | 129 | wakeups, the call returns immediately. The call also returns immediately |
| 130 | if only a conditional wait was requested. Otherwise the thread is |
||
| 131 | enqueued in the wait queue's list of sleeping threads and its state is |
||
| 132 | changed to <constant>Sleeping</constant>. It then sleeps until one of |
||
| 133 | the following events happens:</para> |
||
| 43 | jermar | 134 | |
| 135 | <orderedlist> |
||
| 136 | <listitem> |
||
| 131 | jermar | 137 | <para>another thread calls <code>waitq_wakeup()</code> and the |
| 138 | thread is the first thread in the wait queue's list of sleeping |
||
| 45 | jermar | 139 | threads;</para> |
| 43 | jermar | 140 | </listitem> |
| 141 | |||
| 142 | <listitem> |
||
| 131 | jermar | 143 | <para>another thread calls <code>waitq_interrupt_sleep()</code> on |
| 144 | the sleeping thread;</para> |
||
| 43 | jermar | 145 | </listitem> |
| 146 | |||
| 147 | <listitem> |
||
| 45 | jermar | 148 | <para>the sleep times out provided that none of the previous |
| 149 | occurred within a specified time limit; the limit can be |
||
| 150 | infinity.</para> |
||
| 43 | jermar | 151 | </listitem> |
| 152 | </orderedlist> |
||
| 153 | |||
| 154 | <para>All five possibilities (immediate return on success, immediate |
||
| 155 | return on failure, wakeup after sleep, interruption and timeout) are |
||
| 131 | jermar | 156 | distinguishable by the return value of |
| 157 | <code>waitq_sleep_timeout()</code>. Being able to interrupt a sleeping |
||
| 158 | thread is essential for externally initiated thread termination. The |
||
| 159 | ability to wait only for a certain amount of time is used, for instance, |
||
| 160 | to passively delay thread execution by several microseconds or even |
||
| 161 | seconds in <code>thread_sleep()</code> function. Due to the fact that |
||
| 162 | all other passive kernel synchronization primitives are based on wait |
||
| 163 | queues, they also have the option of being interrutped and, more |
||
| 164 | importantly, can timeout. All of them also implement the conditional |
||
| 165 | operation. Furthemore, this very fundamental interface reaches up to the |
||
| 57 | jermar | 166 | implementation of futexes - userspace synchronization primitive, which |
| 167 | makes it possible for a userspace thread to request a synchronization |
||
| 168 | operation with a timeout or a conditional operation.</para> |
||
| 43 | jermar | 169 | |
| 170 | <para>From the description above, it should be apparent, that when a |
||
| 86 | bondari | 171 | sleeping thread is woken by <code>waitq_wakeup()</code> or when |
| 131 | jermar | 172 | <code>waitq_sleep_timeout()</code> succeeds immediately, the thread can |
| 173 | be sure that the event has occurred. The thread need not and should not |
||
| 57 | jermar | 174 | verify this fact. This approach is called direct hand-off and is |
| 175 | characteristic for all passive HelenOS synchronization primitives, with |
||
| 176 | the exception as described below.</para> |
||
| 41 | jermar | 177 | </section> |
| 9 | bondari | 178 | |
| 41 | jermar | 179 | <section> |
| 72 | bondari | 180 | <indexterm> |
| 181 | <primary>synchronization</primary> |
||
| 182 | |||
| 73 | bondari | 183 | <secondary>- semaphore</secondary> |
| 72 | bondari | 184 | </indexterm> |
| 185 | |||
| 41 | jermar | 186 | <title>Semaphores</title> |
| 9 | bondari | 187 | |
| 43 | jermar | 188 | <para>The interesting point about wait queues is that the number of |
| 189 | missed wakeups is equal to the number of threads that will not block in |
||
| 131 | jermar | 190 | <code>watiq_sleep_timeout()</code> and would immediately succeed |
| 191 | instead. On the other hand, semaphores are synchronization primitives |
||
| 192 | that will let predefined amount of threads into their critical section |
||
| 193 | and block any other threads above this count. However, these two cases |
||
| 194 | are exactly the same. Semaphores in HelenOS are therefore implemented as |
||
| 195 | wait queues with a single semantic change: their wait queue is |
||
| 196 | initialized to have so many missed wakeups as is the number of threads |
||
| 197 | that the semphore intends to let into its critical section |
||
| 198 | simultaneously.</para> |
||
| 43 | jermar | 199 | |
| 200 | <para>In the semaphore language, the wait queue operation |
||
| 86 | bondari | 201 | <code>waitq_sleep_timeout()</code> corresponds to semaphore |
| 57 | jermar | 202 | <code>down</code> operation, represented by the function |
| 86 | bondari | 203 | <code>semaphore_down_timeout()</code> and by way of similitude the wait |
| 57 | jermar | 204 | queue operation waitq_wakeup corresponds to semaphore <code>up</code> |
| 86 | bondari | 205 | operation, represented by the function <code>sempafore_up()</code>. The |
| 57 | jermar | 206 | conditional down operation is called |
| 86 | bondari | 207 | <code>semaphore_trydown()</code>.</para> |
| 41 | jermar | 208 | </section> |
| 9 | bondari | 209 | |
| 41 | jermar | 210 | <section> |
| 43 | jermar | 211 | <title>Mutexes</title> |
| 9 | bondari | 212 | |
| 72 | bondari | 213 | <indexterm> |
| 214 | <primary>synchronization</primary> |
||
| 215 | |||
| 216 | <secondary>- mutex</secondary> |
||
| 217 | </indexterm> |
||
| 218 | |||
| 45 | jermar | 219 | <para>Mutexes are sometimes referred to as binary sempahores. That means |
| 220 | that mutexes are like semaphores that allow only one thread in its |
||
| 43 | jermar | 221 | critical section. Indeed, mutexes in HelenOS are implemented exactly in |
| 45 | jermar | 222 | this way: they are built on top of semaphores. From another point of |
| 223 | view, they can be viewed as spinlocks without busy waiting. Their |
||
| 224 | semaphore heritage provides good basics for both conditional operation |
||
| 225 | and operation with timeout. The locking operation is called |
||
| 86 | bondari | 226 | <code>mutex_lock()</code>, the conditional locking operation is called |
| 227 | <code>mutex_trylock()</code> and the unlocking operation is called |
||
| 228 | <code>mutex_unlock()</code>.</para> |
||
| 41 | jermar | 229 | </section> |
| 9 | bondari | 230 | |
| 41 | jermar | 231 | <section> |
| 131 | jermar | 232 | <title>Reader/Writer Locks</title> |
| 9 | bondari | 233 | |
| 72 | bondari | 234 | <indexterm> |
| 235 | <primary>synchronization</primary> |
||
| 236 | |||
| 237 | <secondary>- read/write lock</secondary> |
||
| 238 | </indexterm> |
||
| 239 | |||
| 43 | jermar | 240 | <para>Reader/writer locks, or rwlocks, are by far the most complicated |
| 241 | synchronization primitive within the kernel. The goal of these locks is |
||
| 45 | jermar | 242 | to improve concurrency of applications, in which threads need to |
| 243 | synchronize access to a shared resource, and that access can be |
||
| 43 | jermar | 244 | partitioned into a read-only mode and a write mode. Reader/writer locks |
| 245 | should make it possible for several, possibly many, readers to enter the |
||
| 246 | critical section, provided that no writer is currently in the critical |
||
| 247 | section, or to be in the critical section contemporarily. Writers are |
||
| 248 | allowed to enter the critical section only individually, provided that |
||
| 45 | jermar | 249 | no reader is in the critical section already. Applications, in which the |
| 250 | majority of operations can be done in the read-only mode, can benefit |
||
| 43 | jermar | 251 | from increased concurrency created by reader/writer locks.</para> |
| 252 | |||
| 45 | jermar | 253 | <para>During reader/writer lock construction, a decision should be made |
| 43 | jermar | 254 | whether readers will be prefered over writers or whether writers will be |
| 255 | prefered over readers in cases when the lock is not currently held and |
||
| 256 | both a reader and a writer want to gain the lock. Some operating systems |
||
| 257 | prefer one group over the other, creating thus a possibility for |
||
| 258 | starving the unprefered group. In the HelenOS operating system, none of |
||
| 45 | jermar | 259 | the two groups is prefered. The lock is granted on a first come, first |
| 43 | jermar | 260 | served basis with the additional note that readers are granted the lock |
| 45 | jermar | 261 | in the biggest possible batch.</para> |
| 43 | jermar | 262 | |
| 263 | <para>With this policy and the timeout modes of operation, the direct |
||
| 264 | hand-off becomes much more complicated. For instance, a writer leaving |
||
| 265 | the critical section must wake up all leading readers in the rwlock's |
||
| 266 | wait queue or one leading writer or no-one if no thread is waiting. |
||
| 267 | Similarily, the last reader leaving the critical section must wakeup the |
||
| 45 | jermar | 268 | sleeping writer if there are any sleeping threads left at all. As |
| 269 | another example, if a writer at the beginning of the rwlock's wait queue |
||
| 270 | times out and the lock is held by at least one reader, the writer which |
||
| 271 | has timed out must first wake up all readers that follow him in the |
||
| 272 | queue prior to signalling the timeout itself and giving up.</para> |
||
| 43 | jermar | 273 | |
| 45 | jermar | 274 | <para>Due to the issues mentioned in the previous paragraph, the |
| 275 | reader/writer lock imlpementation needs to walk the rwlock wait queue's |
||
| 276 | list of sleeping threads directly, in order to find out the type of |
||
| 43 | jermar | 277 | access that the queueing threads demand. This makes the code difficult |
| 278 | to understand and dependent on the internal implementation of the wait |
||
| 279 | queue. Nevertheless, it remains unclear to the authors of HelenOS |
||
| 280 | whether a simpler but equivalently fair solution exists.</para> |
||
| 281 | |||
| 282 | <para>The implementation of rwlocks as it has been already put, makes |
||
| 283 | use of one single wait queue for both readers and writers, thus avoiding |
||
| 284 | any possibility of starvation. In fact, rwlocks use a mutex rather than |
||
| 131 | jermar | 285 | a bare wait queue. This mutex is called <emphasis>exclusive</emphasis> |
| 286 | and is used to synchronize writers. The writer's lock operation, |
||
| 86 | bondari | 287 | <code>rwlock_write_lock_timeout()</code>, simply tries to acquire the |
| 57 | jermar | 288 | exclusive mutex. If it succeeds, the writer is granted the rwlock. |
| 44 | jermar | 289 | However, if the operation fails (e.g. times out), the writer must check |
| 290 | for potential readers at the head of the list of sleeping threads |
||
| 45 | jermar | 291 | associated with the mutex's wait queue and then proceed according to the |
| 44 | jermar | 292 | procedure outlined above.</para> |
| 43 | jermar | 293 | |
| 294 | <para>The exclusive mutex plays an important role in reader |
||
| 295 | synchronization as well. However, a reader doing the reader's lock |
||
| 131 | jermar | 296 | operation, <code>rwlock_read_lock_timeout()</code>, may bypass this |
| 297 | mutex when it detects that:</para> |
||
| 43 | jermar | 298 | |
| 299 | <orderedlist> |
||
| 300 | <listitem> |
||
| 45 | jermar | 301 | <para>there are other readers in the critical section and</para> |
| 43 | jermar | 302 | </listitem> |
| 303 | |||
| 304 | <listitem> |
||
| 305 | <para>there are no sleeping threads waiting for the exclusive |
||
| 45 | jermar | 306 | mutex.</para> |
| 43 | jermar | 307 | </listitem> |
| 308 | </orderedlist> |
||
| 309 | |||
| 310 | <para>If both conditions are true, the reader will bypass the mutex, |
||
| 45 | jermar | 311 | increment the number of readers in the critical section and then enter |
| 312 | the critical section. Note that if there are any sleeping threads at the |
||
| 313 | beginning of the wait queue, the first must be a writer. If the |
||
| 43 | jermar | 314 | conditions are not fulfilled, the reader normally waits until the |
| 315 | exclusive mutex is granted to it.</para> |
||
| 41 | jermar | 316 | </section> |
| 9 | bondari | 317 | |
| 318 | <section> |
||
| 131 | jermar | 319 | <title>Condition Variables</title> |
| 9 | bondari | 320 | |
| 72 | bondari | 321 | <indexterm> |
| 322 | <primary>synchronization</primary> |
||
| 323 | |||
| 324 | <secondary>- condition variable</secondary> |
||
| 325 | </indexterm> |
||
| 326 | |||
| 48 | jermar | 327 | <para>Condition variables can be used for waiting until a condition |
| 328 | becomes true. In this respect, they are similar to wait queues. But |
||
| 329 | contrary to wait queues, condition variables have different semantics |
||
| 330 | that allows events to be lost when there is no thread waiting for them. |
||
| 331 | In order to support this, condition variables don't use direct hand-off |
||
| 332 | and operate in a way similar to the example below. A thread waiting for |
||
| 333 | the condition becoming true does the following:</para> |
||
| 334 | |||
| 62 | jermar | 335 | <example> |
| 86 | bondari | 336 | <title>Use of <code>condvar_wait_timeout()</code>.</title> |
| 72 | bondari | 337 | |
| 338 | <programlisting language="C"><function>mutex_lock</function>(<varname>mtx</varname>); |
||
| 48 | jermar | 339 | while (!<varname>condition</varname>) |
| 102 | bondari | 340 | <function>condvar_wait_timeout</function>(<varname>cv</varname>, <varname>mtx</varname>); /* <remark>the condition is true, do something</remark> */ |
| 62 | jermar | 341 | <function>mutex_unlock</function>(<varname>mtx</varname>);</programlisting> |
| 72 | bondari | 342 | </example> |
| 48 | jermar | 343 | |
| 344 | <para>A thread that causes the condition become true signals this event |
||
| 345 | like this:</para> |
||
| 346 | |||
| 62 | jermar | 347 | <example> |
| 141 | jermar | 348 | <title>Use of <code>condvar_signal()</code>.</title> |
| 72 | bondari | 349 | |
| 102 | bondari | 350 | <programlisting language="C"><function>mutex_lock</function>(<varname>mtx</varname>); |
| 48 | jermar | 351 | <varname>condition</varname> = <constant>true</constant>; |
| 352 | <function>condvar_signal</function>(<varname>cv</varname>); /* <remark>condvar_broadcast(cv);</remark> */ |
||
| 72 | bondari | 353 | <function>mutex_unlock</function>(<varname>mtx</varname>);</programlisting> |
| 354 | </example> |
||
| 48 | jermar | 355 | |
| 131 | jermar | 356 | <para>The wait operation, <code>condvar_wait_timeout()</code>, always |
| 357 | puts the calling thread to sleep. The thread then sleeps until another |
||
| 358 | thread invokes <code>condvar_broadcast()</code> on the same condition |
||
| 359 | variable or until it is woken up by <code>condvar_signal()</code>. The |
||
| 360 | <code>condvar_signal()</code> operation unblocks the first thread |
||
| 361 | blocking on the condition variable while the |
||
| 362 | <code>condvar_broadcast()</code> operation unblocks all threads blocking |
||
| 363 | there. If there are no blocking threads, these two operations have no |
||
| 364 | efect.</para> |
||
| 48 | jermar | 365 | |
| 366 | <para>Note that the threads must synchronize over a dedicated mutex. To |
||
| 86 | bondari | 367 | prevent race condition between <code>condvar_wait_timeout()</code> and |
| 131 | jermar | 368 | <code>condvar_signal()</code> or <code>condvar_broadcast()</code>, the |
| 369 | mutex is passed to <code>condvar_wait_timeout()</code> which then |
||
| 370 | atomically puts the calling thread asleep and unlocks the mutex. When |
||
| 371 | the thread eventually wakes up, <code>condvar_wait()</code> regains the |
||
| 372 | mutex and returns.</para> |
||
| 48 | jermar | 373 | |
| 374 | <para>Also note, that there is no conditional operation for condition |
||
| 375 | variables. Such an operation would make no sence since condition |
||
| 376 | variables are defined to forget events for which there is no waiting |
||
| 86 | bondari | 377 | thread and because <code>condvar_wait()</code> must always go to sleep. |
| 57 | jermar | 378 | The operation with timeout is supported as usually.</para> |
| 48 | jermar | 379 | |
| 380 | <para>In HelenOS, condition variables are based on wait queues. As it is |
||
| 381 | already mentioned above, wait queues remember missed events while |
||
| 382 | condition variables must not do so. This is reasoned by the fact that |
||
| 383 | condition variables are designed for scenarios in which an event might |
||
| 384 | occur very many times without being picked up by any waiting thread. On |
||
| 385 | the other hand, wait queues would remember any event that had not been |
||
| 86 | bondari | 386 | picked up by a call to <code>waitq_sleep_timeout()</code>. Therefore, if |
| 57 | jermar | 387 | wait queues were used directly and without any changes to implement |
| 388 | condition variables, the missed_wakeup counter would hurt performance of |
||
| 389 | the implementation: the <code>while</code> loop in |
||
| 86 | bondari | 390 | <code>condvar_wait_timeout()</code> would effectively do busy waiting |
| 57 | jermar | 391 | until all missed wakeups were discarded.</para> |
| 48 | jermar | 392 | |
| 393 | <para>The requirement on the wait operation to atomically put the caller |
||
| 394 | to sleep and release the mutex poses an interesting problem on |
||
| 86 | bondari | 395 | <code>condvar_wait_timeout()</code>. More precisely, the thread should |
| 57 | jermar | 396 | sleep in the condvar's wait queue prior to releasing the mutex, but it |
| 397 | must not hold the mutex when it is sleeping.</para> |
||
| 48 | jermar | 398 | |
| 399 | <para>Problems described in the two previous paragraphs are addressed in |
||
| 86 | bondari | 400 | HelenOS by dividing the <code>waitq_sleep_timeout()</code> function into |
| 57 | jermar | 401 | three pieces:</para> |
| 48 | jermar | 402 | |
| 403 | <orderedlist> |
||
| 404 | <listitem> |
||
| 131 | jermar | 405 | <para><code>waitq_sleep_prepare()</code> prepares the thread to go |
| 406 | to sleep by, among other things, locking the wait queue;</para> |
||
| 48 | jermar | 407 | </listitem> |
| 408 | |||
| 409 | <listitem> |
||
| 86 | bondari | 410 | <para><code>waitq_sleep_timeout_unsafe()</code> implements the core |
| 57 | jermar | 411 | blocking logic;</para> |
| 48 | jermar | 412 | </listitem> |
| 413 | |||
| 414 | <listitem> |
||
| 86 | bondari | 415 | <para><code>waitq_sleep_finish()</code> performs cleanup after |
| 416 | <code>waitq_sleep_timeout_unsafe()</code>; after this call, the wait |
||
| 57 | jermar | 417 | queue spinlock is guaranteed to be unlocked by the caller</para> |
| 48 | jermar | 418 | </listitem> |
| 419 | </orderedlist> |
||
| 420 | |||
| 131 | jermar | 421 | <para>The stock <code>waitq_sleep_timeout()</code> is then a mere |
| 422 | wrapper that calls these three functions. It is provided for convenience |
||
| 423 | in cases where the caller doesn't require such a low level control. |
||
| 86 | bondari | 424 | However, the implementation of <code>condvar_wait_timeout()</code> does |
| 57 | jermar | 425 | need this finer-grained control because it has to interleave calls to |
| 426 | these functions by other actions. It carries its operations out in the |
||
| 427 | following order:</para> |
||
| 48 | jermar | 428 | |
| 429 | <orderedlist> |
||
| 430 | <listitem> |
||
| 86 | bondari | 431 | <para>calls <code>waitq_sleep_prepare()</code> in order to lock the |
| 57 | jermar | 432 | condition variable's wait queue,</para> |
| 48 | jermar | 433 | </listitem> |
| 434 | |||
| 435 | <listitem> |
||
| 436 | <para>releases the mutex,</para> |
||
| 437 | </listitem> |
||
| 438 | |||
| 439 | <listitem> |
||
| 440 | <para>clears the counter of missed wakeups,</para> |
||
| 441 | </listitem> |
||
| 442 | |||
| 443 | <listitem> |
||
| 86 | bondari | 444 | <para>calls <code>waitq_sleep_timeout_unsafe()</code>,</para> |
| 48 | jermar | 445 | </listitem> |
| 446 | |||
| 447 | <listitem> |
||
| 448 | <para>retakes the mutex,</para> |
||
| 449 | </listitem> |
||
| 450 | |||
| 451 | <listitem> |
||
| 86 | bondari | 452 | <para>calls <code>waitq_sleep_finish()</code>.</para> |
| 48 | jermar | 453 | </listitem> |
| 454 | </orderedlist> |
||
| 9 | bondari | 455 | </section> |
| 41 | jermar | 456 | </section> |
| 9 | bondari | 457 | |
| 41 | jermar | 458 | <section> |
| 131 | jermar | 459 | <title>Userspace Synchronization</title> |
| 9 | bondari | 460 | |
| 41 | jermar | 461 | <section> |
| 462 | <title>Futexes</title> |
||
| 463 | |||
| 72 | bondari | 464 | <indexterm> |
| 465 | <primary>synchronization</primary> |
||
| 466 | |||
| 467 | <secondary>- futex</secondary> |
||
| 468 | </indexterm> |
||
| 469 | |||
| 81 | jermar | 470 | <para>In a multithreaded environment, userspace threads need an |
| 471 | efficient way to synchronize. HelenOS borrows an idea from Linux<xref |
||
| 472 | linkend="futex" /> to implement lightweight userspace synchronization |
||
| 473 | and mutual exclusion primitive called futex. The key idea behind futexes |
||
| 474 | is that non-contended synchronization is very fast and takes place only |
||
| 475 | in userspace without kernel's intervention. When more threads contend |
||
| 476 | for a futex, only one of them wins; other threads go to sleep via a |
||
| 477 | dedicated syscall.</para> |
||
| 478 | |||
| 479 | <para>The userspace part of the futex is a mere integer variable, a |
||
| 480 | counter, that can be atomically incremented or decremented. The kernel |
||
| 481 | part is rather more complicated. For each userspace futex counter, there |
||
| 482 | is a kernel structure describing the futex. This structure |
||
| 483 | contains:</para> |
||
| 484 | |||
| 485 | <itemizedlist> |
||
| 486 | <listitem> |
||
| 487 | <para>number of references,</para> |
||
| 488 | </listitem> |
||
| 489 | |||
| 490 | <listitem> |
||
| 491 | <para>physical address of the userspace futex counter,</para> |
||
| 492 | </listitem> |
||
| 493 | |||
| 494 | <listitem> |
||
| 495 | <para>hash table link and</para> |
||
| 496 | </listitem> |
||
| 497 | |||
| 498 | <listitem> |
||
| 499 | <para>a wait queue.</para> |
||
| 500 | </listitem> |
||
| 501 | </itemizedlist> |
||
| 502 | |||
| 503 | <para>The reference count helps to find out when the futex is no longer |
||
| 504 | needed and can be deallocated. The physical address is used as a key for |
||
| 505 | the global futex hash table. Note that the kernel has to use physical |
||
| 506 | address to identify the futex beacause one futex can be used for |
||
| 507 | synchronization among different address spaces and can have different |
||
| 508 | virtual addresses in each of them. Finally, the kernel futex structure |
||
| 509 | includes a wait queue. The wait queue is used to put threads that didn't |
||
| 510 | win the futex to sleep until the winner wakes one of them up.</para> |
||
| 511 | |||
| 512 | <para>A futex should be initialized by setting its userspace counter to |
||
| 513 | one before it is used. When locking the futex via userspace library |
||
| 86 | bondari | 514 | function <code>futex_down_timeout()</code>, the library code atomically |
| 82 | jermar | 515 | decrements the futex counter and tests if it dropped below zero. If it |
| 81 | jermar | 516 | did, then the futex is locked by another thread and the library uses the |
| 517 | <constant>SYS_FUTEX_SLEEP</constant> syscall to put the thread asleep. |
||
| 518 | If the counter decreased to 0, then there was no contention and the |
||
| 519 | thread can enter the critical section protected by the futex. When the |
||
| 520 | thread later leaves that critical section, it, using library function |
||
| 131 | jermar | 521 | <code>futex_up()</code>, atomically increments the counter. If the |
| 522 | counter value increased to one, then there again was no contention and |
||
| 523 | no action needs to be done. However, if it increased to zero or even a |
||
| 524 | smaller number, then there are sleeping threads waiting for the futex to |
||
| 525 | become available. In that case, the library has to use the |
||
| 81 | jermar | 526 | <constant>SYS_FUTEX_WAKEUP</constant> syscall to wake one sleeping |
| 527 | thread.</para> |
||
| 528 | |||
| 529 | <para>So far, futexes are very elegant in that they don't interfere with |
||
| 530 | the kernel when there is no contention for them. Another nice aspect of |
||
| 531 | futexes is that they don't need to be registered anywhere prior to the |
||
| 532 | first kernel intervention.</para> |
||
| 533 | |||
| 534 | <para>Both futex related syscalls, <constant>SYS_FUTEX_SLEEP</constant> |
||
| 535 | and <constant>SYS_FUTEX_WAKEUP</constant>, respectivelly, are mere |
||
| 86 | bondari | 536 | wrappers for <code>waitq_sleep_timeout()</code> and |
| 537 | <code>waitq_sleep_wakeup()</code>, respectively, with some housekeeping |
||
| 81 | jermar | 538 | functionality added. Both syscalls need to translate the userspace |
| 539 | virtual address of the futex counter to physical address in order to |
||
| 540 | support synchronization accross shared memory. Once the physical address |
||
| 541 | is known, the kernel checks whether the futexes are already known to it |
||
| 542 | by searching the global futex hash table for an item with the physical |
||
| 543 | address of the futex counter as a key. When the search is successful, it |
||
| 544 | returns an address of the kernel futex structure associated with the |
||
| 545 | counter. If the hash table does not contain the key, the kernel creates |
||
| 546 | it and inserts it into the hash table. At the same time, the the current |
||
| 547 | task's B+tree of known futexes is searched in order to find out if the |
||
| 548 | task already uses the futex. If it does, no action is taken. Otherwise |
||
| 549 | the reference count of the futex is incremented, provided that the futex |
||
| 550 | already existed.</para> |
||
| 41 | jermar | 551 | </section> |
| 552 | </section> |
||
| 553 | </chapter> |