Subversion Repositories HelenOS-doc

Rev

Rev 86 | Rev 131 | Go to most recent revision | Details | Compare with Previous | Last modification | View Log | RSS feed

Rev Author Line No. Line
9 bondari 1
<?xml version="1.0" encoding="UTF-8"?>
41 jermar 2
<chapter id="sync">
3
  <?dbhtml filename="sync.html"?>
9 bondari 4
 
82 jermar 5
  <title>Synchronization</title>
9 bondari 6
 
41 jermar 7
  <section>
8
    <title>Introduction</title>
9 bondari 9
 
45 jermar 10
    <para>The HelenOS operating system is designed to make use of the
11
    parallelism offered by the hardware and to exploit concurrency of both the
12
    kernel and userspace tasks. This is achieved through multiprocessor
13
    support and several levels of multiprogramming such as multitasking,
14
    multithreading and also through userspace pseudo threads. However, such a
15
    highly concurrent environment needs safe and efficient ways to handle
16
    mutual exclusion and synchronization of many execution flows.</para>
41 jermar 17
  </section>
18
 
19
  <section>
20
    <title>Active kernel primitives</title>
21
 
9 bondari 22
    <section>
72 bondari 23
      <indexterm>
24
        <primary>synchronization</primary>
25
 
26
        <secondary>- spinlock</secondary>
27
      </indexterm>
28
 
41 jermar 29
      <title>Spinlocks</title>
9 bondari 30
 
45 jermar 31
      <para>The basic mutual exclusion primitive is the spinlock. The spinlock
32
      implements active waiting for the availability of a memory lock (i.e.
41 jermar 33
      simple variable) in a multiprocessor-safe manner. This safety is
34
      achieved through the use of a specialized, architecture-dependent,
35
      atomic test-and-set operation which either locks the spinlock (i.e. sets
36
      the variable) or, provided that it is already locked, leaves it
37
      unaltered. In any case, the test-and-set operation returns a value, thus
38
      signalling either success (i.e. zero return value) or failure (i.e.
45 jermar 39
      non-zero value) in acquiring the lock. Note that this makes a
41 jermar 40
      fundamental difference between the naive algorithm that doesn't use the
41
      atomic operation and the spinlock algortihm. While the naive algorithm
45 jermar 42
      is prone to race conditions on SMP configurations and thus is completely
41 jermar 43
      SMP-unsafe, the spinlock algorithm eliminates the possibility of race
44
      conditions and is suitable for mutual exclusion use.</para>
9 bondari 45
 
41 jermar 46
      <para>The semantics of the test-and-set operation is that the spinlock
47
      remains unavailable until this operation called on the respective
45 jermar 48
      spinlock returns zero. HelenOS builds two functions on top of the
49
      test-and-set operation. The first function is the unconditional attempt
86 bondari 50
      to acquire the spinlock and is called <code>spinlock_lock()</code>. It
57 jermar 51
      simply loops until the test-and-set returns a zero value. The other
86 bondari 52
      function, <code>spinlock_trylock()</code>, is the conditional lock
57 jermar 53
      operation and calls the test-and-set only once to find out whether it
54
      managed to acquire the spinlock or not. The conditional operation is
55
      useful in situations in which an algorithm cannot acquire more spinlocks
56
      in the proper order and a deadlock cannot be avoided. In such a case,
57
      the algorithm would detect the danger and instead of possibly
58
      deadlocking the system it would simply release some spinlocks it already
59
      holds and retry the whole operation with the hope that it will succeed
86 bondari 60
      next time. The unlock function, <code>spinlock_unlock()</code>, is quite
57 jermar 61
      easy - it merely clears the spinlock variable.</para>
9 bondari 62
 
41 jermar 63
      <para>Nevertheless, there is a special issue related to hardware
45 jermar 64
      optimizations that modern processors implement. Particularly problematic
65
      is the out-of-order execution of instructions within the critical
66
      section protected by a spinlock. The processors are always
41 jermar 67
      self-consistent so that they can carry out speculatively executed
68
      instructions in the right order with regard to dependencies among those
69
      instructions. However, the dependency between instructions inside the
70
      critical section and those that implement locking and unlocking of the
45 jermar 71
      respective spinlock is not implicit on some processor architectures. As
72
      a result, the processor needs to be explicitly told about each
73
      occurrence of such a dependency. Therefore, HelenOS adds
86 bondari 74
      architecture-specific hooks to all <code>spinlock_lock()</code>,
75
      <code>spinlock_trylock()</code> and <code>spinlock_unlock()</code> functions
57 jermar 76
      to prevent the instructions inside the critical section from permeating
77
      out. On some architectures, these hooks can be void because the
78
      dependencies are implicitly there because of the special properties of
79
      locking and unlocking instructions. However, other architectures need to
80
      instrument these hooks with different memory barriers, depending on what
81
      operations could permeate out.</para>
9 bondari 82
 
41 jermar 83
      <para>Spinlocks have one significant drawback: when held for longer time
45 jermar 84
      periods, they harm both parallelism and concurrency. The processor
86 bondari 85
      executing <code>spinlock_lock()</code> does not do any fruitful work and
57 jermar 86
      is effectively halted until it can grab the lock and proceed.
45 jermar 87
      Similarily, other execution flows cannot execute on the processor that
88
      holds the spinlock because the kernel disables preemption on that
89
      processor when a spinlock is held. The reason behind disabling
90
      preemption is priority inversion problem avoidance. For the same reason,
91
      threads are strongly discouraged from sleeping when they hold a
92
      spinlock.</para>
9 bondari 93
 
41 jermar 94
      <para>To summarize, spinlocks represent very simple and essential mutual
95
      exclusion primitive for SMP systems. On the other hand, spinlocks scale
96
      poorly because of the active loop they are based on. Therefore,
45 jermar 97
      spinlocks are used in HelenOS only for short-time mutual exclusion and
41 jermar 98
      in cases where the mutual exclusion is required out of thread context.
99
      Lastly, spinlocks are used in the construction of passive
100
      synchronization primitives.</para>
101
    </section>
102
  </section>
9 bondari 103
 
41 jermar 104
  <section>
105
    <title>Passive kernel synchronization</title>
9 bondari 106
 
41 jermar 107
    <section>
72 bondari 108
      <indexterm>
109
        <primary>synchronization</primary>
110
 
73 bondari 111
        <secondary>- wait queue</secondary>
72 bondari 112
      </indexterm>
113
 
43 jermar 114
      <title>Wait queues</title>
9 bondari 115
 
43 jermar 116
      <para>A wait queue is the basic passive synchronization primitive on
45 jermar 117
      which all other passive synchronization primitives are built. Simply
118
      put, it allows a thread to sleep until an event associated with the
119
      particular wait queue occurs. Multiple threads are notified about
120
      incoming events in a first come, first served fashion. Moreover, should
121
      the event come before any thread waits for it, it is recorded in the
122
      wait queue as a missed wakeup and later forwarded to the first thread
123
      that decides to wait in the queue. The inner structures of the wait
124
      queue are protected by a spinlock.</para>
43 jermar 125
 
126
      <para>The thread that wants to wait for a wait queue event uses the
86 bondari 127
      <code>waitq_sleep_timeout()</code> function. The algorithm then checks the
57 jermar 128
      wait queue's counter of missed wakeups and if there are any missed
129
      wakeups, the call returns immediately. The call also returns immediately
130
      if only a conditional wait was requested. Otherwise the thread is
131
      enqueued in the wait queue's list of sleeping threads and its state is
132
      changed to <constant>Sleeping</constant>. It then sleeps until one of
133
      the following events happens:</para>
43 jermar 134
 
135
      <orderedlist>
136
        <listitem>
86 bondari 137
          <para>another thread calls <code>waitq_wakeup()</code> and the thread
57 jermar 138
          is the first thread in the wait queue's list of sleeping
45 jermar 139
          threads;</para>
43 jermar 140
        </listitem>
141
 
142
        <listitem>
86 bondari 143
          <para>another thread calls <code>waitq_interrupt_sleep()</code> on the
57 jermar 144
          sleeping thread;</para>
43 jermar 145
        </listitem>
146
 
147
        <listitem>
45 jermar 148
          <para>the sleep times out provided that none of the previous
149
          occurred within a specified time limit; the limit can be
150
          infinity.</para>
43 jermar 151
        </listitem>
152
      </orderedlist>
153
 
154
      <para>All five possibilities (immediate return on success, immediate
155
      return on failure, wakeup after sleep, interruption and timeout) are
86 bondari 156
      distinguishable by the return value of <code>waitq_sleep_timeout()</code>.
57 jermar 157
      Being able to interrupt a sleeping thread is essential for externally
158
      initiated thread termination. The ability to wait only for a certain
159
      amount of time is used, for instance, to passively delay thread
160
      execution by several microseconds or even seconds in
86 bondari 161
      <code>thread_sleep()</code> function. Due to the fact that all other
57 jermar 162
      passive kernel synchronization primitives are based on wait queues, they
163
      also have the option of being interrutped and, more importantly, can
164
      timeout. All of them also implement the conditional operation.
165
      Furthemore, this very fundamental interface reaches up to the
166
      implementation of futexes - userspace synchronization primitive, which
167
      makes it possible for a userspace thread to request a synchronization
168
      operation with a timeout or a conditional operation.</para>
43 jermar 169
 
170
      <para>From the description above, it should be apparent, that when a
86 bondari 171
      sleeping thread is woken by <code>waitq_wakeup()</code> or when
172
      <code>waitq_sleep_timeout()</code> succeeds immediately, the thread can be
57 jermar 173
      sure that the event has occurred. The thread need not and should not
174
      verify this fact. This approach is called direct hand-off and is
175
      characteristic for all passive HelenOS synchronization primitives, with
176
      the exception as described below.</para>
41 jermar 177
    </section>
9 bondari 178
 
41 jermar 179
    <section>
72 bondari 180
      <indexterm>
181
        <primary>synchronization</primary>
182
 
73 bondari 183
        <secondary>- semaphore</secondary>
72 bondari 184
      </indexterm>
185
 
41 jermar 186
      <title>Semaphores</title>
9 bondari 187
 
43 jermar 188
      <para>The interesting point about wait queues is that the number of
189
      missed wakeups is equal to the number of threads that will not block in
86 bondari 190
      <code>watiq_sleep_timeout()</code> and would immediately succeed instead.
57 jermar 191
      On the other hand, semaphores are synchronization primitives that will
192
      let predefined amount of threads into their critical section and block
193
      any other threads above this count. However, these two cases are exactly
194
      the same. Semaphores in HelenOS are therefore implemented as wait queues
195
      with a single semantic change: their wait queue is initialized to have
196
      so many missed wakeups as is the number of threads that the semphore
197
      intends to let into its critical section simultaneously.</para>
43 jermar 198
 
199
      <para>In the semaphore language, the wait queue operation
86 bondari 200
      <code>waitq_sleep_timeout()</code> corresponds to semaphore
57 jermar 201
      <code>down</code> operation, represented by the function
86 bondari 202
      <code>semaphore_down_timeout()</code> and by way of similitude the wait
57 jermar 203
      queue operation waitq_wakeup corresponds to semaphore <code>up</code>
86 bondari 204
      operation, represented by the function <code>sempafore_up()</code>. The
57 jermar 205
      conditional down operation is called
86 bondari 206
      <code>semaphore_trydown()</code>.</para>
41 jermar 207
    </section>
9 bondari 208
 
41 jermar 209
    <section>
43 jermar 210
      <title>Mutexes</title>
9 bondari 211
 
72 bondari 212
      <indexterm>
213
        <primary>synchronization</primary>
214
 
215
        <secondary>- mutex</secondary>
216
      </indexterm>
217
 
45 jermar 218
      <para>Mutexes are sometimes referred to as binary sempahores. That means
219
      that mutexes are like semaphores that allow only one thread in its
43 jermar 220
      critical section. Indeed, mutexes in HelenOS are implemented exactly in
45 jermar 221
      this way: they are built on top of semaphores. From another point of
222
      view, they can be viewed as spinlocks without busy waiting. Their
223
      semaphore heritage provides good basics for both conditional operation
224
      and operation with timeout. The locking operation is called
86 bondari 225
      <code>mutex_lock()</code>, the conditional locking operation is called
226
      <code>mutex_trylock()</code> and the unlocking operation is called
227
      <code>mutex_unlock()</code>.</para>
41 jermar 228
    </section>
9 bondari 229
 
41 jermar 230
    <section>
43 jermar 231
      <title>Reader/writer locks</title>
9 bondari 232
 
72 bondari 233
      <indexterm>
234
        <primary>synchronization</primary>
235
 
236
        <secondary>- read/write lock</secondary>
237
      </indexterm>
238
 
43 jermar 239
      <para>Reader/writer locks, or rwlocks, are by far the most complicated
240
      synchronization primitive within the kernel. The goal of these locks is
45 jermar 241
      to improve concurrency of applications, in which threads need to
242
      synchronize access to a shared resource, and that access can be
43 jermar 243
      partitioned into a read-only mode and a write mode. Reader/writer locks
244
      should make it possible for several, possibly many, readers to enter the
245
      critical section, provided that no writer is currently in the critical
246
      section, or to be in the critical section contemporarily. Writers are
247
      allowed to enter the critical section only individually, provided that
45 jermar 248
      no reader is in the critical section already. Applications, in which the
249
      majority of operations can be done in the read-only mode, can benefit
43 jermar 250
      from increased concurrency created by reader/writer locks.</para>
251
 
45 jermar 252
      <para>During reader/writer lock construction, a decision should be made
43 jermar 253
      whether readers will be prefered over writers or whether writers will be
254
      prefered over readers in cases when the lock is not currently held and
255
      both a reader and a writer want to gain the lock. Some operating systems
256
      prefer one group over the other, creating thus a possibility for
257
      starving the unprefered group. In the HelenOS operating system, none of
45 jermar 258
      the two groups is prefered. The lock is granted on a first come, first
43 jermar 259
      served basis with the additional note that readers are granted the lock
45 jermar 260
      in the biggest possible batch.</para>
43 jermar 261
 
262
      <para>With this policy and the timeout modes of operation, the direct
263
      hand-off becomes much more complicated. For instance, a writer leaving
264
      the critical section must wake up all leading readers in the rwlock's
265
      wait queue or one leading writer or no-one if no thread is waiting.
266
      Similarily, the last reader leaving the critical section must wakeup the
45 jermar 267
      sleeping writer if there are any sleeping threads left at all. As
268
      another example, if a writer at the beginning of the rwlock's wait queue
269
      times out and the lock is held by at least one reader, the writer which
270
      has timed out must first wake up all readers that follow him in the
271
      queue prior to signalling the timeout itself and giving up.</para>
43 jermar 272
 
45 jermar 273
      <para>Due to the issues mentioned in the previous paragraph, the
274
      reader/writer lock imlpementation needs to walk the rwlock wait queue's
275
      list of sleeping threads directly, in order to find out the type of
43 jermar 276
      access that the queueing threads demand. This makes the code difficult
277
      to understand and dependent on the internal implementation of the wait
278
      queue. Nevertheless, it remains unclear to the authors of HelenOS
279
      whether a simpler but equivalently fair solution exists.</para>
280
 
281
      <para>The implementation of rwlocks as it has been already put, makes
282
      use of one single wait queue for both readers and writers, thus avoiding
283
      any possibility of starvation. In fact, rwlocks use a mutex rather than
86 bondari 284
      a bare wait queue. This mutex is called <emphasis>exclusive</emphasis> and is
57 jermar 285
      used to synchronize writers. The writer's lock operation,
86 bondari 286
      <code>rwlock_write_lock_timeout()</code>, simply tries to acquire the
57 jermar 287
      exclusive mutex. If it succeeds, the writer is granted the rwlock.
44 jermar 288
      However, if the operation fails (e.g. times out), the writer must check
289
      for potential readers at the head of the list of sleeping threads
45 jermar 290
      associated with the mutex's wait queue and then proceed according to the
44 jermar 291
      procedure outlined above.</para>
43 jermar 292
 
293
      <para>The exclusive mutex plays an important role in reader
294
      synchronization as well. However, a reader doing the reader's lock
86 bondari 295
      operation, <code>rwlock_read_lock_timeout()</code>, may bypass this mutex
57 jermar 296
      when it detects that:</para>
43 jermar 297
 
298
      <orderedlist>
299
        <listitem>
45 jermar 300
          <para>there are other readers in the critical section and</para>
43 jermar 301
        </listitem>
302
 
303
        <listitem>
304
          <para>there are no sleeping threads waiting for the exclusive
45 jermar 305
          mutex.</para>
43 jermar 306
        </listitem>
307
      </orderedlist>
308
 
309
      <para>If both conditions are true, the reader will bypass the mutex,
45 jermar 310
      increment the number of readers in the critical section and then enter
311
      the critical section. Note that if there are any sleeping threads at the
312
      beginning of the wait queue, the first must be a writer. If the
43 jermar 313
      conditions are not fulfilled, the reader normally waits until the
314
      exclusive mutex is granted to it.</para>
41 jermar 315
    </section>
9 bondari 316
 
317
    <section>
41 jermar 318
      <title>Condition variables</title>
9 bondari 319
 
72 bondari 320
      <indexterm>
321
        <primary>synchronization</primary>
322
 
323
        <secondary>- condition variable</secondary>
324
      </indexterm>
325
 
48 jermar 326
      <para>Condition variables can be used for waiting until a condition
327
      becomes true. In this respect, they are similar to wait queues. But
328
      contrary to wait queues, condition variables have different semantics
329
      that allows events to be lost when there is no thread waiting for them.
330
      In order to support this, condition variables don't use direct hand-off
331
      and operate in a way similar to the example below. A thread waiting for
332
      the condition becoming true does the following:</para>
333
 
62 jermar 334
      <example>
86 bondari 335
        <title>Use of <code>condvar_wait_timeout()</code>.</title>
72 bondari 336
 
337
        <programlisting language="C"><function>mutex_lock</function>(<varname>mtx</varname>);
48 jermar 338
while (!<varname>condition</varname>)
102 bondari 339
        <function>condvar_wait_timeout</function>(<varname>cv</varname>, <varname>mtx</varname>); /* <remark>the condition is true, do something</remark> */
62 jermar 340
<function>mutex_unlock</function>(<varname>mtx</varname>);</programlisting>
72 bondari 341
      </example>
48 jermar 342
 
343
      <para>A thread that causes the condition become true signals this event
344
      like this:</para>
345
 
62 jermar 346
      <example>
72 bondari 347
        <title>Use of <code>condvar_signal</code>.</title>
348
 
102 bondari 349
        <programlisting language="C"><function>mutex_lock</function>(<varname>mtx</varname>);
48 jermar 350
<varname>condition</varname> = <constant>true</constant>;
351
<function>condvar_signal</function>(<varname>cv</varname>);  /* <remark>condvar_broadcast(cv);</remark> */
72 bondari 352
<function>mutex_unlock</function>(<varname>mtx</varname>);</programlisting>
353
      </example>
48 jermar 354
 
86 bondari 355
      <para>The wait operation, <code>condvar_wait_timeout()</code>, always puts
57 jermar 356
      the calling thread to sleep. The thread then sleeps until another thread
86 bondari 357
      invokes <code>condvar_broadcast()</code> on the same condition variable or
358
      until it is woken up by <code>condvar_signal()</code>. The
359
      <code>condvar_signal()</code> operation unblocks the first thread blocking
360
      on the condition variable while the <code>condvar_broadcast()</code>
57 jermar 361
      operation unblocks all threads blocking there. If there are no blocking
362
      threads, these two operations have no efect.</para>
48 jermar 363
 
364
      <para>Note that the threads must synchronize over a dedicated mutex. To
86 bondari 365
      prevent race condition between <code>condvar_wait_timeout()</code> and
366
      <code>condvar_signal()</code> or <code>condvar_broadcast()</code>, the mutex
367
      is passed to <code>condvar_wait_timeout()</code> which then atomically
57 jermar 368
      puts the calling thread asleep and unlocks the mutex. When the thread
86 bondari 369
      eventually wakes up, <code>condvar_wait()</code> regains the mutex and
48 jermar 370
      returns.</para>
371
 
372
      <para>Also note, that there is no conditional operation for condition
373
      variables. Such an operation would make no sence since condition
374
      variables are defined to forget events for which there is no waiting
86 bondari 375
      thread and because <code>condvar_wait()</code> must always go to sleep.
57 jermar 376
      The operation with timeout is supported as usually.</para>
48 jermar 377
 
378
      <para>In HelenOS, condition variables are based on wait queues. As it is
379
      already mentioned above, wait queues remember missed events while
380
      condition variables must not do so. This is reasoned by the fact that
381
      condition variables are designed for scenarios in which an event might
382
      occur very many times without being picked up by any waiting thread. On
383
      the other hand, wait queues would remember any event that had not been
86 bondari 384
      picked up by a call to <code>waitq_sleep_timeout()</code>. Therefore, if
57 jermar 385
      wait queues were used directly and without any changes to implement
386
      condition variables, the missed_wakeup counter would hurt performance of
387
      the implementation: the <code>while</code> loop in
86 bondari 388
      <code>condvar_wait_timeout()</code> would effectively do busy waiting
57 jermar 389
      until all missed wakeups were discarded.</para>
48 jermar 390
 
391
      <para>The requirement on the wait operation to atomically put the caller
392
      to sleep and release the mutex poses an interesting problem on
86 bondari 393
      <code>condvar_wait_timeout()</code>. More precisely, the thread should
57 jermar 394
      sleep in the condvar's wait queue prior to releasing the mutex, but it
395
      must not hold the mutex when it is sleeping.</para>
48 jermar 396
 
397
      <para>Problems described in the two previous paragraphs are addressed in
86 bondari 398
      HelenOS by dividing the <code>waitq_sleep_timeout()</code> function into
57 jermar 399
      three pieces:</para>
48 jermar 400
 
401
      <orderedlist>
402
        <listitem>
86 bondari 403
          <para><code>waitq_sleep_prepare()</code> prepares the thread to go to
57 jermar 404
          sleep by, among other things, locking the wait queue;</para>
48 jermar 405
        </listitem>
406
 
407
        <listitem>
86 bondari 408
          <para><code>waitq_sleep_timeout_unsafe()</code> implements the core
57 jermar 409
          blocking logic;</para>
48 jermar 410
        </listitem>
411
 
412
        <listitem>
86 bondari 413
          <para><code>waitq_sleep_finish()</code> performs cleanup after
414
          <code>waitq_sleep_timeout_unsafe()</code>; after this call, the wait
57 jermar 415
          queue spinlock is guaranteed to be unlocked by the caller</para>
48 jermar 416
        </listitem>
417
      </orderedlist>
418
 
86 bondari 419
      <para>The stock <code>waitq_sleep_timeout()</code> is then a mere wrapper
57 jermar 420
      that calls these three functions. It is provided for convenience in
421
      cases where the caller doesn't require such a low level control.
86 bondari 422
      However, the implementation of <code>condvar_wait_timeout()</code> does
57 jermar 423
      need this finer-grained control because it has to interleave calls to
424
      these functions by other actions. It carries its operations out in the
425
      following order:</para>
48 jermar 426
 
427
      <orderedlist>
428
        <listitem>
86 bondari 429
          <para>calls <code>waitq_sleep_prepare()</code> in order to lock the
57 jermar 430
          condition variable's wait queue,</para>
48 jermar 431
        </listitem>
432
 
433
        <listitem>
434
          <para>releases the mutex,</para>
435
        </listitem>
436
 
437
        <listitem>
438
          <para>clears the counter of missed wakeups,</para>
439
        </listitem>
440
 
441
        <listitem>
86 bondari 442
          <para>calls <code>waitq_sleep_timeout_unsafe()</code>,</para>
48 jermar 443
        </listitem>
444
 
445
        <listitem>
446
          <para>retakes the mutex,</para>
447
        </listitem>
448
 
449
        <listitem>
86 bondari 450
          <para>calls <code>waitq_sleep_finish()</code>.</para>
48 jermar 451
        </listitem>
452
      </orderedlist>
9 bondari 453
    </section>
41 jermar 454
  </section>
9 bondari 455
 
41 jermar 456
  <section>
457
    <title>Userspace synchronization</title>
9 bondari 458
 
41 jermar 459
    <section>
460
      <title>Futexes</title>
461
 
72 bondari 462
      <indexterm>
463
        <primary>synchronization</primary>
464
 
465
        <secondary>- futex</secondary>
466
      </indexterm>
467
 
81 jermar 468
      <para>In a multithreaded environment, userspace threads need an
469
      efficient way to synchronize. HelenOS borrows an idea from Linux<xref
470
      linkend="futex" /> to implement lightweight userspace synchronization
471
      and mutual exclusion primitive called futex. The key idea behind futexes
472
      is that non-contended synchronization is very fast and takes place only
473
      in userspace without kernel's intervention. When more threads contend
474
      for a futex, only one of them wins; other threads go to sleep via a
475
      dedicated syscall.</para>
476
 
477
      <para>The userspace part of the futex is a mere integer variable, a
478
      counter, that can be atomically incremented or decremented. The kernel
479
      part is rather more complicated. For each userspace futex counter, there
480
      is a kernel structure describing the futex. This structure
481
      contains:</para>
482
 
483
      <itemizedlist>
484
        <listitem>
485
          <para>number of references,</para>
486
        </listitem>
487
 
488
        <listitem>
489
          <para>physical address of the userspace futex counter,</para>
490
        </listitem>
491
 
492
        <listitem>
493
          <para>hash table link and</para>
494
        </listitem>
495
 
496
        <listitem>
497
          <para>a wait queue.</para>
498
        </listitem>
499
      </itemizedlist>
500
 
501
      <para>The reference count helps to find out when the futex is no longer
502
      needed and can be deallocated. The physical address is used as a key for
503
      the global futex hash table. Note that the kernel has to use physical
504
      address to identify the futex beacause one futex can be used for
505
      synchronization among different address spaces and can have different
506
      virtual addresses in each of them. Finally, the kernel futex structure
507
      includes a wait queue. The wait queue is used to put threads that didn't
508
      win the futex to sleep until the winner wakes one of them up.</para>
509
 
510
      <para>A futex should be initialized by setting its userspace counter to
511
      one before it is used. When locking the futex via userspace library
86 bondari 512
      function <code>futex_down_timeout()</code>, the library code atomically
82 jermar 513
      decrements the futex counter and tests if it dropped below zero. If it
81 jermar 514
      did, then the futex is locked by another thread and the library uses the
515
      <constant>SYS_FUTEX_SLEEP</constant> syscall to put the thread asleep.
516
      If the counter decreased to 0, then there was no contention and the
517
      thread can enter the critical section protected by the futex. When the
518
      thread later leaves that critical section, it, using library function
86 bondari 519
      <code>futex_up()</code>, atomically increments the counter. If the counter
81 jermar 520
      value increased to one, then there again was no contention and no action
521
      needs to be done. However, if it increased to zero or even a smaller
522
      number, then there are sleeping threads waiting for the futex to become
523
      available. In that case, the library has to use the
524
      <constant>SYS_FUTEX_WAKEUP</constant> syscall to wake one sleeping
525
      thread.</para>
526
 
527
      <para>So far, futexes are very elegant in that they don't interfere with
528
      the kernel when there is no contention for them. Another nice aspect of
529
      futexes is that they don't need to be registered anywhere prior to the
530
      first kernel intervention.</para>
531
 
532
      <para>Both futex related syscalls, <constant>SYS_FUTEX_SLEEP</constant>
533
      and <constant>SYS_FUTEX_WAKEUP</constant>, respectivelly, are mere
86 bondari 534
      wrappers for <code>waitq_sleep_timeout()</code> and
535
      <code>waitq_sleep_wakeup()</code>, respectively, with some housekeeping
81 jermar 536
      functionality added. Both syscalls need to translate the userspace
537
      virtual address of the futex counter to physical address in order to
538
      support synchronization accross shared memory. Once the physical address
539
      is known, the kernel checks whether the futexes are already known to it
540
      by searching the global futex hash table for an item with the physical
541
      address of the futex counter as a key. When the search is successful, it
542
      returns an address of the kernel futex structure associated with the
543
      counter. If the hash table does not contain the key, the kernel creates
544
      it and inserts it into the hash table. At the same time, the the current
545
      task's B+tree of known futexes is searched in order to find out if the
546
      task already uses the futex. If it does, no action is taken. Otherwise
547
      the reference count of the futex is incremented, provided that the futex
548
      already existed.</para>
41 jermar 549
    </section>
550
  </section>
551
</chapter>