Subversion Repositories HelenOS-doc

Rev

Rev 141 | Details | Compare with Previous | Last modification | View Log | RSS feed

Rev Author Line No. Line
9 bondari 1
<?xml version="1.0" encoding="UTF-8"?>
41 jermar 2
<chapter id="sync">
3
  <?dbhtml filename="sync.html"?>
9 bondari 4
 
82 jermar 5
  <title>Synchronization</title>
9 bondari 6
 
41 jermar 7
  <section>
8
    <title>Introduction</title>
9 bondari 9
 
45 jermar 10
    <para>The HelenOS operating system is designed to make use of the
11
    parallelism offered by the hardware and to exploit concurrency of both the
12
    kernel and userspace tasks. This is achieved through multiprocessor
13
    support and several levels of multiprogramming such as multitasking,
171 jermar 14
    multithreading and also through userspace fibrils. However, such a highly
15
    concurrent environment needs safe and efficient ways to handle mutual
16
    exclusion and synchronization of many execution flows.</para>
41 jermar 17
  </section>
18
 
19
  <section>
131 jermar 20
    <title>Active Kernel Primitives</title>
41 jermar 21
 
9 bondari 22
    <section>
72 bondari 23
      <indexterm>
24
        <primary>synchronization</primary>
25
 
26
        <secondary>- spinlock</secondary>
27
      </indexterm>
28
 
41 jermar 29
      <title>Spinlocks</title>
9 bondari 30
 
45 jermar 31
      <para>The basic mutual exclusion primitive is the spinlock. The spinlock
32
      implements active waiting for the availability of a memory lock (i.e.
41 jermar 33
      simple variable) in a multiprocessor-safe manner. This safety is
34
      achieved through the use of a specialized, architecture-dependent,
35
      atomic test-and-set operation which either locks the spinlock (i.e. sets
36
      the variable) or, provided that it is already locked, leaves it
37
      unaltered. In any case, the test-and-set operation returns a value, thus
38
      signalling either success (i.e. zero return value) or failure (i.e.
45 jermar 39
      non-zero value) in acquiring the lock. Note that this makes a
41 jermar 40
      fundamental difference between the naive algorithm that doesn't use the
41
      atomic operation and the spinlock algortihm. While the naive algorithm
45 jermar 42
      is prone to race conditions on SMP configurations and thus is completely
41 jermar 43
      SMP-unsafe, the spinlock algorithm eliminates the possibility of race
44
      conditions and is suitable for mutual exclusion use.</para>
9 bondari 45
 
41 jermar 46
      <para>The semantics of the test-and-set operation is that the spinlock
47
      remains unavailable until this operation called on the respective
45 jermar 48
      spinlock returns zero. HelenOS builds two functions on top of the
49
      test-and-set operation. The first function is the unconditional attempt
86 bondari 50
      to acquire the spinlock and is called <code>spinlock_lock()</code>. It
57 jermar 51
      simply loops until the test-and-set returns a zero value. The other
86 bondari 52
      function, <code>spinlock_trylock()</code>, is the conditional lock
57 jermar 53
      operation and calls the test-and-set only once to find out whether it
54
      managed to acquire the spinlock or not. The conditional operation is
55
      useful in situations in which an algorithm cannot acquire more spinlocks
56
      in the proper order and a deadlock cannot be avoided. In such a case,
57
      the algorithm would detect the danger and instead of possibly
58
      deadlocking the system it would simply release some spinlocks it already
59
      holds and retry the whole operation with the hope that it will succeed
86 bondari 60
      next time. The unlock function, <code>spinlock_unlock()</code>, is quite
57 jermar 61
      easy - it merely clears the spinlock variable.</para>
9 bondari 62
 
41 jermar 63
      <para>Nevertheless, there is a special issue related to hardware
45 jermar 64
      optimizations that modern processors implement. Particularly problematic
65
      is the out-of-order execution of instructions within the critical
66
      section protected by a spinlock. The processors are always
41 jermar 67
      self-consistent so that they can carry out speculatively executed
68
      instructions in the right order with regard to dependencies among those
69
      instructions. However, the dependency between instructions inside the
70
      critical section and those that implement locking and unlocking of the
45 jermar 71
      respective spinlock is not implicit on some processor architectures. As
72
      a result, the processor needs to be explicitly told about each
73
      occurrence of such a dependency. Therefore, HelenOS adds
86 bondari 74
      architecture-specific hooks to all <code>spinlock_lock()</code>,
131 jermar 75
      <code>spinlock_trylock()</code> and <code>spinlock_unlock()</code>
76
      functions to prevent the instructions inside the critical section from
77
      permeating out. On some architectures, these hooks can be void because
78
      the dependencies are implicitly there because of the special properties
79
      of locking and unlocking instructions. However, other architectures need
80
      to instrument these hooks with different memory barriers, depending on
81
      what operations could permeate out.</para>
9 bondari 82
 
41 jermar 83
      <para>Spinlocks have one significant drawback: when held for longer time
45 jermar 84
      periods, they harm both parallelism and concurrency. The processor
86 bondari 85
      executing <code>spinlock_lock()</code> does not do any fruitful work and
57 jermar 86
      is effectively halted until it can grab the lock and proceed.
45 jermar 87
      Similarily, other execution flows cannot execute on the processor that
88
      holds the spinlock because the kernel disables preemption on that
89
      processor when a spinlock is held. The reason behind disabling
90
      preemption is priority inversion problem avoidance. For the same reason,
91
      threads are strongly discouraged from sleeping when they hold a
92
      spinlock.</para>
9 bondari 93
 
41 jermar 94
      <para>To summarize, spinlocks represent very simple and essential mutual
95
      exclusion primitive for SMP systems. On the other hand, spinlocks scale
96
      poorly because of the active loop they are based on. Therefore,
45 jermar 97
      spinlocks are used in HelenOS only for short-time mutual exclusion and
41 jermar 98
      in cases where the mutual exclusion is required out of thread context.
99
      Lastly, spinlocks are used in the construction of passive
100
      synchronization primitives.</para>
101
    </section>
102
  </section>
9 bondari 103
 
41 jermar 104
  <section>
131 jermar 105
    <title>Passive Kernel Synchronization</title>
9 bondari 106
 
41 jermar 107
    <section>
72 bondari 108
      <indexterm>
109
        <primary>synchronization</primary>
110
 
73 bondari 111
        <secondary>- wait queue</secondary>
72 bondari 112
      </indexterm>
113
 
131 jermar 114
      <title>Wait Queues</title>
9 bondari 115
 
43 jermar 116
      <para>A wait queue is the basic passive synchronization primitive on
45 jermar 117
      which all other passive synchronization primitives are built. Simply
118
      put, it allows a thread to sleep until an event associated with the
119
      particular wait queue occurs. Multiple threads are notified about
120
      incoming events in a first come, first served fashion. Moreover, should
121
      the event come before any thread waits for it, it is recorded in the
122
      wait queue as a missed wakeup and later forwarded to the first thread
123
      that decides to wait in the queue. The inner structures of the wait
124
      queue are protected by a spinlock.</para>
43 jermar 125
 
126
      <para>The thread that wants to wait for a wait queue event uses the
131 jermar 127
      <code>waitq_sleep_timeout()</code> function. The algorithm then checks
128
      the wait queue's counter of missed wakeups and if there are any missed
57 jermar 129
      wakeups, the call returns immediately. The call also returns immediately
130
      if only a conditional wait was requested. Otherwise the thread is
131
      enqueued in the wait queue's list of sleeping threads and its state is
132
      changed to <constant>Sleeping</constant>. It then sleeps until one of
133
      the following events happens:</para>
43 jermar 134
 
135
      <orderedlist>
136
        <listitem>
131 jermar 137
          <para>another thread calls <code>waitq_wakeup()</code> and the
138
          thread is the first thread in the wait queue's list of sleeping
45 jermar 139
          threads;</para>
43 jermar 140
        </listitem>
141
 
142
        <listitem>
131 jermar 143
          <para>another thread calls <code>waitq_interrupt_sleep()</code> on
144
          the sleeping thread;</para>
43 jermar 145
        </listitem>
146
 
147
        <listitem>
45 jermar 148
          <para>the sleep times out provided that none of the previous
149
          occurred within a specified time limit; the limit can be
150
          infinity.</para>
43 jermar 151
        </listitem>
152
      </orderedlist>
153
 
154
      <para>All five possibilities (immediate return on success, immediate
155
      return on failure, wakeup after sleep, interruption and timeout) are
131 jermar 156
      distinguishable by the return value of
157
      <code>waitq_sleep_timeout()</code>. Being able to interrupt a sleeping
158
      thread is essential for externally initiated thread termination. The
159
      ability to wait only for a certain amount of time is used, for instance,
160
      to passively delay thread execution by several microseconds or even
161
      seconds in <code>thread_sleep()</code> function. Due to the fact that
162
      all other passive kernel synchronization primitives are based on wait
163
      queues, they also have the option of being interrutped and, more
164
      importantly, can timeout. All of them also implement the conditional
165
      operation. Furthemore, this very fundamental interface reaches up to the
57 jermar 166
      implementation of futexes - userspace synchronization primitive, which
167
      makes it possible for a userspace thread to request a synchronization
168
      operation with a timeout or a conditional operation.</para>
43 jermar 169
 
170
      <para>From the description above, it should be apparent, that when a
86 bondari 171
      sleeping thread is woken by <code>waitq_wakeup()</code> or when
131 jermar 172
      <code>waitq_sleep_timeout()</code> succeeds immediately, the thread can
173
      be sure that the event has occurred. The thread need not and should not
57 jermar 174
      verify this fact. This approach is called direct hand-off and is
175
      characteristic for all passive HelenOS synchronization primitives, with
176
      the exception as described below.</para>
41 jermar 177
    </section>
9 bondari 178
 
41 jermar 179
    <section>
72 bondari 180
      <indexterm>
181
        <primary>synchronization</primary>
182
 
73 bondari 183
        <secondary>- semaphore</secondary>
72 bondari 184
      </indexterm>
185
 
41 jermar 186
      <title>Semaphores</title>
9 bondari 187
 
43 jermar 188
      <para>The interesting point about wait queues is that the number of
189
      missed wakeups is equal to the number of threads that will not block in
131 jermar 190
      <code>watiq_sleep_timeout()</code> and would immediately succeed
191
      instead. On the other hand, semaphores are synchronization primitives
192
      that will let predefined amount of threads into their critical section
193
      and block any other threads above this count. However, these two cases
194
      are exactly the same. Semaphores in HelenOS are therefore implemented as
195
      wait queues with a single semantic change: their wait queue is
196
      initialized to have so many missed wakeups as is the number of threads
197
      that the semphore intends to let into its critical section
198
      simultaneously.</para>
43 jermar 199
 
200
      <para>In the semaphore language, the wait queue operation
86 bondari 201
      <code>waitq_sleep_timeout()</code> corresponds to semaphore
57 jermar 202
      <code>down</code> operation, represented by the function
86 bondari 203
      <code>semaphore_down_timeout()</code> and by way of similitude the wait
57 jermar 204
      queue operation waitq_wakeup corresponds to semaphore <code>up</code>
86 bondari 205
      operation, represented by the function <code>sempafore_up()</code>. The
57 jermar 206
      conditional down operation is called
86 bondari 207
      <code>semaphore_trydown()</code>.</para>
41 jermar 208
    </section>
9 bondari 209
 
41 jermar 210
    <section>
43 jermar 211
      <title>Mutexes</title>
9 bondari 212
 
72 bondari 213
      <indexterm>
214
        <primary>synchronization</primary>
215
 
216
        <secondary>- mutex</secondary>
217
      </indexterm>
218
 
45 jermar 219
      <para>Mutexes are sometimes referred to as binary sempahores. That means
220
      that mutexes are like semaphores that allow only one thread in its
43 jermar 221
      critical section. Indeed, mutexes in HelenOS are implemented exactly in
45 jermar 222
      this way: they are built on top of semaphores. From another point of
223
      view, they can be viewed as spinlocks without busy waiting. Their
224
      semaphore heritage provides good basics for both conditional operation
225
      and operation with timeout. The locking operation is called
86 bondari 226
      <code>mutex_lock()</code>, the conditional locking operation is called
227
      <code>mutex_trylock()</code> and the unlocking operation is called
228
      <code>mutex_unlock()</code>.</para>
41 jermar 229
    </section>
9 bondari 230
 
41 jermar 231
    <section>
131 jermar 232
      <title>Reader/Writer Locks</title>
9 bondari 233
 
72 bondari 234
      <indexterm>
235
        <primary>synchronization</primary>
236
 
237
        <secondary>- read/write lock</secondary>
238
      </indexterm>
239
 
43 jermar 240
      <para>Reader/writer locks, or rwlocks, are by far the most complicated
241
      synchronization primitive within the kernel. The goal of these locks is
45 jermar 242
      to improve concurrency of applications, in which threads need to
243
      synchronize access to a shared resource, and that access can be
43 jermar 244
      partitioned into a read-only mode and a write mode. Reader/writer locks
245
      should make it possible for several, possibly many, readers to enter the
246
      critical section, provided that no writer is currently in the critical
247
      section, or to be in the critical section contemporarily. Writers are
248
      allowed to enter the critical section only individually, provided that
45 jermar 249
      no reader is in the critical section already. Applications, in which the
250
      majority of operations can be done in the read-only mode, can benefit
43 jermar 251
      from increased concurrency created by reader/writer locks.</para>
252
 
45 jermar 253
      <para>During reader/writer lock construction, a decision should be made
43 jermar 254
      whether readers will be prefered over writers or whether writers will be
255
      prefered over readers in cases when the lock is not currently held and
256
      both a reader and a writer want to gain the lock. Some operating systems
257
      prefer one group over the other, creating thus a possibility for
258
      starving the unprefered group. In the HelenOS operating system, none of
45 jermar 259
      the two groups is prefered. The lock is granted on a first come, first
43 jermar 260
      served basis with the additional note that readers are granted the lock
45 jermar 261
      in the biggest possible batch.</para>
43 jermar 262
 
263
      <para>With this policy and the timeout modes of operation, the direct
264
      hand-off becomes much more complicated. For instance, a writer leaving
265
      the critical section must wake up all leading readers in the rwlock's
266
      wait queue or one leading writer or no-one if no thread is waiting.
267
      Similarily, the last reader leaving the critical section must wakeup the
45 jermar 268
      sleeping writer if there are any sleeping threads left at all. As
269
      another example, if a writer at the beginning of the rwlock's wait queue
270
      times out and the lock is held by at least one reader, the writer which
271
      has timed out must first wake up all readers that follow him in the
272
      queue prior to signalling the timeout itself and giving up.</para>
43 jermar 273
 
45 jermar 274
      <para>Due to the issues mentioned in the previous paragraph, the
275
      reader/writer lock imlpementation needs to walk the rwlock wait queue's
276
      list of sleeping threads directly, in order to find out the type of
43 jermar 277
      access that the queueing threads demand. This makes the code difficult
278
      to understand and dependent on the internal implementation of the wait
279
      queue. Nevertheless, it remains unclear to the authors of HelenOS
280
      whether a simpler but equivalently fair solution exists.</para>
281
 
282
      <para>The implementation of rwlocks as it has been already put, makes
283
      use of one single wait queue for both readers and writers, thus avoiding
284
      any possibility of starvation. In fact, rwlocks use a mutex rather than
131 jermar 285
      a bare wait queue. This mutex is called <emphasis>exclusive</emphasis>
286
      and is used to synchronize writers. The writer's lock operation,
86 bondari 287
      <code>rwlock_write_lock_timeout()</code>, simply tries to acquire the
57 jermar 288
      exclusive mutex. If it succeeds, the writer is granted the rwlock.
44 jermar 289
      However, if the operation fails (e.g. times out), the writer must check
290
      for potential readers at the head of the list of sleeping threads
45 jermar 291
      associated with the mutex's wait queue and then proceed according to the
44 jermar 292
      procedure outlined above.</para>
43 jermar 293
 
294
      <para>The exclusive mutex plays an important role in reader
295
      synchronization as well. However, a reader doing the reader's lock
131 jermar 296
      operation, <code>rwlock_read_lock_timeout()</code>, may bypass this
297
      mutex when it detects that:</para>
43 jermar 298
 
299
      <orderedlist>
300
        <listitem>
45 jermar 301
          <para>there are other readers in the critical section and</para>
43 jermar 302
        </listitem>
303
 
304
        <listitem>
305
          <para>there are no sleeping threads waiting for the exclusive
45 jermar 306
          mutex.</para>
43 jermar 307
        </listitem>
308
      </orderedlist>
309
 
310
      <para>If both conditions are true, the reader will bypass the mutex,
45 jermar 311
      increment the number of readers in the critical section and then enter
312
      the critical section. Note that if there are any sleeping threads at the
313
      beginning of the wait queue, the first must be a writer. If the
43 jermar 314
      conditions are not fulfilled, the reader normally waits until the
315
      exclusive mutex is granted to it.</para>
41 jermar 316
    </section>
9 bondari 317
 
318
    <section>
131 jermar 319
      <title>Condition Variables</title>
9 bondari 320
 
72 bondari 321
      <indexterm>
322
        <primary>synchronization</primary>
323
 
324
        <secondary>- condition variable</secondary>
325
      </indexterm>
326
 
48 jermar 327
      <para>Condition variables can be used for waiting until a condition
328
      becomes true. In this respect, they are similar to wait queues. But
329
      contrary to wait queues, condition variables have different semantics
330
      that allows events to be lost when there is no thread waiting for them.
331
      In order to support this, condition variables don't use direct hand-off
332
      and operate in a way similar to the example below. A thread waiting for
333
      the condition becoming true does the following:</para>
334
 
62 jermar 335
      <example>
86 bondari 336
        <title>Use of <code>condvar_wait_timeout()</code>.</title>
72 bondari 337
 
338
        <programlisting language="C"><function>mutex_lock</function>(<varname>mtx</varname>);
48 jermar 339
while (!<varname>condition</varname>)
102 bondari 340
        <function>condvar_wait_timeout</function>(<varname>cv</varname>, <varname>mtx</varname>); /* <remark>the condition is true, do something</remark> */
62 jermar 341
<function>mutex_unlock</function>(<varname>mtx</varname>);</programlisting>
72 bondari 342
      </example>
48 jermar 343
 
344
      <para>A thread that causes the condition become true signals this event
345
      like this:</para>
346
 
62 jermar 347
      <example>
141 jermar 348
        <title>Use of <code>condvar_signal()</code>.</title>
72 bondari 349
 
102 bondari 350
        <programlisting language="C"><function>mutex_lock</function>(<varname>mtx</varname>);
48 jermar 351
<varname>condition</varname> = <constant>true</constant>;
352
<function>condvar_signal</function>(<varname>cv</varname>);  /* <remark>condvar_broadcast(cv);</remark> */
72 bondari 353
<function>mutex_unlock</function>(<varname>mtx</varname>);</programlisting>
354
      </example>
48 jermar 355
 
131 jermar 356
      <para>The wait operation, <code>condvar_wait_timeout()</code>, always
357
      puts the calling thread to sleep. The thread then sleeps until another
358
      thread invokes <code>condvar_broadcast()</code> on the same condition
359
      variable or until it is woken up by <code>condvar_signal()</code>. The
360
      <code>condvar_signal()</code> operation unblocks the first thread
361
      blocking on the condition variable while the
362
      <code>condvar_broadcast()</code> operation unblocks all threads blocking
363
      there. If there are no blocking threads, these two operations have no
364
      efect.</para>
48 jermar 365
 
366
      <para>Note that the threads must synchronize over a dedicated mutex. To
86 bondari 367
      prevent race condition between <code>condvar_wait_timeout()</code> and
131 jermar 368
      <code>condvar_signal()</code> or <code>condvar_broadcast()</code>, the
369
      mutex is passed to <code>condvar_wait_timeout()</code> which then
370
      atomically puts the calling thread asleep and unlocks the mutex. When
371
      the thread eventually wakes up, <code>condvar_wait()</code> regains the
372
      mutex and returns.</para>
48 jermar 373
 
374
      <para>Also note, that there is no conditional operation for condition
375
      variables. Such an operation would make no sence since condition
376
      variables are defined to forget events for which there is no waiting
86 bondari 377
      thread and because <code>condvar_wait()</code> must always go to sleep.
57 jermar 378
      The operation with timeout is supported as usually.</para>
48 jermar 379
 
380
      <para>In HelenOS, condition variables are based on wait queues. As it is
381
      already mentioned above, wait queues remember missed events while
382
      condition variables must not do so. This is reasoned by the fact that
383
      condition variables are designed for scenarios in which an event might
384
      occur very many times without being picked up by any waiting thread. On
385
      the other hand, wait queues would remember any event that had not been
86 bondari 386
      picked up by a call to <code>waitq_sleep_timeout()</code>. Therefore, if
57 jermar 387
      wait queues were used directly and without any changes to implement
388
      condition variables, the missed_wakeup counter would hurt performance of
389
      the implementation: the <code>while</code> loop in
86 bondari 390
      <code>condvar_wait_timeout()</code> would effectively do busy waiting
57 jermar 391
      until all missed wakeups were discarded.</para>
48 jermar 392
 
393
      <para>The requirement on the wait operation to atomically put the caller
394
      to sleep and release the mutex poses an interesting problem on
86 bondari 395
      <code>condvar_wait_timeout()</code>. More precisely, the thread should
57 jermar 396
      sleep in the condvar's wait queue prior to releasing the mutex, but it
397
      must not hold the mutex when it is sleeping.</para>
48 jermar 398
 
399
      <para>Problems described in the two previous paragraphs are addressed in
86 bondari 400
      HelenOS by dividing the <code>waitq_sleep_timeout()</code> function into
57 jermar 401
      three pieces:</para>
48 jermar 402
 
403
      <orderedlist>
404
        <listitem>
131 jermar 405
          <para><code>waitq_sleep_prepare()</code> prepares the thread to go
406
          to sleep by, among other things, locking the wait queue;</para>
48 jermar 407
        </listitem>
408
 
409
        <listitem>
86 bondari 410
          <para><code>waitq_sleep_timeout_unsafe()</code> implements the core
57 jermar 411
          blocking logic;</para>
48 jermar 412
        </listitem>
413
 
414
        <listitem>
86 bondari 415
          <para><code>waitq_sleep_finish()</code> performs cleanup after
416
          <code>waitq_sleep_timeout_unsafe()</code>; after this call, the wait
57 jermar 417
          queue spinlock is guaranteed to be unlocked by the caller</para>
48 jermar 418
        </listitem>
419
      </orderedlist>
420
 
131 jermar 421
      <para>The stock <code>waitq_sleep_timeout()</code> is then a mere
422
      wrapper that calls these three functions. It is provided for convenience
423
      in cases where the caller doesn't require such a low level control.
86 bondari 424
      However, the implementation of <code>condvar_wait_timeout()</code> does
57 jermar 425
      need this finer-grained control because it has to interleave calls to
426
      these functions by other actions. It carries its operations out in the
427
      following order:</para>
48 jermar 428
 
429
      <orderedlist>
430
        <listitem>
86 bondari 431
          <para>calls <code>waitq_sleep_prepare()</code> in order to lock the
57 jermar 432
          condition variable's wait queue,</para>
48 jermar 433
        </listitem>
434
 
435
        <listitem>
436
          <para>releases the mutex,</para>
437
        </listitem>
438
 
439
        <listitem>
440
          <para>clears the counter of missed wakeups,</para>
441
        </listitem>
442
 
443
        <listitem>
86 bondari 444
          <para>calls <code>waitq_sleep_timeout_unsafe()</code>,</para>
48 jermar 445
        </listitem>
446
 
447
        <listitem>
448
          <para>retakes the mutex,</para>
449
        </listitem>
450
 
451
        <listitem>
86 bondari 452
          <para>calls <code>waitq_sleep_finish()</code>.</para>
48 jermar 453
        </listitem>
454
      </orderedlist>
9 bondari 455
    </section>
41 jermar 456
  </section>
9 bondari 457
 
41 jermar 458
  <section>
131 jermar 459
    <title>Userspace Synchronization</title>
9 bondari 460
 
41 jermar 461
    <section>
462
      <title>Futexes</title>
463
 
72 bondari 464
      <indexterm>
465
        <primary>synchronization</primary>
466
 
467
        <secondary>- futex</secondary>
468
      </indexterm>
469
 
81 jermar 470
      <para>In a multithreaded environment, userspace threads need an
471
      efficient way to synchronize. HelenOS borrows an idea from Linux<xref
472
      linkend="futex" /> to implement lightweight userspace synchronization
473
      and mutual exclusion primitive called futex. The key idea behind futexes
474
      is that non-contended synchronization is very fast and takes place only
475
      in userspace without kernel's intervention. When more threads contend
476
      for a futex, only one of them wins; other threads go to sleep via a
477
      dedicated syscall.</para>
478
 
479
      <para>The userspace part of the futex is a mere integer variable, a
480
      counter, that can be atomically incremented or decremented. The kernel
481
      part is rather more complicated. For each userspace futex counter, there
482
      is a kernel structure describing the futex. This structure
483
      contains:</para>
484
 
485
      <itemizedlist>
486
        <listitem>
487
          <para>number of references,</para>
488
        </listitem>
489
 
490
        <listitem>
491
          <para>physical address of the userspace futex counter,</para>
492
        </listitem>
493
 
494
        <listitem>
495
          <para>hash table link and</para>
496
        </listitem>
497
 
498
        <listitem>
499
          <para>a wait queue.</para>
500
        </listitem>
501
      </itemizedlist>
502
 
503
      <para>The reference count helps to find out when the futex is no longer
504
      needed and can be deallocated. The physical address is used as a key for
505
      the global futex hash table. Note that the kernel has to use physical
506
      address to identify the futex beacause one futex can be used for
507
      synchronization among different address spaces and can have different
508
      virtual addresses in each of them. Finally, the kernel futex structure
509
      includes a wait queue. The wait queue is used to put threads that didn't
510
      win the futex to sleep until the winner wakes one of them up.</para>
511
 
512
      <para>A futex should be initialized by setting its userspace counter to
513
      one before it is used. When locking the futex via userspace library
86 bondari 514
      function <code>futex_down_timeout()</code>, the library code atomically
82 jermar 515
      decrements the futex counter and tests if it dropped below zero. If it
81 jermar 516
      did, then the futex is locked by another thread and the library uses the
517
      <constant>SYS_FUTEX_SLEEP</constant> syscall to put the thread asleep.
518
      If the counter decreased to 0, then there was no contention and the
519
      thread can enter the critical section protected by the futex. When the
520
      thread later leaves that critical section, it, using library function
131 jermar 521
      <code>futex_up()</code>, atomically increments the counter. If the
522
      counter value increased to one, then there again was no contention and
523
      no action needs to be done. However, if it increased to zero or even a
524
      smaller number, then there are sleeping threads waiting for the futex to
525
      become available. In that case, the library has to use the
81 jermar 526
      <constant>SYS_FUTEX_WAKEUP</constant> syscall to wake one sleeping
527
      thread.</para>
528
 
529
      <para>So far, futexes are very elegant in that they don't interfere with
530
      the kernel when there is no contention for them. Another nice aspect of
531
      futexes is that they don't need to be registered anywhere prior to the
532
      first kernel intervention.</para>
533
 
534
      <para>Both futex related syscalls, <constant>SYS_FUTEX_SLEEP</constant>
535
      and <constant>SYS_FUTEX_WAKEUP</constant>, respectivelly, are mere
86 bondari 536
      wrappers for <code>waitq_sleep_timeout()</code> and
537
      <code>waitq_sleep_wakeup()</code>, respectively, with some housekeeping
81 jermar 538
      functionality added. Both syscalls need to translate the userspace
539
      virtual address of the futex counter to physical address in order to
540
      support synchronization accross shared memory. Once the physical address
541
      is known, the kernel checks whether the futexes are already known to it
542
      by searching the global futex hash table for an item with the physical
543
      address of the futex counter as a key. When the search is successful, it
544
      returns an address of the kernel futex structure associated with the
545
      counter. If the hash table does not contain the key, the kernel creates
546
      it and inserts it into the hash table. At the same time, the the current
547
      task's B+tree of known futexes is searched in order to find out if the
548
      task already uses the futex. If it does, no action is taken. Otherwise
549
      the reference count of the futex is incremented, provided that the futex
550
      already existed.</para>
41 jermar 551
    </section>
552
  </section>
553
</chapter>