Subversion Repositories HelenOS-doc

Rev

Rev 185 | Details | Compare with Previous | Last modification | View Log | RSS feed

Rev Author Line No. Line
9 bondari 1
<?xml version="1.0" encoding="UTF-8"?>
11 bondari 2
<chapter id="mm">
3
  <?dbhtml filename="mm.html"?>
9 bondari 4
 
11 bondari 5
  <title>Memory management</title>
9 bondari 6
 
67 jermar 7
  <para>In previous chapters, this book described the scheduling subsystem as
8
  the creator of the impression that threads execute in parallel. The memory
9
  management subsystem, on the other hand, creates the impression that there
10
  is enough physical memory for the kernel and that userspace tasks have the
11
  entire address space only for themselves.</para>
64 jermar 12
 
26 bondari 13
  <section>
64 jermar 14
    <title>Physical memory management</title>
15
 
16
    <section id="zones_and_frames">
17
      <title>Zones and frames</title>
18
 
67 jermar 19
      <para>HelenOS represents continuous areas of physical memory in
20
      structures called frame zones (abbreviated as zones). Each zone contains
21
      information about the number of allocated and unallocated physical
22
      memory frames as well as the physical base address of the zone and
23
      number of frames contained in it. A zone also contains an array of frame
24
      structures describing each frame of the zone and, in the last, but not
25
      the least important, front, each zone is equipped with a buddy system
26
      that faciliates effective allocation of power-of-two sized block of
27
      frames.</para>
64 jermar 28
 
67 jermar 29
      <para>This organization of physical memory provides good preconditions
30
      for hot-plugging of more zones. There is also one currently unused zone
31
      attribute: <code>flags</code>. The attribute could be used to give a
32
      special meaning to some zones in the future.</para>
64 jermar 33
 
67 jermar 34
      <para>The zones are linked in a doubly-linked list. This might seem a
35
      bit ineffective because the zone list is walked everytime a frame is
36
      allocated or deallocated. However, this does not represent a significant
37
      performance problem as it is expected that the number of zones will be
38
      rather low. Moreover, most architectures merge all zones into
39
      one.</para>
40
 
76 palkovsky 41
      <para>Every physical memory frame in a zone, is described by a structure
42
      that contains number of references and other data used by buddy
67 jermar 43
      system.</para>
64 jermar 44
    </section>
45
 
46
    <section id="frame_allocator">
71 bondari 47
      <indexterm>
48
        <primary>frame allocator</primary>
49
      </indexterm>
50
 
64 jermar 51
      <title>Frame allocator</title>
52
 
67 jermar 53
      <para>The frame allocator satisfies kernel requests to allocate
54
      power-of-two sized blocks of physical memory. Because of zonal
55
      organization of physical memory, the frame allocator is always working
76 palkovsky 56
      within a context of a particular frame zone. In order to carry out the
67 jermar 57
      allocation requests, the frame allocator is tightly integrated with the
58
      buddy system belonging to the zone. The frame allocator is also
59
      responsible for updating information about the number of free and busy
101 bondari 60
      frames in the zone. <figure float="1">
67 jermar 61
          <mediaobject id="frame_alloc">
97 bondari 62
            <imageobject role="pdf">
105 bondari 63
              <imagedata fileref="images/frame_alloc.pdf" format="PDF" />
77 bondari 64
            </imageobject>
65
 
67 jermar 66
            <imageobject role="html">
67
              <imagedata fileref="images/frame_alloc.png" format="PNG" />
68
            </imageobject>
64 jermar 69
 
67 jermar 70
            <imageobject role="fop">
105 bondari 71
              <imagedata fileref="images/frame_alloc.svg" format="SVG" />
67 jermar 72
            </imageobject>
73
          </mediaobject>
64 jermar 74
 
67 jermar 75
          <title>Frame allocator scheme.</title>
76
        </figure></para>
64 jermar 77
 
78
      <formalpara>
79
        <title>Allocation / deallocation</title>
80
 
82 jermar 81
        <para>Upon allocation request via function <code>frame_alloc()</code>,
67 jermar 82
        the frame allocator first tries to find a zone that can satisfy the
83
        request (i.e. has the required amount of free frames). Once a suitable
84
        zone is found, the frame allocator uses the buddy allocator on the
186 decky 85
        zone's buddy system to perform the allocation. If no free zone is
86
        found, the frame allocator tries to reclaim slab memory.</para>
87
 
88
        <para>During deallocation, which is triggered by a call to
89
        <code>frame_free()</code>, the frame allocator looks up the respective
90
        zone that contains the frame being deallocated. Afterwards, it calls
91
        the buddy allocator again, this time to take care of deallocation
92
        within the zone's buddy system.</para>
64 jermar 93
      </formalpara>
94
    </section>
95
 
96
    <section id="buddy_allocator">
71 bondari 97
      <indexterm>
98
        <primary>buddy system</primary>
99
      </indexterm>
100
 
64 jermar 101
      <title>Buddy allocator</title>
102
 
67 jermar 103
      <para>In the buddy system, the memory is broken down into power-of-two
104
      sized naturally aligned blocks. These blocks are organized in an array
82 jermar 105
      of lists, in which the list with index <emphasis>i</emphasis> contains
106
      all unallocated blocks of size
107
      <emphasis>2<superscript>i</superscript></emphasis>. The index
108
      <emphasis>i</emphasis> is called the order of block. Should there be two
109
      adjacent equally sized blocks in the list <emphasis>i</emphasis> (i.e.
110
      buddies), the buddy allocator would coalesce them and put the resulting
111
      block in list <emphasis>i + 1</emphasis>, provided that the resulting
112
      block would be naturally aligned. Similarily, when the allocator is
113
      asked to allocate a block of size
114
      <emphasis>2<superscript>i</superscript></emphasis>, it first tries to
115
      satisfy the request from the list with index <emphasis>i</emphasis>. If
116
      the request cannot be satisfied (i.e. the list <emphasis>i</emphasis> is
117
      empty), the buddy allocator will try to allocate and split a larger
118
      block from the list with index <emphasis>i + 1</emphasis>. Both of these
119
      algorithms are recursive. The recursion ends either when there are no
120
      blocks to coalesce in the former case or when there are no blocks that
121
      can be split in the latter case.</para>
64 jermar 122
 
67 jermar 123
      <para>This approach greatly reduces external fragmentation of memory and
124
      helps in allocating bigger continuous blocks of memory aligned to their
125
      size. On the other hand, the buddy allocator suffers increased internal
126
      fragmentation of memory and is not suitable for general kernel
127
      allocations. This purpose is better addressed by the <link
101 bondari 128
      linkend="slab">slab allocator</link>.<figure float="1">
64 jermar 129
          <mediaobject id="buddy_alloc">
97 bondari 130
            <imageobject role="pdf">
105 bondari 131
              <imagedata fileref="images/buddy_alloc.pdf" format="PDF" />
77 bondari 132
            </imageobject>
133
 
64 jermar 134
            <imageobject role="html">
135
              <imagedata fileref="images/buddy_alloc.png" format="PNG" />
136
            </imageobject>
137
 
138
            <imageobject role="fop">
105 bondari 139
              <imagedata fileref="images/buddy_alloc.svg" format="SVG" />
64 jermar 140
            </imageobject>
141
          </mediaobject>
142
 
143
          <title>Buddy system scheme.</title>
67 jermar 144
        </figure></para>
64 jermar 145
 
146
      <section>
147
        <title>Implementation</title>
148
 
169 jermar 149
        <para>The buddy allocator is, in fact, an abstract framework which can
64 jermar 150
        be easily specialized to serve one particular task. It knows nothing
151
        about the nature of memory it helps to allocate. In order to beat the
152
        lack of this knowledge, the buddy allocator exports an interface that
153
        each of its clients is required to implement. When supplied with an
154
        implementation of this interface, the buddy allocator can use
155
        specialized external functions to find a buddy for a block, split and
156
        coalesce blocks, manipulate block order and mark blocks busy or
67 jermar 157
        available.</para>
64 jermar 158
 
159
        <formalpara>
160
          <title>Data organization</title>
161
 
162
          <para>Each entity allocable by the buddy allocator is required to
163
          contain space for storing block order number and a link variable
164
          used to interconnect blocks within the same order.</para>
165
 
166
          <para>Whatever entities are allocated by the buddy allocator, the
167
          first entity within a block is used to represent the entire block.
168
          The first entity keeps the order of the whole block. Other entities
169
          within the block are assigned the magic value
186 decky 170
          <constant>BUDDY_SYSTEM_INNER_BLOCK</constant>. This is especially important
64 jermar 171
          for effective identification of buddies in a one-dimensional array
172
          because the entity that represents a potential buddy cannot be
186 decky 173
          associated with <constant>BUDDY_SYSTEM_INNER_BLOCK</constant> (i.e. if it
174
          is associated with <constant>BUDDY_SYSTEM_INNER_BLOCK</constant> then it is
64 jermar 175
          not a buddy).</para>
176
        </formalpara>
177
      </section>
178
    </section>
179
 
180
    <section id="slab">
71 bondari 181
      <indexterm>
182
        <primary>slab allocator</primary>
183
      </indexterm>
184
 
70 jermar 185
      <title>Slab allocator</title>
64 jermar 186
 
67 jermar 187
      <para>The majority of memory allocation requests in the kernel is for
188
      small, frequently used data structures. The basic idea behind the slab
189
      allocator is that commonly used objects are preallocated in continuous
190
      areas of physical memory called slabs<footnote>
191
          <para>Slabs are in fact blocks of physical memory frames allocated
192
          from the frame allocator.</para>
193
        </footnote>. Whenever an object is to be allocated, the slab allocator
194
      returns the first available item from a suitable slab corresponding to
195
      the object type<footnote>
196
          <para>The mechanism is rather more complicated, see the next
197
          paragraph.</para>
198
        </footnote>. Due to the fact that the sizes of the requested and
199
      allocated object match, the slab allocator significantly reduces
200
      internal fragmentation.</para>
64 jermar 201
 
71 bondari 202
      <indexterm>
203
        <primary>slab allocator</primary>
204
 
72 bondari 205
        <secondary>- slab cache</secondary>
71 bondari 206
      </indexterm>
207
 
67 jermar 208
      <para>Slabs of one object type are organized in a structure called slab
186 decky 209
      cache. There are usually more slabs in the slab cache, depending on
67 jermar 210
      previous allocations. If the the slab cache runs out of available slabs,
211
      new slabs are allocated. In order to exploit parallelism and to avoid
212
      locking of shared spinlocks, slab caches can have variants of
213
      processor-private slabs called magazines. On each processor, there is a
214
      two-magazine cache. Full magazines that are not part of any
215
      per-processor magazine cache are stored in a global list of full
216
      magazines.</para>
64 jermar 217
 
71 bondari 218
      <indexterm>
219
        <primary>slab allocator</primary>
220
 
72 bondari 221
        <secondary>- magazine</secondary>
71 bondari 222
      </indexterm>
223
 
67 jermar 224
      <para>Each object begins its life in a slab. When it is allocated from
225
      there, the slab allocator calls a constructor that is registered in the
226
      respective slab cache. The constructor initializes and brings the object
227
      into a known state. The object is then used by the user. When the user
228
      later frees the object, the slab allocator puts it into a processor
71 bondari 229
      private <indexterm>
230
          <primary>slab allocator</primary>
231
 
72 bondari 232
          <secondary>- magazine</secondary>
71 bondari 233
        </indexterm>magazine cache, from where it can be precedently allocated
67 jermar 234
      again. Note that allocations satisfied from a magazine are already
235
      initialized by the constructor. When both of the processor cached
236
      magazines get full, the allocator will move one of the magazines to the
237
      list of full magazines. Similarily, when allocating from an empty
238
      processor magazine cache, the kernel will reload only one magazine from
239
      the list of full magazines. In other words, the slab allocator tries to
240
      keep the processor magazine cache only half-full in order to prevent
241
      thrashing when allocations and deallocations interleave on magazine
70 jermar 242
      boundaries. The advantage of this setup is that during most of the
243
      allocations, no global spinlock needs to be held.</para>
65 jermar 244
 
67 jermar 245
      <para>Should HelenOS run short of memory, it would start deallocating
246
      objects from magazines, calling slab cache destructor on them and
185 schaabova 247
      putting them back into slabs. When a slab contains no allocated object,
67 jermar 248
      it is immediately freed.</para>
66 bondari 249
 
71 bondari 250
      <para>
101 bondari 251
        <figure float="1">
64 jermar 252
          <mediaobject id="slab_alloc">
97 bondari 253
            <imageobject role="pdf">
105 bondari 254
              <imagedata fileref="images/slab_alloc.pdf" format="PDF" />
77 bondari 255
            </imageobject>
256
 
64 jermar 257
            <imageobject role="html">
258
              <imagedata fileref="images/slab_alloc.png" format="PNG" />
259
            </imageobject>
77 bondari 260
 
261
            <imageobject role="fop">
105 bondari 262
              <imagedata fileref="images/slab_alloc.svg" format="SVG" />
77 bondari 263
            </imageobject>
64 jermar 264
          </mediaobject>
265
 
266
          <title>Slab allocator scheme.</title>
71 bondari 267
        </figure>
268
      </para>
64 jermar 269
 
67 jermar 270
      <section>
271
        <title>Implementation</title>
272
 
83 jermar 273
        <para>The slab allocator is closely modelled after <xref
71 bondari 274
        linkend="Bonwick01" /> with the following exceptions:<itemizedlist>
82 jermar 275
            <listitem>
276
              <para>empty slabs are immediately deallocated and</para>
277
            </listitem>
66 bondari 278
 
279
            <listitem>
70 jermar 280
              <para>empty magazines are deallocated when not needed.</para>
66 bondari 281
            </listitem>
70 jermar 282
          </itemizedlist>The following features are not currently supported
283
        but would be easy to do: <itemizedlist>
71 bondari 284
            <listitem>cache coloring and</listitem>
64 jermar 285
 
71 bondari 286
            <listitem>dynamic magazine grow (different magazine sizes are
287
            already supported, but the allocation strategy would need to be
288
            adjusted).</listitem>
64 jermar 289
          </itemizedlist></para>
290
 
291
        <section>
292
          <title>Allocation/deallocation</title>
293
 
70 jermar 294
          <para>The following two paragraphs summarize and complete the
295
          description of the slab allocator operation (i.e.
82 jermar 296
          <code>slab_alloc()</code> and <code>slab_free()</code>
297
          functions).</para>
64 jermar 298
 
299
          <formalpara>
300
            <title>Allocation</title>
301
 
70 jermar 302
            <para><emphasis>Step 1.</emphasis> When an allocation request
303
            comes, the slab allocator checks availability of memory in the
304
            current magazine of the local processor magazine cache. If the
76 palkovsky 305
            available memory is there, the allocator just pops the object from
306
            magazine and returns it.</para>
64 jermar 307
 
70 jermar 308
            <para><emphasis>Step 2.</emphasis> If the current magazine in the
309
            processor magazine cache is empty, the allocator will attempt to
310
            swap it with the last magazine from the cache and return to the
311
            first step. If also the last magazine is empty, the algorithm will
312
            fall through to Step 3.</para>
64 jermar 313
 
70 jermar 314
            <para><emphasis>Step 3.</emphasis> Now the allocator is in the
315
            situation when both magazines in the processor magazine cache are
316
            empty. The allocator reloads one magazine from the shared list of
317
            full magazines. If the reload is successful (i.e. there are full
318
            magazines in the list), the algorithm continues with Step
319
            1.</para>
64 jermar 320
 
70 jermar 321
            <para><emphasis>Step 4.</emphasis> In this fail-safe step, an
322
            object is allocated from the conventional slab layer and a pointer
185 schaabova 323
            to it is returned. If also the last magazine is full, a new slab
324
            is allocated.</para>
64 jermar 325
          </formalpara>
326
 
327
          <formalpara>
328
            <title>Deallocation</title>
329
 
70 jermar 330
            <para><emphasis>Step 1.</emphasis> During a deallocation request,
331
            the slab allocator checks if the current magazine of the local
76 palkovsky 332
            processor magazine cache is not full. If it is, the pointer to the
70 jermar 333
            objects is just pushed into the magazine and the algorithm
334
            returns.</para>
64 jermar 335
 
70 jermar 336
            <para><emphasis>Step 2.</emphasis> If the current magazine is
337
            full, the allocator will attempt to swap it with the last magazine
338
            from the cache and return to the first step. If also the last
339
            magazine is empty, the algorithm will fall through to Step
340
            3.</para>
64 jermar 341
 
70 jermar 342
            <para><emphasis>Step 3.</emphasis> Now the allocator is in the
343
            situation when both magazines in the processor magazine cache are
76 palkovsky 344
            full. The allocator tries to allocate a new empty magazine and
345
            flush one of the full magazines to the shared list of full
346
            magazines. If it is successfull, the algoritm continues with Step
347
            1.</para>
348
 
349
            <para><emphasis>Step 4. </emphasis>In case of low memory condition
350
            when the allocation of empty magazine fails, the object is moved
351
            directly into slab. In the worst case object deallocation does not
352
            need to allocate any additional memory.</para>
64 jermar 353
          </formalpara>
354
        </section>
355
      </section>
356
    </section>
357
  </section>
358
 
359
  <section>
67 jermar 360
    <title>Virtual memory management</title>
9 bondari 361
 
82 jermar 362
    <para>Virtual memory is essential for an operating system because it makes
363
    several things possible. First, it helps to isolate tasks from each other
364
    by encapsulating them in their private address spaces. Second, virtual
365
    memory can give tasks the feeling of more memory available than is
366
    actually possible. And third, by using virtual memory, there might be
367
    multiple copies of the same program, linked to the same addresses, running
368
    in the system. There are at least two known mechanisms for implementing
369
    virtual memory: segmentation and paging. Even though some processor
370
    architectures supported by HelenOS<footnote>
371
        <para>ia32 has full-fledged segmentation.</para>
169 jermar 372
      </footnote> provide both mechanisms, the kernel makes use solely of
82 jermar 373
    paging.</para>
67 jermar 374
 
82 jermar 375
    <section id="paging">
376
      <title>VAT subsystem</title>
67 jermar 377
 
82 jermar 378
      <para>In a paged virtual memory, the entire virtual address space is
379
      divided into small power-of-two sized naturally aligned blocks called
380
      pages. The processor implements a translation mechanism, that allows the
381
      operating system to manage mappings between set of pages and set of
169 jermar 382
      identically sized and identically aligned pieces of physical memory
82 jermar 383
      called frames. In a result, references to continuous virtual memory
384
      areas don't necessarily need to reference continuos area of physical
385
      memory. Supported page sizes usually range from several kilobytes to
386
      several megabytes. Each page that takes part in the mapping is
387
      associated with certain attributes that further desribe the mapping
388
      (e.g. access rights, dirty and accessed bits and present bit).</para>
67 jermar 389
 
82 jermar 390
      <para>When the processor accesses a page that is not present (i.e. its
391
      present bit is not set), the operating system is notified through a
392
      special exception called page fault. It is then up to the operating
393
      system to service the page fault. In HelenOS, some page faults are fatal
394
      and result in either task termination or, in the worse case, kernel
395
      panic<footnote>
396
          <para>Such a condition would be either caused by a hardware failure
397
          or a bug in the kernel.</para>
398
        </footnote>, while other page faults are used to load memory on demand
399
      or to notify the kernel about certain events.</para>
67 jermar 400
 
82 jermar 401
      <indexterm>
402
        <primary>page tables</primary>
403
      </indexterm>
67 jermar 404
 
82 jermar 405
      <para>The set of all page mappings is stored in a memory structure
406
      called page tables. Some architectures have no hardware support for page
407
      tables<footnote>
408
          <para>On mips32, TLB-only model is used and the operating system is
409
          responsible for managing software defined page tables.</para>
410
        </footnote> while other processor architectures<footnote>
411
          <para>Like amd64 and ia32.</para>
412
        </footnote> understand the whole memory format thereof. Despite all
413
      the possible differences in page table formats, the HelenOS VAT
414
      subsystem<footnote>
415
          <para>Virtual Address Translation subsystem.</para>
416
        </footnote> unifies the page table operations under one programming
417
      interface. For all parts of the kernel, three basic functions are
418
      provided:</para>
419
 
420
      <itemizedlist>
421
        <listitem>
422
          <para><code>page_mapping_insert()</code>,</para>
423
        </listitem>
424
 
425
        <listitem>
426
          <para><code>page_mapping_find()</code> and</para>
427
        </listitem>
428
 
429
        <listitem>
430
          <para><code>page_mapping_remove()</code>.</para>
431
        </listitem>
432
      </itemizedlist>
433
 
434
      <para>The <code>page_mapping_insert()</code> function is used to
435
      introduce a mapping for one virtual memory page belonging to a
436
      particular address space into the page tables. Once the mapping is in
437
      the page tables, it can be searched by <code>page_mapping_find()</code>
438
      and removed by <code>page_mapping_remove()</code>. All of these
439
      functions internally select the page table mechanism specific functions
440
      that carry out the self operation.</para>
441
 
442
      <para>There are currently two supported mechanisms: generic 4-level
443
      hierarchical page tables and global page hash table. Both of the
444
      mechanisms are generic as they cover several hardware platforms. For
445
      instance, the 4-level hierarchical page table mechanism is used by
446
      amd64, ia32, mips32 and ppc32, respectively. These architectures have
447
      the following page table format: 4-level, 2-level, TLB-only and hardware
448
      hash table, respectively. On the other hand, the global page hash table
449
      is used on ia64 that can be TLB-only or use a hardware hash table.
450
      Although only two mechanisms are currently implemented, other mechanisms
451
      (e.g. B+tree) can be easily added.</para>
452
 
453
      <section id="page_tables">
454
        <indexterm>
455
          <primary>page tables</primary>
456
 
457
          <secondary>- hierarchical</secondary>
458
        </indexterm>
459
 
460
        <title>Hierarchical 4-level page tables</title>
461
 
462
        <para>Hierarchical 4-level page tables are generalization of the
463
        frequently used hierarchical model of page tables. In this mechanism,
464
        each address space has its own page tables. To avoid confusion in
465
        terminology used by hardware vendors, in HelenOS, the root level page
466
        table is called PTL0, the two middle levels are called PTL1 and PTL2,
467
        and, finally, the leaf level is called PTL3. All architectures using
468
        this mechanism are required to use PTL0 and PTL3. However, the middle
169 jermar 469
        levels can be left out, depending on the hardware hierarchy or
82 jermar 470
        structure of software-only page tables. The genericity is achieved
471
        through a set of macros that define transitions from one level to
136 jermar 472
        another. Unused levels are optimised out by the compiler.
473
    <figure float="1">
474
          <mediaobject id="mm_pt">
475
            <imageobject role="pdf">
476
              <imagedata fileref="images/mm_pt.pdf" format="PDF" />
477
            </imageobject>
478
 
479
            <imageobject role="html">
480
              <imagedata fileref="images/mm_pt.png" format="PNG" />
481
            </imageobject>
482
 
483
            <imageobject role="fop">
484
              <imagedata fileref="images/mm_pt.svg" format="SVG" />
485
            </imageobject>
486
          </mediaobject>
487
 
488
          <title>Hierarchical 4-level page tables.</title>
489
        </figure>
490
    </para>
82 jermar 491
      </section>
492
 
493
      <section>
494
        <indexterm>
495
          <primary>page tables</primary>
496
 
497
          <secondary>- hashing</secondary>
498
        </indexterm>
499
 
500
        <title>Global page hash table</title>
501
 
502
        <para>Implementation of the global page hash table was encouraged by
503
        64-bit architectures that can have rather sparse address spaces. The
504
        hash table contains valid mappings only. Each entry of the hash table
505
        contains an address space pointer, virtual memory page number (VPN),
506
        physical memory frame number (PFN) and a set of flags. The pair of the
507
        address space pointer and the virtual memory page number is used as a
508
        key for the hash table. One of the major differences between the
509
        global page hash table and hierarchical 4-level page tables is that
510
        there is only a single global page hash table in the system while
511
        hierarchical page tables exist per address space. Thus, the global
512
        page hash table contains information about mappings of all address
136 jermar 513
        spaces in the system.
514
        <figure float="1">
515
          <mediaobject id="mm_hash">
516
            <imageobject role="pdf">
517
              <imagedata fileref="images/mm_hash.pdf" format="PDF" />
518
            </imageobject>
82 jermar 519
 
136 jermar 520
            <imageobject role="html">
521
              <imagedata fileref="images/mm_hash.png" format="PNG" />
522
            </imageobject>
523
 
524
            <imageobject role="fop">
525
              <imagedata fileref="images/mm_hash.svg" format="SVG" />
526
            </imageobject>
527
          </mediaobject>
528
 
529
          <title>Global page hash table.</title>
530
        </figure>
531
</para>
532
 
82 jermar 533
        <para>The global page hash table mechanism uses the generic hash table
83 jermar 534
        type as described in the chapter dedicated to <link
535
        linkend="hashtables">data structures</link> earlier in this
536
        book.</para>
82 jermar 537
      </section>
67 jermar 538
    </section>
82 jermar 539
  </section>
67 jermar 540
 
82 jermar 541
  <section id="tlb">
542
    <indexterm>
543
      <primary>TLB</primary>
544
    </indexterm>
545
 
546
    <title>Translation Lookaside buffer</title>
547
 
83 jermar 548
    <para>Due to the extensive overhead of several extra memory accesses
549
    during page table lookup that are necessary on every instruction, modern
550
    architectures deploy fast assotiative cache of recelntly used page
551
    mappings. This cache is called TLB - Translation Lookaside Buffer - and is
552
    present on every processor in the system. As it has been already pointed
553
    out, TLB is the only page translation mechanism for some
554
    architectures.</para>
82 jermar 555
 
556
    <section id="tlb_shootdown">
557
      <indexterm>
558
        <primary>TLB</primary>
559
 
560
        <secondary>- TLB shootdown</secondary>
561
      </indexterm>
562
 
83 jermar 563
      <title>TLB consistency</title>
82 jermar 564
 
83 jermar 565
      <para>The operating system is responsible for keeping TLB consistent
566
      with the page tables. Whenever mappings are modified or purged from the
567
      page tables, or when an address space identifier is reused, the kernel
568
      needs to invalidate the respective contents of TLB. Some TLB types
569
      support partial invalidation of their content (e.g. ranges of pages or
570
      address spaces) while other types can be invalidated only entirely. The
571
      invalidation must be done on all processors for there is one TLB per
572
      processor. Maintaining TLB consistency on multiprocessor configurations
573
      is not as trivial as it might look from the first glance.</para>
82 jermar 574
 
83 jermar 575
      <para>The remote TLB invalidation is called TLB shootdown. HelenOS uses
576
      a simplified variant of the algorithm described in <xref
84 jermar 577
      linkend="Black89" />.</para>
82 jermar 578
 
83 jermar 579
      <para>TLB shootdown is performed in three phases.</para>
82 jermar 580
 
581
      <formalpara>
582
        <title>Phase 1.</title>
583
 
83 jermar 584
        <para>The initiator clears its TLB flag and locks the global TLB
585
        spinlock. The request is then enqueued into all other processors' TLB
586
        shootdown message queues. When the TLB shootdown message queue is full
587
        on any processor, the queue is purged and a single request to
588
        invalidate the entire TLB is stored there. Once all the TLB shootdown
589
        messages were dispatched, the initiator sends all other processors an
590
        interrupt to notify them about the incoming TLB shootdown message. It
591
        then spins until all processors accept the interrupt and clear their
592
        TLB flags.</para>
82 jermar 593
      </formalpara>
594
 
595
      <formalpara>
596
        <title>Phase 2.</title>
597
 
83 jermar 598
        <para>Except for the initiator, all other processors are spining on
599
        the TLB spinlock. The initiator is now free to modify the page tables
600
        and purge its own TLB. The initiator then unlocks the global TLB
601
        spinlock and sets its TLB flag.</para>
82 jermar 602
      </formalpara>
83 jermar 603
 
604
      <formalpara>
605
        <title>Phase 3.</title>
606
 
607
        <para>When the spinlock is unlocked by the initiator, other processors
608
        are sequentially granted the spinlock. However, once they manage to
609
        lock it, they immediately release it. Each processor invalidates its
610
        TLB according to messages found in its TLB shootdown message queue. In
611
        the end, each processor sets its TLB flag and resumes its previous
612
        operation.</para>
613
      </formalpara>
82 jermar 614
    </section>
84 jermar 615
  </section>
82 jermar 616
 
84 jermar 617
  <section>
618
    <title>Address spaces</title>
70 jermar 619
 
96 jermar 620
    <para>In HelenOS, address spaces are objects that encapsulate the
621
    following items:</para>
70 jermar 622
 
96 jermar 623
    <itemizedlist>
624
      <listitem>
625
        <para>address space identifier,</para>
626
      </listitem>
627
 
628
      <listitem>
629
        <para>page table PTL0 pointer and</para>
630
      </listitem>
631
 
632
      <listitem>
633
        <para>a set of mutually disjunctive address space areas.</para>
634
      </listitem>
635
    </itemizedlist>
636
 
637
    <para>Address space identifiers will be discussed later in this section.
638
    The address space contains a pointer to PTL0, provided that the
639
    architecture uses per address space page tables such as the hierarchical
640
    4-level page tables. The most interesting component is the B+tree of
641
    address space areas belonging to the address space.</para>
642
 
84 jermar 643
    <section>
96 jermar 644
      <title>Address space areas</title>
645
 
646
      <para>Because an address space can be composed of heterogenous mappings
647
      such as userspace code, data, read-only data and kernel memory, it is
648
      further broken down into smaller homogenous units called address space
649
      areas. An address space area represents a continuous piece of userspace
650
      virtual memory associated with common flags. Kernel memory mappings do
651
      not take part in address space areas because they are hardwired either
652
      into TLBs or page tables and are thus shared by all address spaces. The
653
      flags are a combination of:</para>
654
 
655
      <itemizedlist>
656
        <listitem>
657
          <para><constant>AS_AREA_READ</constant>,</para>
658
        </listitem>
659
 
660
        <listitem>
661
          <para><constant>AS_AREA_WRITE</constant>,</para>
662
        </listitem>
663
 
664
        <listitem>
665
          <para><constant>AS_AREA_EXEC</constant> and</para>
666
        </listitem>
667
 
668
        <listitem>
669
          <para><constant>AS_AREA_CACHEABLE</constant>.</para>
670
        </listitem>
671
      </itemizedlist>
672
 
673
      <para>The <constant>AS_AREA_READ</constant> flag is implicit and cannot
674
      be removed. The <constant>AS_AREA_WRITE</constant> flag denotes a
675
      writable address space area and the <constant>AS_AREA_EXEC</constant> is
676
      used for areas containing code. The combination of
677
      <constant>AS_AREA_WRITE</constant> and <constant>AS_AREA_EXEC</constant>
678
      is not allowed. Some architectures don't differentiate between
679
      executable and non-executable mappings. In that case, the
680
      <constant>AS_AREA_EXEC</constant> has no effect on mappings created for
681
      the address space area in the page tables. If the flags don't have
682
      <constant>AS_AREA_CACHEABLE</constant> set, the page tables content of
683
      the area is created with caching disabled. This is useful for address
684
      space areas containing memory of some memory mapped device.</para>
685
 
686
      <para>Address space areas can be backed by a backend that provides
687
      virtual functions for servicing page faults that occur within the
688
      address space area, releasing memory allocated by the area and sharing
689
      the area. Currently, there are three backends supported by HelenOS:
690
      anonymous memory backend, ELF image backend and physical memory
691
      backend.</para>
692
 
693
      <formalpara>
694
        <title>Anonymous memory backend</title>
695
 
696
        <para>Anonymous memory is memory that has no predefined contents such
697
        as userspace stack or heap. Anonymous address space areas are backed
698
        by memory allocated from the frame allocator. Areas backed by this
699
        backend can be resized as long as they are not shared.</para>
700
      </formalpara>
701
 
702
      <formalpara>
703
        <title>ELF image backend</title>
704
 
705
        <para>Areas backed by the ELF backend are composed of memory that can
706
        be either initialized, partially initialized or completely anonymous.
707
        Initialized portions of ELF backend address space areas are those that
708
        are entirely physically present in the executable image (e.g. code and
709
        initialized data). Anonymous portions are those pages of the
710
        <emphasis>bss</emphasis> section that exist entirely outside the
711
        executable image. Lastly, pages that don't fit into the previous two
712
        categories are partially initialized as they are both part of the
713
        image and the <emphasis>bss</emphasis> section. The initialized
714
        portion does not need any memory from the allocator unless it is
715
        writable. In that case, pages are duplicated on demand during page
716
        fault and memory for the copy is allocated from the frame allocator.
717
        The remaining two parts of the ELF always require memory from the
718
        frame allocator. Non-shared address space areas backed by the ELF
719
        image backend can be resized.</para>
720
      </formalpara>
721
 
722
      <formalpara>
723
        <title>Physical memory backend</title>
724
 
725
        <para>Physical memory backend is used by the device drivers to access
726
        physical memory. No additional memory needs to be allocated on a page
727
        fault in this area and when sharing this area. Areas backed by this
728
        backend cannot be resized.</para>
729
      </formalpara>
730
 
731
      <section>
732
        <title>Memory sharing</title>
733
 
734
        <para>Address space areas can be shared provided that their backend
735
        supports sharing<footnote>
736
            <para>Which is the case for all currently supported
737
            backends.</para>
738
          </footnote>. When the kernel calls <code>as_area_share()</code>, a
739
        check is made to see whether the area is already being shared. If the
740
        area is already shared, it contains a pointer to the share info
741
        structure. The pointer is then simply copied into the new address
742
        space area and a reference count in the share info structure is
743
        incremented. Otherwise a new address space share info structure needs
744
        to be created. The backend is then called to duplicate the mapping of
745
        pages for which a frame is allocated. The duplicated mapping is stored
746
        in the share info structure B+tree called <varname>pagemap</varname>.
747
        Note that the reference count of the frames put into the
160 jermar 748
        <varname>pagemap</varname> must be incremented in order to avoid a race condition.
749
    If the originating address space area had been destroyed before the <varname>pagemap</varname>
750
    information made it to the page tables of other address spaces that take part in
751
    the sharing, the reference count of the respective frames
752
    would have dropped to zero and some of them could have been allocated again.</para>
96 jermar 753
      </section>
754
 
755
      <section>
756
        <title>Page faults</title>
757
 
758
        <para>When a page fault is encountered in the address space area, the
759
        address space page fault handler, <code>as_page_fault()</code>,
760
        invokes the corresponding backend page fault handler to resolve the
761
        situation. The backend might either confirm the page fault or perform
762
        a remedy. In the non-shared case, depending on the backend, the page
763
        fault can be remedied usually by allocating some memory on demand or
764
        by looking up the frame for the faulting translation in the ELF
765
        image.</para>
766
 
767
        <para>Shared address space areas need to consider the
768
        <varname>pagemap</varname> B+tree. First they need to make sure
168 jermar 769
        whether the mapping is not present in the <varname>pagemap</varname>.
96 jermar 770
        If it is there, then the frame reference count is increased and the
771
        page fault is resolved. Otherwise the handler proceeds similarily to
772
        the non-shared case. If it allocates a physical memory frame, it must
773
        increment its reference count and add it to the
774
        <varname>pagemap</varname>.</para>
775
      </section>
776
    </section>
777
 
778
    <section>
84 jermar 779
      <indexterm>
780
        <primary>address space</primary>
70 jermar 781
 
84 jermar 782
        <secondary>- ASID</secondary>
783
      </indexterm>
67 jermar 784
 
84 jermar 785
      <title>Address Space ID (ASID)</title>
67 jermar 786
 
84 jermar 787
      <para>Modern processor architectures optimize TLB utilization by
788
      associating TLB entries with address spaces through assigning
789
      identification numbers to them. In HelenOS, the term ASID, originally
790
      taken from the mips32 terminology, is used to refer to the address space
791
      identification number. The advantage of having ASIDs is that TLB does
792
      not have to be invalidated on thread context switch as long as ASIDs are
169 jermar 793
      unique. Unfortunately, architectures supported by HelenOS use all
84 jermar 794
      different widths of ASID numbers<footnote>
795
          <para>amd64 and ia32 don't use similar abstraction at all, mips32
796
          has 8-bit ASIDs and ia64 can have ASIDs between 18 to 24 bits
797
          wide.</para>
798
        </footnote> out of which none is sufficient. The amd64 and ia32
799
      architectures cannot make use of ASIDs as their TLB doesn't recognize
800
      such an abstraction. Other architectures have support for ASIDs, but for
801
      instance ppc32 doesn't make use of them in the current version of
802
      HelenOS. The rest of the architectures does use ASIDs. However, even on
803
      the ia64 architecture, the minimal supported width of ASID<footnote>
804
          <para>RID in ia64 terminology.</para>
805
        </footnote> is insufficient to provide a unique integer identifier to
806
      all address spaces that might hypothetically coexist in the running
807
      system. The situation on mips32 is even worse: the architecture has only
808
      256 unique identifiers.</para>
67 jermar 809
 
84 jermar 810
      <indexterm>
811
        <primary>address space</primary>
67 jermar 812
 
84 jermar 813
        <secondary>- ASID stealing</secondary>
814
      </indexterm>
67 jermar 815
 
84 jermar 816
      <para>To mitigate the shortage of ASIDs, HelenOS uses the following
817
      strategy. When the system initializes, a FIFO queue<footnote>
818
          <para>Note that architecture-specific measures are taken to avoid
819
          too large FIFO queue. For instance, seven consecutive ia64 RIDs are
820
          grouped to form one HelenOS ASID.</para>
821
        </footnote> is created and filled with all available ASIDs. Moreover,
822
      every address space remembers the number of processors on which it is
823
      active. Address spaces that have a valid ASID and that are not active on
824
      any processor are appended to the list of inactive address spaces with
825
      valid ASID. When an address space needs to be assigned a valid ASID, it
826
      first checks the FIFO queue. If it contains at least one ASID, the ASID
827
      is allocated. If the queue is empty, an ASID is simply stolen from the
828
      first address space in the list. In that case, the address space that
829
      loses the ASID in favor of another address space, is removed from the
830
      list. After the new ASID is purged from all TLBs, it can be used by the
831
      address space. Note that this approach works due to the fact that the
832
      number of ASIDs is greater than the maximal number of processors
833
      supported by HelenOS and that there can be only one active address space
834
      per processor. In other words, when the FIFO queue is empty, there must
835
      be address spaces that are not active on any processor.</para>
67 jermar 836
    </section>
26 bondari 837
  </section>
185 schaabova 838
</chapter>