Subversion Repositories HelenOS-doc

Rev

Rev 37 | Rev 39 | Go to most recent revision | Details | Compare with Previous | Last modification | View Log | RSS feed

Rev Author Line No. Line
9 bondari 1
<?xml version="1.0" encoding="UTF-8"?>
11 bondari 2
<chapter id="mm">
3
  <?dbhtml filename="mm.html"?>
9 bondari 4
 
11 bondari 5
  <title>Memory management</title>
9 bondari 6
 
26 bondari 7
  <section>
11 bondari 8
    <title>Virtual memory management</title>
9 bondari 9
 
10
    <section>
35 bondari 11
      <title>Introduction</title>
12
 
13
      <para>Virtual memory is a special memory management technique, used by
14
      kernel to achieve a bunch of mission critical goals. <itemizedlist>
15
          <listitem>
16
             Isolate each task from other tasks that are running on the system at the same time.
17
          </listitem>
18
 
19
          <listitem>
20
             Allow to allocate more memory, than is actual physical memory size of the machine.
21
          </listitem>
22
 
23
          <listitem>
24
             Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations.
25
          </listitem>
26
        </itemizedlist></para>
38 bondari 27
 
28
 
29
<para><!--
30
 
31
                TLB shootdown ASID/ASID:PAGE/ALL.
32
                TLB shootdown requests can come in asynchroniously
33
                so there is a cache of TLB shootdown requests. Upon cache overflow TLB shootdown ALL is executed
34
 
35
 
36
                <para>
37
                        Address spaces. Address space area (B+ tree). Only for uspace. Set of syscalls (shrink/extend etc).
38
                        Special address space area type - device - prohibits shrink/extend syscalls to call on it.
39
                        Address space has link to mapping tables (hierarchical - per Address space, hash - global tables).
40
                </para>
41
 
42
--></para>
35 bondari 43
    </section>
44
 
45
    <section>
46
 
47
 
48
      <title>Paging</title>
49
 
50
      <para>Virtual memory is usually using paged memory model, where virtual
51
      memory address space is divided into the <emphasis>pages</emphasis>
52
      (usually having size 4096 bytes) and physical memory is divided into the
37 bondari 53
      frames (same sized as a page, of course). Each page may be mapped to some
35 bondari 54
      frame and then, upon memory access to the virtual address, CPU performs
55
      <emphasis>address translation</emphasis> during the instruction
56
      execution. Non-existing mapping generates page fault exception, calling
57
      kernel exception handler, thus allowing kernel to manipulate rules of
58
      memory access. Information for pages mapping is stored by kernel in the
59
      <link linkend="page_tables">page tables</link></para>
60
 
61
 
62
 
63
      <para>The majority of the architectures use multi-level page tables,
64
      which means need to access physical memory several times before getting
65
      physical address. This fact would make serios performance overhead in
66
      virtual memory management. To avoid this <link linkend="tlb">Traslation
67
      Lookaside Buffer (TLB)</link> is used.</para>
68
 
69
 
70
 
71
      <para>At the moment HelenOS does not support swapping.</para>
72
 
37 bondari 73
      <para>- pouzivame vypadky stranky k alokaci ramcu on-demand v ramci as_area - na architekturach, ktere to podporuji, podporujeme non-exec stranky </para>
35 bondari 74
    </section>
75
 
76
    <section>
11 bondari 77
      <title>Address spaces</title>
9 bondari 78
 
35 bondari 79
      <section>
80
        <title>Address spaces and areas</title>
81
 
37 bondari 82
        <para>
83
 
84
    - adresovy prostor se sklada z tzv. address space areas
35 bondari 85
        usporadanych v B+stromu; tyto areas popisuji vyuzivane casti
86
        adresoveho prostoru patrici do user address space. Kazda cast je dana
37 bondari 87
        svoji bazovou adresou, velikosti a flagy (rwx/dd).
35 bondari 88
 
37 bondari 89
    </para>
90
 
35 bondari 91
        <para>- uzivatelske thready maji moznost manipulovat se svym adresovym
92
        prostorem (vytvaret/resizovat/sdilet) as_areas pomoci syscallu</para>
93
      </section>
94
 
95
      <section>
96
        <title>Address Space ID (ASID)</title>
97
 
98
        <para>- nektery hardware umoznuje rozlisit ruzne adresove prostory od
99
        sebe (cilem je maximalizovat vyuziti TLB); dela to tak, ze s kazdou
100
        polozkou TLB/strankovacich tabulek sdruzi identifikator adresoveho
101
        prostoru (ASID, RID, ppc32 ???). Tyto id mivaji ruznou sirku: 8-bitu
102
        az 24-bitu (kolik ma ppc32?)</para>
103
 
104
        <para>- kernel tomu rozumi a sam pouziva abstrakci ASIDu (na ia64 to
105
        je napr. cislo odvozene od RIDu, na mips32 to je ASID samotny);
106
        existence ASIDu je nutnou podminkou pouziti _global_ page hash table
107
        mechanismu.</para>
108
 
109
        <para>- na vsech arch. plati, ze asidu je mnohem mene, nez teoreticky
110
        pocet soucasne bezicich tasku ~ adresovych prostoru, takze je
111
        implementovan mechanismus, ktery umoznuje jednomu adresovemu prostoru
112
        ASID odebrat a pridelit ho jinemu</para>
113
 
114
        <para>- vztah task ~ adresovy prostor: teoreticky existuje moznost, ze
115
        je adresovy prostor sdilen vice tasky, avsak tuto moznost nepouzivame
116
        a neni ani nijak osetrena. Tim padem plati, ze kazdy task ma vlastni
117
        adresovy prostor</para>
118
      </section>
38 bondari 119
 
120
 
121
 
9 bondari 122
    </section>
123
 
124
    <section>
11 bondari 125
      <title>Virtual address translation</title>
9 bondari 126
 
35 bondari 127
      <section id="page_tables">
128
        <title>Page tables</title>
34 bondari 129
 
35 bondari 130
        <para>HelenOS kernel has two different approaches to the paging
131
        implementation: <emphasis>4 level page tables</emphasis> and
132
        <emphasis>global hash tables</emphasis>, which are accessible via
133
        generic paging abstraction layer. This division was caused by the
134
        major architectural differences between different platforms.</para>
34 bondari 135
 
35 bondari 136
        <formalpara>
137
          <title>4-level page tables</title>
34 bondari 138
 
35 bondari 139
          <para>4-level page tables are the generalization of the hardware
140
          capabilities of the certain platforms. <itemizedlist>
141
              <listitem>
142
                 ia32 uses 2-level page tables, with full hardware support.
143
              </listitem>
34 bondari 144
 
35 bondari 145
              <listitem>
146
                 amd64 uses 4-level page tables, also coming with full hardware support.
147
              </listitem>
148
 
149
              <listitem>
150
                 mips and ppc32 have 2-level tables, software simulated support.
151
              </listitem>
152
            </itemizedlist></para>
153
        </formalpara>
154
 
155
        <formalpara>
156
          <title>Global hash tables</title>
157
 
158
          <para>- global page hash table: existuje jen jedna v celem systemu
159
          (vyuziva ji ia64), pozn. ia64 ma zatim vypnuty VHPT. Pouziva se
160
          genericke hash table s oddelenymi collision chains</para>
161
        </formalpara>
162
 
163
        <para>Thanks to the abstract paging interface, there is possibility
164
        left have more paging implementations, for example B-Tree page
165
        tables.</para>
166
      </section>
167
 
168
      <section id="tlb">
169
        <title>Translation Lookaside buffer</title>
170
 
171
        <para>- TLB cachuji informace ve strankovacich tabulkach; alternativne
172
        se lze na strankovaci tabulky (ci ruzne hw rozsireni [e.g. VHPT, ppc32
173
        hw hash table]) divat jako na velke TLB</para>
174
 
175
        <para>- pri modifikaci mapovani nebo odstraneni mapovani ze
176
        strankovacich tabulek je potreba zajistit konsistenci TLB a techto
177
        tabulek; nutne delat na vsech CPU; na to mame zjednodusenou verzi TLB
178
        shootdown mechanismu; je to variace na algoritmus popsany zde: D.
179
        Black et al., "Translation Lookaside Buffer Consistency: A Software
180
        Approach," Proc. Third Int'l Conf. Architectural Support for
181
        Programming Languages and Operating Systems, 1989, pp. 113-122.</para>
182
 
183
        <para>- nutno poznamenat, ze existuji odlehcenejsi verze TLB shootdown
184
        algoritmu</para>
185
      </section>
186
    </section>
26 bondari 187
  </section>
9 bondari 188
 
26 bondari 189
  <!-- End of VM -->
24 bondari 190
 
26 bondari 191
  <section>
192
    <!-- Phys mem -->
193
 
11 bondari 194
    <title>Physical memory management</title>
9 bondari 195
 
24 bondari 196
    <section id="zones_and_frames">
197
      <title>Zones and frames</title>
198
 
34 bondari 199
      <para><!--graphic fileref="images/mm2.png" /--><!--graphic fileref="images/buddy_alloc.svg" format="SVG" /--></para>
26 bondari 200
 
201
      <para>On some architectures not whole physical memory is available for
202
      conventional usage. This limitations require from kernel to maintain a
203
      table of available and unavailable ranges of physical memory addresses.
204
      Main idea of zones is in creating memory zone entity, that is a
205
      continuous chunk of memory available for allocation. If some chunk is
206
      not available, we simply do not put it in any zone.</para>
207
 
208
      <para>Zone is also serves for informational purposes, containing
209
      information about number of free and busy frames. Physical memory
210
      allocation is also done inside the certain zone. Allocation of zone
211
      frame must be organized by the <link linkend="frame_allocator">frame
212
      allocator</link> associated with the zone.</para>
213
 
214
      <para>Some of the architectures (mips32, ppc32) have only one zone, that
215
      covers whole physical memory, and the others (like ia32) may have
216
      multiple zones. Information about zones on current machine is stored in
217
      BIOS hardware tables or can be hardcoded into kernel during compile
218
      time.</para>
24 bondari 219
    </section>
220
 
221
    <section id="frame_allocator">
222
      <title>Frame allocator</title>
223
 
26 bondari 224
      <formalpara>
225
        <title>Overview</title>
24 bondari 226
 
26 bondari 227
        <para>Frame allocator provides physical memory allocation for the
228
        kernel. Because of zonal organization of physical memory, frame
229
        allocator is always working in context of some zone, thus making
230
        impossible to allocate a piece of memory, which lays in different
231
        zone, which cannot happen, because two adjacent zones can be merged
232
        into one. Frame allocator is also being responsible to update
233
        information on the number of free/busy frames in zone. Physical memory
234
        allocation inside one <link linkend="zones_and_frames">memory
235
        zone</link> is being handled by an instance of <link
236
        linkend="buddy_allocator">buddy allocator</link> tailored to allocate
237
        blocks of physical memory frames.</para>
238
      </formalpara>
24 bondari 239
 
26 bondari 240
      <formalpara>
241
        <title>Allocation / deallocation</title>
24 bondari 242
 
26 bondari 243
        <para>Upon allocation request, frame allocator tries to find first
244
        zone, that can satisfy the incoming request (has required amount of
245
        free frames to allocate). During deallocation, frame allocator needs
246
        to find zone, that contain deallocated frame. This approach could
247
        bring up two potential problems: <itemizedlist>
248
            <listitem>
249
               Linear search of zones does not any good to performance, but number of zones is not expected to be high. And if yes, list of zones can be replaced with more time-efficient B-tree.
250
            </listitem>
24 bondari 251
 
26 bondari 252
            <listitem>
253
               Quickly find out if zone contains required number of frames to allocate and if this chunk of memory is properly aligned. This issue is perfectly solved bu the buddy allocator.
254
            </listitem>
255
          </itemizedlist></para>
256
      </formalpara>
257
    </section>
17 jermar 258
 
34 bondari 259
    <section id="buddy_allocator">
260
      <title>Buddy allocator</title>
17 jermar 261
 
34 bondari 262
      <section>
263
        <title>Overview</title>
17 jermar 264
 
34 bondari 265
        <para>In buddy allocator, memory is broken down into power-of-two
266
        sized naturally aligned blocks. These blocks are organized in an array
267
        of lists in which list with index i contains all unallocated blocks of
268
        the size <mathphrase>2<superscript>i</superscript></mathphrase>. The
269
        index i is called the order of block. Should there be two adjacent
270
        equally sized blocks in list <mathphrase>i</mathphrase> (i.e.
271
        buddies), the buddy allocator would coalesce them and put the
272
        resulting block in list <mathphrase>i + 1</mathphrase>, provided that
273
        the resulting block would be naturally aligned. Similarily, when the
274
        allocator is asked to allocate a block of size
275
        <mathphrase>2<superscript>i</superscript></mathphrase>, it first tries
276
        to satisfy the request from list with index i. If the request cannot
277
        be satisfied (i.e. the list i is empty), the buddy allocator will try
278
        to allocate and split larger block from list with index i + 1. Both of
279
        these algorithms are recursive. The recursion ends either when there
280
        are no blocks to coalesce in the former case or when there are no
281
        blocks that can be split in the latter case.</para>
17 jermar 282
 
34 bondari 283
        <!--graphic fileref="images/mm1.png" format="EPS" /-->
17 jermar 284
 
34 bondari 285
        <para>This approach greatly reduces external fragmentation of memory
286
        and helps in allocating bigger continuous blocks of memory aligned to
287
        their size. On the other hand, the buddy allocator suffers increased
288
        internal fragmentation of memory and is not suitable for general
289
        kernel allocations. This purpose is better addressed by the <link
290
        linkend="slab">slab allocator</link>.</para>
291
      </section>
17 jermar 292
 
34 bondari 293
      <section>
294
        <title>Implementation</title>
17 jermar 295
 
34 bondari 296
        <para>The buddy allocator is, in fact, an abstract framework wich can
297
        be easily specialized to serve one particular task. It knows nothing
298
        about the nature of memory it helps to allocate. In order to beat the
299
        lack of this knowledge, the buddy allocator exports an interface that
300
        each of its clients is required to implement. When supplied an
301
        implementation of this interface, the buddy allocator can use
302
        specialized external functions to find buddy for a block, split and
303
        coalesce blocks, manipulate block order and mark blocks busy or
304
        available. For precize documentation of this interface, refer to <link
305
        linkend="???">HelenOS Generic Kernel Reference Manual</link>.</para>
17 jermar 306
 
34 bondari 307
        <formalpara>
308
          <title>Data organization</title>
17 jermar 309
 
34 bondari 310
          <para>Each entity allocable by the buddy allocator is required to
311
          contain space for storing block order number and a link variable
312
          used to interconnect blocks within the same order.</para>
15 bondari 313
 
34 bondari 314
          <para>Whatever entities are allocated by the buddy allocator, the
315
          first entity within a block is used to represent the entire block.
316
          The first entity keeps the order of the whole block. Other entities
317
          within the block are assigned the magic value
318
          <constant>BUDDY_INNER_BLOCK</constant>. This is especially important
319
          for effective identification of buddies in one-dimensional array
320
          because the entity that represents a potential buddy cannot be
321
          associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it
322
          is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is
323
          not a buddy).</para>
15 bondari 324
 
34 bondari 325
          <para>Buddy allocator always uses first frame to represent frame
326
          block. This frame contains <varname>buddy_order</varname> variable
327
          to provide information about the block size it actually represents (
328
          <mathphrase>2<superscript>buddy_order</superscript></mathphrase>
329
          frames block). Other frames in block have this value set to magic
330
          <constant>BUDDY_INNER_BLOCK</constant> that is much greater than
331
          buddy <varname>max_order</varname> value.</para>
15 bondari 332
 
34 bondari 333
          <para>Each <varname>frame_t</varname> also contains pointer member
334
          to hold frame structure in the linked list inside one order.</para>
335
        </formalpara>
15 bondari 336
 
34 bondari 337
        <formalpara>
338
          <title>Allocation algorithm</title>
15 bondari 339
 
34 bondari 340
          <para>Upon <mathphrase>2<superscript>i</superscript></mathphrase>
341
          frames block allocation request, allocator checks if there are any
342
          blocks available at the order list <varname>i</varname>. If yes,
343
          removes block from order list and returns its address. If no,
344
          recursively allocates
345
          <mathphrase>2<superscript>i+1</superscript></mathphrase> frame
346
          block, splits it into two
347
          <mathphrase>2<superscript>i</superscript></mathphrase> frame blocks.
348
          Then adds one of the blocks to the <varname>i</varname> order list
349
          and returns address of another.</para>
350
        </formalpara>
15 bondari 351
 
34 bondari 352
        <formalpara>
353
          <title>Deallocation algorithm</title>
17 jermar 354
 
34 bondari 355
          <para>Check if block has so called buddy (another free
356
          <mathphrase>2<superscript>i</superscript></mathphrase> frame block
357
          that can be linked with freed block into the
358
          <mathphrase>2<superscript>i+1</superscript></mathphrase> block).
359
          Technically, buddy is a odd/even block for even/odd block
360
          respectively. Plus we can put an extra requirement, that resulting
361
          block must be aligned to its size. This requirement guarantees
362
          natural block alignment for the blocks coming out the allocation
363
          system.</para>
9 bondari 364
 
34 bondari 365
          <para>Using direct pointer arithmetics,
366
          <varname>frame_t::ref_count</varname> and
367
          <varname>frame_t::buddy_order</varname> variables, finding buddy is
368
          done at constant time.</para>
369
        </formalpara>
370
      </section>
26 bondari 371
    </section>
372
 
15 bondari 373
    <section id="slab">
11 bondari 374
      <title>Slab allocator</title>
9 bondari 375
 
26 bondari 376
      <section>
34 bondari 377
        <title>Overview</title>
9 bondari 378
 
34 bondari 379
        <para><termdef><glossterm>Slab</glossterm> represents a contiguous
380
        piece of memory, usually made of several physically contiguous
381
        pages.</termdef> <termdef><glossterm>Slab cache</glossterm> consists
382
        of one or more slabs.</termdef></para>
383
 
26 bondari 384
        <para>The majority of memory allocation requests in the kernel are for
385
        small, frequently used data structures. For this purpose the slab
34 bondari 386
        allocator is a perfect solution. The basic idea behind the slab
26 bondari 387
        allocator is to have lists of commonly used objects available packed
388
        into pages. This avoids the overhead of allocating and destroying
34 bondari 389
        commonly used types of objects such threads, virtual memory structures
390
        etc. Also due to the exact allocated size matching, slab allocation
391
        completely eliminates internal fragmentation issue.</para>
26 bondari 392
      </section>
24 bondari 393
 
26 bondari 394
      <section>
34 bondari 395
        <title>Implementation</title>
9 bondari 396
 
26 bondari 397
        <para>The SLAB allocator is closely modelled after <ulink
398
        url="http://www.usenix.org/events/usenix01/full_papers/bonwick/bonwick_html/">
399
        OpenSolaris SLAB allocator by Jeff Bonwick and Jonathan Adams </ulink>
400
        with the following exceptions: <itemizedlist>
401
            <listitem>
402
               empty SLABS are deallocated immediately (in Linux they are kept in linked list, in Solaris ???)
403
            </listitem>
404
 
405
            <listitem>
406
               empty magazines are deallocated when not needed (in Solaris they are held in linked list in slab cache)
407
            </listitem>
408
          </itemizedlist> Following features are not currently supported but
409
        would be easy to do: <itemizedlist>
410
            <listitem>
411
               - cache coloring
412
            </listitem>
413
 
414
            <listitem>
34 bondari 415
               - dynamic magazine grow (different magazine sizes are already supported, but we would need to adjust allocation strategy)
26 bondari 416
            </listitem>
417
          </itemizedlist></para>
418
 
34 bondari 419
        <section>
420
          <title>Magazine layer</title>
26 bondari 421
 
34 bondari 422
          <para>Due to the extensive bottleneck on SMP architures, caused by
423
          global SLAB locking mechanism, making processing of all slab
424
          allocation requests serialized, a new layer was introduced to the
425
          classic slab allocator design. Slab allocator was extended to
426
          support per-CPU caches 'magazines' to achieve good SMP scaling.
427
          <termdef>Slab SMP perfromance bottleneck was resolved by introducing
428
          a per-CPU caching scheme called as <glossterm>magazine
429
          layer</glossterm></termdef>.</para>
26 bondari 430
 
34 bondari 431
          <para>Magazine is a N-element cache of objects, so each magazine can
432
          satisfy N allocations. Magazine behaves like a automatic weapon
433
          magazine (LIFO, stack), so the allocation/deallocation become simple
434
          push/pop pointer operation. Trick is that CPU does not access global
435
          slab allocator data during the allocation from its magazine, thus
436
          making possible parallel allocations between CPUs.</para>
26 bondari 437
 
34 bondari 438
          <para>Implementation also requires adding another feature as the
439
          CPU-bound magazine is actually a pair of magazines to avoid
440
          thrashing when during allocation/deallocatiion of 1 item at the
441
          magazine size boundary. LIFO order is enforced, which should avoid
442
          fragmentation as much as possible.</para>
26 bondari 443
 
34 bondari 444
          <para>Another important entity of magazine layer is a full magazine
445
          depot, that stores full magazines which are used by any of the CPU
446
          magazine caches to reload active CPU magazine. Magazine depot can be
447
          pre-filled with full magazines during initialization, but in current
448
          implementation it is filled during object deallocation, when CPU
449
          magazine becomes full.</para>
26 bondari 450
 
34 bondari 451
          <para>Slab allocator control structures are allocated from special
452
          slabs, that are marked by special flag, indicating that it should
453
          not be used for slab magazine layer. This is done to avoid possible
454
          infinite recursions and deadlock during conventional slab allocaiton
455
          requests.</para>
456
        </section>
26 bondari 457
 
34 bondari 458
        <section>
459
          <title>Allocation/deallocation</title>
26 bondari 460
 
34 bondari 461
          <para>Every cache contains list of full slabs and list of partialy
462
          full slabs. Empty slabs are immediately freed (thrashing will be
463
          avoided because of magazines).</para>
26 bondari 464
 
34 bondari 465
          <para>The SLAB allocator allocates lots of space and does not free
466
          it. When frame allocator fails to allocate the frame, it calls
467
          slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim.
468
          The light reclaim releases slabs from cpu-shared magazine-list,
469
          until at least 1 slab is deallocated in each cache (this algorithm
470
          should probably change). The brutal reclaim removes all cached
471
          objects, even from CPU-bound magazines.</para>
472
 
473
          <formalpara>
474
            <title>Allocation</title>
475
 
476
            <para><emphasis>Step 1.</emphasis> When it comes to the allocation
477
            request, slab allocator first of all checks availability of memory
478
            in local CPU-bound magazine. If it is there, we would just "pop"
479
            the CPU magazine and return the pointer to object.</para>
480
 
481
            <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
482
            empty, allocator will attempt to reload magazin, swapping it with
483
            second CPU magazine and returns to the first step.</para>
484
 
485
            <para><emphasis>Step 3.</emphasis> Now we are in the situation
486
            when both CPU-bound magazines are empty, which makes allocator to
487
            access shared full-magazines depot to reload CPU-bound magazines.
488
            If reload is succesful (meaning there are full magazines in depot)
489
            algoritm continues at Step 1.</para>
490
 
491
            <para><emphasis>Step 4.</emphasis> Final step of the allocation.
492
            In this step object is allocated from the conventional slab layer
493
            and pointer is returned.</para>
494
          </formalpara>
495
 
496
          <formalpara>
497
            <title>Deallocation</title>
498
 
499
            <para><emphasis>Step 1.</emphasis> During deallocation request,
500
            slab allocator will check if the local CPU-bound magazine is not
501
            full. In this case we will just push the pointer to this
502
            magazine.</para>
503
 
504
            <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
505
            full, allocator will attempt to reload magazin, swapping it with
506
            second CPU magazine and returns to the first step.</para>
507
 
508
            <para><emphasis>Step 3.</emphasis> Now we are in the situation
509
            when both CPU-bound magazines are full, which makes allocator to
510
            access shared full-magazines depot to put one of the magazines to
511
            the depot and creating new empty magazine. Algoritm continues at
512
            Step 1.</para>
513
          </formalpara>
514
        </section>
26 bondari 515
      </section>
15 bondari 516
    </section>
26 bondari 517
 
518
    <!-- End of Physmem -->
519
  </section>
520
 
521
  <section>
522
    <title>Memory sharing</title>
523
 
524
    <para>Not implemented yet(?)</para>
525
  </section>
11 bondari 526
</chapter>