Subversion Repositories HelenOS-doc

Rev

Rev 35 | Rev 38 | Go to most recent revision | Details | Compare with Previous | Last modification | View Log | RSS feed

Rev Author Line No. Line
9 bondari 1
<?xml version="1.0" encoding="UTF-8"?>
11 bondari 2
<chapter id="mm">
3
  <?dbhtml filename="mm.html"?>
9 bondari 4
 
11 bondari 5
  <title>Memory management</title>
9 bondari 6
 
26 bondari 7
  <section>
11 bondari 8
    <title>Virtual memory management</title>
9 bondari 9
 
10
    <section>
35 bondari 11
      <title>Introduction</title>
12
 
13
      <para>Virtual memory is a special memory management technique, used by
14
      kernel to achieve a bunch of mission critical goals. <itemizedlist>
15
          <listitem>
16
             Isolate each task from other tasks that are running on the system at the same time.
17
          </listitem>
18
 
19
          <listitem>
20
             Allow to allocate more memory, than is actual physical memory size of the machine.
21
          </listitem>
22
 
23
          <listitem>
24
             Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations.
25
          </listitem>
26
        </itemizedlist></para>
27
    </section>
28
 
29
    <section>
30
 
31
 
32
      <title>Paging</title>
33
 
34
 
35
 
36
      <para>Virtual memory is usually using paged memory model, where virtual
37
      memory address space is divided into the <emphasis>pages</emphasis>
38
      (usually having size 4096 bytes) and physical memory is divided into the
37 bondari 39
      frames (same sized as a page, of course). Each page may be mapped to some
35 bondari 40
      frame and then, upon memory access to the virtual address, CPU performs
41
      <emphasis>address translation</emphasis> during the instruction
42
      execution. Non-existing mapping generates page fault exception, calling
43
      kernel exception handler, thus allowing kernel to manipulate rules of
44
      memory access. Information for pages mapping is stored by kernel in the
45
      <link linkend="page_tables">page tables</link></para>
46
 
47
 
48
 
49
      <para>The majority of the architectures use multi-level page tables,
50
      which means need to access physical memory several times before getting
51
      physical address. This fact would make serios performance overhead in
52
      virtual memory management. To avoid this <link linkend="tlb">Traslation
53
      Lookaside Buffer (TLB)</link> is used.</para>
54
 
55
 
56
 
57
      <para>At the moment HelenOS does not support swapping.</para>
58
 
37 bondari 59
      <para>- pouzivame vypadky stranky k alokaci ramcu on-demand v ramci as_area - na architekturach, ktere to podporuji, podporujeme non-exec stranky </para>
35 bondari 60
    </section>
61
 
62
    <section>
11 bondari 63
      <title>Address spaces</title>
9 bondari 64
 
35 bondari 65
      <section>
66
        <title>Address spaces and areas</title>
67
 
37 bondari 68
        <para>
69
 
70
    - adresovy prostor se sklada z tzv. address space areas
35 bondari 71
        usporadanych v B+stromu; tyto areas popisuji vyuzivane casti
72
        adresoveho prostoru patrici do user address space. Kazda cast je dana
37 bondari 73
        svoji bazovou adresou, velikosti a flagy (rwx/dd).
35 bondari 74
 
37 bondari 75
    </para>
76
 
35 bondari 77
        <para>- uzivatelske thready maji moznost manipulovat se svym adresovym
78
        prostorem (vytvaret/resizovat/sdilet) as_areas pomoci syscallu</para>
79
      </section>
80
 
81
      <section>
82
        <title>Address Space ID (ASID)</title>
83
 
84
        <para>- nektery hardware umoznuje rozlisit ruzne adresove prostory od
85
        sebe (cilem je maximalizovat vyuziti TLB); dela to tak, ze s kazdou
86
        polozkou TLB/strankovacich tabulek sdruzi identifikator adresoveho
87
        prostoru (ASID, RID, ppc32 ???). Tyto id mivaji ruznou sirku: 8-bitu
88
        az 24-bitu (kolik ma ppc32?)</para>
89
 
90
        <para>- kernel tomu rozumi a sam pouziva abstrakci ASIDu (na ia64 to
91
        je napr. cislo odvozene od RIDu, na mips32 to je ASID samotny);
92
        existence ASIDu je nutnou podminkou pouziti _global_ page hash table
93
        mechanismu.</para>
94
 
95
        <para>- na vsech arch. plati, ze asidu je mnohem mene, nez teoreticky
96
        pocet soucasne bezicich tasku ~ adresovych prostoru, takze je
97
        implementovan mechanismus, ktery umoznuje jednomu adresovemu prostoru
98
        ASID odebrat a pridelit ho jinemu</para>
99
 
100
        <para>- vztah task ~ adresovy prostor: teoreticky existuje moznost, ze
101
        je adresovy prostor sdilen vice tasky, avsak tuto moznost nepouzivame
102
        a neni ani nijak osetrena. Tim padem plati, ze kazdy task ma vlastni
103
        adresovy prostor</para>
104
      </section>
9 bondari 105
    </section>
106
 
107
    <section>
11 bondari 108
      <title>Virtual address translation</title>
9 bondari 109
 
35 bondari 110
      <section id="page_tables">
111
        <title>Page tables</title>
34 bondari 112
 
35 bondari 113
        <para>HelenOS kernel has two different approaches to the paging
114
        implementation: <emphasis>4 level page tables</emphasis> and
115
        <emphasis>global hash tables</emphasis>, which are accessible via
116
        generic paging abstraction layer. This division was caused by the
117
        major architectural differences between different platforms.</para>
34 bondari 118
 
35 bondari 119
        <formalpara>
120
          <title>4-level page tables</title>
34 bondari 121
 
35 bondari 122
          <para>4-level page tables are the generalization of the hardware
123
          capabilities of the certain platforms. <itemizedlist>
124
              <listitem>
125
                 ia32 uses 2-level page tables, with full hardware support.
126
              </listitem>
34 bondari 127
 
35 bondari 128
              <listitem>
129
                 amd64 uses 4-level page tables, also coming with full hardware support.
130
              </listitem>
131
 
132
              <listitem>
133
                 mips and ppc32 have 2-level tables, software simulated support.
134
              </listitem>
135
            </itemizedlist></para>
136
        </formalpara>
137
 
138
        <formalpara>
139
          <title>Global hash tables</title>
140
 
141
          <para>- global page hash table: existuje jen jedna v celem systemu
142
          (vyuziva ji ia64), pozn. ia64 ma zatim vypnuty VHPT. Pouziva se
143
          genericke hash table s oddelenymi collision chains</para>
144
        </formalpara>
145
 
146
        <para>Thanks to the abstract paging interface, there is possibility
147
        left have more paging implementations, for example B-Tree page
148
        tables.</para>
149
      </section>
150
 
151
      <section id="tlb">
152
        <title>Translation Lookaside buffer</title>
153
 
154
        <para>- TLB cachuji informace ve strankovacich tabulkach; alternativne
155
        se lze na strankovaci tabulky (ci ruzne hw rozsireni [e.g. VHPT, ppc32
156
        hw hash table]) divat jako na velke TLB</para>
157
 
158
        <para>- pri modifikaci mapovani nebo odstraneni mapovani ze
159
        strankovacich tabulek je potreba zajistit konsistenci TLB a techto
160
        tabulek; nutne delat na vsech CPU; na to mame zjednodusenou verzi TLB
161
        shootdown mechanismu; je to variace na algoritmus popsany zde: D.
162
        Black et al., "Translation Lookaside Buffer Consistency: A Software
163
        Approach," Proc. Third Int'l Conf. Architectural Support for
164
        Programming Languages and Operating Systems, 1989, pp. 113-122.</para>
165
 
166
        <para>- nutno poznamenat, ze existuji odlehcenejsi verze TLB shootdown
167
        algoritmu</para>
168
      </section>
169
    </section>
26 bondari 170
  </section>
9 bondari 171
 
26 bondari 172
  <!-- End of VM -->
24 bondari 173
 
26 bondari 174
  <section>
175
    <!-- Phys mem -->
176
 
11 bondari 177
    <title>Physical memory management</title>
9 bondari 178
 
24 bondari 179
    <section id="zones_and_frames">
180
      <title>Zones and frames</title>
181
 
34 bondari 182
      <para><!--graphic fileref="images/mm2.png" /--><!--graphic fileref="images/buddy_alloc.svg" format="SVG" /--></para>
26 bondari 183
 
184
      <para>On some architectures not whole physical memory is available for
185
      conventional usage. This limitations require from kernel to maintain a
186
      table of available and unavailable ranges of physical memory addresses.
187
      Main idea of zones is in creating memory zone entity, that is a
188
      continuous chunk of memory available for allocation. If some chunk is
189
      not available, we simply do not put it in any zone.</para>
190
 
191
      <para>Zone is also serves for informational purposes, containing
192
      information about number of free and busy frames. Physical memory
193
      allocation is also done inside the certain zone. Allocation of zone
194
      frame must be organized by the <link linkend="frame_allocator">frame
195
      allocator</link> associated with the zone.</para>
196
 
197
      <para>Some of the architectures (mips32, ppc32) have only one zone, that
198
      covers whole physical memory, and the others (like ia32) may have
199
      multiple zones. Information about zones on current machine is stored in
200
      BIOS hardware tables or can be hardcoded into kernel during compile
201
      time.</para>
24 bondari 202
    </section>
203
 
204
    <section id="frame_allocator">
205
      <title>Frame allocator</title>
206
 
26 bondari 207
      <formalpara>
208
        <title>Overview</title>
24 bondari 209
 
26 bondari 210
        <para>Frame allocator provides physical memory allocation for the
211
        kernel. Because of zonal organization of physical memory, frame
212
        allocator is always working in context of some zone, thus making
213
        impossible to allocate a piece of memory, which lays in different
214
        zone, which cannot happen, because two adjacent zones can be merged
215
        into one. Frame allocator is also being responsible to update
216
        information on the number of free/busy frames in zone. Physical memory
217
        allocation inside one <link linkend="zones_and_frames">memory
218
        zone</link> is being handled by an instance of <link
219
        linkend="buddy_allocator">buddy allocator</link> tailored to allocate
220
        blocks of physical memory frames.</para>
221
      </formalpara>
24 bondari 222
 
26 bondari 223
      <formalpara>
224
        <title>Allocation / deallocation</title>
24 bondari 225
 
26 bondari 226
        <para>Upon allocation request, frame allocator tries to find first
227
        zone, that can satisfy the incoming request (has required amount of
228
        free frames to allocate). During deallocation, frame allocator needs
229
        to find zone, that contain deallocated frame. This approach could
230
        bring up two potential problems: <itemizedlist>
231
            <listitem>
232
               Linear search of zones does not any good to performance, but number of zones is not expected to be high. And if yes, list of zones can be replaced with more time-efficient B-tree.
233
            </listitem>
24 bondari 234
 
26 bondari 235
            <listitem>
236
               Quickly find out if zone contains required number of frames to allocate and if this chunk of memory is properly aligned. This issue is perfectly solved bu the buddy allocator.
237
            </listitem>
238
          </itemizedlist></para>
239
      </formalpara>
240
    </section>
17 jermar 241
 
34 bondari 242
    <section id="buddy_allocator">
243
      <title>Buddy allocator</title>
17 jermar 244
 
34 bondari 245
      <section>
246
        <title>Overview</title>
17 jermar 247
 
34 bondari 248
        <para>In buddy allocator, memory is broken down into power-of-two
249
        sized naturally aligned blocks. These blocks are organized in an array
250
        of lists in which list with index i contains all unallocated blocks of
251
        the size <mathphrase>2<superscript>i</superscript></mathphrase>. The
252
        index i is called the order of block. Should there be two adjacent
253
        equally sized blocks in list <mathphrase>i</mathphrase> (i.e.
254
        buddies), the buddy allocator would coalesce them and put the
255
        resulting block in list <mathphrase>i + 1</mathphrase>, provided that
256
        the resulting block would be naturally aligned. Similarily, when the
257
        allocator is asked to allocate a block of size
258
        <mathphrase>2<superscript>i</superscript></mathphrase>, it first tries
259
        to satisfy the request from list with index i. If the request cannot
260
        be satisfied (i.e. the list i is empty), the buddy allocator will try
261
        to allocate and split larger block from list with index i + 1. Both of
262
        these algorithms are recursive. The recursion ends either when there
263
        are no blocks to coalesce in the former case or when there are no
264
        blocks that can be split in the latter case.</para>
17 jermar 265
 
34 bondari 266
        <!--graphic fileref="images/mm1.png" format="EPS" /-->
17 jermar 267
 
34 bondari 268
        <para>This approach greatly reduces external fragmentation of memory
269
        and helps in allocating bigger continuous blocks of memory aligned to
270
        their size. On the other hand, the buddy allocator suffers increased
271
        internal fragmentation of memory and is not suitable for general
272
        kernel allocations. This purpose is better addressed by the <link
273
        linkend="slab">slab allocator</link>.</para>
274
      </section>
17 jermar 275
 
34 bondari 276
      <section>
277
        <title>Implementation</title>
17 jermar 278
 
34 bondari 279
        <para>The buddy allocator is, in fact, an abstract framework wich can
280
        be easily specialized to serve one particular task. It knows nothing
281
        about the nature of memory it helps to allocate. In order to beat the
282
        lack of this knowledge, the buddy allocator exports an interface that
283
        each of its clients is required to implement. When supplied an
284
        implementation of this interface, the buddy allocator can use
285
        specialized external functions to find buddy for a block, split and
286
        coalesce blocks, manipulate block order and mark blocks busy or
287
        available. For precize documentation of this interface, refer to <link
288
        linkend="???">HelenOS Generic Kernel Reference Manual</link>.</para>
17 jermar 289
 
34 bondari 290
        <formalpara>
291
          <title>Data organization</title>
17 jermar 292
 
34 bondari 293
          <para>Each entity allocable by the buddy allocator is required to
294
          contain space for storing block order number and a link variable
295
          used to interconnect blocks within the same order.</para>
15 bondari 296
 
34 bondari 297
          <para>Whatever entities are allocated by the buddy allocator, the
298
          first entity within a block is used to represent the entire block.
299
          The first entity keeps the order of the whole block. Other entities
300
          within the block are assigned the magic value
301
          <constant>BUDDY_INNER_BLOCK</constant>. This is especially important
302
          for effective identification of buddies in one-dimensional array
303
          because the entity that represents a potential buddy cannot be
304
          associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it
305
          is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is
306
          not a buddy).</para>
15 bondari 307
 
34 bondari 308
          <para>Buddy allocator always uses first frame to represent frame
309
          block. This frame contains <varname>buddy_order</varname> variable
310
          to provide information about the block size it actually represents (
311
          <mathphrase>2<superscript>buddy_order</superscript></mathphrase>
312
          frames block). Other frames in block have this value set to magic
313
          <constant>BUDDY_INNER_BLOCK</constant> that is much greater than
314
          buddy <varname>max_order</varname> value.</para>
15 bondari 315
 
34 bondari 316
          <para>Each <varname>frame_t</varname> also contains pointer member
317
          to hold frame structure in the linked list inside one order.</para>
318
        </formalpara>
15 bondari 319
 
34 bondari 320
        <formalpara>
321
          <title>Allocation algorithm</title>
15 bondari 322
 
34 bondari 323
          <para>Upon <mathphrase>2<superscript>i</superscript></mathphrase>
324
          frames block allocation request, allocator checks if there are any
325
          blocks available at the order list <varname>i</varname>. If yes,
326
          removes block from order list and returns its address. If no,
327
          recursively allocates
328
          <mathphrase>2<superscript>i+1</superscript></mathphrase> frame
329
          block, splits it into two
330
          <mathphrase>2<superscript>i</superscript></mathphrase> frame blocks.
331
          Then adds one of the blocks to the <varname>i</varname> order list
332
          and returns address of another.</para>
333
        </formalpara>
15 bondari 334
 
34 bondari 335
        <formalpara>
336
          <title>Deallocation algorithm</title>
17 jermar 337
 
34 bondari 338
          <para>Check if block has so called buddy (another free
339
          <mathphrase>2<superscript>i</superscript></mathphrase> frame block
340
          that can be linked with freed block into the
341
          <mathphrase>2<superscript>i+1</superscript></mathphrase> block).
342
          Technically, buddy is a odd/even block for even/odd block
343
          respectively. Plus we can put an extra requirement, that resulting
344
          block must be aligned to its size. This requirement guarantees
345
          natural block alignment for the blocks coming out the allocation
346
          system.</para>
9 bondari 347
 
34 bondari 348
          <para>Using direct pointer arithmetics,
349
          <varname>frame_t::ref_count</varname> and
350
          <varname>frame_t::buddy_order</varname> variables, finding buddy is
351
          done at constant time.</para>
352
        </formalpara>
353
      </section>
26 bondari 354
    </section>
355
 
15 bondari 356
    <section id="slab">
11 bondari 357
      <title>Slab allocator</title>
9 bondari 358
 
26 bondari 359
      <section>
34 bondari 360
        <title>Overview</title>
9 bondari 361
 
34 bondari 362
        <para><termdef><glossterm>Slab</glossterm> represents a contiguous
363
        piece of memory, usually made of several physically contiguous
364
        pages.</termdef> <termdef><glossterm>Slab cache</glossterm> consists
365
        of one or more slabs.</termdef></para>
366
 
26 bondari 367
        <para>The majority of memory allocation requests in the kernel are for
368
        small, frequently used data structures. For this purpose the slab
34 bondari 369
        allocator is a perfect solution. The basic idea behind the slab
26 bondari 370
        allocator is to have lists of commonly used objects available packed
371
        into pages. This avoids the overhead of allocating and destroying
34 bondari 372
        commonly used types of objects such threads, virtual memory structures
373
        etc. Also due to the exact allocated size matching, slab allocation
374
        completely eliminates internal fragmentation issue.</para>
26 bondari 375
      </section>
24 bondari 376
 
26 bondari 377
      <section>
34 bondari 378
        <title>Implementation</title>
9 bondari 379
 
26 bondari 380
        <para>The SLAB allocator is closely modelled after <ulink
381
        url="http://www.usenix.org/events/usenix01/full_papers/bonwick/bonwick_html/">
382
        OpenSolaris SLAB allocator by Jeff Bonwick and Jonathan Adams </ulink>
383
        with the following exceptions: <itemizedlist>
384
            <listitem>
385
               empty SLABS are deallocated immediately (in Linux they are kept in linked list, in Solaris ???)
386
            </listitem>
387
 
388
            <listitem>
389
               empty magazines are deallocated when not needed (in Solaris they are held in linked list in slab cache)
390
            </listitem>
391
          </itemizedlist> Following features are not currently supported but
392
        would be easy to do: <itemizedlist>
393
            <listitem>
394
               - cache coloring
395
            </listitem>
396
 
397
            <listitem>
34 bondari 398
               - dynamic magazine grow (different magazine sizes are already supported, but we would need to adjust allocation strategy)
26 bondari 399
            </listitem>
400
          </itemizedlist></para>
401
 
34 bondari 402
        <section>
403
          <title>Magazine layer</title>
26 bondari 404
 
34 bondari 405
          <para>Due to the extensive bottleneck on SMP architures, caused by
406
          global SLAB locking mechanism, making processing of all slab
407
          allocation requests serialized, a new layer was introduced to the
408
          classic slab allocator design. Slab allocator was extended to
409
          support per-CPU caches 'magazines' to achieve good SMP scaling.
410
          <termdef>Slab SMP perfromance bottleneck was resolved by introducing
411
          a per-CPU caching scheme called as <glossterm>magazine
412
          layer</glossterm></termdef>.</para>
26 bondari 413
 
34 bondari 414
          <para>Magazine is a N-element cache of objects, so each magazine can
415
          satisfy N allocations. Magazine behaves like a automatic weapon
416
          magazine (LIFO, stack), so the allocation/deallocation become simple
417
          push/pop pointer operation. Trick is that CPU does not access global
418
          slab allocator data during the allocation from its magazine, thus
419
          making possible parallel allocations between CPUs.</para>
26 bondari 420
 
34 bondari 421
          <para>Implementation also requires adding another feature as the
422
          CPU-bound magazine is actually a pair of magazines to avoid
423
          thrashing when during allocation/deallocatiion of 1 item at the
424
          magazine size boundary. LIFO order is enforced, which should avoid
425
          fragmentation as much as possible.</para>
26 bondari 426
 
34 bondari 427
          <para>Another important entity of magazine layer is a full magazine
428
          depot, that stores full magazines which are used by any of the CPU
429
          magazine caches to reload active CPU magazine. Magazine depot can be
430
          pre-filled with full magazines during initialization, but in current
431
          implementation it is filled during object deallocation, when CPU
432
          magazine becomes full.</para>
26 bondari 433
 
34 bondari 434
          <para>Slab allocator control structures are allocated from special
435
          slabs, that are marked by special flag, indicating that it should
436
          not be used for slab magazine layer. This is done to avoid possible
437
          infinite recursions and deadlock during conventional slab allocaiton
438
          requests.</para>
439
        </section>
26 bondari 440
 
34 bondari 441
        <section>
442
          <title>Allocation/deallocation</title>
26 bondari 443
 
34 bondari 444
          <para>Every cache contains list of full slabs and list of partialy
445
          full slabs. Empty slabs are immediately freed (thrashing will be
446
          avoided because of magazines).</para>
26 bondari 447
 
34 bondari 448
          <para>The SLAB allocator allocates lots of space and does not free
449
          it. When frame allocator fails to allocate the frame, it calls
450
          slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim.
451
          The light reclaim releases slabs from cpu-shared magazine-list,
452
          until at least 1 slab is deallocated in each cache (this algorithm
453
          should probably change). The brutal reclaim removes all cached
454
          objects, even from CPU-bound magazines.</para>
455
 
456
          <formalpara>
457
            <title>Allocation</title>
458
 
459
            <para><emphasis>Step 1.</emphasis> When it comes to the allocation
460
            request, slab allocator first of all checks availability of memory
461
            in local CPU-bound magazine. If it is there, we would just "pop"
462
            the CPU magazine and return the pointer to object.</para>
463
 
464
            <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
465
            empty, allocator will attempt to reload magazin, swapping it with
466
            second CPU magazine and returns to the first step.</para>
467
 
468
            <para><emphasis>Step 3.</emphasis> Now we are in the situation
469
            when both CPU-bound magazines are empty, which makes allocator to
470
            access shared full-magazines depot to reload CPU-bound magazines.
471
            If reload is succesful (meaning there are full magazines in depot)
472
            algoritm continues at Step 1.</para>
473
 
474
            <para><emphasis>Step 4.</emphasis> Final step of the allocation.
475
            In this step object is allocated from the conventional slab layer
476
            and pointer is returned.</para>
477
          </formalpara>
478
 
479
          <formalpara>
480
            <title>Deallocation</title>
481
 
482
            <para><emphasis>Step 1.</emphasis> During deallocation request,
483
            slab allocator will check if the local CPU-bound magazine is not
484
            full. In this case we will just push the pointer to this
485
            magazine.</para>
486
 
487
            <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
488
            full, allocator will attempt to reload magazin, swapping it with
489
            second CPU magazine and returns to the first step.</para>
490
 
491
            <para><emphasis>Step 3.</emphasis> Now we are in the situation
492
            when both CPU-bound magazines are full, which makes allocator to
493
            access shared full-magazines depot to put one of the magazines to
494
            the depot and creating new empty magazine. Algoritm continues at
495
            Step 1.</para>
496
          </formalpara>
497
        </section>
26 bondari 498
      </section>
15 bondari 499
    </section>
26 bondari 500
 
501
    <!-- End of Physmem -->
502
  </section>
503
 
504
  <section>
505
    <title>Memory sharing</title>
506
 
507
    <para>Not implemented yet(?)</para>
508
  </section>
11 bondari 509
</chapter>