Subversion Repositories HelenOS-doc

Rev

Rev 34 | Rev 37 | Go to most recent revision | Details | Compare with Previous | Last modification | View Log | RSS feed

Rev Author Line No. Line
9 bondari 1
<?xml version="1.0" encoding="UTF-8"?>
11 bondari 2
<chapter id="mm">
3
  <?dbhtml filename="mm.html"?>
9 bondari 4
 
11 bondari 5
  <title>Memory management</title>
9 bondari 6
 
26 bondari 7
  <section>
11 bondari 8
    <title>Virtual memory management</title>
9 bondari 9
 
10
    <section>
35 bondari 11
      <title>Introduction</title>
12
 
13
      <para>Virtual memory is a special memory management technique, used by
14
      kernel to achieve a bunch of mission critical goals. <itemizedlist>
15
          <listitem>
16
             Isolate each task from other tasks that are running on the system at the same time.
17
          </listitem>
18
 
19
          <listitem>
20
             Allow to allocate more memory, than is actual physical memory size of the machine.
21
          </listitem>
22
 
23
          <listitem>
24
             Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations.
25
          </listitem>
26
        </itemizedlist></para>
27
    </section>
28
 
29
    <section>
30
 
31
 
32
      <title>Paging</title>
33
 
34
 
35
 
36
      <para>Virtual memory is usually using paged memory model, where virtual
37
      memory address space is divided into the <emphasis>pages</emphasis>
38
      (usually having size 4096 bytes) and physical memory is divided into the
39
      frames (same sized as a page, of cause). Each page may be mapped to some
40
      frame and then, upon memory access to the virtual address, CPU performs
41
      <emphasis>address translation</emphasis> during the instruction
42
      execution. Non-existing mapping generates page fault exception, calling
43
      kernel exception handler, thus allowing kernel to manipulate rules of
44
      memory access. Information for pages mapping is stored by kernel in the
45
      <link linkend="page_tables">page tables</link></para>
46
 
47
 
48
 
49
      <para>The majority of the architectures use multi-level page tables,
50
      which means need to access physical memory several times before getting
51
      physical address. This fact would make serios performance overhead in
52
      virtual memory management. To avoid this <link linkend="tlb">Traslation
53
      Lookaside Buffer (TLB)</link> is used.</para>
54
 
55
 
56
 
57
      <para>At the moment HelenOS does not support swapping.</para>
58
 
59
       - pouzivame vypadky stranky k alokaci ramcu on-demand v ramci as_area - na architekturach, ktere to podporuji, podporujeme non-exec stranky
60
    </section>
61
 
62
    <section>
11 bondari 63
      <title>Address spaces</title>
9 bondari 64
 
35 bondari 65
      <section>
66
        <title>Address spaces and areas</title>
67
 
68
        <para>- adresovy prostor se sklada z tzv. address space areas
69
        usporadanych v B+stromu; tyto areas popisuji vyuzivane casti
70
        adresoveho prostoru patrici do user address space. Kazda cast je dana
71
        svoji bazovou adresou, velikosti a flagy (rwx/dd).</para>
72
 
73
        <para>- uzivatelske thready maji moznost manipulovat se svym adresovym
74
        prostorem (vytvaret/resizovat/sdilet) as_areas pomoci syscallu</para>
75
      </section>
76
 
77
      <section>
78
        <title>Address Space ID (ASID)</title>
79
 
80
        <para>- nektery hardware umoznuje rozlisit ruzne adresove prostory od
81
        sebe (cilem je maximalizovat vyuziti TLB); dela to tak, ze s kazdou
82
        polozkou TLB/strankovacich tabulek sdruzi identifikator adresoveho
83
        prostoru (ASID, RID, ppc32 ???). Tyto id mivaji ruznou sirku: 8-bitu
84
        az 24-bitu (kolik ma ppc32?)</para>
85
 
86
        <para>- kernel tomu rozumi a sam pouziva abstrakci ASIDu (na ia64 to
87
        je napr. cislo odvozene od RIDu, na mips32 to je ASID samotny);
88
        existence ASIDu je nutnou podminkou pouziti _global_ page hash table
89
        mechanismu.</para>
90
 
91
        <para>- na vsech arch. plati, ze asidu je mnohem mene, nez teoreticky
92
        pocet soucasne bezicich tasku ~ adresovych prostoru, takze je
93
        implementovan mechanismus, ktery umoznuje jednomu adresovemu prostoru
94
        ASID odebrat a pridelit ho jinemu</para>
95
 
96
        <para>- vztah task ~ adresovy prostor: teoreticky existuje moznost, ze
97
        je adresovy prostor sdilen vice tasky, avsak tuto moznost nepouzivame
98
        a neni ani nijak osetrena. Tim padem plati, ze kazdy task ma vlastni
99
        adresovy prostor</para>
100
      </section>
9 bondari 101
    </section>
102
 
103
    <section>
11 bondari 104
      <title>Virtual address translation</title>
9 bondari 105
 
35 bondari 106
      <section id="page_tables">
107
        <title>Page tables</title>
34 bondari 108
 
35 bondari 109
        <para>HelenOS kernel has two different approaches to the paging
110
        implementation: <emphasis>4 level page tables</emphasis> and
111
        <emphasis>global hash tables</emphasis>, which are accessible via
112
        generic paging abstraction layer. This division was caused by the
113
        major architectural differences between different platforms.</para>
34 bondari 114
 
35 bondari 115
        <formalpara>
116
          <title>4-level page tables</title>
34 bondari 117
 
35 bondari 118
          <para>4-level page tables are the generalization of the hardware
119
          capabilities of the certain platforms. <itemizedlist>
120
              <listitem>
121
                 ia32 uses 2-level page tables, with full hardware support.
122
              </listitem>
34 bondari 123
 
35 bondari 124
              <listitem>
125
                 amd64 uses 4-level page tables, also coming with full hardware support.
126
              </listitem>
127
 
128
              <listitem>
129
                 mips and ppc32 have 2-level tables, software simulated support.
130
              </listitem>
131
            </itemizedlist></para>
132
        </formalpara>
133
 
134
        <formalpara>
135
          <title>Global hash tables</title>
136
 
137
          <para>- global page hash table: existuje jen jedna v celem systemu
138
          (vyuziva ji ia64), pozn. ia64 ma zatim vypnuty VHPT. Pouziva se
139
          genericke hash table s oddelenymi collision chains</para>
140
        </formalpara>
141
 
142
        <para>Thanks to the abstract paging interface, there is possibility
143
        left have more paging implementations, for example B-Tree page
144
        tables.</para>
145
      </section>
146
 
147
      <section id="tlb">
148
        <title>Translation Lookaside buffer</title>
149
 
150
        <para>- TLB cachuji informace ve strankovacich tabulkach; alternativne
151
        se lze na strankovaci tabulky (ci ruzne hw rozsireni [e.g. VHPT, ppc32
152
        hw hash table]) divat jako na velke TLB</para>
153
 
154
        <para>- pri modifikaci mapovani nebo odstraneni mapovani ze
155
        strankovacich tabulek je potreba zajistit konsistenci TLB a techto
156
        tabulek; nutne delat na vsech CPU; na to mame zjednodusenou verzi TLB
157
        shootdown mechanismu; je to variace na algoritmus popsany zde: D.
158
        Black et al., "Translation Lookaside Buffer Consistency: A Software
159
        Approach," Proc. Third Int'l Conf. Architectural Support for
160
        Programming Languages and Operating Systems, 1989, pp. 113-122.</para>
161
 
162
        <para>- nutno poznamenat, ze existuji odlehcenejsi verze TLB shootdown
163
        algoritmu</para>
164
      </section>
165
    </section>
26 bondari 166
  </section>
9 bondari 167
 
26 bondari 168
  <!-- End of VM -->
24 bondari 169
 
26 bondari 170
  <section>
171
    <!-- Phys mem -->
172
 
11 bondari 173
    <title>Physical memory management</title>
9 bondari 174
 
24 bondari 175
    <section id="zones_and_frames">
176
      <title>Zones and frames</title>
177
 
34 bondari 178
      <para><!--graphic fileref="images/mm2.png" /--><!--graphic fileref="images/buddy_alloc.svg" format="SVG" /--></para>
26 bondari 179
 
180
      <para>On some architectures not whole physical memory is available for
181
      conventional usage. This limitations require from kernel to maintain a
182
      table of available and unavailable ranges of physical memory addresses.
183
      Main idea of zones is in creating memory zone entity, that is a
184
      continuous chunk of memory available for allocation. If some chunk is
185
      not available, we simply do not put it in any zone.</para>
186
 
187
      <para>Zone is also serves for informational purposes, containing
188
      information about number of free and busy frames. Physical memory
189
      allocation is also done inside the certain zone. Allocation of zone
190
      frame must be organized by the <link linkend="frame_allocator">frame
191
      allocator</link> associated with the zone.</para>
192
 
193
      <para>Some of the architectures (mips32, ppc32) have only one zone, that
194
      covers whole physical memory, and the others (like ia32) may have
195
      multiple zones. Information about zones on current machine is stored in
196
      BIOS hardware tables or can be hardcoded into kernel during compile
197
      time.</para>
24 bondari 198
    </section>
199
 
200
    <section id="frame_allocator">
201
      <title>Frame allocator</title>
202
 
26 bondari 203
      <formalpara>
204
        <title>Overview</title>
24 bondari 205
 
26 bondari 206
        <para>Frame allocator provides physical memory allocation for the
207
        kernel. Because of zonal organization of physical memory, frame
208
        allocator is always working in context of some zone, thus making
209
        impossible to allocate a piece of memory, which lays in different
210
        zone, which cannot happen, because two adjacent zones can be merged
211
        into one. Frame allocator is also being responsible to update
212
        information on the number of free/busy frames in zone. Physical memory
213
        allocation inside one <link linkend="zones_and_frames">memory
214
        zone</link> is being handled by an instance of <link
215
        linkend="buddy_allocator">buddy allocator</link> tailored to allocate
216
        blocks of physical memory frames.</para>
217
      </formalpara>
24 bondari 218
 
26 bondari 219
      <formalpara>
220
        <title>Allocation / deallocation</title>
24 bondari 221
 
26 bondari 222
        <para>Upon allocation request, frame allocator tries to find first
223
        zone, that can satisfy the incoming request (has required amount of
224
        free frames to allocate). During deallocation, frame allocator needs
225
        to find zone, that contain deallocated frame. This approach could
226
        bring up two potential problems: <itemizedlist>
227
            <listitem>
228
               Linear search of zones does not any good to performance, but number of zones is not expected to be high. And if yes, list of zones can be replaced with more time-efficient B-tree.
229
            </listitem>
24 bondari 230
 
26 bondari 231
            <listitem>
232
               Quickly find out if zone contains required number of frames to allocate and if this chunk of memory is properly aligned. This issue is perfectly solved bu the buddy allocator.
233
            </listitem>
234
          </itemizedlist></para>
235
      </formalpara>
236
    </section>
17 jermar 237
 
34 bondari 238
    <section id="buddy_allocator">
239
      <title>Buddy allocator</title>
17 jermar 240
 
34 bondari 241
      <section>
242
        <title>Overview</title>
17 jermar 243
 
34 bondari 244
        <para>In buddy allocator, memory is broken down into power-of-two
245
        sized naturally aligned blocks. These blocks are organized in an array
246
        of lists in which list with index i contains all unallocated blocks of
247
        the size <mathphrase>2<superscript>i</superscript></mathphrase>. The
248
        index i is called the order of block. Should there be two adjacent
249
        equally sized blocks in list <mathphrase>i</mathphrase> (i.e.
250
        buddies), the buddy allocator would coalesce them and put the
251
        resulting block in list <mathphrase>i + 1</mathphrase>, provided that
252
        the resulting block would be naturally aligned. Similarily, when the
253
        allocator is asked to allocate a block of size
254
        <mathphrase>2<superscript>i</superscript></mathphrase>, it first tries
255
        to satisfy the request from list with index i. If the request cannot
256
        be satisfied (i.e. the list i is empty), the buddy allocator will try
257
        to allocate and split larger block from list with index i + 1. Both of
258
        these algorithms are recursive. The recursion ends either when there
259
        are no blocks to coalesce in the former case or when there are no
260
        blocks that can be split in the latter case.</para>
17 jermar 261
 
34 bondari 262
        <!--graphic fileref="images/mm1.png" format="EPS" /-->
17 jermar 263
 
34 bondari 264
        <para>This approach greatly reduces external fragmentation of memory
265
        and helps in allocating bigger continuous blocks of memory aligned to
266
        their size. On the other hand, the buddy allocator suffers increased
267
        internal fragmentation of memory and is not suitable for general
268
        kernel allocations. This purpose is better addressed by the <link
269
        linkend="slab">slab allocator</link>.</para>
270
      </section>
17 jermar 271
 
34 bondari 272
      <section>
273
        <title>Implementation</title>
17 jermar 274
 
34 bondari 275
        <para>The buddy allocator is, in fact, an abstract framework wich can
276
        be easily specialized to serve one particular task. It knows nothing
277
        about the nature of memory it helps to allocate. In order to beat the
278
        lack of this knowledge, the buddy allocator exports an interface that
279
        each of its clients is required to implement. When supplied an
280
        implementation of this interface, the buddy allocator can use
281
        specialized external functions to find buddy for a block, split and
282
        coalesce blocks, manipulate block order and mark blocks busy or
283
        available. For precize documentation of this interface, refer to <link
284
        linkend="???">HelenOS Generic Kernel Reference Manual</link>.</para>
17 jermar 285
 
34 bondari 286
        <formalpara>
287
          <title>Data organization</title>
17 jermar 288
 
34 bondari 289
          <para>Each entity allocable by the buddy allocator is required to
290
          contain space for storing block order number and a link variable
291
          used to interconnect blocks within the same order.</para>
15 bondari 292
 
34 bondari 293
          <para>Whatever entities are allocated by the buddy allocator, the
294
          first entity within a block is used to represent the entire block.
295
          The first entity keeps the order of the whole block. Other entities
296
          within the block are assigned the magic value
297
          <constant>BUDDY_INNER_BLOCK</constant>. This is especially important
298
          for effective identification of buddies in one-dimensional array
299
          because the entity that represents a potential buddy cannot be
300
          associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it
301
          is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is
302
          not a buddy).</para>
15 bondari 303
 
34 bondari 304
          <para>Buddy allocator always uses first frame to represent frame
305
          block. This frame contains <varname>buddy_order</varname> variable
306
          to provide information about the block size it actually represents (
307
          <mathphrase>2<superscript>buddy_order</superscript></mathphrase>
308
          frames block). Other frames in block have this value set to magic
309
          <constant>BUDDY_INNER_BLOCK</constant> that is much greater than
310
          buddy <varname>max_order</varname> value.</para>
15 bondari 311
 
34 bondari 312
          <para>Each <varname>frame_t</varname> also contains pointer member
313
          to hold frame structure in the linked list inside one order.</para>
314
        </formalpara>
15 bondari 315
 
34 bondari 316
        <formalpara>
317
          <title>Allocation algorithm</title>
15 bondari 318
 
34 bondari 319
          <para>Upon <mathphrase>2<superscript>i</superscript></mathphrase>
320
          frames block allocation request, allocator checks if there are any
321
          blocks available at the order list <varname>i</varname>. If yes,
322
          removes block from order list and returns its address. If no,
323
          recursively allocates
324
          <mathphrase>2<superscript>i+1</superscript></mathphrase> frame
325
          block, splits it into two
326
          <mathphrase>2<superscript>i</superscript></mathphrase> frame blocks.
327
          Then adds one of the blocks to the <varname>i</varname> order list
328
          and returns address of another.</para>
329
        </formalpara>
15 bondari 330
 
34 bondari 331
        <formalpara>
332
          <title>Deallocation algorithm</title>
17 jermar 333
 
34 bondari 334
          <para>Check if block has so called buddy (another free
335
          <mathphrase>2<superscript>i</superscript></mathphrase> frame block
336
          that can be linked with freed block into the
337
          <mathphrase>2<superscript>i+1</superscript></mathphrase> block).
338
          Technically, buddy is a odd/even block for even/odd block
339
          respectively. Plus we can put an extra requirement, that resulting
340
          block must be aligned to its size. This requirement guarantees
341
          natural block alignment for the blocks coming out the allocation
342
          system.</para>
9 bondari 343
 
34 bondari 344
          <para>Using direct pointer arithmetics,
345
          <varname>frame_t::ref_count</varname> and
346
          <varname>frame_t::buddy_order</varname> variables, finding buddy is
347
          done at constant time.</para>
348
        </formalpara>
349
      </section>
26 bondari 350
    </section>
351
 
15 bondari 352
    <section id="slab">
11 bondari 353
      <title>Slab allocator</title>
9 bondari 354
 
26 bondari 355
      <section>
34 bondari 356
        <title>Overview</title>
9 bondari 357
 
34 bondari 358
        <para><termdef><glossterm>Slab</glossterm> represents a contiguous
359
        piece of memory, usually made of several physically contiguous
360
        pages.</termdef> <termdef><glossterm>Slab cache</glossterm> consists
361
        of one or more slabs.</termdef></para>
362
 
26 bondari 363
        <para>The majority of memory allocation requests in the kernel are for
364
        small, frequently used data structures. For this purpose the slab
34 bondari 365
        allocator is a perfect solution. The basic idea behind the slab
26 bondari 366
        allocator is to have lists of commonly used objects available packed
367
        into pages. This avoids the overhead of allocating and destroying
34 bondari 368
        commonly used types of objects such threads, virtual memory structures
369
        etc. Also due to the exact allocated size matching, slab allocation
370
        completely eliminates internal fragmentation issue.</para>
26 bondari 371
      </section>
24 bondari 372
 
26 bondari 373
      <section>
34 bondari 374
        <title>Implementation</title>
9 bondari 375
 
26 bondari 376
        <para>The SLAB allocator is closely modelled after <ulink
377
        url="http://www.usenix.org/events/usenix01/full_papers/bonwick/bonwick_html/">
378
        OpenSolaris SLAB allocator by Jeff Bonwick and Jonathan Adams </ulink>
379
        with the following exceptions: <itemizedlist>
380
            <listitem>
381
               empty SLABS are deallocated immediately (in Linux they are kept in linked list, in Solaris ???)
382
            </listitem>
383
 
384
            <listitem>
385
               empty magazines are deallocated when not needed (in Solaris they are held in linked list in slab cache)
386
            </listitem>
387
          </itemizedlist> Following features are not currently supported but
388
        would be easy to do: <itemizedlist>
389
            <listitem>
390
               - cache coloring
391
            </listitem>
392
 
393
            <listitem>
34 bondari 394
               - dynamic magazine grow (different magazine sizes are already supported, but we would need to adjust allocation strategy)
26 bondari 395
            </listitem>
396
          </itemizedlist></para>
397
 
34 bondari 398
        <section>
399
          <title>Magazine layer</title>
26 bondari 400
 
34 bondari 401
          <para>Due to the extensive bottleneck on SMP architures, caused by
402
          global SLAB locking mechanism, making processing of all slab
403
          allocation requests serialized, a new layer was introduced to the
404
          classic slab allocator design. Slab allocator was extended to
405
          support per-CPU caches 'magazines' to achieve good SMP scaling.
406
          <termdef>Slab SMP perfromance bottleneck was resolved by introducing
407
          a per-CPU caching scheme called as <glossterm>magazine
408
          layer</glossterm></termdef>.</para>
26 bondari 409
 
34 bondari 410
          <para>Magazine is a N-element cache of objects, so each magazine can
411
          satisfy N allocations. Magazine behaves like a automatic weapon
412
          magazine (LIFO, stack), so the allocation/deallocation become simple
413
          push/pop pointer operation. Trick is that CPU does not access global
414
          slab allocator data during the allocation from its magazine, thus
415
          making possible parallel allocations between CPUs.</para>
26 bondari 416
 
34 bondari 417
          <para>Implementation also requires adding another feature as the
418
          CPU-bound magazine is actually a pair of magazines to avoid
419
          thrashing when during allocation/deallocatiion of 1 item at the
420
          magazine size boundary. LIFO order is enforced, which should avoid
421
          fragmentation as much as possible.</para>
26 bondari 422
 
34 bondari 423
          <para>Another important entity of magazine layer is a full magazine
424
          depot, that stores full magazines which are used by any of the CPU
425
          magazine caches to reload active CPU magazine. Magazine depot can be
426
          pre-filled with full magazines during initialization, but in current
427
          implementation it is filled during object deallocation, when CPU
428
          magazine becomes full.</para>
26 bondari 429
 
34 bondari 430
          <para>Slab allocator control structures are allocated from special
431
          slabs, that are marked by special flag, indicating that it should
432
          not be used for slab magazine layer. This is done to avoid possible
433
          infinite recursions and deadlock during conventional slab allocaiton
434
          requests.</para>
435
        </section>
26 bondari 436
 
34 bondari 437
        <section>
438
          <title>Allocation/deallocation</title>
26 bondari 439
 
34 bondari 440
          <para>Every cache contains list of full slabs and list of partialy
441
          full slabs. Empty slabs are immediately freed (thrashing will be
442
          avoided because of magazines).</para>
26 bondari 443
 
34 bondari 444
          <para>The SLAB allocator allocates lots of space and does not free
445
          it. When frame allocator fails to allocate the frame, it calls
446
          slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim.
447
          The light reclaim releases slabs from cpu-shared magazine-list,
448
          until at least 1 slab is deallocated in each cache (this algorithm
449
          should probably change). The brutal reclaim removes all cached
450
          objects, even from CPU-bound magazines.</para>
451
 
452
          <formalpara>
453
            <title>Allocation</title>
454
 
455
            <para><emphasis>Step 1.</emphasis> When it comes to the allocation
456
            request, slab allocator first of all checks availability of memory
457
            in local CPU-bound magazine. If it is there, we would just "pop"
458
            the CPU magazine and return the pointer to object.</para>
459
 
460
            <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
461
            empty, allocator will attempt to reload magazin, swapping it with
462
            second CPU magazine and returns to the first step.</para>
463
 
464
            <para><emphasis>Step 3.</emphasis> Now we are in the situation
465
            when both CPU-bound magazines are empty, which makes allocator to
466
            access shared full-magazines depot to reload CPU-bound magazines.
467
            If reload is succesful (meaning there are full magazines in depot)
468
            algoritm continues at Step 1.</para>
469
 
470
            <para><emphasis>Step 4.</emphasis> Final step of the allocation.
471
            In this step object is allocated from the conventional slab layer
472
            and pointer is returned.</para>
473
          </formalpara>
474
 
475
          <formalpara>
476
            <title>Deallocation</title>
477
 
478
            <para><emphasis>Step 1.</emphasis> During deallocation request,
479
            slab allocator will check if the local CPU-bound magazine is not
480
            full. In this case we will just push the pointer to this
481
            magazine.</para>
482
 
483
            <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
484
            full, allocator will attempt to reload magazin, swapping it with
485
            second CPU magazine and returns to the first step.</para>
486
 
487
            <para><emphasis>Step 3.</emphasis> Now we are in the situation
488
            when both CPU-bound magazines are full, which makes allocator to
489
            access shared full-magazines depot to put one of the magazines to
490
            the depot and creating new empty magazine. Algoritm continues at
491
            Step 1.</para>
492
          </formalpara>
493
        </section>
26 bondari 494
      </section>
15 bondari 495
    </section>
26 bondari 496
 
497
    <!-- End of Physmem -->
498
  </section>
499
 
500
  <section>
501
    <title>Memory sharing</title>
502
 
503
    <para>Not implemented yet(?)</para>
504
  </section>
11 bondari 505
</chapter>