Subversion Repositories HelenOS-doc

Rev

Rev 27 | Rev 35 | Go to most recent revision | Details | Compare with Previous | Last modification | View Log | RSS feed

Rev Author Line No. Line
9 bondari 1
<?xml version="1.0" encoding="UTF-8"?>
11 bondari 2
<chapter id="mm">
3
  <?dbhtml filename="mm.html"?>
9 bondari 4
 
11 bondari 5
  <title>Memory management</title>
9 bondari 6
 
26 bondari 7
  <section>
11 bondari 8
    <title>Virtual memory management</title>
9 bondari 9
 
10
    <section>
11 bondari 11
      <title>Address spaces</title>
9 bondari 12
 
34 bondari 13
      <para />
9 bondari 14
    </section>
15
 
16
    <section>
11 bondari 17
      <title>Virtual address translation</title>
9 bondari 18
 
34 bondari 19
      <para />
9 bondari 20
    </section>
34 bondari 21
 
22
    <para>Page tables. 4 level hierarchical and hash directly supported. B+
23
    Tree can be implemented.</para>
24
 
25
    <para>For paging there is an abstract layer</para>
26
 
27
    <para>TLB shootdown implementation (update TLB upon mapping
28
    update/remove). TLB shootdown ASID/ASID:PAGE/ALL. TLB shootdown requests
29
    can come in asynchroniously so there is a cache of TLB shootdown requests.
30
    Upon cache overflow TLB shootdown ALL is executed</para>
31
 
32
    <para>Address spaces. Address space area (B+ tree). Only for uspace. Set
33
    of syscalls (shrink/extend etc). Special address space area type - device
34
    - prohibits shrink/extend syscalls to call on it. Address space has link
35
    to mapping tables (hierarchical - per Address space, hash - global
36
    tables).</para>
26 bondari 37
  </section>
9 bondari 38
 
26 bondari 39
  <!-- End of VM -->
24 bondari 40
 
26 bondari 41
  <section>
42
    <!-- Phys mem -->
43
 
11 bondari 44
    <title>Physical memory management</title>
9 bondari 45
 
24 bondari 46
    <section id="zones_and_frames">
47
      <title>Zones and frames</title>
48
 
34 bondari 49
      <para><!--graphic fileref="images/mm2.png" /--><!--graphic fileref="images/buddy_alloc.svg" format="SVG" /--></para>
26 bondari 50
 
51
      <para>On some architectures not whole physical memory is available for
52
      conventional usage. This limitations require from kernel to maintain a
53
      table of available and unavailable ranges of physical memory addresses.
54
      Main idea of zones is in creating memory zone entity, that is a
55
      continuous chunk of memory available for allocation. If some chunk is
56
      not available, we simply do not put it in any zone.</para>
57
 
58
      <para>Zone is also serves for informational purposes, containing
59
      information about number of free and busy frames. Physical memory
60
      allocation is also done inside the certain zone. Allocation of zone
61
      frame must be organized by the <link linkend="frame_allocator">frame
62
      allocator</link> associated with the zone.</para>
63
 
64
      <para>Some of the architectures (mips32, ppc32) have only one zone, that
65
      covers whole physical memory, and the others (like ia32) may have
66
      multiple zones. Information about zones on current machine is stored in
67
      BIOS hardware tables or can be hardcoded into kernel during compile
68
      time.</para>
24 bondari 69
    </section>
70
 
71
    <section id="frame_allocator">
72
      <title>Frame allocator</title>
73
 
26 bondari 74
      <formalpara>
75
        <title>Overview</title>
24 bondari 76
 
26 bondari 77
        <para>Frame allocator provides physical memory allocation for the
78
        kernel. Because of zonal organization of physical memory, frame
79
        allocator is always working in context of some zone, thus making
80
        impossible to allocate a piece of memory, which lays in different
81
        zone, which cannot happen, because two adjacent zones can be merged
82
        into one. Frame allocator is also being responsible to update
83
        information on the number of free/busy frames in zone. Physical memory
84
        allocation inside one <link linkend="zones_and_frames">memory
85
        zone</link> is being handled by an instance of <link
86
        linkend="buddy_allocator">buddy allocator</link> tailored to allocate
87
        blocks of physical memory frames.</para>
88
      </formalpara>
24 bondari 89
 
26 bondari 90
      <formalpara>
91
        <title>Allocation / deallocation</title>
24 bondari 92
 
26 bondari 93
        <para>Upon allocation request, frame allocator tries to find first
94
        zone, that can satisfy the incoming request (has required amount of
95
        free frames to allocate). During deallocation, frame allocator needs
96
        to find zone, that contain deallocated frame. This approach could
97
        bring up two potential problems: <itemizedlist>
98
            <listitem>
99
               Linear search of zones does not any good to performance, but number of zones is not expected to be high. And if yes, list of zones can be replaced with more time-efficient B-tree.
100
            </listitem>
24 bondari 101
 
26 bondari 102
            <listitem>
103
               Quickly find out if zone contains required number of frames to allocate and if this chunk of memory is properly aligned. This issue is perfectly solved bu the buddy allocator.
104
            </listitem>
105
          </itemizedlist></para>
106
      </formalpara>
107
    </section>
17 jermar 108
 
34 bondari 109
    <section id="buddy_allocator">
110
      <title>Buddy allocator</title>
17 jermar 111
 
34 bondari 112
      <section>
113
        <title>Overview</title>
17 jermar 114
 
34 bondari 115
        <para>In buddy allocator, memory is broken down into power-of-two
116
        sized naturally aligned blocks. These blocks are organized in an array
117
        of lists in which list with index i contains all unallocated blocks of
118
        the size <mathphrase>2<superscript>i</superscript></mathphrase>. The
119
        index i is called the order of block. Should there be two adjacent
120
        equally sized blocks in list <mathphrase>i</mathphrase> (i.e.
121
        buddies), the buddy allocator would coalesce them and put the
122
        resulting block in list <mathphrase>i + 1</mathphrase>, provided that
123
        the resulting block would be naturally aligned. Similarily, when the
124
        allocator is asked to allocate a block of size
125
        <mathphrase>2<superscript>i</superscript></mathphrase>, it first tries
126
        to satisfy the request from list with index i. If the request cannot
127
        be satisfied (i.e. the list i is empty), the buddy allocator will try
128
        to allocate and split larger block from list with index i + 1. Both of
129
        these algorithms are recursive. The recursion ends either when there
130
        are no blocks to coalesce in the former case or when there are no
131
        blocks that can be split in the latter case.</para>
17 jermar 132
 
34 bondari 133
        <!--graphic fileref="images/mm1.png" format="EPS" /-->
17 jermar 134
 
34 bondari 135
        <para>This approach greatly reduces external fragmentation of memory
136
        and helps in allocating bigger continuous blocks of memory aligned to
137
        their size. On the other hand, the buddy allocator suffers increased
138
        internal fragmentation of memory and is not suitable for general
139
        kernel allocations. This purpose is better addressed by the <link
140
        linkend="slab">slab allocator</link>.</para>
141
      </section>
17 jermar 142
 
34 bondari 143
      <section>
144
        <title>Implementation</title>
17 jermar 145
 
34 bondari 146
        <para>The buddy allocator is, in fact, an abstract framework wich can
147
        be easily specialized to serve one particular task. It knows nothing
148
        about the nature of memory it helps to allocate. In order to beat the
149
        lack of this knowledge, the buddy allocator exports an interface that
150
        each of its clients is required to implement. When supplied an
151
        implementation of this interface, the buddy allocator can use
152
        specialized external functions to find buddy for a block, split and
153
        coalesce blocks, manipulate block order and mark blocks busy or
154
        available. For precize documentation of this interface, refer to <link
155
        linkend="???">HelenOS Generic Kernel Reference Manual</link>.</para>
17 jermar 156
 
34 bondari 157
        <formalpara>
158
          <title>Data organization</title>
17 jermar 159
 
34 bondari 160
          <para>Each entity allocable by the buddy allocator is required to
161
          contain space for storing block order number and a link variable
162
          used to interconnect blocks within the same order.</para>
15 bondari 163
 
34 bondari 164
          <para>Whatever entities are allocated by the buddy allocator, the
165
          first entity within a block is used to represent the entire block.
166
          The first entity keeps the order of the whole block. Other entities
167
          within the block are assigned the magic value
168
          <constant>BUDDY_INNER_BLOCK</constant>. This is especially important
169
          for effective identification of buddies in one-dimensional array
170
          because the entity that represents a potential buddy cannot be
171
          associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it
172
          is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is
173
          not a buddy).</para>
15 bondari 174
 
34 bondari 175
          <para>Buddy allocator always uses first frame to represent frame
176
          block. This frame contains <varname>buddy_order</varname> variable
177
          to provide information about the block size it actually represents (
178
          <mathphrase>2<superscript>buddy_order</superscript></mathphrase>
179
          frames block). Other frames in block have this value set to magic
180
          <constant>BUDDY_INNER_BLOCK</constant> that is much greater than
181
          buddy <varname>max_order</varname> value.</para>
15 bondari 182
 
34 bondari 183
          <para>Each <varname>frame_t</varname> also contains pointer member
184
          to hold frame structure in the linked list inside one order.</para>
185
        </formalpara>
15 bondari 186
 
34 bondari 187
        <formalpara>
188
          <title>Allocation algorithm</title>
15 bondari 189
 
34 bondari 190
          <para>Upon <mathphrase>2<superscript>i</superscript></mathphrase>
191
          frames block allocation request, allocator checks if there are any
192
          blocks available at the order list <varname>i</varname>. If yes,
193
          removes block from order list and returns its address. If no,
194
          recursively allocates
195
          <mathphrase>2<superscript>i+1</superscript></mathphrase> frame
196
          block, splits it into two
197
          <mathphrase>2<superscript>i</superscript></mathphrase> frame blocks.
198
          Then adds one of the blocks to the <varname>i</varname> order list
199
          and returns address of another.</para>
200
        </formalpara>
15 bondari 201
 
34 bondari 202
        <formalpara>
203
          <title>Deallocation algorithm</title>
17 jermar 204
 
34 bondari 205
          <para>Check if block has so called buddy (another free
206
          <mathphrase>2<superscript>i</superscript></mathphrase> frame block
207
          that can be linked with freed block into the
208
          <mathphrase>2<superscript>i+1</superscript></mathphrase> block).
209
          Technically, buddy is a odd/even block for even/odd block
210
          respectively. Plus we can put an extra requirement, that resulting
211
          block must be aligned to its size. This requirement guarantees
212
          natural block alignment for the blocks coming out the allocation
213
          system.</para>
9 bondari 214
 
34 bondari 215
          <para>Using direct pointer arithmetics,
216
          <varname>frame_t::ref_count</varname> and
217
          <varname>frame_t::buddy_order</varname> variables, finding buddy is
218
          done at constant time.</para>
219
        </formalpara>
220
      </section>
26 bondari 221
    </section>
222
 
15 bondari 223
    <section id="slab">
11 bondari 224
      <title>Slab allocator</title>
9 bondari 225
 
26 bondari 226
      <section>
34 bondari 227
        <title>Overview</title>
9 bondari 228
 
34 bondari 229
        <para><termdef><glossterm>Slab</glossterm> represents a contiguous
230
        piece of memory, usually made of several physically contiguous
231
        pages.</termdef> <termdef><glossterm>Slab cache</glossterm> consists
232
        of one or more slabs.</termdef></para>
233
 
26 bondari 234
        <para>The majority of memory allocation requests in the kernel are for
235
        small, frequently used data structures. For this purpose the slab
34 bondari 236
        allocator is a perfect solution. The basic idea behind the slab
26 bondari 237
        allocator is to have lists of commonly used objects available packed
238
        into pages. This avoids the overhead of allocating and destroying
34 bondari 239
        commonly used types of objects such threads, virtual memory structures
240
        etc. Also due to the exact allocated size matching, slab allocation
241
        completely eliminates internal fragmentation issue.</para>
26 bondari 242
      </section>
24 bondari 243
 
26 bondari 244
      <section>
34 bondari 245
        <title>Implementation</title>
9 bondari 246
 
26 bondari 247
        <para>The SLAB allocator is closely modelled after <ulink
248
        url="http://www.usenix.org/events/usenix01/full_papers/bonwick/bonwick_html/">
249
        OpenSolaris SLAB allocator by Jeff Bonwick and Jonathan Adams </ulink>
250
        with the following exceptions: <itemizedlist>
251
            <listitem>
252
               empty SLABS are deallocated immediately (in Linux they are kept in linked list, in Solaris ???)
253
            </listitem>
254
 
255
            <listitem>
256
               empty magazines are deallocated when not needed (in Solaris they are held in linked list in slab cache)
257
            </listitem>
258
          </itemizedlist> Following features are not currently supported but
259
        would be easy to do: <itemizedlist>
260
            <listitem>
261
               - cache coloring
262
            </listitem>
263
 
264
            <listitem>
34 bondari 265
               - dynamic magazine grow (different magazine sizes are already supported, but we would need to adjust allocation strategy)
26 bondari 266
            </listitem>
267
          </itemizedlist></para>
268
 
34 bondari 269
        <section>
270
          <title>Magazine layer</title>
26 bondari 271
 
34 bondari 272
          <para>Due to the extensive bottleneck on SMP architures, caused by
273
          global SLAB locking mechanism, making processing of all slab
274
          allocation requests serialized, a new layer was introduced to the
275
          classic slab allocator design. Slab allocator was extended to
276
          support per-CPU caches 'magazines' to achieve good SMP scaling.
277
          <termdef>Slab SMP perfromance bottleneck was resolved by introducing
278
          a per-CPU caching scheme called as <glossterm>magazine
279
          layer</glossterm></termdef>.</para>
26 bondari 280
 
34 bondari 281
          <para>Magazine is a N-element cache of objects, so each magazine can
282
          satisfy N allocations. Magazine behaves like a automatic weapon
283
          magazine (LIFO, stack), so the allocation/deallocation become simple
284
          push/pop pointer operation. Trick is that CPU does not access global
285
          slab allocator data during the allocation from its magazine, thus
286
          making possible parallel allocations between CPUs.</para>
26 bondari 287
 
34 bondari 288
          <para>Implementation also requires adding another feature as the
289
          CPU-bound magazine is actually a pair of magazines to avoid
290
          thrashing when during allocation/deallocatiion of 1 item at the
291
          magazine size boundary. LIFO order is enforced, which should avoid
292
          fragmentation as much as possible.</para>
26 bondari 293
 
34 bondari 294
          <para>Another important entity of magazine layer is a full magazine
295
          depot, that stores full magazines which are used by any of the CPU
296
          magazine caches to reload active CPU magazine. Magazine depot can be
297
          pre-filled with full magazines during initialization, but in current
298
          implementation it is filled during object deallocation, when CPU
299
          magazine becomes full.</para>
26 bondari 300
 
34 bondari 301
          <para>Slab allocator control structures are allocated from special
302
          slabs, that are marked by special flag, indicating that it should
303
          not be used for slab magazine layer. This is done to avoid possible
304
          infinite recursions and deadlock during conventional slab allocaiton
305
          requests.</para>
306
        </section>
26 bondari 307
 
34 bondari 308
        <section>
309
          <title>Allocation/deallocation</title>
26 bondari 310
 
34 bondari 311
          <para>Every cache contains list of full slabs and list of partialy
312
          full slabs. Empty slabs are immediately freed (thrashing will be
313
          avoided because of magazines).</para>
26 bondari 314
 
34 bondari 315
          <para>The SLAB allocator allocates lots of space and does not free
316
          it. When frame allocator fails to allocate the frame, it calls
317
          slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim.
318
          The light reclaim releases slabs from cpu-shared magazine-list,
319
          until at least 1 slab is deallocated in each cache (this algorithm
320
          should probably change). The brutal reclaim removes all cached
321
          objects, even from CPU-bound magazines.</para>
322
 
323
          <formalpara>
324
            <title>Allocation</title>
325
 
326
            <para><emphasis>Step 1.</emphasis> When it comes to the allocation
327
            request, slab allocator first of all checks availability of memory
328
            in local CPU-bound magazine. If it is there, we would just "pop"
329
            the CPU magazine and return the pointer to object.</para>
330
 
331
            <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
332
            empty, allocator will attempt to reload magazin, swapping it with
333
            second CPU magazine and returns to the first step.</para>
334
 
335
            <para><emphasis>Step 3.</emphasis> Now we are in the situation
336
            when both CPU-bound magazines are empty, which makes allocator to
337
            access shared full-magazines depot to reload CPU-bound magazines.
338
            If reload is succesful (meaning there are full magazines in depot)
339
            algoritm continues at Step 1.</para>
340
 
341
            <para><emphasis>Step 4.</emphasis> Final step of the allocation.
342
            In this step object is allocated from the conventional slab layer
343
            and pointer is returned.</para>
344
          </formalpara>
345
 
346
          <formalpara>
347
            <title>Deallocation</title>
348
 
349
            <para><emphasis>Step 1.</emphasis> During deallocation request,
350
            slab allocator will check if the local CPU-bound magazine is not
351
            full. In this case we will just push the pointer to this
352
            magazine.</para>
353
 
354
            <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
355
            full, allocator will attempt to reload magazin, swapping it with
356
            second CPU magazine and returns to the first step.</para>
357
 
358
            <para><emphasis>Step 3.</emphasis> Now we are in the situation
359
            when both CPU-bound magazines are full, which makes allocator to
360
            access shared full-magazines depot to put one of the magazines to
361
            the depot and creating new empty magazine. Algoritm continues at
362
            Step 1.</para>
363
          </formalpara>
364
        </section>
26 bondari 365
      </section>
15 bondari 366
    </section>
26 bondari 367
 
368
    <!-- End of Physmem -->
369
  </section>
370
 
371
  <section>
372
    <title>Memory sharing</title>
373
 
374
    <para>Not implemented yet(?)</para>
375
  </section>
11 bondari 376
</chapter>