Subversion Repositories HelenOS-doc

Rev

Rev 26 | Rev 34 | Go to most recent revision | Details | Compare with Previous | Last modification | View Log | RSS feed

Rev Author Line No. Line
9 bondari 1
<?xml version="1.0" encoding="UTF-8"?>
11 bondari 2
<chapter id="mm">
3
  <?dbhtml filename="mm.html"?>
9 bondari 4
 
11 bondari 5
  <title>Memory management</title>
9 bondari 6
 
26 bondari 7
  <section>
8
    <!-- VM -->
24 bondari 9
 
11 bondari 10
    <title>Virtual memory management</title>
9 bondari 11
 
12
    <section>
11 bondari 13
      <title>Address spaces</title>
9 bondari 14
 
15
      <para></para>
16
    </section>
17
 
18
    <section>
11 bondari 19
      <title>Virtual address translation</title>
9 bondari 20
 
21
      <para></para>
22
    </section>
26 bondari 23
  </section>
9 bondari 24
 
26 bondari 25
  <!-- End of VM -->
24 bondari 26
 
26 bondari 27
  <section>
28
    <!-- Phys mem -->
29
 
11 bondari 30
    <title>Physical memory management</title>
9 bondari 31
 
24 bondari 32
    <section id="zones_and_frames">
33
      <title>Zones and frames</title>
34
 
35
      <para>
26 bondari 36
      <!--graphic fileref="images/mm2.png" /-->
24 bondari 37
 
26 bondari 38
      <!--graphic fileref="images/buddy_alloc.svg" format="SVG" /-->
24 bondari 39
 
26 bondari 40
 
41
      </para>
42
 
43
      <para>On some architectures not whole physical memory is available for
44
      conventional usage. This limitations require from kernel to maintain a
45
      table of available and unavailable ranges of physical memory addresses.
46
      Main idea of zones is in creating memory zone entity, that is a
47
      continuous chunk of memory available for allocation. If some chunk is
48
      not available, we simply do not put it in any zone.</para>
49
 
50
      <para>Zone is also serves for informational purposes, containing
51
      information about number of free and busy frames. Physical memory
52
      allocation is also done inside the certain zone. Allocation of zone
53
      frame must be organized by the <link linkend="frame_allocator">frame
54
      allocator</link> associated with the zone.</para>
55
 
56
      <para>Some of the architectures (mips32, ppc32) have only one zone, that
57
      covers whole physical memory, and the others (like ia32) may have
58
      multiple zones. Information about zones on current machine is stored in
59
      BIOS hardware tables or can be hardcoded into kernel during compile
60
      time.</para>
24 bondari 61
    </section>
62
 
63
    <section id="frame_allocator">
64
      <title>Frame allocator</title>
65
 
26 bondari 66
      <formalpara>
67
        <title>Overview</title>
24 bondari 68
 
26 bondari 69
        <para>Frame allocator provides physical memory allocation for the
70
        kernel. Because of zonal organization of physical memory, frame
71
        allocator is always working in context of some zone, thus making
72
        impossible to allocate a piece of memory, which lays in different
73
        zone, which cannot happen, because two adjacent zones can be merged
74
        into one. Frame allocator is also being responsible to update
75
        information on the number of free/busy frames in zone. Physical memory
76
        allocation inside one <link linkend="zones_and_frames">memory
77
        zone</link> is being handled by an instance of <link
78
        linkend="buddy_allocator">buddy allocator</link> tailored to allocate
79
        blocks of physical memory frames.</para>
80
      </formalpara>
24 bondari 81
 
26 bondari 82
      <formalpara>
83
        <title>Allocation / deallocation</title>
24 bondari 84
 
26 bondari 85
        <para>Upon allocation request, frame allocator tries to find first
86
        zone, that can satisfy the incoming request (has required amount of
87
        free frames to allocate). During deallocation, frame allocator needs
88
        to find zone, that contain deallocated frame. This approach could
89
        bring up two potential problems: <itemizedlist>
90
            <listitem>
91
               Linear search of zones does not any good to performance, but number of zones is not expected to be high. And if yes, list of zones can be replaced with more time-efficient B-tree.
92
            </listitem>
24 bondari 93
 
26 bondari 94
            <listitem>
95
               Quickly find out if zone contains required number of frames to allocate and if this chunk of memory is properly aligned. This issue is perfectly solved bu the buddy allocator.
96
            </listitem>
97
          </itemizedlist></para>
98
      </formalpara>
99
    </section>
100
  </section>
17 jermar 101
 
26 bondari 102
  <section id="buddy_allocator">
103
    <title>Buddy allocator</title>
17 jermar 104
 
26 bondari 105
    <section>
106
      <title>Overview</title>
17 jermar 107
 
26 bondari 108
      <para>In buddy allocator, memory is broken down into power-of-two sized
109
      naturally aligned blocks. These blocks are organized in an array of
110
      lists in which list with index i contains all unallocated blocks of the
111
      size <mathphrase>2<superscript>i</superscript></mathphrase>. The index i
112
      is called the order of block. Should there be two adjacent equally sized
113
      blocks in list <mathphrase>i</mathphrase> (i.e. buddies), the buddy
114
      allocator would coalesce them and put the resulting block in list
115
      <mathphrase>i + 1</mathphrase>, provided that the resulting block would
116
      be naturally aligned. Similarily, when the allocator is asked to
117
      allocate a block of size
118
      <mathphrase>2<superscript>i</superscript></mathphrase>, it first tries
119
      to satisfy the request from list with index i. If the request cannot be
120
      satisfied (i.e. the list i is empty), the buddy allocator will try to
121
      allocate and split larger block from list with index i + 1. Both of
122
      these algorithms are recursive. The recursion ends either when there are
123
      no blocks to coalesce in the former case or when there are no blocks
124
      that can be split in the latter case.</para>
17 jermar 125
 
27 bondari 126
      <!--graphic fileref="images/mm1.png" format="EPS" /-->
17 jermar 127
 
26 bondari 128
      <para>This approach greatly reduces external fragmentation of memory and
129
      helps in allocating bigger continuous blocks of memory aligned to their
130
      size. On the other hand, the buddy allocator suffers increased internal
131
      fragmentation of memory and is not suitable for general kernel
132
      allocations. This purpose is better addressed by the <link
133
      linkend="slab">slab allocator</link>.</para>
134
    </section>
17 jermar 135
 
26 bondari 136
    <section>
137
      <title>Implementation</title>
17 jermar 138
 
26 bondari 139
      <para>The buddy allocator is, in fact, an abstract framework wich can be
140
      easily specialized to serve one particular task. It knows nothing about
141
      the nature of memory it helps to allocate. In order to beat the lack of
142
      this knowledge, the buddy allocator exports an interface that each of
143
      its clients is required to implement. When supplied an implementation of
144
      this interface, the buddy allocator can use specialized external
145
      functions to find buddy for a block, split and coalesce blocks,
146
      manipulate block order and mark blocks busy or available. For precize
147
      documentation of this interface, refer to <link linkend="???">HelenOS
148
      Generic Kernel Reference Manual</link>.</para>
17 jermar 149
 
26 bondari 150
      <formalpara>
151
        <title>Data organization</title>
17 jermar 152
 
26 bondari 153
        <para>Each entity allocable by the buddy allocator is required to
154
        contain space for storing block order number and a link variable used
155
        to interconnect blocks within the same order.</para>
15 bondari 156
 
26 bondari 157
        <para>Whatever entities are allocated by the buddy allocator, the
158
        first entity within a block is used to represent the entire block. The
159
        first entity keeps the order of the whole block. Other entities within
160
        the block are assigned the magic value
161
        <constant>BUDDY_INNER_BLOCK</constant>. This is especially important
162
        for effective identification of buddies in one-dimensional array
163
        because the entity that represents a potential buddy cannot be
164
        associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it is
165
        associated with <constant>BUDDY_INNER_BLOCK</constant> then it is not
166
        a buddy).</para>
167
      </formalpara>
15 bondari 168
 
26 bondari 169
      <formalpara>
170
        <title>Data organization</title>
15 bondari 171
 
26 bondari 172
        <para>Buddy allocator always uses first frame to represent frame
173
        block. This frame contains <varname>buddy_order</varname> variable to
174
        provide information about the block size it actually represents (
175
        <mathphrase>2<superscript>buddy_order</superscript></mathphrase>
176
        frames block). Other frames in block have this value set to magic
177
        <constant>BUDDY_INNER_BLOCK</constant> that is much greater than buddy
178
        <varname>max_order</varname> value.</para>
15 bondari 179
 
26 bondari 180
        <para>Each <varname>frame_t</varname> also contains pointer member to
181
        hold frame structure in the linked list inside one order.</para>
182
      </formalpara>
15 bondari 183
 
26 bondari 184
      <formalpara>
185
        <title>Allocation algorithm</title>
15 bondari 186
 
26 bondari 187
        <para>Upon <mathphrase>2<superscript>i</superscript></mathphrase>
188
        frames block allocation request, allocator checks if there are any
189
        blocks available at the order list <varname>i</varname>. If yes,
190
        removes block from order list and returns its address. If no,
191
        recursively allocates
192
        <mathphrase>2<superscript>i+1</superscript></mathphrase> frame block,
193
        splits it into two
194
        <mathphrase>2<superscript>i</superscript></mathphrase> frame blocks.
195
        Then adds one of the blocks to the <varname>i</varname> order list and
196
        returns address of another.</para>
197
      </formalpara>
17 jermar 198
 
26 bondari 199
      <formalpara>
200
        <title>Deallocation algorithm</title>
9 bondari 201
 
26 bondari 202
        <para>Check if block has so called buddy (another free
203
        <mathphrase>2<superscript>i</superscript></mathphrase> frame block
204
        that can be linked with freed block into the
205
        <mathphrase>2<superscript>i+1</superscript></mathphrase> block).
206
        Technically, buddy is a odd/even block for even/odd block
207
        respectively. Plus we can put an extra requirement, that resulting
208
        block must be aligned to its size. This requirement guarantees natural
209
        block alignment for the blocks coming out the allocation
210
        system.</para>
24 bondari 211
 
26 bondari 212
        <para>Using direct pointer arithmetics,
213
        <varname>frame_t::ref_count</varname> and
214
        <varname>frame_t::buddy_order</varname> variables, finding buddy is
215
        done at constant time.</para>
216
      </formalpara>
217
    </section>
218
 
15 bondari 219
    <section id="slab">
11 bondari 220
      <title>Slab allocator</title>
9 bondari 221
 
26 bondari 222
      <section>
223
        <title>Introduction</title>
9 bondari 224
 
26 bondari 225
        <para>The majority of memory allocation requests in the kernel are for
226
        small, frequently used data structures. For this purpose the slab
227
        allocator is a perfect solution. The basic idea behind a slab
228
        allocator is to have lists of commonly used objects available packed
229
        into pages. This avoids the overhead of allocating and destroying
230
        commonly used types of objects such as inodes, threads, virtual memory
231
        structures etc.</para>
24 bondari 232
 
26 bondari 233
        <para>Original slab allocator locking mechanism has become a
234
        significant preformance bottleneck on SMP architectures. <termdef>Slab
235
        SMP perfromance bottleneck was resolved by introducing a per-CPU
236
        caching scheme called as <glossterm>magazine
237
        layer</glossterm></termdef>.</para>
238
      </section>
24 bondari 239
 
26 bondari 240
      <section>
241
        <title>Implementation details (needs revision)</title>
9 bondari 242
 
26 bondari 243
        <para>The SLAB allocator is closely modelled after <ulink
244
        url="http://www.usenix.org/events/usenix01/full_papers/bonwick/bonwick_html/">
245
        OpenSolaris SLAB allocator by Jeff Bonwick and Jonathan Adams </ulink>
246
        with the following exceptions: <itemizedlist>
247
            <listitem>
248
               empty SLABS are deallocated immediately (in Linux they are kept in linked list, in Solaris ???)
249
            </listitem>
250
 
251
            <listitem>
252
               empty magazines are deallocated when not needed (in Solaris they are held in linked list in slab cache)
253
            </listitem>
254
          </itemizedlist> Following features are not currently supported but
255
        would be easy to do: <itemizedlist>
256
            <listitem>
257
               - cache coloring
258
            </listitem>
259
 
260
            <listitem>
261
               - dynamic magazine growing (different magazine sizes are already supported, but we would need to adjust allocation strategy)
262
            </listitem>
263
          </itemizedlist></para>
264
 
265
        <para>The SLAB allocator supports per-CPU caches ('magazines') to
266
        facilitate good SMP scaling.</para>
267
 
268
        <para>When a new object is being allocated, it is first checked, if it
269
        is available in CPU-bound magazine. If it is not found there, it is
270
        allocated from CPU-shared SLAB - if partial full is found, it is used,
271
        otherwise a new one is allocated.</para>
272
 
273
        <para>When an object is being deallocated, it is put to CPU-bound
274
        magazine. If there is no such magazine, new one is allocated (if it
275
        fails, the object is deallocated into SLAB). If the magazine is full,
276
        it is put into cpu-shared list of magazines and new one is
277
        allocated.</para>
278
 
279
        <para>The CPU-bound magazine is actually a pair of magazines to avoid
280
        thrashing when somebody is allocating/deallocating 1 item at the
281
        magazine size boundary. LIFO order is enforced, which should avoid
282
        fragmentation as much as possible.</para>
283
 
284
        <para>Every cache contains list of full slabs and list of partialy
285
        full slabs. Empty SLABS are immediately freed (thrashing will be
286
        avoided because of magazines).</para>
287
 
288
        <para>The SLAB information structure is kept inside the data area, if
289
        possible. The cache can be marked that it should not use magazines.
290
        This is used only for SLAB related caches to avoid deadlocks and
291
        infinite recursion (the SLAB allocator uses itself for allocating all
292
        it's control structures).</para>
293
 
294
        <para>The SLAB allocator allocates lots of space and does not free it.
295
        When frame allocator fails to allocate the frame, it calls
296
        slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim.
297
        The light reclaim releases slabs from cpu-shared magazine-list, until
298
        at least 1 slab is deallocated in each cache (this algorithm should
299
        probably change). The brutal reclaim removes all cached objects, even
300
        from CPU-bound magazines.</para>
301
 
302
        <para>TODO: <itemizedlist>
303
            <listitem>
304
               For better CPU-scaling the magazine allocation strategy should be extended. Currently, if the cache does not have magazine, it asks for non-cpu cached magazine cache to provide one. It might be feasible to add cpu-cached magazine cache (which would allocate it's magazines from non-cpu-cached mag. cache). This would provide a nice per-cpu buffer. The other possibility is to use the per-cache 'empty-magazine-list', which decreases competing for 1 per-system magazine cache.
305
            </listitem>
306
 
307
            <listitem>
308
               - it might be good to add granularity of locks even to slab level, we could then try_spinlock over all partial slabs and thus improve scalability even on slab level
309
            </listitem>
310
          </itemizedlist></para>
311
      </section>
15 bondari 312
    </section>
26 bondari 313
 
314
    <!-- End of Physmem -->
315
  </section>
316
 
317
  <section>
318
    <title>Memory sharing</title>
319
 
320
    <para>Not implemented yet(?)</para>
321
  </section>
11 bondari 322
</chapter>