Subversion Repositories HelenOS-doc

Rev

Rev 24 | Rev 27 | Go to most recent revision | Show entire file | Regard whitespace | Details | Blame | Last modification | View Log | RSS feed

Rev 24 Rev 26
Line 2... Line 2...
2
<chapter id="mm">
2
<chapter id="mm">
3
  <?dbhtml filename="mm.html"?>
3
  <?dbhtml filename="mm.html"?>
4
 
4
 
5
  <title>Memory management</title>
5
  <title>Memory management</title>
6
 
6
 
-
 
7
  <section>
-
 
8
    <!-- VM -->
7
 
9
 
8
 
-
 
9
  <section><!-- VM -->
-
 
10
    <title>Virtual memory management</title>
10
    <title>Virtual memory management</title>
11
 
11
 
12
    <section>
12
    <section>
13
      <title>Address spaces</title>
13
      <title>Address spaces</title>
14
 
14
 
Line 18... Line 18...
18
    <section>
18
    <section>
19
      <title>Virtual address translation</title>
19
      <title>Virtual address translation</title>
20
 
20
 
21
      <para></para>
21
      <para></para>
22
    </section>
22
    </section>
23
  </section><!-- End of VM -->
23
  </section>
24
 
24
 
-
 
25
  <!-- End of VM -->
25
 
26
 
26
  <section><!-- Phys mem -->
27
  <section>
27
    <title>Physical memory management</title>
28
    <!-- Phys mem -->
28
 
29
 
-
 
30
    <title>Physical memory management</title>
29
 
31
 
30
    <section id="zones_and_frames">
32
    <section id="zones_and_frames">
31
      <title>Zones and frames</title>
33
      <title>Zones and frames</title>
32
    <para>        <graphic fileref="images/mm2.png" /> </para>
-
 
33
 
34
 
-
 
35
      <para>
-
 
36
      <!--graphic fileref="images/mm2.png" /-->
34
 
37
     
35
      <para>On some architectures not whole physical memory is available for conventional usage. This limitations
-
 
36
      require from kernel to maintain a table of available and unavailable ranges of physical memory addresses.
-
 
37
      Main idea of zones is in creating memory zone entity, that is a continuous chunk of memory available for allocation.
-
 
38
      If some chunk is not available, we simply do not put it in any zone.
38
      <!--graphic fileref="images/buddy_alloc.svg" format="SVG" /-->
39
      </para>
39
      <mediaobject
40
     
40
     
41
      <para>
-
 
42
      Zone is also serves for informational purposes, containing information about number of free and busy frames. Physical memory
-
 
43
      allocation is also done inside the certain zone. Allocation of zone frame must be organized by the
-
 
44
      <link linkend="frame_allocator">frame allocator</link> associated with the zone.
-
 
45
      </para>
-
 
46
     
41
     
47
      <para>Some of the architectures (mips32, ppc32) have only one zone, that covers whole
-
 
48
      physical memory, and the others (like ia32) may have multiple zones.  Information about zones on current machine is stored
-
 
49
      in BIOS hardware tables or can be hardcoded into kernel during compile time.</para>
42
      </para>
50
     
43
 
-
 
44
      <para>On some architectures not whole physical memory is available for
-
 
45
      conventional usage. This limitations require from kernel to maintain a
-
 
46
      table of available and unavailable ranges of physical memory addresses.
-
 
47
      Main idea of zones is in creating memory zone entity, that is a
-
 
48
      continuous chunk of memory available for allocation. If some chunk is
-
 
49
      not available, we simply do not put it in any zone.</para>
-
 
50
 
-
 
51
      <para>Zone is also serves for informational purposes, containing
-
 
52
      information about number of free and busy frames. Physical memory
-
 
53
      allocation is also done inside the certain zone. Allocation of zone
-
 
54
      frame must be organized by the <link linkend="frame_allocator">frame
-
 
55
      allocator</link> associated with the zone.</para>
-
 
56
 
-
 
57
      <para>Some of the architectures (mips32, ppc32) have only one zone, that
-
 
58
      covers whole physical memory, and the others (like ia32) may have
-
 
59
      multiple zones. Information about zones on current machine is stored in
-
 
60
      BIOS hardware tables or can be hardcoded into kernel during compile
-
 
61
      time.</para>
51
    </section>
62
    </section>
52
 
63
 
53
    <section id="frame_allocator">
64
    <section id="frame_allocator">
54
      <title>Frame allocator</title>
65
      <title>Frame allocator</title>
55
 
66
 
56
    <formalpara>
67
      <formalpara>
57
    <title>Overview</title>
68
        <title>Overview</title>
58
        <para>Frame allocator provides physical memory allocation for the kernel. Because of zonal organization of physical memory,
-
 
59
    frame allocator is always working in context of some zone, thus making impossible to allocate a piece of memory, which lays in different zone, which
-
 
60
    cannot happen, because two adjacent zones can be merged into one. Frame allocator is also being responsible to update information on
-
 
61
    the number of free/busy frames in zone.
-
 
62
    Physical memory allocation inside one <link
-
 
63
        linkend="zones_and_frames">memory zone</link> is being handled by an
-
 
64
        instance of <link linkend="buddy_allocator">buddy allocator</link>
-
 
65
        tailored to allocate blocks of physical memory frames.
-
 
66
    </para>
-
 
67
    </formalpara>
-
 
68
   
-
 
69
   
-
 
70
   
69
 
-
 
70
        <para>Frame allocator provides physical memory allocation for the
-
 
71
        kernel. Because of zonal organization of physical memory, frame
-
 
72
        allocator is always working in context of some zone, thus making
-
 
73
        impossible to allocate a piece of memory, which lays in different
-
 
74
        zone, which cannot happen, because two adjacent zones can be merged
-
 
75
        into one. Frame allocator is also being responsible to update
-
 
76
        information on the number of free/busy frames in zone. Physical memory
-
 
77
        allocation inside one <link linkend="zones_and_frames">memory
-
 
78
        zone</link> is being handled by an instance of <link
-
 
79
        linkend="buddy_allocator">buddy allocator</link> tailored to allocate
-
 
80
        blocks of physical memory frames.</para>
-
 
81
      </formalpara>
71
   
82
 
72
    <formalpara>
83
      <formalpara>
73
    <title>Allocation / deallocation</title>
84
        <title>Allocation / deallocation</title>
74
    <para>
-
 
75
    Upon allocation request, frame allocator tries to find first zone, that can satisfy the incoming request (has required amount of free frames to allocate).
-
 
76
    During deallocation, frame allocator needs to find zone, that contain deallocated frame.
-
 
77
   
85
 
-
 
86
        <para>Upon allocation request, frame allocator tries to find first
-
 
87
        zone, that can satisfy the incoming request (has required amount of
78
    This approach could bring up two potential problems:
88
        free frames to allocate). During deallocation, frame allocator needs
-
 
89
        to find zone, that contain deallocated frame. This approach could
79
    <itemizedlist>
90
        bring up two potential problems: <itemizedlist>
80
        <listitem>
91
            <listitem>
81
            Linear search of zones does not any good to performance, but number of zones is not expected to be high. And if yes, list of zones can be replaced with more time-efficient B-tree.
92
               Linear search of zones does not any good to performance, but number of zones is not expected to be high. And if yes, list of zones can be replaced with more time-efficient B-tree.
82
        </listitem>
93
            </listitem>
-
 
94
 
83
        <listitem>
95
            <listitem>
84
            Quickly find out if zone contains required number of frames to allocate and if this chunk of memory is properly aligned. This issue is perfectly solved bu the buddy allocator.
96
               Quickly find out if zone contains required number of frames to allocate and if this chunk of memory is properly aligned. This issue is perfectly solved bu the buddy allocator.
85
        </listitem>
97
            </listitem>
86
    </itemizedlist>
98
          </itemizedlist></para>
87
   
-
 
88
   
-
 
89
    </para>
-
 
90
    </formalpara>
99
      </formalpara>
91
   
-
 
92
      </section>
100
    </section>
93
 
-
 
94
    </section>
101
  </section>
95
 
102
 
96
 
-
 
97
 
-
 
98
    <section id="buddy_allocator">
103
  <section id="buddy_allocator">
99
      <title>Buddy allocator</title>
104
    <title>Buddy allocator</title>
100
 
105
 
101
      <section>
106
    <section>
102
        <title>Overview</title>
107
      <title>Overview</title>
103
 
108
 
104
        <para>In buddy allocator, memory is broken down into power-of-two
109
      <para>In buddy allocator, memory is broken down into power-of-two sized
105
        sized naturally aligned blocks. These blocks are organized in an array
110
      naturally aligned blocks. These blocks are organized in an array of
106
        of lists in which list with index i contains all unallocated blocks of
111
      lists in which list with index i contains all unallocated blocks of the
107
        the size <mathphrase>2<superscript>i</superscript></mathphrase>. The
112
      size <mathphrase>2<superscript>i</superscript></mathphrase>. The index i
108
        index i is called the order of block. Should there be two adjacent
113
      is called the order of block. Should there be two adjacent equally sized
109
        equally sized blocks in list <mathphrase>i</mathphrase> (i.e.
114
      blocks in list <mathphrase>i</mathphrase> (i.e. buddies), the buddy
110
        buddies), the buddy allocator would coalesce them and put the
115
      allocator would coalesce them and put the resulting block in list
111
        resulting block in list <mathphrase>i + 1</mathphrase>, provided that
116
      <mathphrase>i + 1</mathphrase>, provided that the resulting block would
112
        the resulting block would be naturally aligned. Similarily, when the
117
      be naturally aligned. Similarily, when the allocator is asked to
113
        allocator is asked to allocate a block of size
118
      allocate a block of size
114
        <mathphrase>2<superscript>i</superscript></mathphrase>, it first tries
119
      <mathphrase>2<superscript>i</superscript></mathphrase>, it first tries
115
        to satisfy the request from list with index i. If the request cannot
120
      to satisfy the request from list with index i. If the request cannot be
116
        be satisfied (i.e. the list i is empty), the buddy allocator will try
121
      satisfied (i.e. the list i is empty), the buddy allocator will try to
117
        to allocate and split larger block from list with index i + 1. Both of
122
      allocate and split larger block from list with index i + 1. Both of
118
        these algorithms are recursive. The recursion ends either when there
123
      these algorithms are recursive. The recursion ends either when there are
119
        are no blocks to coalesce in the former case or when there are no
124
      no blocks to coalesce in the former case or when there are no blocks
120
        blocks that can be split in the latter case.</para>
125
      that can be split in the latter case.</para>
121
 
126
 
122
        <graphic fileref="images/mm1.png" format="EPS" />
127
      <graphic fileref="images/mm1.png" format="EPS" />
123
 
128
 
124
        <para>This approach greatly reduces external fragmentation of memory
129
      <para>This approach greatly reduces external fragmentation of memory and
125
        and helps in allocating bigger continuous blocks of memory aligned to
130
      helps in allocating bigger continuous blocks of memory aligned to their
126
        their size. On the other hand, the buddy allocator suffers increased
131
      size. On the other hand, the buddy allocator suffers increased internal
127
        internal fragmentation of memory and is not suitable for general
132
      fragmentation of memory and is not suitable for general kernel
128
        kernel allocations. This purpose is better addressed by the <link
133
      allocations. This purpose is better addressed by the <link
129
        linkend="slab">slab allocator</link>.</para>
134
      linkend="slab">slab allocator</link>.</para>
130
      </section>
135
    </section>
131
 
136
 
132
      <section>
137
    <section>
133
        <title>Implementation</title>
138
      <title>Implementation</title>
134
 
139
 
135
        <para>The buddy allocator is, in fact, an abstract framework wich can
140
      <para>The buddy allocator is, in fact, an abstract framework wich can be
136
        be easily specialized to serve one particular task. It knows nothing
141
      easily specialized to serve one particular task. It knows nothing about
137
        about the nature of memory it helps to allocate. In order to beat the
142
      the nature of memory it helps to allocate. In order to beat the lack of
138
        lack of this knowledge, the buddy allocator exports an interface that
143
      this knowledge, the buddy allocator exports an interface that each of
139
        each of its clients is required to implement. When supplied an
144
      its clients is required to implement. When supplied an implementation of
140
        implementation of this interface, the buddy allocator can use
145
      this interface, the buddy allocator can use specialized external
141
        specialized external functions to find buddy for a block, split and
146
      functions to find buddy for a block, split and coalesce blocks,
142
        coalesce blocks, manipulate block order and mark blocks busy or
147
      manipulate block order and mark blocks busy or available. For precize
143
        available. For precize documentation of this interface, refer to <link
148
      documentation of this interface, refer to <link linkend="???">HelenOS
144
        linkend="???">HelenOS Generic Kernel Reference Manual</link>.</para>
149
      Generic Kernel Reference Manual</link>.</para>
145
 
150
 
146
        <formalpara>
151
      <formalpara>
147
          <title>Data organization</title>
152
        <title>Data organization</title>
148
 
153
 
149
          <para>Each entity allocable by the buddy allocator is required to
154
        <para>Each entity allocable by the buddy allocator is required to
150
          contain space for storing block order number and a link variable
155
        contain space for storing block order number and a link variable used
151
          used to interconnect blocks within the same order.</para>
156
        to interconnect blocks within the same order.</para>
152
 
157
 
153
          <para>Whatever entities are allocated by the buddy allocator, the
158
        <para>Whatever entities are allocated by the buddy allocator, the
154
          first entity within a block is used to represent the entire block.
159
        first entity within a block is used to represent the entire block. The
155
          The first entity keeps the order of the whole block. Other entities
160
        first entity keeps the order of the whole block. Other entities within
156
          within the block are assigned the magic value
161
        the block are assigned the magic value
157
          <constant>BUDDY_INNER_BLOCK</constant>. This is especially important
162
        <constant>BUDDY_INNER_BLOCK</constant>. This is especially important
158
          for effective identification of buddies in one-dimensional array
163
        for effective identification of buddies in one-dimensional array
159
          because the entity that represents a potential buddy cannot be
164
        because the entity that represents a potential buddy cannot be
160
          associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it
165
        associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it is
161
          is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is
166
        associated with <constant>BUDDY_INNER_BLOCK</constant> then it is not
162
          not a buddy).</para>
167
        a buddy).</para>
163
        </formalpara>
168
      </formalpara>
164
   
169
 
165
        <formalpara>
170
      <formalpara>
166
          <title>Data organization</title>
171
        <title>Data organization</title>
167
 
172
 
168
          <para>Buddy allocator always uses first frame to represent frame
173
        <para>Buddy allocator always uses first frame to represent frame
169
          block. This frame contains <varname>buddy_order</varname> variable
174
        block. This frame contains <varname>buddy_order</varname> variable to
170
          to provide information about the block size it actually represents (
175
        provide information about the block size it actually represents (
171
          <mathphrase>2<superscript>buddy_order</superscript></mathphrase>
176
        <mathphrase>2<superscript>buddy_order</superscript></mathphrase>
172
          frames block). Other frames in block have this value set to magic
177
        frames block). Other frames in block have this value set to magic
173
          <constant>BUDDY_INNER_BLOCK</constant> that is much greater than
178
        <constant>BUDDY_INNER_BLOCK</constant> that is much greater than buddy
174
          buddy <varname>max_order</varname> value.</para>
179
        <varname>max_order</varname> value.</para>
175
 
180
 
176
          <para>Each <varname>frame_t</varname> also contains pointer member
181
        <para>Each <varname>frame_t</varname> also contains pointer member to
177
          to hold frame structure in the linked list inside one order.</para>
182
        hold frame structure in the linked list inside one order.</para>
178
        </formalpara>
183
      </formalpara>
179
 
184
 
180
        <formalpara>
185
      <formalpara>
181
          <title>Allocation algorithm</title>
186
        <title>Allocation algorithm</title>
182
 
187
 
183
          <para>Upon <mathphrase>2<superscript>i</superscript></mathphrase>
188
        <para>Upon <mathphrase>2<superscript>i</superscript></mathphrase>
184
          frames block allocation request, allocator checks if there are any
189
        frames block allocation request, allocator checks if there are any
185
          blocks available at the order list <varname>i</varname>. If yes,
190
        blocks available at the order list <varname>i</varname>. If yes,
186
          removes block from order list and returns its address. If no,
191
        removes block from order list and returns its address. If no,
187
          recursively allocates
192
        recursively allocates
188
          <mathphrase>2<superscript>i+1</superscript></mathphrase> frame
193
        <mathphrase>2<superscript>i+1</superscript></mathphrase> frame block,
189
          block, splits it into two
194
        splits it into two
190
          <mathphrase>2<superscript>i</superscript></mathphrase> frame blocks.
195
        <mathphrase>2<superscript>i</superscript></mathphrase> frame blocks.
191
          Then adds one of the blocks to the <varname>i</varname> order list
196
        Then adds one of the blocks to the <varname>i</varname> order list and
192
          and returns address of another.</para>
197
        returns address of another.</para>
193
        </formalpara>
198
      </formalpara>
194
 
199
 
195
        <formalpara>
200
      <formalpara>
196
          <title>Deallocation algorithm</title>
201
        <title>Deallocation algorithm</title>
197
 
202
 
Line 199... Line 204...
199
          <mathphrase>2<superscript>i</superscript></mathphrase> frame block
204
        <mathphrase>2<superscript>i</superscript></mathphrase> frame block
200
          that can be linked with freed block into the
205
        that can be linked with freed block into the
201
          <mathphrase>2<superscript>i+1</superscript></mathphrase> block).
206
        <mathphrase>2<superscript>i+1</superscript></mathphrase> block).
202
          Technically, buddy is a odd/even block for even/odd block
207
        Technically, buddy is a odd/even block for even/odd block
203
          respectively. Plus we can put an extra requirement, that resulting
208
        respectively. Plus we can put an extra requirement, that resulting
204
          block must be aligned to its size. This requirement guarantees
209
        block must be aligned to its size. This requirement guarantees natural
205
          natural block alignment for the blocks coming out the allocation
210
        block alignment for the blocks coming out the allocation
206
          system.</para>
211
        system.</para>
207
 
212
 
208
          <para>Using direct pointer arithmetics,
213
        <para>Using direct pointer arithmetics,
209
          <varname>frame_t::ref_count</varname> and
214
        <varname>frame_t::ref_count</varname> and
210
          <varname>frame_t::buddy_order</varname> variables, finding buddy is
215
        <varname>frame_t::buddy_order</varname> variables, finding buddy is
211
          done at constant time.</para>
216
        done at constant time.</para>
212
        </formalpara>
217
      </formalpara>
213
   
-
 
214
      </section>
218
    </section>
215
 
219
 
216
 
-
 
217
    <section id="slab">
220
    <section id="slab">
218
      <title>Slab allocator</title>
221
      <title>Slab allocator</title>
219
 
222
 
-
 
223
      <section>
-
 
224
        <title>Introduction</title>
-
 
225
 
220
      <para>Kernel memory allocation is handled by slab.</para>
226
        <para>The majority of memory allocation requests in the kernel are for
-
 
227
        small, frequently used data structures. For this purpose the slab
-
 
228
        allocator is a perfect solution. The basic idea behind a slab
-
 
229
        allocator is to have lists of commonly used objects available packed
-
 
230
        into pages. This avoids the overhead of allocating and destroying
-
 
231
        commonly used types of objects such as inodes, threads, virtual memory
-
 
232
        structures etc.</para>
-
 
233
 
-
 
234
        <para>Original slab allocator locking mechanism has become a
-
 
235
        significant preformance bottleneck on SMP architectures. <termdef>Slab
-
 
236
        SMP perfromance bottleneck was resolved by introducing a per-CPU
-
 
237
        caching scheme called as <glossterm>magazine
-
 
238
        layer</glossterm></termdef>.</para>
-
 
239
      </section>
-
 
240
 
-
 
241
      <section>
-
 
242
        <title>Implementation details (needs revision)</title>
-
 
243
 
-
 
244
        <para>The SLAB allocator is closely modelled after <ulink
-
 
245
        url="http://www.usenix.org/events/usenix01/full_papers/bonwick/bonwick_html/">
-
 
246
        OpenSolaris SLAB allocator by Jeff Bonwick and Jonathan Adams </ulink>
-
 
247
        with the following exceptions: <itemizedlist>
-
 
248
            <listitem>
-
 
249
               empty SLABS are deallocated immediately (in Linux they are kept in linked list, in Solaris ???)
-
 
250
            </listitem>
-
 
251
 
-
 
252
            <listitem>
-
 
253
               empty magazines are deallocated when not needed (in Solaris they are held in linked list in slab cache)
-
 
254
            </listitem>
-
 
255
          </itemizedlist> Following features are not currently supported but
221
    </section><!-- End of Physmem -->
256
        would be easy to do: <itemizedlist>
-
 
257
            <listitem>
-
 
258
               - cache coloring
-
 
259
            </listitem>
-
 
260
 
-
 
261
            <listitem>
-
 
262
               - dynamic magazine growing (different magazine sizes are already supported, but we would need to adjust allocation strategy)
-
 
263
            </listitem>
-
 
264
          </itemizedlist></para>
222
 
265
 
-
 
266
        <para>The SLAB allocator supports per-CPU caches ('magazines') to
-
 
267
        facilitate good SMP scaling.</para>
-
 
268
 
-
 
269
        <para>When a new object is being allocated, it is first checked, if it
-
 
270
        is available in CPU-bound magazine. If it is not found there, it is
-
 
271
        allocated from CPU-shared SLAB - if partial full is found, it is used,
-
 
272
        otherwise a new one is allocated.</para>
-
 
273
 
-
 
274
        <para>When an object is being deallocated, it is put to CPU-bound
-
 
275
        magazine. If there is no such magazine, new one is allocated (if it
-
 
276
        fails, the object is deallocated into SLAB). If the magazine is full,
-
 
277
        it is put into cpu-shared list of magazines and new one is
-
 
278
        allocated.</para>
-
 
279
 
-
 
280
        <para>The CPU-bound magazine is actually a pair of magazines to avoid
-
 
281
        thrashing when somebody is allocating/deallocating 1 item at the
-
 
282
        magazine size boundary. LIFO order is enforced, which should avoid
-
 
283
        fragmentation as much as possible.</para>
-
 
284
 
-
 
285
        <para>Every cache contains list of full slabs and list of partialy
-
 
286
        full slabs. Empty SLABS are immediately freed (thrashing will be
-
 
287
        avoided because of magazines).</para>
-
 
288
 
-
 
289
        <para>The SLAB information structure is kept inside the data area, if
-
 
290
        possible. The cache can be marked that it should not use magazines.
-
 
291
        This is used only for SLAB related caches to avoid deadlocks and
-
 
292
        infinite recursion (the SLAB allocator uses itself for allocating all
-
 
293
        it's control structures).</para>
-
 
294
 
-
 
295
        <para>The SLAB allocator allocates lots of space and does not free it.
-
 
296
        When frame allocator fails to allocate the frame, it calls
-
 
297
        slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim.
-
 
298
        The light reclaim releases slabs from cpu-shared magazine-list, until
-
 
299
        at least 1 slab is deallocated in each cache (this algorithm should
-
 
300
        probably change). The brutal reclaim removes all cached objects, even
-
 
301
        from CPU-bound magazines.</para>
-
 
302
 
-
 
303
        <para>TODO: <itemizedlist>
-
 
304
            <listitem>
-
 
305
               For better CPU-scaling the magazine allocation strategy should be extended. Currently, if the cache does not have magazine, it asks for non-cpu cached magazine cache to provide one. It might be feasible to add cpu-cached magazine cache (which would allocate it's magazines from non-cpu-cached mag. cache). This would provide a nice per-cpu buffer. The other possibility is to use the per-cache 'empty-magazine-list', which decreases competing for 1 per-system magazine cache.
-
 
306
            </listitem>
-
 
307
 
-
 
308
            <listitem>
-
 
309
               - it might be good to add granularity of locks even to slab level, we could then try_spinlock over all partial slabs and thus improve scalability even on slab level
-
 
310
            </listitem>
-
 
311
          </itemizedlist></para>
-
 
312
      </section>
223
  </section>
313
    </section>
224
 
314
 
-
 
315
    <!-- End of Physmem -->
-
 
316
  </section>
225
 
317
 
226
    <section>
318
  <section>
227
      <title>Memory sharing</title>
319
    <title>Memory sharing</title>
228
 
320
 
229
      <para>Not implemented yet(?)</para>
321
    <para>Not implemented yet(?)</para>