Rev 27 | Rev 35 | Go to most recent revision | Show entire file | Ignore whitespace | Details | Blame | Last modification | View Log | RSS feed
| Rev 27 | Rev 34 | ||
|---|---|---|---|
| Line 3... | Line 3... | ||
| 3 | <?dbhtml filename="mm.html"?> |
3 | <?dbhtml filename="mm.html"?> |
| 4 | 4 | ||
| 5 | <title>Memory management</title> |
5 | <title>Memory management</title> |
| 6 | 6 | ||
| 7 | <section> |
7 | <section> |
| 8 | <!-- VM --> |
- | |
| 9 | - | ||
| 10 | <title>Virtual memory management</title> |
8 | <title>Virtual memory management</title> |
| 11 | 9 | ||
| 12 | <section> |
10 | <section> |
| 13 | <title>Address spaces</title> |
11 | <title>Address spaces</title> |
| 14 | 12 | ||
| 15 | <para></para> |
13 | <para /> |
| 16 | </section> |
14 | </section> |
| 17 | 15 | ||
| 18 | <section> |
16 | <section> |
| 19 | <title>Virtual address translation</title> |
17 | <title>Virtual address translation</title> |
| 20 | 18 | ||
| 21 | <para></para> |
19 | <para /> |
| 22 | </section> |
20 | </section> |
| - | 21 | ||
| - | 22 | <para>Page tables. 4 level hierarchical and hash directly supported. B+ |
|
| - | 23 | Tree can be implemented.</para> |
|
| - | 24 | ||
| - | 25 | <para>For paging there is an abstract layer</para> |
|
| - | 26 | ||
| - | 27 | <para>TLB shootdown implementation (update TLB upon mapping |
|
| - | 28 | update/remove). TLB shootdown ASID/ASID:PAGE/ALL. TLB shootdown requests |
|
| - | 29 | can come in asynchroniously so there is a cache of TLB shootdown requests. |
|
| - | 30 | Upon cache overflow TLB shootdown ALL is executed</para> |
|
| - | 31 | ||
| - | 32 | <para>Address spaces. Address space area (B+ tree). Only for uspace. Set |
|
| - | 33 | of syscalls (shrink/extend etc). Special address space area type - device |
|
| - | 34 | - prohibits shrink/extend syscalls to call on it. Address space has link |
|
| - | 35 | to mapping tables (hierarchical - per Address space, hash - global |
|
| - | 36 | tables).</para> |
|
| 23 | </section> |
37 | </section> |
| 24 | 38 | ||
| 25 | <!-- End of VM --> |
39 | <!-- End of VM --> |
| 26 | 40 | ||
| 27 | <section> |
41 | <section> |
| Line 30... | Line 44... | ||
| 30 | <title>Physical memory management</title> |
44 | <title>Physical memory management</title> |
| 31 | 45 | ||
| 32 | <section id="zones_and_frames"> |
46 | <section id="zones_and_frames"> |
| 33 | <title>Zones and frames</title> |
47 | <title>Zones and frames</title> |
| 34 | 48 | ||
| 35 | <para> |
- | |
| 36 | <!--graphic fileref="images/mm2.png" /--> |
- | |
| 37 | - | ||
| 38 | <!--graphic fileref="images/buddy_alloc.svg" format="SVG" /--> |
49 | <para><!--graphic fileref="images/mm2.png" /--><!--graphic fileref="images/buddy_alloc.svg" format="SVG" /--></para> |
| 39 | - | ||
| 40 | - | ||
| 41 | </para> |
- | |
| 42 | 50 | ||
| 43 | <para>On some architectures not whole physical memory is available for |
51 | <para>On some architectures not whole physical memory is available for |
| 44 | conventional usage. This limitations require from kernel to maintain a |
52 | conventional usage. This limitations require from kernel to maintain a |
| 45 | table of available and unavailable ranges of physical memory addresses. |
53 | table of available and unavailable ranges of physical memory addresses. |
| 46 | Main idea of zones is in creating memory zone entity, that is a |
54 | Main idea of zones is in creating memory zone entity, that is a |
| Line 95... | Line 103... | ||
| 95 | Quickly find out if zone contains required number of frames to allocate and if this chunk of memory is properly aligned. This issue is perfectly solved bu the buddy allocator. |
103 | Quickly find out if zone contains required number of frames to allocate and if this chunk of memory is properly aligned. This issue is perfectly solved bu the buddy allocator. |
| 96 | </listitem> |
104 | </listitem> |
| 97 | </itemizedlist></para> |
105 | </itemizedlist></para> |
| 98 | </formalpara> |
106 | </formalpara> |
| 99 | </section> |
107 | </section> |
| 100 | </section> |
- | |
| 101 | - | ||
| 102 | <section id="buddy_allocator"> |
- | |
| 103 | <title>Buddy allocator</title> |
- | |
| 104 | - | ||
| 105 | <section> |
- | |
| 106 | <title>Overview</title> |
- | |
| 107 | - | ||
| 108 | <para>In buddy allocator, memory is broken down into power-of-two sized |
- | |
| 109 | naturally aligned blocks. These blocks are organized in an array of |
- | |
| 110 | lists in which list with index i contains all unallocated blocks of the |
- | |
| 111 | size <mathphrase>2<superscript>i</superscript></mathphrase>. The index i |
- | |
| 112 | is called the order of block. Should there be two adjacent equally sized |
- | |
| 113 | blocks in list <mathphrase>i</mathphrase> (i.e. buddies), the buddy |
- | |
| 114 | allocator would coalesce them and put the resulting block in list |
- | |
| 115 | <mathphrase>i + 1</mathphrase>, provided that the resulting block would |
- | |
| 116 | be naturally aligned. Similarily, when the allocator is asked to |
- | |
| 117 | allocate a block of size |
- | |
| 118 | <mathphrase>2<superscript>i</superscript></mathphrase>, it first tries |
- | |
| 119 | to satisfy the request from list with index i. If the request cannot be |
- | |
| 120 | satisfied (i.e. the list i is empty), the buddy allocator will try to |
- | |
| 121 | allocate and split larger block from list with index i + 1. Both of |
- | |
| 122 | these algorithms are recursive. The recursion ends either when there are |
- | |
| 123 | no blocks to coalesce in the former case or when there are no blocks |
- | |
| 124 | that can be split in the latter case.</para> |
- | |
| 125 | - | ||
| 126 | <!--graphic fileref="images/mm1.png" format="EPS" /--> |
- | |
| 127 | - | ||
| 128 | <para>This approach greatly reduces external fragmentation of memory and |
- | |
| 129 | helps in allocating bigger continuous blocks of memory aligned to their |
- | |
| 130 | size. On the other hand, the buddy allocator suffers increased internal |
- | |
| 131 | fragmentation of memory and is not suitable for general kernel |
- | |
| 132 | allocations. This purpose is better addressed by the <link |
- | |
| 133 | linkend="slab">slab allocator</link>.</para> |
- | |
| 134 | </section> |
- | |
| 135 | - | ||
| 136 | <section> |
- | |
| 137 | <title>Implementation</title> |
- | |
| 138 | - | ||
| 139 | <para>The buddy allocator is, in fact, an abstract framework wich can be |
- | |
| 140 | easily specialized to serve one particular task. It knows nothing about |
- | |
| 141 | the nature of memory it helps to allocate. In order to beat the lack of |
- | |
| 142 | this knowledge, the buddy allocator exports an interface that each of |
- | |
| 143 | its clients is required to implement. When supplied an implementation of |
- | |
| 144 | this interface, the buddy allocator can use specialized external |
- | |
| 145 | functions to find buddy for a block, split and coalesce blocks, |
- | |
| 146 | manipulate block order and mark blocks busy or available. For precize |
- | |
| 147 | documentation of this interface, refer to <link linkend="???">HelenOS |
- | |
| 148 | Generic Kernel Reference Manual</link>.</para> |
- | |
| 149 | - | ||
| 150 | <formalpara> |
- | |
| 151 | <title>Data organization</title> |
- | |
| 152 | - | ||
| 153 | <para>Each entity allocable by the buddy allocator is required to |
- | |
| 154 | contain space for storing block order number and a link variable used |
- | |
| 155 | to interconnect blocks within the same order.</para> |
- | |
| 156 | - | ||
| 157 | <para>Whatever entities are allocated by the buddy allocator, the |
- | |
| 158 | first entity within a block is used to represent the entire block. The |
- | |
| 159 | first entity keeps the order of the whole block. Other entities within |
- | |
| 160 | the block are assigned the magic value |
- | |
| 161 | <constant>BUDDY_INNER_BLOCK</constant>. This is especially important |
- | |
| 162 | for effective identification of buddies in one-dimensional array |
- | |
| 163 | because the entity that represents a potential buddy cannot be |
- | |
| 164 | associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it is |
- | |
| 165 | associated with <constant>BUDDY_INNER_BLOCK</constant> then it is not |
- | |
| 166 | a buddy).</para> |
- | |
| 167 | </formalpara> |
- | |
| 168 | - | ||
| 169 | <formalpara> |
- | |
| 170 | <title>Data organization</title> |
- | |
| 171 | - | ||
| 172 | <para>Buddy allocator always uses first frame to represent frame |
- | |
| 173 | block. This frame contains <varname>buddy_order</varname> variable to |
- | |
| 174 | provide information about the block size it actually represents ( |
- | |
| 175 | <mathphrase>2<superscript>buddy_order</superscript></mathphrase> |
- | |
| 176 | frames block). Other frames in block have this value set to magic |
- | |
| 177 | <constant>BUDDY_INNER_BLOCK</constant> that is much greater than buddy |
- | |
| 178 | <varname>max_order</varname> value.</para> |
- | |
| 179 | 108 | ||
| 180 | <para>Each <varname>frame_t</varname> also contains pointer member to |
- | |
| 181 | hold frame structure in the linked list inside one order.</para> |
109 | <section id="buddy_allocator"> |
| 182 | </formalpara> |
110 | <title>Buddy allocator</title> |
| 183 | 111 | ||
| 184 | <formalpara> |
112 | <section> |
| 185 | <title>Allocation algorithm</title> |
113 | <title>Overview</title> |
| 186 | 114 | ||
| - | 115 | <para>In buddy allocator, memory is broken down into power-of-two |
|
| - | 116 | sized naturally aligned blocks. These blocks are organized in an array |
|
| - | 117 | of lists in which list with index i contains all unallocated blocks of |
|
| 187 | <para>Upon <mathphrase>2<superscript>i</superscript></mathphrase> |
118 | the size <mathphrase>2<superscript>i</superscript></mathphrase>. The |
| 188 | frames block allocation request, allocator checks if there are any |
119 | index i is called the order of block. Should there be two adjacent |
| 189 | blocks available at the order list <varname>i</varname>. If yes, |
120 | equally sized blocks in list <mathphrase>i</mathphrase> (i.e. |
| - | 121 | buddies), the buddy allocator would coalesce them and put the |
|
| - | 122 | resulting block in list <mathphrase>i + 1</mathphrase>, provided that |
|
| 190 | removes block from order list and returns its address. If no, |
123 | the resulting block would be naturally aligned. Similarily, when the |
| 191 | recursively allocates |
124 | allocator is asked to allocate a block of size |
| 192 | <mathphrase>2<superscript>i+1</superscript></mathphrase> frame block, |
125 | <mathphrase>2<superscript>i</superscript></mathphrase>, it first tries |
| - | 126 | to satisfy the request from list with index i. If the request cannot |
|
| - | 127 | be satisfied (i.e. the list i is empty), the buddy allocator will try |
|
| - | 128 | to allocate and split larger block from list with index i + 1. Both of |
|
| - | 129 | these algorithms are recursive. The recursion ends either when there |
|
| - | 130 | are no blocks to coalesce in the former case or when there are no |
|
| 193 | splits it into two |
131 | blocks that can be split in the latter case.</para> |
| - | 132 | ||
| - | 133 | <!--graphic fileref="images/mm1.png" format="EPS" /--> |
|
| - | 134 | ||
| 194 | <mathphrase>2<superscript>i</superscript></mathphrase> frame blocks. |
135 | <para>This approach greatly reduces external fragmentation of memory |
| - | 136 | and helps in allocating bigger continuous blocks of memory aligned to |
|
| 195 | Then adds one of the blocks to the <varname>i</varname> order list and |
137 | their size. On the other hand, the buddy allocator suffers increased |
| - | 138 | internal fragmentation of memory and is not suitable for general |
|
| - | 139 | kernel allocations. This purpose is better addressed by the <link |
|
| 196 | returns address of another.</para> |
140 | linkend="slab">slab allocator</link>.</para> |
| 197 | </formalpara> |
141 | </section> |
| 198 | 142 | ||
| 199 | <formalpara> |
143 | <section> |
| 200 | <title>Deallocation algorithm</title> |
144 | <title>Implementation</title> |
| 201 | 145 | ||
| - | 146 | <para>The buddy allocator is, in fact, an abstract framework wich can |
|
| - | 147 | be easily specialized to serve one particular task. It knows nothing |
|
| - | 148 | about the nature of memory it helps to allocate. In order to beat the |
|
| - | 149 | lack of this knowledge, the buddy allocator exports an interface that |
|
| - | 150 | each of its clients is required to implement. When supplied an |
|
| - | 151 | implementation of this interface, the buddy allocator can use |
|
| - | 152 | specialized external functions to find buddy for a block, split and |
|
| - | 153 | coalesce blocks, manipulate block order and mark blocks busy or |
|
| - | 154 | available. For precize documentation of this interface, refer to <link |
|
| - | 155 | linkend="???">HelenOS Generic Kernel Reference Manual</link>.</para> |
|
| - | 156 | ||
| - | 157 | <formalpara> |
|
| - | 158 | <title>Data organization</title> |
|
| - | 159 | ||
| - | 160 | <para>Each entity allocable by the buddy allocator is required to |
|
| - | 161 | contain space for storing block order number and a link variable |
|
| - | 162 | used to interconnect blocks within the same order.</para> |
|
| - | 163 | ||
| - | 164 | <para>Whatever entities are allocated by the buddy allocator, the |
|
| - | 165 | first entity within a block is used to represent the entire block. |
|
| - | 166 | The first entity keeps the order of the whole block. Other entities |
|
| - | 167 | within the block are assigned the magic value |
|
| - | 168 | <constant>BUDDY_INNER_BLOCK</constant>. This is especially important |
|
| - | 169 | for effective identification of buddies in one-dimensional array |
|
| - | 170 | because the entity that represents a potential buddy cannot be |
|
| - | 171 | associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it |
|
| - | 172 | is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is |
|
| - | 173 | not a buddy).</para> |
|
| - | 174 | ||
| - | 175 | <para>Buddy allocator always uses first frame to represent frame |
|
| - | 176 | block. This frame contains <varname>buddy_order</varname> variable |
|
| - | 177 | to provide information about the block size it actually represents ( |
|
| - | 178 | <mathphrase>2<superscript>buddy_order</superscript></mathphrase> |
|
| - | 179 | frames block). Other frames in block have this value set to magic |
|
| - | 180 | <constant>BUDDY_INNER_BLOCK</constant> that is much greater than |
|
| - | 181 | buddy <varname>max_order</varname> value.</para> |
|
| - | 182 | ||
| - | 183 | <para>Each <varname>frame_t</varname> also contains pointer member |
|
| - | 184 | to hold frame structure in the linked list inside one order.</para> |
|
| - | 185 | </formalpara> |
|
| - | 186 | ||
| - | 187 | <formalpara> |
|
| - | 188 | <title>Allocation algorithm</title> |
|
| - | 189 | ||
| - | 190 | <para>Upon <mathphrase>2<superscript>i</superscript></mathphrase> |
|
| - | 191 | frames block allocation request, allocator checks if there are any |
|
| - | 192 | blocks available at the order list <varname>i</varname>. If yes, |
|
| - | 193 | removes block from order list and returns its address. If no, |
|
| - | 194 | recursively allocates |
|
| - | 195 | <mathphrase>2<superscript>i+1</superscript></mathphrase> frame |
|
| - | 196 | block, splits it into two |
|
| - | 197 | <mathphrase>2<superscript>i</superscript></mathphrase> frame blocks. |
|
| - | 198 | Then adds one of the blocks to the <varname>i</varname> order list |
|
| - | 199 | and returns address of another.</para> |
|
| - | 200 | </formalpara> |
|
| - | 201 | ||
| - | 202 | <formalpara> |
|
| - | 203 | <title>Deallocation algorithm</title> |
|
| - | 204 | ||
| 202 | <para>Check if block has so called buddy (another free |
205 | <para>Check if block has so called buddy (another free |
| 203 | <mathphrase>2<superscript>i</superscript></mathphrase> frame block |
206 | <mathphrase>2<superscript>i</superscript></mathphrase> frame block |
| 204 | that can be linked with freed block into the |
207 | that can be linked with freed block into the |
| 205 | <mathphrase>2<superscript>i+1</superscript></mathphrase> block). |
208 | <mathphrase>2<superscript>i+1</superscript></mathphrase> block). |
| 206 | Technically, buddy is a odd/even block for even/odd block |
209 | Technically, buddy is a odd/even block for even/odd block |
| 207 | respectively. Plus we can put an extra requirement, that resulting |
210 | respectively. Plus we can put an extra requirement, that resulting |
| 208 | block must be aligned to its size. This requirement guarantees natural |
211 | block must be aligned to its size. This requirement guarantees |
| 209 | block alignment for the blocks coming out the allocation |
212 | natural block alignment for the blocks coming out the allocation |
| 210 | system.</para> |
213 | system.</para> |
| 211 | 214 | ||
| 212 | <para>Using direct pointer arithmetics, |
215 | <para>Using direct pointer arithmetics, |
| 213 | <varname>frame_t::ref_count</varname> and |
216 | <varname>frame_t::ref_count</varname> and |
| 214 | <varname>frame_t::buddy_order</varname> variables, finding buddy is |
217 | <varname>frame_t::buddy_order</varname> variables, finding buddy is |
| 215 | done at constant time.</para> |
218 | done at constant time.</para> |
| 216 | </formalpara> |
219 | </formalpara> |
| - | 220 | </section> |
|
| 217 | </section> |
221 | </section> |
| 218 | 222 | ||
| 219 | <section id="slab"> |
223 | <section id="slab"> |
| 220 | <title>Slab allocator</title> |
224 | <title>Slab allocator</title> |
| 221 | 225 | ||
| 222 | <section> |
226 | <section> |
| 223 | <title>Introduction</title> |
227 | <title>Overview</title> |
| - | 228 | ||
| - | 229 | <para><termdef><glossterm>Slab</glossterm> represents a contiguous |
|
| - | 230 | piece of memory, usually made of several physically contiguous |
|
| - | 231 | pages.</termdef> <termdef><glossterm>Slab cache</glossterm> consists |
|
| - | 232 | of one or more slabs.</termdef></para> |
|
| 224 | 233 | ||
| 225 | <para>The majority of memory allocation requests in the kernel are for |
234 | <para>The majority of memory allocation requests in the kernel are for |
| 226 | small, frequently used data structures. For this purpose the slab |
235 | small, frequently used data structures. For this purpose the slab |
| 227 | allocator is a perfect solution. The basic idea behind a slab |
236 | allocator is a perfect solution. The basic idea behind the slab |
| 228 | allocator is to have lists of commonly used objects available packed |
237 | allocator is to have lists of commonly used objects available packed |
| 229 | into pages. This avoids the overhead of allocating and destroying |
238 | into pages. This avoids the overhead of allocating and destroying |
| 230 | commonly used types of objects such as inodes, threads, virtual memory |
239 | commonly used types of objects such threads, virtual memory structures |
| 231 | structures etc.</para> |
- | |
| 232 | - | ||
| 233 | <para>Original slab allocator locking mechanism has become a |
240 | etc. Also due to the exact allocated size matching, slab allocation |
| 234 | significant preformance bottleneck on SMP architectures. <termdef>Slab |
- | |
| 235 | SMP perfromance bottleneck was resolved by introducing a per-CPU |
- | |
| 236 | caching scheme called as <glossterm>magazine |
- | |
| 237 | layer</glossterm></termdef>.</para> |
241 | completely eliminates internal fragmentation issue.</para> |
| 238 | </section> |
242 | </section> |
| 239 | 243 | ||
| 240 | <section> |
244 | <section> |
| 241 | <title>Implementation details (needs revision)</title> |
245 | <title>Implementation</title> |
| 242 | 246 | ||
| 243 | <para>The SLAB allocator is closely modelled after <ulink |
247 | <para>The SLAB allocator is closely modelled after <ulink |
| 244 | url="http://www.usenix.org/events/usenix01/full_papers/bonwick/bonwick_html/"> |
248 | url="http://www.usenix.org/events/usenix01/full_papers/bonwick/bonwick_html/"> |
| 245 | OpenSolaris SLAB allocator by Jeff Bonwick and Jonathan Adams </ulink> |
249 | OpenSolaris SLAB allocator by Jeff Bonwick and Jonathan Adams </ulink> |
| 246 | with the following exceptions: <itemizedlist> |
250 | with the following exceptions: <itemizedlist> |
| Line 256... | Line 260... | ||
| 256 | <listitem> |
260 | <listitem> |
| 257 | - cache coloring |
261 | - cache coloring |
| 258 | </listitem> |
262 | </listitem> |
| 259 | 263 | ||
| 260 | <listitem> |
264 | <listitem> |
| 261 | - dynamic magazine growing (different magazine sizes are already supported, but we would need to adjust allocation strategy) |
265 | - dynamic magazine grow (different magazine sizes are already supported, but we would need to adjust allocation strategy) |
| 262 | </listitem> |
266 | </listitem> |
| 263 | </itemizedlist></para> |
267 | </itemizedlist></para> |
| 264 | 268 | ||
| 265 | <para>The SLAB allocator supports per-CPU caches ('magazines') to |
- | |
| 266 | facilitate good SMP scaling.</para> |
- | |
| 267 | - | ||
| 268 | <para>When a new object is being allocated, it is first checked, if it |
- | |
| 269 | is available in CPU-bound magazine. If it is not found there, it is |
- | |
| 270 | allocated from CPU-shared SLAB - if partial full is found, it is used, |
- | |
| 271 | otherwise a new one is allocated.</para> |
- | |
| 272 | - | ||
| 273 | <para>When an object is being deallocated, it is put to CPU-bound |
- | |
| 274 | magazine. If there is no such magazine, new one is allocated (if it |
- | |
| 275 | fails, the object is deallocated into SLAB). If the magazine is full, |
- | |
| 276 | it is put into cpu-shared list of magazines and new one is |
- | |
| 277 | allocated.</para> |
- | |
| 278 | - | ||
| 279 | <para>The CPU-bound magazine is actually a pair of magazines to avoid |
- | |
| 280 | thrashing when somebody is allocating/deallocating 1 item at the |
- | |
| 281 | magazine size boundary. LIFO order is enforced, which should avoid |
- | |
| 282 | fragmentation as much as possible.</para> |
- | |
| 283 | - | ||
| 284 | <para>Every cache contains list of full slabs and list of partialy |
- | |
| 285 | full slabs. Empty SLABS are immediately freed (thrashing will be |
- | |
| 286 | avoided because of magazines).</para> |
- | |
| 287 | - | ||
| 288 | <para>The SLAB information structure is kept inside the data area, if |
- | |
| 289 | possible. The cache can be marked that it should not use magazines. |
- | |
| 290 | This is used only for SLAB related caches to avoid deadlocks and |
- | |
| 291 | infinite recursion (the SLAB allocator uses itself for allocating all |
- | |
| 292 | it's control structures).</para> |
- | |
| 293 | - | ||
| 294 | <para>The SLAB allocator allocates lots of space and does not free it. |
- | |
| 295 | When frame allocator fails to allocate the frame, it calls |
- | |
| 296 | slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim. |
- | |
| 297 | The light reclaim releases slabs from cpu-shared magazine-list, until |
- | |
| 298 | at least 1 slab is deallocated in each cache (this algorithm should |
- | |
| 299 | probably change). The brutal reclaim removes all cached objects, even |
- | |
| 300 | from CPU-bound magazines.</para> |
- | |
| 301 | - | ||
| 302 | <para>TODO: <itemizedlist> |
- | |
| 303 | <listitem> |
269 | <section> |
| 304 | For better CPU-scaling the magazine allocation strategy should be extended. Currently, if the cache does not have magazine, it asks for non-cpu cached magazine cache to provide one. It might be feasible to add cpu-cached magazine cache (which would allocate it's magazines from non-cpu-cached mag. cache). This would provide a nice per-cpu buffer. The other possibility is to use the per-cache 'empty-magazine-list', which decreases competing for 1 per-system magazine cache. |
270 | <title>Magazine layer</title> |
| 305 | </listitem> |
- | |
| 306 | 271 | ||
| - | 272 | <para>Due to the extensive bottleneck on SMP architures, caused by |
|
| - | 273 | global SLAB locking mechanism, making processing of all slab |
|
| - | 274 | allocation requests serialized, a new layer was introduced to the |
|
| - | 275 | classic slab allocator design. Slab allocator was extended to |
|
| - | 276 | support per-CPU caches 'magazines' to achieve good SMP scaling. |
|
| - | 277 | <termdef>Slab SMP perfromance bottleneck was resolved by introducing |
|
| - | 278 | a per-CPU caching scheme called as <glossterm>magazine |
|
| - | 279 | layer</glossterm></termdef>.</para> |
|
| - | 280 | ||
| - | 281 | <para>Magazine is a N-element cache of objects, so each magazine can |
|
| - | 282 | satisfy N allocations. Magazine behaves like a automatic weapon |
|
| - | 283 | magazine (LIFO, stack), so the allocation/deallocation become simple |
|
| - | 284 | push/pop pointer operation. Trick is that CPU does not access global |
|
| - | 285 | slab allocator data during the allocation from its magazine, thus |
|
| - | 286 | making possible parallel allocations between CPUs.</para> |
|
| - | 287 | ||
| - | 288 | <para>Implementation also requires adding another feature as the |
|
| - | 289 | CPU-bound magazine is actually a pair of magazines to avoid |
|
| - | 290 | thrashing when during allocation/deallocatiion of 1 item at the |
|
| - | 291 | magazine size boundary. LIFO order is enforced, which should avoid |
|
| - | 292 | fragmentation as much as possible.</para> |
|
| - | 293 | ||
| - | 294 | <para>Another important entity of magazine layer is a full magazine |
|
| - | 295 | depot, that stores full magazines which are used by any of the CPU |
|
| - | 296 | magazine caches to reload active CPU magazine. Magazine depot can be |
|
| - | 297 | pre-filled with full magazines during initialization, but in current |
|
| - | 298 | implementation it is filled during object deallocation, when CPU |
|
| - | 299 | magazine becomes full.</para> |
|
| - | 300 | ||
| - | 301 | <para>Slab allocator control structures are allocated from special |
|
| - | 302 | slabs, that are marked by special flag, indicating that it should |
|
| - | 303 | not be used for slab magazine layer. This is done to avoid possible |
|
| - | 304 | infinite recursions and deadlock during conventional slab allocaiton |
|
| - | 305 | requests.</para> |
|
| - | 306 | </section> |
|
| - | 307 | ||
| - | 308 | <section> |
|
| - | 309 | <title>Allocation/deallocation</title> |
|
| - | 310 | ||
| - | 311 | <para>Every cache contains list of full slabs and list of partialy |
|
| - | 312 | full slabs. Empty slabs are immediately freed (thrashing will be |
|
| - | 313 | avoided because of magazines).</para> |
|
| - | 314 | ||
| - | 315 | <para>The SLAB allocator allocates lots of space and does not free |
|
| - | 316 | it. When frame allocator fails to allocate the frame, it calls |
|
| - | 317 | slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim. |
|
| - | 318 | The light reclaim releases slabs from cpu-shared magazine-list, |
|
| - | 319 | until at least 1 slab is deallocated in each cache (this algorithm |
|
| - | 320 | should probably change). The brutal reclaim removes all cached |
|
| - | 321 | objects, even from CPU-bound magazines.</para> |
|
| - | 322 | ||
| 307 | <listitem> |
323 | <formalpara> |
| - | 324 | <title>Allocation</title> |
|
| - | 325 | ||
| - | 326 | <para><emphasis>Step 1.</emphasis> When it comes to the allocation |
|
| - | 327 | request, slab allocator first of all checks availability of memory |
|
| - | 328 | in local CPU-bound magazine. If it is there, we would just "pop" |
|
| - | 329 | the CPU magazine and return the pointer to object.</para> |
|
| - | 330 | ||
| - | 331 | <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is |
|
| - | 332 | empty, allocator will attempt to reload magazin, swapping it with |
|
| - | 333 | second CPU magazine and returns to the first step.</para> |
|
| - | 334 | ||
| - | 335 | <para><emphasis>Step 3.</emphasis> Now we are in the situation |
|
| - | 336 | when both CPU-bound magazines are empty, which makes allocator to |
|
| - | 337 | access shared full-magazines depot to reload CPU-bound magazines. |
|
| - | 338 | If reload is succesful (meaning there are full magazines in depot) |
|
| - | 339 | algoritm continues at Step 1.</para> |
|
| - | 340 | ||
| - | 341 | <para><emphasis>Step 4.</emphasis> Final step of the allocation. |
|
| 308 | - it might be good to add granularity of locks even to slab level, we could then try_spinlock over all partial slabs and thus improve scalability even on slab level |
342 | In this step object is allocated from the conventional slab layer |
| - | 343 | and pointer is returned.</para> |
|
| 309 | </listitem> |
344 | </formalpara> |
| - | 345 | ||
| - | 346 | <formalpara> |
|
| - | 347 | <title>Deallocation</title> |
|
| - | 348 | ||
| - | 349 | <para><emphasis>Step 1.</emphasis> During deallocation request, |
|
| - | 350 | slab allocator will check if the local CPU-bound magazine is not |
|
| - | 351 | full. In this case we will just push the pointer to this |
|
| 310 | </itemizedlist></para> |
352 | magazine.</para> |
| - | 353 | ||
| - | 354 | <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is |
|
| - | 355 | full, allocator will attempt to reload magazin, swapping it with |
|
| - | 356 | second CPU magazine and returns to the first step.</para> |
|
| - | 357 | ||
| - | 358 | <para><emphasis>Step 3.</emphasis> Now we are in the situation |
|
| - | 359 | when both CPU-bound magazines are full, which makes allocator to |
|
| - | 360 | access shared full-magazines depot to put one of the magazines to |
|
| - | 361 | the depot and creating new empty magazine. Algoritm continues at |
|
| - | 362 | Step 1.</para> |
|
| - | 363 | </formalpara> |
|
| - | 364 | </section> |
|
| 311 | </section> |
365 | </section> |
| 312 | </section> |
366 | </section> |
| 313 | 367 | ||
| 314 | <!-- End of Physmem --> |
368 | <!-- End of Physmem --> |
| 315 | </section> |
369 | </section> |