Rev 185 | Details | Compare with Previous | Last modification | View Log | RSS feed
Rev | Author | Line No. | Line |
---|---|---|---|
9 | bondari | 1 | <?xml version="1.0" encoding="UTF-8"?> |
11 | bondari | 2 | <chapter id="mm"> |
3 | <?dbhtml filename="mm.html"?> |
||
9 | bondari | 4 | |
11 | bondari | 5 | <title>Memory management</title> |
9 | bondari | 6 | |
67 | jermar | 7 | <para>In previous chapters, this book described the scheduling subsystem as |
8 | the creator of the impression that threads execute in parallel. The memory |
||
9 | management subsystem, on the other hand, creates the impression that there |
||
10 | is enough physical memory for the kernel and that userspace tasks have the |
||
11 | entire address space only for themselves.</para> |
||
64 | jermar | 12 | |
26 | bondari | 13 | <section> |
64 | jermar | 14 | <title>Physical memory management</title> |
15 | |||
16 | <section id="zones_and_frames"> |
||
17 | <title>Zones and frames</title> |
||
18 | |||
67 | jermar | 19 | <para>HelenOS represents continuous areas of physical memory in |
20 | structures called frame zones (abbreviated as zones). Each zone contains |
||
21 | information about the number of allocated and unallocated physical |
||
22 | memory frames as well as the physical base address of the zone and |
||
23 | number of frames contained in it. A zone also contains an array of frame |
||
24 | structures describing each frame of the zone and, in the last, but not |
||
25 | the least important, front, each zone is equipped with a buddy system |
||
26 | that faciliates effective allocation of power-of-two sized block of |
||
27 | frames.</para> |
||
64 | jermar | 28 | |
67 | jermar | 29 | <para>This organization of physical memory provides good preconditions |
30 | for hot-plugging of more zones. There is also one currently unused zone |
||
31 | attribute: <code>flags</code>. The attribute could be used to give a |
||
32 | special meaning to some zones in the future.</para> |
||
64 | jermar | 33 | |
67 | jermar | 34 | <para>The zones are linked in a doubly-linked list. This might seem a |
35 | bit ineffective because the zone list is walked everytime a frame is |
||
36 | allocated or deallocated. However, this does not represent a significant |
||
37 | performance problem as it is expected that the number of zones will be |
||
38 | rather low. Moreover, most architectures merge all zones into |
||
39 | one.</para> |
||
40 | |||
76 | palkovsky | 41 | <para>Every physical memory frame in a zone, is described by a structure |
42 | that contains number of references and other data used by buddy |
||
67 | jermar | 43 | system.</para> |
64 | jermar | 44 | </section> |
45 | |||
46 | <section id="frame_allocator"> |
||
71 | bondari | 47 | <indexterm> |
48 | <primary>frame allocator</primary> |
||
49 | </indexterm> |
||
50 | |||
64 | jermar | 51 | <title>Frame allocator</title> |
52 | |||
67 | jermar | 53 | <para>The frame allocator satisfies kernel requests to allocate |
54 | power-of-two sized blocks of physical memory. Because of zonal |
||
55 | organization of physical memory, the frame allocator is always working |
||
76 | palkovsky | 56 | within a context of a particular frame zone. In order to carry out the |
67 | jermar | 57 | allocation requests, the frame allocator is tightly integrated with the |
58 | buddy system belonging to the zone. The frame allocator is also |
||
59 | responsible for updating information about the number of free and busy |
||
101 | bondari | 60 | frames in the zone. <figure float="1"> |
67 | jermar | 61 | <mediaobject id="frame_alloc"> |
97 | bondari | 62 | <imageobject role="pdf"> |
105 | bondari | 63 | <imagedata fileref="images/frame_alloc.pdf" format="PDF" /> |
77 | bondari | 64 | </imageobject> |
65 | |||
67 | jermar | 66 | <imageobject role="html"> |
67 | <imagedata fileref="images/frame_alloc.png" format="PNG" /> |
||
68 | </imageobject> |
||
64 | jermar | 69 | |
67 | jermar | 70 | <imageobject role="fop"> |
105 | bondari | 71 | <imagedata fileref="images/frame_alloc.svg" format="SVG" /> |
67 | jermar | 72 | </imageobject> |
73 | </mediaobject> |
||
64 | jermar | 74 | |
67 | jermar | 75 | <title>Frame allocator scheme.</title> |
76 | </figure></para> |
||
64 | jermar | 77 | |
78 | <formalpara> |
||
79 | <title>Allocation / deallocation</title> |
||
80 | |||
82 | jermar | 81 | <para>Upon allocation request via function <code>frame_alloc()</code>, |
67 | jermar | 82 | the frame allocator first tries to find a zone that can satisfy the |
83 | request (i.e. has the required amount of free frames). Once a suitable |
||
84 | zone is found, the frame allocator uses the buddy allocator on the |
||
186 | decky | 85 | zone's buddy system to perform the allocation. If no free zone is |
86 | found, the frame allocator tries to reclaim slab memory.</para> |
||
87 | |||
88 | <para>During deallocation, which is triggered by a call to |
||
89 | <code>frame_free()</code>, the frame allocator looks up the respective |
||
90 | zone that contains the frame being deallocated. Afterwards, it calls |
||
91 | the buddy allocator again, this time to take care of deallocation |
||
92 | within the zone's buddy system.</para> |
||
64 | jermar | 93 | </formalpara> |
94 | </section> |
||
95 | |||
96 | <section id="buddy_allocator"> |
||
71 | bondari | 97 | <indexterm> |
98 | <primary>buddy system</primary> |
||
99 | </indexterm> |
||
100 | |||
64 | jermar | 101 | <title>Buddy allocator</title> |
102 | |||
67 | jermar | 103 | <para>In the buddy system, the memory is broken down into power-of-two |
104 | sized naturally aligned blocks. These blocks are organized in an array |
||
82 | jermar | 105 | of lists, in which the list with index <emphasis>i</emphasis> contains |
106 | all unallocated blocks of size |
||
107 | <emphasis>2<superscript>i</superscript></emphasis>. The index |
||
108 | <emphasis>i</emphasis> is called the order of block. Should there be two |
||
109 | adjacent equally sized blocks in the list <emphasis>i</emphasis> (i.e. |
||
110 | buddies), the buddy allocator would coalesce them and put the resulting |
||
111 | block in list <emphasis>i + 1</emphasis>, provided that the resulting |
||
112 | block would be naturally aligned. Similarily, when the allocator is |
||
113 | asked to allocate a block of size |
||
114 | <emphasis>2<superscript>i</superscript></emphasis>, it first tries to |
||
115 | satisfy the request from the list with index <emphasis>i</emphasis>. If |
||
116 | the request cannot be satisfied (i.e. the list <emphasis>i</emphasis> is |
||
117 | empty), the buddy allocator will try to allocate and split a larger |
||
118 | block from the list with index <emphasis>i + 1</emphasis>. Both of these |
||
119 | algorithms are recursive. The recursion ends either when there are no |
||
120 | blocks to coalesce in the former case or when there are no blocks that |
||
121 | can be split in the latter case.</para> |
||
64 | jermar | 122 | |
67 | jermar | 123 | <para>This approach greatly reduces external fragmentation of memory and |
124 | helps in allocating bigger continuous blocks of memory aligned to their |
||
125 | size. On the other hand, the buddy allocator suffers increased internal |
||
126 | fragmentation of memory and is not suitable for general kernel |
||
127 | allocations. This purpose is better addressed by the <link |
||
101 | bondari | 128 | linkend="slab">slab allocator</link>.<figure float="1"> |
64 | jermar | 129 | <mediaobject id="buddy_alloc"> |
97 | bondari | 130 | <imageobject role="pdf"> |
105 | bondari | 131 | <imagedata fileref="images/buddy_alloc.pdf" format="PDF" /> |
77 | bondari | 132 | </imageobject> |
133 | |||
64 | jermar | 134 | <imageobject role="html"> |
135 | <imagedata fileref="images/buddy_alloc.png" format="PNG" /> |
||
136 | </imageobject> |
||
137 | |||
138 | <imageobject role="fop"> |
||
105 | bondari | 139 | <imagedata fileref="images/buddy_alloc.svg" format="SVG" /> |
64 | jermar | 140 | </imageobject> |
141 | </mediaobject> |
||
142 | |||
143 | <title>Buddy system scheme.</title> |
||
67 | jermar | 144 | </figure></para> |
64 | jermar | 145 | |
146 | <section> |
||
147 | <title>Implementation</title> |
||
148 | |||
169 | jermar | 149 | <para>The buddy allocator is, in fact, an abstract framework which can |
64 | jermar | 150 | be easily specialized to serve one particular task. It knows nothing |
151 | about the nature of memory it helps to allocate. In order to beat the |
||
152 | lack of this knowledge, the buddy allocator exports an interface that |
||
153 | each of its clients is required to implement. When supplied with an |
||
154 | implementation of this interface, the buddy allocator can use |
||
155 | specialized external functions to find a buddy for a block, split and |
||
156 | coalesce blocks, manipulate block order and mark blocks busy or |
||
67 | jermar | 157 | available.</para> |
64 | jermar | 158 | |
159 | <formalpara> |
||
160 | <title>Data organization</title> |
||
161 | |||
162 | <para>Each entity allocable by the buddy allocator is required to |
||
163 | contain space for storing block order number and a link variable |
||
164 | used to interconnect blocks within the same order.</para> |
||
165 | |||
166 | <para>Whatever entities are allocated by the buddy allocator, the |
||
167 | first entity within a block is used to represent the entire block. |
||
168 | The first entity keeps the order of the whole block. Other entities |
||
169 | within the block are assigned the magic value |
||
186 | decky | 170 | <constant>BUDDY_SYSTEM_INNER_BLOCK</constant>. This is especially important |
64 | jermar | 171 | for effective identification of buddies in a one-dimensional array |
172 | because the entity that represents a potential buddy cannot be |
||
186 | decky | 173 | associated with <constant>BUDDY_SYSTEM_INNER_BLOCK</constant> (i.e. if it |
174 | is associated with <constant>BUDDY_SYSTEM_INNER_BLOCK</constant> then it is |
||
64 | jermar | 175 | not a buddy).</para> |
176 | </formalpara> |
||
177 | </section> |
||
178 | </section> |
||
179 | |||
180 | <section id="slab"> |
||
71 | bondari | 181 | <indexterm> |
182 | <primary>slab allocator</primary> |
||
183 | </indexterm> |
||
184 | |||
70 | jermar | 185 | <title>Slab allocator</title> |
64 | jermar | 186 | |
67 | jermar | 187 | <para>The majority of memory allocation requests in the kernel is for |
188 | small, frequently used data structures. The basic idea behind the slab |
||
189 | allocator is that commonly used objects are preallocated in continuous |
||
190 | areas of physical memory called slabs<footnote> |
||
191 | <para>Slabs are in fact blocks of physical memory frames allocated |
||
192 | from the frame allocator.</para> |
||
193 | </footnote>. Whenever an object is to be allocated, the slab allocator |
||
194 | returns the first available item from a suitable slab corresponding to |
||
195 | the object type<footnote> |
||
196 | <para>The mechanism is rather more complicated, see the next |
||
197 | paragraph.</para> |
||
198 | </footnote>. Due to the fact that the sizes of the requested and |
||
199 | allocated object match, the slab allocator significantly reduces |
||
200 | internal fragmentation.</para> |
||
64 | jermar | 201 | |
71 | bondari | 202 | <indexterm> |
203 | <primary>slab allocator</primary> |
||
204 | |||
72 | bondari | 205 | <secondary>- slab cache</secondary> |
71 | bondari | 206 | </indexterm> |
207 | |||
67 | jermar | 208 | <para>Slabs of one object type are organized in a structure called slab |
186 | decky | 209 | cache. There are usually more slabs in the slab cache, depending on |
67 | jermar | 210 | previous allocations. If the the slab cache runs out of available slabs, |
211 | new slabs are allocated. In order to exploit parallelism and to avoid |
||
212 | locking of shared spinlocks, slab caches can have variants of |
||
213 | processor-private slabs called magazines. On each processor, there is a |
||
214 | two-magazine cache. Full magazines that are not part of any |
||
215 | per-processor magazine cache are stored in a global list of full |
||
216 | magazines.</para> |
||
64 | jermar | 217 | |
71 | bondari | 218 | <indexterm> |
219 | <primary>slab allocator</primary> |
||
220 | |||
72 | bondari | 221 | <secondary>- magazine</secondary> |
71 | bondari | 222 | </indexterm> |
223 | |||
67 | jermar | 224 | <para>Each object begins its life in a slab. When it is allocated from |
225 | there, the slab allocator calls a constructor that is registered in the |
||
226 | respective slab cache. The constructor initializes and brings the object |
||
227 | into a known state. The object is then used by the user. When the user |
||
228 | later frees the object, the slab allocator puts it into a processor |
||
71 | bondari | 229 | private <indexterm> |
230 | <primary>slab allocator</primary> |
||
231 | |||
72 | bondari | 232 | <secondary>- magazine</secondary> |
71 | bondari | 233 | </indexterm>magazine cache, from where it can be precedently allocated |
67 | jermar | 234 | again. Note that allocations satisfied from a magazine are already |
235 | initialized by the constructor. When both of the processor cached |
||
236 | magazines get full, the allocator will move one of the magazines to the |
||
237 | list of full magazines. Similarily, when allocating from an empty |
||
238 | processor magazine cache, the kernel will reload only one magazine from |
||
239 | the list of full magazines. In other words, the slab allocator tries to |
||
240 | keep the processor magazine cache only half-full in order to prevent |
||
241 | thrashing when allocations and deallocations interleave on magazine |
||
70 | jermar | 242 | boundaries. The advantage of this setup is that during most of the |
243 | allocations, no global spinlock needs to be held.</para> |
||
65 | jermar | 244 | |
67 | jermar | 245 | <para>Should HelenOS run short of memory, it would start deallocating |
246 | objects from magazines, calling slab cache destructor on them and |
||
185 | schaabova | 247 | putting them back into slabs. When a slab contains no allocated object, |
67 | jermar | 248 | it is immediately freed.</para> |
66 | bondari | 249 | |
71 | bondari | 250 | <para> |
101 | bondari | 251 | <figure float="1"> |
64 | jermar | 252 | <mediaobject id="slab_alloc"> |
97 | bondari | 253 | <imageobject role="pdf"> |
105 | bondari | 254 | <imagedata fileref="images/slab_alloc.pdf" format="PDF" /> |
77 | bondari | 255 | </imageobject> |
256 | |||
64 | jermar | 257 | <imageobject role="html"> |
258 | <imagedata fileref="images/slab_alloc.png" format="PNG" /> |
||
259 | </imageobject> |
||
77 | bondari | 260 | |
261 | <imageobject role="fop"> |
||
105 | bondari | 262 | <imagedata fileref="images/slab_alloc.svg" format="SVG" /> |
77 | bondari | 263 | </imageobject> |
64 | jermar | 264 | </mediaobject> |
265 | |||
266 | <title>Slab allocator scheme.</title> |
||
71 | bondari | 267 | </figure> |
268 | </para> |
||
64 | jermar | 269 | |
67 | jermar | 270 | <section> |
271 | <title>Implementation</title> |
||
272 | |||
83 | jermar | 273 | <para>The slab allocator is closely modelled after <xref |
71 | bondari | 274 | linkend="Bonwick01" /> with the following exceptions:<itemizedlist> |
82 | jermar | 275 | <listitem> |
276 | <para>empty slabs are immediately deallocated and</para> |
||
277 | </listitem> |
||
66 | bondari | 278 | |
279 | <listitem> |
||
70 | jermar | 280 | <para>empty magazines are deallocated when not needed.</para> |
66 | bondari | 281 | </listitem> |
70 | jermar | 282 | </itemizedlist>The following features are not currently supported |
283 | but would be easy to do: <itemizedlist> |
||
71 | bondari | 284 | <listitem>cache coloring and</listitem> |
64 | jermar | 285 | |
71 | bondari | 286 | <listitem>dynamic magazine grow (different magazine sizes are |
287 | already supported, but the allocation strategy would need to be |
||
288 | adjusted).</listitem> |
||
64 | jermar | 289 | </itemizedlist></para> |
290 | |||
291 | <section> |
||
292 | <title>Allocation/deallocation</title> |
||
293 | |||
70 | jermar | 294 | <para>The following two paragraphs summarize and complete the |
295 | description of the slab allocator operation (i.e. |
||
82 | jermar | 296 | <code>slab_alloc()</code> and <code>slab_free()</code> |
297 | functions).</para> |
||
64 | jermar | 298 | |
299 | <formalpara> |
||
300 | <title>Allocation</title> |
||
301 | |||
70 | jermar | 302 | <para><emphasis>Step 1.</emphasis> When an allocation request |
303 | comes, the slab allocator checks availability of memory in the |
||
304 | current magazine of the local processor magazine cache. If the |
||
76 | palkovsky | 305 | available memory is there, the allocator just pops the object from |
306 | magazine and returns it.</para> |
||
64 | jermar | 307 | |
70 | jermar | 308 | <para><emphasis>Step 2.</emphasis> If the current magazine in the |
309 | processor magazine cache is empty, the allocator will attempt to |
||
310 | swap it with the last magazine from the cache and return to the |
||
311 | first step. If also the last magazine is empty, the algorithm will |
||
312 | fall through to Step 3.</para> |
||
64 | jermar | 313 | |
70 | jermar | 314 | <para><emphasis>Step 3.</emphasis> Now the allocator is in the |
315 | situation when both magazines in the processor magazine cache are |
||
316 | empty. The allocator reloads one magazine from the shared list of |
||
317 | full magazines. If the reload is successful (i.e. there are full |
||
318 | magazines in the list), the algorithm continues with Step |
||
319 | 1.</para> |
||
64 | jermar | 320 | |
70 | jermar | 321 | <para><emphasis>Step 4.</emphasis> In this fail-safe step, an |
322 | object is allocated from the conventional slab layer and a pointer |
||
185 | schaabova | 323 | to it is returned. If also the last magazine is full, a new slab |
324 | is allocated.</para> |
||
64 | jermar | 325 | </formalpara> |
326 | |||
327 | <formalpara> |
||
328 | <title>Deallocation</title> |
||
329 | |||
70 | jermar | 330 | <para><emphasis>Step 1.</emphasis> During a deallocation request, |
331 | the slab allocator checks if the current magazine of the local |
||
76 | palkovsky | 332 | processor magazine cache is not full. If it is, the pointer to the |
70 | jermar | 333 | objects is just pushed into the magazine and the algorithm |
334 | returns.</para> |
||
64 | jermar | 335 | |
70 | jermar | 336 | <para><emphasis>Step 2.</emphasis> If the current magazine is |
337 | full, the allocator will attempt to swap it with the last magazine |
||
338 | from the cache and return to the first step. If also the last |
||
339 | magazine is empty, the algorithm will fall through to Step |
||
340 | 3.</para> |
||
64 | jermar | 341 | |
70 | jermar | 342 | <para><emphasis>Step 3.</emphasis> Now the allocator is in the |
343 | situation when both magazines in the processor magazine cache are |
||
76 | palkovsky | 344 | full. The allocator tries to allocate a new empty magazine and |
345 | flush one of the full magazines to the shared list of full |
||
346 | magazines. If it is successfull, the algoritm continues with Step |
||
347 | 1.</para> |
||
348 | |||
349 | <para><emphasis>Step 4. </emphasis>In case of low memory condition |
||
350 | when the allocation of empty magazine fails, the object is moved |
||
351 | directly into slab. In the worst case object deallocation does not |
||
352 | need to allocate any additional memory.</para> |
||
64 | jermar | 353 | </formalpara> |
354 | </section> |
||
355 | </section> |
||
356 | </section> |
||
357 | </section> |
||
358 | |||
359 | <section> |
||
67 | jermar | 360 | <title>Virtual memory management</title> |
9 | bondari | 361 | |
82 | jermar | 362 | <para>Virtual memory is essential for an operating system because it makes |
363 | several things possible. First, it helps to isolate tasks from each other |
||
364 | by encapsulating them in their private address spaces. Second, virtual |
||
365 | memory can give tasks the feeling of more memory available than is |
||
366 | actually possible. And third, by using virtual memory, there might be |
||
367 | multiple copies of the same program, linked to the same addresses, running |
||
368 | in the system. There are at least two known mechanisms for implementing |
||
369 | virtual memory: segmentation and paging. Even though some processor |
||
370 | architectures supported by HelenOS<footnote> |
||
371 | <para>ia32 has full-fledged segmentation.</para> |
||
169 | jermar | 372 | </footnote> provide both mechanisms, the kernel makes use solely of |
82 | jermar | 373 | paging.</para> |
67 | jermar | 374 | |
82 | jermar | 375 | <section id="paging"> |
376 | <title>VAT subsystem</title> |
||
67 | jermar | 377 | |
82 | jermar | 378 | <para>In a paged virtual memory, the entire virtual address space is |
379 | divided into small power-of-two sized naturally aligned blocks called |
||
380 | pages. The processor implements a translation mechanism, that allows the |
||
381 | operating system to manage mappings between set of pages and set of |
||
169 | jermar | 382 | identically sized and identically aligned pieces of physical memory |
82 | jermar | 383 | called frames. In a result, references to continuous virtual memory |
384 | areas don't necessarily need to reference continuos area of physical |
||
385 | memory. Supported page sizes usually range from several kilobytes to |
||
386 | several megabytes. Each page that takes part in the mapping is |
||
387 | associated with certain attributes that further desribe the mapping |
||
388 | (e.g. access rights, dirty and accessed bits and present bit).</para> |
||
67 | jermar | 389 | |
82 | jermar | 390 | <para>When the processor accesses a page that is not present (i.e. its |
391 | present bit is not set), the operating system is notified through a |
||
392 | special exception called page fault. It is then up to the operating |
||
393 | system to service the page fault. In HelenOS, some page faults are fatal |
||
394 | and result in either task termination or, in the worse case, kernel |
||
395 | panic<footnote> |
||
396 | <para>Such a condition would be either caused by a hardware failure |
||
397 | or a bug in the kernel.</para> |
||
398 | </footnote>, while other page faults are used to load memory on demand |
||
399 | or to notify the kernel about certain events.</para> |
||
67 | jermar | 400 | |
82 | jermar | 401 | <indexterm> |
402 | <primary>page tables</primary> |
||
403 | </indexterm> |
||
67 | jermar | 404 | |
82 | jermar | 405 | <para>The set of all page mappings is stored in a memory structure |
406 | called page tables. Some architectures have no hardware support for page |
||
407 | tables<footnote> |
||
408 | <para>On mips32, TLB-only model is used and the operating system is |
||
409 | responsible for managing software defined page tables.</para> |
||
410 | </footnote> while other processor architectures<footnote> |
||
411 | <para>Like amd64 and ia32.</para> |
||
412 | </footnote> understand the whole memory format thereof. Despite all |
||
413 | the possible differences in page table formats, the HelenOS VAT |
||
414 | subsystem<footnote> |
||
415 | <para>Virtual Address Translation subsystem.</para> |
||
416 | </footnote> unifies the page table operations under one programming |
||
417 | interface. For all parts of the kernel, three basic functions are |
||
418 | provided:</para> |
||
419 | |||
420 | <itemizedlist> |
||
421 | <listitem> |
||
422 | <para><code>page_mapping_insert()</code>,</para> |
||
423 | </listitem> |
||
424 | |||
425 | <listitem> |
||
426 | <para><code>page_mapping_find()</code> and</para> |
||
427 | </listitem> |
||
428 | |||
429 | <listitem> |
||
430 | <para><code>page_mapping_remove()</code>.</para> |
||
431 | </listitem> |
||
432 | </itemizedlist> |
||
433 | |||
434 | <para>The <code>page_mapping_insert()</code> function is used to |
||
435 | introduce a mapping for one virtual memory page belonging to a |
||
436 | particular address space into the page tables. Once the mapping is in |
||
437 | the page tables, it can be searched by <code>page_mapping_find()</code> |
||
438 | and removed by <code>page_mapping_remove()</code>. All of these |
||
439 | functions internally select the page table mechanism specific functions |
||
440 | that carry out the self operation.</para> |
||
441 | |||
442 | <para>There are currently two supported mechanisms: generic 4-level |
||
443 | hierarchical page tables and global page hash table. Both of the |
||
444 | mechanisms are generic as they cover several hardware platforms. For |
||
445 | instance, the 4-level hierarchical page table mechanism is used by |
||
446 | amd64, ia32, mips32 and ppc32, respectively. These architectures have |
||
447 | the following page table format: 4-level, 2-level, TLB-only and hardware |
||
448 | hash table, respectively. On the other hand, the global page hash table |
||
449 | is used on ia64 that can be TLB-only or use a hardware hash table. |
||
450 | Although only two mechanisms are currently implemented, other mechanisms |
||
451 | (e.g. B+tree) can be easily added.</para> |
||
452 | |||
453 | <section id="page_tables"> |
||
454 | <indexterm> |
||
455 | <primary>page tables</primary> |
||
456 | |||
457 | <secondary>- hierarchical</secondary> |
||
458 | </indexterm> |
||
459 | |||
460 | <title>Hierarchical 4-level page tables</title> |
||
461 | |||
462 | <para>Hierarchical 4-level page tables are generalization of the |
||
463 | frequently used hierarchical model of page tables. In this mechanism, |
||
464 | each address space has its own page tables. To avoid confusion in |
||
465 | terminology used by hardware vendors, in HelenOS, the root level page |
||
466 | table is called PTL0, the two middle levels are called PTL1 and PTL2, |
||
467 | and, finally, the leaf level is called PTL3. All architectures using |
||
468 | this mechanism are required to use PTL0 and PTL3. However, the middle |
||
169 | jermar | 469 | levels can be left out, depending on the hardware hierarchy or |
82 | jermar | 470 | structure of software-only page tables. The genericity is achieved |
471 | through a set of macros that define transitions from one level to |
||
136 | jermar | 472 | another. Unused levels are optimised out by the compiler. |
473 | <figure float="1"> |
||
474 | <mediaobject id="mm_pt"> |
||
475 | <imageobject role="pdf"> |
||
476 | <imagedata fileref="images/mm_pt.pdf" format="PDF" /> |
||
477 | </imageobject> |
||
478 | |||
479 | <imageobject role="html"> |
||
480 | <imagedata fileref="images/mm_pt.png" format="PNG" /> |
||
481 | </imageobject> |
||
482 | |||
483 | <imageobject role="fop"> |
||
484 | <imagedata fileref="images/mm_pt.svg" format="SVG" /> |
||
485 | </imageobject> |
||
486 | </mediaobject> |
||
487 | |||
488 | <title>Hierarchical 4-level page tables.</title> |
||
489 | </figure> |
||
490 | </para> |
||
82 | jermar | 491 | </section> |
492 | |||
493 | <section> |
||
494 | <indexterm> |
||
495 | <primary>page tables</primary> |
||
496 | |||
497 | <secondary>- hashing</secondary> |
||
498 | </indexterm> |
||
499 | |||
500 | <title>Global page hash table</title> |
||
501 | |||
502 | <para>Implementation of the global page hash table was encouraged by |
||
503 | 64-bit architectures that can have rather sparse address spaces. The |
||
504 | hash table contains valid mappings only. Each entry of the hash table |
||
505 | contains an address space pointer, virtual memory page number (VPN), |
||
506 | physical memory frame number (PFN) and a set of flags. The pair of the |
||
507 | address space pointer and the virtual memory page number is used as a |
||
508 | key for the hash table. One of the major differences between the |
||
509 | global page hash table and hierarchical 4-level page tables is that |
||
510 | there is only a single global page hash table in the system while |
||
511 | hierarchical page tables exist per address space. Thus, the global |
||
512 | page hash table contains information about mappings of all address |
||
136 | jermar | 513 | spaces in the system. |
514 | <figure float="1"> |
||
515 | <mediaobject id="mm_hash"> |
||
516 | <imageobject role="pdf"> |
||
517 | <imagedata fileref="images/mm_hash.pdf" format="PDF" /> |
||
518 | </imageobject> |
||
82 | jermar | 519 | |
136 | jermar | 520 | <imageobject role="html"> |
521 | <imagedata fileref="images/mm_hash.png" format="PNG" /> |
||
522 | </imageobject> |
||
523 | |||
524 | <imageobject role="fop"> |
||
525 | <imagedata fileref="images/mm_hash.svg" format="SVG" /> |
||
526 | </imageobject> |
||
527 | </mediaobject> |
||
528 | |||
529 | <title>Global page hash table.</title> |
||
530 | </figure> |
||
531 | </para> |
||
532 | |||
82 | jermar | 533 | <para>The global page hash table mechanism uses the generic hash table |
83 | jermar | 534 | type as described in the chapter dedicated to <link |
535 | linkend="hashtables">data structures</link> earlier in this |
||
536 | book.</para> |
||
82 | jermar | 537 | </section> |
67 | jermar | 538 | </section> |
82 | jermar | 539 | </section> |
67 | jermar | 540 | |
82 | jermar | 541 | <section id="tlb"> |
542 | <indexterm> |
||
543 | <primary>TLB</primary> |
||
544 | </indexterm> |
||
545 | |||
546 | <title>Translation Lookaside buffer</title> |
||
547 | |||
83 | jermar | 548 | <para>Due to the extensive overhead of several extra memory accesses |
549 | during page table lookup that are necessary on every instruction, modern |
||
550 | architectures deploy fast assotiative cache of recelntly used page |
||
551 | mappings. This cache is called TLB - Translation Lookaside Buffer - and is |
||
552 | present on every processor in the system. As it has been already pointed |
||
553 | out, TLB is the only page translation mechanism for some |
||
554 | architectures.</para> |
||
82 | jermar | 555 | |
556 | <section id="tlb_shootdown"> |
||
557 | <indexterm> |
||
558 | <primary>TLB</primary> |
||
559 | |||
560 | <secondary>- TLB shootdown</secondary> |
||
561 | </indexterm> |
||
562 | |||
83 | jermar | 563 | <title>TLB consistency</title> |
82 | jermar | 564 | |
83 | jermar | 565 | <para>The operating system is responsible for keeping TLB consistent |
566 | with the page tables. Whenever mappings are modified or purged from the |
||
567 | page tables, or when an address space identifier is reused, the kernel |
||
568 | needs to invalidate the respective contents of TLB. Some TLB types |
||
569 | support partial invalidation of their content (e.g. ranges of pages or |
||
570 | address spaces) while other types can be invalidated only entirely. The |
||
571 | invalidation must be done on all processors for there is one TLB per |
||
572 | processor. Maintaining TLB consistency on multiprocessor configurations |
||
573 | is not as trivial as it might look from the first glance.</para> |
||
82 | jermar | 574 | |
83 | jermar | 575 | <para>The remote TLB invalidation is called TLB shootdown. HelenOS uses |
576 | a simplified variant of the algorithm described in <xref |
||
84 | jermar | 577 | linkend="Black89" />.</para> |
82 | jermar | 578 | |
83 | jermar | 579 | <para>TLB shootdown is performed in three phases.</para> |
82 | jermar | 580 | |
581 | <formalpara> |
||
582 | <title>Phase 1.</title> |
||
583 | |||
83 | jermar | 584 | <para>The initiator clears its TLB flag and locks the global TLB |
585 | spinlock. The request is then enqueued into all other processors' TLB |
||
586 | shootdown message queues. When the TLB shootdown message queue is full |
||
587 | on any processor, the queue is purged and a single request to |
||
588 | invalidate the entire TLB is stored there. Once all the TLB shootdown |
||
589 | messages were dispatched, the initiator sends all other processors an |
||
590 | interrupt to notify them about the incoming TLB shootdown message. It |
||
591 | then spins until all processors accept the interrupt and clear their |
||
592 | TLB flags.</para> |
||
82 | jermar | 593 | </formalpara> |
594 | |||
595 | <formalpara> |
||
596 | <title>Phase 2.</title> |
||
597 | |||
83 | jermar | 598 | <para>Except for the initiator, all other processors are spining on |
599 | the TLB spinlock. The initiator is now free to modify the page tables |
||
600 | and purge its own TLB. The initiator then unlocks the global TLB |
||
601 | spinlock and sets its TLB flag.</para> |
||
82 | jermar | 602 | </formalpara> |
83 | jermar | 603 | |
604 | <formalpara> |
||
605 | <title>Phase 3.</title> |
||
606 | |||
607 | <para>When the spinlock is unlocked by the initiator, other processors |
||
608 | are sequentially granted the spinlock. However, once they manage to |
||
609 | lock it, they immediately release it. Each processor invalidates its |
||
610 | TLB according to messages found in its TLB shootdown message queue. In |
||
611 | the end, each processor sets its TLB flag and resumes its previous |
||
612 | operation.</para> |
||
613 | </formalpara> |
||
82 | jermar | 614 | </section> |
84 | jermar | 615 | </section> |
82 | jermar | 616 | |
84 | jermar | 617 | <section> |
618 | <title>Address spaces</title> |
||
70 | jermar | 619 | |
96 | jermar | 620 | <para>In HelenOS, address spaces are objects that encapsulate the |
621 | following items:</para> |
||
70 | jermar | 622 | |
96 | jermar | 623 | <itemizedlist> |
624 | <listitem> |
||
625 | <para>address space identifier,</para> |
||
626 | </listitem> |
||
627 | |||
628 | <listitem> |
||
629 | <para>page table PTL0 pointer and</para> |
||
630 | </listitem> |
||
631 | |||
632 | <listitem> |
||
633 | <para>a set of mutually disjunctive address space areas.</para> |
||
634 | </listitem> |
||
635 | </itemizedlist> |
||
636 | |||
637 | <para>Address space identifiers will be discussed later in this section. |
||
638 | The address space contains a pointer to PTL0, provided that the |
||
639 | architecture uses per address space page tables such as the hierarchical |
||
640 | 4-level page tables. The most interesting component is the B+tree of |
||
641 | address space areas belonging to the address space.</para> |
||
642 | |||
84 | jermar | 643 | <section> |
96 | jermar | 644 | <title>Address space areas</title> |
645 | |||
646 | <para>Because an address space can be composed of heterogenous mappings |
||
647 | such as userspace code, data, read-only data and kernel memory, it is |
||
648 | further broken down into smaller homogenous units called address space |
||
649 | areas. An address space area represents a continuous piece of userspace |
||
650 | virtual memory associated with common flags. Kernel memory mappings do |
||
651 | not take part in address space areas because they are hardwired either |
||
652 | into TLBs or page tables and are thus shared by all address spaces. The |
||
653 | flags are a combination of:</para> |
||
654 | |||
655 | <itemizedlist> |
||
656 | <listitem> |
||
657 | <para><constant>AS_AREA_READ</constant>,</para> |
||
658 | </listitem> |
||
659 | |||
660 | <listitem> |
||
661 | <para><constant>AS_AREA_WRITE</constant>,</para> |
||
662 | </listitem> |
||
663 | |||
664 | <listitem> |
||
665 | <para><constant>AS_AREA_EXEC</constant> and</para> |
||
666 | </listitem> |
||
667 | |||
668 | <listitem> |
||
669 | <para><constant>AS_AREA_CACHEABLE</constant>.</para> |
||
670 | </listitem> |
||
671 | </itemizedlist> |
||
672 | |||
673 | <para>The <constant>AS_AREA_READ</constant> flag is implicit and cannot |
||
674 | be removed. The <constant>AS_AREA_WRITE</constant> flag denotes a |
||
675 | writable address space area and the <constant>AS_AREA_EXEC</constant> is |
||
676 | used for areas containing code. The combination of |
||
677 | <constant>AS_AREA_WRITE</constant> and <constant>AS_AREA_EXEC</constant> |
||
678 | is not allowed. Some architectures don't differentiate between |
||
679 | executable and non-executable mappings. In that case, the |
||
680 | <constant>AS_AREA_EXEC</constant> has no effect on mappings created for |
||
681 | the address space area in the page tables. If the flags don't have |
||
682 | <constant>AS_AREA_CACHEABLE</constant> set, the page tables content of |
||
683 | the area is created with caching disabled. This is useful for address |
||
684 | space areas containing memory of some memory mapped device.</para> |
||
685 | |||
686 | <para>Address space areas can be backed by a backend that provides |
||
687 | virtual functions for servicing page faults that occur within the |
||
688 | address space area, releasing memory allocated by the area and sharing |
||
689 | the area. Currently, there are three backends supported by HelenOS: |
||
690 | anonymous memory backend, ELF image backend and physical memory |
||
691 | backend.</para> |
||
692 | |||
693 | <formalpara> |
||
694 | <title>Anonymous memory backend</title> |
||
695 | |||
696 | <para>Anonymous memory is memory that has no predefined contents such |
||
697 | as userspace stack or heap. Anonymous address space areas are backed |
||
698 | by memory allocated from the frame allocator. Areas backed by this |
||
699 | backend can be resized as long as they are not shared.</para> |
||
700 | </formalpara> |
||
701 | |||
702 | <formalpara> |
||
703 | <title>ELF image backend</title> |
||
704 | |||
705 | <para>Areas backed by the ELF backend are composed of memory that can |
||
706 | be either initialized, partially initialized or completely anonymous. |
||
707 | Initialized portions of ELF backend address space areas are those that |
||
708 | are entirely physically present in the executable image (e.g. code and |
||
709 | initialized data). Anonymous portions are those pages of the |
||
710 | <emphasis>bss</emphasis> section that exist entirely outside the |
||
711 | executable image. Lastly, pages that don't fit into the previous two |
||
712 | categories are partially initialized as they are both part of the |
||
713 | image and the <emphasis>bss</emphasis> section. The initialized |
||
714 | portion does not need any memory from the allocator unless it is |
||
715 | writable. In that case, pages are duplicated on demand during page |
||
716 | fault and memory for the copy is allocated from the frame allocator. |
||
717 | The remaining two parts of the ELF always require memory from the |
||
718 | frame allocator. Non-shared address space areas backed by the ELF |
||
719 | image backend can be resized.</para> |
||
720 | </formalpara> |
||
721 | |||
722 | <formalpara> |
||
723 | <title>Physical memory backend</title> |
||
724 | |||
725 | <para>Physical memory backend is used by the device drivers to access |
||
726 | physical memory. No additional memory needs to be allocated on a page |
||
727 | fault in this area and when sharing this area. Areas backed by this |
||
728 | backend cannot be resized.</para> |
||
729 | </formalpara> |
||
730 | |||
731 | <section> |
||
732 | <title>Memory sharing</title> |
||
733 | |||
734 | <para>Address space areas can be shared provided that their backend |
||
735 | supports sharing<footnote> |
||
736 | <para>Which is the case for all currently supported |
||
737 | backends.</para> |
||
738 | </footnote>. When the kernel calls <code>as_area_share()</code>, a |
||
739 | check is made to see whether the area is already being shared. If the |
||
740 | area is already shared, it contains a pointer to the share info |
||
741 | structure. The pointer is then simply copied into the new address |
||
742 | space area and a reference count in the share info structure is |
||
743 | incremented. Otherwise a new address space share info structure needs |
||
744 | to be created. The backend is then called to duplicate the mapping of |
||
745 | pages for which a frame is allocated. The duplicated mapping is stored |
||
746 | in the share info structure B+tree called <varname>pagemap</varname>. |
||
747 | Note that the reference count of the frames put into the |
||
160 | jermar | 748 | <varname>pagemap</varname> must be incremented in order to avoid a race condition. |
749 | If the originating address space area had been destroyed before the <varname>pagemap</varname> |
||
750 | information made it to the page tables of other address spaces that take part in |
||
751 | the sharing, the reference count of the respective frames |
||
752 | would have dropped to zero and some of them could have been allocated again.</para> |
||
96 | jermar | 753 | </section> |
754 | |||
755 | <section> |
||
756 | <title>Page faults</title> |
||
757 | |||
758 | <para>When a page fault is encountered in the address space area, the |
||
759 | address space page fault handler, <code>as_page_fault()</code>, |
||
760 | invokes the corresponding backend page fault handler to resolve the |
||
761 | situation. The backend might either confirm the page fault or perform |
||
762 | a remedy. In the non-shared case, depending on the backend, the page |
||
763 | fault can be remedied usually by allocating some memory on demand or |
||
764 | by looking up the frame for the faulting translation in the ELF |
||
765 | image.</para> |
||
766 | |||
767 | <para>Shared address space areas need to consider the |
||
768 | <varname>pagemap</varname> B+tree. First they need to make sure |
||
168 | jermar | 769 | whether the mapping is not present in the <varname>pagemap</varname>. |
96 | jermar | 770 | If it is there, then the frame reference count is increased and the |
771 | page fault is resolved. Otherwise the handler proceeds similarily to |
||
772 | the non-shared case. If it allocates a physical memory frame, it must |
||
773 | increment its reference count and add it to the |
||
774 | <varname>pagemap</varname>.</para> |
||
775 | </section> |
||
776 | </section> |
||
777 | |||
778 | <section> |
||
84 | jermar | 779 | <indexterm> |
780 | <primary>address space</primary> |
||
70 | jermar | 781 | |
84 | jermar | 782 | <secondary>- ASID</secondary> |
783 | </indexterm> |
||
67 | jermar | 784 | |
84 | jermar | 785 | <title>Address Space ID (ASID)</title> |
67 | jermar | 786 | |
84 | jermar | 787 | <para>Modern processor architectures optimize TLB utilization by |
788 | associating TLB entries with address spaces through assigning |
||
789 | identification numbers to them. In HelenOS, the term ASID, originally |
||
790 | taken from the mips32 terminology, is used to refer to the address space |
||
791 | identification number. The advantage of having ASIDs is that TLB does |
||
792 | not have to be invalidated on thread context switch as long as ASIDs are |
||
169 | jermar | 793 | unique. Unfortunately, architectures supported by HelenOS use all |
84 | jermar | 794 | different widths of ASID numbers<footnote> |
795 | <para>amd64 and ia32 don't use similar abstraction at all, mips32 |
||
796 | has 8-bit ASIDs and ia64 can have ASIDs between 18 to 24 bits |
||
797 | wide.</para> |
||
798 | </footnote> out of which none is sufficient. The amd64 and ia32 |
||
799 | architectures cannot make use of ASIDs as their TLB doesn't recognize |
||
800 | such an abstraction. Other architectures have support for ASIDs, but for |
||
801 | instance ppc32 doesn't make use of them in the current version of |
||
802 | HelenOS. The rest of the architectures does use ASIDs. However, even on |
||
803 | the ia64 architecture, the minimal supported width of ASID<footnote> |
||
804 | <para>RID in ia64 terminology.</para> |
||
805 | </footnote> is insufficient to provide a unique integer identifier to |
||
806 | all address spaces that might hypothetically coexist in the running |
||
807 | system. The situation on mips32 is even worse: the architecture has only |
||
808 | 256 unique identifiers.</para> |
||
67 | jermar | 809 | |
84 | jermar | 810 | <indexterm> |
811 | <primary>address space</primary> |
||
67 | jermar | 812 | |
84 | jermar | 813 | <secondary>- ASID stealing</secondary> |
814 | </indexterm> |
||
67 | jermar | 815 | |
84 | jermar | 816 | <para>To mitigate the shortage of ASIDs, HelenOS uses the following |
817 | strategy. When the system initializes, a FIFO queue<footnote> |
||
818 | <para>Note that architecture-specific measures are taken to avoid |
||
819 | too large FIFO queue. For instance, seven consecutive ia64 RIDs are |
||
820 | grouped to form one HelenOS ASID.</para> |
||
821 | </footnote> is created and filled with all available ASIDs. Moreover, |
||
822 | every address space remembers the number of processors on which it is |
||
823 | active. Address spaces that have a valid ASID and that are not active on |
||
824 | any processor are appended to the list of inactive address spaces with |
||
825 | valid ASID. When an address space needs to be assigned a valid ASID, it |
||
826 | first checks the FIFO queue. If it contains at least one ASID, the ASID |
||
827 | is allocated. If the queue is empty, an ASID is simply stolen from the |
||
828 | first address space in the list. In that case, the address space that |
||
829 | loses the ASID in favor of another address space, is removed from the |
||
830 | list. After the new ASID is purged from all TLBs, it can be used by the |
||
831 | address space. Note that this approach works due to the fact that the |
||
832 | number of ASIDs is greater than the maximal number of processors |
||
833 | supported by HelenOS and that there can be only one active address space |
||
834 | per processor. In other words, when the FIFO queue is empty, there must |
||
835 | be address spaces that are not active on any processor.</para> |
||
67 | jermar | 836 | </section> |
26 | bondari | 837 | </section> |
185 | schaabova | 838 | </chapter> |