Rev 79 | Rev 82 | Go to most recent revision | Details | Compare with Previous | Last modification | View Log | RSS feed
Rev | Author | Line No. | Line |
---|---|---|---|
9 | bondari | 1 | <?xml version="1.0" encoding="UTF-8"?> |
11 | bondari | 2 | <chapter id="mm"> |
3 | <?dbhtml filename="mm.html"?> |
||
9 | bondari | 4 | |
11 | bondari | 5 | <title>Memory management</title> |
9 | bondari | 6 | |
67 | jermar | 7 | <para>In previous chapters, this book described the scheduling subsystem as |
8 | the creator of the impression that threads execute in parallel. The memory |
||
9 | management subsystem, on the other hand, creates the impression that there |
||
10 | is enough physical memory for the kernel and that userspace tasks have the |
||
11 | entire address space only for themselves.</para> |
||
64 | jermar | 12 | |
26 | bondari | 13 | <section> |
64 | jermar | 14 | <title>Physical memory management</title> |
15 | |||
16 | <section id="zones_and_frames"> |
||
17 | <title>Zones and frames</title> |
||
18 | |||
67 | jermar | 19 | <para>HelenOS represents continuous areas of physical memory in |
20 | structures called frame zones (abbreviated as zones). Each zone contains |
||
21 | information about the number of allocated and unallocated physical |
||
22 | memory frames as well as the physical base address of the zone and |
||
23 | number of frames contained in it. A zone also contains an array of frame |
||
24 | structures describing each frame of the zone and, in the last, but not |
||
25 | the least important, front, each zone is equipped with a buddy system |
||
26 | that faciliates effective allocation of power-of-two sized block of |
||
27 | frames.</para> |
||
64 | jermar | 28 | |
67 | jermar | 29 | <para>This organization of physical memory provides good preconditions |
30 | for hot-plugging of more zones. There is also one currently unused zone |
||
31 | attribute: <code>flags</code>. The attribute could be used to give a |
||
32 | special meaning to some zones in the future.</para> |
||
64 | jermar | 33 | |
67 | jermar | 34 | <para>The zones are linked in a doubly-linked list. This might seem a |
35 | bit ineffective because the zone list is walked everytime a frame is |
||
36 | allocated or deallocated. However, this does not represent a significant |
||
37 | performance problem as it is expected that the number of zones will be |
||
38 | rather low. Moreover, most architectures merge all zones into |
||
39 | one.</para> |
||
40 | |||
76 | palkovsky | 41 | <para>Every physical memory frame in a zone, is described by a structure |
42 | that contains number of references and other data used by buddy |
||
67 | jermar | 43 | system.</para> |
64 | jermar | 44 | </section> |
45 | |||
46 | <section id="frame_allocator"> |
||
71 | bondari | 47 | <indexterm> |
48 | <primary>frame allocator</primary> |
||
49 | </indexterm> |
||
50 | |||
64 | jermar | 51 | <title>Frame allocator</title> |
52 | |||
67 | jermar | 53 | <para>The frame allocator satisfies kernel requests to allocate |
54 | power-of-two sized blocks of physical memory. Because of zonal |
||
55 | organization of physical memory, the frame allocator is always working |
||
76 | palkovsky | 56 | within a context of a particular frame zone. In order to carry out the |
67 | jermar | 57 | allocation requests, the frame allocator is tightly integrated with the |
58 | buddy system belonging to the zone. The frame allocator is also |
||
59 | responsible for updating information about the number of free and busy |
||
60 | frames in the zone. <figure> |
||
61 | <mediaobject id="frame_alloc"> |
||
77 | bondari | 62 | <imageobject role="eps"> |
63 | <imagedata fileref="images.vector/frame_alloc.eps" format="EPS" /> |
||
64 | </imageobject> |
||
65 | |||
67 | jermar | 66 | <imageobject role="html"> |
67 | <imagedata fileref="images/frame_alloc.png" format="PNG" /> |
||
68 | </imageobject> |
||
64 | jermar | 69 | |
67 | jermar | 70 | <imageobject role="fop"> |
71 | <imagedata fileref="images.vector/frame_alloc.svg" format="SVG" /> |
||
72 | </imageobject> |
||
73 | </mediaobject> |
||
64 | jermar | 74 | |
67 | jermar | 75 | <title>Frame allocator scheme.</title> |
76 | </figure></para> |
||
64 | jermar | 77 | |
78 | <formalpara> |
||
79 | <title>Allocation / deallocation</title> |
||
80 | |||
67 | jermar | 81 | <para>Upon allocation request via function <code>frame_alloc</code>, |
82 | the frame allocator first tries to find a zone that can satisfy the |
||
83 | request (i.e. has the required amount of free frames). Once a suitable |
||
84 | zone is found, the frame allocator uses the buddy allocator on the |
||
85 | zone's buddy system to perform the allocation. During deallocation, |
||
86 | which is triggered by a call to <code>frame_free</code>, the frame |
||
87 | allocator looks up the respective zone that contains the frame being |
||
88 | deallocated. Afterwards, it calls the buddy allocator again, this time |
||
89 | to take care of deallocation within the zone's buddy system.</para> |
||
64 | jermar | 90 | </formalpara> |
91 | </section> |
||
92 | |||
93 | <section id="buddy_allocator"> |
||
71 | bondari | 94 | <indexterm> |
95 | <primary>buddy system</primary> |
||
96 | </indexterm> |
||
97 | |||
64 | jermar | 98 | <title>Buddy allocator</title> |
99 | |||
67 | jermar | 100 | <para>In the buddy system, the memory is broken down into power-of-two |
101 | sized naturally aligned blocks. These blocks are organized in an array |
||
79 | bondari | 102 | of lists, in which the list with index <emphasis>i</emphasis> contains all unallocated blocks |
103 | of size <emphasis>2<superscript>i</superscript></emphasis>. The |
||
104 | index <emphasis>i</emphasis> is called the order of block. Should there be two adjacent |
||
105 | equally sized blocks in the list <emphasis>i</emphasis> (i.e. buddies), the |
||
67 | jermar | 106 | buddy allocator would coalesce them and put the resulting block in list |
79 | bondari | 107 | <emphasis>i + 1</emphasis>, provided that the resulting block would |
67 | jermar | 108 | be naturally aligned. Similarily, when the allocator is asked to |
109 | allocate a block of size |
||
79 | bondari | 110 | <emphasis>2<superscript>i</superscript></emphasis>, it first tries |
111 | to satisfy the request from the list with index <emphasis>i</emphasis>. If the request cannot |
||
112 | be satisfied (i.e. the list <emphasis>i</emphasis> is empty), the buddy allocator will try to |
||
113 | allocate and split a larger block from the list with index <emphasis>i + 1</emphasis>. Both |
||
67 | jermar | 114 | of these algorithms are recursive. The recursion ends either when there |
115 | are no blocks to coalesce in the former case or when there are no blocks |
||
116 | that can be split in the latter case.</para> |
||
64 | jermar | 117 | |
67 | jermar | 118 | <para>This approach greatly reduces external fragmentation of memory and |
119 | helps in allocating bigger continuous blocks of memory aligned to their |
||
120 | size. On the other hand, the buddy allocator suffers increased internal |
||
121 | fragmentation of memory and is not suitable for general kernel |
||
122 | allocations. This purpose is better addressed by the <link |
||
123 | linkend="slab">slab allocator</link>.<figure> |
||
64 | jermar | 124 | <mediaobject id="buddy_alloc"> |
77 | bondari | 125 | <imageobject role="eps"> |
126 | <imagedata fileref="images.vector/buddy_alloc.eps" format="EPS" /> |
||
127 | </imageobject> |
||
128 | |||
64 | jermar | 129 | <imageobject role="html"> |
130 | <imagedata fileref="images/buddy_alloc.png" format="PNG" /> |
||
131 | </imageobject> |
||
132 | |||
133 | <imageobject role="fop"> |
||
134 | <imagedata fileref="images.vector/buddy_alloc.svg" format="SVG" /> |
||
135 | </imageobject> |
||
136 | </mediaobject> |
||
137 | |||
138 | <title>Buddy system scheme.</title> |
||
67 | jermar | 139 | </figure></para> |
64 | jermar | 140 | |
141 | <section> |
||
142 | <title>Implementation</title> |
||
143 | |||
144 | <para>The buddy allocator is, in fact, an abstract framework wich can |
||
145 | be easily specialized to serve one particular task. It knows nothing |
||
146 | about the nature of memory it helps to allocate. In order to beat the |
||
147 | lack of this knowledge, the buddy allocator exports an interface that |
||
148 | each of its clients is required to implement. When supplied with an |
||
149 | implementation of this interface, the buddy allocator can use |
||
150 | specialized external functions to find a buddy for a block, split and |
||
151 | coalesce blocks, manipulate block order and mark blocks busy or |
||
67 | jermar | 152 | available.</para> |
64 | jermar | 153 | |
154 | <formalpara> |
||
155 | <title>Data organization</title> |
||
156 | |||
157 | <para>Each entity allocable by the buddy allocator is required to |
||
158 | contain space for storing block order number and a link variable |
||
159 | used to interconnect blocks within the same order.</para> |
||
160 | |||
161 | <para>Whatever entities are allocated by the buddy allocator, the |
||
162 | first entity within a block is used to represent the entire block. |
||
163 | The first entity keeps the order of the whole block. Other entities |
||
164 | within the block are assigned the magic value |
||
165 | <constant>BUDDY_INNER_BLOCK</constant>. This is especially important |
||
166 | for effective identification of buddies in a one-dimensional array |
||
167 | because the entity that represents a potential buddy cannot be |
||
168 | associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it |
||
169 | is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is |
||
170 | not a buddy).</para> |
||
171 | </formalpara> |
||
172 | </section> |
||
173 | </section> |
||
174 | |||
175 | <section id="slab"> |
||
71 | bondari | 176 | <indexterm> |
177 | <primary>slab allocator</primary> |
||
178 | </indexterm> |
||
179 | |||
70 | jermar | 180 | <title>Slab allocator</title> |
64 | jermar | 181 | |
67 | jermar | 182 | <para>The majority of memory allocation requests in the kernel is for |
183 | small, frequently used data structures. The basic idea behind the slab |
||
184 | allocator is that commonly used objects are preallocated in continuous |
||
185 | areas of physical memory called slabs<footnote> |
||
186 | <para>Slabs are in fact blocks of physical memory frames allocated |
||
187 | from the frame allocator.</para> |
||
188 | </footnote>. Whenever an object is to be allocated, the slab allocator |
||
189 | returns the first available item from a suitable slab corresponding to |
||
190 | the object type<footnote> |
||
191 | <para>The mechanism is rather more complicated, see the next |
||
192 | paragraph.</para> |
||
193 | </footnote>. Due to the fact that the sizes of the requested and |
||
194 | allocated object match, the slab allocator significantly reduces |
||
195 | internal fragmentation.</para> |
||
64 | jermar | 196 | |
71 | bondari | 197 | <indexterm> |
198 | <primary>slab allocator</primary> |
||
199 | |||
72 | bondari | 200 | <secondary>- slab cache</secondary> |
71 | bondari | 201 | </indexterm> |
202 | |||
67 | jermar | 203 | <para>Slabs of one object type are organized in a structure called slab |
204 | cache. There are ususally more slabs in the slab cache, depending on |
||
205 | previous allocations. If the the slab cache runs out of available slabs, |
||
206 | new slabs are allocated. In order to exploit parallelism and to avoid |
||
207 | locking of shared spinlocks, slab caches can have variants of |
||
208 | processor-private slabs called magazines. On each processor, there is a |
||
209 | two-magazine cache. Full magazines that are not part of any |
||
210 | per-processor magazine cache are stored in a global list of full |
||
211 | magazines.</para> |
||
64 | jermar | 212 | |
71 | bondari | 213 | <indexterm> |
214 | <primary>slab allocator</primary> |
||
215 | |||
72 | bondari | 216 | <secondary>- magazine</secondary> |
71 | bondari | 217 | </indexterm> |
218 | |||
67 | jermar | 219 | <para>Each object begins its life in a slab. When it is allocated from |
220 | there, the slab allocator calls a constructor that is registered in the |
||
221 | respective slab cache. The constructor initializes and brings the object |
||
222 | into a known state. The object is then used by the user. When the user |
||
223 | later frees the object, the slab allocator puts it into a processor |
||
71 | bondari | 224 | private <indexterm> |
225 | <primary>slab allocator</primary> |
||
226 | |||
72 | bondari | 227 | <secondary>- magazine</secondary> |
71 | bondari | 228 | </indexterm>magazine cache, from where it can be precedently allocated |
67 | jermar | 229 | again. Note that allocations satisfied from a magazine are already |
230 | initialized by the constructor. When both of the processor cached |
||
231 | magazines get full, the allocator will move one of the magazines to the |
||
232 | list of full magazines. Similarily, when allocating from an empty |
||
233 | processor magazine cache, the kernel will reload only one magazine from |
||
234 | the list of full magazines. In other words, the slab allocator tries to |
||
235 | keep the processor magazine cache only half-full in order to prevent |
||
236 | thrashing when allocations and deallocations interleave on magazine |
||
70 | jermar | 237 | boundaries. The advantage of this setup is that during most of the |
238 | allocations, no global spinlock needs to be held.</para> |
||
65 | jermar | 239 | |
67 | jermar | 240 | <para>Should HelenOS run short of memory, it would start deallocating |
241 | objects from magazines, calling slab cache destructor on them and |
||
242 | putting them back into slabs. When a slab contanins no allocated object, |
||
243 | it is immediately freed.</para> |
||
66 | bondari | 244 | |
71 | bondari | 245 | <para> |
246 | <figure> |
||
64 | jermar | 247 | <mediaobject id="slab_alloc"> |
77 | bondari | 248 | <imageobject role="eps"> |
249 | <imagedata fileref="images.vector/slab_alloc.eps" format="EPS" /> |
||
250 | </imageobject> |
||
251 | |||
64 | jermar | 252 | <imageobject role="html"> |
253 | <imagedata fileref="images/slab_alloc.png" format="PNG" /> |
||
254 | </imageobject> |
||
77 | bondari | 255 | |
256 | <imageobject role="fop"> |
||
257 | <imagedata fileref="images.vector/slab_alloc.svg" format="SVG" /> |
||
258 | </imageobject> |
||
64 | jermar | 259 | </mediaobject> |
260 | |||
261 | <title>Slab allocator scheme.</title> |
||
71 | bondari | 262 | </figure> |
263 | </para> |
||
64 | jermar | 264 | |
67 | jermar | 265 | <section> |
266 | <title>Implementation</title> |
||
267 | |||
268 | <para>The slab allocator is closely modelled after OpenSolaris slab |
||
71 | bondari | 269 | allocator by Jeff Bonwick and Jonathan Adams <xref |
270 | linkend="Bonwick01" /> with the following exceptions:<itemizedlist> |
||
81 | jermar | 271 | <listitem><para>empty slabs are immediately deallocated and</para></listitem> |
66 | bondari | 272 | |
273 | <listitem> |
||
70 | jermar | 274 | <para>empty magazines are deallocated when not needed.</para> |
66 | bondari | 275 | </listitem> |
70 | jermar | 276 | </itemizedlist>The following features are not currently supported |
277 | but would be easy to do: <itemizedlist> |
||
71 | bondari | 278 | <listitem>cache coloring and</listitem> |
64 | jermar | 279 | |
71 | bondari | 280 | <listitem>dynamic magazine grow (different magazine sizes are |
281 | already supported, but the allocation strategy would need to be |
||
282 | adjusted).</listitem> |
||
64 | jermar | 283 | </itemizedlist></para> |
284 | |||
285 | <section> |
||
286 | <title>Allocation/deallocation</title> |
||
287 | |||
70 | jermar | 288 | <para>The following two paragraphs summarize and complete the |
289 | description of the slab allocator operation (i.e. |
||
290 | <code>slab_alloc</code> and <code>slab_free</code> |
||
291 | operations).</para> |
||
64 | jermar | 292 | |
293 | <formalpara> |
||
294 | <title>Allocation</title> |
||
295 | |||
70 | jermar | 296 | <para><emphasis>Step 1.</emphasis> When an allocation request |
297 | comes, the slab allocator checks availability of memory in the |
||
298 | current magazine of the local processor magazine cache. If the |
||
76 | palkovsky | 299 | available memory is there, the allocator just pops the object from |
300 | magazine and returns it.</para> |
||
64 | jermar | 301 | |
70 | jermar | 302 | <para><emphasis>Step 2.</emphasis> If the current magazine in the |
303 | processor magazine cache is empty, the allocator will attempt to |
||
304 | swap it with the last magazine from the cache and return to the |
||
305 | first step. If also the last magazine is empty, the algorithm will |
||
306 | fall through to Step 3.</para> |
||
64 | jermar | 307 | |
70 | jermar | 308 | <para><emphasis>Step 3.</emphasis> Now the allocator is in the |
309 | situation when both magazines in the processor magazine cache are |
||
310 | empty. The allocator reloads one magazine from the shared list of |
||
311 | full magazines. If the reload is successful (i.e. there are full |
||
312 | magazines in the list), the algorithm continues with Step |
||
313 | 1.</para> |
||
64 | jermar | 314 | |
70 | jermar | 315 | <para><emphasis>Step 4.</emphasis> In this fail-safe step, an |
316 | object is allocated from the conventional slab layer and a pointer |
||
317 | to it is returned. If also the last magazine is full,</para> |
||
64 | jermar | 318 | </formalpara> |
319 | |||
320 | <formalpara> |
||
321 | <title>Deallocation</title> |
||
322 | |||
70 | jermar | 323 | <para><emphasis>Step 1.</emphasis> During a deallocation request, |
324 | the slab allocator checks if the current magazine of the local |
||
76 | palkovsky | 325 | processor magazine cache is not full. If it is, the pointer to the |
70 | jermar | 326 | objects is just pushed into the magazine and the algorithm |
327 | returns.</para> |
||
64 | jermar | 328 | |
70 | jermar | 329 | <para><emphasis>Step 2.</emphasis> If the current magazine is |
330 | full, the allocator will attempt to swap it with the last magazine |
||
331 | from the cache and return to the first step. If also the last |
||
332 | magazine is empty, the algorithm will fall through to Step |
||
333 | 3.</para> |
||
64 | jermar | 334 | |
70 | jermar | 335 | <para><emphasis>Step 3.</emphasis> Now the allocator is in the |
336 | situation when both magazines in the processor magazine cache are |
||
76 | palkovsky | 337 | full. The allocator tries to allocate a new empty magazine and |
338 | flush one of the full magazines to the shared list of full |
||
339 | magazines. If it is successfull, the algoritm continues with Step |
||
340 | 1.</para> |
||
341 | |||
342 | <para><emphasis>Step 4. </emphasis>In case of low memory condition |
||
343 | when the allocation of empty magazine fails, the object is moved |
||
344 | directly into slab. In the worst case object deallocation does not |
||
345 | need to allocate any additional memory.</para> |
||
64 | jermar | 346 | </formalpara> |
347 | </section> |
||
348 | </section> |
||
349 | </section> |
||
350 | </section> |
||
351 | |||
352 | <section> |
||
67 | jermar | 353 | <title>Virtual memory management</title> |
9 | bondari | 354 | |
67 | jermar | 355 | <section> |
356 | <title>Introduction</title> |
||
357 | |||
358 | <para>Virtual memory is a special memory management technique, used by |
||
359 | kernel to achieve a bunch of mission critical goals. <itemizedlist> |
||
360 | <listitem> |
||
361 | Isolate each task from other tasks that are running on the system at the same time. |
||
362 | </listitem> |
||
363 | |||
364 | <listitem> |
||
365 | Allow to allocate more memory, than is actual physical memory size of the machine. |
||
366 | </listitem> |
||
367 | |||
368 | <listitem> |
||
369 | Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations. |
||
370 | </listitem> |
||
371 | </itemizedlist></para> |
||
372 | |||
373 | <para><!-- |
||
374 | <para> |
||
375 | Address spaces. Address space area (B+ tree). Only for uspace. Set of syscalls (shrink/extend etc). |
||
376 | Special address space area type - device - prohibits shrink/extend syscalls to call on it. |
||
377 | Address space has link to mapping tables (hierarchical - per Address space, hash - global tables). |
||
378 | </para> |
||
379 | |||
380 | --></para> |
||
381 | </section> |
||
382 | |||
383 | <section> |
||
71 | bondari | 384 | <title>Address spaces</title> |
70 | jermar | 385 | |
71 | bondari | 386 | <section> |
387 | <indexterm> |
||
388 | <primary>address space</primary> |
||
70 | jermar | 389 | |
72 | bondari | 390 | <secondary>- area</secondary> |
71 | bondari | 391 | </indexterm> |
70 | jermar | 392 | |
393 | <title>Address space areas</title> |
||
67 | jermar | 394 | |
395 | <para>Each address space consists of mutually disjunctive continuous |
||
396 | address space areas. Address space area is precisely defined by its |
||
397 | base address and the number of frames/pages is contains.</para> |
||
398 | |||
399 | <para>Address space area , that define behaviour and permissions on |
||
400 | the particular area. <itemizedlist> |
||
71 | bondari | 401 | <listitem><emphasis>AS_AREA_READ</emphasis> flag indicates reading |
402 | permission.</listitem> |
||
67 | jermar | 403 | |
71 | bondari | 404 | <listitem><emphasis>AS_AREA_WRITE</emphasis> flag indicates |
405 | writing permission.</listitem> |
||
67 | jermar | 406 | |
71 | bondari | 407 | <listitem><emphasis>AS_AREA_EXEC</emphasis> flag indicates code |
408 | execution permission. Some architectures do not support execution |
||
409 | persmission restriction. In this case this flag has no |
||
410 | effect.</listitem> |
||
67 | jermar | 411 | |
71 | bondari | 412 | <listitem><emphasis>AS_AREA_DEVICE</emphasis> marks area as mapped |
413 | to the device memory.</listitem> |
||
67 | jermar | 414 | </itemizedlist></para> |
415 | |||
416 | <para>Kernel provides possibility tasks create/expand/shrink/share its |
||
417 | address space via the set of syscalls.</para> |
||
418 | </section> |
||
419 | |||
420 | <section> |
||
71 | bondari | 421 | <indexterm> |
422 | <primary>address space</primary> |
||
423 | |||
72 | bondari | 424 | <secondary>- ASID</secondary> |
71 | bondari | 425 | </indexterm> |
426 | |||
67 | jermar | 427 | <title>Address Space ID (ASID)</title> |
428 | |||
76 | palkovsky | 429 | <para>Every task in the operating system has it's own view of the |
430 | virtual memory. When performing context switch between different |
||
431 | tasks, the kernel must switch the address space mapping as well. As |
||
432 | modern processors perform very aggressive caching of virtual mappings, |
||
433 | flushing the complete TLB on every context switch would be very |
||
434 | inefficient. To avoid such performance penalty, some architectures |
||
435 | introduce an address space identifier, which allows storing several |
||
436 | different mappings inside TLB.</para> |
||
67 | jermar | 437 | |
76 | palkovsky | 438 | <para>HelenOS kernel can take advantage of this hardware support by |
439 | having an ASID abstraction. I.e. on ia64 kernel ASID is derived from |
||
440 | RID (region identifier) and on the mips32 kernel ASID is actually the |
||
441 | hardware identifier. As expected, this ASID information record is the |
||
442 | part of <emphasis>as_t</emphasis> structure.</para> |
||
67 | jermar | 443 | |
444 | <para>Due to the hardware limitations, hardware ASID has limited |
||
445 | length from 8 bits on ia64 to 24 bits on mips32, which makes it |
||
446 | impossible to use it as unique address space identifier for all tasks |
||
447 | running in the system. In such situations special ASID stealing |
||
448 | algoritm is used, which takes ASID from inactive task and assigns it |
||
449 | to the active task.</para> |
||
450 | |||
71 | bondari | 451 | <indexterm> |
452 | <primary>address space</primary> |
||
453 | |||
72 | bondari | 454 | <secondary>- ASID stealing</secondary> |
71 | bondari | 455 | </indexterm> |
456 | |||
457 | <para> |
||
458 | <classname>ASID stealing algoritm here.</classname> |
||
459 | </para> |
||
67 | jermar | 460 | </section> |
461 | </section> |
||
462 | |||
71 | bondari | 463 | <section id="paging"> |
67 | jermar | 464 | <title>Virtual address translation</title> |
465 | |||
71 | bondari | 466 | <section> |
467 | <title>Introduction</title> |
||
67 | jermar | 468 | |
71 | bondari | 469 | <para>Virtual memory is usually using paged memory model, where |
470 | virtual memory address space is divided into the |
||
471 | <emphasis>pages</emphasis> (usually having size 4096 bytes) and |
||
472 | physical memory is divided into the frames (same sized as a page, of |
||
473 | course). Each page may be mapped to some frame and then, upon memory |
||
474 | access to the virtual address, CPU performs <emphasis>address |
||
475 | translation</emphasis> during the instruction execution. Non-existing |
||
476 | mapping generates page fault exception, calling kernel exception |
||
477 | handler, thus allowing kernel to manipulate rules of memory access. |
||
478 | Information for pages mapping is stored by kernel in the <link |
||
479 | linkend="page_tables">page tables</link></para> |
||
480 | |||
481 | <indexterm> |
||
482 | <primary>page tables</primary> |
||
483 | </indexterm> |
||
484 | |||
485 | <para>The majority of the architectures use multi-level page tables, |
||
486 | which means need to access physical memory several times before |
||
487 | getting physical address. This fact would make serios performance |
||
488 | overhead in virtual memory management. To avoid this <link |
||
489 | linkend="tlb">Traslation Lookaside Buffer (TLB)</link> is used.</para> |
||
490 | |||
70 | jermar | 491 | <para>HelenOS kernel has two different approaches to the paging |
492 | implementation: <emphasis>4 level page tables</emphasis> and |
||
71 | bondari | 493 | <emphasis>global hash table</emphasis>, which are accessible via |
70 | jermar | 494 | generic paging abstraction layer. Such different functionality was |
495 | caused by the major architectural differences between supported |
||
496 | platforms. This abstraction is implemented with help of the global |
||
497 | structure of pointers to basic mapping functions |
||
498 | <emphasis>page_mapping_operations</emphasis>. To achieve different |
||
499 | functionality of page tables, corresponding layer must implement |
||
500 | functions, declared in |
||
501 | <emphasis>page_mapping_operations</emphasis></para> |
||
67 | jermar | 502 | |
71 | bondari | 503 | <para>Thanks to the abstract paging interface, there was a place left |
504 | for more paging implementations (besides already implemented |
||
73 | bondari | 505 | hieararchical page tables and hash table), for example <indexterm> |
506 | <primary>B-tree</primary> |
||
507 | </indexterm> B-Tree based page tables.</para> |
||
71 | bondari | 508 | </section> |
67 | jermar | 509 | |
71 | bondari | 510 | <section id="page_tables"> |
511 | <indexterm> |
||
512 | <primary>page tables</primary> |
||
67 | jermar | 513 | |
72 | bondari | 514 | <secondary>- hierarchical</secondary> |
71 | bondari | 515 | </indexterm> |
67 | jermar | 516 | |
71 | bondari | 517 | <title>Hierarchical 4-level page tables</title> |
67 | jermar | 518 | |
71 | bondari | 519 | <para>Hierarchical 4-level page tables are the generalization of the |
520 | hardware capabilities of most architectures. Each address space has |
||
521 | its own page tables.<itemizedlist> |
||
522 | <listitem>ia32 uses 2-level page tables, with full hardware |
||
523 | support.</listitem> |
||
67 | jermar | 524 | |
71 | bondari | 525 | <listitem>amd64 uses 4-level page tables, also coming with full |
526 | hardware support.</listitem> |
||
67 | jermar | 527 | |
71 | bondari | 528 | <listitem>mips and ppc32 have 2-level tables, software simulated |
529 | support.</listitem> |
||
530 | </itemizedlist></para> |
||
67 | jermar | 531 | </section> |
532 | |||
71 | bondari | 533 | <section> |
534 | <indexterm> |
||
535 | <primary>page tables</primary> |
||
67 | jermar | 536 | |
72 | bondari | 537 | <secondary>- hashing</secondary> |
71 | bondari | 538 | </indexterm> |
67 | jermar | 539 | |
71 | bondari | 540 | <title>Global hash table</title> |
67 | jermar | 541 | |
71 | bondari | 542 | <para>Implementation of the global hash table was encouraged by the |
543 | ia64 architecture support. One of the major differences between global |
||
544 | hash table and hierarchical tables is that global hash table exists |
||
545 | only once in the system and the hierarchical tables are maintained per |
||
546 | address space.</para> |
||
67 | jermar | 547 | |
71 | bondari | 548 | <para>Thus, hash table contains information about all address spaces |
549 | mappings in the system, so, the hash of an entry must contain |
||
550 | information of both address space pointer or id and the virtual |
||
551 | address of the page. Generic hash table implementation assumes that |
||
552 | the addresses of the pointers to the address spaces are likely to be |
||
553 | on the close addresses, so it uses least significant bits for hash; |
||
554 | also it assumes that the virtual page addresses have roughly the same |
||
555 | probability of occurring, so the least significant bits of VPN compose |
||
556 | the hash index.</para> |
||
67 | jermar | 557 | |
74 | bondari | 558 | <para>Paging hash table uses generic hash table with collision chains |
559 | (see the <link linkend="hashtables">Data Structures</link> chapter of |
||
560 | this manual for details).</para> |
||
70 | jermar | 561 | </section> |
562 | </section> |
||
67 | jermar | 563 | |
71 | bondari | 564 | <section id="tlb"> |
565 | <indexterm> |
||
566 | <primary>TLB</primary> |
||
567 | </indexterm> |
||
67 | jermar | 568 | |
71 | bondari | 569 | <title>Translation Lookaside buffer</title> |
67 | jermar | 570 | |
71 | bondari | 571 | <para>Due to the extensive overhead during the page mapping lookup in |
572 | the page tables, all architectures has fast assotiative cache memory |
||
573 | built-in CPU. This memory called TLB stores recently used page table |
||
574 | entries.</para> |
||
575 | |||
576 | <section id="tlb_shootdown"> |
||
577 | <indexterm> |
||
578 | <primary>TLB</primary> |
||
579 | |||
72 | bondari | 580 | <secondary>- TLB shootdown</secondary> |
71 | bondari | 581 | </indexterm> |
582 | |||
583 | <title>TLB consistency. TLB shootdown algorithm.</title> |
||
584 | |||
585 | <para>Operating system is responsible for keeping TLB consistent by |
||
586 | invalidating the contents of TLB, whenever there is some change in |
||
587 | page tables. Those changes may occur when page or group of pages were |
||
588 | unmapped, mapping is changed or system switching active address space |
||
589 | to schedule a new system task. Moreover, this invalidation operation |
||
590 | must be done an all system CPUs because each CPU has its own |
||
591 | independent TLB cache. Thus maintaining TLB consistency on SMP |
||
592 | configuration as not as trivial task as it looks on the first glance. |
||
593 | Naive solution would assume that is the CPU which wants to invalidate |
||
594 | TLB will invalidate TLB caches on other CPUs. It is not possible on |
||
595 | the most of the architectures, because of the simple fact - flushing |
||
596 | TLB is allowed only on the local CPU and there is no possibility to |
||
597 | access other CPUs' TLB caches, thus invalidate TLB remotely.</para> |
||
598 | |||
599 | <para>Technique of remote invalidation of TLB entries is called "TLB |
||
600 | shootdown". HelenOS uses a variation of the algorithm described by D. |
||
601 | Black et al., "Translation Lookaside Buffer Consistency: A Software |
||
602 | Approach," Proc. Third Int'l Conf. Architectural Support for |
||
603 | Programming Languages and Operating Systems, 1989, pp. 113-122. <xref |
||
604 | linkend="Black89" /></para> |
||
605 | |||
606 | <para>As the situation demands, you will want partitial invalidation |
||
607 | of TLB caches. In case of simple memory mapping change it is necessary |
||
608 | to invalidate only one or more adjacent pages. In case if the |
||
609 | architecture is aware of ASIDs, when kernel needs to dump some ASID to |
||
610 | use by another task, it invalidates only entries from this particular |
||
611 | address space. Final option of the TLB invalidation is the complete |
||
612 | TLB cache invalidation, which is the operation that flushes all |
||
613 | entries in TLB.</para> |
||
614 | |||
615 | <para>TLB shootdown is performed in two phases.</para> |
||
616 | |||
617 | <formalpara> |
||
618 | <title>Phase 1.</title> |
||
619 | |||
620 | <para>First, initiator locks a global TLB spinlock, then request is |
||
621 | being put to the local request cache of every other CPU in the |
||
622 | system protected by its spinlock. In case the cache is full, all |
||
623 | requests in the cache are replaced by one request, indicating global |
||
624 | TLB flush. Then the initiator thread sends an IPI message indicating |
||
625 | the TLB shootdown request to the rest of the CPUs and waits actively |
||
626 | until all CPUs confirm TLB invalidating action execution by setting |
||
627 | up a special flag. After setting this flag this thread is blocked on |
||
628 | the TLB spinlock, held by the initiator.</para> |
||
629 | </formalpara> |
||
630 | |||
631 | <formalpara> |
||
632 | <title>Phase 2.</title> |
||
633 | |||
634 | <para>All CPUs are waiting on the TLB spinlock to execute TLB |
||
635 | invalidation action and have indicated their intention to the |
||
636 | initiator. Initiator continues, cleaning up its TLB and releasing |
||
637 | the global TLB spinlock. After this all other CPUs gain and |
||
638 | immidiately release TLB spinlock and perform TLB invalidation |
||
639 | actions.</para> |
||
640 | </formalpara> |
||
641 | </section> |
||
67 | jermar | 642 | </section> |
26 | bondari | 643 | </section> |
11 | bondari | 644 | </chapter> |