Rev 68 | Rev 70 | Go to most recent revision | Details | Compare with Previous | Last modification | View Log | RSS feed
Rev | Author | Line No. | Line |
---|---|---|---|
9 | bondari | 1 | <?xml version="1.0" encoding="UTF-8"?> |
11 | bondari | 2 | <chapter id="mm"> |
3 | <?dbhtml filename="mm.html"?> |
||
9 | bondari | 4 | |
11 | bondari | 5 | <title>Memory management</title> |
9 | bondari | 6 | |
67 | jermar | 7 | <para>In previous chapters, this book described the scheduling subsystem as |
8 | the creator of the impression that threads execute in parallel. The memory |
||
9 | management subsystem, on the other hand, creates the impression that there |
||
10 | is enough physical memory for the kernel and that userspace tasks have the |
||
11 | entire address space only for themselves.</para> |
||
64 | jermar | 12 | |
26 | bondari | 13 | <section> |
64 | jermar | 14 | <title>Physical memory management</title> |
15 | |||
16 | <section id="zones_and_frames"> |
||
69 | bondari | 17 | <indexterm><primary>memory management</primary><secondary>memory zone</secondary></indexterm> |
64 | jermar | 18 | <title>Zones and frames</title> |
19 | |||
67 | jermar | 20 | <para>HelenOS represents continuous areas of physical memory in |
21 | structures called frame zones (abbreviated as zones). Each zone contains |
||
22 | information about the number of allocated and unallocated physical |
||
23 | memory frames as well as the physical base address of the zone and |
||
24 | number of frames contained in it. A zone also contains an array of frame |
||
25 | structures describing each frame of the zone and, in the last, but not |
||
26 | the least important, front, each zone is equipped with a buddy system |
||
27 | that faciliates effective allocation of power-of-two sized block of |
||
28 | frames.</para> |
||
64 | jermar | 29 | |
67 | jermar | 30 | <para>This organization of physical memory provides good preconditions |
31 | for hot-plugging of more zones. There is also one currently unused zone |
||
32 | attribute: <code>flags</code>. The attribute could be used to give a |
||
33 | special meaning to some zones in the future.</para> |
||
64 | jermar | 34 | |
67 | jermar | 35 | <para>The zones are linked in a doubly-linked list. This might seem a |
36 | bit ineffective because the zone list is walked everytime a frame is |
||
37 | allocated or deallocated. However, this does not represent a significant |
||
38 | performance problem as it is expected that the number of zones will be |
||
39 | rather low. Moreover, most architectures merge all zones into |
||
40 | one.</para> |
||
41 | |||
42 | <para>For each physical memory frame found in a zone, there is a frame |
||
43 | structure that contains number of references and data used by buddy |
||
44 | system.</para> |
||
64 | jermar | 45 | </section> |
46 | |||
47 | <section id="frame_allocator"> |
||
69 | bondari | 48 | <indexterm><primary>memory management</primary><secondary>frame allocator</secondary></indexterm> |
64 | jermar | 49 | <title>Frame allocator</title> |
50 | |||
67 | jermar | 51 | <para>The frame allocator satisfies kernel requests to allocate |
52 | power-of-two sized blocks of physical memory. Because of zonal |
||
53 | organization of physical memory, the frame allocator is always working |
||
54 | within a context of some frame zone. In order to carry out the |
||
55 | allocation requests, the frame allocator is tightly integrated with the |
||
56 | buddy system belonging to the zone. The frame allocator is also |
||
57 | responsible for updating information about the number of free and busy |
||
58 | frames in the zone. <figure> |
||
59 | <mediaobject id="frame_alloc"> |
||
60 | <imageobject role="html"> |
||
61 | <imagedata fileref="images/frame_alloc.png" format="PNG" /> |
||
62 | </imageobject> |
||
64 | jermar | 63 | |
67 | jermar | 64 | <imageobject role="fop"> |
65 | <imagedata fileref="images.vector/frame_alloc.svg" format="SVG" /> |
||
66 | </imageobject> |
||
67 | </mediaobject> |
||
64 | jermar | 68 | |
67 | jermar | 69 | <title>Frame allocator scheme.</title> |
70 | </figure></para> |
||
64 | jermar | 71 | |
72 | <formalpara> |
||
73 | <title>Allocation / deallocation</title> |
||
74 | |||
67 | jermar | 75 | <para>Upon allocation request via function <code>frame_alloc</code>, |
76 | the frame allocator first tries to find a zone that can satisfy the |
||
77 | request (i.e. has the required amount of free frames). Once a suitable |
||
78 | zone is found, the frame allocator uses the buddy allocator on the |
||
79 | zone's buddy system to perform the allocation. During deallocation, |
||
80 | which is triggered by a call to <code>frame_free</code>, the frame |
||
81 | allocator looks up the respective zone that contains the frame being |
||
82 | deallocated. Afterwards, it calls the buddy allocator again, this time |
||
83 | to take care of deallocation within the zone's buddy system.</para> |
||
64 | jermar | 84 | </formalpara> |
85 | </section> |
||
86 | |||
87 | <section id="buddy_allocator"> |
||
69 | bondari | 88 | <indexterm><primary>memory management</primary><secondary>buddy system</secondary></indexterm> |
64 | jermar | 89 | <title>Buddy allocator</title> |
90 | |||
67 | jermar | 91 | <para>In the buddy system, the memory is broken down into power-of-two |
92 | sized naturally aligned blocks. These blocks are organized in an array |
||
93 | of lists, in which the list with index i contains all unallocated blocks |
||
94 | of size <mathphrase>2<superscript>i</superscript></mathphrase>. The |
||
95 | index i is called the order of block. Should there be two adjacent |
||
96 | equally sized blocks in the list i<mathphrase />(i.e. buddies), the |
||
97 | buddy allocator would coalesce them and put the resulting block in list |
||
98 | <mathphrase>i + 1</mathphrase>, provided that the resulting block would |
||
99 | be naturally aligned. Similarily, when the allocator is asked to |
||
100 | allocate a block of size |
||
101 | <mathphrase>2<superscript>i</superscript></mathphrase>, it first tries |
||
102 | to satisfy the request from the list with index i. If the request cannot |
||
103 | be satisfied (i.e. the list i is empty), the buddy allocator will try to |
||
104 | allocate and split a larger block from the list with index i + 1. Both |
||
105 | of these algorithms are recursive. The recursion ends either when there |
||
106 | are no blocks to coalesce in the former case or when there are no blocks |
||
107 | that can be split in the latter case.</para> |
||
64 | jermar | 108 | |
67 | jermar | 109 | <para>This approach greatly reduces external fragmentation of memory and |
110 | helps in allocating bigger continuous blocks of memory aligned to their |
||
111 | size. On the other hand, the buddy allocator suffers increased internal |
||
112 | fragmentation of memory and is not suitable for general kernel |
||
113 | allocations. This purpose is better addressed by the <link |
||
114 | linkend="slab">slab allocator</link>.<figure> |
||
64 | jermar | 115 | <mediaobject id="buddy_alloc"> |
116 | <imageobject role="html"> |
||
117 | <imagedata fileref="images/buddy_alloc.png" format="PNG" /> |
||
118 | </imageobject> |
||
119 | |||
120 | <imageobject role="fop"> |
||
121 | <imagedata fileref="images.vector/buddy_alloc.svg" format="SVG" /> |
||
122 | </imageobject> |
||
123 | </mediaobject> |
||
124 | |||
125 | <title>Buddy system scheme.</title> |
||
67 | jermar | 126 | </figure></para> |
64 | jermar | 127 | |
128 | <section> |
||
129 | <title>Implementation</title> |
||
130 | |||
131 | <para>The buddy allocator is, in fact, an abstract framework wich can |
||
132 | be easily specialized to serve one particular task. It knows nothing |
||
133 | about the nature of memory it helps to allocate. In order to beat the |
||
134 | lack of this knowledge, the buddy allocator exports an interface that |
||
135 | each of its clients is required to implement. When supplied with an |
||
136 | implementation of this interface, the buddy allocator can use |
||
137 | specialized external functions to find a buddy for a block, split and |
||
138 | coalesce blocks, manipulate block order and mark blocks busy or |
||
67 | jermar | 139 | available.</para> |
64 | jermar | 140 | |
141 | <formalpara> |
||
142 | <title>Data organization</title> |
||
143 | |||
144 | <para>Each entity allocable by the buddy allocator is required to |
||
145 | contain space for storing block order number and a link variable |
||
146 | used to interconnect blocks within the same order.</para> |
||
147 | |||
148 | <para>Whatever entities are allocated by the buddy allocator, the |
||
149 | first entity within a block is used to represent the entire block. |
||
150 | The first entity keeps the order of the whole block. Other entities |
||
151 | within the block are assigned the magic value |
||
152 | <constant>BUDDY_INNER_BLOCK</constant>. This is especially important |
||
153 | for effective identification of buddies in a one-dimensional array |
||
154 | because the entity that represents a potential buddy cannot be |
||
155 | associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it |
||
156 | is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is |
||
157 | not a buddy).</para> |
||
158 | </formalpara> |
||
159 | </section> |
||
160 | </section> |
||
161 | |||
162 | <section id="slab"> |
||
69 | bondari | 163 | <indexterm><primary>memory management</primary><secondary>slab</secondary></indexterm> |
164 | <title>Slab allocator</title> |
||
64 | jermar | 165 | |
67 | jermar | 166 | <para>The majority of memory allocation requests in the kernel is for |
167 | small, frequently used data structures. The basic idea behind the slab |
||
168 | allocator is that commonly used objects are preallocated in continuous |
||
169 | areas of physical memory called slabs<footnote> |
||
170 | <para>Slabs are in fact blocks of physical memory frames allocated |
||
171 | from the frame allocator.</para> |
||
172 | </footnote>. Whenever an object is to be allocated, the slab allocator |
||
173 | returns the first available item from a suitable slab corresponding to |
||
174 | the object type<footnote> |
||
175 | <para>The mechanism is rather more complicated, see the next |
||
176 | paragraph.</para> |
||
177 | </footnote>. Due to the fact that the sizes of the requested and |
||
178 | allocated object match, the slab allocator significantly reduces |
||
179 | internal fragmentation.</para> |
||
64 | jermar | 180 | |
67 | jermar | 181 | <para>Slabs of one object type are organized in a structure called slab |
182 | cache. There are ususally more slabs in the slab cache, depending on |
||
183 | previous allocations. If the the slab cache runs out of available slabs, |
||
184 | new slabs are allocated. In order to exploit parallelism and to avoid |
||
185 | locking of shared spinlocks, slab caches can have variants of |
||
186 | processor-private slabs called magazines. On each processor, there is a |
||
187 | two-magazine cache. Full magazines that are not part of any |
||
188 | per-processor magazine cache are stored in a global list of full |
||
189 | magazines.</para> |
||
64 | jermar | 190 | |
67 | jermar | 191 | <para>Each object begins its life in a slab. When it is allocated from |
192 | there, the slab allocator calls a constructor that is registered in the |
||
193 | respective slab cache. The constructor initializes and brings the object |
||
194 | into a known state. The object is then used by the user. When the user |
||
195 | later frees the object, the slab allocator puts it into a processor |
||
196 | private magazine cache, from where it can be precedently allocated |
||
197 | again. Note that allocations satisfied from a magazine are already |
||
198 | initialized by the constructor. When both of the processor cached |
||
199 | magazines get full, the allocator will move one of the magazines to the |
||
200 | list of full magazines. Similarily, when allocating from an empty |
||
201 | processor magazine cache, the kernel will reload only one magazine from |
||
202 | the list of full magazines. In other words, the slab allocator tries to |
||
203 | keep the processor magazine cache only half-full in order to prevent |
||
204 | thrashing when allocations and deallocations interleave on magazine |
||
205 | boundaries.</para> |
||
65 | jermar | 206 | |
67 | jermar | 207 | <para>Should HelenOS run short of memory, it would start deallocating |
208 | objects from magazines, calling slab cache destructor on them and |
||
209 | putting them back into slabs. When a slab contanins no allocated object, |
||
210 | it is immediately freed.</para> |
||
66 | bondari | 211 | |
67 | jermar | 212 | <para><figure> |
64 | jermar | 213 | <mediaobject id="slab_alloc"> |
214 | <imageobject role="html"> |
||
215 | <imagedata fileref="images/slab_alloc.png" format="PNG" /> |
||
216 | </imageobject> |
||
217 | </mediaobject> |
||
218 | |||
219 | <title>Slab allocator scheme.</title> |
||
67 | jermar | 220 | </figure></para> |
64 | jermar | 221 | |
67 | jermar | 222 | <section> |
223 | <title>Implementation</title> |
||
224 | |||
225 | <para>The slab allocator is closely modelled after OpenSolaris slab |
||
226 | allocator by Jeff Bonwick and Jonathan Adams with the following |
||
227 | exceptions:<itemizedlist> |
||
64 | jermar | 228 | <listitem> |
68 | bondari | 229 | empty slabs are immediately deallocated |
64 | jermar | 230 | </listitem> |
66 | bondari | 231 | |
232 | <listitem> |
||
67 | jermar | 233 | <para>empty magazines are deallocated when not needed</para> |
66 | bondari | 234 | </listitem> |
64 | jermar | 235 | </itemizedlist> Following features are not currently supported but |
236 | would be easy to do: <itemizedlist> |
||
237 | <listitem> |
||
67 | jermar | 238 | cache coloring |
64 | jermar | 239 | </listitem> |
240 | |||
241 | <listitem> |
||
67 | jermar | 242 | dynamic magazine grow (different magazine sizes are already supported, but the allocation strategy would need to be adjusted) |
64 | jermar | 243 | </listitem> |
244 | </itemizedlist></para> |
||
245 | |||
246 | <section> |
||
69 | bondari | 247 | <indexterm><primary>memory management</primary><secondary>slab magazine</secondary></indexterm> |
248 | <title>Magazine layer</title> |
||
64 | jermar | 249 | |
250 | <para>Due to the extensive bottleneck on SMP architures, caused by |
||
251 | global slab locking mechanism, making processing of all slab |
||
252 | allocation requests serialized, a new layer was introduced to the |
||
253 | classic slab allocator design. Slab allocator was extended to |
||
254 | support per-CPU caches 'magazines' to achieve good SMP scaling. |
||
255 | <termdef>Slab SMP perfromance bottleneck was resolved by introducing |
||
256 | a per-CPU caching scheme called as <glossterm>magazine |
||
257 | layer</glossterm></termdef>.</para> |
||
258 | |||
259 | <para>Magazine is a N-element cache of objects, so each magazine can |
||
260 | satisfy N allocations. Magazine behaves like a automatic weapon |
||
261 | magazine (LIFO, stack), so the allocation/deallocation become simple |
||
262 | push/pop pointer operation. Trick is that CPU does not access global |
||
263 | slab allocator data during the allocation from its magazine, thus |
||
264 | making possible parallel allocations between CPUs.</para> |
||
265 | |||
266 | <para>Implementation also requires adding another feature as the |
||
267 | CPU-bound magazine is actually a pair of magazines to avoid |
||
268 | thrashing when during allocation/deallocatiion of 1 item at the |
||
269 | magazine size boundary. LIFO order is enforced, which should avoid |
||
270 | fragmentation as much as possible.</para> |
||
271 | |||
272 | <para>Another important entity of magazine layer is the common full |
||
273 | magazine list (also called a depot), that stores full magazines that |
||
274 | may be used by any of the CPU magazine caches to reload active CPU |
||
275 | magazine. This list of magazines can be pre-filled with full |
||
276 | magazines during initialization, but in current implementation it is |
||
277 | filled during object deallocation, when CPU magazine becomes |
||
278 | full.</para> |
||
279 | |||
280 | <para>Slab allocator control structures are allocated from special |
||
281 | slabs, that are marked by special flag, indicating that it should |
||
282 | not be used for slab magazine layer. This is done to avoid possible |
||
283 | infinite recursions and deadlock during conventional slab allocaiton |
||
284 | requests.</para> |
||
285 | </section> |
||
286 | |||
287 | <section> |
||
288 | <title>Allocation/deallocation</title> |
||
289 | |||
290 | <para>Every cache contains list of full slabs and list of partialy |
||
291 | full slabs. Empty slabs are immediately freed (thrashing will be |
||
292 | avoided because of magazines).</para> |
||
293 | |||
294 | <para>The SLAB allocator allocates lots of space and does not free |
||
295 | it. When frame allocator fails to allocate the frame, it calls |
||
296 | slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim. |
||
297 | The light reclaim releases slabs from cpu-shared magazine-list, |
||
298 | until at least 1 slab is deallocated in each cache (this algorithm |
||
299 | should probably change). The brutal reclaim removes all cached |
||
300 | objects, even from CPU-bound magazines.</para> |
||
301 | |||
302 | <formalpara> |
||
303 | <title>Allocation</title> |
||
304 | |||
305 | <para><emphasis>Step 1.</emphasis> When it comes to the allocation |
||
306 | request, slab allocator first of all checks availability of memory |
||
307 | in local CPU-bound magazine. If it is there, we would just "pop" |
||
308 | the CPU magazine and return the pointer to object.</para> |
||
309 | |||
310 | <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is |
||
311 | empty, allocator will attempt to reload magazin, swapping it with |
||
312 | second CPU magazine and returns to the first step.</para> |
||
313 | |||
314 | <para><emphasis>Step 3.</emphasis> Now we are in the situation |
||
315 | when both CPU-bound magazines are empty, which makes allocator to |
||
316 | access shared full-magazines depot to reload CPU-bound magazines. |
||
317 | If reload is succesful (meaning there are full magazines in depot) |
||
318 | algoritm continues at Step 1.</para> |
||
319 | |||
320 | <para><emphasis>Step 4.</emphasis> Final step of the allocation. |
||
321 | In this step object is allocated from the conventional slab layer |
||
322 | and pointer is returned.</para> |
||
323 | </formalpara> |
||
324 | |||
325 | <formalpara> |
||
326 | <title>Deallocation</title> |
||
327 | |||
328 | <para><emphasis>Step 1.</emphasis> During deallocation request, |
||
329 | slab allocator will check if the local CPU-bound magazine is not |
||
330 | full. In this case we will just push the pointer to this |
||
331 | magazine.</para> |
||
332 | |||
333 | <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is |
||
334 | full, allocator will attempt to reload magazin, swapping it with |
||
335 | second CPU magazine and returns to the first step.</para> |
||
336 | |||
337 | <para><emphasis>Step 3.</emphasis> Now we are in the situation |
||
338 | when both CPU-bound magazines are full, which makes allocator to |
||
339 | access shared full-magazines depot to put one of the magazines to |
||
340 | the depot and creating new empty magazine. Algoritm continues at |
||
341 | Step 1.</para> |
||
342 | </formalpara> |
||
343 | </section> |
||
344 | </section> |
||
345 | </section> |
||
346 | |||
347 | <!-- End of Physmem --> |
||
348 | </section> |
||
349 | |||
350 | <section> |
||
67 | jermar | 351 | <title>Virtual memory management</title> |
9 | bondari | 352 | |
67 | jermar | 353 | <section> |
354 | <title>Introduction</title> |
||
355 | |||
356 | <para>Virtual memory is a special memory management technique, used by |
||
357 | kernel to achieve a bunch of mission critical goals. <itemizedlist> |
||
358 | <listitem> |
||
359 | Isolate each task from other tasks that are running on the system at the same time. |
||
360 | </listitem> |
||
361 | |||
362 | <listitem> |
||
363 | Allow to allocate more memory, than is actual physical memory size of the machine. |
||
364 | </listitem> |
||
365 | |||
366 | <listitem> |
||
367 | Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations. |
||
368 | </listitem> |
||
369 | </itemizedlist></para> |
||
370 | |||
371 | <para><!-- |
||
372 | |||
373 | <para> |
||
374 | Address spaces. Address space area (B+ tree). Only for uspace. Set of syscalls (shrink/extend etc). |
||
375 | Special address space area type - device - prohibits shrink/extend syscalls to call on it. |
||
376 | Address space has link to mapping tables (hierarchical - per Address space, hash - global tables). |
||
377 | </para> |
||
378 | |||
379 | --></para> |
||
380 | </section> |
||
381 | |||
382 | <section> |
||
383 | <title>Address spaces</title> |
||
384 | |||
385 | <section> |
||
69 | bondari | 386 | <indexterm><primary>memory management</primary><secondary>address space area</secondary></indexterm> |
387 | <title>Address space areas</title> |
||
67 | jermar | 388 | |
389 | <para>Each address space consists of mutually disjunctive continuous |
||
390 | address space areas. Address space area is precisely defined by its |
||
391 | base address and the number of frames/pages is contains.</para> |
||
392 | |||
393 | <para>Address space area , that define behaviour and permissions on |
||
394 | the particular area. <itemizedlist> |
||
395 | <listitem> |
||
396 | |||
397 | |||
398 | <emphasis>AS_AREA_READ</emphasis> |
||
399 | |||
400 | flag indicates reading permission. |
||
401 | </listitem> |
||
402 | |||
403 | <listitem> |
||
404 | |||
405 | |||
406 | <emphasis>AS_AREA_WRITE</emphasis> |
||
407 | |||
408 | flag indicates writing permission. |
||
409 | </listitem> |
||
410 | |||
411 | <listitem> |
||
412 | |||
413 | |||
414 | <emphasis>AS_AREA_EXEC</emphasis> |
||
415 | |||
416 | flag indicates code execution permission. Some architectures do not support execution persmission restriction. In this case this flag has no effect. |
||
417 | </listitem> |
||
418 | |||
419 | <listitem> |
||
420 | |||
421 | |||
422 | <emphasis>AS_AREA_DEVICE</emphasis> |
||
423 | |||
424 | marks area as mapped to the device memory. |
||
425 | </listitem> |
||
426 | </itemizedlist></para> |
||
427 | |||
428 | <para>Kernel provides possibility tasks create/expand/shrink/share its |
||
429 | address space via the set of syscalls.</para> |
||
430 | </section> |
||
431 | |||
432 | <section> |
||
69 | bondari | 433 | <indexterm><primary>memory management</primary><secondary>asid</secondary></indexterm> |
67 | jermar | 434 | <title>Address Space ID (ASID)</title> |
435 | |||
436 | <para>When switching to the different task, kernel also require to |
||
437 | switch mappings to the different address space. In case TLB cannot |
||
438 | distinguish address space mappings, all mapping information in TLB |
||
439 | from the old address space must be flushed, which can create certain |
||
440 | uncessary overhead during the task switching. To avoid this, some |
||
441 | architectures have capability to segregate different address spaces on |
||
442 | hardware level introducing the address space identifier as a part of |
||
443 | TLB record, telling the virtual address space translation unit to |
||
444 | which address space this record is applicable.</para> |
||
445 | |||
446 | <para>HelenOS kernel can take advantage of this hardware supported |
||
447 | identifier by having an ASID abstraction which is somehow related to |
||
448 | the corresponding architecture identifier. I.e. on ia64 kernel ASID is |
||
449 | derived from RID (region identifier) and on the mips32 kernel ASID is |
||
450 | actually the hardware identifier. As expected, this ASID information |
||
451 | record is the part of <emphasis>as_t</emphasis> structure.</para> |
||
452 | |||
453 | <para>Due to the hardware limitations, hardware ASID has limited |
||
454 | length from 8 bits on ia64 to 24 bits on mips32, which makes it |
||
455 | impossible to use it as unique address space identifier for all tasks |
||
456 | running in the system. In such situations special ASID stealing |
||
457 | algoritm is used, which takes ASID from inactive task and assigns it |
||
458 | to the active task.</para> |
||
459 | |||
460 | <para><classname>ASID stealing algoritm here.</classname></para> |
||
461 | </section> |
||
462 | </section> |
||
463 | |||
464 | <section> |
||
465 | <title>Virtual address translation</title> |
||
466 | |||
68 | bondari | 467 | <section id="paging"> |
468 | <title>Paging</title> |
||
67 | jermar | 469 | |
68 | bondari | 470 | <section> |
471 | <title>Introduction</title> |
||
67 | jermar | 472 | |
68 | bondari | 473 | <para>Virtual memory is usually using paged memory model, where |
474 | virtual memory address space is divided into the |
||
475 | <emphasis>pages</emphasis> (usually having size 4096 bytes) and |
||
476 | physical memory is divided into the frames (same sized as a page, of |
||
477 | course). Each page may be mapped to some frame and then, upon memory |
||
478 | access to the virtual address, CPU performs <emphasis>address |
||
479 | translation</emphasis> during the instruction execution. |
||
480 | Non-existing mapping generates page fault exception, calling kernel |
||
481 | exception handler, thus allowing kernel to manipulate rules of |
||
482 | memory access. Information for pages mapping is stored by kernel in |
||
69 | bondari | 483 | the page tables</para> |
67 | jermar | 484 | |
68 | bondari | 485 | <para>The majority of the architectures use multi-level page tables, |
486 | which means need to access physical memory several times before |
||
487 | getting physical address. This fact would make serios performance |
||
488 | overhead in virtual memory management. To avoid this <link |
||
489 | linkend="tlb">Traslation Lookaside Buffer (TLB)</link> is |
||
490 | used.</para> |
||
491 | |||
492 | <para>HelenOS kernel has two different approaches to the paging |
||
493 | implementation: <emphasis>4 level page tables</emphasis> and |
||
494 | <emphasis>global hash table</emphasis>, which are accessible via |
||
495 | generic paging abstraction layer. Such different functionality was |
||
496 | caused by the major architectural differences between supported |
||
497 | platforms. This abstraction is implemented with help of the global |
||
498 | structure of pointers to basic mapping functions |
||
499 | <emphasis>page_mapping_operations</emphasis>. To achieve different |
||
500 | functionality of page tables, corresponding layer must implement |
||
501 | functions, declared in |
||
502 | <emphasis>page_mapping_operations</emphasis></para> |
||
503 | |||
504 | <para>Thanks to the abstract paging interface, there was a place |
||
505 | left for more paging implementations (besides already implemented |
||
506 | hieararchical page tables and hash table), for example B-Tree based |
||
507 | page tables.</para> |
||
508 | </section> |
||
509 | |||
510 | <section> |
||
511 | <title>Hierarchical 4-level page tables</title> |
||
512 | |||
513 | <para>Hierarchical 4-level page tables are the generalization of the |
||
514 | hardware capabilities of most architectures. Each address space has |
||
515 | its own page tables.<itemizedlist> |
||
67 | jermar | 516 | <listitem> |
517 | ia32 uses 2-level page tables, with full hardware support. |
||
518 | </listitem> |
||
519 | |||
520 | <listitem> |
||
521 | amd64 uses 4-level page tables, also coming with full hardware support. |
||
522 | </listitem> |
||
523 | |||
524 | <listitem> |
||
525 | mips and ppc32 have 2-level tables, software simulated support. |
||
526 | </listitem> |
||
527 | </itemizedlist></para> |
||
68 | bondari | 528 | </section> |
67 | jermar | 529 | |
68 | bondari | 530 | <section> |
69 | bondari | 531 | <indexterm><primary>memory management</primary><secondary>hash table</secondary></indexterm> |
532 | <title>Global hash table</title> |
||
67 | jermar | 533 | |
68 | bondari | 534 | <para>Implementation of the global hash table was encouraged by the |
535 | ia64 architecture support. One of the major differences between |
||
536 | global hash table and hierarchical tables is that global hash table |
||
537 | exists only once in the system and the hierarchical tables are |
||
538 | maintained per address space.</para> |
||
67 | jermar | 539 | |
68 | bondari | 540 | <para>Thus, hash table contains information about all address spaces |
541 | mappings in the system, so, the hash of an entry must contain |
||
542 | information of both address space pointer or id and the virtual |
||
543 | address of the page. Generic hash table implementation assumes that |
||
544 | the addresses of the pointers to the address spaces are likely to be |
||
545 | on the close addresses, so it uses least significant bits for hash; |
||
546 | also it assumes that the virtual page addresses have roughly the |
||
547 | same probability of occurring, so the least significant bits of VPN |
||
548 | compose the hash index.</para> |
||
549 | |||
550 | <para>Collision chains ...</para> |
||
551 | </section> |
||
67 | jermar | 552 | </section> |
553 | |||
554 | <section id="tlb"> |
||
69 | bondari | 555 | <indexterm><primary>memory management</primary><secondary>tlb</secondary></indexterm> |
67 | jermar | 556 | <title>Translation Lookaside buffer</title> |
557 | |||
558 | <para>Due to the extensive overhead during the page mapping lookup in |
||
559 | the page tables, all architectures has fast assotiative cache memory |
||
560 | built-in CPU. This memory called TLB stores recently used page table |
||
561 | entries.</para> |
||
562 | |||
563 | <section id="tlb_shootdown"> |
||
564 | <title>TLB consistency. TLB shootdown algorithm.</title> |
||
565 | |||
566 | <para>Operating system is responsible for keeping TLB consistent by |
||
567 | invalidating the contents of TLB, whenever there is some change in |
||
568 | page tables. Those changes may occur when page or group of pages |
||
569 | were unmapped, mapping is changed or system switching active address |
||
68 | bondari | 570 | space to schedule a new system task. Moreover, this invalidation |
571 | operation must be done an all system CPUs because each CPU has its |
||
572 | own independent TLB cache. Thus maintaining TLB consistency on SMP |
||
573 | configuration as not as trivial task as it looks on the first |
||
574 | glance. Naive solution would assume that is the CPU which wants to |
||
575 | invalidate TLB will invalidate TLB caches on other CPUs. It is not |
||
576 | possible on the most of the architectures, because of the simple |
||
577 | fact - flushing TLB is allowed only on the local CPU and there is no |
||
578 | possibility to access other CPUs' TLB caches, thus invalidate TLB |
||
579 | remotely.</para> |
||
67 | jermar | 580 | |
581 | <para>Technique of remote invalidation of TLB entries is called "TLB |
||
582 | shootdown". HelenOS uses a variation of the algorithm described by |
||
583 | D. Black et al., "Translation Lookaside Buffer Consistency: A |
||
584 | Software Approach," Proc. Third Int'l Conf. Architectural Support |
||
585 | for Programming Languages and Operating Systems, 1989, pp. |
||
586 | 113-122.</para> |
||
587 | |||
588 | <para>As the situation demands, you will want partitial invalidation |
||
589 | of TLB caches. In case of simple memory mapping change it is |
||
590 | necessary to invalidate only one or more adjacent pages. In case if |
||
68 | bondari | 591 | the architecture is aware of ASIDs, when kernel needs to dump some |
592 | ASID to use by another task, it invalidates only entries from this |
||
593 | particular address space. Final option of the TLB invalidation is |
||
594 | the complete TLB cache invalidation, which is the operation that |
||
595 | flushes all entries in TLB.</para> |
||
67 | jermar | 596 | |
68 | bondari | 597 | <para>TLB shootdown is performed in two phases.</para> |
67 | jermar | 598 | |
68 | bondari | 599 | <formalpara> |
600 | <title>Phase 1.</title> |
||
67 | jermar | 601 | |
68 | bondari | 602 | <para>First, initiator locks a global TLB spinlock, then request |
603 | is being put to the local request cache of every other CPU in the |
||
604 | system protected by its spinlock. In case the cache is full, all |
||
605 | requests in the cache are replaced by one request, indicating |
||
606 | global TLB flush. Then the initiator thread sends an IPI message |
||
607 | indicating the TLB shootdown request to the rest of the CPUs and |
||
608 | waits actively until all CPUs confirm TLB invalidating action |
||
609 | execution by setting up a special flag. After setting this flag |
||
610 | this thread is blocked on the TLB spinlock, held by the |
||
611 | initiator.</para> |
||
612 | </formalpara> |
||
67 | jermar | 613 | |
68 | bondari | 614 | <formalpara> |
615 | <title>Phase 2.</title> |
||
616 | |||
617 | <para>All CPUs are waiting on the TLB spinlock to execute TLB |
||
618 | invalidation action and have indicated their intention to the |
||
619 | initiator. Initiator continues, cleaning up its TLB and releasing |
||
620 | the global TLB spinlock. After this all other CPUs gain and |
||
621 | immidiately release TLB spinlock and perform TLB invalidation |
||
622 | actions.</para> |
||
623 | </formalpara> |
||
624 | </section> |
||
625 | </section> |
||
67 | jermar | 626 | </section> |
26 | bondari | 627 | </section> |
11 | bondari | 628 | </chapter> |