Rev 82 | Rev 84 | Go to most recent revision | Details | Compare with Previous | Last modification | View Log | RSS feed
| Rev | Author | Line No. | Line |
|---|---|---|---|
| 9 | bondari | 1 | <?xml version="1.0" encoding="UTF-8"?> |
| 11 | bondari | 2 | <chapter id="mm"> |
| 3 | <?dbhtml filename="mm.html"?> |
||
| 9 | bondari | 4 | |
| 11 | bondari | 5 | <title>Memory management</title> |
| 9 | bondari | 6 | |
| 67 | jermar | 7 | <para>In previous chapters, this book described the scheduling subsystem as |
| 8 | the creator of the impression that threads execute in parallel. The memory |
||
| 9 | management subsystem, on the other hand, creates the impression that there |
||
| 10 | is enough physical memory for the kernel and that userspace tasks have the |
||
| 11 | entire address space only for themselves.</para> |
||
| 64 | jermar | 12 | |
| 26 | bondari | 13 | <section> |
| 64 | jermar | 14 | <title>Physical memory management</title> |
| 15 | |||
| 16 | <section id="zones_and_frames"> |
||
| 17 | <title>Zones and frames</title> |
||
| 18 | |||
| 67 | jermar | 19 | <para>HelenOS represents continuous areas of physical memory in |
| 20 | structures called frame zones (abbreviated as zones). Each zone contains |
||
| 21 | information about the number of allocated and unallocated physical |
||
| 22 | memory frames as well as the physical base address of the zone and |
||
| 23 | number of frames contained in it. A zone also contains an array of frame |
||
| 24 | structures describing each frame of the zone and, in the last, but not |
||
| 25 | the least important, front, each zone is equipped with a buddy system |
||
| 26 | that faciliates effective allocation of power-of-two sized block of |
||
| 27 | frames.</para> |
||
| 64 | jermar | 28 | |
| 67 | jermar | 29 | <para>This organization of physical memory provides good preconditions |
| 30 | for hot-plugging of more zones. There is also one currently unused zone |
||
| 31 | attribute: <code>flags</code>. The attribute could be used to give a |
||
| 32 | special meaning to some zones in the future.</para> |
||
| 64 | jermar | 33 | |
| 67 | jermar | 34 | <para>The zones are linked in a doubly-linked list. This might seem a |
| 35 | bit ineffective because the zone list is walked everytime a frame is |
||
| 36 | allocated or deallocated. However, this does not represent a significant |
||
| 37 | performance problem as it is expected that the number of zones will be |
||
| 38 | rather low. Moreover, most architectures merge all zones into |
||
| 39 | one.</para> |
||
| 40 | |||
| 76 | palkovsky | 41 | <para>Every physical memory frame in a zone, is described by a structure |
| 42 | that contains number of references and other data used by buddy |
||
| 67 | jermar | 43 | system.</para> |
| 64 | jermar | 44 | </section> |
| 45 | |||
| 46 | <section id="frame_allocator"> |
||
| 71 | bondari | 47 | <indexterm> |
| 48 | <primary>frame allocator</primary> |
||
| 49 | </indexterm> |
||
| 50 | |||
| 64 | jermar | 51 | <title>Frame allocator</title> |
| 52 | |||
| 67 | jermar | 53 | <para>The frame allocator satisfies kernel requests to allocate |
| 54 | power-of-two sized blocks of physical memory. Because of zonal |
||
| 55 | organization of physical memory, the frame allocator is always working |
||
| 76 | palkovsky | 56 | within a context of a particular frame zone. In order to carry out the |
| 67 | jermar | 57 | allocation requests, the frame allocator is tightly integrated with the |
| 58 | buddy system belonging to the zone. The frame allocator is also |
||
| 59 | responsible for updating information about the number of free and busy |
||
| 60 | frames in the zone. <figure> |
||
| 61 | <mediaobject id="frame_alloc"> |
||
| 77 | bondari | 62 | <imageobject role="eps"> |
| 63 | <imagedata fileref="images.vector/frame_alloc.eps" format="EPS" /> |
||
| 64 | </imageobject> |
||
| 65 | |||
| 67 | jermar | 66 | <imageobject role="html"> |
| 67 | <imagedata fileref="images/frame_alloc.png" format="PNG" /> |
||
| 68 | </imageobject> |
||
| 64 | jermar | 69 | |
| 67 | jermar | 70 | <imageobject role="fop"> |
| 71 | <imagedata fileref="images.vector/frame_alloc.svg" format="SVG" /> |
||
| 72 | </imageobject> |
||
| 73 | </mediaobject> |
||
| 64 | jermar | 74 | |
| 67 | jermar | 75 | <title>Frame allocator scheme.</title> |
| 76 | </figure></para> |
||
| 64 | jermar | 77 | |
| 78 | <formalpara> |
||
| 79 | <title>Allocation / deallocation</title> |
||
| 80 | |||
| 82 | jermar | 81 | <para>Upon allocation request via function <code>frame_alloc()</code>, |
| 67 | jermar | 82 | the frame allocator first tries to find a zone that can satisfy the |
| 83 | request (i.e. has the required amount of free frames). Once a suitable |
||
| 84 | zone is found, the frame allocator uses the buddy allocator on the |
||
| 85 | zone's buddy system to perform the allocation. During deallocation, |
||
| 82 | jermar | 86 | which is triggered by a call to <code>frame_free()</code>, the frame |
| 67 | jermar | 87 | allocator looks up the respective zone that contains the frame being |
| 88 | deallocated. Afterwards, it calls the buddy allocator again, this time |
||
| 89 | to take care of deallocation within the zone's buddy system.</para> |
||
| 64 | jermar | 90 | </formalpara> |
| 91 | </section> |
||
| 92 | |||
| 93 | <section id="buddy_allocator"> |
||
| 71 | bondari | 94 | <indexterm> |
| 95 | <primary>buddy system</primary> |
||
| 96 | </indexterm> |
||
| 97 | |||
| 64 | jermar | 98 | <title>Buddy allocator</title> |
| 99 | |||
| 67 | jermar | 100 | <para>In the buddy system, the memory is broken down into power-of-two |
| 101 | sized naturally aligned blocks. These blocks are organized in an array |
||
| 82 | jermar | 102 | of lists, in which the list with index <emphasis>i</emphasis> contains |
| 103 | all unallocated blocks of size |
||
| 104 | <emphasis>2<superscript>i</superscript></emphasis>. The index |
||
| 105 | <emphasis>i</emphasis> is called the order of block. Should there be two |
||
| 106 | adjacent equally sized blocks in the list <emphasis>i</emphasis> (i.e. |
||
| 107 | buddies), the buddy allocator would coalesce them and put the resulting |
||
| 108 | block in list <emphasis>i + 1</emphasis>, provided that the resulting |
||
| 109 | block would be naturally aligned. Similarily, when the allocator is |
||
| 110 | asked to allocate a block of size |
||
| 111 | <emphasis>2<superscript>i</superscript></emphasis>, it first tries to |
||
| 112 | satisfy the request from the list with index <emphasis>i</emphasis>. If |
||
| 113 | the request cannot be satisfied (i.e. the list <emphasis>i</emphasis> is |
||
| 114 | empty), the buddy allocator will try to allocate and split a larger |
||
| 115 | block from the list with index <emphasis>i + 1</emphasis>. Both of these |
||
| 116 | algorithms are recursive. The recursion ends either when there are no |
||
| 117 | blocks to coalesce in the former case or when there are no blocks that |
||
| 118 | can be split in the latter case.</para> |
||
| 64 | jermar | 119 | |
| 67 | jermar | 120 | <para>This approach greatly reduces external fragmentation of memory and |
| 121 | helps in allocating bigger continuous blocks of memory aligned to their |
||
| 122 | size. On the other hand, the buddy allocator suffers increased internal |
||
| 123 | fragmentation of memory and is not suitable for general kernel |
||
| 124 | allocations. This purpose is better addressed by the <link |
||
| 125 | linkend="slab">slab allocator</link>.<figure> |
||
| 64 | jermar | 126 | <mediaobject id="buddy_alloc"> |
| 77 | bondari | 127 | <imageobject role="eps"> |
| 128 | <imagedata fileref="images.vector/buddy_alloc.eps" format="EPS" /> |
||
| 129 | </imageobject> |
||
| 130 | |||
| 64 | jermar | 131 | <imageobject role="html"> |
| 132 | <imagedata fileref="images/buddy_alloc.png" format="PNG" /> |
||
| 133 | </imageobject> |
||
| 134 | |||
| 135 | <imageobject role="fop"> |
||
| 136 | <imagedata fileref="images.vector/buddy_alloc.svg" format="SVG" /> |
||
| 137 | </imageobject> |
||
| 138 | </mediaobject> |
||
| 139 | |||
| 140 | <title>Buddy system scheme.</title> |
||
| 67 | jermar | 141 | </figure></para> |
| 64 | jermar | 142 | |
| 143 | <section> |
||
| 144 | <title>Implementation</title> |
||
| 145 | |||
| 146 | <para>The buddy allocator is, in fact, an abstract framework wich can |
||
| 147 | be easily specialized to serve one particular task. It knows nothing |
||
| 148 | about the nature of memory it helps to allocate. In order to beat the |
||
| 149 | lack of this knowledge, the buddy allocator exports an interface that |
||
| 150 | each of its clients is required to implement. When supplied with an |
||
| 151 | implementation of this interface, the buddy allocator can use |
||
| 152 | specialized external functions to find a buddy for a block, split and |
||
| 153 | coalesce blocks, manipulate block order and mark blocks busy or |
||
| 67 | jermar | 154 | available.</para> |
| 64 | jermar | 155 | |
| 156 | <formalpara> |
||
| 157 | <title>Data organization</title> |
||
| 158 | |||
| 159 | <para>Each entity allocable by the buddy allocator is required to |
||
| 160 | contain space for storing block order number and a link variable |
||
| 161 | used to interconnect blocks within the same order.</para> |
||
| 162 | |||
| 163 | <para>Whatever entities are allocated by the buddy allocator, the |
||
| 164 | first entity within a block is used to represent the entire block. |
||
| 165 | The first entity keeps the order of the whole block. Other entities |
||
| 166 | within the block are assigned the magic value |
||
| 167 | <constant>BUDDY_INNER_BLOCK</constant>. This is especially important |
||
| 168 | for effective identification of buddies in a one-dimensional array |
||
| 169 | because the entity that represents a potential buddy cannot be |
||
| 170 | associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it |
||
| 171 | is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is |
||
| 172 | not a buddy).</para> |
||
| 173 | </formalpara> |
||
| 174 | </section> |
||
| 175 | </section> |
||
| 176 | |||
| 177 | <section id="slab"> |
||
| 71 | bondari | 178 | <indexterm> |
| 179 | <primary>slab allocator</primary> |
||
| 180 | </indexterm> |
||
| 181 | |||
| 70 | jermar | 182 | <title>Slab allocator</title> |
| 64 | jermar | 183 | |
| 67 | jermar | 184 | <para>The majority of memory allocation requests in the kernel is for |
| 185 | small, frequently used data structures. The basic idea behind the slab |
||
| 186 | allocator is that commonly used objects are preallocated in continuous |
||
| 187 | areas of physical memory called slabs<footnote> |
||
| 188 | <para>Slabs are in fact blocks of physical memory frames allocated |
||
| 189 | from the frame allocator.</para> |
||
| 190 | </footnote>. Whenever an object is to be allocated, the slab allocator |
||
| 191 | returns the first available item from a suitable slab corresponding to |
||
| 192 | the object type<footnote> |
||
| 193 | <para>The mechanism is rather more complicated, see the next |
||
| 194 | paragraph.</para> |
||
| 195 | </footnote>. Due to the fact that the sizes of the requested and |
||
| 196 | allocated object match, the slab allocator significantly reduces |
||
| 197 | internal fragmentation.</para> |
||
| 64 | jermar | 198 | |
| 71 | bondari | 199 | <indexterm> |
| 200 | <primary>slab allocator</primary> |
||
| 201 | |||
| 72 | bondari | 202 | <secondary>- slab cache</secondary> |
| 71 | bondari | 203 | </indexterm> |
| 204 | |||
| 67 | jermar | 205 | <para>Slabs of one object type are organized in a structure called slab |
| 206 | cache. There are ususally more slabs in the slab cache, depending on |
||
| 207 | previous allocations. If the the slab cache runs out of available slabs, |
||
| 208 | new slabs are allocated. In order to exploit parallelism and to avoid |
||
| 209 | locking of shared spinlocks, slab caches can have variants of |
||
| 210 | processor-private slabs called magazines. On each processor, there is a |
||
| 211 | two-magazine cache. Full magazines that are not part of any |
||
| 212 | per-processor magazine cache are stored in a global list of full |
||
| 213 | magazines.</para> |
||
| 64 | jermar | 214 | |
| 71 | bondari | 215 | <indexterm> |
| 216 | <primary>slab allocator</primary> |
||
| 217 | |||
| 72 | bondari | 218 | <secondary>- magazine</secondary> |
| 71 | bondari | 219 | </indexterm> |
| 220 | |||
| 67 | jermar | 221 | <para>Each object begins its life in a slab. When it is allocated from |
| 222 | there, the slab allocator calls a constructor that is registered in the |
||
| 223 | respective slab cache. The constructor initializes and brings the object |
||
| 224 | into a known state. The object is then used by the user. When the user |
||
| 225 | later frees the object, the slab allocator puts it into a processor |
||
| 71 | bondari | 226 | private <indexterm> |
| 227 | <primary>slab allocator</primary> |
||
| 228 | |||
| 72 | bondari | 229 | <secondary>- magazine</secondary> |
| 71 | bondari | 230 | </indexterm>magazine cache, from where it can be precedently allocated |
| 67 | jermar | 231 | again. Note that allocations satisfied from a magazine are already |
| 232 | initialized by the constructor. When both of the processor cached |
||
| 233 | magazines get full, the allocator will move one of the magazines to the |
||
| 234 | list of full magazines. Similarily, when allocating from an empty |
||
| 235 | processor magazine cache, the kernel will reload only one magazine from |
||
| 236 | the list of full magazines. In other words, the slab allocator tries to |
||
| 237 | keep the processor magazine cache only half-full in order to prevent |
||
| 238 | thrashing when allocations and deallocations interleave on magazine |
||
| 70 | jermar | 239 | boundaries. The advantage of this setup is that during most of the |
| 240 | allocations, no global spinlock needs to be held.</para> |
||
| 65 | jermar | 241 | |
| 67 | jermar | 242 | <para>Should HelenOS run short of memory, it would start deallocating |
| 243 | objects from magazines, calling slab cache destructor on them and |
||
| 244 | putting them back into slabs. When a slab contanins no allocated object, |
||
| 245 | it is immediately freed.</para> |
||
| 66 | bondari | 246 | |
| 71 | bondari | 247 | <para> |
| 248 | <figure> |
||
| 64 | jermar | 249 | <mediaobject id="slab_alloc"> |
| 77 | bondari | 250 | <imageobject role="eps"> |
| 251 | <imagedata fileref="images.vector/slab_alloc.eps" format="EPS" /> |
||
| 252 | </imageobject> |
||
| 253 | |||
| 64 | jermar | 254 | <imageobject role="html"> |
| 255 | <imagedata fileref="images/slab_alloc.png" format="PNG" /> |
||
| 256 | </imageobject> |
||
| 77 | bondari | 257 | |
| 258 | <imageobject role="fop"> |
||
| 259 | <imagedata fileref="images.vector/slab_alloc.svg" format="SVG" /> |
||
| 260 | </imageobject> |
||
| 64 | jermar | 261 | </mediaobject> |
| 262 | |||
| 263 | <title>Slab allocator scheme.</title> |
||
| 71 | bondari | 264 | </figure> |
| 265 | </para> |
||
| 64 | jermar | 266 | |
| 67 | jermar | 267 | <section> |
| 268 | <title>Implementation</title> |
||
| 269 | |||
| 83 | jermar | 270 | <para>The slab allocator is closely modelled after <xref |
| 71 | bondari | 271 | linkend="Bonwick01" /> with the following exceptions:<itemizedlist> |
| 82 | jermar | 272 | <listitem> |
| 273 | <para>empty slabs are immediately deallocated and</para> |
||
| 274 | </listitem> |
||
| 66 | bondari | 275 | |
| 276 | <listitem> |
||
| 70 | jermar | 277 | <para>empty magazines are deallocated when not needed.</para> |
| 66 | bondari | 278 | </listitem> |
| 70 | jermar | 279 | </itemizedlist>The following features are not currently supported |
| 280 | but would be easy to do: <itemizedlist> |
||
| 71 | bondari | 281 | <listitem>cache coloring and</listitem> |
| 64 | jermar | 282 | |
| 71 | bondari | 283 | <listitem>dynamic magazine grow (different magazine sizes are |
| 284 | already supported, but the allocation strategy would need to be |
||
| 285 | adjusted).</listitem> |
||
| 64 | jermar | 286 | </itemizedlist></para> |
| 287 | |||
| 288 | <section> |
||
| 289 | <title>Allocation/deallocation</title> |
||
| 290 | |||
| 70 | jermar | 291 | <para>The following two paragraphs summarize and complete the |
| 292 | description of the slab allocator operation (i.e. |
||
| 82 | jermar | 293 | <code>slab_alloc()</code> and <code>slab_free()</code> |
| 294 | functions).</para> |
||
| 64 | jermar | 295 | |
| 296 | <formalpara> |
||
| 297 | <title>Allocation</title> |
||
| 298 | |||
| 70 | jermar | 299 | <para><emphasis>Step 1.</emphasis> When an allocation request |
| 300 | comes, the slab allocator checks availability of memory in the |
||
| 301 | current magazine of the local processor magazine cache. If the |
||
| 76 | palkovsky | 302 | available memory is there, the allocator just pops the object from |
| 303 | magazine and returns it.</para> |
||
| 64 | jermar | 304 | |
| 70 | jermar | 305 | <para><emphasis>Step 2.</emphasis> If the current magazine in the |
| 306 | processor magazine cache is empty, the allocator will attempt to |
||
| 307 | swap it with the last magazine from the cache and return to the |
||
| 308 | first step. If also the last magazine is empty, the algorithm will |
||
| 309 | fall through to Step 3.</para> |
||
| 64 | jermar | 310 | |
| 70 | jermar | 311 | <para><emphasis>Step 3.</emphasis> Now the allocator is in the |
| 312 | situation when both magazines in the processor magazine cache are |
||
| 313 | empty. The allocator reloads one magazine from the shared list of |
||
| 314 | full magazines. If the reload is successful (i.e. there are full |
||
| 315 | magazines in the list), the algorithm continues with Step |
||
| 316 | 1.</para> |
||
| 64 | jermar | 317 | |
| 70 | jermar | 318 | <para><emphasis>Step 4.</emphasis> In this fail-safe step, an |
| 319 | object is allocated from the conventional slab layer and a pointer |
||
| 320 | to it is returned. If also the last magazine is full,</para> |
||
| 64 | jermar | 321 | </formalpara> |
| 322 | |||
| 323 | <formalpara> |
||
| 324 | <title>Deallocation</title> |
||
| 325 | |||
| 70 | jermar | 326 | <para><emphasis>Step 1.</emphasis> During a deallocation request, |
| 327 | the slab allocator checks if the current magazine of the local |
||
| 76 | palkovsky | 328 | processor magazine cache is not full. If it is, the pointer to the |
| 70 | jermar | 329 | objects is just pushed into the magazine and the algorithm |
| 330 | returns.</para> |
||
| 64 | jermar | 331 | |
| 70 | jermar | 332 | <para><emphasis>Step 2.</emphasis> If the current magazine is |
| 333 | full, the allocator will attempt to swap it with the last magazine |
||
| 334 | from the cache and return to the first step. If also the last |
||
| 335 | magazine is empty, the algorithm will fall through to Step |
||
| 336 | 3.</para> |
||
| 64 | jermar | 337 | |
| 70 | jermar | 338 | <para><emphasis>Step 3.</emphasis> Now the allocator is in the |
| 339 | situation when both magazines in the processor magazine cache are |
||
| 76 | palkovsky | 340 | full. The allocator tries to allocate a new empty magazine and |
| 341 | flush one of the full magazines to the shared list of full |
||
| 342 | magazines. If it is successfull, the algoritm continues with Step |
||
| 343 | 1.</para> |
||
| 344 | |||
| 345 | <para><emphasis>Step 4. </emphasis>In case of low memory condition |
||
| 346 | when the allocation of empty magazine fails, the object is moved |
||
| 347 | directly into slab. In the worst case object deallocation does not |
||
| 348 | need to allocate any additional memory.</para> |
||
| 64 | jermar | 349 | </formalpara> |
| 350 | </section> |
||
| 351 | </section> |
||
| 352 | </section> |
||
| 353 | </section> |
||
| 354 | |||
| 355 | <section> |
||
| 67 | jermar | 356 | <title>Virtual memory management</title> |
| 9 | bondari | 357 | |
| 82 | jermar | 358 | <para>Virtual memory is essential for an operating system because it makes |
| 359 | several things possible. First, it helps to isolate tasks from each other |
||
| 360 | by encapsulating them in their private address spaces. Second, virtual |
||
| 361 | memory can give tasks the feeling of more memory available than is |
||
| 362 | actually possible. And third, by using virtual memory, there might be |
||
| 363 | multiple copies of the same program, linked to the same addresses, running |
||
| 364 | in the system. There are at least two known mechanisms for implementing |
||
| 365 | virtual memory: segmentation and paging. Even though some processor |
||
| 366 | architectures supported by HelenOS<footnote> |
||
| 367 | <para>ia32 has full-fledged segmentation.</para> |
||
| 368 | </footnote> provide both mechanism, the kernel makes use solely of |
||
| 369 | paging.</para> |
||
| 67 | jermar | 370 | |
| 82 | jermar | 371 | <section id="paging"> |
| 372 | <title>VAT subsystem</title> |
||
| 67 | jermar | 373 | |
| 82 | jermar | 374 | <para>In a paged virtual memory, the entire virtual address space is |
| 375 | divided into small power-of-two sized naturally aligned blocks called |
||
| 376 | pages. The processor implements a translation mechanism, that allows the |
||
| 377 | operating system to manage mappings between set of pages and set of |
||
| 378 | indentically sized and identically aligned pieces of physical memory |
||
| 379 | called frames. In a result, references to continuous virtual memory |
||
| 380 | areas don't necessarily need to reference continuos area of physical |
||
| 381 | memory. Supported page sizes usually range from several kilobytes to |
||
| 382 | several megabytes. Each page that takes part in the mapping is |
||
| 383 | associated with certain attributes that further desribe the mapping |
||
| 384 | (e.g. access rights, dirty and accessed bits and present bit).</para> |
||
| 67 | jermar | 385 | |
| 82 | jermar | 386 | <para>When the processor accesses a page that is not present (i.e. its |
| 387 | present bit is not set), the operating system is notified through a |
||
| 388 | special exception called page fault. It is then up to the operating |
||
| 389 | system to service the page fault. In HelenOS, some page faults are fatal |
||
| 390 | and result in either task termination or, in the worse case, kernel |
||
| 391 | panic<footnote> |
||
| 392 | <para>Such a condition would be either caused by a hardware failure |
||
| 393 | or a bug in the kernel.</para> |
||
| 394 | </footnote>, while other page faults are used to load memory on demand |
||
| 395 | or to notify the kernel about certain events.</para> |
||
| 67 | jermar | 396 | |
| 82 | jermar | 397 | <indexterm> |
| 398 | <primary>page tables</primary> |
||
| 399 | </indexterm> |
||
| 67 | jermar | 400 | |
| 82 | jermar | 401 | <para>The set of all page mappings is stored in a memory structure |
| 402 | called page tables. Some architectures have no hardware support for page |
||
| 403 | tables<footnote> |
||
| 404 | <para>On mips32, TLB-only model is used and the operating system is |
||
| 405 | responsible for managing software defined page tables.</para> |
||
| 406 | </footnote> while other processor architectures<footnote> |
||
| 407 | <para>Like amd64 and ia32.</para> |
||
| 408 | </footnote> understand the whole memory format thereof. Despite all |
||
| 409 | the possible differences in page table formats, the HelenOS VAT |
||
| 410 | subsystem<footnote> |
||
| 411 | <para>Virtual Address Translation subsystem.</para> |
||
| 412 | </footnote> unifies the page table operations under one programming |
||
| 413 | interface. For all parts of the kernel, three basic functions are |
||
| 414 | provided:</para> |
||
| 415 | |||
| 416 | <itemizedlist> |
||
| 417 | <listitem> |
||
| 418 | <para><code>page_mapping_insert()</code>,</para> |
||
| 419 | </listitem> |
||
| 420 | |||
| 421 | <listitem> |
||
| 422 | <para><code>page_mapping_find()</code> and</para> |
||
| 423 | </listitem> |
||
| 424 | |||
| 425 | <listitem> |
||
| 426 | <para><code>page_mapping_remove()</code>.</para> |
||
| 427 | </listitem> |
||
| 428 | </itemizedlist> |
||
| 429 | |||
| 430 | <para>The <code>page_mapping_insert()</code> function is used to |
||
| 431 | introduce a mapping for one virtual memory page belonging to a |
||
| 432 | particular address space into the page tables. Once the mapping is in |
||
| 433 | the page tables, it can be searched by <code>page_mapping_find()</code> |
||
| 434 | and removed by <code>page_mapping_remove()</code>. All of these |
||
| 435 | functions internally select the page table mechanism specific functions |
||
| 436 | that carry out the self operation.</para> |
||
| 437 | |||
| 438 | <para>There are currently two supported mechanisms: generic 4-level |
||
| 439 | hierarchical page tables and global page hash table. Both of the |
||
| 440 | mechanisms are generic as they cover several hardware platforms. For |
||
| 441 | instance, the 4-level hierarchical page table mechanism is used by |
||
| 442 | amd64, ia32, mips32 and ppc32, respectively. These architectures have |
||
| 443 | the following page table format: 4-level, 2-level, TLB-only and hardware |
||
| 444 | hash table, respectively. On the other hand, the global page hash table |
||
| 445 | is used on ia64 that can be TLB-only or use a hardware hash table. |
||
| 446 | Although only two mechanisms are currently implemented, other mechanisms |
||
| 447 | (e.g. B+tree) can be easily added.</para> |
||
| 448 | |||
| 449 | <section id="page_tables"> |
||
| 450 | <indexterm> |
||
| 451 | <primary>page tables</primary> |
||
| 452 | |||
| 453 | <secondary>- hierarchical</secondary> |
||
| 454 | </indexterm> |
||
| 455 | |||
| 456 | <title>Hierarchical 4-level page tables</title> |
||
| 457 | |||
| 458 | <para>Hierarchical 4-level page tables are generalization of the |
||
| 459 | frequently used hierarchical model of page tables. In this mechanism, |
||
| 460 | each address space has its own page tables. To avoid confusion in |
||
| 461 | terminology used by hardware vendors, in HelenOS, the root level page |
||
| 462 | table is called PTL0, the two middle levels are called PTL1 and PTL2, |
||
| 463 | and, finally, the leaf level is called PTL3. All architectures using |
||
| 464 | this mechanism are required to use PTL0 and PTL3. However, the middle |
||
| 465 | levels can be left out, depending on the hardware hierachy or |
||
| 466 | structure of software-only page tables. The genericity is achieved |
||
| 467 | through a set of macros that define transitions from one level to |
||
| 468 | another. Unused levels are optimised out by the compiler.</para> |
||
| 469 | </section> |
||
| 470 | |||
| 471 | <section> |
||
| 472 | <indexterm> |
||
| 473 | <primary>page tables</primary> |
||
| 474 | |||
| 475 | <secondary>- hashing</secondary> |
||
| 476 | </indexterm> |
||
| 477 | |||
| 478 | <title>Global page hash table</title> |
||
| 479 | |||
| 480 | <para>Implementation of the global page hash table was encouraged by |
||
| 481 | 64-bit architectures that can have rather sparse address spaces. The |
||
| 482 | hash table contains valid mappings only. Each entry of the hash table |
||
| 483 | contains an address space pointer, virtual memory page number (VPN), |
||
| 484 | physical memory frame number (PFN) and a set of flags. The pair of the |
||
| 485 | address space pointer and the virtual memory page number is used as a |
||
| 486 | key for the hash table. One of the major differences between the |
||
| 487 | global page hash table and hierarchical 4-level page tables is that |
||
| 488 | there is only a single global page hash table in the system while |
||
| 489 | hierarchical page tables exist per address space. Thus, the global |
||
| 490 | page hash table contains information about mappings of all address |
||
| 83 | jermar | 491 | spaces in the system.</para> |
| 82 | jermar | 492 | |
| 493 | <para>The global page hash table mechanism uses the generic hash table |
||
| 83 | jermar | 494 | type as described in the chapter dedicated to <link |
| 495 | linkend="hashtables">data structures</link> earlier in this |
||
| 496 | book.</para> |
||
| 82 | jermar | 497 | </section> |
| 67 | jermar | 498 | </section> |
| 82 | jermar | 499 | </section> |
| 67 | jermar | 500 | |
| 82 | jermar | 501 | <section id="tlb"> |
| 502 | <indexterm> |
||
| 503 | <primary>TLB</primary> |
||
| 504 | </indexterm> |
||
| 505 | |||
| 506 | <title>Translation Lookaside buffer</title> |
||
| 507 | |||
| 83 | jermar | 508 | <para>Due to the extensive overhead of several extra memory accesses |
| 509 | during page table lookup that are necessary on every instruction, modern |
||
| 510 | architectures deploy fast assotiative cache of recelntly used page |
||
| 511 | mappings. This cache is called TLB - Translation Lookaside Buffer - and is |
||
| 512 | present on every processor in the system. As it has been already pointed |
||
| 513 | out, TLB is the only page translation mechanism for some |
||
| 514 | architectures.</para> |
||
| 82 | jermar | 515 | |
| 516 | <section id="tlb_shootdown"> |
||
| 517 | <indexterm> |
||
| 518 | <primary>TLB</primary> |
||
| 519 | |||
| 520 | <secondary>- TLB shootdown</secondary> |
||
| 521 | </indexterm> |
||
| 522 | |||
| 83 | jermar | 523 | <title>TLB consistency</title> |
| 82 | jermar | 524 | |
| 83 | jermar | 525 | <para>The operating system is responsible for keeping TLB consistent |
| 526 | with the page tables. Whenever mappings are modified or purged from the |
||
| 527 | page tables, or when an address space identifier is reused, the kernel |
||
| 528 | needs to invalidate the respective contents of TLB. Some TLB types |
||
| 529 | support partial invalidation of their content (e.g. ranges of pages or |
||
| 530 | address spaces) while other types can be invalidated only entirely. The |
||
| 531 | invalidation must be done on all processors for there is one TLB per |
||
| 532 | processor. Maintaining TLB consistency on multiprocessor configurations |
||
| 533 | is not as trivial as it might look from the first glance.</para> |
||
| 82 | jermar | 534 | |
| 83 | jermar | 535 | <para>The remote TLB invalidation is called TLB shootdown. HelenOS uses |
| 536 | a simplified variant of the algorithm described in <xref |
||
| 537 | linkend="Black89" />. </para> |
||
| 82 | jermar | 538 | |
| 83 | jermar | 539 | <para>TLB shootdown is performed in three phases.</para> |
| 82 | jermar | 540 | |
| 541 | <formalpara> |
||
| 542 | <title>Phase 1.</title> |
||
| 543 | |||
| 83 | jermar | 544 | <para>The initiator clears its TLB flag and locks the global TLB |
| 545 | spinlock. The request is then enqueued into all other processors' TLB |
||
| 546 | shootdown message queues. When the TLB shootdown message queue is full |
||
| 547 | on any processor, the queue is purged and a single request to |
||
| 548 | invalidate the entire TLB is stored there. Once all the TLB shootdown |
||
| 549 | messages were dispatched, the initiator sends all other processors an |
||
| 550 | interrupt to notify them about the incoming TLB shootdown message. It |
||
| 551 | then spins until all processors accept the interrupt and clear their |
||
| 552 | TLB flags.</para> |
||
| 82 | jermar | 553 | </formalpara> |
| 554 | |||
| 555 | <formalpara> |
||
| 556 | <title>Phase 2.</title> |
||
| 557 | |||
| 83 | jermar | 558 | <para>Except for the initiator, all other processors are spining on |
| 559 | the TLB spinlock. The initiator is now free to modify the page tables |
||
| 560 | and purge its own TLB. The initiator then unlocks the global TLB |
||
| 561 | spinlock and sets its TLB flag.</para> |
||
| 82 | jermar | 562 | </formalpara> |
| 83 | jermar | 563 | |
| 564 | <formalpara> |
||
| 565 | <title>Phase 3.</title> |
||
| 566 | |||
| 567 | <para>When the spinlock is unlocked by the initiator, other processors |
||
| 568 | are sequentially granted the spinlock. However, once they manage to |
||
| 569 | lock it, they immediately release it. Each processor invalidates its |
||
| 570 | TLB according to messages found in its TLB shootdown message queue. In |
||
| 571 | the end, each processor sets its TLB flag and resumes its previous |
||
| 572 | operation.</para> |
||
| 573 | </formalpara> |
||
| 82 | jermar | 574 | </section> |
| 575 | |||
| 67 | jermar | 576 | <section> |
| 71 | bondari | 577 | <title>Address spaces</title> |
| 70 | jermar | 578 | |
| 71 | bondari | 579 | <section> |
| 580 | <indexterm> |
||
| 581 | <primary>address space</primary> |
||
| 70 | jermar | 582 | |
| 72 | bondari | 583 | <secondary>- area</secondary> |
| 71 | bondari | 584 | </indexterm> |
| 70 | jermar | 585 | |
| 586 | <title>Address space areas</title> |
||
| 67 | jermar | 587 | |
| 588 | <para>Each address space consists of mutually disjunctive continuous |
||
| 589 | address space areas. Address space area is precisely defined by its |
||
| 590 | base address and the number of frames/pages is contains.</para> |
||
| 591 | |||
| 592 | <para>Address space area , that define behaviour and permissions on |
||
| 593 | the particular area. <itemizedlist> |
||
| 71 | bondari | 594 | <listitem><emphasis>AS_AREA_READ</emphasis> flag indicates reading |
| 595 | permission.</listitem> |
||
| 67 | jermar | 596 | |
| 71 | bondari | 597 | <listitem><emphasis>AS_AREA_WRITE</emphasis> flag indicates |
| 598 | writing permission.</listitem> |
||
| 67 | jermar | 599 | |
| 71 | bondari | 600 | <listitem><emphasis>AS_AREA_EXEC</emphasis> flag indicates code |
| 601 | execution permission. Some architectures do not support execution |
||
| 602 | persmission restriction. In this case this flag has no |
||
| 603 | effect.</listitem> |
||
| 67 | jermar | 604 | |
| 71 | bondari | 605 | <listitem><emphasis>AS_AREA_DEVICE</emphasis> marks area as mapped |
| 606 | to the device memory.</listitem> |
||
| 67 | jermar | 607 | </itemizedlist></para> |
| 608 | |||
| 609 | <para>Kernel provides possibility tasks create/expand/shrink/share its |
||
| 610 | address space via the set of syscalls.</para> |
||
| 611 | </section> |
||
| 612 | |||
| 613 | <section> |
||
| 71 | bondari | 614 | <indexterm> |
| 615 | <primary>address space</primary> |
||
| 616 | |||
| 72 | bondari | 617 | <secondary>- ASID</secondary> |
| 71 | bondari | 618 | </indexterm> |
| 619 | |||
| 67 | jermar | 620 | <title>Address Space ID (ASID)</title> |
| 621 | |||
| 76 | palkovsky | 622 | <para>Every task in the operating system has it's own view of the |
| 623 | virtual memory. When performing context switch between different |
||
| 624 | tasks, the kernel must switch the address space mapping as well. As |
||
| 625 | modern processors perform very aggressive caching of virtual mappings, |
||
| 626 | flushing the complete TLB on every context switch would be very |
||
| 627 | inefficient. To avoid such performance penalty, some architectures |
||
| 628 | introduce an address space identifier, which allows storing several |
||
| 629 | different mappings inside TLB.</para> |
||
| 67 | jermar | 630 | |
| 76 | palkovsky | 631 | <para>HelenOS kernel can take advantage of this hardware support by |
| 632 | having an ASID abstraction. I.e. on ia64 kernel ASID is derived from |
||
| 633 | RID (region identifier) and on the mips32 kernel ASID is actually the |
||
| 634 | hardware identifier. As expected, this ASID information record is the |
||
| 635 | part of <emphasis>as_t</emphasis> structure.</para> |
||
| 67 | jermar | 636 | |
| 637 | <para>Due to the hardware limitations, hardware ASID has limited |
||
| 638 | length from 8 bits on ia64 to 24 bits on mips32, which makes it |
||
| 639 | impossible to use it as unique address space identifier for all tasks |
||
| 640 | running in the system. In such situations special ASID stealing |
||
| 641 | algoritm is used, which takes ASID from inactive task and assigns it |
||
| 642 | to the active task.</para> |
||
| 643 | |||
| 71 | bondari | 644 | <indexterm> |
| 645 | <primary>address space</primary> |
||
| 646 | |||
| 72 | bondari | 647 | <secondary>- ASID stealing</secondary> |
| 71 | bondari | 648 | </indexterm> |
| 649 | |||
| 650 | <para> |
||
| 651 | <classname>ASID stealing algoritm here.</classname> |
||
| 652 | </para> |
||
| 67 | jermar | 653 | </section> |
| 654 | </section> |
||
| 26 | bondari | 655 | </section> |
| 11 | bondari | 656 | </chapter> |