Rev 62 | Rev 65 | Go to most recent revision | Details | Compare with Previous | Last modification | View Log | RSS feed
| Rev | Author | Line No. | Line |
|---|---|---|---|
| 9 | bondari | 1 | <?xml version="1.0" encoding="UTF-8"?> |
| 11 | bondari | 2 | <chapter id="mm"> |
| 3 | <?dbhtml filename="mm.html"?> |
||
| 9 | bondari | 4 | |
| 11 | bondari | 5 | <title>Memory management</title> |
| 9 | bondari | 6 | |
| 64 | jermar | 7 | <para>In previous chapters, this book described the scheduling subsystem as |
| 8 | the creator of the impression that threads execute in parallel. The memory |
||
| 9 | management subsystem, on the other hand, creates the impression that there |
||
| 10 | is enough physical memory for the kernel and that userspace tasks have the |
||
| 11 | entire address space only for themselves.</para> |
||
| 12 | |||
| 26 | bondari | 13 | <section> |
| 64 | jermar | 14 | <title>Physical memory management</title> |
| 15 | |||
| 16 | <section id="zones_and_frames"> |
||
| 17 | <title>Zones and frames</title> |
||
| 18 | |||
| 19 | <para>HelenOS represents continuous areas of physical memory in |
||
| 20 | structures called frame zones (abbreviated as zones). Each zone contains |
||
| 21 | information about the number of allocated and unallocated physical |
||
| 22 | memory frames as well as the physical base address of the zone and |
||
| 23 | number of frames contained in it. A zone also contains an array of frame |
||
| 24 | structures describing each frame of the zone and, in the last, but not |
||
| 25 | the least important, front, each zone is equipped with a buddy system |
||
| 26 | that faciliates effective allocation of power-of-two sized block of |
||
| 27 | frames.</para> |
||
| 28 | |||
| 29 | <para>This organization of physical memory provides good preconditions |
||
| 30 | for hot-plugging of more zones. There is also one currently unused zone |
||
| 31 | attribute: <code>flags</code>. The attribute could be used to give a |
||
| 32 | special meaning to some zones in the future.</para> |
||
| 33 | |||
| 34 | <para>The zones are linked in a doubly-linked list. This might seem a |
||
| 35 | bit ineffective because the zone list is walked everytime a frame is |
||
| 36 | allocated or deallocated. However, this does not represent a significant |
||
| 37 | performance problem as it is expected that the number of zones will be |
||
| 38 | rather low. Moreover, most architectures merge all zones into |
||
| 39 | one.</para> |
||
| 40 | |||
| 41 | <para>For each physical memory frame found in a zone, there is a frame |
||
| 42 | structure that contains number of references and data used by buddy |
||
| 43 | system.</para> |
||
| 44 | </section> |
||
| 45 | |||
| 46 | <section id="frame_allocator"> |
||
| 47 | <title>Frame allocator</title> |
||
| 48 | |||
| 49 | <para>The frame allocator satisfies kernel requests to allocate |
||
| 50 | power-of-two sized blocks of physical memory. Because of zonal |
||
| 51 | organization of physical memory, the frame allocator is always working |
||
| 52 | within a context of some frame zone. In order to carry out the |
||
| 53 | allocation requests, the frame allocator is tightly integrated with the |
||
| 54 | buddy system belonging to the zone. The frame allocator is also |
||
| 55 | responsible for updating information about the number of free and busy |
||
| 56 | frames in the zone. <figure> |
||
| 57 | <mediaobject id="frame_alloc"> |
||
| 58 | <imageobject role="html"> |
||
| 59 | <imagedata fileref="images/frame_alloc.png" format="PNG" /> |
||
| 60 | </imageobject> |
||
| 61 | |||
| 62 | <imageobject role="fop"> |
||
| 63 | <imagedata fileref="images.vector/frame_alloc.svg" format="SVG" /> |
||
| 64 | </imageobject> |
||
| 65 | </mediaobject> |
||
| 66 | |||
| 67 | <title>Frame allocator scheme.</title> |
||
| 68 | </figure></para> |
||
| 69 | |||
| 70 | <formalpara> |
||
| 71 | <title>Allocation / deallocation</title> |
||
| 72 | |||
| 73 | <para>Upon allocation request via function <code>frame_alloc</code>, |
||
| 74 | the frame allocator first tries to find a zone that can satisfy the |
||
| 75 | request (i.e. has the required amount of free frames). Once a suitable |
||
| 76 | zone is found, the frame allocator uses the buddy allocator on the |
||
| 77 | zone's buddy system to perform the allocation. During deallocation, |
||
| 78 | which is triggered by a call to <code>frame_free</code>, the frame |
||
| 79 | allocator looks up the respective zone that contains the frame being |
||
| 80 | deallocated. Afterwards, it calls the buddy allocator again, this time |
||
| 81 | to take care of deallocation within the zone's buddy system.</para> |
||
| 82 | </formalpara> |
||
| 83 | </section> |
||
| 84 | |||
| 85 | <section id="buddy_allocator"> |
||
| 86 | <title>Buddy allocator</title> |
||
| 87 | |||
| 88 | <para>In the buddy system, the memory is broken down into power-of-two |
||
| 89 | sized naturally aligned blocks. These blocks are organized in an array |
||
| 90 | of lists, in which the list with index i contains all unallocated blocks |
||
| 91 | of size <mathphrase>2<superscript>i</superscript></mathphrase>. The |
||
| 92 | index i is called the order of block. Should there be two adjacent |
||
| 93 | equally sized blocks in the list i<mathphrase />(i.e. buddies), the |
||
| 94 | buddy allocator would coalesce them and put the resulting block in list |
||
| 95 | <mathphrase>i + 1</mathphrase>, provided that the resulting block would |
||
| 96 | be naturally aligned. Similarily, when the allocator is asked to |
||
| 97 | allocate a block of size |
||
| 98 | <mathphrase>2<superscript>i</superscript></mathphrase>, it first tries |
||
| 99 | to satisfy the request from the list with index i. If the request cannot |
||
| 100 | be satisfied (i.e. the list i is empty), the buddy allocator will try to |
||
| 101 | allocate and split a larger block from the list with index i + 1. Both |
||
| 102 | of these algorithms are recursive. The recursion ends either when there |
||
| 103 | are no blocks to coalesce in the former case or when there are no blocks |
||
| 104 | that can be split in the latter case.</para> |
||
| 105 | |||
| 106 | <para>This approach greatly reduces external fragmentation of memory and |
||
| 107 | helps in allocating bigger continuous blocks of memory aligned to their |
||
| 108 | size. On the other hand, the buddy allocator suffers increased internal |
||
| 109 | fragmentation of memory and is not suitable for general kernel |
||
| 110 | allocations. This purpose is better addressed by the <link |
||
| 111 | linkend="slab">slab allocator</link>.<figure> |
||
| 112 | <mediaobject id="buddy_alloc"> |
||
| 113 | <imageobject role="html"> |
||
| 114 | <imagedata fileref="images/buddy_alloc.png" format="PNG" /> |
||
| 115 | </imageobject> |
||
| 116 | |||
| 117 | <imageobject role="fop"> |
||
| 118 | <imagedata fileref="images.vector/buddy_alloc.svg" format="SVG" /> |
||
| 119 | </imageobject> |
||
| 120 | </mediaobject> |
||
| 121 | |||
| 122 | <title>Buddy system scheme.</title> |
||
| 123 | </figure></para> |
||
| 124 | |||
| 125 | <section> |
||
| 126 | <title>Implementation</title> |
||
| 127 | |||
| 128 | <para>The buddy allocator is, in fact, an abstract framework wich can |
||
| 129 | be easily specialized to serve one particular task. It knows nothing |
||
| 130 | about the nature of memory it helps to allocate. In order to beat the |
||
| 131 | lack of this knowledge, the buddy allocator exports an interface that |
||
| 132 | each of its clients is required to implement. When supplied with an |
||
| 133 | implementation of this interface, the buddy allocator can use |
||
| 134 | specialized external functions to find a buddy for a block, split and |
||
| 135 | coalesce blocks, manipulate block order and mark blocks busy or |
||
| 136 | available.</para> |
||
| 137 | |||
| 138 | <formalpara> |
||
| 139 | <title>Data organization</title> |
||
| 140 | |||
| 141 | <para>Each entity allocable by the buddy allocator is required to |
||
| 142 | contain space for storing block order number and a link variable |
||
| 143 | used to interconnect blocks within the same order.</para> |
||
| 144 | |||
| 145 | <para>Whatever entities are allocated by the buddy allocator, the |
||
| 146 | first entity within a block is used to represent the entire block. |
||
| 147 | The first entity keeps the order of the whole block. Other entities |
||
| 148 | within the block are assigned the magic value |
||
| 149 | <constant>BUDDY_INNER_BLOCK</constant>. This is especially important |
||
| 150 | for effective identification of buddies in a one-dimensional array |
||
| 151 | because the entity that represents a potential buddy cannot be |
||
| 152 | associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it |
||
| 153 | is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is |
||
| 154 | not a buddy).</para> |
||
| 155 | </formalpara> |
||
| 156 | </section> |
||
| 157 | </section> |
||
| 158 | |||
| 159 | <section id="slab"> |
||
| 160 | <title>Slab allocator</title> |
||
| 161 | |||
| 162 | <para>The majority of memory allocation requests in the kernel is for |
||
| 163 | small, frequently used data structures. The basic idea behind the slab |
||
| 164 | allocator is that deployment of lists of preallocated, commonly used |
||
| 165 | objects. Whenever an object is to be allocated, the slab allocator takes |
||
| 166 | the first item from the list corresponding to the object type. This |
||
| 167 | avoids the overhead of allocating and dellocating commonly used types of |
||
| 168 | objects such as threads, B+tree nodes etc. Due to the fact that the |
||
| 169 | sizes of the requested and allocated object match, the slab allocator |
||
| 170 | significantly eliminates internal fragmentation. Moreover, each list can |
||
| 171 | have a constructor and a destructor, which leads to performance gains |
||
| 172 | because constructed and then destroyed objects don't need to be |
||
| 173 | reinitialized<footnote> |
||
| 174 | <para>Provided that the assumption that the destructor leaves the |
||
| 175 | object in a consistent state holds.</para> |
||
| 176 | </footnote>.</para> |
||
| 177 | |||
| 178 | <para>In the slab allocator, objects of one type are kept in continuous |
||
| 179 | areas of physical memory called slabs. Slabs can span from one to |
||
| 180 | several physical memory frames. Slabs of objects of one type are stored |
||
| 181 | in slab caches. When the allocator needs to allocate an object, it |
||
| 182 | searches available slabs. When the slab does not contain any free |
||
| 183 | obejct, a new slab is allocated and added to the cache. Contrary to |
||
| 184 | allocation, deallocated objects are returned to their respective slabs. |
||
| 185 | Empty slabs are deallocated immediately while empty slab caches are not |
||
| 186 | freed until HelenOS runs short of memory.</para> |
||
| 187 | |||
| 188 | <para><figure> |
||
| 189 | <mediaobject id="slab_alloc"> |
||
| 190 | <imageobject role="html"> |
||
| 191 | <imagedata fileref="images/slab_alloc.png" format="PNG" /> |
||
| 192 | </imageobject> |
||
| 193 | |||
| 194 | <imageobject role="fop"> |
||
| 195 | <imagedata fileref="images.vector/slab_alloc.svg" format="SVG" /> |
||
| 196 | </imageobject> |
||
| 197 | </mediaobject> |
||
| 198 | |||
| 199 | <title>Slab allocator scheme.</title> |
||
| 200 | </figure></para> |
||
| 201 | |||
| 202 | <section> |
||
| 203 | <para> |
||
| 204 | <termdef /> |
||
| 205 | |||
| 206 | |||
| 207 | |||
| 208 | <termdef /> |
||
| 209 | </para> |
||
| 210 | </section> |
||
| 211 | |||
| 212 | <section> |
||
| 213 | <title>Implementation</title> |
||
| 214 | |||
| 215 | <para>The slab allocator is closely modelled after <ulink |
||
| 216 | url="http://www.usenix.org/events/usenix01/full_papers/bonwick/bonwick_html/"> |
||
| 217 | OpenSolaris slab allocator by Jeff Bonwick and Jonathan Adams </ulink> |
||
| 218 | with the following exceptions:<itemizedlist> |
||
| 219 | <listitem></listitem> |
||
| 220 | |||
| 221 | <listitem> |
||
| 222 | empty magazines are deallocated when not needed |
||
| 223 | </listitem> |
||
| 224 | </itemizedlist> Following features are not currently supported but |
||
| 225 | would be easy to do: <itemizedlist> |
||
| 226 | <listitem> |
||
| 227 | cache coloring |
||
| 228 | </listitem> |
||
| 229 | |||
| 230 | <listitem> |
||
| 231 | dynamic magazine grow (different magazine sizes are already supported, but we would need to adjust allocation strategy) |
||
| 232 | </listitem> |
||
| 233 | </itemizedlist></para> |
||
| 234 | |||
| 235 | <section> |
||
| 236 | <title>Magazine layer</title> |
||
| 237 | |||
| 238 | <para>Due to the extensive bottleneck on SMP architures, caused by |
||
| 239 | global slab locking mechanism, making processing of all slab |
||
| 240 | allocation requests serialized, a new layer was introduced to the |
||
| 241 | classic slab allocator design. Slab allocator was extended to |
||
| 242 | support per-CPU caches 'magazines' to achieve good SMP scaling. |
||
| 243 | <termdef>Slab SMP perfromance bottleneck was resolved by introducing |
||
| 244 | a per-CPU caching scheme called as <glossterm>magazine |
||
| 245 | layer</glossterm></termdef>.</para> |
||
| 246 | |||
| 247 | <para>Magazine is a N-element cache of objects, so each magazine can |
||
| 248 | satisfy N allocations. Magazine behaves like a automatic weapon |
||
| 249 | magazine (LIFO, stack), so the allocation/deallocation become simple |
||
| 250 | push/pop pointer operation. Trick is that CPU does not access global |
||
| 251 | slab allocator data during the allocation from its magazine, thus |
||
| 252 | making possible parallel allocations between CPUs.</para> |
||
| 253 | |||
| 254 | <para>Implementation also requires adding another feature as the |
||
| 255 | CPU-bound magazine is actually a pair of magazines to avoid |
||
| 256 | thrashing when during allocation/deallocatiion of 1 item at the |
||
| 257 | magazine size boundary. LIFO order is enforced, which should avoid |
||
| 258 | fragmentation as much as possible.</para> |
||
| 259 | |||
| 260 | <para>Another important entity of magazine layer is the common full |
||
| 261 | magazine list (also called a depot), that stores full magazines that |
||
| 262 | may be used by any of the CPU magazine caches to reload active CPU |
||
| 263 | magazine. This list of magazines can be pre-filled with full |
||
| 264 | magazines during initialization, but in current implementation it is |
||
| 265 | filled during object deallocation, when CPU magazine becomes |
||
| 266 | full.</para> |
||
| 267 | |||
| 268 | <para>Slab allocator control structures are allocated from special |
||
| 269 | slabs, that are marked by special flag, indicating that it should |
||
| 270 | not be used for slab magazine layer. This is done to avoid possible |
||
| 271 | infinite recursions and deadlock during conventional slab allocaiton |
||
| 272 | requests.</para> |
||
| 273 | </section> |
||
| 274 | |||
| 275 | <section> |
||
| 276 | <title>Allocation/deallocation</title> |
||
| 277 | |||
| 278 | <para>Every cache contains list of full slabs and list of partialy |
||
| 279 | full slabs. Empty slabs are immediately freed (thrashing will be |
||
| 280 | avoided because of magazines).</para> |
||
| 281 | |||
| 282 | <para>The SLAB allocator allocates lots of space and does not free |
||
| 283 | it. When frame allocator fails to allocate the frame, it calls |
||
| 284 | slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim. |
||
| 285 | The light reclaim releases slabs from cpu-shared magazine-list, |
||
| 286 | until at least 1 slab is deallocated in each cache (this algorithm |
||
| 287 | should probably change). The brutal reclaim removes all cached |
||
| 288 | objects, even from CPU-bound magazines.</para> |
||
| 289 | |||
| 290 | <formalpara> |
||
| 291 | <title>Allocation</title> |
||
| 292 | |||
| 293 | <para><emphasis>Step 1.</emphasis> When it comes to the allocation |
||
| 294 | request, slab allocator first of all checks availability of memory |
||
| 295 | in local CPU-bound magazine. If it is there, we would just "pop" |
||
| 296 | the CPU magazine and return the pointer to object.</para> |
||
| 297 | |||
| 298 | <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is |
||
| 299 | empty, allocator will attempt to reload magazin, swapping it with |
||
| 300 | second CPU magazine and returns to the first step.</para> |
||
| 301 | |||
| 302 | <para><emphasis>Step 3.</emphasis> Now we are in the situation |
||
| 303 | when both CPU-bound magazines are empty, which makes allocator to |
||
| 304 | access shared full-magazines depot to reload CPU-bound magazines. |
||
| 305 | If reload is succesful (meaning there are full magazines in depot) |
||
| 306 | algoritm continues at Step 1.</para> |
||
| 307 | |||
| 308 | <para><emphasis>Step 4.</emphasis> Final step of the allocation. |
||
| 309 | In this step object is allocated from the conventional slab layer |
||
| 310 | and pointer is returned.</para> |
||
| 311 | </formalpara> |
||
| 312 | |||
| 313 | <formalpara> |
||
| 314 | <title>Deallocation</title> |
||
| 315 | |||
| 316 | <para><emphasis>Step 1.</emphasis> During deallocation request, |
||
| 317 | slab allocator will check if the local CPU-bound magazine is not |
||
| 318 | full. In this case we will just push the pointer to this |
||
| 319 | magazine.</para> |
||
| 320 | |||
| 321 | <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is |
||
| 322 | full, allocator will attempt to reload magazin, swapping it with |
||
| 323 | second CPU magazine and returns to the first step.</para> |
||
| 324 | |||
| 325 | <para><emphasis>Step 3.</emphasis> Now we are in the situation |
||
| 326 | when both CPU-bound magazines are full, which makes allocator to |
||
| 327 | access shared full-magazines depot to put one of the magazines to |
||
| 328 | the depot and creating new empty magazine. Algoritm continues at |
||
| 329 | Step 1.</para> |
||
| 330 | </formalpara> |
||
| 331 | </section> |
||
| 332 | </section> |
||
| 333 | </section> |
||
| 334 | |||
| 335 | <!-- End of Physmem --> |
||
| 336 | </section> |
||
| 337 | |||
| 338 | <section> |
||
| 11 | bondari | 339 | <title>Virtual memory management</title> |
| 9 | bondari | 340 | |
| 341 | <section> |
||
| 35 | bondari | 342 | <title>Introduction</title> |
| 343 | |||
| 344 | <para>Virtual memory is a special memory management technique, used by |
||
| 345 | kernel to achieve a bunch of mission critical goals. <itemizedlist> |
||
| 346 | <listitem> |
||
| 347 | Isolate each task from other tasks that are running on the system at the same time. |
||
| 348 | </listitem> |
||
| 349 | |||
| 350 | <listitem> |
||
| 351 | Allow to allocate more memory, than is actual physical memory size of the machine. |
||
| 352 | </listitem> |
||
| 353 | |||
| 354 | <listitem> |
||
| 355 | Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations. |
||
| 356 | </listitem> |
||
| 357 | </itemizedlist></para> |
||
| 38 | bondari | 358 | |
| 39 | bondari | 359 | <para><!-- |
| 360 | |||
| 38 | bondari | 361 | TLB shootdown ASID/ASID:PAGE/ALL. |
| 362 | TLB shootdown requests can come in asynchroniously |
||
| 363 | so there is a cache of TLB shootdown requests. Upon cache overflow TLB shootdown ALL is executed |
||
| 364 | |||
| 365 | |||
| 366 | <para> |
||
| 367 | Address spaces. Address space area (B+ tree). Only for uspace. Set of syscalls (shrink/extend etc). |
||
| 368 | Special address space area type - device - prohibits shrink/extend syscalls to call on it. |
||
| 369 | Address space has link to mapping tables (hierarchical - per Address space, hash - global tables). |
||
| 370 | </para> |
||
| 371 | |||
| 372 | --></para> |
||
| 35 | bondari | 373 | </section> |
| 374 | |||
| 375 | <section> |
||
| 376 | <title>Paging</title> |
||
| 377 | |||
| 378 | <para>Virtual memory is usually using paged memory model, where virtual |
||
| 379 | memory address space is divided into the <emphasis>pages</emphasis> |
||
| 380 | (usually having size 4096 bytes) and physical memory is divided into the |
||
| 39 | bondari | 381 | frames (same sized as a page, of course). Each page may be mapped to |
| 382 | some frame and then, upon memory access to the virtual address, CPU |
||
| 383 | performs <emphasis>address translation</emphasis> during the instruction |
||
| 35 | bondari | 384 | execution. Non-existing mapping generates page fault exception, calling |
| 385 | kernel exception handler, thus allowing kernel to manipulate rules of |
||
| 386 | memory access. Information for pages mapping is stored by kernel in the |
||
| 387 | <link linkend="page_tables">page tables</link></para> |
||
| 388 | |||
| 389 | <para>The majority of the architectures use multi-level page tables, |
||
| 390 | which means need to access physical memory several times before getting |
||
| 391 | physical address. This fact would make serios performance overhead in |
||
| 392 | virtual memory management. To avoid this <link linkend="tlb">Traslation |
||
| 393 | Lookaside Buffer (TLB)</link> is used.</para> |
||
| 394 | </section> |
||
| 395 | |||
| 396 | <section> |
||
| 11 | bondari | 397 | <title>Address spaces</title> |
| 9 | bondari | 398 | |
| 35 | bondari | 399 | <section> |
| 46 | bondari | 400 | <title>Address space areas</title> |
| 35 | bondari | 401 | |
| 46 | bondari | 402 | <para>Each address space consists of mutually disjunctive continuous |
| 403 | address space areas. Address space area is precisely defined by its |
||
| 47 | bondari | 404 | base address and the number of frames/pages is contains.</para> |
| 35 | bondari | 405 | |
| 47 | bondari | 406 | <para>Address space area , that define behaviour and permissions on |
| 407 | the particular area. <itemizedlist> |
||
| 46 | bondari | 408 | <listitem> |
| 409 | |||
| 410 | |||
| 411 | <emphasis>AS_AREA_READ</emphasis> |
||
| 412 | |||
| 413 | flag indicates reading permission. |
||
| 414 | </listitem> |
||
| 415 | |||
| 416 | <listitem> |
||
| 417 | |||
| 418 | |||
| 419 | <emphasis>AS_AREA_WRITE</emphasis> |
||
| 420 | |||
| 421 | flag indicates writing permission. |
||
| 422 | </listitem> |
||
| 423 | |||
| 424 | <listitem> |
||
| 425 | |||
| 426 | |||
| 427 | <emphasis>AS_AREA_EXEC</emphasis> |
||
| 428 | |||
| 429 | flag indicates code execution permission. Some architectures do not support execution persmission restriction. In this case this flag has no effect. |
||
| 430 | </listitem> |
||
| 431 | |||
| 432 | <listitem> |
||
| 433 | |||
| 434 | |||
| 435 | <emphasis>AS_AREA_DEVICE</emphasis> |
||
| 436 | |||
| 437 | marks area as mapped to the device memory. |
||
| 438 | </listitem> |
||
| 439 | </itemizedlist></para> |
||
| 440 | |||
| 441 | <para>Kernel provides possibility tasks create/expand/shrink/share its |
||
| 442 | address space via the set of syscalls.</para> |
||
| 35 | bondari | 443 | </section> |
| 444 | |||
| 445 | <section> |
||
| 446 | <title>Address Space ID (ASID)</title> |
||
| 447 | |||
| 46 | bondari | 448 | <para>When switching to the different task, kernel also require to |
| 449 | switch mappings to the different address space. In case TLB cannot |
||
| 47 | bondari | 450 | distinguish address space mappings, all mapping information in TLB |
| 451 | from the old address space must be flushed, which can create certain |
||
| 452 | uncessary overhead during the task switching. To avoid this, some |
||
| 453 | architectures have capability to segregate different address spaces on |
||
| 454 | hardware level introducing the address space identifier as a part of |
||
| 455 | TLB record, telling the virtual address space translation unit to |
||
| 456 | which address space this record is applicable.</para> |
||
| 35 | bondari | 457 | |
| 46 | bondari | 458 | <para>HelenOS kernel can take advantage of this hardware supported |
| 47 | bondari | 459 | identifier by having an ASID abstraction which is somehow related to |
| 460 | the corresponding architecture identifier. I.e. on ia64 kernel ASID is |
||
| 461 | derived from RID (region identifier) and on the mips32 kernel ASID is |
||
| 462 | actually the hardware identifier. As expected, this ASID information |
||
| 463 | record is the part of <emphasis>as_t</emphasis> structure.</para> |
||
| 35 | bondari | 464 | |
| 47 | bondari | 465 | <para>Due to the hardware limitations, hardware ASID has limited |
| 466 | length from 8 bits on ia64 to 24 bits on mips32, which makes it |
||
| 467 | impossible to use it as unique address space identifier for all tasks |
||
| 468 | running in the system. In such situations special ASID stealing |
||
| 469 | algoritm is used, which takes ASID from inactive task and assigns it |
||
| 470 | to the active task.</para> |
||
| 471 | |||
| 472 | <para><classname>ASID stealing algoritm here.</classname></para> |
||
| 35 | bondari | 473 | </section> |
| 9 | bondari | 474 | </section> |
| 475 | |||
| 476 | <section> |
||
| 11 | bondari | 477 | <title>Virtual address translation</title> |
| 9 | bondari | 478 | |
| 35 | bondari | 479 | <section id="page_tables"> |
| 480 | <title>Page tables</title> |
||
| 34 | bondari | 481 | |
| 35 | bondari | 482 | <para>HelenOS kernel has two different approaches to the paging |
| 483 | implementation: <emphasis>4 level page tables</emphasis> and |
||
| 484 | <emphasis>global hash tables</emphasis>, which are accessible via |
||
| 47 | bondari | 485 | generic paging abstraction layer. Such different functionality was |
| 486 | caused by the major architectural differences between supported |
||
| 487 | platforms. This abstraction is implemented with help of the global |
||
| 488 | structure of pointers to basic mapping functions |
||
| 489 | <emphasis>page_mapping_operations</emphasis>. To achieve different |
||
| 490 | functionality of page tables, corresponding layer must implement |
||
| 491 | functions, declared in |
||
| 492 | <emphasis>page_mapping_operations</emphasis></para> |
||
| 34 | bondari | 493 | |
| 35 | bondari | 494 | <formalpara> |
| 495 | <title>4-level page tables</title> |
||
| 34 | bondari | 496 | |
| 35 | bondari | 497 | <para>4-level page tables are the generalization of the hardware |
| 47 | bondari | 498 | capabilities of several architectures.<itemizedlist> |
| 35 | bondari | 499 | <listitem> |
| 500 | ia32 uses 2-level page tables, with full hardware support. |
||
| 501 | </listitem> |
||
| 34 | bondari | 502 | |
| 35 | bondari | 503 | <listitem> |
| 504 | amd64 uses 4-level page tables, also coming with full hardware support. |
||
| 505 | </listitem> |
||
| 506 | |||
| 507 | <listitem> |
||
| 508 | mips and ppc32 have 2-level tables, software simulated support. |
||
| 509 | </listitem> |
||
| 510 | </itemizedlist></para> |
||
| 511 | </formalpara> |
||
| 512 | |||
| 513 | <formalpara> |
||
| 514 | <title>Global hash tables</title> |
||
| 515 | |||
| 516 | <para>- global page hash table: existuje jen jedna v celem systemu |
||
| 517 | (vyuziva ji ia64), pozn. ia64 ma zatim vypnuty VHPT. Pouziva se |
||
| 46 | bondari | 518 | genericke hash table s oddelenymi collision chains. ASID support is |
| 519 | required to use global hash tables.</para> |
||
| 35 | bondari | 520 | </formalpara> |
| 521 | |||
| 522 | <para>Thanks to the abstract paging interface, there is possibility |
||
| 523 | left have more paging implementations, for example B-Tree page |
||
| 524 | tables.</para> |
||
| 525 | </section> |
||
| 526 | |||
| 527 | <section id="tlb"> |
||
| 54 | bondari | 528 | <title>Translation Lookaside buffer</title> |
| 35 | bondari | 529 | |
| 54 | bondari | 530 | <para>Due to the extensive overhead during the page mapping lookup in |
| 531 | the page tables, all architectures has fast assotiative cache memory |
||
| 532 | built-in CPU. This memory called TLB stores recently used page table |
||
| 533 | entries.</para> |
||
| 35 | bondari | 534 | |
| 54 | bondari | 535 | <section id="tlb_shootdown"> |
| 536 | <title>TLB consistency. TLB shootdown algorithm.</title> |
||
| 35 | bondari | 537 | |
| 54 | bondari | 538 | <para>Operating system is responsible for keeping TLB consistent by |
| 539 | invalidating the contents of TLB, whenever there is some change in |
||
| 540 | page tables. Those changes may occur when page or group of pages |
||
| 541 | were unmapped, mapping is changed or system switching active address |
||
| 542 | space to schedule a new system task (which is a batch unmap of all |
||
| 543 | address space mappings). Moreover, this invalidation operation must |
||
| 544 | be done an all system CPUs because each CPU has its own independent |
||
| 545 | TLB cache. Thus maintaining TLB consistency on SMP configuration as |
||
| 546 | not as trivial task as it looks at the first glance. Naive solution |
||
| 547 | would assume remote TLB invalidatation, which is not possible on the |
||
| 548 | most of the architectures, because of the simple fact - flushing TLB |
||
| 549 | is allowed only on the local CPU and there is no possibility to |
||
| 550 | access other CPUs' TLB caches.</para> |
||
| 551 | |||
| 552 | <para>Technique of remote invalidation of TLB entries is called "TLB |
||
| 553 | shootdown". HelenOS uses a variation of the algorithm described by |
||
| 554 | D. Black et al., "Translation Lookaside Buffer Consistency: A |
||
| 555 | Software Approach," Proc. Third Int'l Conf. Architectural Support |
||
| 556 | for Programming Languages and Operating Systems, 1989, pp. |
||
| 557 | 113-122.</para> |
||
| 558 | |||
| 559 | <para>As the situation demands, you will want partitial invalidation |
||
| 560 | of TLB caches. In case of simple memory mapping change it is |
||
| 561 | necessary to invalidate only one or more adjacent pages. In case if |
||
| 562 | the architecture is aware of ASIDs, during the address space |
||
| 563 | switching, kernel invalidates only entries from this particular |
||
| 564 | address space. Final option of the TLB invalidation is the complete |
||
| 565 | TLB cache invalidation, which is the operation that flushes all |
||
| 566 | entries in TLB.</para> |
||
| 567 | |||
| 568 | <para>TLB shootdown is performed in two phases. First, the initiator |
||
| 569 | process sends an IPI message indicating the TLB shootdown request to |
||
| 570 | the rest of the CPUs. Then, it waits until all CPUs confirm TLB |
||
| 571 | invalidating action execution.</para> |
||
| 572 | </section> |
||
| 35 | bondari | 573 | </section> |
| 574 | </section> |
||
| 46 | bondari | 575 | |
| 576 | <section> |
||
| 577 | <title>---</title> |
||
| 578 | |||
| 579 | <para>At the moment HelenOS does not support swapping.</para> |
||
| 580 | |||
| 581 | <para>- pouzivame vypadky stranky k alokaci ramcu on-demand v ramci |
||
| 582 | as_area - na architekturach, ktere to podporuji, podporujeme non-exec |
||
| 583 | stranky</para> |
||
| 584 | </section> |
||
| 26 | bondari | 585 | </section> |
| 11 | bondari | 586 | </chapter> |