Rev 81 | Rev 83 | Go to most recent revision | Show entire file | Ignore whitespace | Details | Blame | Last modification | View Log | RSS feed
Rev 81 | Rev 82 | ||
---|---|---|---|
Line 76... | Line 76... | ||
76 | </figure></para> |
76 | </figure></para> |
77 | 77 | ||
78 | <formalpara> |
78 | <formalpara> |
79 | <title>Allocation / deallocation</title> |
79 | <title>Allocation / deallocation</title> |
80 | 80 | ||
81 | <para>Upon allocation request via function <code>frame_alloc</code>, |
81 | <para>Upon allocation request via function <code>frame_alloc()</code>, |
82 | the frame allocator first tries to find a zone that can satisfy the |
82 | the frame allocator first tries to find a zone that can satisfy the |
83 | request (i.e. has the required amount of free frames). Once a suitable |
83 | request (i.e. has the required amount of free frames). Once a suitable |
84 | zone is found, the frame allocator uses the buddy allocator on the |
84 | zone is found, the frame allocator uses the buddy allocator on the |
85 | zone's buddy system to perform the allocation. During deallocation, |
85 | zone's buddy system to perform the allocation. During deallocation, |
86 | which is triggered by a call to <code>frame_free</code>, the frame |
86 | which is triggered by a call to <code>frame_free()</code>, the frame |
87 | allocator looks up the respective zone that contains the frame being |
87 | allocator looks up the respective zone that contains the frame being |
88 | deallocated. Afterwards, it calls the buddy allocator again, this time |
88 | deallocated. Afterwards, it calls the buddy allocator again, this time |
89 | to take care of deallocation within the zone's buddy system.</para> |
89 | to take care of deallocation within the zone's buddy system.</para> |
90 | </formalpara> |
90 | </formalpara> |
91 | </section> |
91 | </section> |
Line 97... | Line 97... | ||
97 | 97 | ||
98 | <title>Buddy allocator</title> |
98 | <title>Buddy allocator</title> |
99 | 99 | ||
100 | <para>In the buddy system, the memory is broken down into power-of-two |
100 | <para>In the buddy system, the memory is broken down into power-of-two |
101 | sized naturally aligned blocks. These blocks are organized in an array |
101 | sized naturally aligned blocks. These blocks are organized in an array |
102 | of lists, in which the list with index <emphasis>i</emphasis> contains all unallocated blocks |
102 | of lists, in which the list with index <emphasis>i</emphasis> contains |
- | 103 | all unallocated blocks of size |
|
103 | of size <emphasis>2<superscript>i</superscript></emphasis>. The |
104 | <emphasis>2<superscript>i</superscript></emphasis>. The index |
104 | index <emphasis>i</emphasis> is called the order of block. Should there be two adjacent |
105 | <emphasis>i</emphasis> is called the order of block. Should there be two |
105 | equally sized blocks in the list <emphasis>i</emphasis> (i.e. buddies), the |
106 | adjacent equally sized blocks in the list <emphasis>i</emphasis> (i.e. |
106 | buddy allocator would coalesce them and put the resulting block in list |
107 | buddies), the buddy allocator would coalesce them and put the resulting |
107 | <emphasis>i + 1</emphasis>, provided that the resulting block would |
108 | block in list <emphasis>i + 1</emphasis>, provided that the resulting |
108 | be naturally aligned. Similarily, when the allocator is asked to |
109 | block would be naturally aligned. Similarily, when the allocator is |
109 | allocate a block of size |
110 | asked to allocate a block of size |
110 | <emphasis>2<superscript>i</superscript></emphasis>, it first tries |
111 | <emphasis>2<superscript>i</superscript></emphasis>, it first tries to |
111 | to satisfy the request from the list with index <emphasis>i</emphasis>. If the request cannot |
112 | satisfy the request from the list with index <emphasis>i</emphasis>. If |
112 | be satisfied (i.e. the list <emphasis>i</emphasis> is empty), the buddy allocator will try to |
113 | the request cannot be satisfied (i.e. the list <emphasis>i</emphasis> is |
- | 114 | empty), the buddy allocator will try to allocate and split a larger |
|
113 | allocate and split a larger block from the list with index <emphasis>i + 1</emphasis>. Both |
115 | block from the list with index <emphasis>i + 1</emphasis>. Both of these |
114 | of these algorithms are recursive. The recursion ends either when there |
116 | algorithms are recursive. The recursion ends either when there are no |
115 | are no blocks to coalesce in the former case or when there are no blocks |
117 | blocks to coalesce in the former case or when there are no blocks that |
116 | that can be split in the latter case.</para> |
118 | can be split in the latter case.</para> |
117 | 119 | ||
118 | <para>This approach greatly reduces external fragmentation of memory and |
120 | <para>This approach greatly reduces external fragmentation of memory and |
119 | helps in allocating bigger continuous blocks of memory aligned to their |
121 | helps in allocating bigger continuous blocks of memory aligned to their |
120 | size. On the other hand, the buddy allocator suffers increased internal |
122 | size. On the other hand, the buddy allocator suffers increased internal |
121 | fragmentation of memory and is not suitable for general kernel |
123 | fragmentation of memory and is not suitable for general kernel |
Line 266... | Line 268... | ||
266 | <title>Implementation</title> |
268 | <title>Implementation</title> |
267 | 269 | ||
268 | <para>The slab allocator is closely modelled after OpenSolaris slab |
270 | <para>The slab allocator is closely modelled after OpenSolaris slab |
269 | allocator by Jeff Bonwick and Jonathan Adams <xref |
271 | allocator by Jeff Bonwick and Jonathan Adams <xref |
270 | linkend="Bonwick01" /> with the following exceptions:<itemizedlist> |
272 | linkend="Bonwick01" /> with the following exceptions:<itemizedlist> |
- | 273 | <listitem> |
|
271 | <listitem><para>empty slabs are immediately deallocated and</para></listitem> |
274 | <para>empty slabs are immediately deallocated and</para> |
- | 275 | </listitem> |
|
272 | 276 | ||
273 | <listitem> |
277 | <listitem> |
274 | <para>empty magazines are deallocated when not needed.</para> |
278 | <para>empty magazines are deallocated when not needed.</para> |
275 | </listitem> |
279 | </listitem> |
276 | </itemizedlist>The following features are not currently supported |
280 | </itemizedlist>The following features are not currently supported |
Line 285... | Line 289... | ||
285 | <section> |
289 | <section> |
286 | <title>Allocation/deallocation</title> |
290 | <title>Allocation/deallocation</title> |
287 | 291 | ||
288 | <para>The following two paragraphs summarize and complete the |
292 | <para>The following two paragraphs summarize and complete the |
289 | description of the slab allocator operation (i.e. |
293 | description of the slab allocator operation (i.e. |
290 | <code>slab_alloc</code> and <code>slab_free</code> |
294 | <code>slab_alloc()</code> and <code>slab_free()</code> |
291 | operations).</para> |
295 | functions).</para> |
292 | 296 | ||
293 | <formalpara> |
297 | <formalpara> |
294 | <title>Allocation</title> |
298 | <title>Allocation</title> |
295 | 299 | ||
296 | <para><emphasis>Step 1.</emphasis> When an allocation request |
300 | <para><emphasis>Step 1.</emphasis> When an allocation request |
Line 350... | Line 354... | ||
350 | </section> |
354 | </section> |
351 | 355 | ||
352 | <section> |
356 | <section> |
353 | <title>Virtual memory management</title> |
357 | <title>Virtual memory management</title> |
354 | 358 | ||
- | 359 | <para>Virtual memory is essential for an operating system because it makes |
|
- | 360 | several things possible. First, it helps to isolate tasks from each other |
|
- | 361 | by encapsulating them in their private address spaces. Second, virtual |
|
- | 362 | memory can give tasks the feeling of more memory available than is |
|
- | 363 | actually possible. And third, by using virtual memory, there might be |
|
- | 364 | multiple copies of the same program, linked to the same addresses, running |
|
- | 365 | in the system. There are at least two known mechanisms for implementing |
|
- | 366 | virtual memory: segmentation and paging. Even though some processor |
|
- | 367 | architectures supported by HelenOS<footnote> |
|
- | 368 | <para>ia32 has full-fledged segmentation.</para> |
|
- | 369 | </footnote> provide both mechanism, the kernel makes use solely of |
|
- | 370 | paging.</para> |
|
- | 371 | ||
- | 372 | <section id="paging"> |
|
- | 373 | <title>VAT subsystem</title> |
|
- | 374 | ||
- | 375 | <para>In a paged virtual memory, the entire virtual address space is |
|
- | 376 | divided into small power-of-two sized naturally aligned blocks called |
|
- | 377 | pages. The processor implements a translation mechanism, that allows the |
|
- | 378 | operating system to manage mappings between set of pages and set of |
|
- | 379 | indentically sized and identically aligned pieces of physical memory |
|
- | 380 | called frames. In a result, references to continuous virtual memory |
|
- | 381 | areas don't necessarily need to reference continuos area of physical |
|
- | 382 | memory. Supported page sizes usually range from several kilobytes to |
|
- | 383 | several megabytes. Each page that takes part in the mapping is |
|
- | 384 | associated with certain attributes that further desribe the mapping |
|
- | 385 | (e.g. access rights, dirty and accessed bits and present bit).</para> |
|
- | 386 | ||
- | 387 | <para>When the processor accesses a page that is not present (i.e. its |
|
- | 388 | present bit is not set), the operating system is notified through a |
|
- | 389 | special exception called page fault. It is then up to the operating |
|
- | 390 | system to service the page fault. In HelenOS, some page faults are fatal |
|
- | 391 | and result in either task termination or, in the worse case, kernel |
|
- | 392 | panic<footnote> |
|
- | 393 | <para>Such a condition would be either caused by a hardware failure |
|
- | 394 | or a bug in the kernel.</para> |
|
- | 395 | </footnote>, while other page faults are used to load memory on demand |
|
- | 396 | or to notify the kernel about certain events.</para> |
|
- | 397 | ||
- | 398 | <indexterm> |
|
- | 399 | <primary>page tables</primary> |
|
- | 400 | </indexterm> |
|
- | 401 | ||
- | 402 | <para>The set of all page mappings is stored in a memory structure |
|
- | 403 | called page tables. Some architectures have no hardware support for page |
|
- | 404 | tables<footnote> |
|
- | 405 | <para>On mips32, TLB-only model is used and the operating system is |
|
- | 406 | responsible for managing software defined page tables.</para> |
|
- | 407 | </footnote> while other processor architectures<footnote> |
|
- | 408 | <para>Like amd64 and ia32.</para> |
|
- | 409 | </footnote> understand the whole memory format thereof. Despite all |
|
- | 410 | the possible differences in page table formats, the HelenOS VAT |
|
- | 411 | subsystem<footnote> |
|
- | 412 | <para>Virtual Address Translation subsystem.</para> |
|
- | 413 | </footnote> unifies the page table operations under one programming |
|
- | 414 | interface. For all parts of the kernel, three basic functions are |
|
- | 415 | provided:</para> |
|
- | 416 | ||
- | 417 | <itemizedlist> |
|
- | 418 | <listitem> |
|
- | 419 | <para><code>page_mapping_insert()</code>,</para> |
|
- | 420 | </listitem> |
|
- | 421 | ||
- | 422 | <listitem> |
|
- | 423 | <para><code>page_mapping_find()</code> and</para> |
|
- | 424 | </listitem> |
|
- | 425 | ||
- | 426 | <listitem> |
|
- | 427 | <para><code>page_mapping_remove()</code>.</para> |
|
- | 428 | </listitem> |
|
- | 429 | </itemizedlist> |
|
- | 430 | ||
- | 431 | <para>The <code>page_mapping_insert()</code> function is used to |
|
- | 432 | introduce a mapping for one virtual memory page belonging to a |
|
- | 433 | particular address space into the page tables. Once the mapping is in |
|
- | 434 | the page tables, it can be searched by <code>page_mapping_find()</code> |
|
- | 435 | and removed by <code>page_mapping_remove()</code>. All of these |
|
- | 436 | functions internally select the page table mechanism specific functions |
|
- | 437 | that carry out the self operation.</para> |
|
- | 438 | ||
- | 439 | <para>There are currently two supported mechanisms: generic 4-level |
|
- | 440 | hierarchical page tables and global page hash table. Both of the |
|
- | 441 | mechanisms are generic as they cover several hardware platforms. For |
|
- | 442 | instance, the 4-level hierarchical page table mechanism is used by |
|
- | 443 | amd64, ia32, mips32 and ppc32, respectively. These architectures have |
|
- | 444 | the following page table format: 4-level, 2-level, TLB-only and hardware |
|
- | 445 | hash table, respectively. On the other hand, the global page hash table |
|
- | 446 | is used on ia64 that can be TLB-only or use a hardware hash table. |
|
- | 447 | Although only two mechanisms are currently implemented, other mechanisms |
|
- | 448 | (e.g. B+tree) can be easily added.</para> |
|
- | 449 | ||
- | 450 | <section id="page_tables"> |
|
- | 451 | <indexterm> |
|
- | 452 | <primary>page tables</primary> |
|
- | 453 | ||
- | 454 | <secondary>- hierarchical</secondary> |
|
- | 455 | </indexterm> |
|
- | 456 | ||
- | 457 | <title>Hierarchical 4-level page tables</title> |
|
- | 458 | ||
- | 459 | <para>Hierarchical 4-level page tables are generalization of the |
|
- | 460 | frequently used hierarchical model of page tables. In this mechanism, |
|
- | 461 | each address space has its own page tables. To avoid confusion in |
|
- | 462 | terminology used by hardware vendors, in HelenOS, the root level page |
|
- | 463 | table is called PTL0, the two middle levels are called PTL1 and PTL2, |
|
- | 464 | and, finally, the leaf level is called PTL3. All architectures using |
|
- | 465 | this mechanism are required to use PTL0 and PTL3. However, the middle |
|
- | 466 | levels can be left out, depending on the hardware hierachy or |
|
- | 467 | structure of software-only page tables. The genericity is achieved |
|
- | 468 | through a set of macros that define transitions from one level to |
|
- | 469 | another. Unused levels are optimised out by the compiler.</para> |
|
- | 470 | </section> |
|
- | 471 | ||
355 | <section> |
472 | <section> |
- | 473 | <indexterm> |
|
356 | <title>Introduction</title> |
474 | <primary>page tables</primary> |
- | 475 | ||
- | 476 | <secondary>- hashing</secondary> |
|
- | 477 | </indexterm> |
|
357 | 478 | ||
358 | <para>Virtual memory is a special memory management technique, used by |
- | |
359 | kernel to achieve a bunch of mission critical goals. <itemizedlist> |
- | |
360 | <listitem> |
- | |
361 | Isolate each task from other tasks that are running on the system at the same time. |
- | |
362 | </listitem> |
- | |
363 | - | ||
364 | <listitem> |
- | |
365 | Allow to allocate more memory, than is actual physical memory size of the machine. |
- | |
366 | </listitem> |
- | |
367 | - | ||
368 | <listitem> |
- | |
369 | Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations. |
- | |
370 | </listitem> |
- | |
371 | </itemizedlist></para> |
479 | <title>Global page hash table</title> |
372 | - | ||
373 | <para><!-- |
- | |
374 | <para> |
- | |
375 | Address spaces. Address space area (B+ tree). Only for uspace. Set of syscalls (shrink/extend etc). |
- | |
376 | Special address space area type - device - prohibits shrink/extend syscalls to call on it. |
- | |
377 | Address space has link to mapping tables (hierarchical - per Address space, hash - global tables). |
- | |
378 | </para> |
- | |
379 | 480 | ||
- | 481 | <para>Implementation of the global page hash table was encouraged by |
|
- | 482 | 64-bit architectures that can have rather sparse address spaces. The |
|
- | 483 | hash table contains valid mappings only. Each entry of the hash table |
|
- | 484 | contains an address space pointer, virtual memory page number (VPN), |
|
- | 485 | physical memory frame number (PFN) and a set of flags. The pair of the |
|
- | 486 | address space pointer and the virtual memory page number is used as a |
|
- | 487 | key for the hash table. One of the major differences between the |
|
- | 488 | global page hash table and hierarchical 4-level page tables is that |
|
- | 489 | there is only a single global page hash table in the system while |
|
- | 490 | hierarchical page tables exist per address space. Thus, the global |
|
- | 491 | page hash table contains information about mappings of all address |
|
- | 492 | spaces in the system. </para> |
|
- | 493 | ||
- | 494 | <para>The global page hash table mechanism uses the generic hash table |
|
- | 495 | type as described in the chapter about <link linkend="hashtables">data |
|
- | 496 | structures</link> earlier in this book.</para> |
|
- | 497 | </section> |
|
- | 498 | </section> |
|
- | 499 | </section> |
|
- | 500 | ||
- | 501 | <section id="tlb"> |
|
- | 502 | <indexterm> |
|
- | 503 | <primary>TLB</primary> |
|
- | 504 | </indexterm> |
|
- | 505 | ||
- | 506 | <title>Translation Lookaside buffer</title> |
|
- | 507 | ||
- | 508 | <para>Due to the extensive overhead during the page mapping lookup in the |
|
- | 509 | page tables, all architectures has fast assotiative cache memory built-in |
|
- | 510 | CPU. This memory called TLB stores recently used page table |
|
- | 511 | entries.</para> |
|
- | 512 | ||
- | 513 | <section id="tlb_shootdown"> |
|
- | 514 | <indexterm> |
|
- | 515 | <primary>TLB</primary> |
|
- | 516 | ||
- | 517 | <secondary>- TLB shootdown</secondary> |
|
- | 518 | </indexterm> |
|
- | 519 | ||
- | 520 | <title>TLB consistency. TLB shootdown algorithm.</title> |
|
- | 521 | ||
- | 522 | <para>Operating system is responsible for keeping TLB consistent by |
|
- | 523 | invalidating the contents of TLB, whenever there is some change in page |
|
- | 524 | tables. Those changes may occur when page or group of pages were |
|
- | 525 | unmapped, mapping is changed or system switching active address space to |
|
- | 526 | schedule a new system task. Moreover, this invalidation operation must |
|
- | 527 | be done an all system CPUs because each CPU has its own independent TLB |
|
- | 528 | cache. Thus maintaining TLB consistency on SMP configuration as not as |
|
- | 529 | trivial task as it looks on the first glance. Naive solution would |
|
- | 530 | assume that is the CPU which wants to invalidate TLB will invalidate TLB |
|
- | 531 | caches on other CPUs. It is not possible on the most of the |
|
- | 532 | architectures, because of the simple fact - flushing TLB is allowed only |
|
- | 533 | on the local CPU and there is no possibility to access other CPUs' TLB |
|
- | 534 | caches, thus invalidate TLB remotely.</para> |
|
- | 535 | ||
- | 536 | <para>Technique of remote invalidation of TLB entries is called "TLB |
|
- | 537 | shootdown". HelenOS uses a variation of the algorithm described by D. |
|
- | 538 | Black et al., "Translation Lookaside Buffer Consistency: A Software |
|
- | 539 | Approach," Proc. Third Int'l Conf. Architectural Support for Programming |
|
- | 540 | Languages and Operating Systems, 1989, pp. 113-122. <xref |
|
- | 541 | linkend="Black89" /></para> |
|
- | 542 | ||
- | 543 | <para>As the situation demands, you will want partitial invalidation of |
|
- | 544 | TLB caches. In case of simple memory mapping change it is necessary to |
|
- | 545 | invalidate only one or more adjacent pages. In case if the architecture |
|
- | 546 | is aware of ASIDs, when kernel needs to dump some ASID to use by another |
|
- | 547 | task, it invalidates only entries from this particular address space. |
|
- | 548 | Final option of the TLB invalidation is the complete TLB cache |
|
- | 549 | invalidation, which is the operation that flushes all entries in |
|
380 | --></para> |
550 | TLB.</para> |
- | 551 | ||
- | 552 | <para>TLB shootdown is performed in two phases.</para> |
|
- | 553 | ||
- | 554 | <formalpara> |
|
- | 555 | <title>Phase 1.</title> |
|
- | 556 | ||
- | 557 | <para>First, initiator locks a global TLB spinlock, then request is |
|
- | 558 | being put to the local request cache of every other CPU in the system |
|
- | 559 | protected by its spinlock. In case the cache is full, all requests in |
|
- | 560 | the cache are replaced by one request, indicating global TLB flush. |
|
- | 561 | Then the initiator thread sends an IPI message indicating the TLB |
|
- | 562 | shootdown request to the rest of the CPUs and waits actively until all |
|
- | 563 | CPUs confirm TLB invalidating action execution by setting up a special |
|
- | 564 | flag. After setting this flag this thread is blocked on the TLB |
|
- | 565 | spinlock, held by the initiator.</para> |
|
- | 566 | </formalpara> |
|
- | 567 | ||
- | 568 | <formalpara> |
|
- | 569 | <title>Phase 2.</title> |
|
- | 570 | ||
- | 571 | <para>All CPUs are waiting on the TLB spinlock to execute TLB |
|
- | 572 | invalidation action and have indicated their intention to the |
|
- | 573 | initiator. Initiator continues, cleaning up its TLB and releasing the |
|
- | 574 | global TLB spinlock. After this all other CPUs gain and immidiately |
|
- | 575 | release TLB spinlock and perform TLB invalidation actions.</para> |
|
- | 576 | </formalpara> |
|
381 | </section> |
577 | </section> |
382 | 578 | ||
383 | <section> |
579 | <section> |
384 | <title>Address spaces</title> |
580 | <title>Address spaces</title> |
385 | 581 | ||
Line 457... | Line 653... | ||
457 | <para> |
653 | <para> |
458 | <classname>ASID stealing algoritm here.</classname> |
654 | <classname>ASID stealing algoritm here.</classname> |
459 | </para> |
655 | </para> |
460 | </section> |
656 | </section> |
461 | </section> |
657 | </section> |
462 | - | ||
463 | <section id="paging"> |
- | |
464 | <title>Virtual address translation</title> |
- | |
465 | - | ||
466 | <section> |
- | |
467 | <title>Introduction</title> |
- | |
468 | - | ||
469 | <para>Virtual memory is usually using paged memory model, where |
- | |
470 | virtual memory address space is divided into the |
- | |
471 | <emphasis>pages</emphasis> (usually having size 4096 bytes) and |
- | |
472 | physical memory is divided into the frames (same sized as a page, of |
- | |
473 | course). Each page may be mapped to some frame and then, upon memory |
- | |
474 | access to the virtual address, CPU performs <emphasis>address |
- | |
475 | translation</emphasis> during the instruction execution. Non-existing |
- | |
476 | mapping generates page fault exception, calling kernel exception |
- | |
477 | handler, thus allowing kernel to manipulate rules of memory access. |
- | |
478 | Information for pages mapping is stored by kernel in the <link |
- | |
479 | linkend="page_tables">page tables</link></para> |
- | |
480 | - | ||
481 | <indexterm> |
- | |
482 | <primary>page tables</primary> |
- | |
483 | </indexterm> |
- | |
484 | - | ||
485 | <para>The majority of the architectures use multi-level page tables, |
- | |
486 | which means need to access physical memory several times before |
- | |
487 | getting physical address. This fact would make serios performance |
- | |
488 | overhead in virtual memory management. To avoid this <link |
- | |
489 | linkend="tlb">Traslation Lookaside Buffer (TLB)</link> is used.</para> |
- | |
490 | - | ||
491 | <para>HelenOS kernel has two different approaches to the paging |
- | |
492 | implementation: <emphasis>4 level page tables</emphasis> and |
- | |
493 | <emphasis>global hash table</emphasis>, which are accessible via |
- | |
494 | generic paging abstraction layer. Such different functionality was |
- | |
495 | caused by the major architectural differences between supported |
- | |
496 | platforms. This abstraction is implemented with help of the global |
- | |
497 | structure of pointers to basic mapping functions |
- | |
498 | <emphasis>page_mapping_operations</emphasis>. To achieve different |
- | |
499 | functionality of page tables, corresponding layer must implement |
- | |
500 | functions, declared in |
- | |
501 | <emphasis>page_mapping_operations</emphasis></para> |
- | |
502 | - | ||
503 | <para>Thanks to the abstract paging interface, there was a place left |
- | |
504 | for more paging implementations (besides already implemented |
- | |
505 | hieararchical page tables and hash table), for example <indexterm> |
- | |
506 | <primary>B-tree</primary> |
- | |
507 | </indexterm> B-Tree based page tables.</para> |
- | |
508 | </section> |
- | |
509 | - | ||
510 | <section id="page_tables"> |
- | |
511 | <indexterm> |
- | |
512 | <primary>page tables</primary> |
- | |
513 | - | ||
514 | <secondary>- hierarchical</secondary> |
- | |
515 | </indexterm> |
- | |
516 | - | ||
517 | <title>Hierarchical 4-level page tables</title> |
- | |
518 | - | ||
519 | <para>Hierarchical 4-level page tables are the generalization of the |
- | |
520 | hardware capabilities of most architectures. Each address space has |
- | |
521 | its own page tables.<itemizedlist> |
- | |
522 | <listitem>ia32 uses 2-level page tables, with full hardware |
- | |
523 | support.</listitem> |
- | |
524 | - | ||
525 | <listitem>amd64 uses 4-level page tables, also coming with full |
- | |
526 | hardware support.</listitem> |
- | |
527 | - | ||
528 | <listitem>mips and ppc32 have 2-level tables, software simulated |
- | |
529 | support.</listitem> |
- | |
530 | </itemizedlist></para> |
- | |
531 | </section> |
- | |
532 | - | ||
533 | <section> |
- | |
534 | <indexterm> |
- | |
535 | <primary>page tables</primary> |
- | |
536 | - | ||
537 | <secondary>- hashing</secondary> |
- | |
538 | </indexterm> |
- | |
539 | - | ||
540 | <title>Global hash table</title> |
- | |
541 | - | ||
542 | <para>Implementation of the global hash table was encouraged by the |
- | |
543 | ia64 architecture support. One of the major differences between global |
- | |
544 | hash table and hierarchical tables is that global hash table exists |
- | |
545 | only once in the system and the hierarchical tables are maintained per |
- | |
546 | address space.</para> |
- | |
547 | - | ||
548 | <para>Thus, hash table contains information about all address spaces |
- | |
549 | mappings in the system, so, the hash of an entry must contain |
- | |
550 | information of both address space pointer or id and the virtual |
- | |
551 | address of the page. Generic hash table implementation assumes that |
- | |
552 | the addresses of the pointers to the address spaces are likely to be |
- | |
553 | on the close addresses, so it uses least significant bits for hash; |
- | |
554 | also it assumes that the virtual page addresses have roughly the same |
- | |
555 | probability of occurring, so the least significant bits of VPN compose |
- | |
556 | the hash index.</para> |
- | |
557 | - | ||
558 | <para>Paging hash table uses generic hash table with collision chains |
- | |
559 | (see the <link linkend="hashtables">Data Structures</link> chapter of |
- | |
560 | this manual for details).</para> |
- | |
561 | </section> |
- | |
562 | </section> |
- | |
563 | - | ||
564 | <section id="tlb"> |
- | |
565 | <indexterm> |
- | |
566 | <primary>TLB</primary> |
- | |
567 | </indexterm> |
- | |
568 | - | ||
569 | <title>Translation Lookaside buffer</title> |
- | |
570 | - | ||
571 | <para>Due to the extensive overhead during the page mapping lookup in |
- | |
572 | the page tables, all architectures has fast assotiative cache memory |
- | |
573 | built-in CPU. This memory called TLB stores recently used page table |
- | |
574 | entries.</para> |
- | |
575 | - | ||
576 | <section id="tlb_shootdown"> |
- | |
577 | <indexterm> |
- | |
578 | <primary>TLB</primary> |
- | |
579 | - | ||
580 | <secondary>- TLB shootdown</secondary> |
- | |
581 | </indexterm> |
- | |
582 | - | ||
583 | <title>TLB consistency. TLB shootdown algorithm.</title> |
- | |
584 | - | ||
585 | <para>Operating system is responsible for keeping TLB consistent by |
- | |
586 | invalidating the contents of TLB, whenever there is some change in |
- | |
587 | page tables. Those changes may occur when page or group of pages were |
- | |
588 | unmapped, mapping is changed or system switching active address space |
- | |
589 | to schedule a new system task. Moreover, this invalidation operation |
- | |
590 | must be done an all system CPUs because each CPU has its own |
- | |
591 | independent TLB cache. Thus maintaining TLB consistency on SMP |
- | |
592 | configuration as not as trivial task as it looks on the first glance. |
- | |
593 | Naive solution would assume that is the CPU which wants to invalidate |
- | |
594 | TLB will invalidate TLB caches on other CPUs. It is not possible on |
- | |
595 | the most of the architectures, because of the simple fact - flushing |
- | |
596 | TLB is allowed only on the local CPU and there is no possibility to |
- | |
597 | access other CPUs' TLB caches, thus invalidate TLB remotely.</para> |
- | |
598 | - | ||
599 | <para>Technique of remote invalidation of TLB entries is called "TLB |
- | |
600 | shootdown". HelenOS uses a variation of the algorithm described by D. |
- | |
601 | Black et al., "Translation Lookaside Buffer Consistency: A Software |
- | |
602 | Approach," Proc. Third Int'l Conf. Architectural Support for |
- | |
603 | Programming Languages and Operating Systems, 1989, pp. 113-122. <xref |
- | |
604 | linkend="Black89" /></para> |
- | |
605 | - | ||
606 | <para>As the situation demands, you will want partitial invalidation |
- | |
607 | of TLB caches. In case of simple memory mapping change it is necessary |
- | |
608 | to invalidate only one or more adjacent pages. In case if the |
- | |
609 | architecture is aware of ASIDs, when kernel needs to dump some ASID to |
- | |
610 | use by another task, it invalidates only entries from this particular |
- | |
611 | address space. Final option of the TLB invalidation is the complete |
- | |
612 | TLB cache invalidation, which is the operation that flushes all |
- | |
613 | entries in TLB.</para> |
- | |
614 | - | ||
615 | <para>TLB shootdown is performed in two phases.</para> |
- | |
616 | - | ||
617 | <formalpara> |
- | |
618 | <title>Phase 1.</title> |
- | |
619 | - | ||
620 | <para>First, initiator locks a global TLB spinlock, then request is |
- | |
621 | being put to the local request cache of every other CPU in the |
- | |
622 | system protected by its spinlock. In case the cache is full, all |
- | |
623 | requests in the cache are replaced by one request, indicating global |
- | |
624 | TLB flush. Then the initiator thread sends an IPI message indicating |
- | |
625 | the TLB shootdown request to the rest of the CPUs and waits actively |
- | |
626 | until all CPUs confirm TLB invalidating action execution by setting |
- | |
627 | up a special flag. After setting this flag this thread is blocked on |
- | |
628 | the TLB spinlock, held by the initiator.</para> |
- | |
629 | </formalpara> |
- | |
630 | - | ||
631 | <formalpara> |
- | |
632 | <title>Phase 2.</title> |
- | |
633 | - | ||
634 | <para>All CPUs are waiting on the TLB spinlock to execute TLB |
- | |
635 | invalidation action and have indicated their intention to the |
- | |
636 | initiator. Initiator continues, cleaning up its TLB and releasing |
- | |
637 | the global TLB spinlock. After this all other CPUs gain and |
- | |
638 | immidiately release TLB spinlock and perform TLB invalidation |
- | |
639 | actions.</para> |
- | |
640 | </formalpara> |
- | |
641 | </section> |
- | |
642 | </section> |
- | |
643 | </section> |
658 | </section> |
644 | </chapter> |
659 | </chapter> |
645 | 660 |