78,12 → 78,12 |
<formalpara> |
<title>Allocation / deallocation</title> |
|
<para>Upon allocation request via function <code>frame_alloc</code>, |
<para>Upon allocation request via function <code>frame_alloc()</code>, |
the frame allocator first tries to find a zone that can satisfy the |
request (i.e. has the required amount of free frames). Once a suitable |
zone is found, the frame allocator uses the buddy allocator on the |
zone's buddy system to perform the allocation. During deallocation, |
which is triggered by a call to <code>frame_free</code>, the frame |
which is triggered by a call to <code>frame_free()</code>, the frame |
allocator looks up the respective zone that contains the frame being |
deallocated. Afterwards, it calls the buddy allocator again, this time |
to take care of deallocation within the zone's buddy system.</para> |
99,21 → 99,23 |
|
<para>In the buddy system, the memory is broken down into power-of-two |
sized naturally aligned blocks. These blocks are organized in an array |
of lists, in which the list with index <emphasis>i</emphasis> contains all unallocated blocks |
of size <emphasis>2<superscript>i</superscript></emphasis>. The |
index <emphasis>i</emphasis> is called the order of block. Should there be two adjacent |
equally sized blocks in the list <emphasis>i</emphasis> (i.e. buddies), the |
buddy allocator would coalesce them and put the resulting block in list |
<emphasis>i + 1</emphasis>, provided that the resulting block would |
be naturally aligned. Similarily, when the allocator is asked to |
allocate a block of size |
<emphasis>2<superscript>i</superscript></emphasis>, it first tries |
to satisfy the request from the list with index <emphasis>i</emphasis>. If the request cannot |
be satisfied (i.e. the list <emphasis>i</emphasis> is empty), the buddy allocator will try to |
allocate and split a larger block from the list with index <emphasis>i + 1</emphasis>. Both |
of these algorithms are recursive. The recursion ends either when there |
are no blocks to coalesce in the former case or when there are no blocks |
that can be split in the latter case.</para> |
of lists, in which the list with index <emphasis>i</emphasis> contains |
all unallocated blocks of size |
<emphasis>2<superscript>i</superscript></emphasis>. The index |
<emphasis>i</emphasis> is called the order of block. Should there be two |
adjacent equally sized blocks in the list <emphasis>i</emphasis> (i.e. |
buddies), the buddy allocator would coalesce them and put the resulting |
block in list <emphasis>i + 1</emphasis>, provided that the resulting |
block would be naturally aligned. Similarily, when the allocator is |
asked to allocate a block of size |
<emphasis>2<superscript>i</superscript></emphasis>, it first tries to |
satisfy the request from the list with index <emphasis>i</emphasis>. If |
the request cannot be satisfied (i.e. the list <emphasis>i</emphasis> is |
empty), the buddy allocator will try to allocate and split a larger |
block from the list with index <emphasis>i + 1</emphasis>. Both of these |
algorithms are recursive. The recursion ends either when there are no |
blocks to coalesce in the former case or when there are no blocks that |
can be split in the latter case.</para> |
|
<para>This approach greatly reduces external fragmentation of memory and |
helps in allocating bigger continuous blocks of memory aligned to their |
268,7 → 270,9 |
<para>The slab allocator is closely modelled after OpenSolaris slab |
allocator by Jeff Bonwick and Jonathan Adams <xref |
linkend="Bonwick01" /> with the following exceptions:<itemizedlist> |
<listitem><para>empty slabs are immediately deallocated and</para></listitem> |
<listitem> |
<para>empty slabs are immediately deallocated and</para> |
</listitem> |
|
<listitem> |
<para>empty magazines are deallocated when not needed.</para> |
287,8 → 291,8 |
|
<para>The following two paragraphs summarize and complete the |
description of the slab allocator operation (i.e. |
<code>slab_alloc</code> and <code>slab_free</code> |
operations).</para> |
<code>slab_alloc()</code> and <code>slab_free()</code> |
functions).</para> |
|
<formalpara> |
<title>Allocation</title> |
352,34 → 356,226 |
<section> |
<title>Virtual memory management</title> |
|
<section> |
<title>Introduction</title> |
<para>Virtual memory is essential for an operating system because it makes |
several things possible. First, it helps to isolate tasks from each other |
by encapsulating them in their private address spaces. Second, virtual |
memory can give tasks the feeling of more memory available than is |
actually possible. And third, by using virtual memory, there might be |
multiple copies of the same program, linked to the same addresses, running |
in the system. There are at least two known mechanisms for implementing |
virtual memory: segmentation and paging. Even though some processor |
architectures supported by HelenOS<footnote> |
<para>ia32 has full-fledged segmentation.</para> |
</footnote> provide both mechanism, the kernel makes use solely of |
paging.</para> |
|
<para>Virtual memory is a special memory management technique, used by |
kernel to achieve a bunch of mission critical goals. <itemizedlist> |
<listitem> |
Isolate each task from other tasks that are running on the system at the same time. |
</listitem> |
<section id="paging"> |
<title>VAT subsystem</title> |
|
<listitem> |
Allow to allocate more memory, than is actual physical memory size of the machine. |
</listitem> |
<para>In a paged virtual memory, the entire virtual address space is |
divided into small power-of-two sized naturally aligned blocks called |
pages. The processor implements a translation mechanism, that allows the |
operating system to manage mappings between set of pages and set of |
indentically sized and identically aligned pieces of physical memory |
called frames. In a result, references to continuous virtual memory |
areas don't necessarily need to reference continuos area of physical |
memory. Supported page sizes usually range from several kilobytes to |
several megabytes. Each page that takes part in the mapping is |
associated with certain attributes that further desribe the mapping |
(e.g. access rights, dirty and accessed bits and present bit).</para> |
|
<listitem> |
Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations. |
</listitem> |
</itemizedlist></para> |
<para>When the processor accesses a page that is not present (i.e. its |
present bit is not set), the operating system is notified through a |
special exception called page fault. It is then up to the operating |
system to service the page fault. In HelenOS, some page faults are fatal |
and result in either task termination or, in the worse case, kernel |
panic<footnote> |
<para>Such a condition would be either caused by a hardware failure |
or a bug in the kernel.</para> |
</footnote>, while other page faults are used to load memory on demand |
or to notify the kernel about certain events.</para> |
|
<para><!-- |
<para> |
Address spaces. Address space area (B+ tree). Only for uspace. Set of syscalls (shrink/extend etc). |
Special address space area type - device - prohibits shrink/extend syscalls to call on it. |
Address space has link to mapping tables (hierarchical - per Address space, hash - global tables). |
</para> |
<indexterm> |
<primary>page tables</primary> |
</indexterm> |
|
--></para> |
<para>The set of all page mappings is stored in a memory structure |
called page tables. Some architectures have no hardware support for page |
tables<footnote> |
<para>On mips32, TLB-only model is used and the operating system is |
responsible for managing software defined page tables.</para> |
</footnote> while other processor architectures<footnote> |
<para>Like amd64 and ia32.</para> |
</footnote> understand the whole memory format thereof. Despite all |
the possible differences in page table formats, the HelenOS VAT |
subsystem<footnote> |
<para>Virtual Address Translation subsystem.</para> |
</footnote> unifies the page table operations under one programming |
interface. For all parts of the kernel, three basic functions are |
provided:</para> |
|
<itemizedlist> |
<listitem> |
<para><code>page_mapping_insert()</code>,</para> |
</listitem> |
|
<listitem> |
<para><code>page_mapping_find()</code> and</para> |
</listitem> |
|
<listitem> |
<para><code>page_mapping_remove()</code>.</para> |
</listitem> |
</itemizedlist> |
|
<para>The <code>page_mapping_insert()</code> function is used to |
introduce a mapping for one virtual memory page belonging to a |
particular address space into the page tables. Once the mapping is in |
the page tables, it can be searched by <code>page_mapping_find()</code> |
and removed by <code>page_mapping_remove()</code>. All of these |
functions internally select the page table mechanism specific functions |
that carry out the self operation.</para> |
|
<para>There are currently two supported mechanisms: generic 4-level |
hierarchical page tables and global page hash table. Both of the |
mechanisms are generic as they cover several hardware platforms. For |
instance, the 4-level hierarchical page table mechanism is used by |
amd64, ia32, mips32 and ppc32, respectively. These architectures have |
the following page table format: 4-level, 2-level, TLB-only and hardware |
hash table, respectively. On the other hand, the global page hash table |
is used on ia64 that can be TLB-only or use a hardware hash table. |
Although only two mechanisms are currently implemented, other mechanisms |
(e.g. B+tree) can be easily added.</para> |
|
<section id="page_tables"> |
<indexterm> |
<primary>page tables</primary> |
|
<secondary>- hierarchical</secondary> |
</indexterm> |
|
<title>Hierarchical 4-level page tables</title> |
|
<para>Hierarchical 4-level page tables are generalization of the |
frequently used hierarchical model of page tables. In this mechanism, |
each address space has its own page tables. To avoid confusion in |
terminology used by hardware vendors, in HelenOS, the root level page |
table is called PTL0, the two middle levels are called PTL1 and PTL2, |
and, finally, the leaf level is called PTL3. All architectures using |
this mechanism are required to use PTL0 and PTL3. However, the middle |
levels can be left out, depending on the hardware hierachy or |
structure of software-only page tables. The genericity is achieved |
through a set of macros that define transitions from one level to |
another. Unused levels are optimised out by the compiler.</para> |
</section> |
|
<section> |
<indexterm> |
<primary>page tables</primary> |
|
<secondary>- hashing</secondary> |
</indexterm> |
|
<title>Global page hash table</title> |
|
<para>Implementation of the global page hash table was encouraged by |
64-bit architectures that can have rather sparse address spaces. The |
hash table contains valid mappings only. Each entry of the hash table |
contains an address space pointer, virtual memory page number (VPN), |
physical memory frame number (PFN) and a set of flags. The pair of the |
address space pointer and the virtual memory page number is used as a |
key for the hash table. One of the major differences between the |
global page hash table and hierarchical 4-level page tables is that |
there is only a single global page hash table in the system while |
hierarchical page tables exist per address space. Thus, the global |
page hash table contains information about mappings of all address |
spaces in the system. </para> |
|
<para>The global page hash table mechanism uses the generic hash table |
type as described in the chapter about <link linkend="hashtables">data |
structures</link> earlier in this book.</para> |
</section> |
</section> |
</section> |
|
<section id="tlb"> |
<indexterm> |
<primary>TLB</primary> |
</indexterm> |
|
<title>Translation Lookaside buffer</title> |
|
<para>Due to the extensive overhead during the page mapping lookup in the |
page tables, all architectures has fast assotiative cache memory built-in |
CPU. This memory called TLB stores recently used page table |
entries.</para> |
|
<section id="tlb_shootdown"> |
<indexterm> |
<primary>TLB</primary> |
|
<secondary>- TLB shootdown</secondary> |
</indexterm> |
|
<title>TLB consistency. TLB shootdown algorithm.</title> |
|
<para>Operating system is responsible for keeping TLB consistent by |
invalidating the contents of TLB, whenever there is some change in page |
tables. Those changes may occur when page or group of pages were |
unmapped, mapping is changed or system switching active address space to |
schedule a new system task. Moreover, this invalidation operation must |
be done an all system CPUs because each CPU has its own independent TLB |
cache. Thus maintaining TLB consistency on SMP configuration as not as |
trivial task as it looks on the first glance. Naive solution would |
assume that is the CPU which wants to invalidate TLB will invalidate TLB |
caches on other CPUs. It is not possible on the most of the |
architectures, because of the simple fact - flushing TLB is allowed only |
on the local CPU and there is no possibility to access other CPUs' TLB |
caches, thus invalidate TLB remotely.</para> |
|
<para>Technique of remote invalidation of TLB entries is called "TLB |
shootdown". HelenOS uses a variation of the algorithm described by D. |
Black et al., "Translation Lookaside Buffer Consistency: A Software |
Approach," Proc. Third Int'l Conf. Architectural Support for Programming |
Languages and Operating Systems, 1989, pp. 113-122. <xref |
linkend="Black89" /></para> |
|
<para>As the situation demands, you will want partitial invalidation of |
TLB caches. In case of simple memory mapping change it is necessary to |
invalidate only one or more adjacent pages. In case if the architecture |
is aware of ASIDs, when kernel needs to dump some ASID to use by another |
task, it invalidates only entries from this particular address space. |
Final option of the TLB invalidation is the complete TLB cache |
invalidation, which is the operation that flushes all entries in |
TLB.</para> |
|
<para>TLB shootdown is performed in two phases.</para> |
|
<formalpara> |
<title>Phase 1.</title> |
|
<para>First, initiator locks a global TLB spinlock, then request is |
being put to the local request cache of every other CPU in the system |
protected by its spinlock. In case the cache is full, all requests in |
the cache are replaced by one request, indicating global TLB flush. |
Then the initiator thread sends an IPI message indicating the TLB |
shootdown request to the rest of the CPUs and waits actively until all |
CPUs confirm TLB invalidating action execution by setting up a special |
flag. After setting this flag this thread is blocked on the TLB |
spinlock, held by the initiator.</para> |
</formalpara> |
|
<formalpara> |
<title>Phase 2.</title> |
|
<para>All CPUs are waiting on the TLB spinlock to execute TLB |
invalidation action and have indicated their intention to the |
initiator. Initiator continues, cleaning up its TLB and releasing the |
global TLB spinlock. After this all other CPUs gain and immidiately |
release TLB spinlock and perform TLB invalidation actions.</para> |
</formalpara> |
</section> |
|
<section> |
<title>Address spaces</title> |
|
459,186 → 655,5 |
</para> |
</section> |
</section> |
|
<section id="paging"> |
<title>Virtual address translation</title> |
|
<section> |
<title>Introduction</title> |
|
<para>Virtual memory is usually using paged memory model, where |
virtual memory address space is divided into the |
<emphasis>pages</emphasis> (usually having size 4096 bytes) and |
physical memory is divided into the frames (same sized as a page, of |
course). Each page may be mapped to some frame and then, upon memory |
access to the virtual address, CPU performs <emphasis>address |
translation</emphasis> during the instruction execution. Non-existing |
mapping generates page fault exception, calling kernel exception |
handler, thus allowing kernel to manipulate rules of memory access. |
Information for pages mapping is stored by kernel in the <link |
linkend="page_tables">page tables</link></para> |
|
<indexterm> |
<primary>page tables</primary> |
</indexterm> |
|
<para>The majority of the architectures use multi-level page tables, |
which means need to access physical memory several times before |
getting physical address. This fact would make serios performance |
overhead in virtual memory management. To avoid this <link |
linkend="tlb">Traslation Lookaside Buffer (TLB)</link> is used.</para> |
|
<para>HelenOS kernel has two different approaches to the paging |
implementation: <emphasis>4 level page tables</emphasis> and |
<emphasis>global hash table</emphasis>, which are accessible via |
generic paging abstraction layer. Such different functionality was |
caused by the major architectural differences between supported |
platforms. This abstraction is implemented with help of the global |
structure of pointers to basic mapping functions |
<emphasis>page_mapping_operations</emphasis>. To achieve different |
functionality of page tables, corresponding layer must implement |
functions, declared in |
<emphasis>page_mapping_operations</emphasis></para> |
|
<para>Thanks to the abstract paging interface, there was a place left |
for more paging implementations (besides already implemented |
hieararchical page tables and hash table), for example <indexterm> |
<primary>B-tree</primary> |
</indexterm> B-Tree based page tables.</para> |
</section> |
|
<section id="page_tables"> |
<indexterm> |
<primary>page tables</primary> |
|
<secondary>- hierarchical</secondary> |
</indexterm> |
|
<title>Hierarchical 4-level page tables</title> |
|
<para>Hierarchical 4-level page tables are the generalization of the |
hardware capabilities of most architectures. Each address space has |
its own page tables.<itemizedlist> |
<listitem>ia32 uses 2-level page tables, with full hardware |
support.</listitem> |
|
<listitem>amd64 uses 4-level page tables, also coming with full |
hardware support.</listitem> |
|
<listitem>mips and ppc32 have 2-level tables, software simulated |
support.</listitem> |
</itemizedlist></para> |
</section> |
|
<section> |
<indexterm> |
<primary>page tables</primary> |
|
<secondary>- hashing</secondary> |
</indexterm> |
|
<title>Global hash table</title> |
|
<para>Implementation of the global hash table was encouraged by the |
ia64 architecture support. One of the major differences between global |
hash table and hierarchical tables is that global hash table exists |
only once in the system and the hierarchical tables are maintained per |
address space.</para> |
|
<para>Thus, hash table contains information about all address spaces |
mappings in the system, so, the hash of an entry must contain |
information of both address space pointer or id and the virtual |
address of the page. Generic hash table implementation assumes that |
the addresses of the pointers to the address spaces are likely to be |
on the close addresses, so it uses least significant bits for hash; |
also it assumes that the virtual page addresses have roughly the same |
probability of occurring, so the least significant bits of VPN compose |
the hash index.</para> |
|
<para>Paging hash table uses generic hash table with collision chains |
(see the <link linkend="hashtables">Data Structures</link> chapter of |
this manual for details).</para> |
</section> |
</section> |
|
<section id="tlb"> |
<indexterm> |
<primary>TLB</primary> |
</indexterm> |
|
<title>Translation Lookaside buffer</title> |
|
<para>Due to the extensive overhead during the page mapping lookup in |
the page tables, all architectures has fast assotiative cache memory |
built-in CPU. This memory called TLB stores recently used page table |
entries.</para> |
|
<section id="tlb_shootdown"> |
<indexterm> |
<primary>TLB</primary> |
|
<secondary>- TLB shootdown</secondary> |
</indexterm> |
|
<title>TLB consistency. TLB shootdown algorithm.</title> |
|
<para>Operating system is responsible for keeping TLB consistent by |
invalidating the contents of TLB, whenever there is some change in |
page tables. Those changes may occur when page or group of pages were |
unmapped, mapping is changed or system switching active address space |
to schedule a new system task. Moreover, this invalidation operation |
must be done an all system CPUs because each CPU has its own |
independent TLB cache. Thus maintaining TLB consistency on SMP |
configuration as not as trivial task as it looks on the first glance. |
Naive solution would assume that is the CPU which wants to invalidate |
TLB will invalidate TLB caches on other CPUs. It is not possible on |
the most of the architectures, because of the simple fact - flushing |
TLB is allowed only on the local CPU and there is no possibility to |
access other CPUs' TLB caches, thus invalidate TLB remotely.</para> |
|
<para>Technique of remote invalidation of TLB entries is called "TLB |
shootdown". HelenOS uses a variation of the algorithm described by D. |
Black et al., "Translation Lookaside Buffer Consistency: A Software |
Approach," Proc. Third Int'l Conf. Architectural Support for |
Programming Languages and Operating Systems, 1989, pp. 113-122. <xref |
linkend="Black89" /></para> |
|
<para>As the situation demands, you will want partitial invalidation |
of TLB caches. In case of simple memory mapping change it is necessary |
to invalidate only one or more adjacent pages. In case if the |
architecture is aware of ASIDs, when kernel needs to dump some ASID to |
use by another task, it invalidates only entries from this particular |
address space. Final option of the TLB invalidation is the complete |
TLB cache invalidation, which is the operation that flushes all |
entries in TLB.</para> |
|
<para>TLB shootdown is performed in two phases.</para> |
|
<formalpara> |
<title>Phase 1.</title> |
|
<para>First, initiator locks a global TLB spinlock, then request is |
being put to the local request cache of every other CPU in the |
system protected by its spinlock. In case the cache is full, all |
requests in the cache are replaced by one request, indicating global |
TLB flush. Then the initiator thread sends an IPI message indicating |
the TLB shootdown request to the rest of the CPUs and waits actively |
until all CPUs confirm TLB invalidating action execution by setting |
up a special flag. After setting this flag this thread is blocked on |
the TLB spinlock, held by the initiator.</para> |
</formalpara> |
|
<formalpara> |
<title>Phase 2.</title> |
|
<para>All CPUs are waiting on the TLB spinlock to execute TLB |
invalidation action and have indicated their intention to the |
initiator. Initiator continues, cleaning up its TLB and releasing |
the global TLB spinlock. After this all other CPUs gain and |
immidiately release TLB spinlock and perform TLB invalidation |
actions.</para> |
</formalpara> |
</section> |
</section> |
</section> |
</chapter> |