Subversion Repositories HelenOS-doc

Compare Revisions

Ignore whitespace Rev 81 → Rev 82

/design/trunk/src/ch_memory_management.xml
78,12 → 78,12
<formalpara>
<title>Allocation / deallocation</title>
 
<para>Upon allocation request via function <code>frame_alloc</code>,
<para>Upon allocation request via function <code>frame_alloc()</code>,
the frame allocator first tries to find a zone that can satisfy the
request (i.e. has the required amount of free frames). Once a suitable
zone is found, the frame allocator uses the buddy allocator on the
zone's buddy system to perform the allocation. During deallocation,
which is triggered by a call to <code>frame_free</code>, the frame
which is triggered by a call to <code>frame_free()</code>, the frame
allocator looks up the respective zone that contains the frame being
deallocated. Afterwards, it calls the buddy allocator again, this time
to take care of deallocation within the zone's buddy system.</para>
99,21 → 99,23
 
<para>In the buddy system, the memory is broken down into power-of-two
sized naturally aligned blocks. These blocks are organized in an array
of lists, in which the list with index <emphasis>i</emphasis> contains all unallocated blocks
of size <emphasis>2<superscript>i</superscript></emphasis>. The
index <emphasis>i</emphasis> is called the order of block. Should there be two adjacent
equally sized blocks in the list <emphasis>i</emphasis> (i.e. buddies), the
buddy allocator would coalesce them and put the resulting block in list
<emphasis>i + 1</emphasis>, provided that the resulting block would
be naturally aligned. Similarily, when the allocator is asked to
allocate a block of size
<emphasis>2<superscript>i</superscript></emphasis>, it first tries
to satisfy the request from the list with index <emphasis>i</emphasis>. If the request cannot
be satisfied (i.e. the list <emphasis>i</emphasis> is empty), the buddy allocator will try to
allocate and split a larger block from the list with index <emphasis>i + 1</emphasis>. Both
of these algorithms are recursive. The recursion ends either when there
are no blocks to coalesce in the former case or when there are no blocks
that can be split in the latter case.</para>
of lists, in which the list with index <emphasis>i</emphasis> contains
all unallocated blocks of size
<emphasis>2<superscript>i</superscript></emphasis>. The index
<emphasis>i</emphasis> is called the order of block. Should there be two
adjacent equally sized blocks in the list <emphasis>i</emphasis> (i.e.
buddies), the buddy allocator would coalesce them and put the resulting
block in list <emphasis>i + 1</emphasis>, provided that the resulting
block would be naturally aligned. Similarily, when the allocator is
asked to allocate a block of size
<emphasis>2<superscript>i</superscript></emphasis>, it first tries to
satisfy the request from the list with index <emphasis>i</emphasis>. If
the request cannot be satisfied (i.e. the list <emphasis>i</emphasis> is
empty), the buddy allocator will try to allocate and split a larger
block from the list with index <emphasis>i + 1</emphasis>. Both of these
algorithms are recursive. The recursion ends either when there are no
blocks to coalesce in the former case or when there are no blocks that
can be split in the latter case.</para>
 
<para>This approach greatly reduces external fragmentation of memory and
helps in allocating bigger continuous blocks of memory aligned to their
268,7 → 270,9
<para>The slab allocator is closely modelled after OpenSolaris slab
allocator by Jeff Bonwick and Jonathan Adams <xref
linkend="Bonwick01" /> with the following exceptions:<itemizedlist>
<listitem><para>empty slabs are immediately deallocated and</para></listitem>
<listitem>
<para>empty slabs are immediately deallocated and</para>
</listitem>
 
<listitem>
<para>empty magazines are deallocated when not needed.</para>
287,8 → 291,8
 
<para>The following two paragraphs summarize and complete the
description of the slab allocator operation (i.e.
<code>slab_alloc</code> and <code>slab_free</code>
operations).</para>
<code>slab_alloc()</code> and <code>slab_free()</code>
functions).</para>
 
<formalpara>
<title>Allocation</title>
352,34 → 356,226
<section>
<title>Virtual memory management</title>
 
<section>
<title>Introduction</title>
<para>Virtual memory is essential for an operating system because it makes
several things possible. First, it helps to isolate tasks from each other
by encapsulating them in their private address spaces. Second, virtual
memory can give tasks the feeling of more memory available than is
actually possible. And third, by using virtual memory, there might be
multiple copies of the same program, linked to the same addresses, running
in the system. There are at least two known mechanisms for implementing
virtual memory: segmentation and paging. Even though some processor
architectures supported by HelenOS<footnote>
<para>ia32 has full-fledged segmentation.</para>
</footnote> provide both mechanism, the kernel makes use solely of
paging.</para>
 
<para>Virtual memory is a special memory management technique, used by
kernel to achieve a bunch of mission critical goals. <itemizedlist>
<listitem>
Isolate each task from other tasks that are running on the system at the same time.
</listitem>
<section id="paging">
<title>VAT subsystem</title>
 
<listitem>
Allow to allocate more memory, than is actual physical memory size of the machine.
</listitem>
<para>In a paged virtual memory, the entire virtual address space is
divided into small power-of-two sized naturally aligned blocks called
pages. The processor implements a translation mechanism, that allows the
operating system to manage mappings between set of pages and set of
indentically sized and identically aligned pieces of physical memory
called frames. In a result, references to continuous virtual memory
areas don't necessarily need to reference continuos area of physical
memory. Supported page sizes usually range from several kilobytes to
several megabytes. Each page that takes part in the mapping is
associated with certain attributes that further desribe the mapping
(e.g. access rights, dirty and accessed bits and present bit).</para>
 
<listitem>
Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations.
</listitem>
</itemizedlist></para>
<para>When the processor accesses a page that is not present (i.e. its
present bit is not set), the operating system is notified through a
special exception called page fault. It is then up to the operating
system to service the page fault. In HelenOS, some page faults are fatal
and result in either task termination or, in the worse case, kernel
panic<footnote>
<para>Such a condition would be either caused by a hardware failure
or a bug in the kernel.</para>
</footnote>, while other page faults are used to load memory on demand
or to notify the kernel about certain events.</para>
 
<para><!--
<para>
Address spaces. Address space area (B+ tree). Only for uspace. Set of syscalls (shrink/extend etc).
Special address space area type - device - prohibits shrink/extend syscalls to call on it.
Address space has link to mapping tables (hierarchical - per Address space, hash - global tables).
</para>
<indexterm>
<primary>page tables</primary>
</indexterm>
 
--></para>
<para>The set of all page mappings is stored in a memory structure
called page tables. Some architectures have no hardware support for page
tables<footnote>
<para>On mips32, TLB-only model is used and the operating system is
responsible for managing software defined page tables.</para>
</footnote> while other processor architectures<footnote>
<para>Like amd64 and ia32.</para>
</footnote> understand the whole memory format thereof. Despite all
the possible differences in page table formats, the HelenOS VAT
subsystem<footnote>
<para>Virtual Address Translation subsystem.</para>
</footnote> unifies the page table operations under one programming
interface. For all parts of the kernel, three basic functions are
provided:</para>
 
<itemizedlist>
<listitem>
<para><code>page_mapping_insert()</code>,</para>
</listitem>
 
<listitem>
<para><code>page_mapping_find()</code> and</para>
</listitem>
 
<listitem>
<para><code>page_mapping_remove()</code>.</para>
</listitem>
</itemizedlist>
 
<para>The <code>page_mapping_insert()</code> function is used to
introduce a mapping for one virtual memory page belonging to a
particular address space into the page tables. Once the mapping is in
the page tables, it can be searched by <code>page_mapping_find()</code>
and removed by <code>page_mapping_remove()</code>. All of these
functions internally select the page table mechanism specific functions
that carry out the self operation.</para>
 
<para>There are currently two supported mechanisms: generic 4-level
hierarchical page tables and global page hash table. Both of the
mechanisms are generic as they cover several hardware platforms. For
instance, the 4-level hierarchical page table mechanism is used by
amd64, ia32, mips32 and ppc32, respectively. These architectures have
the following page table format: 4-level, 2-level, TLB-only and hardware
hash table, respectively. On the other hand, the global page hash table
is used on ia64 that can be TLB-only or use a hardware hash table.
Although only two mechanisms are currently implemented, other mechanisms
(e.g. B+tree) can be easily added.</para>
 
<section id="page_tables">
<indexterm>
<primary>page tables</primary>
 
<secondary>- hierarchical</secondary>
</indexterm>
 
<title>Hierarchical 4-level page tables</title>
 
<para>Hierarchical 4-level page tables are generalization of the
frequently used hierarchical model of page tables. In this mechanism,
each address space has its own page tables. To avoid confusion in
terminology used by hardware vendors, in HelenOS, the root level page
table is called PTL0, the two middle levels are called PTL1 and PTL2,
and, finally, the leaf level is called PTL3. All architectures using
this mechanism are required to use PTL0 and PTL3. However, the middle
levels can be left out, depending on the hardware hierachy or
structure of software-only page tables. The genericity is achieved
through a set of macros that define transitions from one level to
another. Unused levels are optimised out by the compiler.</para>
</section>
 
<section>
<indexterm>
<primary>page tables</primary>
 
<secondary>- hashing</secondary>
</indexterm>
 
<title>Global page hash table</title>
 
<para>Implementation of the global page hash table was encouraged by
64-bit architectures that can have rather sparse address spaces. The
hash table contains valid mappings only. Each entry of the hash table
contains an address space pointer, virtual memory page number (VPN),
physical memory frame number (PFN) and a set of flags. The pair of the
address space pointer and the virtual memory page number is used as a
key for the hash table. One of the major differences between the
global page hash table and hierarchical 4-level page tables is that
there is only a single global page hash table in the system while
hierarchical page tables exist per address space. Thus, the global
page hash table contains information about mappings of all address
spaces in the system. </para>
 
<para>The global page hash table mechanism uses the generic hash table
type as described in the chapter about <link linkend="hashtables">data
structures</link> earlier in this book.</para>
</section>
</section>
</section>
 
<section id="tlb">
<indexterm>
<primary>TLB</primary>
</indexterm>
 
<title>Translation Lookaside buffer</title>
 
<para>Due to the extensive overhead during the page mapping lookup in the
page tables, all architectures has fast assotiative cache memory built-in
CPU. This memory called TLB stores recently used page table
entries.</para>
 
<section id="tlb_shootdown">
<indexterm>
<primary>TLB</primary>
 
<secondary>- TLB shootdown</secondary>
</indexterm>
 
<title>TLB consistency. TLB shootdown algorithm.</title>
 
<para>Operating system is responsible for keeping TLB consistent by
invalidating the contents of TLB, whenever there is some change in page
tables. Those changes may occur when page or group of pages were
unmapped, mapping is changed or system switching active address space to
schedule a new system task. Moreover, this invalidation operation must
be done an all system CPUs because each CPU has its own independent TLB
cache. Thus maintaining TLB consistency on SMP configuration as not as
trivial task as it looks on the first glance. Naive solution would
assume that is the CPU which wants to invalidate TLB will invalidate TLB
caches on other CPUs. It is not possible on the most of the
architectures, because of the simple fact - flushing TLB is allowed only
on the local CPU and there is no possibility to access other CPUs' TLB
caches, thus invalidate TLB remotely.</para>
 
<para>Technique of remote invalidation of TLB entries is called "TLB
shootdown". HelenOS uses a variation of the algorithm described by D.
Black et al., "Translation Lookaside Buffer Consistency: A Software
Approach," Proc. Third Int'l Conf. Architectural Support for Programming
Languages and Operating Systems, 1989, pp. 113-122. <xref
linkend="Black89" /></para>
 
<para>As the situation demands, you will want partitial invalidation of
TLB caches. In case of simple memory mapping change it is necessary to
invalidate only one or more adjacent pages. In case if the architecture
is aware of ASIDs, when kernel needs to dump some ASID to use by another
task, it invalidates only entries from this particular address space.
Final option of the TLB invalidation is the complete TLB cache
invalidation, which is the operation that flushes all entries in
TLB.</para>
 
<para>TLB shootdown is performed in two phases.</para>
 
<formalpara>
<title>Phase 1.</title>
 
<para>First, initiator locks a global TLB spinlock, then request is
being put to the local request cache of every other CPU in the system
protected by its spinlock. In case the cache is full, all requests in
the cache are replaced by one request, indicating global TLB flush.
Then the initiator thread sends an IPI message indicating the TLB
shootdown request to the rest of the CPUs and waits actively until all
CPUs confirm TLB invalidating action execution by setting up a special
flag. After setting this flag this thread is blocked on the TLB
spinlock, held by the initiator.</para>
</formalpara>
 
<formalpara>
<title>Phase 2.</title>
 
<para>All CPUs are waiting on the TLB spinlock to execute TLB
invalidation action and have indicated their intention to the
initiator. Initiator continues, cleaning up its TLB and releasing the
global TLB spinlock. After this all other CPUs gain and immidiately
release TLB spinlock and perform TLB invalidation actions.</para>
</formalpara>
</section>
 
<section>
<title>Address spaces</title>
 
459,186 → 655,5
</para>
</section>
</section>
 
<section id="paging">
<title>Virtual address translation</title>
 
<section>
<title>Introduction</title>
 
<para>Virtual memory is usually using paged memory model, where
virtual memory address space is divided into the
<emphasis>pages</emphasis> (usually having size 4096 bytes) and
physical memory is divided into the frames (same sized as a page, of
course). Each page may be mapped to some frame and then, upon memory
access to the virtual address, CPU performs <emphasis>address
translation</emphasis> during the instruction execution. Non-existing
mapping generates page fault exception, calling kernel exception
handler, thus allowing kernel to manipulate rules of memory access.
Information for pages mapping is stored by kernel in the <link
linkend="page_tables">page tables</link></para>
 
<indexterm>
<primary>page tables</primary>
</indexterm>
 
<para>The majority of the architectures use multi-level page tables,
which means need to access physical memory several times before
getting physical address. This fact would make serios performance
overhead in virtual memory management. To avoid this <link
linkend="tlb">Traslation Lookaside Buffer (TLB)</link> is used.</para>
 
<para>HelenOS kernel has two different approaches to the paging
implementation: <emphasis>4 level page tables</emphasis> and
<emphasis>global hash table</emphasis>, which are accessible via
generic paging abstraction layer. Such different functionality was
caused by the major architectural differences between supported
platforms. This abstraction is implemented with help of the global
structure of pointers to basic mapping functions
<emphasis>page_mapping_operations</emphasis>. To achieve different
functionality of page tables, corresponding layer must implement
functions, declared in
<emphasis>page_mapping_operations</emphasis></para>
 
<para>Thanks to the abstract paging interface, there was a place left
for more paging implementations (besides already implemented
hieararchical page tables and hash table), for example <indexterm>
<primary>B-tree</primary>
</indexterm> B-Tree based page tables.</para>
</section>
 
<section id="page_tables">
<indexterm>
<primary>page tables</primary>
 
<secondary>- hierarchical</secondary>
</indexterm>
 
<title>Hierarchical 4-level page tables</title>
 
<para>Hierarchical 4-level page tables are the generalization of the
hardware capabilities of most architectures. Each address space has
its own page tables.<itemizedlist>
<listitem>ia32 uses 2-level page tables, with full hardware
support.</listitem>
 
<listitem>amd64 uses 4-level page tables, also coming with full
hardware support.</listitem>
 
<listitem>mips and ppc32 have 2-level tables, software simulated
support.</listitem>
</itemizedlist></para>
</section>
 
<section>
<indexterm>
<primary>page tables</primary>
 
<secondary>- hashing</secondary>
</indexterm>
 
<title>Global hash table</title>
 
<para>Implementation of the global hash table was encouraged by the
ia64 architecture support. One of the major differences between global
hash table and hierarchical tables is that global hash table exists
only once in the system and the hierarchical tables are maintained per
address space.</para>
 
<para>Thus, hash table contains information about all address spaces
mappings in the system, so, the hash of an entry must contain
information of both address space pointer or id and the virtual
address of the page. Generic hash table implementation assumes that
the addresses of the pointers to the address spaces are likely to be
on the close addresses, so it uses least significant bits for hash;
also it assumes that the virtual page addresses have roughly the same
probability of occurring, so the least significant bits of VPN compose
the hash index.</para>
 
<para>Paging hash table uses generic hash table with collision chains
(see the <link linkend="hashtables">Data Structures</link> chapter of
this manual for details).</para>
</section>
</section>
 
<section id="tlb">
<indexterm>
<primary>TLB</primary>
</indexterm>
 
<title>Translation Lookaside buffer</title>
 
<para>Due to the extensive overhead during the page mapping lookup in
the page tables, all architectures has fast assotiative cache memory
built-in CPU. This memory called TLB stores recently used page table
entries.</para>
 
<section id="tlb_shootdown">
<indexterm>
<primary>TLB</primary>
 
<secondary>- TLB shootdown</secondary>
</indexterm>
 
<title>TLB consistency. TLB shootdown algorithm.</title>
 
<para>Operating system is responsible for keeping TLB consistent by
invalidating the contents of TLB, whenever there is some change in
page tables. Those changes may occur when page or group of pages were
unmapped, mapping is changed or system switching active address space
to schedule a new system task. Moreover, this invalidation operation
must be done an all system CPUs because each CPU has its own
independent TLB cache. Thus maintaining TLB consistency on SMP
configuration as not as trivial task as it looks on the first glance.
Naive solution would assume that is the CPU which wants to invalidate
TLB will invalidate TLB caches on other CPUs. It is not possible on
the most of the architectures, because of the simple fact - flushing
TLB is allowed only on the local CPU and there is no possibility to
access other CPUs' TLB caches, thus invalidate TLB remotely.</para>
 
<para>Technique of remote invalidation of TLB entries is called "TLB
shootdown". HelenOS uses a variation of the algorithm described by D.
Black et al., "Translation Lookaside Buffer Consistency: A Software
Approach," Proc. Third Int'l Conf. Architectural Support for
Programming Languages and Operating Systems, 1989, pp. 113-122. <xref
linkend="Black89" /></para>
 
<para>As the situation demands, you will want partitial invalidation
of TLB caches. In case of simple memory mapping change it is necessary
to invalidate only one or more adjacent pages. In case if the
architecture is aware of ASIDs, when kernel needs to dump some ASID to
use by another task, it invalidates only entries from this particular
address space. Final option of the TLB invalidation is the complete
TLB cache invalidation, which is the operation that flushes all
entries in TLB.</para>
 
<para>TLB shootdown is performed in two phases.</para>
 
<formalpara>
<title>Phase 1.</title>
 
<para>First, initiator locks a global TLB spinlock, then request is
being put to the local request cache of every other CPU in the
system protected by its spinlock. In case the cache is full, all
requests in the cache are replaced by one request, indicating global
TLB flush. Then the initiator thread sends an IPI message indicating
the TLB shootdown request to the rest of the CPUs and waits actively
until all CPUs confirm TLB invalidating action execution by setting
up a special flag. After setting this flag this thread is blocked on
the TLB spinlock, held by the initiator.</para>
</formalpara>
 
<formalpara>
<title>Phase 2.</title>
 
<para>All CPUs are waiting on the TLB spinlock to execute TLB
invalidation action and have indicated their intention to the
initiator. Initiator continues, cleaning up its TLB and releasing
the global TLB spinlock. After this all other CPUs gain and
immidiately release TLB spinlock and perform TLB invalidation
actions.</para>
</formalpara>
</section>
</section>
</section>
</chapter>