14,7 → 14,6 |
<title>Physical memory management</title> |
|
<section id="zones_and_frames"> |
<indexterm><primary>memory management</primary><secondary>memory zone</secondary></indexterm> |
<title>Zones and frames</title> |
|
<para>HelenOS represents continuous areas of physical memory in |
45,7 → 44,6 |
</section> |
|
<section id="frame_allocator"> |
<indexterm><primary>memory management</primary><secondary>frame allocator</secondary></indexterm> |
<title>Frame allocator</title> |
|
<para>The frame allocator satisfies kernel requests to allocate |
85,7 → 83,6 |
</section> |
|
<section id="buddy_allocator"> |
<indexterm><primary>memory management</primary><secondary>buddy system</secondary></indexterm> |
<title>Buddy allocator</title> |
|
<para>In the buddy system, the memory is broken down into power-of-two |
160,8 → 157,7 |
</section> |
|
<section id="slab"> |
<indexterm><primary>memory management</primary><secondary>slab</secondary></indexterm> |
<title>Slab allocator</title> |
<title>Slab allocator</title> |
|
<para>The majority of memory allocation requests in the kernel is for |
small, frequently used data structures. The basic idea behind the slab |
202,7 → 198,8 |
the list of full magazines. In other words, the slab allocator tries to |
keep the processor magazine cache only half-full in order to prevent |
thrashing when allocations and deallocations interleave on magazine |
boundaries.</para> |
boundaries. The advantage of this setup is that during most of the |
allocations, no global spinlock needs to be held.</para> |
|
<para>Should HelenOS run short of memory, it would start deallocating |
objects from magazines, calling slab cache destructor on them and |
226,125 → 223,81 |
allocator by Jeff Bonwick and Jonathan Adams with the following |
exceptions:<itemizedlist> |
<listitem> |
empty slabs are immediately deallocated |
empty slabs are immediately deallocated and |
</listitem> |
|
<listitem> |
<para>empty magazines are deallocated when not needed</para> |
<para>empty magazines are deallocated when not needed.</para> |
</listitem> |
</itemizedlist> Following features are not currently supported but |
would be easy to do: <itemizedlist> |
</itemizedlist>The following features are not currently supported |
but would be easy to do: <itemizedlist> |
<listitem> |
cache coloring |
cache coloring and |
</listitem> |
|
<listitem> |
dynamic magazine grow (different magazine sizes are already supported, but the allocation strategy would need to be adjusted) |
dynamic magazine grow (different magazine sizes are already supported, but the allocation strategy would need to be adjusted). |
</listitem> |
</itemizedlist></para> |
|
<section> |
<indexterm><primary>memory management</primary><secondary>slab magazine</secondary></indexterm> |
<title>Magazine layer</title> |
|
<para>Due to the extensive bottleneck on SMP architures, caused by |
global slab locking mechanism, making processing of all slab |
allocation requests serialized, a new layer was introduced to the |
classic slab allocator design. Slab allocator was extended to |
support per-CPU caches 'magazines' to achieve good SMP scaling. |
<termdef>Slab SMP perfromance bottleneck was resolved by introducing |
a per-CPU caching scheme called as <glossterm>magazine |
layer</glossterm></termdef>.</para> |
|
<para>Magazine is a N-element cache of objects, so each magazine can |
satisfy N allocations. Magazine behaves like a automatic weapon |
magazine (LIFO, stack), so the allocation/deallocation become simple |
push/pop pointer operation. Trick is that CPU does not access global |
slab allocator data during the allocation from its magazine, thus |
making possible parallel allocations between CPUs.</para> |
|
<para>Implementation also requires adding another feature as the |
CPU-bound magazine is actually a pair of magazines to avoid |
thrashing when during allocation/deallocatiion of 1 item at the |
magazine size boundary. LIFO order is enforced, which should avoid |
fragmentation as much as possible.</para> |
|
<para>Another important entity of magazine layer is the common full |
magazine list (also called a depot), that stores full magazines that |
may be used by any of the CPU magazine caches to reload active CPU |
magazine. This list of magazines can be pre-filled with full |
magazines during initialization, but in current implementation it is |
filled during object deallocation, when CPU magazine becomes |
full.</para> |
|
<para>Slab allocator control structures are allocated from special |
slabs, that are marked by special flag, indicating that it should |
not be used for slab magazine layer. This is done to avoid possible |
infinite recursions and deadlock during conventional slab allocaiton |
requests.</para> |
</section> |
|
<section> |
<title>Allocation/deallocation</title> |
|
<para>Every cache contains list of full slabs and list of partialy |
full slabs. Empty slabs are immediately freed (thrashing will be |
avoided because of magazines).</para> |
<para>The following two paragraphs summarize and complete the |
description of the slab allocator operation (i.e. |
<code>slab_alloc</code> and <code>slab_free</code> |
operations).</para> |
|
<para>The SLAB allocator allocates lots of space and does not free |
it. When frame allocator fails to allocate the frame, it calls |
slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim. |
The light reclaim releases slabs from cpu-shared magazine-list, |
until at least 1 slab is deallocated in each cache (this algorithm |
should probably change). The brutal reclaim removes all cached |
objects, even from CPU-bound magazines.</para> |
|
<formalpara> |
<title>Allocation</title> |
|
<para><emphasis>Step 1.</emphasis> When it comes to the allocation |
request, slab allocator first of all checks availability of memory |
in local CPU-bound magazine. If it is there, we would just "pop" |
the CPU magazine and return the pointer to object.</para> |
<para><emphasis>Step 1.</emphasis> When an allocation request |
comes, the slab allocator checks availability of memory in the |
current magazine of the local processor magazine cache. If the |
available memory is there, the allocator just pops the magazine |
and returns pointer to the object.</para> |
|
<para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is |
empty, allocator will attempt to reload magazin, swapping it with |
second CPU magazine and returns to the first step.</para> |
<para><emphasis>Step 2.</emphasis> If the current magazine in the |
processor magazine cache is empty, the allocator will attempt to |
swap it with the last magazine from the cache and return to the |
first step. If also the last magazine is empty, the algorithm will |
fall through to Step 3.</para> |
|
<para><emphasis>Step 3.</emphasis> Now we are in the situation |
when both CPU-bound magazines are empty, which makes allocator to |
access shared full-magazines depot to reload CPU-bound magazines. |
If reload is succesful (meaning there are full magazines in depot) |
algoritm continues at Step 1.</para> |
<para><emphasis>Step 3.</emphasis> Now the allocator is in the |
situation when both magazines in the processor magazine cache are |
empty. The allocator reloads one magazine from the shared list of |
full magazines. If the reload is successful (i.e. there are full |
magazines in the list), the algorithm continues with Step |
1.</para> |
|
<para><emphasis>Step 4.</emphasis> Final step of the allocation. |
In this step object is allocated from the conventional slab layer |
and pointer is returned.</para> |
<para><emphasis>Step 4.</emphasis> In this fail-safe step, an |
object is allocated from the conventional slab layer and a pointer |
to it is returned. If also the last magazine is full,</para> |
</formalpara> |
|
<formalpara> |
<title>Deallocation</title> |
|
<para><emphasis>Step 1.</emphasis> During deallocation request, |
slab allocator will check if the local CPU-bound magazine is not |
full. In this case we will just push the pointer to this |
magazine.</para> |
<para><emphasis>Step 1.</emphasis> During a deallocation request, |
the slab allocator checks if the current magazine of the local |
processor magazine cache is not full. If yes, the pointer to the |
objects is just pushed into the magazine and the algorithm |
returns.</para> |
|
<para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is |
full, allocator will attempt to reload magazin, swapping it with |
second CPU magazine and returns to the first step.</para> |
<para><emphasis>Step 2.</emphasis> If the current magazine is |
full, the allocator will attempt to swap it with the last magazine |
from the cache and return to the first step. If also the last |
magazine is empty, the algorithm will fall through to Step |
3.</para> |
|
<para><emphasis>Step 3.</emphasis> Now we are in the situation |
when both CPU-bound magazines are full, which makes allocator to |
access shared full-magazines depot to put one of the magazines to |
the depot and creating new empty magazine. Algoritm continues at |
Step 1.</para> |
<para><emphasis>Step 3.</emphasis> Now the allocator is in the |
situation when both magazines in the processor magazine cache are |
full. The allocator moves one magazine to the shared list of full |
magazines. The algoritm continues with Step 1.</para> |
</formalpara> |
</section> |
</section> |
</section> |
|
<!-- End of Physmem --> |
</section> |
|
<section> |
370,6 → 323,11 |
|
<para><!-- |
|
TLB shootdown ASID/ASID:PAGE/ALL. |
TLB shootdown requests can come in asynchroniously |
so there is a cache of TLB shootdown requests. Upon cache overflow TLB shootdown ALL is executed |
|
|
<para> |
Address spaces. Address space area (B+ tree). Only for uspace. Set of syscalls (shrink/extend etc). |
Special address space area type - device - prohibits shrink/extend syscalls to call on it. |
380,11 → 338,31 |
</section> |
|
<section> |
<title>Paging</title> |
|
<para>Virtual memory is usually using paged memory model, where virtual |
memory address space is divided into the <emphasis>pages</emphasis> |
(usually having size 4096 bytes) and physical memory is divided into the |
frames (same sized as a page, of course). Each page may be mapped to |
some frame and then, upon memory access to the virtual address, CPU |
performs <emphasis>address translation</emphasis> during the instruction |
execution. Non-existing mapping generates page fault exception, calling |
kernel exception handler, thus allowing kernel to manipulate rules of |
memory access. Information for pages mapping is stored by kernel in the |
<link linkend="page_tables">page tables</link></para> |
|
<para>The majority of the architectures use multi-level page tables, |
which means need to access physical memory several times before getting |
physical address. This fact would make serios performance overhead in |
virtual memory management. To avoid this <link linkend="tlb">Traslation |
Lookaside Buffer (TLB)</link> is used.</para> |
</section> |
|
<section> |
<title>Address spaces</title> |
|
<section> |
<indexterm><primary>memory management</primary><secondary>address space area</secondary></indexterm> |
<title>Address space areas</title> |
<title>Address space areas</title> |
|
<para>Each address space consists of mutually disjunctive continuous |
address space areas. Address space area is precisely defined by its |
430,7 → 408,6 |
</section> |
|
<section> |
<indexterm><primary>memory management</primary><secondary>asid</secondary></indexterm> |
<title>Address Space ID (ASID)</title> |
|
<para>When switching to the different task, kernel also require to |
464,55 → 441,26 |
<section> |
<title>Virtual address translation</title> |
|
<section id="paging"> |
<title>Paging</title> |
<section id="page_tables"> |
<title>Page tables</title> |
|
<section> |
<title>Introduction</title> |
<para>HelenOS kernel has two different approaches to the paging |
implementation: <emphasis>4 level page tables</emphasis> and |
<emphasis>global hash tables</emphasis>, which are accessible via |
generic paging abstraction layer. Such different functionality was |
caused by the major architectural differences between supported |
platforms. This abstraction is implemented with help of the global |
structure of pointers to basic mapping functions |
<emphasis>page_mapping_operations</emphasis>. To achieve different |
functionality of page tables, corresponding layer must implement |
functions, declared in |
<emphasis>page_mapping_operations</emphasis></para> |
|
<para>Virtual memory is usually using paged memory model, where |
virtual memory address space is divided into the |
<emphasis>pages</emphasis> (usually having size 4096 bytes) and |
physical memory is divided into the frames (same sized as a page, of |
course). Each page may be mapped to some frame and then, upon memory |
access to the virtual address, CPU performs <emphasis>address |
translation</emphasis> during the instruction execution. |
Non-existing mapping generates page fault exception, calling kernel |
exception handler, thus allowing kernel to manipulate rules of |
memory access. Information for pages mapping is stored by kernel in |
the page tables</para> |
<formalpara> |
<title>4-level page tables</title> |
|
<para>The majority of the architectures use multi-level page tables, |
which means need to access physical memory several times before |
getting physical address. This fact would make serios performance |
overhead in virtual memory management. To avoid this <link |
linkend="tlb">Traslation Lookaside Buffer (TLB)</link> is |
used.</para> |
|
<para>HelenOS kernel has two different approaches to the paging |
implementation: <emphasis>4 level page tables</emphasis> and |
<emphasis>global hash table</emphasis>, which are accessible via |
generic paging abstraction layer. Such different functionality was |
caused by the major architectural differences between supported |
platforms. This abstraction is implemented with help of the global |
structure of pointers to basic mapping functions |
<emphasis>page_mapping_operations</emphasis>. To achieve different |
functionality of page tables, corresponding layer must implement |
functions, declared in |
<emphasis>page_mapping_operations</emphasis></para> |
|
<para>Thanks to the abstract paging interface, there was a place |
left for more paging implementations (besides already implemented |
hieararchical page tables and hash table), for example B-Tree based |
page tables.</para> |
</section> |
|
<section> |
<title>Hierarchical 4-level page tables</title> |
|
<para>Hierarchical 4-level page tables are the generalization of the |
hardware capabilities of most architectures. Each address space has |
its own page tables.<itemizedlist> |
<para>4-level page tables are the generalization of the hardware |
capabilities of several architectures.<itemizedlist> |
<listitem> |
ia32 uses 2-level page tables, with full hardware support. |
</listitem> |
525,34 → 473,23 |
mips and ppc32 have 2-level tables, software simulated support. |
</listitem> |
</itemizedlist></para> |
</section> |
</formalpara> |
|
<section> |
<indexterm><primary>memory management</primary><secondary>hash table</secondary></indexterm> |
<title>Global hash table</title> |
<formalpara> |
<title>Global hash tables</title> |
|
<para>Implementation of the global hash table was encouraged by the |
ia64 architecture support. One of the major differences between |
global hash table and hierarchical tables is that global hash table |
exists only once in the system and the hierarchical tables are |
maintained per address space.</para> |
<para>- global page hash table: existuje jen jedna v celem systemu |
(vyuziva ji ia64), pozn. ia64 ma zatim vypnuty VHPT. Pouziva se |
genericke hash table s oddelenymi collision chains. ASID support is |
required to use global hash tables.</para> |
</formalpara> |
|
<para>Thus, hash table contains information about all address spaces |
mappings in the system, so, the hash of an entry must contain |
information of both address space pointer or id and the virtual |
address of the page. Generic hash table implementation assumes that |
the addresses of the pointers to the address spaces are likely to be |
on the close addresses, so it uses least significant bits for hash; |
also it assumes that the virtual page addresses have roughly the |
same probability of occurring, so the least significant bits of VPN |
compose the hash index.</para> |
|
<para>Collision chains ...</para> |
</section> |
<para>Thanks to the abstract paging interface, there is possibility |
left have more paging implementations, for example B-Tree page |
tables.</para> |
</section> |
|
<section id="tlb"> |
<indexterm><primary>memory management</primary><secondary>tlb</secondary></indexterm> |
<title>Translation Lookaside buffer</title> |
|
<para>Due to the extensive overhead during the page mapping lookup in |
567,16 → 504,15 |
invalidating the contents of TLB, whenever there is some change in |
page tables. Those changes may occur when page or group of pages |
were unmapped, mapping is changed or system switching active address |
space to schedule a new system task. Moreover, this invalidation |
operation must be done an all system CPUs because each CPU has its |
own independent TLB cache. Thus maintaining TLB consistency on SMP |
configuration as not as trivial task as it looks on the first |
glance. Naive solution would assume that is the CPU which wants to |
invalidate TLB will invalidate TLB caches on other CPUs. It is not |
possible on the most of the architectures, because of the simple |
fact - flushing TLB is allowed only on the local CPU and there is no |
possibility to access other CPUs' TLB caches, thus invalidate TLB |
remotely.</para> |
space to schedule a new system task (which is a batch unmap of all |
address space mappings). Moreover, this invalidation operation must |
be done an all system CPUs because each CPU has its own independent |
TLB cache. Thus maintaining TLB consistency on SMP configuration as |
not as trivial task as it looks at the first glance. Naive solution |
would assume remote TLB invalidatation, which is not possible on the |
most of the architectures, because of the simple fact - flushing TLB |
is allowed only on the local CPU and there is no possibility to |
access other CPUs' TLB caches.</para> |
|
<para>Technique of remote invalidation of TLB entries is called "TLB |
shootdown". HelenOS uses a variation of the algorithm described by |
588,41 → 524,28 |
<para>As the situation demands, you will want partitial invalidation |
of TLB caches. In case of simple memory mapping change it is |
necessary to invalidate only one or more adjacent pages. In case if |
the architecture is aware of ASIDs, when kernel needs to dump some |
ASID to use by another task, it invalidates only entries from this |
particular address space. Final option of the TLB invalidation is |
the complete TLB cache invalidation, which is the operation that |
flushes all entries in TLB.</para> |
the architecture is aware of ASIDs, during the address space |
switching, kernel invalidates only entries from this particular |
address space. Final option of the TLB invalidation is the complete |
TLB cache invalidation, which is the operation that flushes all |
entries in TLB.</para> |
|
<para>TLB shootdown is performed in two phases.</para> |
<para>TLB shootdown is performed in two phases. First, the initiator |
process sends an IPI message indicating the TLB shootdown request to |
the rest of the CPUs. Then, it waits until all CPUs confirm TLB |
invalidating action execution.</para> |
</section> |
</section> |
</section> |
|
<formalpara> |
<title>Phase 1.</title> |
<section> |
<title>---</title> |
|
<para>First, initiator locks a global TLB spinlock, then request |
is being put to the local request cache of every other CPU in the |
system protected by its spinlock. In case the cache is full, all |
requests in the cache are replaced by one request, indicating |
global TLB flush. Then the initiator thread sends an IPI message |
indicating the TLB shootdown request to the rest of the CPUs and |
waits actively until all CPUs confirm TLB invalidating action |
execution by setting up a special flag. After setting this flag |
this thread is blocked on the TLB spinlock, held by the |
initiator.</para> |
</formalpara> |
<para>At the moment HelenOS does not support swapping.</para> |
|
<formalpara> |
<title>Phase 2.</title> |
|
<para>All CPUs are waiting on the TLB spinlock to execute TLB |
invalidation action and have indicated their intention to the |
initiator. Initiator continues, cleaning up its TLB and releasing |
the global TLB spinlock. After this all other CPUs gain and |
immidiately release TLB spinlock and perform TLB invalidation |
actions.</para> |
</formalpara> |
</section> |
</section> |
<para>- pouzivame vypadky stranky k alokaci ramcu on-demand v ramci |
as_area - na architekturach, ktere to podporuji, podporujeme non-exec |
stranky</para> |
</section> |
</section> |
</chapter> |