5,21 → 5,35 |
<title>Memory management</title> |
|
<section> |
<!-- VM --> |
|
<title>Virtual memory management</title> |
|
<section> |
<title>Address spaces</title> |
|
<para></para> |
<para /> |
</section> |
|
<section> |
<title>Virtual address translation</title> |
|
<para></para> |
<para /> |
</section> |
|
<para>Page tables. 4 level hierarchical and hash directly supported. B+ |
Tree can be implemented.</para> |
|
<para>For paging there is an abstract layer</para> |
|
<para>TLB shootdown implementation (update TLB upon mapping |
update/remove). TLB shootdown ASID/ASID:PAGE/ALL. TLB shootdown requests |
can come in asynchroniously so there is a cache of TLB shootdown requests. |
Upon cache overflow TLB shootdown ALL is executed</para> |
|
<para>Address spaces. Address space area (B+ tree). Only for uspace. Set |
of syscalls (shrink/extend etc). Special address space area type - device |
- prohibits shrink/extend syscalls to call on it. Address space has link |
to mapping tables (hierarchical - per Address space, hash - global |
tables).</para> |
</section> |
|
<!-- End of VM --> |
32,13 → 46,7 |
<section id="zones_and_frames"> |
<title>Zones and frames</title> |
|
<para> |
<!--graphic fileref="images/mm2.png" /--> |
|
<!--graphic fileref="images/buddy_alloc.svg" format="SVG" /--> |
|
|
</para> |
<para><!--graphic fileref="images/mm2.png" /--><!--graphic fileref="images/buddy_alloc.svg" format="SVG" /--></para> |
|
<para>On some architectures not whole physical memory is available for |
conventional usage. This limitations require from kernel to maintain a |
97,123 → 105,119 |
</itemizedlist></para> |
</formalpara> |
</section> |
</section> |
|
<section id="buddy_allocator"> |
<title>Buddy allocator</title> |
<section id="buddy_allocator"> |
<title>Buddy allocator</title> |
|
<section> |
<title>Overview</title> |
<section> |
<title>Overview</title> |
|
<para>In buddy allocator, memory is broken down into power-of-two sized |
naturally aligned blocks. These blocks are organized in an array of |
lists in which list with index i contains all unallocated blocks of the |
size <mathphrase>2<superscript>i</superscript></mathphrase>. The index i |
is called the order of block. Should there be two adjacent equally sized |
blocks in list <mathphrase>i</mathphrase> (i.e. buddies), the buddy |
allocator would coalesce them and put the resulting block in list |
<mathphrase>i + 1</mathphrase>, provided that the resulting block would |
be naturally aligned. Similarily, when the allocator is asked to |
allocate a block of size |
<mathphrase>2<superscript>i</superscript></mathphrase>, it first tries |
to satisfy the request from list with index i. If the request cannot be |
satisfied (i.e. the list i is empty), the buddy allocator will try to |
allocate and split larger block from list with index i + 1. Both of |
these algorithms are recursive. The recursion ends either when there are |
no blocks to coalesce in the former case or when there are no blocks |
that can be split in the latter case.</para> |
<para>In buddy allocator, memory is broken down into power-of-two |
sized naturally aligned blocks. These blocks are organized in an array |
of lists in which list with index i contains all unallocated blocks of |
the size <mathphrase>2<superscript>i</superscript></mathphrase>. The |
index i is called the order of block. Should there be two adjacent |
equally sized blocks in list <mathphrase>i</mathphrase> (i.e. |
buddies), the buddy allocator would coalesce them and put the |
resulting block in list <mathphrase>i + 1</mathphrase>, provided that |
the resulting block would be naturally aligned. Similarily, when the |
allocator is asked to allocate a block of size |
<mathphrase>2<superscript>i</superscript></mathphrase>, it first tries |
to satisfy the request from list with index i. If the request cannot |
be satisfied (i.e. the list i is empty), the buddy allocator will try |
to allocate and split larger block from list with index i + 1. Both of |
these algorithms are recursive. The recursion ends either when there |
are no blocks to coalesce in the former case or when there are no |
blocks that can be split in the latter case.</para> |
|
<!--graphic fileref="images/mm1.png" format="EPS" /--> |
<!--graphic fileref="images/mm1.png" format="EPS" /--> |
|
<para>This approach greatly reduces external fragmentation of memory and |
helps in allocating bigger continuous blocks of memory aligned to their |
size. On the other hand, the buddy allocator suffers increased internal |
fragmentation of memory and is not suitable for general kernel |
allocations. This purpose is better addressed by the <link |
linkend="slab">slab allocator</link>.</para> |
</section> |
<para>This approach greatly reduces external fragmentation of memory |
and helps in allocating bigger continuous blocks of memory aligned to |
their size. On the other hand, the buddy allocator suffers increased |
internal fragmentation of memory and is not suitable for general |
kernel allocations. This purpose is better addressed by the <link |
linkend="slab">slab allocator</link>.</para> |
</section> |
|
<section> |
<title>Implementation</title> |
<section> |
<title>Implementation</title> |
|
<para>The buddy allocator is, in fact, an abstract framework wich can be |
easily specialized to serve one particular task. It knows nothing about |
the nature of memory it helps to allocate. In order to beat the lack of |
this knowledge, the buddy allocator exports an interface that each of |
its clients is required to implement. When supplied an implementation of |
this interface, the buddy allocator can use specialized external |
functions to find buddy for a block, split and coalesce blocks, |
manipulate block order and mark blocks busy or available. For precize |
documentation of this interface, refer to <link linkend="???">HelenOS |
Generic Kernel Reference Manual</link>.</para> |
<para>The buddy allocator is, in fact, an abstract framework wich can |
be easily specialized to serve one particular task. It knows nothing |
about the nature of memory it helps to allocate. In order to beat the |
lack of this knowledge, the buddy allocator exports an interface that |
each of its clients is required to implement. When supplied an |
implementation of this interface, the buddy allocator can use |
specialized external functions to find buddy for a block, split and |
coalesce blocks, manipulate block order and mark blocks busy or |
available. For precize documentation of this interface, refer to <link |
linkend="???">HelenOS Generic Kernel Reference Manual</link>.</para> |
|
<formalpara> |
<title>Data organization</title> |
<formalpara> |
<title>Data organization</title> |
|
<para>Each entity allocable by the buddy allocator is required to |
contain space for storing block order number and a link variable used |
to interconnect blocks within the same order.</para> |
<para>Each entity allocable by the buddy allocator is required to |
contain space for storing block order number and a link variable |
used to interconnect blocks within the same order.</para> |
|
<para>Whatever entities are allocated by the buddy allocator, the |
first entity within a block is used to represent the entire block. The |
first entity keeps the order of the whole block. Other entities within |
the block are assigned the magic value |
<constant>BUDDY_INNER_BLOCK</constant>. This is especially important |
for effective identification of buddies in one-dimensional array |
because the entity that represents a potential buddy cannot be |
associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it is |
associated with <constant>BUDDY_INNER_BLOCK</constant> then it is not |
a buddy).</para> |
</formalpara> |
<para>Whatever entities are allocated by the buddy allocator, the |
first entity within a block is used to represent the entire block. |
The first entity keeps the order of the whole block. Other entities |
within the block are assigned the magic value |
<constant>BUDDY_INNER_BLOCK</constant>. This is especially important |
for effective identification of buddies in one-dimensional array |
because the entity that represents a potential buddy cannot be |
associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it |
is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is |
not a buddy).</para> |
|
<formalpara> |
<title>Data organization</title> |
<para>Buddy allocator always uses first frame to represent frame |
block. This frame contains <varname>buddy_order</varname> variable |
to provide information about the block size it actually represents ( |
<mathphrase>2<superscript>buddy_order</superscript></mathphrase> |
frames block). Other frames in block have this value set to magic |
<constant>BUDDY_INNER_BLOCK</constant> that is much greater than |
buddy <varname>max_order</varname> value.</para> |
|
<para>Buddy allocator always uses first frame to represent frame |
block. This frame contains <varname>buddy_order</varname> variable to |
provide information about the block size it actually represents ( |
<mathphrase>2<superscript>buddy_order</superscript></mathphrase> |
frames block). Other frames in block have this value set to magic |
<constant>BUDDY_INNER_BLOCK</constant> that is much greater than buddy |
<varname>max_order</varname> value.</para> |
<para>Each <varname>frame_t</varname> also contains pointer member |
to hold frame structure in the linked list inside one order.</para> |
</formalpara> |
|
<para>Each <varname>frame_t</varname> also contains pointer member to |
hold frame structure in the linked list inside one order.</para> |
</formalpara> |
<formalpara> |
<title>Allocation algorithm</title> |
|
<formalpara> |
<title>Allocation algorithm</title> |
<para>Upon <mathphrase>2<superscript>i</superscript></mathphrase> |
frames block allocation request, allocator checks if there are any |
blocks available at the order list <varname>i</varname>. If yes, |
removes block from order list and returns its address. If no, |
recursively allocates |
<mathphrase>2<superscript>i+1</superscript></mathphrase> frame |
block, splits it into two |
<mathphrase>2<superscript>i</superscript></mathphrase> frame blocks. |
Then adds one of the blocks to the <varname>i</varname> order list |
and returns address of another.</para> |
</formalpara> |
|
<para>Upon <mathphrase>2<superscript>i</superscript></mathphrase> |
frames block allocation request, allocator checks if there are any |
blocks available at the order list <varname>i</varname>. If yes, |
removes block from order list and returns its address. If no, |
recursively allocates |
<mathphrase>2<superscript>i+1</superscript></mathphrase> frame block, |
splits it into two |
<mathphrase>2<superscript>i</superscript></mathphrase> frame blocks. |
Then adds one of the blocks to the <varname>i</varname> order list and |
returns address of another.</para> |
</formalpara> |
<formalpara> |
<title>Deallocation algorithm</title> |
|
<formalpara> |
<title>Deallocation algorithm</title> |
<para>Check if block has so called buddy (another free |
<mathphrase>2<superscript>i</superscript></mathphrase> frame block |
that can be linked with freed block into the |
<mathphrase>2<superscript>i+1</superscript></mathphrase> block). |
Technically, buddy is a odd/even block for even/odd block |
respectively. Plus we can put an extra requirement, that resulting |
block must be aligned to its size. This requirement guarantees |
natural block alignment for the blocks coming out the allocation |
system.</para> |
|
<para>Check if block has so called buddy (another free |
<mathphrase>2<superscript>i</superscript></mathphrase> frame block |
that can be linked with freed block into the |
<mathphrase>2<superscript>i+1</superscript></mathphrase> block). |
Technically, buddy is a odd/even block for even/odd block |
respectively. Plus we can put an extra requirement, that resulting |
block must be aligned to its size. This requirement guarantees natural |
block alignment for the blocks coming out the allocation |
system.</para> |
|
<para>Using direct pointer arithmetics, |
<varname>frame_t::ref_count</varname> and |
<varname>frame_t::buddy_order</varname> variables, finding buddy is |
done at constant time.</para> |
</formalpara> |
<para>Using direct pointer arithmetics, |
<varname>frame_t::ref_count</varname> and |
<varname>frame_t::buddy_order</varname> variables, finding buddy is |
done at constant time.</para> |
</formalpara> |
</section> |
</section> |
|
<section id="slab"> |
220,25 → 224,25 |
<title>Slab allocator</title> |
|
<section> |
<title>Introduction</title> |
<title>Overview</title> |
|
<para><termdef><glossterm>Slab</glossterm> represents a contiguous |
piece of memory, usually made of several physically contiguous |
pages.</termdef> <termdef><glossterm>Slab cache</glossterm> consists |
of one or more slabs.</termdef></para> |
|
<para>The majority of memory allocation requests in the kernel are for |
small, frequently used data structures. For this purpose the slab |
allocator is a perfect solution. The basic idea behind a slab |
allocator is a perfect solution. The basic idea behind the slab |
allocator is to have lists of commonly used objects available packed |
into pages. This avoids the overhead of allocating and destroying |
commonly used types of objects such as inodes, threads, virtual memory |
structures etc.</para> |
|
<para>Original slab allocator locking mechanism has become a |
significant preformance bottleneck on SMP architectures. <termdef>Slab |
SMP perfromance bottleneck was resolved by introducing a per-CPU |
caching scheme called as <glossterm>magazine |
layer</glossterm></termdef>.</para> |
commonly used types of objects such threads, virtual memory structures |
etc. Also due to the exact allocated size matching, slab allocation |
completely eliminates internal fragmentation issue.</para> |
</section> |
|
<section> |
<title>Implementation details (needs revision)</title> |
<title>Implementation</title> |
|
<para>The SLAB allocator is closely modelled after <ulink |
url="http://www.usenix.org/events/usenix01/full_papers/bonwick/bonwick_html/"> |
258,56 → 262,106 |
</listitem> |
|
<listitem> |
- dynamic magazine growing (different magazine sizes are already supported, but we would need to adjust allocation strategy) |
- dynamic magazine grow (different magazine sizes are already supported, but we would need to adjust allocation strategy) |
</listitem> |
</itemizedlist></para> |
|
<para>The SLAB allocator supports per-CPU caches ('magazines') to |
facilitate good SMP scaling.</para> |
<section> |
<title>Magazine layer</title> |
|
<para>When a new object is being allocated, it is first checked, if it |
is available in CPU-bound magazine. If it is not found there, it is |
allocated from CPU-shared SLAB - if partial full is found, it is used, |
otherwise a new one is allocated.</para> |
<para>Due to the extensive bottleneck on SMP architures, caused by |
global SLAB locking mechanism, making processing of all slab |
allocation requests serialized, a new layer was introduced to the |
classic slab allocator design. Slab allocator was extended to |
support per-CPU caches 'magazines' to achieve good SMP scaling. |
<termdef>Slab SMP perfromance bottleneck was resolved by introducing |
a per-CPU caching scheme called as <glossterm>magazine |
layer</glossterm></termdef>.</para> |
|
<para>When an object is being deallocated, it is put to CPU-bound |
magazine. If there is no such magazine, new one is allocated (if it |
fails, the object is deallocated into SLAB). If the magazine is full, |
it is put into cpu-shared list of magazines and new one is |
allocated.</para> |
<para>Magazine is a N-element cache of objects, so each magazine can |
satisfy N allocations. Magazine behaves like a automatic weapon |
magazine (LIFO, stack), so the allocation/deallocation become simple |
push/pop pointer operation. Trick is that CPU does not access global |
slab allocator data during the allocation from its magazine, thus |
making possible parallel allocations between CPUs.</para> |
|
<para>The CPU-bound magazine is actually a pair of magazines to avoid |
thrashing when somebody is allocating/deallocating 1 item at the |
magazine size boundary. LIFO order is enforced, which should avoid |
fragmentation as much as possible.</para> |
<para>Implementation also requires adding another feature as the |
CPU-bound magazine is actually a pair of magazines to avoid |
thrashing when during allocation/deallocatiion of 1 item at the |
magazine size boundary. LIFO order is enforced, which should avoid |
fragmentation as much as possible.</para> |
|
<para>Every cache contains list of full slabs and list of partialy |
full slabs. Empty SLABS are immediately freed (thrashing will be |
avoided because of magazines).</para> |
<para>Another important entity of magazine layer is a full magazine |
depot, that stores full magazines which are used by any of the CPU |
magazine caches to reload active CPU magazine. Magazine depot can be |
pre-filled with full magazines during initialization, but in current |
implementation it is filled during object deallocation, when CPU |
magazine becomes full.</para> |
|
<para>The SLAB information structure is kept inside the data area, if |
possible. The cache can be marked that it should not use magazines. |
This is used only for SLAB related caches to avoid deadlocks and |
infinite recursion (the SLAB allocator uses itself for allocating all |
it's control structures).</para> |
<para>Slab allocator control structures are allocated from special |
slabs, that are marked by special flag, indicating that it should |
not be used for slab magazine layer. This is done to avoid possible |
infinite recursions and deadlock during conventional slab allocaiton |
requests.</para> |
</section> |
|
<para>The SLAB allocator allocates lots of space and does not free it. |
When frame allocator fails to allocate the frame, it calls |
slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim. |
The light reclaim releases slabs from cpu-shared magazine-list, until |
at least 1 slab is deallocated in each cache (this algorithm should |
probably change). The brutal reclaim removes all cached objects, even |
from CPU-bound magazines.</para> |
<section> |
<title>Allocation/deallocation</title> |
|
<para>TODO: <itemizedlist> |
<listitem> |
For better CPU-scaling the magazine allocation strategy should be extended. Currently, if the cache does not have magazine, it asks for non-cpu cached magazine cache to provide one. It might be feasible to add cpu-cached magazine cache (which would allocate it's magazines from non-cpu-cached mag. cache). This would provide a nice per-cpu buffer. The other possibility is to use the per-cache 'empty-magazine-list', which decreases competing for 1 per-system magazine cache. |
</listitem> |
<para>Every cache contains list of full slabs and list of partialy |
full slabs. Empty slabs are immediately freed (thrashing will be |
avoided because of magazines).</para> |
|
<listitem> |
- it might be good to add granularity of locks even to slab level, we could then try_spinlock over all partial slabs and thus improve scalability even on slab level |
</listitem> |
</itemizedlist></para> |
<para>The SLAB allocator allocates lots of space and does not free |
it. When frame allocator fails to allocate the frame, it calls |
slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim. |
The light reclaim releases slabs from cpu-shared magazine-list, |
until at least 1 slab is deallocated in each cache (this algorithm |
should probably change). The brutal reclaim removes all cached |
objects, even from CPU-bound magazines.</para> |
|
<formalpara> |
<title>Allocation</title> |
|
<para><emphasis>Step 1.</emphasis> When it comes to the allocation |
request, slab allocator first of all checks availability of memory |
in local CPU-bound magazine. If it is there, we would just "pop" |
the CPU magazine and return the pointer to object.</para> |
|
<para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is |
empty, allocator will attempt to reload magazin, swapping it with |
second CPU magazine and returns to the first step.</para> |
|
<para><emphasis>Step 3.</emphasis> Now we are in the situation |
when both CPU-bound magazines are empty, which makes allocator to |
access shared full-magazines depot to reload CPU-bound magazines. |
If reload is succesful (meaning there are full magazines in depot) |
algoritm continues at Step 1.</para> |
|
<para><emphasis>Step 4.</emphasis> Final step of the allocation. |
In this step object is allocated from the conventional slab layer |
and pointer is returned.</para> |
</formalpara> |
|
<formalpara> |
<title>Deallocation</title> |
|
<para><emphasis>Step 1.</emphasis> During deallocation request, |
slab allocator will check if the local CPU-bound magazine is not |
full. In this case we will just push the pointer to this |
magazine.</para> |
|
<para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is |
full, allocator will attempt to reload magazin, swapping it with |
second CPU magazine and returns to the first step.</para> |
|
<para><emphasis>Step 3.</emphasis> Now we are in the situation |
when both CPU-bound magazines are full, which makes allocator to |
access shared full-magazines depot to put one of the magazines to |
the depot and creating new empty magazine. Algoritm continues at |
Step 1.</para> |
</formalpara> |
</section> |
</section> |
</section> |
|