WebSVN – HelenOS-doc – Diff – /design/trunk/src/ch_memory_management.xml

   <?dbhtml filename="mm.html"?>
   <title>Memory management</title>
   <section>
-    <!-- VM -->
     <title>Virtual memory management</title>
     <section>
       <title>Address spaces</title>
-      <para></para>
+      <para />
     </section>
     <section>
       <title>Virtual address translation</title>
-      <para></para>
+      <para />
     </section>
+    <para>Page tables. 4 level hierarchical and hash directly supported. B+
+    Tree can be implemented.</para>
+    <para>For paging there is an abstract layer</para>
+    <para>TLB shootdown implementation (update TLB upon mapping
+    update/remove). TLB shootdown ASID/ASID:PAGE/ALL. TLB shootdown requests
+    can come in asynchroniously so there is a cache of TLB shootdown requests.
+    Upon cache overflow TLB shootdown ALL is executed</para>
+    <para>Address spaces. Address space area (B+ tree). Only for uspace. Set
+    of syscalls (shrink/extend etc). Special address space area type - device
+    - prohibits shrink/extend syscalls to call on it. Address space has link
+    to mapping tables (hierarchical - per Address space, hash - global
+    tables).</para>
   </section>
   <!-- End of VM -->
   <section>
     <title>Physical memory management</title>
     <section id="zones_and_frames">
       <title>Zones and frames</title>
-      <para>
-      <!--graphic fileref="images/mm2.png" /-->
-      <!--graphic fileref="images/buddy_alloc.svg" format="SVG" /-->
+      <para><!--graphic fileref="images/mm2.png" /--><!--graphic fileref="images/buddy_alloc.svg" format="SVG" /--></para>
-      </para>
       <para>On some architectures not whole physical memory is available for
       conventional usage. This limitations require from kernel to maintain a
       table of available and unavailable ranges of physical memory addresses.
       Main idea of zones is in creating memory zone entity, that is a
                Quickly find out if zone contains required number of frames to allocate and if this chunk of memory is properly aligned. This issue is perfectly solved bu the buddy allocator.
             </listitem>
           </itemizedlist></para>
       </formalpara>
     </section>
-  </section>
-  <section id="buddy_allocator">
-    <title>Buddy allocator</title>
-    <section>
-      <title>Overview</title>
-      <para>In buddy allocator, memory is broken down into power-of-two sized
-      naturally aligned blocks. These blocks are organized in an array of
-      lists in which list with index i contains all unallocated blocks of the
-      size <mathphrase>2<superscript>i</superscript></mathphrase>. The index i
-      is called the order of block. Should there be two adjacent equally sized
-      blocks in list <mathphrase>i</mathphrase> (i.e. buddies), the buddy
-      allocator would coalesce them and put the resulting block in list
-      <mathphrase>i + 1</mathphrase>, provided that the resulting block would
-      be naturally aligned. Similarily, when the allocator is asked to
-      allocate a block of size
-      <mathphrase>2<superscript>i</superscript></mathphrase>, it first tries
-      to satisfy the request from list with index i. If the request cannot be
-      satisfied (i.e. the list i is empty), the buddy allocator will try to
-      allocate and split larger block from list with index i + 1. Both of
-      these algorithms are recursive. The recursion ends either when there are
-      no blocks to coalesce in the former case or when there are no blocks
-      that can be split in the latter case.</para>
-      <!--graphic fileref="images/mm1.png" format="EPS" /-->
-      <para>This approach greatly reduces external fragmentation of memory and
-      helps in allocating bigger continuous blocks of memory aligned to their
-      size. On the other hand, the buddy allocator suffers increased internal
-      fragmentation of memory and is not suitable for general kernel
-      allocations. This purpose is better addressed by the <link
-      linkend="slab">slab allocator</link>.</para>
-    </section>
-    <section>
-      <title>Implementation</title>
-      <para>The buddy allocator is, in fact, an abstract framework wich can be
-      easily specialized to serve one particular task. It knows nothing about
-      the nature of memory it helps to allocate. In order to beat the lack of
-      this knowledge, the buddy allocator exports an interface that each of
-      its clients is required to implement. When supplied an implementation of
-      this interface, the buddy allocator can use specialized external
-      functions to find buddy for a block, split and coalesce blocks,
-      manipulate block order and mark blocks busy or available. For precize
-      documentation of this interface, refer to <link linkend="???">HelenOS
-      Generic Kernel Reference Manual</link>.</para>
-      <formalpara>
-        <title>Data organization</title>
-        <para>Each entity allocable by the buddy allocator is required to
-        contain space for storing block order number and a link variable used
-        to interconnect blocks within the same order.</para>
-        <para>Whatever entities are allocated by the buddy allocator, the
-        first entity within a block is used to represent the entire block. The
-        first entity keeps the order of the whole block. Other entities within
-        the block are assigned the magic value
-        <constant>BUDDY_INNER_BLOCK</constant>. This is especially important
-        for effective identification of buddies in one-dimensional array
-        because the entity that represents a potential buddy cannot be
-        associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it is
-        associated with <constant>BUDDY_INNER_BLOCK</constant> then it is not
-        a buddy).</para>
-      </formalpara>
-      <formalpara>
-        <title>Data organization</title>
-        <para>Buddy allocator always uses first frame to represent frame
-        block. This frame contains <varname>buddy_order</varname> variable to
-        provide information about the block size it actually represents (
-        <mathphrase>2<superscript>buddy_order</superscript></mathphrase>
-        frames block). Other frames in block have this value set to magic
-        <constant>BUDDY_INNER_BLOCK</constant> that is much greater than buddy
-        <varname>max_order</varname> value.</para>
-        <para>Each <varname>frame_t</varname> also contains pointer member to
-        hold frame structure in the linked list inside one order.</para>
+    <section id="buddy_allocator">
-      </formalpara>
+      <title>Buddy allocator</title>
-      <formalpara>
+      <section>
-        <title>Allocation algorithm</title>
+        <title>Overview</title>
+        <para>In buddy allocator, memory is broken down into power-of-two
+        sized naturally aligned blocks. These blocks are organized in an array
+        of lists in which list with index i contains all unallocated blocks of
-        <para>Upon <mathphrase>2<superscript>i</superscript></mathphrase>
+        the size <mathphrase>2<superscript>i</superscript></mathphrase>. The
-        frames block allocation request, allocator checks if there are any
+        index i is called the order of block. Should there be two adjacent
-        blocks available at the order list <varname>i</varname>. If yes,
+        equally sized blocks in list <mathphrase>i</mathphrase> (i.e.
+        buddies), the buddy allocator would coalesce them and put the
+        resulting block in list <mathphrase>i + 1</mathphrase>, provided that
-        removes block from order list and returns its address. If no,
+        the resulting block would be naturally aligned. Similarily, when the
-        recursively allocates
+        allocator is asked to allocate a block of size
-        <mathphrase>2<superscript>i+1</superscript></mathphrase> frame block,
+        <mathphrase>2<superscript>i</superscript></mathphrase>, it first tries
+        to satisfy the request from list with index i. If the request cannot
+        be satisfied (i.e. the list i is empty), the buddy allocator will try
+        to allocate and split larger block from list with index i + 1. Both of
+        these algorithms are recursive. The recursion ends either when there
+        are no blocks to coalesce in the former case or when there are no
-        splits it into two
+        blocks that can be split in the latter case.</para>
+        <!--graphic fileref="images/mm1.png" format="EPS" /-->
-        <mathphrase>2<superscript>i</superscript></mathphrase> frame blocks.
+        <para>This approach greatly reduces external fragmentation of memory
+        and helps in allocating bigger continuous blocks of memory aligned to
-        Then adds one of the blocks to the <varname>i</varname> order list and
+        their size. On the other hand, the buddy allocator suffers increased
+        internal fragmentation of memory and is not suitable for general
+        kernel allocations. This purpose is better addressed by the <link
-        returns address of another.</para>
+        linkend="slab">slab allocator</link>.</para>
-      </formalpara>
+      </section>
-      <formalpara>
+      <section>
-        <title>Deallocation algorithm</title>
+        <title>Implementation</title>
+        <para>The buddy allocator is, in fact, an abstract framework wich can
+        be easily specialized to serve one particular task. It knows nothing
+        about the nature of memory it helps to allocate. In order to beat the
+        lack of this knowledge, the buddy allocator exports an interface that
+        each of its clients is required to implement. When supplied an
+        implementation of this interface, the buddy allocator can use
+        specialized external functions to find buddy for a block, split and
+        coalesce blocks, manipulate block order and mark blocks busy or
+        available. For precize documentation of this interface, refer to <link
+        linkend="???">HelenOS Generic Kernel Reference Manual</link>.</para>
+        <formalpara>
+          <title>Data organization</title>
+          <para>Each entity allocable by the buddy allocator is required to
+          contain space for storing block order number and a link variable
+          used to interconnect blocks within the same order.</para>
+          <para>Whatever entities are allocated by the buddy allocator, the
+          first entity within a block is used to represent the entire block.
+          The first entity keeps the order of the whole block. Other entities
+          within the block are assigned the magic value
+          <constant>BUDDY_INNER_BLOCK</constant>. This is especially important
+          for effective identification of buddies in one-dimensional array
+          because the entity that represents a potential buddy cannot be
+          associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it
+          is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is
+          not a buddy).</para>
+          <para>Buddy allocator always uses first frame to represent frame
+          block. This frame contains <varname>buddy_order</varname> variable
+          to provide information about the block size it actually represents (
+          <mathphrase>2<superscript>buddy_order</superscript></mathphrase>
+          frames block). Other frames in block have this value set to magic
+          <constant>BUDDY_INNER_BLOCK</constant> that is much greater than
+          buddy <varname>max_order</varname> value.</para>
+          <para>Each <varname>frame_t</varname> also contains pointer member
+          to hold frame structure in the linked list inside one order.</para>
+        </formalpara>
+        <formalpara>
+          <title>Allocation algorithm</title>
+          <para>Upon <mathphrase>2<superscript>i</superscript></mathphrase>
+          frames block allocation request, allocator checks if there are any
+          blocks available at the order list <varname>i</varname>. If yes,
+          removes block from order list and returns its address. If no,
+          recursively allocates
+          <mathphrase>2<superscript>i+1</superscript></mathphrase> frame
+          block, splits it into two
+          <mathphrase>2<superscript>i</superscript></mathphrase> frame blocks.
+          Then adds one of the blocks to the <varname>i</varname> order list
+          and returns address of another.</para>
+        </formalpara>
+        <formalpara>
+          <title>Deallocation algorithm</title>
-        <para>Check if block has so called buddy (another free
+          <para>Check if block has so called buddy (another free
-        <mathphrase>2<superscript>i</superscript></mathphrase> frame block
+          <mathphrase>2<superscript>i</superscript></mathphrase> frame block
-        that can be linked with freed block into the
+          that can be linked with freed block into the
-        <mathphrase>2<superscript>i+1</superscript></mathphrase> block).
+          <mathphrase>2<superscript>i+1</superscript></mathphrase> block).
-        Technically, buddy is a odd/even block for even/odd block
+          Technically, buddy is a odd/even block for even/odd block
-        respectively. Plus we can put an extra requirement, that resulting
+          respectively. Plus we can put an extra requirement, that resulting
-        block must be aligned to its size. This requirement guarantees natural
+          block must be aligned to its size. This requirement guarantees
-        block alignment for the blocks coming out the allocation
+          natural block alignment for the blocks coming out the allocation
-        system.</para>
+          system.</para>
-        <para>Using direct pointer arithmetics,
+          <para>Using direct pointer arithmetics,
-        <varname>frame_t::ref_count</varname> and
+          <varname>frame_t::ref_count</varname> and
-        <varname>frame_t::buddy_order</varname> variables, finding buddy is
+          <varname>frame_t::buddy_order</varname> variables, finding buddy is
-        done at constant time.</para>
+          done at constant time.</para>
-      </formalpara>
+        </formalpara>
+      </section>
     </section>
     <section id="slab">
       <title>Slab allocator</title>
       <section>
-        <title>Introduction</title>
+        <title>Overview</title>
+        <para><termdef><glossterm>Slab</glossterm> represents a contiguous
+        piece of memory, usually made of several physically contiguous
+        pages.</termdef> <termdef><glossterm>Slab cache</glossterm> consists
+        of one or more slabs.</termdef></para>
         <para>The majority of memory allocation requests in the kernel are for
         small, frequently used data structures. For this purpose the slab
-        allocator is a perfect solution. The basic idea behind a slab
+        allocator is a perfect solution. The basic idea behind the slab
         allocator is to have lists of commonly used objects available packed
         into pages. This avoids the overhead of allocating and destroying
-        commonly used types of objects such as inodes, threads, virtual memory
+        commonly used types of objects such threads, virtual memory structures
-        structures etc.</para>
-        <para>Original slab allocator locking mechanism has become a
+        etc. Also due to the exact allocated size matching, slab allocation
-        significant preformance bottleneck on SMP architectures. <termdef>Slab
-        SMP perfromance bottleneck was resolved by introducing a per-CPU
-        caching scheme called as <glossterm>magazine
-        layer</glossterm></termdef>.</para>
+        completely eliminates internal fragmentation issue.</para>
       </section>
       <section>
-        <title>Implementation details (needs revision)</title>
+        <title>Implementation</title>
         <para>The SLAB allocator is closely modelled after <ulink
         url="http://www.usenix.org/events/usenix01/full_papers/bonwick/bonwick_html/">
         OpenSolaris SLAB allocator by Jeff Bonwick and Jonathan Adams </ulink>
         with the following exceptions: <itemizedlist>
             <listitem>
                - cache coloring
             </listitem>
             <listitem>
-               - dynamic magazine growing (different magazine sizes are already supported, but we would need to adjust allocation strategy)
+               - dynamic magazine grow (different magazine sizes are already supported, but we would need to adjust allocation strategy)
             </listitem>
           </itemizedlist></para>
-        <para>The SLAB allocator supports per-CPU caches ('magazines') to
-        facilitate good SMP scaling.</para>
-        <para>When a new object is being allocated, it is first checked, if it
-        is available in CPU-bound magazine. If it is not found there, it is
-        allocated from CPU-shared SLAB - if partial full is found, it is used,
-        otherwise a new one is allocated.</para>
-        <para>When an object is being deallocated, it is put to CPU-bound
-        magazine. If there is no such magazine, new one is allocated (if it
-        fails, the object is deallocated into SLAB). If the magazine is full,
-        it is put into cpu-shared list of magazines and new one is
-        allocated.</para>
-        <para>The CPU-bound magazine is actually a pair of magazines to avoid
-        thrashing when somebody is allocating/deallocating 1 item at the
-        magazine size boundary. LIFO order is enforced, which should avoid
-        fragmentation as much as possible.</para>
-        <para>Every cache contains list of full slabs and list of partialy
-        full slabs. Empty SLABS are immediately freed (thrashing will be
-        avoided because of magazines).</para>
-        <para>The SLAB information structure is kept inside the data area, if
-        possible. The cache can be marked that it should not use magazines.
-        This is used only for SLAB related caches to avoid deadlocks and
-        infinite recursion (the SLAB allocator uses itself for allocating all
-        it's control structures).</para>
-        <para>The SLAB allocator allocates lots of space and does not free it.
-        When frame allocator fails to allocate the frame, it calls
-        slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim.
-        The light reclaim releases slabs from cpu-shared magazine-list, until
-        at least 1 slab is deallocated in each cache (this algorithm should
-        probably change). The brutal reclaim removes all cached objects, even
-        from CPU-bound magazines.</para>
-        <para>TODO: <itemizedlist>
-            <listitem>
+        <section>
-               For better CPU-scaling the magazine allocation strategy should be extended. Currently, if the cache does not have magazine, it asks for non-cpu cached magazine cache to provide one. It might be feasible to add cpu-cached magazine cache (which would allocate it's magazines from non-cpu-cached mag. cache). This would provide a nice per-cpu buffer. The other possibility is to use the per-cache 'empty-magazine-list', which decreases competing for 1 per-system magazine cache.
+          <title>Magazine layer</title>
-            </listitem>
+          <para>Due to the extensive bottleneck on SMP architures, caused by
+          global SLAB locking mechanism, making processing of all slab
+          allocation requests serialized, a new layer was introduced to the
+          classic slab allocator design. Slab allocator was extended to
+          support per-CPU caches 'magazines' to achieve good SMP scaling.
+          <termdef>Slab SMP perfromance bottleneck was resolved by introducing
+          a per-CPU caching scheme called as <glossterm>magazine
+          layer</glossterm></termdef>.</para>
+          <para>Magazine is a N-element cache of objects, so each magazine can
+          satisfy N allocations. Magazine behaves like a automatic weapon
+          magazine (LIFO, stack), so the allocation/deallocation become simple
+          push/pop pointer operation. Trick is that CPU does not access global
+          slab allocator data during the allocation from its magazine, thus
+          making possible parallel allocations between CPUs.</para>
+          <para>Implementation also requires adding another feature as the
+          CPU-bound magazine is actually a pair of magazines to avoid
+          thrashing when during allocation/deallocatiion of 1 item at the
+          magazine size boundary. LIFO order is enforced, which should avoid
+          fragmentation as much as possible.</para>
+          <para>Another important entity of magazine layer is a full magazine
+          depot, that stores full magazines which are used by any of the CPU
+          magazine caches to reload active CPU magazine. Magazine depot can be
+          pre-filled with full magazines during initialization, but in current
+          implementation it is filled during object deallocation, when CPU
+          magazine becomes full.</para>
+          <para>Slab allocator control structures are allocated from special
+          slabs, that are marked by special flag, indicating that it should
+          not be used for slab magazine layer. This is done to avoid possible
+          infinite recursions and deadlock during conventional slab allocaiton
+          requests.</para>
+        </section>
+        <section>
+          <title>Allocation/deallocation</title>
+          <para>Every cache contains list of full slabs and list of partialy
+          full slabs. Empty slabs are immediately freed (thrashing will be
+          avoided because of magazines).</para>
+          <para>The SLAB allocator allocates lots of space and does not free
+          it. When frame allocator fails to allocate the frame, it calls
+          slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim.
+          The light reclaim releases slabs from cpu-shared magazine-list,
+          until at least 1 slab is deallocated in each cache (this algorithm
+          should probably change). The brutal reclaim removes all cached
+          objects, even from CPU-bound magazines.</para>
-            <listitem>
+          <formalpara>
+            <title>Allocation</title>
+            <para><emphasis>Step 1.</emphasis> When it comes to the allocation
+            request, slab allocator first of all checks availability of memory
+            in local CPU-bound magazine. If it is there, we would just "pop"
+            the CPU magazine and return the pointer to object.</para>
+            <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
+            empty, allocator will attempt to reload magazin, swapping it with
+            second CPU magazine and returns to the first step.</para>
+            <para><emphasis>Step 3.</emphasis> Now we are in the situation
+            when both CPU-bound magazines are empty, which makes allocator to
+            access shared full-magazines depot to reload CPU-bound magazines.
+            If reload is succesful (meaning there are full magazines in depot)
+            algoritm continues at Step 1.</para>
+            <para><emphasis>Step 4.</emphasis> Final step of the allocation.
-               - it might be good to add granularity of locks even to slab level, we could then try_spinlock over all partial slabs and thus improve scalability even on slab level
+            In this step object is allocated from the conventional slab layer
+            and pointer is returned.</para>
-            </listitem>
+          </formalpara>
+          <formalpara>
+            <title>Deallocation</title>
+            <para><emphasis>Step 1.</emphasis> During deallocation request,
+            slab allocator will check if the local CPU-bound magazine is not
+            full. In this case we will just push the pointer to this
-          </itemizedlist></para>
+            magazine.</para>
+            <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
+            full, allocator will attempt to reload magazin, swapping it with
+            second CPU magazine and returns to the first step.</para>
+            <para><emphasis>Step 3.</emphasis> Now we are in the situation
+            when both CPU-bound magazines are full, which makes allocator to
+            access shared full-magazines depot to put one of the magazines to
+            the depot and creating new empty magazine. Algoritm continues at
+            Step 1.</para>
+          </formalpara>
+        </section>
       </section>
     </section>
     <!-- End of Physmem -->
   </section>

Subversion Repositories HelenOS-doc

(root)/design/trunk/src/ch_memory_management.xml @ 185 – Rev 27 → 34