Subversion Repositories HelenOS-doc

Rev

Rev 37 | Rev 39 | Go to most recent revision | Only display areas with differences | Ignore whitespace | Details | Blame | Last modification | View Log | RSS feed

Rev 37 Rev 38
1
<?xml version="1.0" encoding="UTF-8"?>
1
<?xml version="1.0" encoding="UTF-8"?>
2
<chapter id="mm">
2
<chapter id="mm">
3
  <?dbhtml filename="mm.html"?>
3
  <?dbhtml filename="mm.html"?>
4
 
4
 
5
  <title>Memory management</title>
5
  <title>Memory management</title>
6
 
6
 
7
  <section>
7
  <section>
8
    <title>Virtual memory management</title>
8
    <title>Virtual memory management</title>
9
 
9
 
10
    <section>
10
    <section>
11
      <title>Introduction</title>
11
      <title>Introduction</title>
12
 
12
 
13
      <para>Virtual memory is a special memory management technique, used by
13
      <para>Virtual memory is a special memory management technique, used by
14
      kernel to achieve a bunch of mission critical goals. <itemizedlist>
14
      kernel to achieve a bunch of mission critical goals. <itemizedlist>
15
          <listitem>
15
          <listitem>
16
             Isolate each task from other tasks that are running on the system at the same time.
16
             Isolate each task from other tasks that are running on the system at the same time.
17
          </listitem>
17
          </listitem>
18
 
18
 
19
          <listitem>
19
          <listitem>
20
             Allow to allocate more memory, than is actual physical memory size of the machine.
20
             Allow to allocate more memory, than is actual physical memory size of the machine.
21
          </listitem>
21
          </listitem>
22
 
22
 
23
          <listitem>
23
          <listitem>
24
             Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations.
24
             Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations.
25
          </listitem>
25
          </listitem>
26
        </itemizedlist></para>
26
        </itemizedlist></para>
-
 
27
   
-
 
28
   
-
 
29
<para><!--
-
 
30
 
-
 
31
                TLB shootdown ASID/ASID:PAGE/ALL.
-
 
32
                TLB shootdown requests can come in asynchroniously
-
 
33
                so there is a cache of TLB shootdown requests. Upon cache overflow TLB shootdown ALL is executed
-
 
34
 
-
 
35
 
-
 
36
                <para>
-
 
37
                        Address spaces. Address space area (B+ tree). Only for uspace. Set of syscalls (shrink/extend etc).
-
 
38
                        Special address space area type - device - prohibits shrink/extend syscalls to call on it.
-
 
39
                        Address space has link to mapping tables (hierarchical - per Address space, hash - global tables).
-
 
40
                </para>
-
 
41
 
-
 
42
--></para>
27
    </section>
43
    </section>
28
 
44
 
29
    <section>
45
    <section>
30
       
46
       
31
 
47
 
32
      <title>Paging</title>
48
      <title>Paging</title>
33
 
49
 
34
       
-
 
35
 
-
 
36
      <para>Virtual memory is usually using paged memory model, where virtual
50
      <para>Virtual memory is usually using paged memory model, where virtual
37
      memory address space is divided into the <emphasis>pages</emphasis>
51
      memory address space is divided into the <emphasis>pages</emphasis>
38
      (usually having size 4096 bytes) and physical memory is divided into the
52
      (usually having size 4096 bytes) and physical memory is divided into the
39
      frames (same sized as a page, of course). Each page may be mapped to some
53
      frames (same sized as a page, of course). Each page may be mapped to some
40
      frame and then, upon memory access to the virtual address, CPU performs
54
      frame and then, upon memory access to the virtual address, CPU performs
41
      <emphasis>address translation</emphasis> during the instruction
55
      <emphasis>address translation</emphasis> during the instruction
42
      execution. Non-existing mapping generates page fault exception, calling
56
      execution. Non-existing mapping generates page fault exception, calling
43
      kernel exception handler, thus allowing kernel to manipulate rules of
57
      kernel exception handler, thus allowing kernel to manipulate rules of
44
      memory access. Information for pages mapping is stored by kernel in the
58
      memory access. Information for pages mapping is stored by kernel in the
45
      <link linkend="page_tables">page tables</link></para>
59
      <link linkend="page_tables">page tables</link></para>
46
 
60
 
47
       
61
       
48
 
62
 
49
      <para>The majority of the architectures use multi-level page tables,
63
      <para>The majority of the architectures use multi-level page tables,
50
      which means need to access physical memory several times before getting
64
      which means need to access physical memory several times before getting
51
      physical address. This fact would make serios performance overhead in
65
      physical address. This fact would make serios performance overhead in
52
      virtual memory management. To avoid this <link linkend="tlb">Traslation
66
      virtual memory management. To avoid this <link linkend="tlb">Traslation
53
      Lookaside Buffer (TLB)</link> is used.</para>
67
      Lookaside Buffer (TLB)</link> is used.</para>
54
 
68
 
55
       
69
       
56
 
70
 
57
      <para>At the moment HelenOS does not support swapping.</para>
71
      <para>At the moment HelenOS does not support swapping.</para>
58
 
72
 
59
      <para>- pouzivame vypadky stranky k alokaci ramcu on-demand v ramci as_area - na architekturach, ktere to podporuji, podporujeme non-exec stranky </para>
73
      <para>- pouzivame vypadky stranky k alokaci ramcu on-demand v ramci as_area - na architekturach, ktere to podporuji, podporujeme non-exec stranky </para>
60
    </section>
74
    </section>
61
 
75
 
62
    <section>
76
    <section>
63
      <title>Address spaces</title>
77
      <title>Address spaces</title>
64
 
78
 
65
      <section>
79
      <section>
66
        <title>Address spaces and areas</title>
80
        <title>Address spaces and areas</title>
67
 
81
 
68
        <para>
82
        <para>
69
 
83
 
70
    - adresovy prostor se sklada z tzv. address space areas
84
    - adresovy prostor se sklada z tzv. address space areas
71
        usporadanych v B+stromu; tyto areas popisuji vyuzivane casti
85
        usporadanych v B+stromu; tyto areas popisuji vyuzivane casti
72
        adresoveho prostoru patrici do user address space. Kazda cast je dana
86
        adresoveho prostoru patrici do user address space. Kazda cast je dana
73
        svoji bazovou adresou, velikosti a flagy (rwx/dd).
87
        svoji bazovou adresou, velikosti a flagy (rwx/dd).
74
 
88
 
75
    </para>
89
    </para>
76
 
90
 
77
        <para>- uzivatelske thready maji moznost manipulovat se svym adresovym
91
        <para>- uzivatelske thready maji moznost manipulovat se svym adresovym
78
        prostorem (vytvaret/resizovat/sdilet) as_areas pomoci syscallu</para>
92
        prostorem (vytvaret/resizovat/sdilet) as_areas pomoci syscallu</para>
79
      </section>
93
      </section>
80
 
94
 
81
      <section>
95
      <section>
82
        <title>Address Space ID (ASID)</title>
96
        <title>Address Space ID (ASID)</title>
83
 
97
 
84
        <para>- nektery hardware umoznuje rozlisit ruzne adresove prostory od
98
        <para>- nektery hardware umoznuje rozlisit ruzne adresove prostory od
85
        sebe (cilem je maximalizovat vyuziti TLB); dela to tak, ze s kazdou
99
        sebe (cilem je maximalizovat vyuziti TLB); dela to tak, ze s kazdou
86
        polozkou TLB/strankovacich tabulek sdruzi identifikator adresoveho
100
        polozkou TLB/strankovacich tabulek sdruzi identifikator adresoveho
87
        prostoru (ASID, RID, ppc32 ???). Tyto id mivaji ruznou sirku: 8-bitu
101
        prostoru (ASID, RID, ppc32 ???). Tyto id mivaji ruznou sirku: 8-bitu
88
        az 24-bitu (kolik ma ppc32?)</para>
102
        az 24-bitu (kolik ma ppc32?)</para>
89
 
103
 
90
        <para>- kernel tomu rozumi a sam pouziva abstrakci ASIDu (na ia64 to
104
        <para>- kernel tomu rozumi a sam pouziva abstrakci ASIDu (na ia64 to
91
        je napr. cislo odvozene od RIDu, na mips32 to je ASID samotny);
105
        je napr. cislo odvozene od RIDu, na mips32 to je ASID samotny);
92
        existence ASIDu je nutnou podminkou pouziti _global_ page hash table
106
        existence ASIDu je nutnou podminkou pouziti _global_ page hash table
93
        mechanismu.</para>
107
        mechanismu.</para>
94
 
108
 
95
        <para>- na vsech arch. plati, ze asidu je mnohem mene, nez teoreticky
109
        <para>- na vsech arch. plati, ze asidu je mnohem mene, nez teoreticky
96
        pocet soucasne bezicich tasku ~ adresovych prostoru, takze je
110
        pocet soucasne bezicich tasku ~ adresovych prostoru, takze je
97
        implementovan mechanismus, ktery umoznuje jednomu adresovemu prostoru
111
        implementovan mechanismus, ktery umoznuje jednomu adresovemu prostoru
98
        ASID odebrat a pridelit ho jinemu</para>
112
        ASID odebrat a pridelit ho jinemu</para>
99
 
113
 
100
        <para>- vztah task ~ adresovy prostor: teoreticky existuje moznost, ze
114
        <para>- vztah task ~ adresovy prostor: teoreticky existuje moznost, ze
101
        je adresovy prostor sdilen vice tasky, avsak tuto moznost nepouzivame
115
        je adresovy prostor sdilen vice tasky, avsak tuto moznost nepouzivame
102
        a neni ani nijak osetrena. Tim padem plati, ze kazdy task ma vlastni
116
        a neni ani nijak osetrena. Tim padem plati, ze kazdy task ma vlastni
103
        adresovy prostor</para>
117
        adresovy prostor</para>
104
      </section>
118
      </section>
-
 
119
     
-
 
120
     
-
 
121
     
105
    </section>
122
    </section>
106
 
123
 
107
    <section>
124
    <section>
108
      <title>Virtual address translation</title>
125
      <title>Virtual address translation</title>
109
 
126
 
110
      <section id="page_tables">
127
      <section id="page_tables">
111
        <title>Page tables</title>
128
        <title>Page tables</title>
112
 
129
 
113
        <para>HelenOS kernel has two different approaches to the paging
130
        <para>HelenOS kernel has two different approaches to the paging
114
        implementation: <emphasis>4 level page tables</emphasis> and
131
        implementation: <emphasis>4 level page tables</emphasis> and
115
        <emphasis>global hash tables</emphasis>, which are accessible via
132
        <emphasis>global hash tables</emphasis>, which are accessible via
116
        generic paging abstraction layer. This division was caused by the
133
        generic paging abstraction layer. This division was caused by the
117
        major architectural differences between different platforms.</para>
134
        major architectural differences between different platforms.</para>
118
 
135
 
119
        <formalpara>
136
        <formalpara>
120
          <title>4-level page tables</title>
137
          <title>4-level page tables</title>
121
 
138
 
122
          <para>4-level page tables are the generalization of the hardware
139
          <para>4-level page tables are the generalization of the hardware
123
          capabilities of the certain platforms. <itemizedlist>
140
          capabilities of the certain platforms. <itemizedlist>
124
              <listitem>
141
              <listitem>
125
                 ia32 uses 2-level page tables, with full hardware support.
142
                 ia32 uses 2-level page tables, with full hardware support.
126
              </listitem>
143
              </listitem>
127
 
144
 
128
              <listitem>
145
              <listitem>
129
                 amd64 uses 4-level page tables, also coming with full hardware support.
146
                 amd64 uses 4-level page tables, also coming with full hardware support.
130
              </listitem>
147
              </listitem>
131
 
148
 
132
              <listitem>
149
              <listitem>
133
                 mips and ppc32 have 2-level tables, software simulated support.
150
                 mips and ppc32 have 2-level tables, software simulated support.
134
              </listitem>
151
              </listitem>
135
            </itemizedlist></para>
152
            </itemizedlist></para>
136
        </formalpara>
153
        </formalpara>
137
 
154
 
138
        <formalpara>
155
        <formalpara>
139
          <title>Global hash tables</title>
156
          <title>Global hash tables</title>
140
 
157
 
141
          <para>- global page hash table: existuje jen jedna v celem systemu
158
          <para>- global page hash table: existuje jen jedna v celem systemu
142
          (vyuziva ji ia64), pozn. ia64 ma zatim vypnuty VHPT. Pouziva se
159
          (vyuziva ji ia64), pozn. ia64 ma zatim vypnuty VHPT. Pouziva se
143
          genericke hash table s oddelenymi collision chains</para>
160
          genericke hash table s oddelenymi collision chains</para>
144
        </formalpara>
161
        </formalpara>
145
 
162
 
146
        <para>Thanks to the abstract paging interface, there is possibility
163
        <para>Thanks to the abstract paging interface, there is possibility
147
        left have more paging implementations, for example B-Tree page
164
        left have more paging implementations, for example B-Tree page
148
        tables.</para>
165
        tables.</para>
149
      </section>
166
      </section>
150
 
167
 
151
      <section id="tlb">
168
      <section id="tlb">
152
        <title>Translation Lookaside buffer</title>
169
        <title>Translation Lookaside buffer</title>
153
 
170
 
154
        <para>- TLB cachuji informace ve strankovacich tabulkach; alternativne
171
        <para>- TLB cachuji informace ve strankovacich tabulkach; alternativne
155
        se lze na strankovaci tabulky (ci ruzne hw rozsireni [e.g. VHPT, ppc32
172
        se lze na strankovaci tabulky (ci ruzne hw rozsireni [e.g. VHPT, ppc32
156
        hw hash table]) divat jako na velke TLB</para>
173
        hw hash table]) divat jako na velke TLB</para>
157
 
174
 
158
        <para>- pri modifikaci mapovani nebo odstraneni mapovani ze
175
        <para>- pri modifikaci mapovani nebo odstraneni mapovani ze
159
        strankovacich tabulek je potreba zajistit konsistenci TLB a techto
176
        strankovacich tabulek je potreba zajistit konsistenci TLB a techto
160
        tabulek; nutne delat na vsech CPU; na to mame zjednodusenou verzi TLB
177
        tabulek; nutne delat na vsech CPU; na to mame zjednodusenou verzi TLB
161
        shootdown mechanismu; je to variace na algoritmus popsany zde: D.
178
        shootdown mechanismu; je to variace na algoritmus popsany zde: D.
162
        Black et al., "Translation Lookaside Buffer Consistency: A Software
179
        Black et al., "Translation Lookaside Buffer Consistency: A Software
163
        Approach," Proc. Third Int'l Conf. Architectural Support for
180
        Approach," Proc. Third Int'l Conf. Architectural Support for
164
        Programming Languages and Operating Systems, 1989, pp. 113-122.</para>
181
        Programming Languages and Operating Systems, 1989, pp. 113-122.</para>
165
 
182
 
166
        <para>- nutno poznamenat, ze existuji odlehcenejsi verze TLB shootdown
183
        <para>- nutno poznamenat, ze existuji odlehcenejsi verze TLB shootdown
167
        algoritmu</para>
184
        algoritmu</para>
168
      </section>
185
      </section>
169
    </section>
186
    </section>
170
  </section>
187
  </section>
171
 
188
 
172
  <!-- End of VM -->
189
  <!-- End of VM -->
173
 
190
 
174
  <section>
191
  <section>
175
    <!-- Phys mem -->
192
    <!-- Phys mem -->
176
 
193
 
177
    <title>Physical memory management</title>
194
    <title>Physical memory management</title>
178
 
195
 
179
    <section id="zones_and_frames">
196
    <section id="zones_and_frames">
180
      <title>Zones and frames</title>
197
      <title>Zones and frames</title>
181
 
198
 
182
      <para><!--graphic fileref="images/mm2.png" /--><!--graphic fileref="images/buddy_alloc.svg" format="SVG" /--></para>
199
      <para><!--graphic fileref="images/mm2.png" /--><!--graphic fileref="images/buddy_alloc.svg" format="SVG" /--></para>
183
 
200
 
184
      <para>On some architectures not whole physical memory is available for
201
      <para>On some architectures not whole physical memory is available for
185
      conventional usage. This limitations require from kernel to maintain a
202
      conventional usage. This limitations require from kernel to maintain a
186
      table of available and unavailable ranges of physical memory addresses.
203
      table of available and unavailable ranges of physical memory addresses.
187
      Main idea of zones is in creating memory zone entity, that is a
204
      Main idea of zones is in creating memory zone entity, that is a
188
      continuous chunk of memory available for allocation. If some chunk is
205
      continuous chunk of memory available for allocation. If some chunk is
189
      not available, we simply do not put it in any zone.</para>
206
      not available, we simply do not put it in any zone.</para>
190
 
207
 
191
      <para>Zone is also serves for informational purposes, containing
208
      <para>Zone is also serves for informational purposes, containing
192
      information about number of free and busy frames. Physical memory
209
      information about number of free and busy frames. Physical memory
193
      allocation is also done inside the certain zone. Allocation of zone
210
      allocation is also done inside the certain zone. Allocation of zone
194
      frame must be organized by the <link linkend="frame_allocator">frame
211
      frame must be organized by the <link linkend="frame_allocator">frame
195
      allocator</link> associated with the zone.</para>
212
      allocator</link> associated with the zone.</para>
196
 
213
 
197
      <para>Some of the architectures (mips32, ppc32) have only one zone, that
214
      <para>Some of the architectures (mips32, ppc32) have only one zone, that
198
      covers whole physical memory, and the others (like ia32) may have
215
      covers whole physical memory, and the others (like ia32) may have
199
      multiple zones. Information about zones on current machine is stored in
216
      multiple zones. Information about zones on current machine is stored in
200
      BIOS hardware tables or can be hardcoded into kernel during compile
217
      BIOS hardware tables or can be hardcoded into kernel during compile
201
      time.</para>
218
      time.</para>
202
    </section>
219
    </section>
203
 
220
 
204
    <section id="frame_allocator">
221
    <section id="frame_allocator">
205
      <title>Frame allocator</title>
222
      <title>Frame allocator</title>
206
 
223
 
207
      <formalpara>
224
      <formalpara>
208
        <title>Overview</title>
225
        <title>Overview</title>
209
 
226
 
210
        <para>Frame allocator provides physical memory allocation for the
227
        <para>Frame allocator provides physical memory allocation for the
211
        kernel. Because of zonal organization of physical memory, frame
228
        kernel. Because of zonal organization of physical memory, frame
212
        allocator is always working in context of some zone, thus making
229
        allocator is always working in context of some zone, thus making
213
        impossible to allocate a piece of memory, which lays in different
230
        impossible to allocate a piece of memory, which lays in different
214
        zone, which cannot happen, because two adjacent zones can be merged
231
        zone, which cannot happen, because two adjacent zones can be merged
215
        into one. Frame allocator is also being responsible to update
232
        into one. Frame allocator is also being responsible to update
216
        information on the number of free/busy frames in zone. Physical memory
233
        information on the number of free/busy frames in zone. Physical memory
217
        allocation inside one <link linkend="zones_and_frames">memory
234
        allocation inside one <link linkend="zones_and_frames">memory
218
        zone</link> is being handled by an instance of <link
235
        zone</link> is being handled by an instance of <link
219
        linkend="buddy_allocator">buddy allocator</link> tailored to allocate
236
        linkend="buddy_allocator">buddy allocator</link> tailored to allocate
220
        blocks of physical memory frames.</para>
237
        blocks of physical memory frames.</para>
221
      </formalpara>
238
      </formalpara>
222
 
239
 
223
      <formalpara>
240
      <formalpara>
224
        <title>Allocation / deallocation</title>
241
        <title>Allocation / deallocation</title>
225
 
242
 
226
        <para>Upon allocation request, frame allocator tries to find first
243
        <para>Upon allocation request, frame allocator tries to find first
227
        zone, that can satisfy the incoming request (has required amount of
244
        zone, that can satisfy the incoming request (has required amount of
228
        free frames to allocate). During deallocation, frame allocator needs
245
        free frames to allocate). During deallocation, frame allocator needs
229
        to find zone, that contain deallocated frame. This approach could
246
        to find zone, that contain deallocated frame. This approach could
230
        bring up two potential problems: <itemizedlist>
247
        bring up two potential problems: <itemizedlist>
231
            <listitem>
248
            <listitem>
232
               Linear search of zones does not any good to performance, but number of zones is not expected to be high. And if yes, list of zones can be replaced with more time-efficient B-tree.
249
               Linear search of zones does not any good to performance, but number of zones is not expected to be high. And if yes, list of zones can be replaced with more time-efficient B-tree.
233
            </listitem>
250
            </listitem>
234
 
251
 
235
            <listitem>
252
            <listitem>
236
               Quickly find out if zone contains required number of frames to allocate and if this chunk of memory is properly aligned. This issue is perfectly solved bu the buddy allocator.
253
               Quickly find out if zone contains required number of frames to allocate and if this chunk of memory is properly aligned. This issue is perfectly solved bu the buddy allocator.
237
            </listitem>
254
            </listitem>
238
          </itemizedlist></para>
255
          </itemizedlist></para>
239
      </formalpara>
256
      </formalpara>
240
    </section>
257
    </section>
241
 
258
 
242
    <section id="buddy_allocator">
259
    <section id="buddy_allocator">
243
      <title>Buddy allocator</title>
260
      <title>Buddy allocator</title>
244
 
261
 
245
      <section>
262
      <section>
246
        <title>Overview</title>
263
        <title>Overview</title>
247
 
264
 
248
        <para>In buddy allocator, memory is broken down into power-of-two
265
        <para>In buddy allocator, memory is broken down into power-of-two
249
        sized naturally aligned blocks. These blocks are organized in an array
266
        sized naturally aligned blocks. These blocks are organized in an array
250
        of lists in which list with index i contains all unallocated blocks of
267
        of lists in which list with index i contains all unallocated blocks of
251
        the size <mathphrase>2<superscript>i</superscript></mathphrase>. The
268
        the size <mathphrase>2<superscript>i</superscript></mathphrase>. The
252
        index i is called the order of block. Should there be two adjacent
269
        index i is called the order of block. Should there be two adjacent
253
        equally sized blocks in list <mathphrase>i</mathphrase> (i.e.
270
        equally sized blocks in list <mathphrase>i</mathphrase> (i.e.
254
        buddies), the buddy allocator would coalesce them and put the
271
        buddies), the buddy allocator would coalesce them and put the
255
        resulting block in list <mathphrase>i + 1</mathphrase>, provided that
272
        resulting block in list <mathphrase>i + 1</mathphrase>, provided that
256
        the resulting block would be naturally aligned. Similarily, when the
273
        the resulting block would be naturally aligned. Similarily, when the
257
        allocator is asked to allocate a block of size
274
        allocator is asked to allocate a block of size
258
        <mathphrase>2<superscript>i</superscript></mathphrase>, it first tries
275
        <mathphrase>2<superscript>i</superscript></mathphrase>, it first tries
259
        to satisfy the request from list with index i. If the request cannot
276
        to satisfy the request from list with index i. If the request cannot
260
        be satisfied (i.e. the list i is empty), the buddy allocator will try
277
        be satisfied (i.e. the list i is empty), the buddy allocator will try
261
        to allocate and split larger block from list with index i + 1. Both of
278
        to allocate and split larger block from list with index i + 1. Both of
262
        these algorithms are recursive. The recursion ends either when there
279
        these algorithms are recursive. The recursion ends either when there
263
        are no blocks to coalesce in the former case or when there are no
280
        are no blocks to coalesce in the former case or when there are no
264
        blocks that can be split in the latter case.</para>
281
        blocks that can be split in the latter case.</para>
265
 
282
 
266
        <!--graphic fileref="images/mm1.png" format="EPS" /-->
283
        <!--graphic fileref="images/mm1.png" format="EPS" /-->
267
 
284
 
268
        <para>This approach greatly reduces external fragmentation of memory
285
        <para>This approach greatly reduces external fragmentation of memory
269
        and helps in allocating bigger continuous blocks of memory aligned to
286
        and helps in allocating bigger continuous blocks of memory aligned to
270
        their size. On the other hand, the buddy allocator suffers increased
287
        their size. On the other hand, the buddy allocator suffers increased
271
        internal fragmentation of memory and is not suitable for general
288
        internal fragmentation of memory and is not suitable for general
272
        kernel allocations. This purpose is better addressed by the <link
289
        kernel allocations. This purpose is better addressed by the <link
273
        linkend="slab">slab allocator</link>.</para>
290
        linkend="slab">slab allocator</link>.</para>
274
      </section>
291
      </section>
275
 
292
 
276
      <section>
293
      <section>
277
        <title>Implementation</title>
294
        <title>Implementation</title>
278
 
295
 
279
        <para>The buddy allocator is, in fact, an abstract framework wich can
296
        <para>The buddy allocator is, in fact, an abstract framework wich can
280
        be easily specialized to serve one particular task. It knows nothing
297
        be easily specialized to serve one particular task. It knows nothing
281
        about the nature of memory it helps to allocate. In order to beat the
298
        about the nature of memory it helps to allocate. In order to beat the
282
        lack of this knowledge, the buddy allocator exports an interface that
299
        lack of this knowledge, the buddy allocator exports an interface that
283
        each of its clients is required to implement. When supplied an
300
        each of its clients is required to implement. When supplied an
284
        implementation of this interface, the buddy allocator can use
301
        implementation of this interface, the buddy allocator can use
285
        specialized external functions to find buddy for a block, split and
302
        specialized external functions to find buddy for a block, split and
286
        coalesce blocks, manipulate block order and mark blocks busy or
303
        coalesce blocks, manipulate block order and mark blocks busy or
287
        available. For precize documentation of this interface, refer to <link
304
        available. For precize documentation of this interface, refer to <link
288
        linkend="???">HelenOS Generic Kernel Reference Manual</link>.</para>
305
        linkend="???">HelenOS Generic Kernel Reference Manual</link>.</para>
289
 
306
 
290
        <formalpara>
307
        <formalpara>
291
          <title>Data organization</title>
308
          <title>Data organization</title>
292
 
309
 
293
          <para>Each entity allocable by the buddy allocator is required to
310
          <para>Each entity allocable by the buddy allocator is required to
294
          contain space for storing block order number and a link variable
311
          contain space for storing block order number and a link variable
295
          used to interconnect blocks within the same order.</para>
312
          used to interconnect blocks within the same order.</para>
296
 
313
 
297
          <para>Whatever entities are allocated by the buddy allocator, the
314
          <para>Whatever entities are allocated by the buddy allocator, the
298
          first entity within a block is used to represent the entire block.
315
          first entity within a block is used to represent the entire block.
299
          The first entity keeps the order of the whole block. Other entities
316
          The first entity keeps the order of the whole block. Other entities
300
          within the block are assigned the magic value
317
          within the block are assigned the magic value
301
          <constant>BUDDY_INNER_BLOCK</constant>. This is especially important
318
          <constant>BUDDY_INNER_BLOCK</constant>. This is especially important
302
          for effective identification of buddies in one-dimensional array
319
          for effective identification of buddies in one-dimensional array
303
          because the entity that represents a potential buddy cannot be
320
          because the entity that represents a potential buddy cannot be
304
          associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it
321
          associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it
305
          is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is
322
          is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is
306
          not a buddy).</para>
323
          not a buddy).</para>
307
 
324
 
308
          <para>Buddy allocator always uses first frame to represent frame
325
          <para>Buddy allocator always uses first frame to represent frame
309
          block. This frame contains <varname>buddy_order</varname> variable
326
          block. This frame contains <varname>buddy_order</varname> variable
310
          to provide information about the block size it actually represents (
327
          to provide information about the block size it actually represents (
311
          <mathphrase>2<superscript>buddy_order</superscript></mathphrase>
328
          <mathphrase>2<superscript>buddy_order</superscript></mathphrase>
312
          frames block). Other frames in block have this value set to magic
329
          frames block). Other frames in block have this value set to magic
313
          <constant>BUDDY_INNER_BLOCK</constant> that is much greater than
330
          <constant>BUDDY_INNER_BLOCK</constant> that is much greater than
314
          buddy <varname>max_order</varname> value.</para>
331
          buddy <varname>max_order</varname> value.</para>
315
 
332
 
316
          <para>Each <varname>frame_t</varname> also contains pointer member
333
          <para>Each <varname>frame_t</varname> also contains pointer member
317
          to hold frame structure in the linked list inside one order.</para>
334
          to hold frame structure in the linked list inside one order.</para>
318
        </formalpara>
335
        </formalpara>
319
 
336
 
320
        <formalpara>
337
        <formalpara>
321
          <title>Allocation algorithm</title>
338
          <title>Allocation algorithm</title>
322
 
339
 
323
          <para>Upon <mathphrase>2<superscript>i</superscript></mathphrase>
340
          <para>Upon <mathphrase>2<superscript>i</superscript></mathphrase>
324
          frames block allocation request, allocator checks if there are any
341
          frames block allocation request, allocator checks if there are any
325
          blocks available at the order list <varname>i</varname>. If yes,
342
          blocks available at the order list <varname>i</varname>. If yes,
326
          removes block from order list and returns its address. If no,
343
          removes block from order list and returns its address. If no,
327
          recursively allocates
344
          recursively allocates
328
          <mathphrase>2<superscript>i+1</superscript></mathphrase> frame
345
          <mathphrase>2<superscript>i+1</superscript></mathphrase> frame
329
          block, splits it into two
346
          block, splits it into two
330
          <mathphrase>2<superscript>i</superscript></mathphrase> frame blocks.
347
          <mathphrase>2<superscript>i</superscript></mathphrase> frame blocks.
331
          Then adds one of the blocks to the <varname>i</varname> order list
348
          Then adds one of the blocks to the <varname>i</varname> order list
332
          and returns address of another.</para>
349
          and returns address of another.</para>
333
        </formalpara>
350
        </formalpara>
334
 
351
 
335
        <formalpara>
352
        <formalpara>
336
          <title>Deallocation algorithm</title>
353
          <title>Deallocation algorithm</title>
337
 
354
 
338
          <para>Check if block has so called buddy (another free
355
          <para>Check if block has so called buddy (another free
339
          <mathphrase>2<superscript>i</superscript></mathphrase> frame block
356
          <mathphrase>2<superscript>i</superscript></mathphrase> frame block
340
          that can be linked with freed block into the
357
          that can be linked with freed block into the
341
          <mathphrase>2<superscript>i+1</superscript></mathphrase> block).
358
          <mathphrase>2<superscript>i+1</superscript></mathphrase> block).
342
          Technically, buddy is a odd/even block for even/odd block
359
          Technically, buddy is a odd/even block for even/odd block
343
          respectively. Plus we can put an extra requirement, that resulting
360
          respectively. Plus we can put an extra requirement, that resulting
344
          block must be aligned to its size. This requirement guarantees
361
          block must be aligned to its size. This requirement guarantees
345
          natural block alignment for the blocks coming out the allocation
362
          natural block alignment for the blocks coming out the allocation
346
          system.</para>
363
          system.</para>
347
 
364
 
348
          <para>Using direct pointer arithmetics,
365
          <para>Using direct pointer arithmetics,
349
          <varname>frame_t::ref_count</varname> and
366
          <varname>frame_t::ref_count</varname> and
350
          <varname>frame_t::buddy_order</varname> variables, finding buddy is
367
          <varname>frame_t::buddy_order</varname> variables, finding buddy is
351
          done at constant time.</para>
368
          done at constant time.</para>
352
        </formalpara>
369
        </formalpara>
353
      </section>
370
      </section>
354
    </section>
371
    </section>
355
 
372
 
356
    <section id="slab">
373
    <section id="slab">
357
      <title>Slab allocator</title>
374
      <title>Slab allocator</title>
358
 
375
 
359
      <section>
376
      <section>
360
        <title>Overview</title>
377
        <title>Overview</title>
361
 
378
 
362
        <para><termdef><glossterm>Slab</glossterm> represents a contiguous
379
        <para><termdef><glossterm>Slab</glossterm> represents a contiguous
363
        piece of memory, usually made of several physically contiguous
380
        piece of memory, usually made of several physically contiguous
364
        pages.</termdef> <termdef><glossterm>Slab cache</glossterm> consists
381
        pages.</termdef> <termdef><glossterm>Slab cache</glossterm> consists
365
        of one or more slabs.</termdef></para>
382
        of one or more slabs.</termdef></para>
366
 
383
 
367
        <para>The majority of memory allocation requests in the kernel are for
384
        <para>The majority of memory allocation requests in the kernel are for
368
        small, frequently used data structures. For this purpose the slab
385
        small, frequently used data structures. For this purpose the slab
369
        allocator is a perfect solution. The basic idea behind the slab
386
        allocator is a perfect solution. The basic idea behind the slab
370
        allocator is to have lists of commonly used objects available packed
387
        allocator is to have lists of commonly used objects available packed
371
        into pages. This avoids the overhead of allocating and destroying
388
        into pages. This avoids the overhead of allocating and destroying
372
        commonly used types of objects such threads, virtual memory structures
389
        commonly used types of objects such threads, virtual memory structures
373
        etc. Also due to the exact allocated size matching, slab allocation
390
        etc. Also due to the exact allocated size matching, slab allocation
374
        completely eliminates internal fragmentation issue.</para>
391
        completely eliminates internal fragmentation issue.</para>
375
      </section>
392
      </section>
376
 
393
 
377
      <section>
394
      <section>
378
        <title>Implementation</title>
395
        <title>Implementation</title>
379
 
396
 
380
        <para>The SLAB allocator is closely modelled after <ulink
397
        <para>The SLAB allocator is closely modelled after <ulink
381
        url="http://www.usenix.org/events/usenix01/full_papers/bonwick/bonwick_html/">
398
        url="http://www.usenix.org/events/usenix01/full_papers/bonwick/bonwick_html/">
382
        OpenSolaris SLAB allocator by Jeff Bonwick and Jonathan Adams </ulink>
399
        OpenSolaris SLAB allocator by Jeff Bonwick and Jonathan Adams </ulink>
383
        with the following exceptions: <itemizedlist>
400
        with the following exceptions: <itemizedlist>
384
            <listitem>
401
            <listitem>
385
               empty SLABS are deallocated immediately (in Linux they are kept in linked list, in Solaris ???)
402
               empty SLABS are deallocated immediately (in Linux they are kept in linked list, in Solaris ???)
386
            </listitem>
403
            </listitem>
387
 
404
 
388
            <listitem>
405
            <listitem>
389
               empty magazines are deallocated when not needed (in Solaris they are held in linked list in slab cache)
406
               empty magazines are deallocated when not needed (in Solaris they are held in linked list in slab cache)
390
            </listitem>
407
            </listitem>
391
          </itemizedlist> Following features are not currently supported but
408
          </itemizedlist> Following features are not currently supported but
392
        would be easy to do: <itemizedlist>
409
        would be easy to do: <itemizedlist>
393
            <listitem>
410
            <listitem>
394
               - cache coloring
411
               - cache coloring
395
            </listitem>
412
            </listitem>
396
 
413
 
397
            <listitem>
414
            <listitem>
398
               - dynamic magazine grow (different magazine sizes are already supported, but we would need to adjust allocation strategy)
415
               - dynamic magazine grow (different magazine sizes are already supported, but we would need to adjust allocation strategy)
399
            </listitem>
416
            </listitem>
400
          </itemizedlist></para>
417
          </itemizedlist></para>
401
 
418
 
402
        <section>
419
        <section>
403
          <title>Magazine layer</title>
420
          <title>Magazine layer</title>
404
 
421
 
405
          <para>Due to the extensive bottleneck on SMP architures, caused by
422
          <para>Due to the extensive bottleneck on SMP architures, caused by
406
          global SLAB locking mechanism, making processing of all slab
423
          global SLAB locking mechanism, making processing of all slab
407
          allocation requests serialized, a new layer was introduced to the
424
          allocation requests serialized, a new layer was introduced to the
408
          classic slab allocator design. Slab allocator was extended to
425
          classic slab allocator design. Slab allocator was extended to
409
          support per-CPU caches 'magazines' to achieve good SMP scaling.
426
          support per-CPU caches 'magazines' to achieve good SMP scaling.
410
          <termdef>Slab SMP perfromance bottleneck was resolved by introducing
427
          <termdef>Slab SMP perfromance bottleneck was resolved by introducing
411
          a per-CPU caching scheme called as <glossterm>magazine
428
          a per-CPU caching scheme called as <glossterm>magazine
412
          layer</glossterm></termdef>.</para>
429
          layer</glossterm></termdef>.</para>
413
 
430
 
414
          <para>Magazine is a N-element cache of objects, so each magazine can
431
          <para>Magazine is a N-element cache of objects, so each magazine can
415
          satisfy N allocations. Magazine behaves like a automatic weapon
432
          satisfy N allocations. Magazine behaves like a automatic weapon
416
          magazine (LIFO, stack), so the allocation/deallocation become simple
433
          magazine (LIFO, stack), so the allocation/deallocation become simple
417
          push/pop pointer operation. Trick is that CPU does not access global
434
          push/pop pointer operation. Trick is that CPU does not access global
418
          slab allocator data during the allocation from its magazine, thus
435
          slab allocator data during the allocation from its magazine, thus
419
          making possible parallel allocations between CPUs.</para>
436
          making possible parallel allocations between CPUs.</para>
420
 
437
 
421
          <para>Implementation also requires adding another feature as the
438
          <para>Implementation also requires adding another feature as the
422
          CPU-bound magazine is actually a pair of magazines to avoid
439
          CPU-bound magazine is actually a pair of magazines to avoid
423
          thrashing when during allocation/deallocatiion of 1 item at the
440
          thrashing when during allocation/deallocatiion of 1 item at the
424
          magazine size boundary. LIFO order is enforced, which should avoid
441
          magazine size boundary. LIFO order is enforced, which should avoid
425
          fragmentation as much as possible.</para>
442
          fragmentation as much as possible.</para>
426
 
443
 
427
          <para>Another important entity of magazine layer is a full magazine
444
          <para>Another important entity of magazine layer is a full magazine
428
          depot, that stores full magazines which are used by any of the CPU
445
          depot, that stores full magazines which are used by any of the CPU
429
          magazine caches to reload active CPU magazine. Magazine depot can be
446
          magazine caches to reload active CPU magazine. Magazine depot can be
430
          pre-filled with full magazines during initialization, but in current
447
          pre-filled with full magazines during initialization, but in current
431
          implementation it is filled during object deallocation, when CPU
448
          implementation it is filled during object deallocation, when CPU
432
          magazine becomes full.</para>
449
          magazine becomes full.</para>
433
 
450
 
434
          <para>Slab allocator control structures are allocated from special
451
          <para>Slab allocator control structures are allocated from special
435
          slabs, that are marked by special flag, indicating that it should
452
          slabs, that are marked by special flag, indicating that it should
436
          not be used for slab magazine layer. This is done to avoid possible
453
          not be used for slab magazine layer. This is done to avoid possible
437
          infinite recursions and deadlock during conventional slab allocaiton
454
          infinite recursions and deadlock during conventional slab allocaiton
438
          requests.</para>
455
          requests.</para>
439
        </section>
456
        </section>
440
 
457
 
441
        <section>
458
        <section>
442
          <title>Allocation/deallocation</title>
459
          <title>Allocation/deallocation</title>
443
 
460
 
444
          <para>Every cache contains list of full slabs and list of partialy
461
          <para>Every cache contains list of full slabs and list of partialy
445
          full slabs. Empty slabs are immediately freed (thrashing will be
462
          full slabs. Empty slabs are immediately freed (thrashing will be
446
          avoided because of magazines).</para>
463
          avoided because of magazines).</para>
447
 
464
 
448
          <para>The SLAB allocator allocates lots of space and does not free
465
          <para>The SLAB allocator allocates lots of space and does not free
449
          it. When frame allocator fails to allocate the frame, it calls
466
          it. When frame allocator fails to allocate the frame, it calls
450
          slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim.
467
          slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim.
451
          The light reclaim releases slabs from cpu-shared magazine-list,
468
          The light reclaim releases slabs from cpu-shared magazine-list,
452
          until at least 1 slab is deallocated in each cache (this algorithm
469
          until at least 1 slab is deallocated in each cache (this algorithm
453
          should probably change). The brutal reclaim removes all cached
470
          should probably change). The brutal reclaim removes all cached
454
          objects, even from CPU-bound magazines.</para>
471
          objects, even from CPU-bound magazines.</para>
455
 
472
 
456
          <formalpara>
473
          <formalpara>
457
            <title>Allocation</title>
474
            <title>Allocation</title>
458
 
475
 
459
            <para><emphasis>Step 1.</emphasis> When it comes to the allocation
476
            <para><emphasis>Step 1.</emphasis> When it comes to the allocation
460
            request, slab allocator first of all checks availability of memory
477
            request, slab allocator first of all checks availability of memory
461
            in local CPU-bound magazine. If it is there, we would just "pop"
478
            in local CPU-bound magazine. If it is there, we would just "pop"
462
            the CPU magazine and return the pointer to object.</para>
479
            the CPU magazine and return the pointer to object.</para>
463
 
480
 
464
            <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
481
            <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
465
            empty, allocator will attempt to reload magazin, swapping it with
482
            empty, allocator will attempt to reload magazin, swapping it with
466
            second CPU magazine and returns to the first step.</para>
483
            second CPU magazine and returns to the first step.</para>
467
 
484
 
468
            <para><emphasis>Step 3.</emphasis> Now we are in the situation
485
            <para><emphasis>Step 3.</emphasis> Now we are in the situation
469
            when both CPU-bound magazines are empty, which makes allocator to
486
            when both CPU-bound magazines are empty, which makes allocator to
470
            access shared full-magazines depot to reload CPU-bound magazines.
487
            access shared full-magazines depot to reload CPU-bound magazines.
471
            If reload is succesful (meaning there are full magazines in depot)
488
            If reload is succesful (meaning there are full magazines in depot)
472
            algoritm continues at Step 1.</para>
489
            algoritm continues at Step 1.</para>
473
 
490
 
474
            <para><emphasis>Step 4.</emphasis> Final step of the allocation.
491
            <para><emphasis>Step 4.</emphasis> Final step of the allocation.
475
            In this step object is allocated from the conventional slab layer
492
            In this step object is allocated from the conventional slab layer
476
            and pointer is returned.</para>
493
            and pointer is returned.</para>
477
          </formalpara>
494
          </formalpara>
478
 
495
 
479
          <formalpara>
496
          <formalpara>
480
            <title>Deallocation</title>
497
            <title>Deallocation</title>
481
 
498
 
482
            <para><emphasis>Step 1.</emphasis> During deallocation request,
499
            <para><emphasis>Step 1.</emphasis> During deallocation request,
483
            slab allocator will check if the local CPU-bound magazine is not
500
            slab allocator will check if the local CPU-bound magazine is not
484
            full. In this case we will just push the pointer to this
501
            full. In this case we will just push the pointer to this
485
            magazine.</para>
502
            magazine.</para>
486
 
503
 
487
            <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
504
            <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
488
            full, allocator will attempt to reload magazin, swapping it with
505
            full, allocator will attempt to reload magazin, swapping it with
489
            second CPU magazine and returns to the first step.</para>
506
            second CPU magazine and returns to the first step.</para>
490
 
507
 
491
            <para><emphasis>Step 3.</emphasis> Now we are in the situation
508
            <para><emphasis>Step 3.</emphasis> Now we are in the situation
492
            when both CPU-bound magazines are full, which makes allocator to
509
            when both CPU-bound magazines are full, which makes allocator to
493
            access shared full-magazines depot to put one of the magazines to
510
            access shared full-magazines depot to put one of the magazines to
494
            the depot and creating new empty magazine. Algoritm continues at
511
            the depot and creating new empty magazine. Algoritm continues at
495
            Step 1.</para>
512
            Step 1.</para>
496
          </formalpara>
513
          </formalpara>
497
        </section>
514
        </section>
498
      </section>
515
      </section>
499
    </section>
516
    </section>
500
 
517
 
501
    <!-- End of Physmem -->
518
    <!-- End of Physmem -->
502
  </section>
519
  </section>
503
 
520
 
504
  <section>
521
  <section>
505
    <title>Memory sharing</title>
522
    <title>Memory sharing</title>
506
 
523
 
507
    <para>Not implemented yet(?)</para>
524
    <para>Not implemented yet(?)</para>
508
  </section>
525
  </section>
509
</chapter>
526
</chapter>