Subversion Repositories HelenOS-doc

Rev

Rev 45 | Rev 47 | Go to most recent revision | Only display areas with differences | Ignore whitespace | Details | Blame | Last modification | View Log | RSS feed

Rev 45 Rev 46
1
<?xml version="1.0" encoding="UTF-8"?>
1
<?xml version="1.0" encoding="UTF-8"?>
2
<chapter id="mm">
2
<chapter id="mm">
3
  <?dbhtml filename="mm.html"?>
3
  <?dbhtml filename="mm.html"?>
4
 
4
 
5
  <title>Memory management</title>
5
  <title>Memory management</title>
6
 
6
 
7
  <section>
7
  <section>
8
    <title>Virtual memory management</title>
8
    <title>Virtual memory management</title>
9
 
9
 
10
    <section>
10
    <section>
11
      <title>Introduction</title>
11
      <title>Introduction</title>
12
 
12
 
13
      <para>Virtual memory is a special memory management technique, used by
13
      <para>Virtual memory is a special memory management technique, used by
14
      kernel to achieve a bunch of mission critical goals. <itemizedlist>
14
      kernel to achieve a bunch of mission critical goals. <itemizedlist>
15
          <listitem>
15
          <listitem>
16
             Isolate each task from other tasks that are running on the system at the same time.
16
             Isolate each task from other tasks that are running on the system at the same time.
17
          </listitem>
17
          </listitem>
18
 
18
 
19
          <listitem>
19
          <listitem>
20
             Allow to allocate more memory, than is actual physical memory size of the machine.
20
             Allow to allocate more memory, than is actual physical memory size of the machine.
21
          </listitem>
21
          </listitem>
22
 
22
 
23
          <listitem>
23
          <listitem>
24
             Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations.
24
             Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations.
25
          </listitem>
25
          </listitem>
26
        </itemizedlist></para>
26
        </itemizedlist></para>
27
 
27
 
28
      <para><!--
28
      <para><!--
29
 
29
 
30
                TLB shootdown ASID/ASID:PAGE/ALL.
30
                TLB shootdown ASID/ASID:PAGE/ALL.
31
                TLB shootdown requests can come in asynchroniously
31
                TLB shootdown requests can come in asynchroniously
32
                so there is a cache of TLB shootdown requests. Upon cache overflow TLB shootdown ALL is executed
32
                so there is a cache of TLB shootdown requests. Upon cache overflow TLB shootdown ALL is executed
33
 
33
 
34
 
34
 
35
                <para>
35
                <para>
36
                        Address spaces. Address space area (B+ tree). Only for uspace. Set of syscalls (shrink/extend etc).
36
                        Address spaces. Address space area (B+ tree). Only for uspace. Set of syscalls (shrink/extend etc).
37
                        Special address space area type - device - prohibits shrink/extend syscalls to call on it.
37
                        Special address space area type - device - prohibits shrink/extend syscalls to call on it.
38
                        Address space has link to mapping tables (hierarchical - per Address space, hash - global tables).
38
                        Address space has link to mapping tables (hierarchical - per Address space, hash - global tables).
39
                </para>
39
                </para>
40
 
40
 
41
--></para>
41
--></para>
42
    </section>
42
    </section>
43
 
43
 
44
    <section>
44
    <section>
45
      <title>Paging</title>
45
      <title>Paging</title>
46
 
46
 
47
      <para>Virtual memory is usually using paged memory model, where virtual
47
      <para>Virtual memory is usually using paged memory model, where virtual
48
      memory address space is divided into the <emphasis>pages</emphasis>
48
      memory address space is divided into the <emphasis>pages</emphasis>
49
      (usually having size 4096 bytes) and physical memory is divided into the
49
      (usually having size 4096 bytes) and physical memory is divided into the
50
      frames (same sized as a page, of course). Each page may be mapped to
50
      frames (same sized as a page, of course). Each page may be mapped to
51
      some frame and then, upon memory access to the virtual address, CPU
51
      some frame and then, upon memory access to the virtual address, CPU
52
      performs <emphasis>address translation</emphasis> during the instruction
52
      performs <emphasis>address translation</emphasis> during the instruction
53
      execution. Non-existing mapping generates page fault exception, calling
53
      execution. Non-existing mapping generates page fault exception, calling
54
      kernel exception handler, thus allowing kernel to manipulate rules of
54
      kernel exception handler, thus allowing kernel to manipulate rules of
55
      memory access. Information for pages mapping is stored by kernel in the
55
      memory access. Information for pages mapping is stored by kernel in the
56
      <link linkend="page_tables">page tables</link></para>
56
      <link linkend="page_tables">page tables</link></para>
57
 
57
 
58
      <para>The majority of the architectures use multi-level page tables,
58
      <para>The majority of the architectures use multi-level page tables,
59
      which means need to access physical memory several times before getting
59
      which means need to access physical memory several times before getting
60
      physical address. This fact would make serios performance overhead in
60
      physical address. This fact would make serios performance overhead in
61
      virtual memory management. To avoid this <link linkend="tlb">Traslation
61
      virtual memory management. To avoid this <link linkend="tlb">Traslation
62
      Lookaside Buffer (TLB)</link> is used.</para>
62
      Lookaside Buffer (TLB)</link> is used.</para>
63
 
-
 
64
      <para>At the moment HelenOS does not support swapping.</para>
-
 
65
 
-
 
66
      <para>- pouzivame vypadky stranky k alokaci ramcu on-demand v ramci
-
 
67
      as_area - na architekturach, ktere to podporuji, podporujeme non-exec
-
 
68
      stranky</para>
-
 
69
    </section>
63
    </section>
70
 
64
 
71
    <section>
65
    <section>
72
      <title>Address spaces</title>
66
      <title>Address spaces</title>
73
 
67
 
74
      <section>
68
      <section>
75
        <title>Address spaces and areas</title>
69
        <title>Address space areas</title>
-
 
70
 
-
 
71
        <para>Each address space consists of mutually disjunctive continuous
-
 
72
        address space areas. Address space area is precisely defined by its
-
 
73
        base address and the number of frames is contains.</para>
-
 
74
 
-
 
75
        <para>Address space area also has special flags, that define behaviour
-
 
76
        and permissions on the particular area. <itemizedlist>
-
 
77
            <listitem>
-
 
78
               
-
 
79
 
-
 
80
              <emphasis>AS_AREA_READ</emphasis>
-
 
81
 
-
 
82
               flag indicates reading permission.
-
 
83
            </listitem>
-
 
84
 
-
 
85
            <listitem>
-
 
86
               
-
 
87
 
-
 
88
              <emphasis>AS_AREA_WRITE</emphasis>
-
 
89
 
-
 
90
               flag indicates writing permission.
-
 
91
            </listitem>
-
 
92
 
-
 
93
            <listitem>
-
 
94
               
-
 
95
 
-
 
96
              <emphasis>AS_AREA_EXEC</emphasis>
76
 
97
 
77
        <para>- adresovy prostor se sklada z tzv. address space areas
98
               flag indicates code execution permission. Some architectures do not support execution persmission restriction. In this case this flag has no effect.
-
 
99
            </listitem>
-
 
100
 
-
 
101
            <listitem>
-
 
102
               
-
 
103
 
78
        usporadanych v B+stromu; tyto areas popisuji vyuzivane casti
104
              <emphasis>AS_AREA_DEVICE</emphasis>
-
 
105
 
79
        adresoveho prostoru patrici do user address space. Kazda cast je dana
106
               marks area as mapped to the device memory.
-
 
107
            </listitem>
80
        svoji bazovou adresou, velikosti a flagy (rwx/dd).</para>
108
          </itemizedlist></para>
81
 
109
 
82
        <para>- uzivatelske thready maji moznost manipulovat se svym adresovym
110
        <para>Kernel provides possibility tasks create/expand/shrink/share its
83
        prostorem (vytvaret/resizovat/sdilet) as_areas pomoci syscallu</para>
111
        address space via the set of syscalls.</para>
84
      </section>
112
      </section>
85
 
113
 
86
      <section>
114
      <section>
87
        <title>Address Space ID (ASID)</title>
115
        <title>Address Space ID (ASID)</title>
88
 
116
 
89
        <para>- nektery hardware umoznuje rozlisit ruzne adresove prostory od
117
        <para>When switching to the different task, kernel also require to
90
        sebe (cilem je maximalizovat vyuziti TLB); dela to tak, ze s kazdou
118
        switch mappings to the different address space. In case TLB cannot
91
        polozkou TLB/strankovacich tabulek sdruzi identifikator adresoveho
119
        distinguish address space mappings, all mappings from the old address
92
        prostoru (ASID, RID, ppc32 ???). Tyto id mivaji ruznou sirku: 8-bitu
120
        space should be flushed, which can create certain uncessary
93
        az 24-bitu (kolik ma ppc32?)</para>
121
        overhead.</para>
94
 
122
 
95
        <para>- kernel tomu rozumi a sam pouziva abstrakci ASIDu (na ia64 to
123
        <para>To avoid this, some architectures have capability to segregate
96
        je napr. cislo odvozene od RIDu, na mips32 to je ASID samotny);
124
        different address spaces on HW level introducing the ASID (address
-
 
125
        space ID). On those architectures each TLB record contains an address
97
        existence ASIDu je nutnou podminkou pouziti _global_ page hash table
126
        space identifier, that tells to which address space this record is
98
        mechanismu.</para>
127
        applicable.</para>
99
 
128
 
100
        <para>- na vsech arch. plati, ze asidu je mnohem mene, nez teoreticky
129
        <para>HelenOS kernel can take advantage of this hardware supported
101
        pocet soucasne bezicich tasku ~ adresovych prostoru, takze je
130
        identifier by having an ASID abstraction which is connected to the
102
        implementovan mechanismus, ktery umoznuje jednomu adresovemu prostoru
131
        corresponding architecture identifier. I.e. on ia64 kernel ASID is
-
 
132
        built from RID (region identifier) and on the mips32 kernel ASID is
103
        ASID odebrat a pridelit ho jinemu</para>
133
        actually the hardware identifier.</para>
104
 
134
 
105
        <para>- vztah task ~ adresovy prostor: teoreticky existuje moznost, ze
135
        <para>Due to the hardware limitations ASID has limited length from 8
-
 
136
        bits on ia64 to 24 bits on mips32, which makes it impossible to use as
106
        je adresovy prostor sdilen vice tasky, avsak tuto moznost nepouzivame
137
        unique address space identifier for all tasks running in the system.
107
        a neni ani nijak osetrena. Tim padem plati, ze kazdy task ma vlastni
138
        In such situations special ASID stealing algoritm is used, which takes
108
        adresovy prostor</para>
139
        ASID from inactive task and assigns it to the active task.</para>
109
      </section>
140
      </section>
110
    </section>
141
    </section>
111
 
142
 
112
    <section>
143
    <section>
113
      <title>Virtual address translation</title>
144
      <title>Virtual address translation</title>
114
 
145
 
115
      <section id="page_tables">
146
      <section id="page_tables">
116
        <title>Page tables</title>
147
        <title>Page tables</title>
117
 
148
 
118
        <para>HelenOS kernel has two different approaches to the paging
149
        <para>HelenOS kernel has two different approaches to the paging
119
        implementation: <emphasis>4 level page tables</emphasis> and
150
        implementation: <emphasis>4 level page tables</emphasis> and
120
        <emphasis>global hash tables</emphasis>, which are accessible via
151
        <emphasis>global hash tables</emphasis>, which are accessible via
121
        generic paging abstraction layer. This division was caused by the
152
        generic paging abstraction layer. This division was caused by the
122
        major architectural differences between different platforms.</para>
153
        major architectural differences between different platforms.</para>
123
 
154
 
124
        <formalpara>
155
        <formalpara>
125
          <title>4-level page tables</title>
156
          <title>4-level page tables</title>
126
 
157
 
127
          <para>4-level page tables are the generalization of the hardware
158
          <para>4-level page tables are the generalization of the hardware
128
          capabilities of the certain platforms. <itemizedlist>
159
          capabilities of the certain platforms. <itemizedlist>
129
              <listitem>
160
              <listitem>
130
                 ia32 uses 2-level page tables, with full hardware support.
161
                 ia32 uses 2-level page tables, with full hardware support.
131
              </listitem>
162
              </listitem>
132
 
163
 
133
              <listitem>
164
              <listitem>
134
                 amd64 uses 4-level page tables, also coming with full hardware support.
165
                 amd64 uses 4-level page tables, also coming with full hardware support.
135
              </listitem>
166
              </listitem>
136
 
167
 
137
              <listitem>
168
              <listitem>
138
                 mips and ppc32 have 2-level tables, software simulated support.
169
                 mips and ppc32 have 2-level tables, software simulated support.
139
              </listitem>
170
              </listitem>
140
            </itemizedlist></para>
171
            </itemizedlist></para>
141
        </formalpara>
172
        </formalpara>
142
 
173
 
143
        <formalpara>
174
        <formalpara>
144
          <title>Global hash tables</title>
175
          <title>Global hash tables</title>
145
 
176
 
146
          <para>- global page hash table: existuje jen jedna v celem systemu
177
          <para>- global page hash table: existuje jen jedna v celem systemu
147
          (vyuziva ji ia64), pozn. ia64 ma zatim vypnuty VHPT. Pouziva se
178
          (vyuziva ji ia64), pozn. ia64 ma zatim vypnuty VHPT. Pouziva se
148
          genericke hash table s oddelenymi collision chains</para>
179
          genericke hash table s oddelenymi collision chains. ASID support is
-
 
180
          required to use global hash tables.</para>
149
        </formalpara>
181
        </formalpara>
150
 
182
 
151
        <para>Thanks to the abstract paging interface, there is possibility
183
        <para>Thanks to the abstract paging interface, there is possibility
152
        left have more paging implementations, for example B-Tree page
184
        left have more paging implementations, for example B-Tree page
153
        tables.</para>
185
        tables.</para>
154
      </section>
186
      </section>
155
 
187
 
156
      <section id="tlb">
188
      <section id="tlb">
157
        <title>Translation Lookaside buffer</title>
189
        <title>Translation Lookaside Buffer</title>
158
 
190
 
159
        <para>- TLB cachuji informace ve strankovacich tabulkach; alternativne
191
        <para>- TLB cachuji informace ve strankovacich tabulkach; alternativne
160
        se lze na strankovaci tabulky (ci ruzne hw rozsireni [e.g. VHPT, ppc32
192
        se lze na strankovaci tabulky (ci ruzne hw rozsireni [e.g. VHPT, ppc32
161
        hw hash table]) divat jako na velke TLB</para>
193
        hw hash table]) divat jako na velke TLB</para>
162
 
194
 
163
        <para>- pri modifikaci mapovani nebo odstraneni mapovani ze
195
        <para>- pri modifikaci mapovani nebo odstraneni mapovani ze
164
        strankovacich tabulek je potreba zajistit konsistenci TLB a techto
196
        strankovacich tabulek je potreba zajistit konsistenci TLB a techto
165
        tabulek; nutne delat na vsech CPU; na to mame zjednodusenou verzi TLB
197
        tabulek; nutne delat na vsech CPU; na to mame zjednodusenou verzi TLB
166
        shootdown mechanismu; je to variace na algoritmus popsany zde: D.
198
        shootdown mechanismu; je to variace na algoritmus popsany zde: D.
167
        Black et al., "Translation Lookaside Buffer Consistency: A Software
199
        Black et al., "Translation Lookaside Buffer Consistency: A Software
168
        Approach," Proc. Third Int'l Conf. Architectural Support for
200
        Approach," Proc. Third Int'l Conf. Architectural Support for
169
        Programming Languages and Operating Systems, 1989, pp. 113-122.</para>
201
        Programming Languages and Operating Systems, 1989, pp. 113-122.</para>
170
 
202
 
171
        <para>- nutno poznamenat, ze existuji odlehcenejsi verze TLB shootdown
203
        <para>- nutno poznamenat, ze existuji odlehcenejsi verze TLB shootdown
172
        algoritmu</para>
204
        algoritm</para>
173
      </section>
205
      </section>
174
    </section>
206
    </section>
-
 
207
 
-
 
208
    <section>
-
 
209
      <title>---</title>
-
 
210
 
-
 
211
      <para>At the moment HelenOS does not support swapping.</para>
-
 
212
 
-
 
213
      <para>- pouzivame vypadky stranky k alokaci ramcu on-demand v ramci
-
 
214
      as_area - na architekturach, ktere to podporuji, podporujeme non-exec
-
 
215
      stranky</para>
-
 
216
    </section>
175
  </section>
217
  </section>
176
 
218
 
177
  <!-- End of VM -->
219
  <!-- End of VM -->
178
 
220
 
179
  <section>
221
  <section>
180
    <!-- Phys mem -->
222
    <!-- Phys mem -->
181
 
223
 
182
    <title>Physical memory management</title>
224
    <title>Physical memory management</title>
183
 
225
 
184
    <section id="zones_and_frames">
226
    <section id="zones_and_frames">
185
      <title>Zones and frames</title>
227
      <title>Zones and frames</title>
186
 
228
 
187
      <para><!--graphic fileref="images/mm2.png" /--><!--graphic fileref="images/buddy_alloc.svg" format="SVG" /--></para>
229
      <para><!--graphic fileref="images/mm2.png" /--><!--graphic fileref="images/buddy_alloc.svg" format="SVG" /--></para>
188
 
230
 
189
      <para>On some architectures not whole physical memory is available for
231
      <para>On some architectures not whole physical memory is available for
190
      conventional usage. This limitations require from kernel to maintain a
232
      conventional usage. This limitations require from kernel to maintain a
191
      table of available and unavailable ranges of physical memory addresses.
233
      table of available and unavailable ranges of physical memory addresses.
192
      Main idea of zones is in creating memory zone entity, that is a
234
      Main idea of zones is in creating memory zone entity, that is a
193
      continuous chunk of memory available for allocation. If some chunk is
235
      continuous chunk of memory available for allocation. If some chunk is
194
      not available, we simply do not put it in any zone.</para>
236
      not available, we simply do not put it in any zone.</para>
195
 
237
 
196
      <para>Zone is also serves for informational purposes, containing
238
      <para>Zone is also serves for informational purposes, containing
197
      information about number of free and busy frames. Physical memory
239
      information about number of free and busy frames. Physical memory
198
      allocation is also done inside the certain zone. Allocation of zone
240
      allocation is also done inside the certain zone. Allocation of zone
199
      frame must be organized by the <link linkend="frame_allocator">frame
241
      frame must be organized by the <link linkend="frame_allocator">frame
200
      allocator</link> associated with the zone.</para>
242
      allocator</link> associated with the zone.</para>
201
 
243
 
202
      <para>Some of the architectures (mips32, ppc32) have only one zone, that
244
      <para>Some of the architectures (mips32, ppc32) have only one zone, that
203
      covers whole physical memory, and the others (like ia32) may have
245
      covers whole physical memory, and the others (like ia32) may have
204
      multiple zones. Information about zones on current machine is stored in
246
      multiple zones. Information about zones on current machine is stored in
205
      BIOS hardware tables or can be hardcoded into kernel during compile
247
      BIOS hardware tables or can be hardcoded into kernel during compile
206
      time.</para>
248
      time.</para>
207
    </section>
249
    </section>
208
 
250
 
209
    <section id="frame_allocator">
251
    <section id="frame_allocator">
210
      <title>Frame allocator</title>
252
      <title>Frame allocator</title>
211
 
253
 
212
      <para><mediaobject id="frame_alloc">
254
      <para><mediaobject id="frame_alloc">
213
          <imageobject role="html">
255
          <imageobject role="html">
214
            <imagedata fileref="images/frame_alloc.png" format="PNG" />
256
            <imagedata fileref="images/frame_alloc.png" format="PNG" />
215
          </imageobject>
257
          </imageobject>
216
 
258
 
217
          <imageobject role="fop">
259
          <imageobject role="fop">
218
            <imagedata fileref="images.vector/frame_alloc.svg" format="SVG" />
260
            <imagedata fileref="images.vector/frame_alloc.svg" format="SVG" />
219
          </imageobject>
261
          </imageobject>
220
        </mediaobject></para>
262
        </mediaobject></para>
221
 
263
 
222
      <formalpara>
264
      <formalpara>
223
        <title>Overview</title>
265
        <title>Overview</title>
224
 
266
 
225
        <para>Frame allocator provides physical memory allocation for the
267
        <para>Frame allocator provides physical memory allocation for the
226
        kernel. Because of zonal organization of physical memory, frame
268
        kernel. Because of zonal organization of physical memory, frame
227
        allocator is always working in context of some zone, thus making
269
        allocator is always working in context of some zone, thus making
228
        impossible to allocate a piece of memory, which lays in different
270
        impossible to allocate a piece of memory, which lays in different
229
        zone, which cannot happen, because two adjacent zones can be merged
271
        zone, which cannot happen, because two adjacent zones can be merged
230
        into one. Frame allocator is also being responsible to update
272
        into one. Frame allocator is also being responsible to update
231
        information on the number of free/busy frames in zone. Physical memory
273
        information on the number of free/busy frames in zone. Physical memory
232
        allocation inside one <link linkend="zones_and_frames">memory
274
        allocation inside one <link linkend="zones_and_frames">memory
233
        zone</link> is being handled by an instance of <link
275
        zone</link> is being handled by an instance of <link
234
        linkend="buddy_allocator">buddy allocator</link> tailored to allocate
276
        linkend="buddy_allocator">buddy allocator</link> tailored to allocate
235
        blocks of physical memory frames.</para>
277
        blocks of physical memory frames.</para>
236
      </formalpara>
278
      </formalpara>
237
 
279
 
238
      <formalpara>
280
      <formalpara>
239
        <title>Allocation / deallocation</title>
281
        <title>Allocation / deallocation</title>
240
 
282
 
241
        <para>Upon allocation request, frame allocator tries to find first
283
        <para>Upon allocation request, frame allocator tries to find first
242
        zone, that can satisfy the incoming request (has required amount of
284
        zone, that can satisfy the incoming request (has required amount of
243
        free frames to allocate). During deallocation, frame allocator needs
285
        free frames to allocate). During deallocation, frame allocator needs
244
        to find zone, that contain deallocated frame. This approach could
286
        to find zone, that contain deallocated frame. This approach could
245
        bring up two potential problems: <itemizedlist>
287
        bring up two potential problems: <itemizedlist>
246
            <listitem>
288
            <listitem>
247
               Linear search of zones does not any good to performance, but number of zones is not expected to be high. And if yes, list of zones can be replaced with more time-efficient B-tree.
289
               Linear search of zones does not any good to performance, but number of zones is not expected to be high. And if yes, list of zones can be replaced with more time-efficient B-tree.
248
            </listitem>
290
            </listitem>
249
 
291
 
250
            <listitem>
292
            <listitem>
251
               Quickly find out if zone contains required number of frames to allocate and if this chunk of memory is properly aligned. This issue is perfectly solved bu the buddy allocator.
293
               Quickly find out if zone contains required number of frames to allocate and if this chunk of memory is properly aligned. This issue is perfectly solved bu the buddy allocator.
252
            </listitem>
294
            </listitem>
253
          </itemizedlist></para>
295
          </itemizedlist></para>
254
      </formalpara>
296
      </formalpara>
255
    </section>
297
    </section>
256
 
298
 
257
    <section id="buddy_allocator">
299
    <section id="buddy_allocator">
258
      <title>Buddy allocator</title>
300
      <title>Buddy allocator</title>
259
 
301
 
260
      <section>
302
      <section>
261
        <title>Overview</title>
303
        <title>Overview</title>
262
 
304
 
263
        <para><mediaobject id="buddy_alloc">
305
        <para><mediaobject id="buddy_alloc">
264
            <imageobject role="html">
306
            <imageobject role="html">
265
              <imagedata fileref="images/buddy_alloc.png" format="PNG" />
307
              <imagedata fileref="images/buddy_alloc.png" format="PNG" />
266
            </imageobject>
308
            </imageobject>
267
 
309
 
268
            <imageobject role="fop">
310
            <imageobject role="fop">
269
              <imagedata fileref="images.vector/buddy_alloc.svg" format="SVG" />
311
              <imagedata fileref="images.vector/buddy_alloc.svg" format="SVG" />
270
            </imageobject>
312
            </imageobject>
271
          </mediaobject></para>
313
          </mediaobject></para>
272
 
314
 
273
        <para>In the buddy allocator, the memory is broken down into
315
        <para>In the buddy allocator, the memory is broken down into
274
        power-of-two sized naturally aligned blocks. These blocks are
316
        power-of-two sized naturally aligned blocks. These blocks are
275
        organized in an array of lists, in which the list with index i
317
        organized in an array of lists, in which the list with index i
276
        contains all unallocated blocks of size
318
        contains all unallocated blocks of size
277
        <mathphrase>2<superscript>i</superscript></mathphrase>. The index i is
319
        <mathphrase>2<superscript>i</superscript></mathphrase>. The index i is
278
        called the order of block. Should there be two adjacent equally sized
320
        called the order of block. Should there be two adjacent equally sized
279
        blocks in the list i<mathphrase> </mathphrase>(i.e. buddies), the
321
        blocks in the list i<mathphrase />(i.e. buddies), the buddy allocator
280
        buddy allocator would coalesce them and put the resulting block in
322
        would coalesce them and put the resulting block in list <mathphrase>i
281
        list <mathphrase>i + 1</mathphrase>, provided that the resulting block
323
        + 1</mathphrase>, provided that the resulting block would be naturally
282
        would be naturally aligned. Similarily, when the allocator is asked to
324
        aligned. Similarily, when the allocator is asked to allocate a block
283
        allocate a block of size
-
 
284
        <mathphrase>2<superscript>i</superscript></mathphrase>, it first tries
325
        of size <mathphrase>2<superscript>i</superscript></mathphrase>, it
285
        to satisfy the request from the list with index i. If the request
326
        first tries to satisfy the request from the list with index i. If the
286
        cannot be satisfied (i.e. the list i is empty), the buddy allocator
327
        request cannot be satisfied (i.e. the list i is empty), the buddy
287
        will try to allocate and split a larger block from the list with index
328
        allocator will try to allocate and split a larger block from the list
288
        i + 1. Both of these algorithms are recursive. The recursion ends
329
        with index i + 1. Both of these algorithms are recursive. The
289
        either when there are no blocks to coalesce in the former case or when
330
        recursion ends either when there are no blocks to coalesce in the
290
        there are no blocks that can be split in the latter case.</para>
331
        former case or when there are no blocks that can be split in the
-
 
332
        latter case.</para>
291
 
333
 
292
        <!--graphic fileref="images/mm1.png" format="EPS" /-->
334
        <!--graphic fileref="images/mm1.png" format="EPS" /-->
293
 
335
 
294
        <para>This approach greatly reduces external fragmentation of memory
336
        <para>This approach greatly reduces external fragmentation of memory
295
        and helps in allocating bigger continuous blocks of memory aligned to
337
        and helps in allocating bigger continuous blocks of memory aligned to
296
        their size. On the other hand, the buddy allocator suffers increased
338
        their size. On the other hand, the buddy allocator suffers increased
297
        internal fragmentation of memory and is not suitable for general
339
        internal fragmentation of memory and is not suitable for general
298
        kernel allocations. This purpose is better addressed by the <link
340
        kernel allocations. This purpose is better addressed by the <link
299
        linkend="slab">slab allocator</link>.</para>
341
        linkend="slab">slab allocator</link>.</para>
300
      </section>
342
      </section>
301
 
343
 
302
      <section>
344
      <section>
303
        <title>Implementation</title>
345
        <title>Implementation</title>
304
 
346
 
305
        <para>The buddy allocator is, in fact, an abstract framework wich can
347
        <para>The buddy allocator is, in fact, an abstract framework wich can
306
        be easily specialized to serve one particular task. It knows nothing
348
        be easily specialized to serve one particular task. It knows nothing
307
        about the nature of memory it helps to allocate. In order to beat the
349
        about the nature of memory it helps to allocate. In order to beat the
308
        lack of this knowledge, the buddy allocator exports an interface that
350
        lack of this knowledge, the buddy allocator exports an interface that
309
        each of its clients is required to implement. When supplied with an
351
        each of its clients is required to implement. When supplied with an
310
        implementation of this interface, the buddy allocator can use
352
        implementation of this interface, the buddy allocator can use
311
        specialized external functions to find a buddy for a block, split and
353
        specialized external functions to find a buddy for a block, split and
312
        coalesce blocks, manipulate block order and mark blocks busy or
354
        coalesce blocks, manipulate block order and mark blocks busy or
313
        available. For precise documentation of this interface, refer to
355
        available. For precise documentation of this interface, refer to
314
        <emphasis>"HelenOS Generic Kernel Reference Manual"</emphasis>.</para>
356
        <emphasis>"HelenOS Generic Kernel Reference Manual"</emphasis>.</para>
315
 
357
 
316
        <formalpara>
358
        <formalpara>
317
          <title>Data organization</title>
359
          <title>Data organization</title>
318
 
360
 
319
          <para>Each entity allocable by the buddy allocator is required to
361
          <para>Each entity allocable by the buddy allocator is required to
320
          contain space for storing block order number and a link variable
362
          contain space for storing block order number and a link variable
321
          used to interconnect blocks within the same order.</para>
363
          used to interconnect blocks within the same order.</para>
322
 
364
 
323
          <para>Whatever entities are allocated by the buddy allocator, the
365
          <para>Whatever entities are allocated by the buddy allocator, the
324
          first entity within a block is used to represent the entire block.
366
          first entity within a block is used to represent the entire block.
325
          The first entity keeps the order of the whole block. Other entities
367
          The first entity keeps the order of the whole block. Other entities
326
          within the block are assigned the magic value
368
          within the block are assigned the magic value
327
          <constant>BUDDY_INNER_BLOCK</constant>. This is especially important
369
          <constant>BUDDY_INNER_BLOCK</constant>. This is especially important
328
          for effective identification of buddies in a one-dimensional array
370
          for effective identification of buddies in a one-dimensional array
329
          because the entity that represents a potential buddy cannot be
371
          because the entity that represents a potential buddy cannot be
330
          associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it
372
          associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it
331
          is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is
373
          is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is
332
          not a buddy).</para>
374
          not a buddy).</para>
333
 
375
 
334
          <para>The buddy allocator always uses the first frame to represent
376
          <para>The buddy allocator always uses the first frame to represent
335
          the frame block. This frame contains <varname>buddy_order</varname>
377
          the frame block. This frame contains <varname>buddy_order</varname>
336
          variable to provide information about the block size it actually
378
          variable to provide information about the block size it actually
337
          represents (
379
          represents (
338
          <mathphrase>2<superscript>buddy_order</superscript></mathphrase>
380
          <mathphrase>2<superscript>buddy_order</superscript></mathphrase>
339
          frames block). Other frames in block have this value set to magic
381
          frames block). Other frames in block have this value set to magic
340
          <constant>BUDDY_INNER_BLOCK</constant> that is much greater than
382
          <constant>BUDDY_INNER_BLOCK</constant> that is much greater than
341
          buddy <varname>max_order</varname> value.</para>
383
          buddy <varname>max_order</varname> value.</para>
342
 
384
 
343
          <para>Each <varname>frame_t</varname> also contains pointer member
385
          <para>Each <varname>frame_t</varname> also contains pointer member
344
          to hold frame structure in the linked list inside one order.</para>
386
          to hold frame structure in the linked list inside one order.</para>
345
        </formalpara>
387
        </formalpara>
346
 
388
 
347
        <formalpara>
389
        <formalpara>
348
          <title>Allocation algorithm</title>
390
          <title>Allocation algorithm</title>
349
 
391
 
350
          <para>Upon <mathphrase>2<superscript>i</superscript></mathphrase>
392
          <para>Upon <mathphrase>2<superscript>i</superscript></mathphrase>
351
          frames block allocation request, allocator checks if there are any
393
          frames block allocation request, allocator checks if there are any
352
          blocks available at the order list <varname>i</varname>. If yes,
394
          blocks available at the order list <varname>i</varname>. If yes,
353
          removes block from order list and returns its address. If no,
395
          removes block from order list and returns its address. If no,
354
          recursively allocates
396
          recursively allocates
355
          <mathphrase>2<superscript>i+1</superscript></mathphrase> frame
397
          <mathphrase>2<superscript>i+1</superscript></mathphrase> frame
356
          block, splits it into two
398
          block, splits it into two
357
          <mathphrase>2<superscript>i</superscript></mathphrase> frame blocks.
399
          <mathphrase>2<superscript>i</superscript></mathphrase> frame blocks.
358
          Then adds one of the blocks to the <varname>i</varname> order list
400
          Then adds one of the blocks to the <varname>i</varname> order list
359
          and returns address of another.</para>
401
          and returns address of another.</para>
360
        </formalpara>
402
        </formalpara>
361
 
403
 
362
        <formalpara>
404
        <formalpara>
363
          <title>Deallocation algorithm</title>
405
          <title>Deallocation algorithm</title>
364
 
406
 
365
          <para>Check if block has so called buddy (another free
407
          <para>Check if block has so called buddy (another free
366
          <mathphrase>2<superscript>i</superscript></mathphrase> frame block
408
          <mathphrase>2<superscript>i</superscript></mathphrase> frame block
367
          that can be linked with freed block into the
409
          that can be linked with freed block into the
368
          <mathphrase>2<superscript>i+1</superscript></mathphrase> block).
410
          <mathphrase>2<superscript>i+1</superscript></mathphrase> block).
369
          Technically, buddy is a odd/even block for even/odd block
411
          Technically, buddy is a odd/even block for even/odd block
370
          respectively. Plus we can put an extra requirement, that resulting
412
          respectively. Plus we can put an extra requirement, that resulting
371
          block must be aligned to its size. This requirement guarantees
413
          block must be aligned to its size. This requirement guarantees
372
          natural block alignment for the blocks coming out the allocation
414
          natural block alignment for the blocks coming out the allocation
373
          system.</para>
415
          system.</para>
374
 
416
 
375
          <para>Using direct pointer arithmetics,
417
          <para>Using direct pointer arithmetics,
376
          <varname>frame_t::ref_count</varname> and
418
          <varname>frame_t::ref_count</varname> and
377
          <varname>frame_t::buddy_order</varname> variables, finding buddy is
419
          <varname>frame_t::buddy_order</varname> variables, finding buddy is
378
          done at constant time.</para>
420
          done at constant time.</para>
379
        </formalpara>
421
        </formalpara>
380
      </section>
422
      </section>
381
    </section>
423
    </section>
382
 
424
 
383
    <section id="slab">
425
    <section id="slab">
384
      <title>Slab allocator</title>
426
      <title>Slab allocator</title>
385
 
427
 
386
      <section>
428
      <section>
387
        <title>Overview</title>
429
        <title>Overview</title>
388
 
430
 
389
        <para><termdef><glossterm>Slab</glossterm> represents a contiguous
431
        <para><termdef><glossterm>Slab</glossterm> represents a contiguous
390
        piece of memory, usually made of several physically contiguous
432
        piece of memory, usually made of several physically contiguous
391
        pages.</termdef> <termdef><glossterm>Slab cache</glossterm> consists
433
        pages.</termdef> <termdef><glossterm>Slab cache</glossterm> consists
392
        of one or more slabs.</termdef></para>
434
        of one or more slabs.</termdef></para>
393
 
435
 
394
        <para>The majority of memory allocation requests in the kernel are for
436
        <para>The majority of memory allocation requests in the kernel are for
395
        small, frequently used data structures. For this purpose the slab
437
        small, frequently used data structures. For this purpose the slab
396
        allocator is a perfect solution. The basic idea behind the slab
438
        allocator is a perfect solution. The basic idea behind the slab
397
        allocator is to have lists of commonly used objects available packed
439
        allocator is to have lists of commonly used objects available packed
398
        into pages. This avoids the overhead of allocating and destroying
440
        into pages. This avoids the overhead of allocating and destroying
399
        commonly used types of objects such threads, virtual memory structures
441
        commonly used types of objects such threads, virtual memory structures
400
        etc. Also due to the exact allocated size matching, slab allocation
442
        etc. Also due to the exact allocated size matching, slab allocation
401
        completely eliminates internal fragmentation issue.</para>
443
        completely eliminates internal fragmentation issue.</para>
402
      </section>
444
      </section>
403
 
445
 
404
      <section>
446
      <section>
405
        <title>Implementation</title>
447
        <title>Implementation</title>
406
 
448
 
407
        <para><mediaobject id="slab_alloc">
449
        <para><mediaobject id="slab_alloc">
408
            <imageobject role="html">
450
            <imageobject role="html">
409
              <imagedata fileref="images/slab_alloc.png" format="PNG" />
451
              <imagedata fileref="images/slab_alloc.png" format="PNG" />
410
            </imageobject>
452
            </imageobject>
411
 
453
 
412
            <imageobject role="fop">
454
            <imageobject role="fop">
413
              <imagedata fileref="images.vector/slab_alloc.svg" format="SVG" />
455
              <imagedata fileref="images.vector/slab_alloc.svg" format="SVG" />
414
            </imageobject>
456
            </imageobject>
415
          </mediaobject></para>
457
          </mediaobject></para>
416
 
458
 
417
        <para>The SLAB allocator is closely modelled after <ulink
459
        <para>The SLAB allocator is closely modelled after <ulink
418
        url="http://www.usenix.org/events/usenix01/full_papers/bonwick/bonwick_html/">
460
        url="http://www.usenix.org/events/usenix01/full_papers/bonwick/bonwick_html/">
419
        OpenSolaris SLAB allocator by Jeff Bonwick and Jonathan Adams </ulink>
461
        OpenSolaris SLAB allocator by Jeff Bonwick and Jonathan Adams </ulink>
420
        with the following exceptions: <itemizedlist>
462
        with the following exceptions: <itemizedlist>
421
            <listitem>
463
            <listitem>
422
               empty SLABS are deallocated immediately (in Linux they are kept in linked list, in Solaris ???)
464
               empty SLABS are deallocated immediately (in Linux they are kept in linked list, in Solaris ???)
423
            </listitem>
465
            </listitem>
424
 
466
 
425
            <listitem>
467
            <listitem>
426
               empty magazines are deallocated when not needed (in Solaris they are held in linked list in slab cache)
468
               empty magazines are deallocated when not needed (in Solaris they are held in linked list in slab cache)
427
            </listitem>
469
            </listitem>
428
          </itemizedlist> Following features are not currently supported but
470
          </itemizedlist> Following features are not currently supported but
429
        would be easy to do: <itemizedlist>
471
        would be easy to do: <itemizedlist>
430
            <listitem>
472
            <listitem>
431
               - cache coloring
473
               - cache coloring
432
            </listitem>
474
            </listitem>
433
 
475
 
434
            <listitem>
476
            <listitem>
435
               - dynamic magazine grow (different magazine sizes are already supported, but we would need to adjust allocation strategy)
477
               - dynamic magazine grow (different magazine sizes are already supported, but we would need to adjust allocation strategy)
436
            </listitem>
478
            </listitem>
437
          </itemizedlist></para>
479
          </itemizedlist></para>
438
 
480
 
439
        <section>
481
        <section>
440
          <title>Magazine layer</title>
482
          <title>Magazine layer</title>
441
 
483
 
442
          <para>Due to the extensive bottleneck on SMP architures, caused by
484
          <para>Due to the extensive bottleneck on SMP architures, caused by
443
          global SLAB locking mechanism, making processing of all slab
485
          global SLAB locking mechanism, making processing of all slab
444
          allocation requests serialized, a new layer was introduced to the
486
          allocation requests serialized, a new layer was introduced to the
445
          classic slab allocator design. Slab allocator was extended to
487
          classic slab allocator design. Slab allocator was extended to
446
          support per-CPU caches 'magazines' to achieve good SMP scaling.
488
          support per-CPU caches 'magazines' to achieve good SMP scaling.
447
          <termdef>Slab SMP perfromance bottleneck was resolved by introducing
489
          <termdef>Slab SMP perfromance bottleneck was resolved by introducing
448
          a per-CPU caching scheme called as <glossterm>magazine
490
          a per-CPU caching scheme called as <glossterm>magazine
449
          layer</glossterm></termdef>.</para>
491
          layer</glossterm></termdef>.</para>
450
 
492
 
451
          <para>Magazine is a N-element cache of objects, so each magazine can
493
          <para>Magazine is a N-element cache of objects, so each magazine can
452
          satisfy N allocations. Magazine behaves like a automatic weapon
494
          satisfy N allocations. Magazine behaves like a automatic weapon
453
          magazine (LIFO, stack), so the allocation/deallocation become simple
495
          magazine (LIFO, stack), so the allocation/deallocation become simple
454
          push/pop pointer operation. Trick is that CPU does not access global
496
          push/pop pointer operation. Trick is that CPU does not access global
455
          slab allocator data during the allocation from its magazine, thus
497
          slab allocator data during the allocation from its magazine, thus
456
          making possible parallel allocations between CPUs.</para>
498
          making possible parallel allocations between CPUs.</para>
457
 
499
 
458
          <para>Implementation also requires adding another feature as the
500
          <para>Implementation also requires adding another feature as the
459
          CPU-bound magazine is actually a pair of magazines to avoid
501
          CPU-bound magazine is actually a pair of magazines to avoid
460
          thrashing when during allocation/deallocatiion of 1 item at the
502
          thrashing when during allocation/deallocatiion of 1 item at the
461
          magazine size boundary. LIFO order is enforced, which should avoid
503
          magazine size boundary. LIFO order is enforced, which should avoid
462
          fragmentation as much as possible.</para>
504
          fragmentation as much as possible.</para>
463
 
505
 
464
          <para>Another important entity of magazine layer is a full magazine
506
          <para>Another important entity of magazine layer is the common full
-
 
507
          magazine list (also called a depot), that stores full magazines that
465
          depot, that stores full magazines which are used by any of the CPU
508
          may be used by any of the CPU magazine caches to reload active CPU
466
          magazine caches to reload active CPU magazine. Magazine depot can be
509
          magazine. This list of magazines can be pre-filled with full
467
          pre-filled with full magazines during initialization, but in current
510
          magazines during initialization, but in current implementation it is
468
          implementation it is filled during object deallocation, when CPU
511
          filled during object deallocation, when CPU magazine becomes
469
          magazine becomes full.</para>
512
          full.</para>
470
 
513
 
471
          <para>Slab allocator control structures are allocated from special
514
          <para>Slab allocator control structures are allocated from special
472
          slabs, that are marked by special flag, indicating that it should
515
          slabs, that are marked by special flag, indicating that it should
473
          not be used for slab magazine layer. This is done to avoid possible
516
          not be used for slab magazine layer. This is done to avoid possible
474
          infinite recursions and deadlock during conventional slab allocaiton
517
          infinite recursions and deadlock during conventional slab allocaiton
475
          requests.</para>
518
          requests.</para>
476
        </section>
519
        </section>
477
 
520
 
478
        <section>
521
        <section>
479
          <title>Allocation/deallocation</title>
522
          <title>Allocation/deallocation</title>
480
 
523
 
481
          <para>Every cache contains list of full slabs and list of partialy
524
          <para>Every cache contains list of full slabs and list of partialy
482
          full slabs. Empty slabs are immediately freed (thrashing will be
525
          full slabs. Empty slabs are immediately freed (thrashing will be
483
          avoided because of magazines).</para>
526
          avoided because of magazines).</para>
484
 
527
 
485
          <para>The SLAB allocator allocates lots of space and does not free
528
          <para>The SLAB allocator allocates lots of space and does not free
486
          it. When frame allocator fails to allocate the frame, it calls
529
          it. When frame allocator fails to allocate the frame, it calls
487
          slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim.
530
          slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim.
488
          The light reclaim releases slabs from cpu-shared magazine-list,
531
          The light reclaim releases slabs from cpu-shared magazine-list,
489
          until at least 1 slab is deallocated in each cache (this algorithm
532
          until at least 1 slab is deallocated in each cache (this algorithm
490
          should probably change). The brutal reclaim removes all cached
533
          should probably change). The brutal reclaim removes all cached
491
          objects, even from CPU-bound magazines.</para>
534
          objects, even from CPU-bound magazines.</para>
492
 
535
 
493
          <formalpara>
536
          <formalpara>
494
            <title>Allocation</title>
537
            <title>Allocation</title>
495
 
538
 
496
            <para><emphasis>Step 1.</emphasis> When it comes to the allocation
539
            <para><emphasis>Step 1.</emphasis> When it comes to the allocation
497
            request, slab allocator first of all checks availability of memory
540
            request, slab allocator first of all checks availability of memory
498
            in local CPU-bound magazine. If it is there, we would just "pop"
541
            in local CPU-bound magazine. If it is there, we would just "pop"
499
            the CPU magazine and return the pointer to object.</para>
542
            the CPU magazine and return the pointer to object.</para>
500
 
543
 
501
            <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
544
            <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
502
            empty, allocator will attempt to reload magazin, swapping it with
545
            empty, allocator will attempt to reload magazin, swapping it with
503
            second CPU magazine and returns to the first step.</para>
546
            second CPU magazine and returns to the first step.</para>
504
 
547
 
505
            <para><emphasis>Step 3.</emphasis> Now we are in the situation
548
            <para><emphasis>Step 3.</emphasis> Now we are in the situation
506
            when both CPU-bound magazines are empty, which makes allocator to
549
            when both CPU-bound magazines are empty, which makes allocator to
507
            access shared full-magazines depot to reload CPU-bound magazines.
550
            access shared full-magazines depot to reload CPU-bound magazines.
508
            If reload is succesful (meaning there are full magazines in depot)
551
            If reload is succesful (meaning there are full magazines in depot)
509
            algoritm continues at Step 1.</para>
552
            algoritm continues at Step 1.</para>
510
 
553
 
511
            <para><emphasis>Step 4.</emphasis> Final step of the allocation.
554
            <para><emphasis>Step 4.</emphasis> Final step of the allocation.
512
            In this step object is allocated from the conventional slab layer
555
            In this step object is allocated from the conventional slab layer
513
            and pointer is returned.</para>
556
            and pointer is returned.</para>
514
          </formalpara>
557
          </formalpara>
515
 
558
 
516
          <formalpara>
559
          <formalpara>
517
            <title>Deallocation</title>
560
            <title>Deallocation</title>
518
 
561
 
519
            <para><emphasis>Step 1.</emphasis> During deallocation request,
562
            <para><emphasis>Step 1.</emphasis> During deallocation request,
520
            slab allocator will check if the local CPU-bound magazine is not
563
            slab allocator will check if the local CPU-bound magazine is not
521
            full. In this case we will just push the pointer to this
564
            full. In this case we will just push the pointer to this
522
            magazine.</para>
565
            magazine.</para>
523
 
566
 
524
            <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
567
            <para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
525
            full, allocator will attempt to reload magazin, swapping it with
568
            full, allocator will attempt to reload magazin, swapping it with
526
            second CPU magazine and returns to the first step.</para>
569
            second CPU magazine and returns to the first step.</para>
527
 
570
 
528
            <para><emphasis>Step 3.</emphasis> Now we are in the situation
571
            <para><emphasis>Step 3.</emphasis> Now we are in the situation
529
            when both CPU-bound magazines are full, which makes allocator to
572
            when both CPU-bound magazines are full, which makes allocator to
530
            access shared full-magazines depot to put one of the magazines to
573
            access shared full-magazines depot to put one of the magazines to
531
            the depot and creating new empty magazine. Algoritm continues at
574
            the depot and creating new empty magazine. Algoritm continues at
532
            Step 1.</para>
575
            Step 1.</para>
533
          </formalpara>
576
          </formalpara>
534
        </section>
577
        </section>
535
      </section>
578
      </section>
536
    </section>
579
    </section>
537
 
580
 
538
    <!-- End of Physmem -->
581
    <!-- End of Physmem -->
539
  </section>
582
  </section>
540
 
583
 
541
  <section>
584
  <section>
542
    <title>Memory sharing</title>
585
    <title>Memory sharing</title>
543
 
586
 
544
    <para>Not implemented yet(?)</para>
587
    <para>Not implemented yet(?)</para>
545
  </section>
588
  </section>
546
</chapter>
589
</chapter>