Subversion Repositories HelenOS-doc

Rev

Rev 185 | Only display areas with differences | Ignore whitespace | Details | Blame | Last modification | View Log | RSS feed

Rev 185 Rev 186
1
<?xml version="1.0" encoding="UTF-8"?>
1
<?xml version="1.0" encoding="UTF-8"?>
2
<chapter id="mm">
2
<chapter id="mm">
3
  <?dbhtml filename="mm.html"?>
3
  <?dbhtml filename="mm.html"?>
4
 
4
 
5
  <title>Memory management</title>
5
  <title>Memory management</title>
6
 
6
 
7
  <para>In previous chapters, this book described the scheduling subsystem as
7
  <para>In previous chapters, this book described the scheduling subsystem as
8
  the creator of the impression that threads execute in parallel. The memory
8
  the creator of the impression that threads execute in parallel. The memory
9
  management subsystem, on the other hand, creates the impression that there
9
  management subsystem, on the other hand, creates the impression that there
10
  is enough physical memory for the kernel and that userspace tasks have the
10
  is enough physical memory for the kernel and that userspace tasks have the
11
  entire address space only for themselves.</para>
11
  entire address space only for themselves.</para>
12
 
12
 
13
  <section>
13
  <section>
14
    <title>Physical memory management</title>
14
    <title>Physical memory management</title>
15
 
15
 
16
    <section id="zones_and_frames">
16
    <section id="zones_and_frames">
17
      <title>Zones and frames</title>
17
      <title>Zones and frames</title>
18
 
18
 
19
      <para>HelenOS represents continuous areas of physical memory in
19
      <para>HelenOS represents continuous areas of physical memory in
20
      structures called frame zones (abbreviated as zones). Each zone contains
20
      structures called frame zones (abbreviated as zones). Each zone contains
21
      information about the number of allocated and unallocated physical
21
      information about the number of allocated and unallocated physical
22
      memory frames as well as the physical base address of the zone and
22
      memory frames as well as the physical base address of the zone and
23
      number of frames contained in it. A zone also contains an array of frame
23
      number of frames contained in it. A zone also contains an array of frame
24
      structures describing each frame of the zone and, in the last, but not
24
      structures describing each frame of the zone and, in the last, but not
25
      the least important, front, each zone is equipped with a buddy system
25
      the least important, front, each zone is equipped with a buddy system
26
      that faciliates effective allocation of power-of-two sized block of
26
      that faciliates effective allocation of power-of-two sized block of
27
      frames.</para>
27
      frames.</para>
28
 
28
 
29
      <para>This organization of physical memory provides good preconditions
29
      <para>This organization of physical memory provides good preconditions
30
      for hot-plugging of more zones. There is also one currently unused zone
30
      for hot-plugging of more zones. There is also one currently unused zone
31
      attribute: <code>flags</code>. The attribute could be used to give a
31
      attribute: <code>flags</code>. The attribute could be used to give a
32
      special meaning to some zones in the future.</para>
32
      special meaning to some zones in the future.</para>
33
 
33
 
34
      <para>The zones are linked in a doubly-linked list. This might seem a
34
      <para>The zones are linked in a doubly-linked list. This might seem a
35
      bit ineffective because the zone list is walked everytime a frame is
35
      bit ineffective because the zone list is walked everytime a frame is
36
      allocated or deallocated. However, this does not represent a significant
36
      allocated or deallocated. However, this does not represent a significant
37
      performance problem as it is expected that the number of zones will be
37
      performance problem as it is expected that the number of zones will be
38
      rather low. Moreover, most architectures merge all zones into
38
      rather low. Moreover, most architectures merge all zones into
39
      one.</para>
39
      one.</para>
40
 
40
 
41
      <para>Every physical memory frame in a zone, is described by a structure
41
      <para>Every physical memory frame in a zone, is described by a structure
42
      that contains number of references and other data used by buddy
42
      that contains number of references and other data used by buddy
43
      system.</para>
43
      system.</para>
44
    </section>
44
    </section>
45
 
45
 
46
    <section id="frame_allocator">
46
    <section id="frame_allocator">
47
      <indexterm>
47
      <indexterm>
48
        <primary>frame allocator</primary>
48
        <primary>frame allocator</primary>
49
      </indexterm>
49
      </indexterm>
50
 
50
 
51
      <title>Frame allocator</title>
51
      <title>Frame allocator</title>
52
 
52
 
53
      <para>The frame allocator satisfies kernel requests to allocate
53
      <para>The frame allocator satisfies kernel requests to allocate
54
      power-of-two sized blocks of physical memory. Because of zonal
54
      power-of-two sized blocks of physical memory. Because of zonal
55
      organization of physical memory, the frame allocator is always working
55
      organization of physical memory, the frame allocator is always working
56
      within a context of a particular frame zone. In order to carry out the
56
      within a context of a particular frame zone. In order to carry out the
57
      allocation requests, the frame allocator is tightly integrated with the
57
      allocation requests, the frame allocator is tightly integrated with the
58
      buddy system belonging to the zone. The frame allocator is also
58
      buddy system belonging to the zone. The frame allocator is also
59
      responsible for updating information about the number of free and busy
59
      responsible for updating information about the number of free and busy
60
      frames in the zone. <figure float="1">
60
      frames in the zone. <figure float="1">
61
          <mediaobject id="frame_alloc">
61
          <mediaobject id="frame_alloc">
62
            <imageobject role="pdf">
62
            <imageobject role="pdf">
63
              <imagedata fileref="images/frame_alloc.pdf" format="PDF" />
63
              <imagedata fileref="images/frame_alloc.pdf" format="PDF" />
64
            </imageobject>
64
            </imageobject>
65
 
65
 
66
            <imageobject role="html">
66
            <imageobject role="html">
67
              <imagedata fileref="images/frame_alloc.png" format="PNG" />
67
              <imagedata fileref="images/frame_alloc.png" format="PNG" />
68
            </imageobject>
68
            </imageobject>
69
 
69
 
70
            <imageobject role="fop">
70
            <imageobject role="fop">
71
              <imagedata fileref="images/frame_alloc.svg" format="SVG" />
71
              <imagedata fileref="images/frame_alloc.svg" format="SVG" />
72
            </imageobject>
72
            </imageobject>
73
          </mediaobject>
73
          </mediaobject>
74
 
74
 
75
          <title>Frame allocator scheme.</title>
75
          <title>Frame allocator scheme.</title>
76
        </figure></para>
76
        </figure></para>
77
 
77
 
78
      <formalpara>
78
      <formalpara>
79
        <title>Allocation / deallocation</title>
79
        <title>Allocation / deallocation</title>
80
 
80
 
81
        <para>Upon allocation request via function <code>frame_alloc()</code>,
81
        <para>Upon allocation request via function <code>frame_alloc()</code>,
82
        the frame allocator first tries to find a zone that can satisfy the
82
        the frame allocator first tries to find a zone that can satisfy the
83
        request (i.e. has the required amount of free frames). Once a suitable
83
        request (i.e. has the required amount of free frames). Once a suitable
84
        zone is found, the frame allocator uses the buddy allocator on the
84
        zone is found, the frame allocator uses the buddy allocator on the
85
        zone's buddy system to perform the allocation. During deallocation,
85
        zone's buddy system to perform the allocation. If no free zone is
86
        which is triggered by a call to <code>frame_free()</code>, the frame
86
        found, the frame allocator tries to reclaim slab memory.</para>
-
 
87
 
87
        allocator looks up the respective zone that contains the frame being
88
        <para>During deallocation, which is triggered by a call to
88
        deallocated. Afterwards, it calls the buddy allocator again, this time
89
        <code>frame_free()</code>, the frame allocator looks up the respective
-
 
90
        zone that contains the frame being deallocated. Afterwards, it calls
-
 
91
        the buddy allocator again, this time to take care of deallocation
89
        to take care of deallocation within the zone's buddy system.</para>
92
        within the zone's buddy system.</para>
90
      </formalpara>
93
      </formalpara>
91
    </section>
94
    </section>
92
 
95
 
93
    <section id="buddy_allocator">
96
    <section id="buddy_allocator">
94
      <indexterm>
97
      <indexterm>
95
        <primary>buddy system</primary>
98
        <primary>buddy system</primary>
96
      </indexterm>
99
      </indexterm>
97
 
100
 
98
      <title>Buddy allocator</title>
101
      <title>Buddy allocator</title>
99
 
102
 
100
      <para>In the buddy system, the memory is broken down into power-of-two
103
      <para>In the buddy system, the memory is broken down into power-of-two
101
      sized naturally aligned blocks. These blocks are organized in an array
104
      sized naturally aligned blocks. These blocks are organized in an array
102
      of lists, in which the list with index <emphasis>i</emphasis> contains
105
      of lists, in which the list with index <emphasis>i</emphasis> contains
103
      all unallocated blocks of size
106
      all unallocated blocks of size
104
      <emphasis>2<superscript>i</superscript></emphasis>. The index
107
      <emphasis>2<superscript>i</superscript></emphasis>. The index
105
      <emphasis>i</emphasis> is called the order of block. Should there be two
108
      <emphasis>i</emphasis> is called the order of block. Should there be two
106
      adjacent equally sized blocks in the list <emphasis>i</emphasis> (i.e.
109
      adjacent equally sized blocks in the list <emphasis>i</emphasis> (i.e.
107
      buddies), the buddy allocator would coalesce them and put the resulting
110
      buddies), the buddy allocator would coalesce them and put the resulting
108
      block in list <emphasis>i + 1</emphasis>, provided that the resulting
111
      block in list <emphasis>i + 1</emphasis>, provided that the resulting
109
      block would be naturally aligned. Similarily, when the allocator is
112
      block would be naturally aligned. Similarily, when the allocator is
110
      asked to allocate a block of size
113
      asked to allocate a block of size
111
      <emphasis>2<superscript>i</superscript></emphasis>, it first tries to
114
      <emphasis>2<superscript>i</superscript></emphasis>, it first tries to
112
      satisfy the request from the list with index <emphasis>i</emphasis>. If
115
      satisfy the request from the list with index <emphasis>i</emphasis>. If
113
      the request cannot be satisfied (i.e. the list <emphasis>i</emphasis> is
116
      the request cannot be satisfied (i.e. the list <emphasis>i</emphasis> is
114
      empty), the buddy allocator will try to allocate and split a larger
117
      empty), the buddy allocator will try to allocate and split a larger
115
      block from the list with index <emphasis>i + 1</emphasis>. Both of these
118
      block from the list with index <emphasis>i + 1</emphasis>. Both of these
116
      algorithms are recursive. The recursion ends either when there are no
119
      algorithms are recursive. The recursion ends either when there are no
117
      blocks to coalesce in the former case or when there are no blocks that
120
      blocks to coalesce in the former case or when there are no blocks that
118
      can be split in the latter case.</para>
121
      can be split in the latter case.</para>
119
 
122
 
120
      <para>This approach greatly reduces external fragmentation of memory and
123
      <para>This approach greatly reduces external fragmentation of memory and
121
      helps in allocating bigger continuous blocks of memory aligned to their
124
      helps in allocating bigger continuous blocks of memory aligned to their
122
      size. On the other hand, the buddy allocator suffers increased internal
125
      size. On the other hand, the buddy allocator suffers increased internal
123
      fragmentation of memory and is not suitable for general kernel
126
      fragmentation of memory and is not suitable for general kernel
124
      allocations. This purpose is better addressed by the <link
127
      allocations. This purpose is better addressed by the <link
125
      linkend="slab">slab allocator</link>.<figure float="1">
128
      linkend="slab">slab allocator</link>.<figure float="1">
126
          <mediaobject id="buddy_alloc">
129
          <mediaobject id="buddy_alloc">
127
            <imageobject role="pdf">
130
            <imageobject role="pdf">
128
              <imagedata fileref="images/buddy_alloc.pdf" format="PDF" />
131
              <imagedata fileref="images/buddy_alloc.pdf" format="PDF" />
129
            </imageobject>
132
            </imageobject>
130
 
133
 
131
            <imageobject role="html">
134
            <imageobject role="html">
132
              <imagedata fileref="images/buddy_alloc.png" format="PNG" />
135
              <imagedata fileref="images/buddy_alloc.png" format="PNG" />
133
            </imageobject>
136
            </imageobject>
134
 
137
 
135
            <imageobject role="fop">
138
            <imageobject role="fop">
136
              <imagedata fileref="images/buddy_alloc.svg" format="SVG" />
139
              <imagedata fileref="images/buddy_alloc.svg" format="SVG" />
137
            </imageobject>
140
            </imageobject>
138
          </mediaobject>
141
          </mediaobject>
139
 
142
 
140
          <title>Buddy system scheme.</title>
143
          <title>Buddy system scheme.</title>
141
        </figure></para>
144
        </figure></para>
142
 
145
 
143
      <section>
146
      <section>
144
        <title>Implementation</title>
147
        <title>Implementation</title>
145
 
148
 
146
        <para>The buddy allocator is, in fact, an abstract framework which can
149
        <para>The buddy allocator is, in fact, an abstract framework which can
147
        be easily specialized to serve one particular task. It knows nothing
150
        be easily specialized to serve one particular task. It knows nothing
148
        about the nature of memory it helps to allocate. In order to beat the
151
        about the nature of memory it helps to allocate. In order to beat the
149
        lack of this knowledge, the buddy allocator exports an interface that
152
        lack of this knowledge, the buddy allocator exports an interface that
150
        each of its clients is required to implement. When supplied with an
153
        each of its clients is required to implement. When supplied with an
151
        implementation of this interface, the buddy allocator can use
154
        implementation of this interface, the buddy allocator can use
152
        specialized external functions to find a buddy for a block, split and
155
        specialized external functions to find a buddy for a block, split and
153
        coalesce blocks, manipulate block order and mark blocks busy or
156
        coalesce blocks, manipulate block order and mark blocks busy or
154
        available.</para>
157
        available.</para>
155
 
158
 
156
        <formalpara>
159
        <formalpara>
157
          <title>Data organization</title>
160
          <title>Data organization</title>
158
 
161
 
159
          <para>Each entity allocable by the buddy allocator is required to
162
          <para>Each entity allocable by the buddy allocator is required to
160
          contain space for storing block order number and a link variable
163
          contain space for storing block order number and a link variable
161
          used to interconnect blocks within the same order.</para>
164
          used to interconnect blocks within the same order.</para>
162
 
165
 
163
          <para>Whatever entities are allocated by the buddy allocator, the
166
          <para>Whatever entities are allocated by the buddy allocator, the
164
          first entity within a block is used to represent the entire block.
167
          first entity within a block is used to represent the entire block.
165
          The first entity keeps the order of the whole block. Other entities
168
          The first entity keeps the order of the whole block. Other entities
166
          within the block are assigned the magic value
169
          within the block are assigned the magic value
167
          <constant>BUDDY_INNER_BLOCK</constant>. This is especially important
170
          <constant>BUDDY_SYSTEM_INNER_BLOCK</constant>. This is especially important
168
          for effective identification of buddies in a one-dimensional array
171
          for effective identification of buddies in a one-dimensional array
169
          because the entity that represents a potential buddy cannot be
172
          because the entity that represents a potential buddy cannot be
170
          associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it
173
          associated with <constant>BUDDY_SYSTEM_INNER_BLOCK</constant> (i.e. if it
171
          is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is
174
          is associated with <constant>BUDDY_SYSTEM_INNER_BLOCK</constant> then it is
172
          not a buddy).</para>
175
          not a buddy).</para>
173
        </formalpara>
176
        </formalpara>
174
      </section>
177
      </section>
175
    </section>
178
    </section>
176
 
179
 
177
    <section id="slab">
180
    <section id="slab">
178
      <indexterm>
181
      <indexterm>
179
        <primary>slab allocator</primary>
182
        <primary>slab allocator</primary>
180
      </indexterm>
183
      </indexterm>
181
 
184
 
182
      <title>Slab allocator</title>
185
      <title>Slab allocator</title>
183
 
186
 
184
      <para>The majority of memory allocation requests in the kernel is for
187
      <para>The majority of memory allocation requests in the kernel is for
185
      small, frequently used data structures. The basic idea behind the slab
188
      small, frequently used data structures. The basic idea behind the slab
186
      allocator is that commonly used objects are preallocated in continuous
189
      allocator is that commonly used objects are preallocated in continuous
187
      areas of physical memory called slabs<footnote>
190
      areas of physical memory called slabs<footnote>
188
          <para>Slabs are in fact blocks of physical memory frames allocated
191
          <para>Slabs are in fact blocks of physical memory frames allocated
189
          from the frame allocator.</para>
192
          from the frame allocator.</para>
190
        </footnote>. Whenever an object is to be allocated, the slab allocator
193
        </footnote>. Whenever an object is to be allocated, the slab allocator
191
      returns the first available item from a suitable slab corresponding to
194
      returns the first available item from a suitable slab corresponding to
192
      the object type<footnote>
195
      the object type<footnote>
193
          <para>The mechanism is rather more complicated, see the next
196
          <para>The mechanism is rather more complicated, see the next
194
          paragraph.</para>
197
          paragraph.</para>
195
        </footnote>. Due to the fact that the sizes of the requested and
198
        </footnote>. Due to the fact that the sizes of the requested and
196
      allocated object match, the slab allocator significantly reduces
199
      allocated object match, the slab allocator significantly reduces
197
      internal fragmentation.</para>
200
      internal fragmentation.</para>
198
 
201
 
199
      <indexterm>
202
      <indexterm>
200
        <primary>slab allocator</primary>
203
        <primary>slab allocator</primary>
201
 
204
 
202
        <secondary>- slab cache</secondary>
205
        <secondary>- slab cache</secondary>
203
      </indexterm>
206
      </indexterm>
204
 
207
 
205
      <para>Slabs of one object type are organized in a structure called slab
208
      <para>Slabs of one object type are organized in a structure called slab
206
      cache. There are ususally more slabs in the slab cache, depending on
209
      cache. There are usually more slabs in the slab cache, depending on
207
      previous allocations. If the the slab cache runs out of available slabs,
210
      previous allocations. If the the slab cache runs out of available slabs,
208
      new slabs are allocated. In order to exploit parallelism and to avoid
211
      new slabs are allocated. In order to exploit parallelism and to avoid
209
      locking of shared spinlocks, slab caches can have variants of
212
      locking of shared spinlocks, slab caches can have variants of
210
      processor-private slabs called magazines. On each processor, there is a
213
      processor-private slabs called magazines. On each processor, there is a
211
      two-magazine cache. Full magazines that are not part of any
214
      two-magazine cache. Full magazines that are not part of any
212
      per-processor magazine cache are stored in a global list of full
215
      per-processor magazine cache are stored in a global list of full
213
      magazines.</para>
216
      magazines.</para>
214
 
217
 
215
      <indexterm>
218
      <indexterm>
216
        <primary>slab allocator</primary>
219
        <primary>slab allocator</primary>
217
 
220
 
218
        <secondary>- magazine</secondary>
221
        <secondary>- magazine</secondary>
219
      </indexterm>
222
      </indexterm>
220
 
223
 
221
      <para>Each object begins its life in a slab. When it is allocated from
224
      <para>Each object begins its life in a slab. When it is allocated from
222
      there, the slab allocator calls a constructor that is registered in the
225
      there, the slab allocator calls a constructor that is registered in the
223
      respective slab cache. The constructor initializes and brings the object
226
      respective slab cache. The constructor initializes and brings the object
224
      into a known state. The object is then used by the user. When the user
227
      into a known state. The object is then used by the user. When the user
225
      later frees the object, the slab allocator puts it into a processor
228
      later frees the object, the slab allocator puts it into a processor
226
      private <indexterm>
229
      private <indexterm>
227
          <primary>slab allocator</primary>
230
          <primary>slab allocator</primary>
228
 
231
 
229
          <secondary>- magazine</secondary>
232
          <secondary>- magazine</secondary>
230
        </indexterm>magazine cache, from where it can be precedently allocated
233
        </indexterm>magazine cache, from where it can be precedently allocated
231
      again. Note that allocations satisfied from a magazine are already
234
      again. Note that allocations satisfied from a magazine are already
232
      initialized by the constructor. When both of the processor cached
235
      initialized by the constructor. When both of the processor cached
233
      magazines get full, the allocator will move one of the magazines to the
236
      magazines get full, the allocator will move one of the magazines to the
234
      list of full magazines. Similarily, when allocating from an empty
237
      list of full magazines. Similarily, when allocating from an empty
235
      processor magazine cache, the kernel will reload only one magazine from
238
      processor magazine cache, the kernel will reload only one magazine from
236
      the list of full magazines. In other words, the slab allocator tries to
239
      the list of full magazines. In other words, the slab allocator tries to
237
      keep the processor magazine cache only half-full in order to prevent
240
      keep the processor magazine cache only half-full in order to prevent
238
      thrashing when allocations and deallocations interleave on magazine
241
      thrashing when allocations and deallocations interleave on magazine
239
      boundaries. The advantage of this setup is that during most of the
242
      boundaries. The advantage of this setup is that during most of the
240
      allocations, no global spinlock needs to be held.</para>
243
      allocations, no global spinlock needs to be held.</para>
241
 
244
 
242
      <para>Should HelenOS run short of memory, it would start deallocating
245
      <para>Should HelenOS run short of memory, it would start deallocating
243
      objects from magazines, calling slab cache destructor on them and
246
      objects from magazines, calling slab cache destructor on them and
244
      putting them back into slabs. When a slab contains no allocated object,
247
      putting them back into slabs. When a slab contains no allocated object,
245
      it is immediately freed.</para>
248
      it is immediately freed.</para>
246
 
249
 
247
      <para>
250
      <para>
248
        <figure float="1">
251
        <figure float="1">
249
          <mediaobject id="slab_alloc">
252
          <mediaobject id="slab_alloc">
250
            <imageobject role="pdf">
253
            <imageobject role="pdf">
251
              <imagedata fileref="images/slab_alloc.pdf" format="PDF" />
254
              <imagedata fileref="images/slab_alloc.pdf" format="PDF" />
252
            </imageobject>
255
            </imageobject>
253
 
256
 
254
            <imageobject role="html">
257
            <imageobject role="html">
255
              <imagedata fileref="images/slab_alloc.png" format="PNG" />
258
              <imagedata fileref="images/slab_alloc.png" format="PNG" />
256
            </imageobject>
259
            </imageobject>
257
 
260
 
258
            <imageobject role="fop">
261
            <imageobject role="fop">
259
              <imagedata fileref="images/slab_alloc.svg" format="SVG" />
262
              <imagedata fileref="images/slab_alloc.svg" format="SVG" />
260
            </imageobject>
263
            </imageobject>
261
          </mediaobject>
264
          </mediaobject>
262
 
265
 
263
          <title>Slab allocator scheme.</title>
266
          <title>Slab allocator scheme.</title>
264
        </figure>
267
        </figure>
265
      </para>
268
      </para>
266
 
269
 
267
      <section>
270
      <section>
268
        <title>Implementation</title>
271
        <title>Implementation</title>
269
 
272
 
270
        <para>The slab allocator is closely modelled after <xref
273
        <para>The slab allocator is closely modelled after <xref
271
        linkend="Bonwick01" /> with the following exceptions:<itemizedlist>
274
        linkend="Bonwick01" /> with the following exceptions:<itemizedlist>
272
            <listitem>
275
            <listitem>
273
              <para>empty slabs are immediately deallocated and</para>
276
              <para>empty slabs are immediately deallocated and</para>
274
            </listitem>
277
            </listitem>
275
 
278
 
276
            <listitem>
279
            <listitem>
277
              <para>empty magazines are deallocated when not needed.</para>
280
              <para>empty magazines are deallocated when not needed.</para>
278
            </listitem>
281
            </listitem>
279
          </itemizedlist>The following features are not currently supported
282
          </itemizedlist>The following features are not currently supported
280
        but would be easy to do: <itemizedlist>
283
        but would be easy to do: <itemizedlist>
281
            <listitem>cache coloring and</listitem>
284
            <listitem>cache coloring and</listitem>
282
 
285
 
283
            <listitem>dynamic magazine grow (different magazine sizes are
286
            <listitem>dynamic magazine grow (different magazine sizes are
284
            already supported, but the allocation strategy would need to be
287
            already supported, but the allocation strategy would need to be
285
            adjusted).</listitem>
288
            adjusted).</listitem>
286
          </itemizedlist></para>
289
          </itemizedlist></para>
287
 
290
 
288
        <section>
291
        <section>
289
          <title>Allocation/deallocation</title>
292
          <title>Allocation/deallocation</title>
290
 
293
 
291
          <para>The following two paragraphs summarize and complete the
294
          <para>The following two paragraphs summarize and complete the
292
          description of the slab allocator operation (i.e.
295
          description of the slab allocator operation (i.e.
293
          <code>slab_alloc()</code> and <code>slab_free()</code>
296
          <code>slab_alloc()</code> and <code>slab_free()</code>
294
          functions).</para>
297
          functions).</para>
295
 
298
 
296
          <formalpara>
299
          <formalpara>
297
            <title>Allocation</title>
300
            <title>Allocation</title>
298
 
301
 
299
            <para><emphasis>Step 1.</emphasis> When an allocation request
302
            <para><emphasis>Step 1.</emphasis> When an allocation request
300
            comes, the slab allocator checks availability of memory in the
303
            comes, the slab allocator checks availability of memory in the
301
            current magazine of the local processor magazine cache. If the
304
            current magazine of the local processor magazine cache. If the
302
            available memory is there, the allocator just pops the object from
305
            available memory is there, the allocator just pops the object from
303
            magazine and returns it.</para>
306
            magazine and returns it.</para>
304
 
307
 
305
            <para><emphasis>Step 2.</emphasis> If the current magazine in the
308
            <para><emphasis>Step 2.</emphasis> If the current magazine in the
306
            processor magazine cache is empty, the allocator will attempt to
309
            processor magazine cache is empty, the allocator will attempt to
307
            swap it with the last magazine from the cache and return to the
310
            swap it with the last magazine from the cache and return to the
308
            first step. If also the last magazine is empty, the algorithm will
311
            first step. If also the last magazine is empty, the algorithm will
309
            fall through to Step 3.</para>
312
            fall through to Step 3.</para>
310
 
313
 
311
            <para><emphasis>Step 3.</emphasis> Now the allocator is in the
314
            <para><emphasis>Step 3.</emphasis> Now the allocator is in the
312
            situation when both magazines in the processor magazine cache are
315
            situation when both magazines in the processor magazine cache are
313
            empty. The allocator reloads one magazine from the shared list of
316
            empty. The allocator reloads one magazine from the shared list of
314
            full magazines. If the reload is successful (i.e. there are full
317
            full magazines. If the reload is successful (i.e. there are full
315
            magazines in the list), the algorithm continues with Step
318
            magazines in the list), the algorithm continues with Step
316
            1.</para>
319
            1.</para>
317
 
320
 
318
            <para><emphasis>Step 4.</emphasis> In this fail-safe step, an
321
            <para><emphasis>Step 4.</emphasis> In this fail-safe step, an
319
            object is allocated from the conventional slab layer and a pointer
322
            object is allocated from the conventional slab layer and a pointer
320
            to it is returned. If also the last magazine is full, a new slab
323
            to it is returned. If also the last magazine is full, a new slab
321
            is allocated.</para>
324
            is allocated.</para>
322
          </formalpara>
325
          </formalpara>
323
 
326
 
324
          <formalpara>
327
          <formalpara>
325
            <title>Deallocation</title>
328
            <title>Deallocation</title>
326
 
329
 
327
            <para><emphasis>Step 1.</emphasis> During a deallocation request,
330
            <para><emphasis>Step 1.</emphasis> During a deallocation request,
328
            the slab allocator checks if the current magazine of the local
331
            the slab allocator checks if the current magazine of the local
329
            processor magazine cache is not full. If it is, the pointer to the
332
            processor magazine cache is not full. If it is, the pointer to the
330
            objects is just pushed into the magazine and the algorithm
333
            objects is just pushed into the magazine and the algorithm
331
            returns.</para>
334
            returns.</para>
332
 
335
 
333
            <para><emphasis>Step 2.</emphasis> If the current magazine is
336
            <para><emphasis>Step 2.</emphasis> If the current magazine is
334
            full, the allocator will attempt to swap it with the last magazine
337
            full, the allocator will attempt to swap it with the last magazine
335
            from the cache and return to the first step. If also the last
338
            from the cache and return to the first step. If also the last
336
            magazine is empty, the algorithm will fall through to Step
339
            magazine is empty, the algorithm will fall through to Step
337
            3.</para>
340
            3.</para>
338
 
341
 
339
            <para><emphasis>Step 3.</emphasis> Now the allocator is in the
342
            <para><emphasis>Step 3.</emphasis> Now the allocator is in the
340
            situation when both magazines in the processor magazine cache are
343
            situation when both magazines in the processor magazine cache are
341
            full. The allocator tries to allocate a new empty magazine and
344
            full. The allocator tries to allocate a new empty magazine and
342
            flush one of the full magazines to the shared list of full
345
            flush one of the full magazines to the shared list of full
343
            magazines. If it is successfull, the algoritm continues with Step
346
            magazines. If it is successfull, the algoritm continues with Step
344
            1.</para>
347
            1.</para>
345
 
348
 
346
            <para><emphasis>Step 4. </emphasis>In case of low memory condition
349
            <para><emphasis>Step 4. </emphasis>In case of low memory condition
347
            when the allocation of empty magazine fails, the object is moved
350
            when the allocation of empty magazine fails, the object is moved
348
            directly into slab. In the worst case object deallocation does not
351
            directly into slab. In the worst case object deallocation does not
349
            need to allocate any additional memory.</para>
352
            need to allocate any additional memory.</para>
350
          </formalpara>
353
          </formalpara>
351
        </section>
354
        </section>
352
      </section>
355
      </section>
353
    </section>
356
    </section>
354
  </section>
357
  </section>
355
 
358
 
356
  <section>
359
  <section>
357
    <title>Virtual memory management</title>
360
    <title>Virtual memory management</title>
358
 
361
 
359
    <para>Virtual memory is essential for an operating system because it makes
362
    <para>Virtual memory is essential for an operating system because it makes
360
    several things possible. First, it helps to isolate tasks from each other
363
    several things possible. First, it helps to isolate tasks from each other
361
    by encapsulating them in their private address spaces. Second, virtual
364
    by encapsulating them in their private address spaces. Second, virtual
362
    memory can give tasks the feeling of more memory available than is
365
    memory can give tasks the feeling of more memory available than is
363
    actually possible. And third, by using virtual memory, there might be
366
    actually possible. And third, by using virtual memory, there might be
364
    multiple copies of the same program, linked to the same addresses, running
367
    multiple copies of the same program, linked to the same addresses, running
365
    in the system. There are at least two known mechanisms for implementing
368
    in the system. There are at least two known mechanisms for implementing
366
    virtual memory: segmentation and paging. Even though some processor
369
    virtual memory: segmentation and paging. Even though some processor
367
    architectures supported by HelenOS<footnote>
370
    architectures supported by HelenOS<footnote>
368
        <para>ia32 has full-fledged segmentation.</para>
371
        <para>ia32 has full-fledged segmentation.</para>
369
      </footnote> provide both mechanisms, the kernel makes use solely of
372
      </footnote> provide both mechanisms, the kernel makes use solely of
370
    paging.</para>
373
    paging.</para>
371
 
374
 
372
    <section id="paging">
375
    <section id="paging">
373
      <title>VAT subsystem</title>
376
      <title>VAT subsystem</title>
374
 
377
 
375
      <para>In a paged virtual memory, the entire virtual address space is
378
      <para>In a paged virtual memory, the entire virtual address space is
376
      divided into small power-of-two sized naturally aligned blocks called
379
      divided into small power-of-two sized naturally aligned blocks called
377
      pages. The processor implements a translation mechanism, that allows the
380
      pages. The processor implements a translation mechanism, that allows the
378
      operating system to manage mappings between set of pages and set of
381
      operating system to manage mappings between set of pages and set of
379
      identically sized and identically aligned pieces of physical memory
382
      identically sized and identically aligned pieces of physical memory
380
      called frames. In a result, references to continuous virtual memory
383
      called frames. In a result, references to continuous virtual memory
381
      areas don't necessarily need to reference continuos area of physical
384
      areas don't necessarily need to reference continuos area of physical
382
      memory. Supported page sizes usually range from several kilobytes to
385
      memory. Supported page sizes usually range from several kilobytes to
383
      several megabytes. Each page that takes part in the mapping is
386
      several megabytes. Each page that takes part in the mapping is
384
      associated with certain attributes that further desribe the mapping
387
      associated with certain attributes that further desribe the mapping
385
      (e.g. access rights, dirty and accessed bits and present bit).</para>
388
      (e.g. access rights, dirty and accessed bits and present bit).</para>
386
 
389
 
387
      <para>When the processor accesses a page that is not present (i.e. its
390
      <para>When the processor accesses a page that is not present (i.e. its
388
      present bit is not set), the operating system is notified through a
391
      present bit is not set), the operating system is notified through a
389
      special exception called page fault. It is then up to the operating
392
      special exception called page fault. It is then up to the operating
390
      system to service the page fault. In HelenOS, some page faults are fatal
393
      system to service the page fault. In HelenOS, some page faults are fatal
391
      and result in either task termination or, in the worse case, kernel
394
      and result in either task termination or, in the worse case, kernel
392
      panic<footnote>
395
      panic<footnote>
393
          <para>Such a condition would be either caused by a hardware failure
396
          <para>Such a condition would be either caused by a hardware failure
394
          or a bug in the kernel.</para>
397
          or a bug in the kernel.</para>
395
        </footnote>, while other page faults are used to load memory on demand
398
        </footnote>, while other page faults are used to load memory on demand
396
      or to notify the kernel about certain events.</para>
399
      or to notify the kernel about certain events.</para>
397
 
400
 
398
      <indexterm>
401
      <indexterm>
399
        <primary>page tables</primary>
402
        <primary>page tables</primary>
400
      </indexterm>
403
      </indexterm>
401
 
404
 
402
      <para>The set of all page mappings is stored in a memory structure
405
      <para>The set of all page mappings is stored in a memory structure
403
      called page tables. Some architectures have no hardware support for page
406
      called page tables. Some architectures have no hardware support for page
404
      tables<footnote>
407
      tables<footnote>
405
          <para>On mips32, TLB-only model is used and the operating system is
408
          <para>On mips32, TLB-only model is used and the operating system is
406
          responsible for managing software defined page tables.</para>
409
          responsible for managing software defined page tables.</para>
407
        </footnote> while other processor architectures<footnote>
410
        </footnote> while other processor architectures<footnote>
408
          <para>Like amd64 and ia32.</para>
411
          <para>Like amd64 and ia32.</para>
409
        </footnote> understand the whole memory format thereof. Despite all
412
        </footnote> understand the whole memory format thereof. Despite all
410
      the possible differences in page table formats, the HelenOS VAT
413
      the possible differences in page table formats, the HelenOS VAT
411
      subsystem<footnote>
414
      subsystem<footnote>
412
          <para>Virtual Address Translation subsystem.</para>
415
          <para>Virtual Address Translation subsystem.</para>
413
        </footnote> unifies the page table operations under one programming
416
        </footnote> unifies the page table operations under one programming
414
      interface. For all parts of the kernel, three basic functions are
417
      interface. For all parts of the kernel, three basic functions are
415
      provided:</para>
418
      provided:</para>
416
 
419
 
417
      <itemizedlist>
420
      <itemizedlist>
418
        <listitem>
421
        <listitem>
419
          <para><code>page_mapping_insert()</code>,</para>
422
          <para><code>page_mapping_insert()</code>,</para>
420
        </listitem>
423
        </listitem>
421
 
424
 
422
        <listitem>
425
        <listitem>
423
          <para><code>page_mapping_find()</code> and</para>
426
          <para><code>page_mapping_find()</code> and</para>
424
        </listitem>
427
        </listitem>
425
 
428
 
426
        <listitem>
429
        <listitem>
427
          <para><code>page_mapping_remove()</code>.</para>
430
          <para><code>page_mapping_remove()</code>.</para>
428
        </listitem>
431
        </listitem>
429
      </itemizedlist>
432
      </itemizedlist>
430
 
433
 
431
      <para>The <code>page_mapping_insert()</code> function is used to
434
      <para>The <code>page_mapping_insert()</code> function is used to
432
      introduce a mapping for one virtual memory page belonging to a
435
      introduce a mapping for one virtual memory page belonging to a
433
      particular address space into the page tables. Once the mapping is in
436
      particular address space into the page tables. Once the mapping is in
434
      the page tables, it can be searched by <code>page_mapping_find()</code>
437
      the page tables, it can be searched by <code>page_mapping_find()</code>
435
      and removed by <code>page_mapping_remove()</code>. All of these
438
      and removed by <code>page_mapping_remove()</code>. All of these
436
      functions internally select the page table mechanism specific functions
439
      functions internally select the page table mechanism specific functions
437
      that carry out the self operation.</para>
440
      that carry out the self operation.</para>
438
 
441
 
439
      <para>There are currently two supported mechanisms: generic 4-level
442
      <para>There are currently two supported mechanisms: generic 4-level
440
      hierarchical page tables and global page hash table. Both of the
443
      hierarchical page tables and global page hash table. Both of the
441
      mechanisms are generic as they cover several hardware platforms. For
444
      mechanisms are generic as they cover several hardware platforms. For
442
      instance, the 4-level hierarchical page table mechanism is used by
445
      instance, the 4-level hierarchical page table mechanism is used by
443
      amd64, ia32, mips32 and ppc32, respectively. These architectures have
446
      amd64, ia32, mips32 and ppc32, respectively. These architectures have
444
      the following page table format: 4-level, 2-level, TLB-only and hardware
447
      the following page table format: 4-level, 2-level, TLB-only and hardware
445
      hash table, respectively. On the other hand, the global page hash table
448
      hash table, respectively. On the other hand, the global page hash table
446
      is used on ia64 that can be TLB-only or use a hardware hash table.
449
      is used on ia64 that can be TLB-only or use a hardware hash table.
447
      Although only two mechanisms are currently implemented, other mechanisms
450
      Although only two mechanisms are currently implemented, other mechanisms
448
      (e.g. B+tree) can be easily added.</para>
451
      (e.g. B+tree) can be easily added.</para>
449
 
452
 
450
      <section id="page_tables">
453
      <section id="page_tables">
451
        <indexterm>
454
        <indexterm>
452
          <primary>page tables</primary>
455
          <primary>page tables</primary>
453
 
456
 
454
          <secondary>- hierarchical</secondary>
457
          <secondary>- hierarchical</secondary>
455
        </indexterm>
458
        </indexterm>
456
 
459
 
457
        <title>Hierarchical 4-level page tables</title>
460
        <title>Hierarchical 4-level page tables</title>
458
 
461
 
459
        <para>Hierarchical 4-level page tables are generalization of the
462
        <para>Hierarchical 4-level page tables are generalization of the
460
        frequently used hierarchical model of page tables. In this mechanism,
463
        frequently used hierarchical model of page tables. In this mechanism,
461
        each address space has its own page tables. To avoid confusion in
464
        each address space has its own page tables. To avoid confusion in
462
        terminology used by hardware vendors, in HelenOS, the root level page
465
        terminology used by hardware vendors, in HelenOS, the root level page
463
        table is called PTL0, the two middle levels are called PTL1 and PTL2,
466
        table is called PTL0, the two middle levels are called PTL1 and PTL2,
464
        and, finally, the leaf level is called PTL3. All architectures using
467
        and, finally, the leaf level is called PTL3. All architectures using
465
        this mechanism are required to use PTL0 and PTL3. However, the middle
468
        this mechanism are required to use PTL0 and PTL3. However, the middle
466
        levels can be left out, depending on the hardware hierarchy or
469
        levels can be left out, depending on the hardware hierarchy or
467
        structure of software-only page tables. The genericity is achieved
470
        structure of software-only page tables. The genericity is achieved
468
        through a set of macros that define transitions from one level to
471
        through a set of macros that define transitions from one level to
469
        another. Unused levels are optimised out by the compiler.
472
        another. Unused levels are optimised out by the compiler.
470
    <figure float="1">
473
    <figure float="1">
471
          <mediaobject id="mm_pt">
474
          <mediaobject id="mm_pt">
472
            <imageobject role="pdf">
475
            <imageobject role="pdf">
473
              <imagedata fileref="images/mm_pt.pdf" format="PDF" />
476
              <imagedata fileref="images/mm_pt.pdf" format="PDF" />
474
            </imageobject>
477
            </imageobject>
475
 
478
 
476
            <imageobject role="html">
479
            <imageobject role="html">
477
              <imagedata fileref="images/mm_pt.png" format="PNG" />
480
              <imagedata fileref="images/mm_pt.png" format="PNG" />
478
            </imageobject>
481
            </imageobject>
479
 
482
 
480
            <imageobject role="fop">
483
            <imageobject role="fop">
481
              <imagedata fileref="images/mm_pt.svg" format="SVG" />
484
              <imagedata fileref="images/mm_pt.svg" format="SVG" />
482
            </imageobject>
485
            </imageobject>
483
          </mediaobject>
486
          </mediaobject>
484
 
487
 
485
          <title>Hierarchical 4-level page tables.</title>
488
          <title>Hierarchical 4-level page tables.</title>
486
        </figure>
489
        </figure>
487
    </para>
490
    </para>
488
      </section>
491
      </section>
489
 
492
 
490
      <section>
493
      <section>
491
        <indexterm>
494
        <indexterm>
492
          <primary>page tables</primary>
495
          <primary>page tables</primary>
493
 
496
 
494
          <secondary>- hashing</secondary>
497
          <secondary>- hashing</secondary>
495
        </indexterm>
498
        </indexterm>
496
 
499
 
497
        <title>Global page hash table</title>
500
        <title>Global page hash table</title>
498
 
501
 
499
        <para>Implementation of the global page hash table was encouraged by
502
        <para>Implementation of the global page hash table was encouraged by
500
        64-bit architectures that can have rather sparse address spaces. The
503
        64-bit architectures that can have rather sparse address spaces. The
501
        hash table contains valid mappings only. Each entry of the hash table
504
        hash table contains valid mappings only. Each entry of the hash table
502
        contains an address space pointer, virtual memory page number (VPN),
505
        contains an address space pointer, virtual memory page number (VPN),
503
        physical memory frame number (PFN) and a set of flags. The pair of the
506
        physical memory frame number (PFN) and a set of flags. The pair of the
504
        address space pointer and the virtual memory page number is used as a
507
        address space pointer and the virtual memory page number is used as a
505
        key for the hash table. One of the major differences between the
508
        key for the hash table. One of the major differences between the
506
        global page hash table and hierarchical 4-level page tables is that
509
        global page hash table and hierarchical 4-level page tables is that
507
        there is only a single global page hash table in the system while
510
        there is only a single global page hash table in the system while
508
        hierarchical page tables exist per address space. Thus, the global
511
        hierarchical page tables exist per address space. Thus, the global
509
        page hash table contains information about mappings of all address
512
        page hash table contains information about mappings of all address
510
        spaces in the system.
513
        spaces in the system.
511
        <figure float="1">
514
        <figure float="1">
512
          <mediaobject id="mm_hash">
515
          <mediaobject id="mm_hash">
513
            <imageobject role="pdf">
516
            <imageobject role="pdf">
514
              <imagedata fileref="images/mm_hash.pdf" format="PDF" />
517
              <imagedata fileref="images/mm_hash.pdf" format="PDF" />
515
            </imageobject>
518
            </imageobject>
516
 
519
 
517
            <imageobject role="html">
520
            <imageobject role="html">
518
              <imagedata fileref="images/mm_hash.png" format="PNG" />
521
              <imagedata fileref="images/mm_hash.png" format="PNG" />
519
            </imageobject>
522
            </imageobject>
520
 
523
 
521
            <imageobject role="fop">
524
            <imageobject role="fop">
522
              <imagedata fileref="images/mm_hash.svg" format="SVG" />
525
              <imagedata fileref="images/mm_hash.svg" format="SVG" />
523
            </imageobject>
526
            </imageobject>
524
          </mediaobject>
527
          </mediaobject>
525
 
528
 
526
          <title>Global page hash table.</title>
529
          <title>Global page hash table.</title>
527
        </figure>
530
        </figure>
528
</para>
531
</para>
529
 
532
 
530
        <para>The global page hash table mechanism uses the generic hash table
533
        <para>The global page hash table mechanism uses the generic hash table
531
        type as described in the chapter dedicated to <link
534
        type as described in the chapter dedicated to <link
532
        linkend="hashtables">data structures</link> earlier in this
535
        linkend="hashtables">data structures</link> earlier in this
533
        book.</para>
536
        book.</para>
534
      </section>
537
      </section>
535
    </section>
538
    </section>
536
  </section>
539
  </section>
537
 
540
 
538
  <section id="tlb">
541
  <section id="tlb">
539
    <indexterm>
542
    <indexterm>
540
      <primary>TLB</primary>
543
      <primary>TLB</primary>
541
    </indexterm>
544
    </indexterm>
542
 
545
 
543
    <title>Translation Lookaside buffer</title>
546
    <title>Translation Lookaside buffer</title>
544
 
547
 
545
    <para>Due to the extensive overhead of several extra memory accesses
548
    <para>Due to the extensive overhead of several extra memory accesses
546
    during page table lookup that are necessary on every instruction, modern
549
    during page table lookup that are necessary on every instruction, modern
547
    architectures deploy fast assotiative cache of recelntly used page
550
    architectures deploy fast assotiative cache of recelntly used page
548
    mappings. This cache is called TLB - Translation Lookaside Buffer - and is
551
    mappings. This cache is called TLB - Translation Lookaside Buffer - and is
549
    present on every processor in the system. As it has been already pointed
552
    present on every processor in the system. As it has been already pointed
550
    out, TLB is the only page translation mechanism for some
553
    out, TLB is the only page translation mechanism for some
551
    architectures.</para>
554
    architectures.</para>
552
 
555
 
553
    <section id="tlb_shootdown">
556
    <section id="tlb_shootdown">
554
      <indexterm>
557
      <indexterm>
555
        <primary>TLB</primary>
558
        <primary>TLB</primary>
556
 
559
 
557
        <secondary>- TLB shootdown</secondary>
560
        <secondary>- TLB shootdown</secondary>
558
      </indexterm>
561
      </indexterm>
559
 
562
 
560
      <title>TLB consistency</title>
563
      <title>TLB consistency</title>
561
 
564
 
562
      <para>The operating system is responsible for keeping TLB consistent
565
      <para>The operating system is responsible for keeping TLB consistent
563
      with the page tables. Whenever mappings are modified or purged from the
566
      with the page tables. Whenever mappings are modified or purged from the
564
      page tables, or when an address space identifier is reused, the kernel
567
      page tables, or when an address space identifier is reused, the kernel
565
      needs to invalidate the respective contents of TLB. Some TLB types
568
      needs to invalidate the respective contents of TLB. Some TLB types
566
      support partial invalidation of their content (e.g. ranges of pages or
569
      support partial invalidation of their content (e.g. ranges of pages or
567
      address spaces) while other types can be invalidated only entirely. The
570
      address spaces) while other types can be invalidated only entirely. The
568
      invalidation must be done on all processors for there is one TLB per
571
      invalidation must be done on all processors for there is one TLB per
569
      processor. Maintaining TLB consistency on multiprocessor configurations
572
      processor. Maintaining TLB consistency on multiprocessor configurations
570
      is not as trivial as it might look from the first glance.</para>
573
      is not as trivial as it might look from the first glance.</para>
571
 
574
 
572
      <para>The remote TLB invalidation is called TLB shootdown. HelenOS uses
575
      <para>The remote TLB invalidation is called TLB shootdown. HelenOS uses
573
      a simplified variant of the algorithm described in <xref
576
      a simplified variant of the algorithm described in <xref
574
      linkend="Black89" />.</para>
577
      linkend="Black89" />.</para>
575
 
578
 
576
      <para>TLB shootdown is performed in three phases.</para>
579
      <para>TLB shootdown is performed in three phases.</para>
577
 
580
 
578
      <formalpara>
581
      <formalpara>
579
        <title>Phase 1.</title>
582
        <title>Phase 1.</title>
580
 
583
 
581
        <para>The initiator clears its TLB flag and locks the global TLB
584
        <para>The initiator clears its TLB flag and locks the global TLB
582
        spinlock. The request is then enqueued into all other processors' TLB
585
        spinlock. The request is then enqueued into all other processors' TLB
583
        shootdown message queues. When the TLB shootdown message queue is full
586
        shootdown message queues. When the TLB shootdown message queue is full
584
        on any processor, the queue is purged and a single request to
587
        on any processor, the queue is purged and a single request to
585
        invalidate the entire TLB is stored there. Once all the TLB shootdown
588
        invalidate the entire TLB is stored there. Once all the TLB shootdown
586
        messages were dispatched, the initiator sends all other processors an
589
        messages were dispatched, the initiator sends all other processors an
587
        interrupt to notify them about the incoming TLB shootdown message. It
590
        interrupt to notify them about the incoming TLB shootdown message. It
588
        then spins until all processors accept the interrupt and clear their
591
        then spins until all processors accept the interrupt and clear their
589
        TLB flags.</para>
592
        TLB flags.</para>
590
      </formalpara>
593
      </formalpara>
591
 
594
 
592
      <formalpara>
595
      <formalpara>
593
        <title>Phase 2.</title>
596
        <title>Phase 2.</title>
594
 
597
 
595
        <para>Except for the initiator, all other processors are spining on
598
        <para>Except for the initiator, all other processors are spining on
596
        the TLB spinlock. The initiator is now free to modify the page tables
599
        the TLB spinlock. The initiator is now free to modify the page tables
597
        and purge its own TLB. The initiator then unlocks the global TLB
600
        and purge its own TLB. The initiator then unlocks the global TLB
598
        spinlock and sets its TLB flag.</para>
601
        spinlock and sets its TLB flag.</para>
599
      </formalpara>
602
      </formalpara>
600
 
603
 
601
      <formalpara>
604
      <formalpara>
602
        <title>Phase 3.</title>
605
        <title>Phase 3.</title>
603
 
606
 
604
        <para>When the spinlock is unlocked by the initiator, other processors
607
        <para>When the spinlock is unlocked by the initiator, other processors
605
        are sequentially granted the spinlock. However, once they manage to
608
        are sequentially granted the spinlock. However, once they manage to
606
        lock it, they immediately release it. Each processor invalidates its
609
        lock it, they immediately release it. Each processor invalidates its
607
        TLB according to messages found in its TLB shootdown message queue. In
610
        TLB according to messages found in its TLB shootdown message queue. In
608
        the end, each processor sets its TLB flag and resumes its previous
611
        the end, each processor sets its TLB flag and resumes its previous
609
        operation.</para>
612
        operation.</para>
610
      </formalpara>
613
      </formalpara>
611
    </section>
614
    </section>
612
  </section>
615
  </section>
613
 
616
 
614
  <section>
617
  <section>
615
    <title>Address spaces</title>
618
    <title>Address spaces</title>
616
 
619
 
617
    <para>In HelenOS, address spaces are objects that encapsulate the
620
    <para>In HelenOS, address spaces are objects that encapsulate the
618
    following items:</para>
621
    following items:</para>
619
 
622
 
620
    <itemizedlist>
623
    <itemizedlist>
621
      <listitem>
624
      <listitem>
622
        <para>address space identifier,</para>
625
        <para>address space identifier,</para>
623
      </listitem>
626
      </listitem>
624
 
627
 
625
      <listitem>
628
      <listitem>
626
        <para>page table PTL0 pointer and</para>
629
        <para>page table PTL0 pointer and</para>
627
      </listitem>
630
      </listitem>
628
 
631
 
629
      <listitem>
632
      <listitem>
630
        <para>a set of mutually disjunctive address space areas.</para>
633
        <para>a set of mutually disjunctive address space areas.</para>
631
      </listitem>
634
      </listitem>
632
    </itemizedlist>
635
    </itemizedlist>
633
 
636
 
634
    <para>Address space identifiers will be discussed later in this section.
637
    <para>Address space identifiers will be discussed later in this section.
635
    The address space contains a pointer to PTL0, provided that the
638
    The address space contains a pointer to PTL0, provided that the
636
    architecture uses per address space page tables such as the hierarchical
639
    architecture uses per address space page tables such as the hierarchical
637
    4-level page tables. The most interesting component is the B+tree of
640
    4-level page tables. The most interesting component is the B+tree of
638
    address space areas belonging to the address space.</para>
641
    address space areas belonging to the address space.</para>
639
 
642
 
640
    <section>
643
    <section>
641
      <title>Address space areas</title>
644
      <title>Address space areas</title>
642
 
645
 
643
      <para>Because an address space can be composed of heterogenous mappings
646
      <para>Because an address space can be composed of heterogenous mappings
644
      such as userspace code, data, read-only data and kernel memory, it is
647
      such as userspace code, data, read-only data and kernel memory, it is
645
      further broken down into smaller homogenous units called address space
648
      further broken down into smaller homogenous units called address space
646
      areas. An address space area represents a continuous piece of userspace
649
      areas. An address space area represents a continuous piece of userspace
647
      virtual memory associated with common flags. Kernel memory mappings do
650
      virtual memory associated with common flags. Kernel memory mappings do
648
      not take part in address space areas because they are hardwired either
651
      not take part in address space areas because they are hardwired either
649
      into TLBs or page tables and are thus shared by all address spaces. The
652
      into TLBs or page tables and are thus shared by all address spaces. The
650
      flags are a combination of:</para>
653
      flags are a combination of:</para>
651
 
654
 
652
      <itemizedlist>
655
      <itemizedlist>
653
        <listitem>
656
        <listitem>
654
          <para><constant>AS_AREA_READ</constant>,</para>
657
          <para><constant>AS_AREA_READ</constant>,</para>
655
        </listitem>
658
        </listitem>
656
 
659
 
657
        <listitem>
660
        <listitem>
658
          <para><constant>AS_AREA_WRITE</constant>,</para>
661
          <para><constant>AS_AREA_WRITE</constant>,</para>
659
        </listitem>
662
        </listitem>
660
 
663
 
661
        <listitem>
664
        <listitem>
662
          <para><constant>AS_AREA_EXEC</constant> and</para>
665
          <para><constant>AS_AREA_EXEC</constant> and</para>
663
        </listitem>
666
        </listitem>
664
 
667
 
665
        <listitem>
668
        <listitem>
666
          <para><constant>AS_AREA_CACHEABLE</constant>.</para>
669
          <para><constant>AS_AREA_CACHEABLE</constant>.</para>
667
        </listitem>
670
        </listitem>
668
      </itemizedlist>
671
      </itemizedlist>
669
 
672
 
670
      <para>The <constant>AS_AREA_READ</constant> flag is implicit and cannot
673
      <para>The <constant>AS_AREA_READ</constant> flag is implicit and cannot
671
      be removed. The <constant>AS_AREA_WRITE</constant> flag denotes a
674
      be removed. The <constant>AS_AREA_WRITE</constant> flag denotes a
672
      writable address space area and the <constant>AS_AREA_EXEC</constant> is
675
      writable address space area and the <constant>AS_AREA_EXEC</constant> is
673
      used for areas containing code. The combination of
676
      used for areas containing code. The combination of
674
      <constant>AS_AREA_WRITE</constant> and <constant>AS_AREA_EXEC</constant>
677
      <constant>AS_AREA_WRITE</constant> and <constant>AS_AREA_EXEC</constant>
675
      is not allowed. Some architectures don't differentiate between
678
      is not allowed. Some architectures don't differentiate between
676
      executable and non-executable mappings. In that case, the
679
      executable and non-executable mappings. In that case, the
677
      <constant>AS_AREA_EXEC</constant> has no effect on mappings created for
680
      <constant>AS_AREA_EXEC</constant> has no effect on mappings created for
678
      the address space area in the page tables. If the flags don't have
681
      the address space area in the page tables. If the flags don't have
679
      <constant>AS_AREA_CACHEABLE</constant> set, the page tables content of
682
      <constant>AS_AREA_CACHEABLE</constant> set, the page tables content of
680
      the area is created with caching disabled. This is useful for address
683
      the area is created with caching disabled. This is useful for address
681
      space areas containing memory of some memory mapped device.</para>
684
      space areas containing memory of some memory mapped device.</para>
682
 
685
 
683
      <para>Address space areas can be backed by a backend that provides
686
      <para>Address space areas can be backed by a backend that provides
684
      virtual functions for servicing page faults that occur within the
687
      virtual functions for servicing page faults that occur within the
685
      address space area, releasing memory allocated by the area and sharing
688
      address space area, releasing memory allocated by the area and sharing
686
      the area. Currently, there are three backends supported by HelenOS:
689
      the area. Currently, there are three backends supported by HelenOS:
687
      anonymous memory backend, ELF image backend and physical memory
690
      anonymous memory backend, ELF image backend and physical memory
688
      backend.</para>
691
      backend.</para>
689
 
692
 
690
      <formalpara>
693
      <formalpara>
691
        <title>Anonymous memory backend</title>
694
        <title>Anonymous memory backend</title>
692
 
695
 
693
        <para>Anonymous memory is memory that has no predefined contents such
696
        <para>Anonymous memory is memory that has no predefined contents such
694
        as userspace stack or heap. Anonymous address space areas are backed
697
        as userspace stack or heap. Anonymous address space areas are backed
695
        by memory allocated from the frame allocator. Areas backed by this
698
        by memory allocated from the frame allocator. Areas backed by this
696
        backend can be resized as long as they are not shared.</para>
699
        backend can be resized as long as they are not shared.</para>
697
      </formalpara>
700
      </formalpara>
698
 
701
 
699
      <formalpara>
702
      <formalpara>
700
        <title>ELF image backend</title>
703
        <title>ELF image backend</title>
701
 
704
 
702
        <para>Areas backed by the ELF backend are composed of memory that can
705
        <para>Areas backed by the ELF backend are composed of memory that can
703
        be either initialized, partially initialized or completely anonymous.
706
        be either initialized, partially initialized or completely anonymous.
704
        Initialized portions of ELF backend address space areas are those that
707
        Initialized portions of ELF backend address space areas are those that
705
        are entirely physically present in the executable image (e.g. code and
708
        are entirely physically present in the executable image (e.g. code and
706
        initialized data). Anonymous portions are those pages of the
709
        initialized data). Anonymous portions are those pages of the
707
        <emphasis>bss</emphasis> section that exist entirely outside the
710
        <emphasis>bss</emphasis> section that exist entirely outside the
708
        executable image. Lastly, pages that don't fit into the previous two
711
        executable image. Lastly, pages that don't fit into the previous two
709
        categories are partially initialized as they are both part of the
712
        categories are partially initialized as they are both part of the
710
        image and the <emphasis>bss</emphasis> section. The initialized
713
        image and the <emphasis>bss</emphasis> section. The initialized
711
        portion does not need any memory from the allocator unless it is
714
        portion does not need any memory from the allocator unless it is
712
        writable. In that case, pages are duplicated on demand during page
715
        writable. In that case, pages are duplicated on demand during page
713
        fault and memory for the copy is allocated from the frame allocator.
716
        fault and memory for the copy is allocated from the frame allocator.
714
        The remaining two parts of the ELF always require memory from the
717
        The remaining two parts of the ELF always require memory from the
715
        frame allocator. Non-shared address space areas backed by the ELF
718
        frame allocator. Non-shared address space areas backed by the ELF
716
        image backend can be resized.</para>
719
        image backend can be resized.</para>
717
      </formalpara>
720
      </formalpara>
718
 
721
 
719
      <formalpara>
722
      <formalpara>
720
        <title>Physical memory backend</title>
723
        <title>Physical memory backend</title>
721
 
724
 
722
        <para>Physical memory backend is used by the device drivers to access
725
        <para>Physical memory backend is used by the device drivers to access
723
        physical memory. No additional memory needs to be allocated on a page
726
        physical memory. No additional memory needs to be allocated on a page
724
        fault in this area and when sharing this area. Areas backed by this
727
        fault in this area and when sharing this area. Areas backed by this
725
        backend cannot be resized.</para>
728
        backend cannot be resized.</para>
726
      </formalpara>
729
      </formalpara>
727
 
730
 
728
      <section>
731
      <section>
729
        <title>Memory sharing</title>
732
        <title>Memory sharing</title>
730
 
733
 
731
        <para>Address space areas can be shared provided that their backend
734
        <para>Address space areas can be shared provided that their backend
732
        supports sharing<footnote>
735
        supports sharing<footnote>
733
            <para>Which is the case for all currently supported
736
            <para>Which is the case for all currently supported
734
            backends.</para>
737
            backends.</para>
735
          </footnote>. When the kernel calls <code>as_area_share()</code>, a
738
          </footnote>. When the kernel calls <code>as_area_share()</code>, a
736
        check is made to see whether the area is already being shared. If the
739
        check is made to see whether the area is already being shared. If the
737
        area is already shared, it contains a pointer to the share info
740
        area is already shared, it contains a pointer to the share info
738
        structure. The pointer is then simply copied into the new address
741
        structure. The pointer is then simply copied into the new address
739
        space area and a reference count in the share info structure is
742
        space area and a reference count in the share info structure is
740
        incremented. Otherwise a new address space share info structure needs
743
        incremented. Otherwise a new address space share info structure needs
741
        to be created. The backend is then called to duplicate the mapping of
744
        to be created. The backend is then called to duplicate the mapping of
742
        pages for which a frame is allocated. The duplicated mapping is stored
745
        pages for which a frame is allocated. The duplicated mapping is stored
743
        in the share info structure B+tree called <varname>pagemap</varname>.
746
        in the share info structure B+tree called <varname>pagemap</varname>.
744
        Note that the reference count of the frames put into the
747
        Note that the reference count of the frames put into the
745
        <varname>pagemap</varname> must be incremented in order to avoid a race condition.
748
        <varname>pagemap</varname> must be incremented in order to avoid a race condition.
746
    If the originating address space area had been destroyed before the <varname>pagemap</varname>
749
    If the originating address space area had been destroyed before the <varname>pagemap</varname>
747
    information made it to the page tables of other address spaces that take part in
750
    information made it to the page tables of other address spaces that take part in
748
    the sharing, the reference count of the respective frames
751
    the sharing, the reference count of the respective frames
749
    would have dropped to zero and some of them could have been allocated again.</para>
752
    would have dropped to zero and some of them could have been allocated again.</para>
750
      </section>
753
      </section>
751
 
754
 
752
      <section>
755
      <section>
753
        <title>Page faults</title>
756
        <title>Page faults</title>
754
 
757
 
755
        <para>When a page fault is encountered in the address space area, the
758
        <para>When a page fault is encountered in the address space area, the
756
        address space page fault handler, <code>as_page_fault()</code>,
759
        address space page fault handler, <code>as_page_fault()</code>,
757
        invokes the corresponding backend page fault handler to resolve the
760
        invokes the corresponding backend page fault handler to resolve the
758
        situation. The backend might either confirm the page fault or perform
761
        situation. The backend might either confirm the page fault or perform
759
        a remedy. In the non-shared case, depending on the backend, the page
762
        a remedy. In the non-shared case, depending on the backend, the page
760
        fault can be remedied usually by allocating some memory on demand or
763
        fault can be remedied usually by allocating some memory on demand or
761
        by looking up the frame for the faulting translation in the ELF
764
        by looking up the frame for the faulting translation in the ELF
762
        image.</para>
765
        image.</para>
763
 
766
 
764
        <para>Shared address space areas need to consider the
767
        <para>Shared address space areas need to consider the
765
        <varname>pagemap</varname> B+tree. First they need to make sure
768
        <varname>pagemap</varname> B+tree. First they need to make sure
766
        whether the mapping is not present in the <varname>pagemap</varname>.
769
        whether the mapping is not present in the <varname>pagemap</varname>.
767
        If it is there, then the frame reference count is increased and the
770
        If it is there, then the frame reference count is increased and the
768
        page fault is resolved. Otherwise the handler proceeds similarily to
771
        page fault is resolved. Otherwise the handler proceeds similarily to
769
        the non-shared case. If it allocates a physical memory frame, it must
772
        the non-shared case. If it allocates a physical memory frame, it must
770
        increment its reference count and add it to the
773
        increment its reference count and add it to the
771
        <varname>pagemap</varname>.</para>
774
        <varname>pagemap</varname>.</para>
772
      </section>
775
      </section>
773
    </section>
776
    </section>
774
 
777
 
775
    <section>
778
    <section>
776
      <indexterm>
779
      <indexterm>
777
        <primary>address space</primary>
780
        <primary>address space</primary>
778
 
781
 
779
        <secondary>- ASID</secondary>
782
        <secondary>- ASID</secondary>
780
      </indexterm>
783
      </indexterm>
781
 
784
 
782
      <title>Address Space ID (ASID)</title>
785
      <title>Address Space ID (ASID)</title>
783
 
786
 
784
      <para>Modern processor architectures optimize TLB utilization by
787
      <para>Modern processor architectures optimize TLB utilization by
785
      associating TLB entries with address spaces through assigning
788
      associating TLB entries with address spaces through assigning
786
      identification numbers to them. In HelenOS, the term ASID, originally
789
      identification numbers to them. In HelenOS, the term ASID, originally
787
      taken from the mips32 terminology, is used to refer to the address space
790
      taken from the mips32 terminology, is used to refer to the address space
788
      identification number. The advantage of having ASIDs is that TLB does
791
      identification number. The advantage of having ASIDs is that TLB does
789
      not have to be invalidated on thread context switch as long as ASIDs are
792
      not have to be invalidated on thread context switch as long as ASIDs are
790
      unique. Unfortunately, architectures supported by HelenOS use all
793
      unique. Unfortunately, architectures supported by HelenOS use all
791
      different widths of ASID numbers<footnote>
794
      different widths of ASID numbers<footnote>
792
          <para>amd64 and ia32 don't use similar abstraction at all, mips32
795
          <para>amd64 and ia32 don't use similar abstraction at all, mips32
793
          has 8-bit ASIDs and ia64 can have ASIDs between 18 to 24 bits
796
          has 8-bit ASIDs and ia64 can have ASIDs between 18 to 24 bits
794
          wide.</para>
797
          wide.</para>
795
        </footnote> out of which none is sufficient. The amd64 and ia32
798
        </footnote> out of which none is sufficient. The amd64 and ia32
796
      architectures cannot make use of ASIDs as their TLB doesn't recognize
799
      architectures cannot make use of ASIDs as their TLB doesn't recognize
797
      such an abstraction. Other architectures have support for ASIDs, but for
800
      such an abstraction. Other architectures have support for ASIDs, but for
798
      instance ppc32 doesn't make use of them in the current version of
801
      instance ppc32 doesn't make use of them in the current version of
799
      HelenOS. The rest of the architectures does use ASIDs. However, even on
802
      HelenOS. The rest of the architectures does use ASIDs. However, even on
800
      the ia64 architecture, the minimal supported width of ASID<footnote>
803
      the ia64 architecture, the minimal supported width of ASID<footnote>
801
          <para>RID in ia64 terminology.</para>
804
          <para>RID in ia64 terminology.</para>
802
        </footnote> is insufficient to provide a unique integer identifier to
805
        </footnote> is insufficient to provide a unique integer identifier to
803
      all address spaces that might hypothetically coexist in the running
806
      all address spaces that might hypothetically coexist in the running
804
      system. The situation on mips32 is even worse: the architecture has only
807
      system. The situation on mips32 is even worse: the architecture has only
805
      256 unique identifiers.</para>
808
      256 unique identifiers.</para>
806
 
809
 
807
      <indexterm>
810
      <indexterm>
808
        <primary>address space</primary>
811
        <primary>address space</primary>
809
 
812
 
810
        <secondary>- ASID stealing</secondary>
813
        <secondary>- ASID stealing</secondary>
811
      </indexterm>
814
      </indexterm>
812
 
815
 
813
      <para>To mitigate the shortage of ASIDs, HelenOS uses the following
816
      <para>To mitigate the shortage of ASIDs, HelenOS uses the following
814
      strategy. When the system initializes, a FIFO queue<footnote>
817
      strategy. When the system initializes, a FIFO queue<footnote>
815
          <para>Note that architecture-specific measures are taken to avoid
818
          <para>Note that architecture-specific measures are taken to avoid
816
          too large FIFO queue. For instance, seven consecutive ia64 RIDs are
819
          too large FIFO queue. For instance, seven consecutive ia64 RIDs are
817
          grouped to form one HelenOS ASID.</para>
820
          grouped to form one HelenOS ASID.</para>
818
        </footnote> is created and filled with all available ASIDs. Moreover,
821
        </footnote> is created and filled with all available ASIDs. Moreover,
819
      every address space remembers the number of processors on which it is
822
      every address space remembers the number of processors on which it is
820
      active. Address spaces that have a valid ASID and that are not active on
823
      active. Address spaces that have a valid ASID and that are not active on
821
      any processor are appended to the list of inactive address spaces with
824
      any processor are appended to the list of inactive address spaces with
822
      valid ASID. When an address space needs to be assigned a valid ASID, it
825
      valid ASID. When an address space needs to be assigned a valid ASID, it
823
      first checks the FIFO queue. If it contains at least one ASID, the ASID
826
      first checks the FIFO queue. If it contains at least one ASID, the ASID
824
      is allocated. If the queue is empty, an ASID is simply stolen from the
827
      is allocated. If the queue is empty, an ASID is simply stolen from the
825
      first address space in the list. In that case, the address space that
828
      first address space in the list. In that case, the address space that
826
      loses the ASID in favor of another address space, is removed from the
829
      loses the ASID in favor of another address space, is removed from the
827
      list. After the new ASID is purged from all TLBs, it can be used by the
830
      list. After the new ASID is purged from all TLBs, it can be used by the
828
      address space. Note that this approach works due to the fact that the
831
      address space. Note that this approach works due to the fact that the
829
      number of ASIDs is greater than the maximal number of processors
832
      number of ASIDs is greater than the maximal number of processors
830
      supported by HelenOS and that there can be only one active address space
833
      supported by HelenOS and that there can be only one active address space
831
      per processor. In other words, when the FIFO queue is empty, there must
834
      per processor. In other words, when the FIFO queue is empty, there must
832
      be address spaces that are not active on any processor.</para>
835
      be address spaces that are not active on any processor.</para>
833
    </section>
836
    </section>
834
  </section>
837
  </section>
835
</chapter>
838
</chapter>
836
 
839