Rev 82 | Rev 84 | Go to most recent revision | Only display areas with differences | Ignore whitespace | Details | Blame | Last modification | View Log | RSS feed
Rev 82 | Rev 83 | ||
---|---|---|---|
1 | <?xml version="1.0" encoding="UTF-8"?> |
1 | <?xml version="1.0" encoding="UTF-8"?> |
2 | <chapter id="mm"> |
2 | <chapter id="mm"> |
3 | <?dbhtml filename="mm.html"?> |
3 | <?dbhtml filename="mm.html"?> |
4 | 4 | ||
5 | <title>Memory management</title> |
5 | <title>Memory management</title> |
6 | 6 | ||
7 | <para>In previous chapters, this book described the scheduling subsystem as |
7 | <para>In previous chapters, this book described the scheduling subsystem as |
8 | the creator of the impression that threads execute in parallel. The memory |
8 | the creator of the impression that threads execute in parallel. The memory |
9 | management subsystem, on the other hand, creates the impression that there |
9 | management subsystem, on the other hand, creates the impression that there |
10 | is enough physical memory for the kernel and that userspace tasks have the |
10 | is enough physical memory for the kernel and that userspace tasks have the |
11 | entire address space only for themselves.</para> |
11 | entire address space only for themselves.</para> |
12 | 12 | ||
13 | <section> |
13 | <section> |
14 | <title>Physical memory management</title> |
14 | <title>Physical memory management</title> |
15 | 15 | ||
16 | <section id="zones_and_frames"> |
16 | <section id="zones_and_frames"> |
17 | <title>Zones and frames</title> |
17 | <title>Zones and frames</title> |
18 | 18 | ||
19 | <para>HelenOS represents continuous areas of physical memory in |
19 | <para>HelenOS represents continuous areas of physical memory in |
20 | structures called frame zones (abbreviated as zones). Each zone contains |
20 | structures called frame zones (abbreviated as zones). Each zone contains |
21 | information about the number of allocated and unallocated physical |
21 | information about the number of allocated and unallocated physical |
22 | memory frames as well as the physical base address of the zone and |
22 | memory frames as well as the physical base address of the zone and |
23 | number of frames contained in it. A zone also contains an array of frame |
23 | number of frames contained in it. A zone also contains an array of frame |
24 | structures describing each frame of the zone and, in the last, but not |
24 | structures describing each frame of the zone and, in the last, but not |
25 | the least important, front, each zone is equipped with a buddy system |
25 | the least important, front, each zone is equipped with a buddy system |
26 | that faciliates effective allocation of power-of-two sized block of |
26 | that faciliates effective allocation of power-of-two sized block of |
27 | frames.</para> |
27 | frames.</para> |
28 | 28 | ||
29 | <para>This organization of physical memory provides good preconditions |
29 | <para>This organization of physical memory provides good preconditions |
30 | for hot-plugging of more zones. There is also one currently unused zone |
30 | for hot-plugging of more zones. There is also one currently unused zone |
31 | attribute: <code>flags</code>. The attribute could be used to give a |
31 | attribute: <code>flags</code>. The attribute could be used to give a |
32 | special meaning to some zones in the future.</para> |
32 | special meaning to some zones in the future.</para> |
33 | 33 | ||
34 | <para>The zones are linked in a doubly-linked list. This might seem a |
34 | <para>The zones are linked in a doubly-linked list. This might seem a |
35 | bit ineffective because the zone list is walked everytime a frame is |
35 | bit ineffective because the zone list is walked everytime a frame is |
36 | allocated or deallocated. However, this does not represent a significant |
36 | allocated or deallocated. However, this does not represent a significant |
37 | performance problem as it is expected that the number of zones will be |
37 | performance problem as it is expected that the number of zones will be |
38 | rather low. Moreover, most architectures merge all zones into |
38 | rather low. Moreover, most architectures merge all zones into |
39 | one.</para> |
39 | one.</para> |
40 | 40 | ||
41 | <para>Every physical memory frame in a zone, is described by a structure |
41 | <para>Every physical memory frame in a zone, is described by a structure |
42 | that contains number of references and other data used by buddy |
42 | that contains number of references and other data used by buddy |
43 | system.</para> |
43 | system.</para> |
44 | </section> |
44 | </section> |
45 | 45 | ||
46 | <section id="frame_allocator"> |
46 | <section id="frame_allocator"> |
47 | <indexterm> |
47 | <indexterm> |
48 | <primary>frame allocator</primary> |
48 | <primary>frame allocator</primary> |
49 | </indexterm> |
49 | </indexterm> |
50 | 50 | ||
51 | <title>Frame allocator</title> |
51 | <title>Frame allocator</title> |
52 | 52 | ||
53 | <para>The frame allocator satisfies kernel requests to allocate |
53 | <para>The frame allocator satisfies kernel requests to allocate |
54 | power-of-two sized blocks of physical memory. Because of zonal |
54 | power-of-two sized blocks of physical memory. Because of zonal |
55 | organization of physical memory, the frame allocator is always working |
55 | organization of physical memory, the frame allocator is always working |
56 | within a context of a particular frame zone. In order to carry out the |
56 | within a context of a particular frame zone. In order to carry out the |
57 | allocation requests, the frame allocator is tightly integrated with the |
57 | allocation requests, the frame allocator is tightly integrated with the |
58 | buddy system belonging to the zone. The frame allocator is also |
58 | buddy system belonging to the zone. The frame allocator is also |
59 | responsible for updating information about the number of free and busy |
59 | responsible for updating information about the number of free and busy |
60 | frames in the zone. <figure> |
60 | frames in the zone. <figure> |
61 | <mediaobject id="frame_alloc"> |
61 | <mediaobject id="frame_alloc"> |
62 | <imageobject role="eps"> |
62 | <imageobject role="eps"> |
63 | <imagedata fileref="images.vector/frame_alloc.eps" format="EPS" /> |
63 | <imagedata fileref="images.vector/frame_alloc.eps" format="EPS" /> |
64 | </imageobject> |
64 | </imageobject> |
65 | 65 | ||
66 | <imageobject role="html"> |
66 | <imageobject role="html"> |
67 | <imagedata fileref="images/frame_alloc.png" format="PNG" /> |
67 | <imagedata fileref="images/frame_alloc.png" format="PNG" /> |
68 | </imageobject> |
68 | </imageobject> |
69 | 69 | ||
70 | <imageobject role="fop"> |
70 | <imageobject role="fop"> |
71 | <imagedata fileref="images.vector/frame_alloc.svg" format="SVG" /> |
71 | <imagedata fileref="images.vector/frame_alloc.svg" format="SVG" /> |
72 | </imageobject> |
72 | </imageobject> |
73 | </mediaobject> |
73 | </mediaobject> |
74 | 74 | ||
75 | <title>Frame allocator scheme.</title> |
75 | <title>Frame allocator scheme.</title> |
76 | </figure></para> |
76 | </figure></para> |
77 | 77 | ||
78 | <formalpara> |
78 | <formalpara> |
79 | <title>Allocation / deallocation</title> |
79 | <title>Allocation / deallocation</title> |
80 | 80 | ||
81 | <para>Upon allocation request via function <code>frame_alloc()</code>, |
81 | <para>Upon allocation request via function <code>frame_alloc()</code>, |
82 | the frame allocator first tries to find a zone that can satisfy the |
82 | the frame allocator first tries to find a zone that can satisfy the |
83 | request (i.e. has the required amount of free frames). Once a suitable |
83 | request (i.e. has the required amount of free frames). Once a suitable |
84 | zone is found, the frame allocator uses the buddy allocator on the |
84 | zone is found, the frame allocator uses the buddy allocator on the |
85 | zone's buddy system to perform the allocation. During deallocation, |
85 | zone's buddy system to perform the allocation. During deallocation, |
86 | which is triggered by a call to <code>frame_free()</code>, the frame |
86 | which is triggered by a call to <code>frame_free()</code>, the frame |
87 | allocator looks up the respective zone that contains the frame being |
87 | allocator looks up the respective zone that contains the frame being |
88 | deallocated. Afterwards, it calls the buddy allocator again, this time |
88 | deallocated. Afterwards, it calls the buddy allocator again, this time |
89 | to take care of deallocation within the zone's buddy system.</para> |
89 | to take care of deallocation within the zone's buddy system.</para> |
90 | </formalpara> |
90 | </formalpara> |
91 | </section> |
91 | </section> |
92 | 92 | ||
93 | <section id="buddy_allocator"> |
93 | <section id="buddy_allocator"> |
94 | <indexterm> |
94 | <indexterm> |
95 | <primary>buddy system</primary> |
95 | <primary>buddy system</primary> |
96 | </indexterm> |
96 | </indexterm> |
97 | 97 | ||
98 | <title>Buddy allocator</title> |
98 | <title>Buddy allocator</title> |
99 | 99 | ||
100 | <para>In the buddy system, the memory is broken down into power-of-two |
100 | <para>In the buddy system, the memory is broken down into power-of-two |
101 | sized naturally aligned blocks. These blocks are organized in an array |
101 | sized naturally aligned blocks. These blocks are organized in an array |
102 | of lists, in which the list with index <emphasis>i</emphasis> contains |
102 | of lists, in which the list with index <emphasis>i</emphasis> contains |
103 | all unallocated blocks of size |
103 | all unallocated blocks of size |
104 | <emphasis>2<superscript>i</superscript></emphasis>. The index |
104 | <emphasis>2<superscript>i</superscript></emphasis>. The index |
105 | <emphasis>i</emphasis> is called the order of block. Should there be two |
105 | <emphasis>i</emphasis> is called the order of block. Should there be two |
106 | adjacent equally sized blocks in the list <emphasis>i</emphasis> (i.e. |
106 | adjacent equally sized blocks in the list <emphasis>i</emphasis> (i.e. |
107 | buddies), the buddy allocator would coalesce them and put the resulting |
107 | buddies), the buddy allocator would coalesce them and put the resulting |
108 | block in list <emphasis>i + 1</emphasis>, provided that the resulting |
108 | block in list <emphasis>i + 1</emphasis>, provided that the resulting |
109 | block would be naturally aligned. Similarily, when the allocator is |
109 | block would be naturally aligned. Similarily, when the allocator is |
110 | asked to allocate a block of size |
110 | asked to allocate a block of size |
111 | <emphasis>2<superscript>i</superscript></emphasis>, it first tries to |
111 | <emphasis>2<superscript>i</superscript></emphasis>, it first tries to |
112 | satisfy the request from the list with index <emphasis>i</emphasis>. If |
112 | satisfy the request from the list with index <emphasis>i</emphasis>. If |
113 | the request cannot be satisfied (i.e. the list <emphasis>i</emphasis> is |
113 | the request cannot be satisfied (i.e. the list <emphasis>i</emphasis> is |
114 | empty), the buddy allocator will try to allocate and split a larger |
114 | empty), the buddy allocator will try to allocate and split a larger |
115 | block from the list with index <emphasis>i + 1</emphasis>. Both of these |
115 | block from the list with index <emphasis>i + 1</emphasis>. Both of these |
116 | algorithms are recursive. The recursion ends either when there are no |
116 | algorithms are recursive. The recursion ends either when there are no |
117 | blocks to coalesce in the former case or when there are no blocks that |
117 | blocks to coalesce in the former case or when there are no blocks that |
118 | can be split in the latter case.</para> |
118 | can be split in the latter case.</para> |
119 | 119 | ||
120 | <para>This approach greatly reduces external fragmentation of memory and |
120 | <para>This approach greatly reduces external fragmentation of memory and |
121 | helps in allocating bigger continuous blocks of memory aligned to their |
121 | helps in allocating bigger continuous blocks of memory aligned to their |
122 | size. On the other hand, the buddy allocator suffers increased internal |
122 | size. On the other hand, the buddy allocator suffers increased internal |
123 | fragmentation of memory and is not suitable for general kernel |
123 | fragmentation of memory and is not suitable for general kernel |
124 | allocations. This purpose is better addressed by the <link |
124 | allocations. This purpose is better addressed by the <link |
125 | linkend="slab">slab allocator</link>.<figure> |
125 | linkend="slab">slab allocator</link>.<figure> |
126 | <mediaobject id="buddy_alloc"> |
126 | <mediaobject id="buddy_alloc"> |
127 | <imageobject role="eps"> |
127 | <imageobject role="eps"> |
128 | <imagedata fileref="images.vector/buddy_alloc.eps" format="EPS" /> |
128 | <imagedata fileref="images.vector/buddy_alloc.eps" format="EPS" /> |
129 | </imageobject> |
129 | </imageobject> |
130 | 130 | ||
131 | <imageobject role="html"> |
131 | <imageobject role="html"> |
132 | <imagedata fileref="images/buddy_alloc.png" format="PNG" /> |
132 | <imagedata fileref="images/buddy_alloc.png" format="PNG" /> |
133 | </imageobject> |
133 | </imageobject> |
134 | 134 | ||
135 | <imageobject role="fop"> |
135 | <imageobject role="fop"> |
136 | <imagedata fileref="images.vector/buddy_alloc.svg" format="SVG" /> |
136 | <imagedata fileref="images.vector/buddy_alloc.svg" format="SVG" /> |
137 | </imageobject> |
137 | </imageobject> |
138 | </mediaobject> |
138 | </mediaobject> |
139 | 139 | ||
140 | <title>Buddy system scheme.</title> |
140 | <title>Buddy system scheme.</title> |
141 | </figure></para> |
141 | </figure></para> |
142 | 142 | ||
143 | <section> |
143 | <section> |
144 | <title>Implementation</title> |
144 | <title>Implementation</title> |
145 | 145 | ||
146 | <para>The buddy allocator is, in fact, an abstract framework wich can |
146 | <para>The buddy allocator is, in fact, an abstract framework wich can |
147 | be easily specialized to serve one particular task. It knows nothing |
147 | be easily specialized to serve one particular task. It knows nothing |
148 | about the nature of memory it helps to allocate. In order to beat the |
148 | about the nature of memory it helps to allocate. In order to beat the |
149 | lack of this knowledge, the buddy allocator exports an interface that |
149 | lack of this knowledge, the buddy allocator exports an interface that |
150 | each of its clients is required to implement. When supplied with an |
150 | each of its clients is required to implement. When supplied with an |
151 | implementation of this interface, the buddy allocator can use |
151 | implementation of this interface, the buddy allocator can use |
152 | specialized external functions to find a buddy for a block, split and |
152 | specialized external functions to find a buddy for a block, split and |
153 | coalesce blocks, manipulate block order and mark blocks busy or |
153 | coalesce blocks, manipulate block order and mark blocks busy or |
154 | available.</para> |
154 | available.</para> |
155 | 155 | ||
156 | <formalpara> |
156 | <formalpara> |
157 | <title>Data organization</title> |
157 | <title>Data organization</title> |
158 | 158 | ||
159 | <para>Each entity allocable by the buddy allocator is required to |
159 | <para>Each entity allocable by the buddy allocator is required to |
160 | contain space for storing block order number and a link variable |
160 | contain space for storing block order number and a link variable |
161 | used to interconnect blocks within the same order.</para> |
161 | used to interconnect blocks within the same order.</para> |
162 | 162 | ||
163 | <para>Whatever entities are allocated by the buddy allocator, the |
163 | <para>Whatever entities are allocated by the buddy allocator, the |
164 | first entity within a block is used to represent the entire block. |
164 | first entity within a block is used to represent the entire block. |
165 | The first entity keeps the order of the whole block. Other entities |
165 | The first entity keeps the order of the whole block. Other entities |
166 | within the block are assigned the magic value |
166 | within the block are assigned the magic value |
167 | <constant>BUDDY_INNER_BLOCK</constant>. This is especially important |
167 | <constant>BUDDY_INNER_BLOCK</constant>. This is especially important |
168 | for effective identification of buddies in a one-dimensional array |
168 | for effective identification of buddies in a one-dimensional array |
169 | because the entity that represents a potential buddy cannot be |
169 | because the entity that represents a potential buddy cannot be |
170 | associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it |
170 | associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it |
171 | is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is |
171 | is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is |
172 | not a buddy).</para> |
172 | not a buddy).</para> |
173 | </formalpara> |
173 | </formalpara> |
174 | </section> |
174 | </section> |
175 | </section> |
175 | </section> |
176 | 176 | ||
177 | <section id="slab"> |
177 | <section id="slab"> |
178 | <indexterm> |
178 | <indexterm> |
179 | <primary>slab allocator</primary> |
179 | <primary>slab allocator</primary> |
180 | </indexterm> |
180 | </indexterm> |
181 | 181 | ||
182 | <title>Slab allocator</title> |
182 | <title>Slab allocator</title> |
183 | 183 | ||
184 | <para>The majority of memory allocation requests in the kernel is for |
184 | <para>The majority of memory allocation requests in the kernel is for |
185 | small, frequently used data structures. The basic idea behind the slab |
185 | small, frequently used data structures. The basic idea behind the slab |
186 | allocator is that commonly used objects are preallocated in continuous |
186 | allocator is that commonly used objects are preallocated in continuous |
187 | areas of physical memory called slabs<footnote> |
187 | areas of physical memory called slabs<footnote> |
188 | <para>Slabs are in fact blocks of physical memory frames allocated |
188 | <para>Slabs are in fact blocks of physical memory frames allocated |
189 | from the frame allocator.</para> |
189 | from the frame allocator.</para> |
190 | </footnote>. Whenever an object is to be allocated, the slab allocator |
190 | </footnote>. Whenever an object is to be allocated, the slab allocator |
191 | returns the first available item from a suitable slab corresponding to |
191 | returns the first available item from a suitable slab corresponding to |
192 | the object type<footnote> |
192 | the object type<footnote> |
193 | <para>The mechanism is rather more complicated, see the next |
193 | <para>The mechanism is rather more complicated, see the next |
194 | paragraph.</para> |
194 | paragraph.</para> |
195 | </footnote>. Due to the fact that the sizes of the requested and |
195 | </footnote>. Due to the fact that the sizes of the requested and |
196 | allocated object match, the slab allocator significantly reduces |
196 | allocated object match, the slab allocator significantly reduces |
197 | internal fragmentation.</para> |
197 | internal fragmentation.</para> |
198 | 198 | ||
199 | <indexterm> |
199 | <indexterm> |
200 | <primary>slab allocator</primary> |
200 | <primary>slab allocator</primary> |
201 | 201 | ||
202 | <secondary>- slab cache</secondary> |
202 | <secondary>- slab cache</secondary> |
203 | </indexterm> |
203 | </indexterm> |
204 | 204 | ||
205 | <para>Slabs of one object type are organized in a structure called slab |
205 | <para>Slabs of one object type are organized in a structure called slab |
206 | cache. There are ususally more slabs in the slab cache, depending on |
206 | cache. There are ususally more slabs in the slab cache, depending on |
207 | previous allocations. If the the slab cache runs out of available slabs, |
207 | previous allocations. If the the slab cache runs out of available slabs, |
208 | new slabs are allocated. In order to exploit parallelism and to avoid |
208 | new slabs are allocated. In order to exploit parallelism and to avoid |
209 | locking of shared spinlocks, slab caches can have variants of |
209 | locking of shared spinlocks, slab caches can have variants of |
210 | processor-private slabs called magazines. On each processor, there is a |
210 | processor-private slabs called magazines. On each processor, there is a |
211 | two-magazine cache. Full magazines that are not part of any |
211 | two-magazine cache. Full magazines that are not part of any |
212 | per-processor magazine cache are stored in a global list of full |
212 | per-processor magazine cache are stored in a global list of full |
213 | magazines.</para> |
213 | magazines.</para> |
214 | 214 | ||
215 | <indexterm> |
215 | <indexterm> |
216 | <primary>slab allocator</primary> |
216 | <primary>slab allocator</primary> |
217 | 217 | ||
218 | <secondary>- magazine</secondary> |
218 | <secondary>- magazine</secondary> |
219 | </indexterm> |
219 | </indexterm> |
220 | 220 | ||
221 | <para>Each object begins its life in a slab. When it is allocated from |
221 | <para>Each object begins its life in a slab. When it is allocated from |
222 | there, the slab allocator calls a constructor that is registered in the |
222 | there, the slab allocator calls a constructor that is registered in the |
223 | respective slab cache. The constructor initializes and brings the object |
223 | respective slab cache. The constructor initializes and brings the object |
224 | into a known state. The object is then used by the user. When the user |
224 | into a known state. The object is then used by the user. When the user |
225 | later frees the object, the slab allocator puts it into a processor |
225 | later frees the object, the slab allocator puts it into a processor |
226 | private <indexterm> |
226 | private <indexterm> |
227 | <primary>slab allocator</primary> |
227 | <primary>slab allocator</primary> |
228 | 228 | ||
229 | <secondary>- magazine</secondary> |
229 | <secondary>- magazine</secondary> |
230 | </indexterm>magazine cache, from where it can be precedently allocated |
230 | </indexterm>magazine cache, from where it can be precedently allocated |
231 | again. Note that allocations satisfied from a magazine are already |
231 | again. Note that allocations satisfied from a magazine are already |
232 | initialized by the constructor. When both of the processor cached |
232 | initialized by the constructor. When both of the processor cached |
233 | magazines get full, the allocator will move one of the magazines to the |
233 | magazines get full, the allocator will move one of the magazines to the |
234 | list of full magazines. Similarily, when allocating from an empty |
234 | list of full magazines. Similarily, when allocating from an empty |
235 | processor magazine cache, the kernel will reload only one magazine from |
235 | processor magazine cache, the kernel will reload only one magazine from |
236 | the list of full magazines. In other words, the slab allocator tries to |
236 | the list of full magazines. In other words, the slab allocator tries to |
237 | keep the processor magazine cache only half-full in order to prevent |
237 | keep the processor magazine cache only half-full in order to prevent |
238 | thrashing when allocations and deallocations interleave on magazine |
238 | thrashing when allocations and deallocations interleave on magazine |
239 | boundaries. The advantage of this setup is that during most of the |
239 | boundaries. The advantage of this setup is that during most of the |
240 | allocations, no global spinlock needs to be held.</para> |
240 | allocations, no global spinlock needs to be held.</para> |
241 | 241 | ||
242 | <para>Should HelenOS run short of memory, it would start deallocating |
242 | <para>Should HelenOS run short of memory, it would start deallocating |
243 | objects from magazines, calling slab cache destructor on them and |
243 | objects from magazines, calling slab cache destructor on them and |
244 | putting them back into slabs. When a slab contanins no allocated object, |
244 | putting them back into slabs. When a slab contanins no allocated object, |
245 | it is immediately freed.</para> |
245 | it is immediately freed.</para> |
246 | 246 | ||
247 | <para> |
247 | <para> |
248 | <figure> |
248 | <figure> |
249 | <mediaobject id="slab_alloc"> |
249 | <mediaobject id="slab_alloc"> |
250 | <imageobject role="eps"> |
250 | <imageobject role="eps"> |
251 | <imagedata fileref="images.vector/slab_alloc.eps" format="EPS" /> |
251 | <imagedata fileref="images.vector/slab_alloc.eps" format="EPS" /> |
252 | </imageobject> |
252 | </imageobject> |
253 | 253 | ||
254 | <imageobject role="html"> |
254 | <imageobject role="html"> |
255 | <imagedata fileref="images/slab_alloc.png" format="PNG" /> |
255 | <imagedata fileref="images/slab_alloc.png" format="PNG" /> |
256 | </imageobject> |
256 | </imageobject> |
257 | 257 | ||
258 | <imageobject role="fop"> |
258 | <imageobject role="fop"> |
259 | <imagedata fileref="images.vector/slab_alloc.svg" format="SVG" /> |
259 | <imagedata fileref="images.vector/slab_alloc.svg" format="SVG" /> |
260 | </imageobject> |
260 | </imageobject> |
261 | </mediaobject> |
261 | </mediaobject> |
262 | 262 | ||
263 | <title>Slab allocator scheme.</title> |
263 | <title>Slab allocator scheme.</title> |
264 | </figure> |
264 | </figure> |
265 | </para> |
265 | </para> |
266 | 266 | ||
267 | <section> |
267 | <section> |
268 | <title>Implementation</title> |
268 | <title>Implementation</title> |
269 | 269 | ||
270 | <para>The slab allocator is closely modelled after OpenSolaris slab |
270 | <para>The slab allocator is closely modelled after <xref |
271 | allocator by Jeff Bonwick and Jonathan Adams <xref |
- | |
272 | linkend="Bonwick01" /> with the following exceptions:<itemizedlist> |
271 | linkend="Bonwick01" /> with the following exceptions:<itemizedlist> |
273 | <listitem> |
272 | <listitem> |
274 | <para>empty slabs are immediately deallocated and</para> |
273 | <para>empty slabs are immediately deallocated and</para> |
275 | </listitem> |
274 | </listitem> |
276 | 275 | ||
277 | <listitem> |
276 | <listitem> |
278 | <para>empty magazines are deallocated when not needed.</para> |
277 | <para>empty magazines are deallocated when not needed.</para> |
279 | </listitem> |
278 | </listitem> |
280 | </itemizedlist>The following features are not currently supported |
279 | </itemizedlist>The following features are not currently supported |
281 | but would be easy to do: <itemizedlist> |
280 | but would be easy to do: <itemizedlist> |
282 | <listitem>cache coloring and</listitem> |
281 | <listitem>cache coloring and</listitem> |
283 | 282 | ||
284 | <listitem>dynamic magazine grow (different magazine sizes are |
283 | <listitem>dynamic magazine grow (different magazine sizes are |
285 | already supported, but the allocation strategy would need to be |
284 | already supported, but the allocation strategy would need to be |
286 | adjusted).</listitem> |
285 | adjusted).</listitem> |
287 | </itemizedlist></para> |
286 | </itemizedlist></para> |
288 | 287 | ||
289 | <section> |
288 | <section> |
290 | <title>Allocation/deallocation</title> |
289 | <title>Allocation/deallocation</title> |
291 | 290 | ||
292 | <para>The following two paragraphs summarize and complete the |
291 | <para>The following two paragraphs summarize and complete the |
293 | description of the slab allocator operation (i.e. |
292 | description of the slab allocator operation (i.e. |
294 | <code>slab_alloc()</code> and <code>slab_free()</code> |
293 | <code>slab_alloc()</code> and <code>slab_free()</code> |
295 | functions).</para> |
294 | functions).</para> |
296 | 295 | ||
297 | <formalpara> |
296 | <formalpara> |
298 | <title>Allocation</title> |
297 | <title>Allocation</title> |
299 | 298 | ||
300 | <para><emphasis>Step 1.</emphasis> When an allocation request |
299 | <para><emphasis>Step 1.</emphasis> When an allocation request |
301 | comes, the slab allocator checks availability of memory in the |
300 | comes, the slab allocator checks availability of memory in the |
302 | current magazine of the local processor magazine cache. If the |
301 | current magazine of the local processor magazine cache. If the |
303 | available memory is there, the allocator just pops the object from |
302 | available memory is there, the allocator just pops the object from |
304 | magazine and returns it.</para> |
303 | magazine and returns it.</para> |
305 | 304 | ||
306 | <para><emphasis>Step 2.</emphasis> If the current magazine in the |
305 | <para><emphasis>Step 2.</emphasis> If the current magazine in the |
307 | processor magazine cache is empty, the allocator will attempt to |
306 | processor magazine cache is empty, the allocator will attempt to |
308 | swap it with the last magazine from the cache and return to the |
307 | swap it with the last magazine from the cache and return to the |
309 | first step. If also the last magazine is empty, the algorithm will |
308 | first step. If also the last magazine is empty, the algorithm will |
310 | fall through to Step 3.</para> |
309 | fall through to Step 3.</para> |
311 | 310 | ||
312 | <para><emphasis>Step 3.</emphasis> Now the allocator is in the |
311 | <para><emphasis>Step 3.</emphasis> Now the allocator is in the |
313 | situation when both magazines in the processor magazine cache are |
312 | situation when both magazines in the processor magazine cache are |
314 | empty. The allocator reloads one magazine from the shared list of |
313 | empty. The allocator reloads one magazine from the shared list of |
315 | full magazines. If the reload is successful (i.e. there are full |
314 | full magazines. If the reload is successful (i.e. there are full |
316 | magazines in the list), the algorithm continues with Step |
315 | magazines in the list), the algorithm continues with Step |
317 | 1.</para> |
316 | 1.</para> |
318 | 317 | ||
319 | <para><emphasis>Step 4.</emphasis> In this fail-safe step, an |
318 | <para><emphasis>Step 4.</emphasis> In this fail-safe step, an |
320 | object is allocated from the conventional slab layer and a pointer |
319 | object is allocated from the conventional slab layer and a pointer |
321 | to it is returned. If also the last magazine is full,</para> |
320 | to it is returned. If also the last magazine is full,</para> |
322 | </formalpara> |
321 | </formalpara> |
323 | 322 | ||
324 | <formalpara> |
323 | <formalpara> |
325 | <title>Deallocation</title> |
324 | <title>Deallocation</title> |
326 | 325 | ||
327 | <para><emphasis>Step 1.</emphasis> During a deallocation request, |
326 | <para><emphasis>Step 1.</emphasis> During a deallocation request, |
328 | the slab allocator checks if the current magazine of the local |
327 | the slab allocator checks if the current magazine of the local |
329 | processor magazine cache is not full. If it is, the pointer to the |
328 | processor magazine cache is not full. If it is, the pointer to the |
330 | objects is just pushed into the magazine and the algorithm |
329 | objects is just pushed into the magazine and the algorithm |
331 | returns.</para> |
330 | returns.</para> |
332 | 331 | ||
333 | <para><emphasis>Step 2.</emphasis> If the current magazine is |
332 | <para><emphasis>Step 2.</emphasis> If the current magazine is |
334 | full, the allocator will attempt to swap it with the last magazine |
333 | full, the allocator will attempt to swap it with the last magazine |
335 | from the cache and return to the first step. If also the last |
334 | from the cache and return to the first step. If also the last |
336 | magazine is empty, the algorithm will fall through to Step |
335 | magazine is empty, the algorithm will fall through to Step |
337 | 3.</para> |
336 | 3.</para> |
338 | 337 | ||
339 | <para><emphasis>Step 3.</emphasis> Now the allocator is in the |
338 | <para><emphasis>Step 3.</emphasis> Now the allocator is in the |
340 | situation when both magazines in the processor magazine cache are |
339 | situation when both magazines in the processor magazine cache are |
341 | full. The allocator tries to allocate a new empty magazine and |
340 | full. The allocator tries to allocate a new empty magazine and |
342 | flush one of the full magazines to the shared list of full |
341 | flush one of the full magazines to the shared list of full |
343 | magazines. If it is successfull, the algoritm continues with Step |
342 | magazines. If it is successfull, the algoritm continues with Step |
344 | 1.</para> |
343 | 1.</para> |
345 | 344 | ||
346 | <para><emphasis>Step 4. </emphasis>In case of low memory condition |
345 | <para><emphasis>Step 4. </emphasis>In case of low memory condition |
347 | when the allocation of empty magazine fails, the object is moved |
346 | when the allocation of empty magazine fails, the object is moved |
348 | directly into slab. In the worst case object deallocation does not |
347 | directly into slab. In the worst case object deallocation does not |
349 | need to allocate any additional memory.</para> |
348 | need to allocate any additional memory.</para> |
350 | </formalpara> |
349 | </formalpara> |
351 | </section> |
350 | </section> |
352 | </section> |
351 | </section> |
353 | </section> |
352 | </section> |
354 | </section> |
353 | </section> |
355 | 354 | ||
356 | <section> |
355 | <section> |
357 | <title>Virtual memory management</title> |
356 | <title>Virtual memory management</title> |
358 | 357 | ||
359 | <para>Virtual memory is essential for an operating system because it makes |
358 | <para>Virtual memory is essential for an operating system because it makes |
360 | several things possible. First, it helps to isolate tasks from each other |
359 | several things possible. First, it helps to isolate tasks from each other |
361 | by encapsulating them in their private address spaces. Second, virtual |
360 | by encapsulating them in their private address spaces. Second, virtual |
362 | memory can give tasks the feeling of more memory available than is |
361 | memory can give tasks the feeling of more memory available than is |
363 | actually possible. And third, by using virtual memory, there might be |
362 | actually possible. And third, by using virtual memory, there might be |
364 | multiple copies of the same program, linked to the same addresses, running |
363 | multiple copies of the same program, linked to the same addresses, running |
365 | in the system. There are at least two known mechanisms for implementing |
364 | in the system. There are at least two known mechanisms for implementing |
366 | virtual memory: segmentation and paging. Even though some processor |
365 | virtual memory: segmentation and paging. Even though some processor |
367 | architectures supported by HelenOS<footnote> |
366 | architectures supported by HelenOS<footnote> |
368 | <para>ia32 has full-fledged segmentation.</para> |
367 | <para>ia32 has full-fledged segmentation.</para> |
369 | </footnote> provide both mechanism, the kernel makes use solely of |
368 | </footnote> provide both mechanism, the kernel makes use solely of |
370 | paging.</para> |
369 | paging.</para> |
371 | 370 | ||
372 | <section id="paging"> |
371 | <section id="paging"> |
373 | <title>VAT subsystem</title> |
372 | <title>VAT subsystem</title> |
374 | 373 | ||
375 | <para>In a paged virtual memory, the entire virtual address space is |
374 | <para>In a paged virtual memory, the entire virtual address space is |
376 | divided into small power-of-two sized naturally aligned blocks called |
375 | divided into small power-of-two sized naturally aligned blocks called |
377 | pages. The processor implements a translation mechanism, that allows the |
376 | pages. The processor implements a translation mechanism, that allows the |
378 | operating system to manage mappings between set of pages and set of |
377 | operating system to manage mappings between set of pages and set of |
379 | indentically sized and identically aligned pieces of physical memory |
378 | indentically sized and identically aligned pieces of physical memory |
380 | called frames. In a result, references to continuous virtual memory |
379 | called frames. In a result, references to continuous virtual memory |
381 | areas don't necessarily need to reference continuos area of physical |
380 | areas don't necessarily need to reference continuos area of physical |
382 | memory. Supported page sizes usually range from several kilobytes to |
381 | memory. Supported page sizes usually range from several kilobytes to |
383 | several megabytes. Each page that takes part in the mapping is |
382 | several megabytes. Each page that takes part in the mapping is |
384 | associated with certain attributes that further desribe the mapping |
383 | associated with certain attributes that further desribe the mapping |
385 | (e.g. access rights, dirty and accessed bits and present bit).</para> |
384 | (e.g. access rights, dirty and accessed bits and present bit).</para> |
386 | 385 | ||
387 | <para>When the processor accesses a page that is not present (i.e. its |
386 | <para>When the processor accesses a page that is not present (i.e. its |
388 | present bit is not set), the operating system is notified through a |
387 | present bit is not set), the operating system is notified through a |
389 | special exception called page fault. It is then up to the operating |
388 | special exception called page fault. It is then up to the operating |
390 | system to service the page fault. In HelenOS, some page faults are fatal |
389 | system to service the page fault. In HelenOS, some page faults are fatal |
391 | and result in either task termination or, in the worse case, kernel |
390 | and result in either task termination or, in the worse case, kernel |
392 | panic<footnote> |
391 | panic<footnote> |
393 | <para>Such a condition would be either caused by a hardware failure |
392 | <para>Such a condition would be either caused by a hardware failure |
394 | or a bug in the kernel.</para> |
393 | or a bug in the kernel.</para> |
395 | </footnote>, while other page faults are used to load memory on demand |
394 | </footnote>, while other page faults are used to load memory on demand |
396 | or to notify the kernel about certain events.</para> |
395 | or to notify the kernel about certain events.</para> |
397 | 396 | ||
398 | <indexterm> |
397 | <indexterm> |
399 | <primary>page tables</primary> |
398 | <primary>page tables</primary> |
400 | </indexterm> |
399 | </indexterm> |
401 | 400 | ||
402 | <para>The set of all page mappings is stored in a memory structure |
401 | <para>The set of all page mappings is stored in a memory structure |
403 | called page tables. Some architectures have no hardware support for page |
402 | called page tables. Some architectures have no hardware support for page |
404 | tables<footnote> |
403 | tables<footnote> |
405 | <para>On mips32, TLB-only model is used and the operating system is |
404 | <para>On mips32, TLB-only model is used and the operating system is |
406 | responsible for managing software defined page tables.</para> |
405 | responsible for managing software defined page tables.</para> |
407 | </footnote> while other processor architectures<footnote> |
406 | </footnote> while other processor architectures<footnote> |
408 | <para>Like amd64 and ia32.</para> |
407 | <para>Like amd64 and ia32.</para> |
409 | </footnote> understand the whole memory format thereof. Despite all |
408 | </footnote> understand the whole memory format thereof. Despite all |
410 | the possible differences in page table formats, the HelenOS VAT |
409 | the possible differences in page table formats, the HelenOS VAT |
411 | subsystem<footnote> |
410 | subsystem<footnote> |
412 | <para>Virtual Address Translation subsystem.</para> |
411 | <para>Virtual Address Translation subsystem.</para> |
413 | </footnote> unifies the page table operations under one programming |
412 | </footnote> unifies the page table operations under one programming |
414 | interface. For all parts of the kernel, three basic functions are |
413 | interface. For all parts of the kernel, three basic functions are |
415 | provided:</para> |
414 | provided:</para> |
416 | 415 | ||
417 | <itemizedlist> |
416 | <itemizedlist> |
418 | <listitem> |
417 | <listitem> |
419 | <para><code>page_mapping_insert()</code>,</para> |
418 | <para><code>page_mapping_insert()</code>,</para> |
420 | </listitem> |
419 | </listitem> |
421 | 420 | ||
422 | <listitem> |
421 | <listitem> |
423 | <para><code>page_mapping_find()</code> and</para> |
422 | <para><code>page_mapping_find()</code> and</para> |
424 | </listitem> |
423 | </listitem> |
425 | 424 | ||
426 | <listitem> |
425 | <listitem> |
427 | <para><code>page_mapping_remove()</code>.</para> |
426 | <para><code>page_mapping_remove()</code>.</para> |
428 | </listitem> |
427 | </listitem> |
429 | </itemizedlist> |
428 | </itemizedlist> |
430 | 429 | ||
431 | <para>The <code>page_mapping_insert()</code> function is used to |
430 | <para>The <code>page_mapping_insert()</code> function is used to |
432 | introduce a mapping for one virtual memory page belonging to a |
431 | introduce a mapping for one virtual memory page belonging to a |
433 | particular address space into the page tables. Once the mapping is in |
432 | particular address space into the page tables. Once the mapping is in |
434 | the page tables, it can be searched by <code>page_mapping_find()</code> |
433 | the page tables, it can be searched by <code>page_mapping_find()</code> |
435 | and removed by <code>page_mapping_remove()</code>. All of these |
434 | and removed by <code>page_mapping_remove()</code>. All of these |
436 | functions internally select the page table mechanism specific functions |
435 | functions internally select the page table mechanism specific functions |
437 | that carry out the self operation.</para> |
436 | that carry out the self operation.</para> |
438 | 437 | ||
439 | <para>There are currently two supported mechanisms: generic 4-level |
438 | <para>There are currently two supported mechanisms: generic 4-level |
440 | hierarchical page tables and global page hash table. Both of the |
439 | hierarchical page tables and global page hash table. Both of the |
441 | mechanisms are generic as they cover several hardware platforms. For |
440 | mechanisms are generic as they cover several hardware platforms. For |
442 | instance, the 4-level hierarchical page table mechanism is used by |
441 | instance, the 4-level hierarchical page table mechanism is used by |
443 | amd64, ia32, mips32 and ppc32, respectively. These architectures have |
442 | amd64, ia32, mips32 and ppc32, respectively. These architectures have |
444 | the following page table format: 4-level, 2-level, TLB-only and hardware |
443 | the following page table format: 4-level, 2-level, TLB-only and hardware |
445 | hash table, respectively. On the other hand, the global page hash table |
444 | hash table, respectively. On the other hand, the global page hash table |
446 | is used on ia64 that can be TLB-only or use a hardware hash table. |
445 | is used on ia64 that can be TLB-only or use a hardware hash table. |
447 | Although only two mechanisms are currently implemented, other mechanisms |
446 | Although only two mechanisms are currently implemented, other mechanisms |
448 | (e.g. B+tree) can be easily added.</para> |
447 | (e.g. B+tree) can be easily added.</para> |
449 | 448 | ||
450 | <section id="page_tables"> |
449 | <section id="page_tables"> |
451 | <indexterm> |
450 | <indexterm> |
452 | <primary>page tables</primary> |
451 | <primary>page tables</primary> |
453 | 452 | ||
454 | <secondary>- hierarchical</secondary> |
453 | <secondary>- hierarchical</secondary> |
455 | </indexterm> |
454 | </indexterm> |
456 | 455 | ||
457 | <title>Hierarchical 4-level page tables</title> |
456 | <title>Hierarchical 4-level page tables</title> |
458 | 457 | ||
459 | <para>Hierarchical 4-level page tables are generalization of the |
458 | <para>Hierarchical 4-level page tables are generalization of the |
460 | frequently used hierarchical model of page tables. In this mechanism, |
459 | frequently used hierarchical model of page tables. In this mechanism, |
461 | each address space has its own page tables. To avoid confusion in |
460 | each address space has its own page tables. To avoid confusion in |
462 | terminology used by hardware vendors, in HelenOS, the root level page |
461 | terminology used by hardware vendors, in HelenOS, the root level page |
463 | table is called PTL0, the two middle levels are called PTL1 and PTL2, |
462 | table is called PTL0, the two middle levels are called PTL1 and PTL2, |
464 | and, finally, the leaf level is called PTL3. All architectures using |
463 | and, finally, the leaf level is called PTL3. All architectures using |
465 | this mechanism are required to use PTL0 and PTL3. However, the middle |
464 | this mechanism are required to use PTL0 and PTL3. However, the middle |
466 | levels can be left out, depending on the hardware hierachy or |
465 | levels can be left out, depending on the hardware hierachy or |
467 | structure of software-only page tables. The genericity is achieved |
466 | structure of software-only page tables. The genericity is achieved |
468 | through a set of macros that define transitions from one level to |
467 | through a set of macros that define transitions from one level to |
469 | another. Unused levels are optimised out by the compiler.</para> |
468 | another. Unused levels are optimised out by the compiler.</para> |
470 | </section> |
469 | </section> |
471 | 470 | ||
472 | <section> |
471 | <section> |
473 | <indexterm> |
472 | <indexterm> |
474 | <primary>page tables</primary> |
473 | <primary>page tables</primary> |
475 | 474 | ||
476 | <secondary>- hashing</secondary> |
475 | <secondary>- hashing</secondary> |
477 | </indexterm> |
476 | </indexterm> |
478 | 477 | ||
479 | <title>Global page hash table</title> |
478 | <title>Global page hash table</title> |
480 | 479 | ||
481 | <para>Implementation of the global page hash table was encouraged by |
480 | <para>Implementation of the global page hash table was encouraged by |
482 | 64-bit architectures that can have rather sparse address spaces. The |
481 | 64-bit architectures that can have rather sparse address spaces. The |
483 | hash table contains valid mappings only. Each entry of the hash table |
482 | hash table contains valid mappings only. Each entry of the hash table |
484 | contains an address space pointer, virtual memory page number (VPN), |
483 | contains an address space pointer, virtual memory page number (VPN), |
485 | physical memory frame number (PFN) and a set of flags. The pair of the |
484 | physical memory frame number (PFN) and a set of flags. The pair of the |
486 | address space pointer and the virtual memory page number is used as a |
485 | address space pointer and the virtual memory page number is used as a |
487 | key for the hash table. One of the major differences between the |
486 | key for the hash table. One of the major differences between the |
488 | global page hash table and hierarchical 4-level page tables is that |
487 | global page hash table and hierarchical 4-level page tables is that |
489 | there is only a single global page hash table in the system while |
488 | there is only a single global page hash table in the system while |
490 | hierarchical page tables exist per address space. Thus, the global |
489 | hierarchical page tables exist per address space. Thus, the global |
491 | page hash table contains information about mappings of all address |
490 | page hash table contains information about mappings of all address |
492 | spaces in the system. </para> |
491 | spaces in the system.</para> |
493 | 492 | ||
494 | <para>The global page hash table mechanism uses the generic hash table |
493 | <para>The global page hash table mechanism uses the generic hash table |
495 | type as described in the chapter about <link linkend="hashtables">data |
494 | type as described in the chapter dedicated to <link |
496 | structures</link> earlier in this book.</para> |
495 | linkend="hashtables">data structures</link> earlier in this |
- | 496 | book.</para> |
|
497 | </section> |
497 | </section> |
498 | </section> |
498 | </section> |
499 | </section> |
499 | </section> |
500 | 500 | ||
501 | <section id="tlb"> |
501 | <section id="tlb"> |
502 | <indexterm> |
502 | <indexterm> |
503 | <primary>TLB</primary> |
503 | <primary>TLB</primary> |
504 | </indexterm> |
504 | </indexterm> |
505 | 505 | ||
506 | <title>Translation Lookaside buffer</title> |
506 | <title>Translation Lookaside buffer</title> |
507 | 507 | ||
508 | <para>Due to the extensive overhead during the page mapping lookup in the |
508 | <para>Due to the extensive overhead of several extra memory accesses |
- | 509 | during page table lookup that are necessary on every instruction, modern |
|
509 | page tables, all architectures has fast assotiative cache memory built-in |
510 | architectures deploy fast assotiative cache of recelntly used page |
510 | CPU. This memory called TLB stores recently used page table |
511 | mappings. This cache is called TLB - Translation Lookaside Buffer - and is |
- | 512 | present on every processor in the system. As it has been already pointed |
|
- | 513 | out, TLB is the only page translation mechanism for some |
|
511 | entries.</para> |
514 | architectures.</para> |
512 | 515 | ||
513 | <section id="tlb_shootdown"> |
516 | <section id="tlb_shootdown"> |
514 | <indexterm> |
517 | <indexterm> |
515 | <primary>TLB</primary> |
518 | <primary>TLB</primary> |
516 | 519 | ||
517 | <secondary>- TLB shootdown</secondary> |
520 | <secondary>- TLB shootdown</secondary> |
518 | </indexterm> |
521 | </indexterm> |
519 | 522 | ||
520 | <title>TLB consistency. TLB shootdown algorithm.</title> |
523 | <title>TLB consistency</title> |
521 | 524 | ||
522 | <para>Operating system is responsible for keeping TLB consistent by |
525 | <para>The operating system is responsible for keeping TLB consistent |
523 | invalidating the contents of TLB, whenever there is some change in page |
526 | with the page tables. Whenever mappings are modified or purged from the |
524 | tables. Those changes may occur when page or group of pages were |
527 | page tables, or when an address space identifier is reused, the kernel |
525 | unmapped, mapping is changed or system switching active address space to |
- | |
526 | schedule a new system task. Moreover, this invalidation operation must |
- | |
527 | be done an all system CPUs because each CPU has its own independent TLB |
528 | needs to invalidate the respective contents of TLB. Some TLB types |
528 | cache. Thus maintaining TLB consistency on SMP configuration as not as |
- | |
529 | trivial task as it looks on the first glance. Naive solution would |
529 | support partial invalidation of their content (e.g. ranges of pages or |
530 | assume that is the CPU which wants to invalidate TLB will invalidate TLB |
530 | address spaces) while other types can be invalidated only entirely. The |
531 | caches on other CPUs. It is not possible on the most of the |
531 | invalidation must be done on all processors for there is one TLB per |
532 | architectures, because of the simple fact - flushing TLB is allowed only |
- | |
533 | on the local CPU and there is no possibility to access other CPUs' TLB |
532 | processor. Maintaining TLB consistency on multiprocessor configurations |
534 | caches, thus invalidate TLB remotely.</para> |
533 | is not as trivial as it might look from the first glance.</para> |
535 | 534 | ||
536 | <para>Technique of remote invalidation of TLB entries is called "TLB |
535 | <para>The remote TLB invalidation is called TLB shootdown. HelenOS uses |
537 | shootdown". HelenOS uses a variation of the algorithm described by D. |
536 | a simplified variant of the algorithm described in <xref |
538 | Black et al., "Translation Lookaside Buffer Consistency: A Software |
- | |
539 | Approach," Proc. Third Int'l Conf. Architectural Support for Programming |
- | |
540 | Languages and Operating Systems, 1989, pp. 113-122. <xref |
- | |
541 | linkend="Black89" /></para> |
537 | linkend="Black89" />. </para> |
542 | - | ||
543 | <para>As the situation demands, you will want partitial invalidation of |
- | |
544 | TLB caches. In case of simple memory mapping change it is necessary to |
- | |
545 | invalidate only one or more adjacent pages. In case if the architecture |
- | |
546 | is aware of ASIDs, when kernel needs to dump some ASID to use by another |
- | |
547 | task, it invalidates only entries from this particular address space. |
- | |
548 | Final option of the TLB invalidation is the complete TLB cache |
- | |
549 | invalidation, which is the operation that flushes all entries in |
- | |
550 | TLB.</para> |
- | |
551 | 538 | ||
552 | <para>TLB shootdown is performed in two phases.</para> |
539 | <para>TLB shootdown is performed in three phases.</para> |
553 | 540 | ||
554 | <formalpara> |
541 | <formalpara> |
555 | <title>Phase 1.</title> |
542 | <title>Phase 1.</title> |
556 | 543 | ||
557 | <para>First, initiator locks a global TLB spinlock, then request is |
544 | <para>The initiator clears its TLB flag and locks the global TLB |
558 | being put to the local request cache of every other CPU in the system |
545 | spinlock. The request is then enqueued into all other processors' TLB |
559 | protected by its spinlock. In case the cache is full, all requests in |
546 | shootdown message queues. When the TLB shootdown message queue is full |
560 | the cache are replaced by one request, indicating global TLB flush. |
547 | on any processor, the queue is purged and a single request to |
561 | Then the initiator thread sends an IPI message indicating the TLB |
548 | invalidate the entire TLB is stored there. Once all the TLB shootdown |
562 | shootdown request to the rest of the CPUs and waits actively until all |
549 | messages were dispatched, the initiator sends all other processors an |
563 | CPUs confirm TLB invalidating action execution by setting up a special |
550 | interrupt to notify them about the incoming TLB shootdown message. It |
564 | flag. After setting this flag this thread is blocked on the TLB |
551 | then spins until all processors accept the interrupt and clear their |
565 | spinlock, held by the initiator.</para> |
552 | TLB flags.</para> |
566 | </formalpara> |
553 | </formalpara> |
567 | 554 | ||
568 | <formalpara> |
555 | <formalpara> |
569 | <title>Phase 2.</title> |
556 | <title>Phase 2.</title> |
570 | 557 | ||
571 | <para>All CPUs are waiting on the TLB spinlock to execute TLB |
558 | <para>Except for the initiator, all other processors are spining on |
- | 559 | the TLB spinlock. The initiator is now free to modify the page tables |
|
572 | invalidation action and have indicated their intention to the |
560 | and purge its own TLB. The initiator then unlocks the global TLB |
- | 561 | spinlock and sets its TLB flag.</para> |
|
- | 562 | </formalpara> |
|
- | 563 | ||
- | 564 | <formalpara> |
|
- | 565 | <title>Phase 3.</title> |
|
- | 566 | ||
- | 567 | <para>When the spinlock is unlocked by the initiator, other processors |
|
573 | initiator. Initiator continues, cleaning up its TLB and releasing the |
568 | are sequentially granted the spinlock. However, once they manage to |
574 | global TLB spinlock. After this all other CPUs gain and immidiately |
569 | lock it, they immediately release it. Each processor invalidates its |
- | 570 | TLB according to messages found in its TLB shootdown message queue. In |
|
575 | release TLB spinlock and perform TLB invalidation actions.</para> |
571 | the end, each processor sets its TLB flag and resumes its previous |
- | 572 | operation.</para> |
|
576 | </formalpara> |
573 | </formalpara> |
577 | </section> |
574 | </section> |
578 | 575 | ||
579 | <section> |
576 | <section> |
580 | <title>Address spaces</title> |
577 | <title>Address spaces</title> |
581 | 578 | ||
582 | <section> |
579 | <section> |
583 | <indexterm> |
580 | <indexterm> |
584 | <primary>address space</primary> |
581 | <primary>address space</primary> |
585 | 582 | ||
586 | <secondary>- area</secondary> |
583 | <secondary>- area</secondary> |
587 | </indexterm> |
584 | </indexterm> |
588 | 585 | ||
589 | <title>Address space areas</title> |
586 | <title>Address space areas</title> |
590 | 587 | ||
591 | <para>Each address space consists of mutually disjunctive continuous |
588 | <para>Each address space consists of mutually disjunctive continuous |
592 | address space areas. Address space area is precisely defined by its |
589 | address space areas. Address space area is precisely defined by its |
593 | base address and the number of frames/pages is contains.</para> |
590 | base address and the number of frames/pages is contains.</para> |
594 | 591 | ||
595 | <para>Address space area , that define behaviour and permissions on |
592 | <para>Address space area , that define behaviour and permissions on |
596 | the particular area. <itemizedlist> |
593 | the particular area. <itemizedlist> |
597 | <listitem><emphasis>AS_AREA_READ</emphasis> flag indicates reading |
594 | <listitem><emphasis>AS_AREA_READ</emphasis> flag indicates reading |
598 | permission.</listitem> |
595 | permission.</listitem> |
599 | 596 | ||
600 | <listitem><emphasis>AS_AREA_WRITE</emphasis> flag indicates |
597 | <listitem><emphasis>AS_AREA_WRITE</emphasis> flag indicates |
601 | writing permission.</listitem> |
598 | writing permission.</listitem> |
602 | 599 | ||
603 | <listitem><emphasis>AS_AREA_EXEC</emphasis> flag indicates code |
600 | <listitem><emphasis>AS_AREA_EXEC</emphasis> flag indicates code |
604 | execution permission. Some architectures do not support execution |
601 | execution permission. Some architectures do not support execution |
605 | persmission restriction. In this case this flag has no |
602 | persmission restriction. In this case this flag has no |
606 | effect.</listitem> |
603 | effect.</listitem> |
607 | 604 | ||
608 | <listitem><emphasis>AS_AREA_DEVICE</emphasis> marks area as mapped |
605 | <listitem><emphasis>AS_AREA_DEVICE</emphasis> marks area as mapped |
609 | to the device memory.</listitem> |
606 | to the device memory.</listitem> |
610 | </itemizedlist></para> |
607 | </itemizedlist></para> |
611 | 608 | ||
612 | <para>Kernel provides possibility tasks create/expand/shrink/share its |
609 | <para>Kernel provides possibility tasks create/expand/shrink/share its |
613 | address space via the set of syscalls.</para> |
610 | address space via the set of syscalls.</para> |
614 | </section> |
611 | </section> |
615 | 612 | ||
616 | <section> |
613 | <section> |
617 | <indexterm> |
614 | <indexterm> |
618 | <primary>address space</primary> |
615 | <primary>address space</primary> |
619 | 616 | ||
620 | <secondary>- ASID</secondary> |
617 | <secondary>- ASID</secondary> |
621 | </indexterm> |
618 | </indexterm> |
622 | 619 | ||
623 | <title>Address Space ID (ASID)</title> |
620 | <title>Address Space ID (ASID)</title> |
624 | 621 | ||
625 | <para>Every task in the operating system has it's own view of the |
622 | <para>Every task in the operating system has it's own view of the |
626 | virtual memory. When performing context switch between different |
623 | virtual memory. When performing context switch between different |
627 | tasks, the kernel must switch the address space mapping as well. As |
624 | tasks, the kernel must switch the address space mapping as well. As |
628 | modern processors perform very aggressive caching of virtual mappings, |
625 | modern processors perform very aggressive caching of virtual mappings, |
629 | flushing the complete TLB on every context switch would be very |
626 | flushing the complete TLB on every context switch would be very |
630 | inefficient. To avoid such performance penalty, some architectures |
627 | inefficient. To avoid such performance penalty, some architectures |
631 | introduce an address space identifier, which allows storing several |
628 | introduce an address space identifier, which allows storing several |
632 | different mappings inside TLB.</para> |
629 | different mappings inside TLB.</para> |
633 | 630 | ||
634 | <para>HelenOS kernel can take advantage of this hardware support by |
631 | <para>HelenOS kernel can take advantage of this hardware support by |
635 | having an ASID abstraction. I.e. on ia64 kernel ASID is derived from |
632 | having an ASID abstraction. I.e. on ia64 kernel ASID is derived from |
636 | RID (region identifier) and on the mips32 kernel ASID is actually the |
633 | RID (region identifier) and on the mips32 kernel ASID is actually the |
637 | hardware identifier. As expected, this ASID information record is the |
634 | hardware identifier. As expected, this ASID information record is the |
638 | part of <emphasis>as_t</emphasis> structure.</para> |
635 | part of <emphasis>as_t</emphasis> structure.</para> |
639 | 636 | ||
640 | <para>Due to the hardware limitations, hardware ASID has limited |
637 | <para>Due to the hardware limitations, hardware ASID has limited |
641 | length from 8 bits on ia64 to 24 bits on mips32, which makes it |
638 | length from 8 bits on ia64 to 24 bits on mips32, which makes it |
642 | impossible to use it as unique address space identifier for all tasks |
639 | impossible to use it as unique address space identifier for all tasks |
643 | running in the system. In such situations special ASID stealing |
640 | running in the system. In such situations special ASID stealing |
644 | algoritm is used, which takes ASID from inactive task and assigns it |
641 | algoritm is used, which takes ASID from inactive task and assigns it |
645 | to the active task.</para> |
642 | to the active task.</para> |
646 | 643 | ||
647 | <indexterm> |
644 | <indexterm> |
648 | <primary>address space</primary> |
645 | <primary>address space</primary> |
649 | 646 | ||
650 | <secondary>- ASID stealing</secondary> |
647 | <secondary>- ASID stealing</secondary> |
651 | </indexterm> |
648 | </indexterm> |
652 | 649 | ||
653 | <para> |
650 | <para> |
654 | <classname>ASID stealing algoritm here.</classname> |
651 | <classname>ASID stealing algoritm here.</classname> |
655 | </para> |
652 | </para> |
656 | </section> |
653 | </section> |
657 | </section> |
654 | </section> |
658 | </section> |
655 | </section> |
659 | </chapter> |
656 | </chapter> |