Rev 74 | Rev 77 | Go to most recent revision | Only display areas with differences | Ignore whitespace | Details | Blame | Last modification | View Log | RSS feed
Rev 74 | Rev 76 | ||
---|---|---|---|
1 | <?xml version="1.0" encoding="UTF-8"?> |
1 | <?xml version="1.0" encoding="UTF-8"?> |
2 | <chapter id="mm"> |
2 | <chapter id="mm"> |
3 | <?dbhtml filename="mm.html"?> |
3 | <?dbhtml filename="mm.html"?> |
4 | 4 | ||
5 | <title>Memory management</title> |
5 | <title>Memory management</title> |
6 | 6 | ||
7 | <para>In previous chapters, this book described the scheduling subsystem as |
7 | <para>In previous chapters, this book described the scheduling subsystem as |
8 | the creator of the impression that threads execute in parallel. The memory |
8 | the creator of the impression that threads execute in parallel. The memory |
9 | management subsystem, on the other hand, creates the impression that there |
9 | management subsystem, on the other hand, creates the impression that there |
10 | is enough physical memory for the kernel and that userspace tasks have the |
10 | is enough physical memory for the kernel and that userspace tasks have the |
11 | entire address space only for themselves.</para> |
11 | entire address space only for themselves.</para> |
12 | 12 | ||
13 | <section> |
13 | <section> |
14 | <title>Physical memory management</title> |
14 | <title>Physical memory management</title> |
15 | 15 | ||
16 | <section id="zones_and_frames"> |
16 | <section id="zones_and_frames"> |
17 | <title>Zones and frames</title> |
17 | <title>Zones and frames</title> |
18 | 18 | ||
19 | <para>HelenOS represents continuous areas of physical memory in |
19 | <para>HelenOS represents continuous areas of physical memory in |
20 | structures called frame zones (abbreviated as zones). Each zone contains |
20 | structures called frame zones (abbreviated as zones). Each zone contains |
21 | information about the number of allocated and unallocated physical |
21 | information about the number of allocated and unallocated physical |
22 | memory frames as well as the physical base address of the zone and |
22 | memory frames as well as the physical base address of the zone and |
23 | number of frames contained in it. A zone also contains an array of frame |
23 | number of frames contained in it. A zone also contains an array of frame |
24 | structures describing each frame of the zone and, in the last, but not |
24 | structures describing each frame of the zone and, in the last, but not |
25 | the least important, front, each zone is equipped with a buddy system |
25 | the least important, front, each zone is equipped with a buddy system |
26 | that faciliates effective allocation of power-of-two sized block of |
26 | that faciliates effective allocation of power-of-two sized block of |
27 | frames.</para> |
27 | frames.</para> |
28 | 28 | ||
29 | <para>This organization of physical memory provides good preconditions |
29 | <para>This organization of physical memory provides good preconditions |
30 | for hot-plugging of more zones. There is also one currently unused zone |
30 | for hot-plugging of more zones. There is also one currently unused zone |
31 | attribute: <code>flags</code>. The attribute could be used to give a |
31 | attribute: <code>flags</code>. The attribute could be used to give a |
32 | special meaning to some zones in the future.</para> |
32 | special meaning to some zones in the future.</para> |
33 | 33 | ||
34 | <para>The zones are linked in a doubly-linked list. This might seem a |
34 | <para>The zones are linked in a doubly-linked list. This might seem a |
35 | bit ineffective because the zone list is walked everytime a frame is |
35 | bit ineffective because the zone list is walked everytime a frame is |
36 | allocated or deallocated. However, this does not represent a significant |
36 | allocated or deallocated. However, this does not represent a significant |
37 | performance problem as it is expected that the number of zones will be |
37 | performance problem as it is expected that the number of zones will be |
38 | rather low. Moreover, most architectures merge all zones into |
38 | rather low. Moreover, most architectures merge all zones into |
39 | one.</para> |
39 | one.</para> |
40 | 40 | ||
41 | <para>For each physical memory frame found in a zone, there is a frame |
41 | <para>Every physical memory frame in a zone, is described by a structure |
42 | structure that contains number of references and data used by buddy |
42 | that contains number of references and other data used by buddy |
43 | system.</para> |
43 | system.</para> |
44 | </section> |
44 | </section> |
45 | 45 | ||
46 | <section id="frame_allocator"> |
46 | <section id="frame_allocator"> |
47 | <indexterm> |
47 | <indexterm> |
48 | <primary>frame allocator</primary> |
48 | <primary>frame allocator</primary> |
49 | </indexterm> |
49 | </indexterm> |
50 | 50 | ||
51 | <title>Frame allocator</title> |
51 | <title>Frame allocator</title> |
52 | 52 | ||
53 | <para>The frame allocator satisfies kernel requests to allocate |
53 | <para>The frame allocator satisfies kernel requests to allocate |
54 | power-of-two sized blocks of physical memory. Because of zonal |
54 | power-of-two sized blocks of physical memory. Because of zonal |
55 | organization of physical memory, the frame allocator is always working |
55 | organization of physical memory, the frame allocator is always working |
56 | within a context of some frame zone. In order to carry out the |
56 | within a context of a particular frame zone. In order to carry out the |
57 | allocation requests, the frame allocator is tightly integrated with the |
57 | allocation requests, the frame allocator is tightly integrated with the |
58 | buddy system belonging to the zone. The frame allocator is also |
58 | buddy system belonging to the zone. The frame allocator is also |
59 | responsible for updating information about the number of free and busy |
59 | responsible for updating information about the number of free and busy |
60 | frames in the zone. <figure> |
60 | frames in the zone. <figure> |
61 | <mediaobject id="frame_alloc"> |
61 | <mediaobject id="frame_alloc"> |
62 | <imageobject role="html"> |
62 | <imageobject role="html"> |
63 | <imagedata fileref="images/frame_alloc.png" format="PNG" /> |
63 | <imagedata fileref="images/frame_alloc.png" format="PNG" /> |
64 | </imageobject> |
64 | </imageobject> |
65 | 65 | ||
66 | <imageobject role="fop"> |
66 | <imageobject role="fop"> |
67 | <imagedata fileref="images.vector/frame_alloc.svg" format="SVG" /> |
67 | <imagedata fileref="images.vector/frame_alloc.svg" format="SVG" /> |
68 | </imageobject> |
68 | </imageobject> |
69 | </mediaobject> |
69 | </mediaobject> |
70 | 70 | ||
71 | <title>Frame allocator scheme.</title> |
71 | <title>Frame allocator scheme.</title> |
72 | </figure></para> |
72 | </figure></para> |
73 | 73 | ||
74 | <formalpara> |
74 | <formalpara> |
75 | <title>Allocation / deallocation</title> |
75 | <title>Allocation / deallocation</title> |
76 | 76 | ||
77 | <para>Upon allocation request via function <code>frame_alloc</code>, |
77 | <para>Upon allocation request via function <code>frame_alloc</code>, |
78 | the frame allocator first tries to find a zone that can satisfy the |
78 | the frame allocator first tries to find a zone that can satisfy the |
79 | request (i.e. has the required amount of free frames). Once a suitable |
79 | request (i.e. has the required amount of free frames). Once a suitable |
80 | zone is found, the frame allocator uses the buddy allocator on the |
80 | zone is found, the frame allocator uses the buddy allocator on the |
81 | zone's buddy system to perform the allocation. During deallocation, |
81 | zone's buddy system to perform the allocation. During deallocation, |
82 | which is triggered by a call to <code>frame_free</code>, the frame |
82 | which is triggered by a call to <code>frame_free</code>, the frame |
83 | allocator looks up the respective zone that contains the frame being |
83 | allocator looks up the respective zone that contains the frame being |
84 | deallocated. Afterwards, it calls the buddy allocator again, this time |
84 | deallocated. Afterwards, it calls the buddy allocator again, this time |
85 | to take care of deallocation within the zone's buddy system.</para> |
85 | to take care of deallocation within the zone's buddy system.</para> |
86 | </formalpara> |
86 | </formalpara> |
87 | </section> |
87 | </section> |
88 | 88 | ||
89 | <section id="buddy_allocator"> |
89 | <section id="buddy_allocator"> |
90 | <indexterm> |
90 | <indexterm> |
91 | <primary>buddy system</primary> |
91 | <primary>buddy system</primary> |
92 | </indexterm> |
92 | </indexterm> |
93 | 93 | ||
94 | <title>Buddy allocator</title> |
94 | <title>Buddy allocator</title> |
95 | 95 | ||
96 | <para>In the buddy system, the memory is broken down into power-of-two |
96 | <para>In the buddy system, the memory is broken down into power-of-two |
97 | sized naturally aligned blocks. These blocks are organized in an array |
97 | sized naturally aligned blocks. These blocks are organized in an array |
98 | of lists, in which the list with index i contains all unallocated blocks |
98 | of lists, in which the list with index i contains all unallocated blocks |
99 | of size <mathphrase>2<superscript>i</superscript></mathphrase>. The |
99 | of size <mathphrase>2<superscript>i</superscript></mathphrase>. The |
100 | index i is called the order of block. Should there be two adjacent |
100 | index i is called the order of block. Should there be two adjacent |
101 | equally sized blocks in the list i<mathphrase />(i.e. buddies), the |
101 | equally sized blocks in the list i<mathphrase />(i.e. buddies), the |
102 | buddy allocator would coalesce them and put the resulting block in list |
102 | buddy allocator would coalesce them and put the resulting block in list |
103 | <mathphrase>i + 1</mathphrase>, provided that the resulting block would |
103 | <mathphrase>i + 1</mathphrase>, provided that the resulting block would |
104 | be naturally aligned. Similarily, when the allocator is asked to |
104 | be naturally aligned. Similarily, when the allocator is asked to |
105 | allocate a block of size |
105 | allocate a block of size |
106 | <mathphrase>2<superscript>i</superscript></mathphrase>, it first tries |
106 | <mathphrase>2<superscript>i</superscript></mathphrase>, it first tries |
107 | to satisfy the request from the list with index i. If the request cannot |
107 | to satisfy the request from the list with index i. If the request cannot |
108 | be satisfied (i.e. the list i is empty), the buddy allocator will try to |
108 | be satisfied (i.e. the list i is empty), the buddy allocator will try to |
109 | allocate and split a larger block from the list with index i + 1. Both |
109 | allocate and split a larger block from the list with index i + 1. Both |
110 | of these algorithms are recursive. The recursion ends either when there |
110 | of these algorithms are recursive. The recursion ends either when there |
111 | are no blocks to coalesce in the former case or when there are no blocks |
111 | are no blocks to coalesce in the former case or when there are no blocks |
112 | that can be split in the latter case.</para> |
112 | that can be split in the latter case.</para> |
113 | 113 | ||
114 | <para>This approach greatly reduces external fragmentation of memory and |
114 | <para>This approach greatly reduces external fragmentation of memory and |
115 | helps in allocating bigger continuous blocks of memory aligned to their |
115 | helps in allocating bigger continuous blocks of memory aligned to their |
116 | size. On the other hand, the buddy allocator suffers increased internal |
116 | size. On the other hand, the buddy allocator suffers increased internal |
117 | fragmentation of memory and is not suitable for general kernel |
117 | fragmentation of memory and is not suitable for general kernel |
118 | allocations. This purpose is better addressed by the <link |
118 | allocations. This purpose is better addressed by the <link |
119 | linkend="slab">slab allocator</link>.<figure> |
119 | linkend="slab">slab allocator</link>.<figure> |
120 | <mediaobject id="buddy_alloc"> |
120 | <mediaobject id="buddy_alloc"> |
121 | <imageobject role="html"> |
121 | <imageobject role="html"> |
122 | <imagedata fileref="images/buddy_alloc.png" format="PNG" /> |
122 | <imagedata fileref="images/buddy_alloc.png" format="PNG" /> |
123 | </imageobject> |
123 | </imageobject> |
124 | 124 | ||
125 | <imageobject role="fop"> |
125 | <imageobject role="fop"> |
126 | <imagedata fileref="images.vector/buddy_alloc.svg" format="SVG" /> |
126 | <imagedata fileref="images.vector/buddy_alloc.svg" format="SVG" /> |
127 | </imageobject> |
127 | </imageobject> |
128 | </mediaobject> |
128 | </mediaobject> |
129 | 129 | ||
130 | <title>Buddy system scheme.</title> |
130 | <title>Buddy system scheme.</title> |
131 | </figure></para> |
131 | </figure></para> |
132 | 132 | ||
133 | <section> |
133 | <section> |
134 | <title>Implementation</title> |
134 | <title>Implementation</title> |
135 | 135 | ||
136 | <para>The buddy allocator is, in fact, an abstract framework wich can |
136 | <para>The buddy allocator is, in fact, an abstract framework wich can |
137 | be easily specialized to serve one particular task. It knows nothing |
137 | be easily specialized to serve one particular task. It knows nothing |
138 | about the nature of memory it helps to allocate. In order to beat the |
138 | about the nature of memory it helps to allocate. In order to beat the |
139 | lack of this knowledge, the buddy allocator exports an interface that |
139 | lack of this knowledge, the buddy allocator exports an interface that |
140 | each of its clients is required to implement. When supplied with an |
140 | each of its clients is required to implement. When supplied with an |
141 | implementation of this interface, the buddy allocator can use |
141 | implementation of this interface, the buddy allocator can use |
142 | specialized external functions to find a buddy for a block, split and |
142 | specialized external functions to find a buddy for a block, split and |
143 | coalesce blocks, manipulate block order and mark blocks busy or |
143 | coalesce blocks, manipulate block order and mark blocks busy or |
144 | available.</para> |
144 | available.</para> |
145 | 145 | ||
146 | <formalpara> |
146 | <formalpara> |
147 | <title>Data organization</title> |
147 | <title>Data organization</title> |
148 | 148 | ||
149 | <para>Each entity allocable by the buddy allocator is required to |
149 | <para>Each entity allocable by the buddy allocator is required to |
150 | contain space for storing block order number and a link variable |
150 | contain space for storing block order number and a link variable |
151 | used to interconnect blocks within the same order.</para> |
151 | used to interconnect blocks within the same order.</para> |
152 | 152 | ||
153 | <para>Whatever entities are allocated by the buddy allocator, the |
153 | <para>Whatever entities are allocated by the buddy allocator, the |
154 | first entity within a block is used to represent the entire block. |
154 | first entity within a block is used to represent the entire block. |
155 | The first entity keeps the order of the whole block. Other entities |
155 | The first entity keeps the order of the whole block. Other entities |
156 | within the block are assigned the magic value |
156 | within the block are assigned the magic value |
157 | <constant>BUDDY_INNER_BLOCK</constant>. This is especially important |
157 | <constant>BUDDY_INNER_BLOCK</constant>. This is especially important |
158 | for effective identification of buddies in a one-dimensional array |
158 | for effective identification of buddies in a one-dimensional array |
159 | because the entity that represents a potential buddy cannot be |
159 | because the entity that represents a potential buddy cannot be |
160 | associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it |
160 | associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it |
161 | is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is |
161 | is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is |
162 | not a buddy).</para> |
162 | not a buddy).</para> |
163 | </formalpara> |
163 | </formalpara> |
164 | </section> |
164 | </section> |
165 | </section> |
165 | </section> |
166 | 166 | ||
167 | <section id="slab"> |
167 | <section id="slab"> |
168 | <indexterm> |
168 | <indexterm> |
169 | <primary>slab allocator</primary> |
169 | <primary>slab allocator</primary> |
170 | </indexterm> |
170 | </indexterm> |
171 | 171 | ||
172 | <title>Slab allocator</title> |
172 | <title>Slab allocator</title> |
173 | 173 | ||
174 | <para>The majority of memory allocation requests in the kernel is for |
174 | <para>The majority of memory allocation requests in the kernel is for |
175 | small, frequently used data structures. The basic idea behind the slab |
175 | small, frequently used data structures. The basic idea behind the slab |
176 | allocator is that commonly used objects are preallocated in continuous |
176 | allocator is that commonly used objects are preallocated in continuous |
177 | areas of physical memory called slabs<footnote> |
177 | areas of physical memory called slabs<footnote> |
178 | <para>Slabs are in fact blocks of physical memory frames allocated |
178 | <para>Slabs are in fact blocks of physical memory frames allocated |
179 | from the frame allocator.</para> |
179 | from the frame allocator.</para> |
180 | </footnote>. Whenever an object is to be allocated, the slab allocator |
180 | </footnote>. Whenever an object is to be allocated, the slab allocator |
181 | returns the first available item from a suitable slab corresponding to |
181 | returns the first available item from a suitable slab corresponding to |
182 | the object type<footnote> |
182 | the object type<footnote> |
183 | <para>The mechanism is rather more complicated, see the next |
183 | <para>The mechanism is rather more complicated, see the next |
184 | paragraph.</para> |
184 | paragraph.</para> |
185 | </footnote>. Due to the fact that the sizes of the requested and |
185 | </footnote>. Due to the fact that the sizes of the requested and |
186 | allocated object match, the slab allocator significantly reduces |
186 | allocated object match, the slab allocator significantly reduces |
187 | internal fragmentation.</para> |
187 | internal fragmentation.</para> |
188 | 188 | ||
189 | <indexterm> |
189 | <indexterm> |
190 | <primary>slab allocator</primary> |
190 | <primary>slab allocator</primary> |
191 | 191 | ||
192 | <secondary>- slab cache</secondary> |
192 | <secondary>- slab cache</secondary> |
193 | </indexterm> |
193 | </indexterm> |
194 | 194 | ||
195 | <para>Slabs of one object type are organized in a structure called slab |
195 | <para>Slabs of one object type are organized in a structure called slab |
196 | cache. There are ususally more slabs in the slab cache, depending on |
196 | cache. There are ususally more slabs in the slab cache, depending on |
197 | previous allocations. If the the slab cache runs out of available slabs, |
197 | previous allocations. If the the slab cache runs out of available slabs, |
198 | new slabs are allocated. In order to exploit parallelism and to avoid |
198 | new slabs are allocated. In order to exploit parallelism and to avoid |
199 | locking of shared spinlocks, slab caches can have variants of |
199 | locking of shared spinlocks, slab caches can have variants of |
200 | processor-private slabs called magazines. On each processor, there is a |
200 | processor-private slabs called magazines. On each processor, there is a |
201 | two-magazine cache. Full magazines that are not part of any |
201 | two-magazine cache. Full magazines that are not part of any |
202 | per-processor magazine cache are stored in a global list of full |
202 | per-processor magazine cache are stored in a global list of full |
203 | magazines.</para> |
203 | magazines.</para> |
204 | 204 | ||
205 | <indexterm> |
205 | <indexterm> |
206 | <primary>slab allocator</primary> |
206 | <primary>slab allocator</primary> |
207 | 207 | ||
208 | <secondary>- magazine</secondary> |
208 | <secondary>- magazine</secondary> |
209 | </indexterm> |
209 | </indexterm> |
210 | 210 | ||
211 | <para>Each object begins its life in a slab. When it is allocated from |
211 | <para>Each object begins its life in a slab. When it is allocated from |
212 | there, the slab allocator calls a constructor that is registered in the |
212 | there, the slab allocator calls a constructor that is registered in the |
213 | respective slab cache. The constructor initializes and brings the object |
213 | respective slab cache. The constructor initializes and brings the object |
214 | into a known state. The object is then used by the user. When the user |
214 | into a known state. The object is then used by the user. When the user |
215 | later frees the object, the slab allocator puts it into a processor |
215 | later frees the object, the slab allocator puts it into a processor |
216 | private <indexterm> |
216 | private <indexterm> |
217 | <primary>slab allocator</primary> |
217 | <primary>slab allocator</primary> |
218 | 218 | ||
219 | <secondary>- magazine</secondary> |
219 | <secondary>- magazine</secondary> |
220 | </indexterm>magazine cache, from where it can be precedently allocated |
220 | </indexterm>magazine cache, from where it can be precedently allocated |
221 | again. Note that allocations satisfied from a magazine are already |
221 | again. Note that allocations satisfied from a magazine are already |
222 | initialized by the constructor. When both of the processor cached |
222 | initialized by the constructor. When both of the processor cached |
223 | magazines get full, the allocator will move one of the magazines to the |
223 | magazines get full, the allocator will move one of the magazines to the |
224 | list of full magazines. Similarily, when allocating from an empty |
224 | list of full magazines. Similarily, when allocating from an empty |
225 | processor magazine cache, the kernel will reload only one magazine from |
225 | processor magazine cache, the kernel will reload only one magazine from |
226 | the list of full magazines. In other words, the slab allocator tries to |
226 | the list of full magazines. In other words, the slab allocator tries to |
227 | keep the processor magazine cache only half-full in order to prevent |
227 | keep the processor magazine cache only half-full in order to prevent |
228 | thrashing when allocations and deallocations interleave on magazine |
228 | thrashing when allocations and deallocations interleave on magazine |
229 | boundaries. The advantage of this setup is that during most of the |
229 | boundaries. The advantage of this setup is that during most of the |
230 | allocations, no global spinlock needs to be held.</para> |
230 | allocations, no global spinlock needs to be held.</para> |
231 | 231 | ||
232 | <para>Should HelenOS run short of memory, it would start deallocating |
232 | <para>Should HelenOS run short of memory, it would start deallocating |
233 | objects from magazines, calling slab cache destructor on them and |
233 | objects from magazines, calling slab cache destructor on them and |
234 | putting them back into slabs. When a slab contanins no allocated object, |
234 | putting them back into slabs. When a slab contanins no allocated object, |
235 | it is immediately freed.</para> |
235 | it is immediately freed.</para> |
236 | 236 | ||
237 | <para> |
237 | <para> |
238 | <figure> |
238 | <figure> |
239 | <mediaobject id="slab_alloc"> |
239 | <mediaobject id="slab_alloc"> |
240 | <imageobject role="html"> |
240 | <imageobject role="html"> |
241 | <imagedata fileref="images/slab_alloc.png" format="PNG" /> |
241 | <imagedata fileref="images/slab_alloc.png" format="PNG" /> |
242 | </imageobject> |
242 | </imageobject> |
243 | </mediaobject> |
243 | </mediaobject> |
244 | 244 | ||
245 | <title>Slab allocator scheme.</title> |
245 | <title>Slab allocator scheme.</title> |
246 | </figure> |
246 | </figure> |
247 | </para> |
247 | </para> |
248 | 248 | ||
249 | <section> |
249 | <section> |
250 | <title>Implementation</title> |
250 | <title>Implementation</title> |
251 | 251 | ||
252 | <para>The slab allocator is closely modelled after OpenSolaris slab |
252 | <para>The slab allocator is closely modelled after OpenSolaris slab |
253 | allocator by Jeff Bonwick and Jonathan Adams <xref |
253 | allocator by Jeff Bonwick and Jonathan Adams <xref |
254 | linkend="Bonwick01" /> with the following exceptions:<itemizedlist> |
254 | linkend="Bonwick01" /> with the following exceptions:<itemizedlist> |
255 | <listitem>empty slabs are immediately deallocated and</listitem> |
255 | <listitem>empty slabs are immediately deallocated and</listitem> |
256 | 256 | ||
257 | <listitem> |
257 | <listitem> |
258 | <para>empty magazines are deallocated when not needed.</para> |
258 | <para>empty magazines are deallocated when not needed.</para> |
259 | </listitem> |
259 | </listitem> |
260 | </itemizedlist>The following features are not currently supported |
260 | </itemizedlist>The following features are not currently supported |
261 | but would be easy to do: <itemizedlist> |
261 | but would be easy to do: <itemizedlist> |
262 | <listitem>cache coloring and</listitem> |
262 | <listitem>cache coloring and</listitem> |
263 | 263 | ||
264 | <listitem>dynamic magazine grow (different magazine sizes are |
264 | <listitem>dynamic magazine grow (different magazine sizes are |
265 | already supported, but the allocation strategy would need to be |
265 | already supported, but the allocation strategy would need to be |
266 | adjusted).</listitem> |
266 | adjusted).</listitem> |
267 | </itemizedlist></para> |
267 | </itemizedlist></para> |
268 | 268 | ||
269 | <section> |
269 | <section> |
270 | <title>Allocation/deallocation</title> |
270 | <title>Allocation/deallocation</title> |
271 | 271 | ||
272 | <para>The following two paragraphs summarize and complete the |
272 | <para>The following two paragraphs summarize and complete the |
273 | description of the slab allocator operation (i.e. |
273 | description of the slab allocator operation (i.e. |
274 | <code>slab_alloc</code> and <code>slab_free</code> |
274 | <code>slab_alloc</code> and <code>slab_free</code> |
275 | operations).</para> |
275 | operations).</para> |
276 | 276 | ||
277 | <formalpara> |
277 | <formalpara> |
278 | <title>Allocation</title> |
278 | <title>Allocation</title> |
279 | 279 | ||
280 | <para><emphasis>Step 1.</emphasis> When an allocation request |
280 | <para><emphasis>Step 1.</emphasis> When an allocation request |
281 | comes, the slab allocator checks availability of memory in the |
281 | comes, the slab allocator checks availability of memory in the |
282 | current magazine of the local processor magazine cache. If the |
282 | current magazine of the local processor magazine cache. If the |
283 | available memory is there, the allocator just pops the magazine |
283 | available memory is there, the allocator just pops the object from |
284 | and returns pointer to the object.</para> |
284 | magazine and returns it.</para> |
285 | 285 | ||
286 | <para><emphasis>Step 2.</emphasis> If the current magazine in the |
286 | <para><emphasis>Step 2.</emphasis> If the current magazine in the |
287 | processor magazine cache is empty, the allocator will attempt to |
287 | processor magazine cache is empty, the allocator will attempt to |
288 | swap it with the last magazine from the cache and return to the |
288 | swap it with the last magazine from the cache and return to the |
289 | first step. If also the last magazine is empty, the algorithm will |
289 | first step. If also the last magazine is empty, the algorithm will |
290 | fall through to Step 3.</para> |
290 | fall through to Step 3.</para> |
291 | 291 | ||
292 | <para><emphasis>Step 3.</emphasis> Now the allocator is in the |
292 | <para><emphasis>Step 3.</emphasis> Now the allocator is in the |
293 | situation when both magazines in the processor magazine cache are |
293 | situation when both magazines in the processor magazine cache are |
294 | empty. The allocator reloads one magazine from the shared list of |
294 | empty. The allocator reloads one magazine from the shared list of |
295 | full magazines. If the reload is successful (i.e. there are full |
295 | full magazines. If the reload is successful (i.e. there are full |
296 | magazines in the list), the algorithm continues with Step |
296 | magazines in the list), the algorithm continues with Step |
297 | 1.</para> |
297 | 1.</para> |
298 | 298 | ||
299 | <para><emphasis>Step 4.</emphasis> In this fail-safe step, an |
299 | <para><emphasis>Step 4.</emphasis> In this fail-safe step, an |
300 | object is allocated from the conventional slab layer and a pointer |
300 | object is allocated from the conventional slab layer and a pointer |
301 | to it is returned. If also the last magazine is full,</para> |
301 | to it is returned. If also the last magazine is full,</para> |
302 | </formalpara> |
302 | </formalpara> |
303 | 303 | ||
304 | <formalpara> |
304 | <formalpara> |
305 | <title>Deallocation</title> |
305 | <title>Deallocation</title> |
306 | 306 | ||
307 | <para><emphasis>Step 1.</emphasis> During a deallocation request, |
307 | <para><emphasis>Step 1.</emphasis> During a deallocation request, |
308 | the slab allocator checks if the current magazine of the local |
308 | the slab allocator checks if the current magazine of the local |
309 | processor magazine cache is not full. If yes, the pointer to the |
309 | processor magazine cache is not full. If it is, the pointer to the |
310 | objects is just pushed into the magazine and the algorithm |
310 | objects is just pushed into the magazine and the algorithm |
311 | returns.</para> |
311 | returns.</para> |
312 | 312 | ||
313 | <para><emphasis>Step 2.</emphasis> If the current magazine is |
313 | <para><emphasis>Step 2.</emphasis> If the current magazine is |
314 | full, the allocator will attempt to swap it with the last magazine |
314 | full, the allocator will attempt to swap it with the last magazine |
315 | from the cache and return to the first step. If also the last |
315 | from the cache and return to the first step. If also the last |
316 | magazine is empty, the algorithm will fall through to Step |
316 | magazine is empty, the algorithm will fall through to Step |
317 | 3.</para> |
317 | 3.</para> |
318 | 318 | ||
319 | <para><emphasis>Step 3.</emphasis> Now the allocator is in the |
319 | <para><emphasis>Step 3.</emphasis> Now the allocator is in the |
320 | situation when both magazines in the processor magazine cache are |
320 | situation when both magazines in the processor magazine cache are |
- | 321 | full. The allocator tries to allocate a new empty magazine and |
|
321 | full. The allocator moves one magazine to the shared list of full |
322 | flush one of the full magazines to the shared list of full |
322 | magazines. The algoritm continues with Step 1.</para> |
323 | magazines. If it is successfull, the algoritm continues with Step |
- | 324 | 1.</para> |
|
- | 325 | ||
- | 326 | <para><emphasis>Step 4. </emphasis>In case of low memory condition |
|
- | 327 | when the allocation of empty magazine fails, the object is moved |
|
- | 328 | directly into slab. In the worst case object deallocation does not |
|
- | 329 | need to allocate any additional memory.</para> |
|
323 | </formalpara> |
330 | </formalpara> |
324 | </section> |
331 | </section> |
325 | </section> |
332 | </section> |
326 | </section> |
333 | </section> |
327 | </section> |
334 | </section> |
328 | 335 | ||
329 | <section> |
336 | <section> |
330 | <title>Virtual memory management</title> |
337 | <title>Virtual memory management</title> |
331 | 338 | ||
332 | <section> |
339 | <section> |
333 | <title>Introduction</title> |
340 | <title>Introduction</title> |
334 | 341 | ||
335 | <para>Virtual memory is a special memory management technique, used by |
342 | <para>Virtual memory is a special memory management technique, used by |
336 | kernel to achieve a bunch of mission critical goals. <itemizedlist> |
343 | kernel to achieve a bunch of mission critical goals. <itemizedlist> |
337 | <listitem> |
344 | <listitem> |
338 | Isolate each task from other tasks that are running on the system at the same time. |
345 | Isolate each task from other tasks that are running on the system at the same time. |
339 | </listitem> |
346 | </listitem> |
340 | 347 | ||
341 | <listitem> |
348 | <listitem> |
342 | Allow to allocate more memory, than is actual physical memory size of the machine. |
349 | Allow to allocate more memory, than is actual physical memory size of the machine. |
343 | </listitem> |
350 | </listitem> |
344 | 351 | ||
345 | <listitem> |
352 | <listitem> |
346 | Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations. |
353 | Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations. |
347 | </listitem> |
354 | </listitem> |
348 | </itemizedlist></para> |
355 | </itemizedlist></para> |
349 | 356 | ||
350 | <para><!-- |
357 | <para><!-- |
351 | <para> |
358 | <para> |
352 | Address spaces. Address space area (B+ tree). Only for uspace. Set of syscalls (shrink/extend etc). |
359 | Address spaces. Address space area (B+ tree). Only for uspace. Set of syscalls (shrink/extend etc). |
353 | Special address space area type - device - prohibits shrink/extend syscalls to call on it. |
360 | Special address space area type - device - prohibits shrink/extend syscalls to call on it. |
354 | Address space has link to mapping tables (hierarchical - per Address space, hash - global tables). |
361 | Address space has link to mapping tables (hierarchical - per Address space, hash - global tables). |
355 | </para> |
362 | </para> |
356 | 363 | ||
357 | --></para> |
364 | --></para> |
358 | </section> |
365 | </section> |
359 | 366 | ||
360 | <section> |
367 | <section> |
361 | <title>Address spaces</title> |
368 | <title>Address spaces</title> |
362 | 369 | ||
363 | <section> |
370 | <section> |
364 | <indexterm> |
371 | <indexterm> |
365 | <primary>address space</primary> |
372 | <primary>address space</primary> |
366 | 373 | ||
367 | <secondary>- area</secondary> |
374 | <secondary>- area</secondary> |
368 | </indexterm> |
375 | </indexterm> |
369 | 376 | ||
370 | <title>Address space areas</title> |
377 | <title>Address space areas</title> |
371 | 378 | ||
372 | <para>Each address space consists of mutually disjunctive continuous |
379 | <para>Each address space consists of mutually disjunctive continuous |
373 | address space areas. Address space area is precisely defined by its |
380 | address space areas. Address space area is precisely defined by its |
374 | base address and the number of frames/pages is contains.</para> |
381 | base address and the number of frames/pages is contains.</para> |
375 | 382 | ||
376 | <para>Address space area , that define behaviour and permissions on |
383 | <para>Address space area , that define behaviour and permissions on |
377 | the particular area. <itemizedlist> |
384 | the particular area. <itemizedlist> |
378 | <listitem><emphasis>AS_AREA_READ</emphasis> flag indicates reading |
385 | <listitem><emphasis>AS_AREA_READ</emphasis> flag indicates reading |
379 | permission.</listitem> |
386 | permission.</listitem> |
380 | 387 | ||
381 | <listitem><emphasis>AS_AREA_WRITE</emphasis> flag indicates |
388 | <listitem><emphasis>AS_AREA_WRITE</emphasis> flag indicates |
382 | writing permission.</listitem> |
389 | writing permission.</listitem> |
383 | 390 | ||
384 | <listitem><emphasis>AS_AREA_EXEC</emphasis> flag indicates code |
391 | <listitem><emphasis>AS_AREA_EXEC</emphasis> flag indicates code |
385 | execution permission. Some architectures do not support execution |
392 | execution permission. Some architectures do not support execution |
386 | persmission restriction. In this case this flag has no |
393 | persmission restriction. In this case this flag has no |
387 | effect.</listitem> |
394 | effect.</listitem> |
388 | 395 | ||
389 | <listitem><emphasis>AS_AREA_DEVICE</emphasis> marks area as mapped |
396 | <listitem><emphasis>AS_AREA_DEVICE</emphasis> marks area as mapped |
390 | to the device memory.</listitem> |
397 | to the device memory.</listitem> |
391 | </itemizedlist></para> |
398 | </itemizedlist></para> |
392 | 399 | ||
393 | <para>Kernel provides possibility tasks create/expand/shrink/share its |
400 | <para>Kernel provides possibility tasks create/expand/shrink/share its |
394 | address space via the set of syscalls.</para> |
401 | address space via the set of syscalls.</para> |
395 | </section> |
402 | </section> |
396 | 403 | ||
397 | <section> |
404 | <section> |
398 | <indexterm> |
405 | <indexterm> |
399 | <primary>address space</primary> |
406 | <primary>address space</primary> |
400 | 407 | ||
401 | <secondary>- ASID</secondary> |
408 | <secondary>- ASID</secondary> |
402 | </indexterm> |
409 | </indexterm> |
403 | 410 | ||
404 | <title>Address Space ID (ASID)</title> |
411 | <title>Address Space ID (ASID)</title> |
405 | 412 | ||
406 | <para>When switching to the different task, kernel also require to |
413 | <para>Every task in the operating system has it's own view of the |
407 | switch mappings to the different address space. In case TLB cannot |
414 | virtual memory. When performing context switch between different |
408 | distinguish address space mappings, all mapping information in TLB |
415 | tasks, the kernel must switch the address space mapping as well. As |
409 | from the old address space must be flushed, which can create certain |
416 | modern processors perform very aggressive caching of virtual mappings, |
410 | uncessary overhead during the task switching. To avoid this, some |
417 | flushing the complete TLB on every context switch would be very |
411 | architectures have capability to segregate different address spaces on |
418 | inefficient. To avoid such performance penalty, some architectures |
412 | hardware level introducing the address space identifier as a part of |
419 | introduce an address space identifier, which allows storing several |
413 | TLB record, telling the virtual address space translation unit to |
- | |
414 | which address space this record is applicable.</para> |
420 | different mappings inside TLB.</para> |
415 | 421 | ||
416 | <para>HelenOS kernel can take advantage of this hardware supported |
422 | <para>HelenOS kernel can take advantage of this hardware support by |
417 | identifier by having an ASID abstraction which is somehow related to |
423 | having an ASID abstraction. I.e. on ia64 kernel ASID is derived from |
418 | the corresponding architecture identifier. I.e. on ia64 kernel ASID is |
- | |
419 | derived from RID (region identifier) and on the mips32 kernel ASID is |
424 | RID (region identifier) and on the mips32 kernel ASID is actually the |
420 | actually the hardware identifier. As expected, this ASID information |
425 | hardware identifier. As expected, this ASID information record is the |
421 | record is the part of <emphasis>as_t</emphasis> structure.</para> |
426 | part of <emphasis>as_t</emphasis> structure.</para> |
422 | 427 | ||
423 | <para>Due to the hardware limitations, hardware ASID has limited |
428 | <para>Due to the hardware limitations, hardware ASID has limited |
424 | length from 8 bits on ia64 to 24 bits on mips32, which makes it |
429 | length from 8 bits on ia64 to 24 bits on mips32, which makes it |
425 | impossible to use it as unique address space identifier for all tasks |
430 | impossible to use it as unique address space identifier for all tasks |
426 | running in the system. In such situations special ASID stealing |
431 | running in the system. In such situations special ASID stealing |
427 | algoritm is used, which takes ASID from inactive task and assigns it |
432 | algoritm is used, which takes ASID from inactive task and assigns it |
428 | to the active task.</para> |
433 | to the active task.</para> |
429 | 434 | ||
430 | <indexterm> |
435 | <indexterm> |
431 | <primary>address space</primary> |
436 | <primary>address space</primary> |
432 | 437 | ||
433 | <secondary>- ASID stealing</secondary> |
438 | <secondary>- ASID stealing</secondary> |
434 | </indexterm> |
439 | </indexterm> |
435 | 440 | ||
436 | <para> |
441 | <para> |
437 | <classname>ASID stealing algoritm here.</classname> |
442 | <classname>ASID stealing algoritm here.</classname> |
438 | </para> |
443 | </para> |
439 | </section> |
444 | </section> |
440 | </section> |
445 | </section> |
441 | 446 | ||
442 | <section id="paging"> |
447 | <section id="paging"> |
443 | <title>Virtual address translation</title> |
448 | <title>Virtual address translation</title> |
444 | 449 | ||
445 | <section> |
450 | <section> |
446 | <title>Introduction</title> |
451 | <title>Introduction</title> |
447 | 452 | ||
448 | <para>Virtual memory is usually using paged memory model, where |
453 | <para>Virtual memory is usually using paged memory model, where |
449 | virtual memory address space is divided into the |
454 | virtual memory address space is divided into the |
450 | <emphasis>pages</emphasis> (usually having size 4096 bytes) and |
455 | <emphasis>pages</emphasis> (usually having size 4096 bytes) and |
451 | physical memory is divided into the frames (same sized as a page, of |
456 | physical memory is divided into the frames (same sized as a page, of |
452 | course). Each page may be mapped to some frame and then, upon memory |
457 | course). Each page may be mapped to some frame and then, upon memory |
453 | access to the virtual address, CPU performs <emphasis>address |
458 | access to the virtual address, CPU performs <emphasis>address |
454 | translation</emphasis> during the instruction execution. Non-existing |
459 | translation</emphasis> during the instruction execution. Non-existing |
455 | mapping generates page fault exception, calling kernel exception |
460 | mapping generates page fault exception, calling kernel exception |
456 | handler, thus allowing kernel to manipulate rules of memory access. |
461 | handler, thus allowing kernel to manipulate rules of memory access. |
457 | Information for pages mapping is stored by kernel in the <link |
462 | Information for pages mapping is stored by kernel in the <link |
458 | linkend="page_tables">page tables</link></para> |
463 | linkend="page_tables">page tables</link></para> |
459 | 464 | ||
460 | <indexterm> |
465 | <indexterm> |
461 | <primary>page tables</primary> |
466 | <primary>page tables</primary> |
462 | </indexterm> |
467 | </indexterm> |
463 | 468 | ||
464 | <para>The majority of the architectures use multi-level page tables, |
469 | <para>The majority of the architectures use multi-level page tables, |
465 | which means need to access physical memory several times before |
470 | which means need to access physical memory several times before |
466 | getting physical address. This fact would make serios performance |
471 | getting physical address. This fact would make serios performance |
467 | overhead in virtual memory management. To avoid this <link |
472 | overhead in virtual memory management. To avoid this <link |
468 | linkend="tlb">Traslation Lookaside Buffer (TLB)</link> is used.</para> |
473 | linkend="tlb">Traslation Lookaside Buffer (TLB)</link> is used.</para> |
469 | 474 | ||
470 | <para>HelenOS kernel has two different approaches to the paging |
475 | <para>HelenOS kernel has two different approaches to the paging |
471 | implementation: <emphasis>4 level page tables</emphasis> and |
476 | implementation: <emphasis>4 level page tables</emphasis> and |
472 | <emphasis>global hash table</emphasis>, which are accessible via |
477 | <emphasis>global hash table</emphasis>, which are accessible via |
473 | generic paging abstraction layer. Such different functionality was |
478 | generic paging abstraction layer. Such different functionality was |
474 | caused by the major architectural differences between supported |
479 | caused by the major architectural differences between supported |
475 | platforms. This abstraction is implemented with help of the global |
480 | platforms. This abstraction is implemented with help of the global |
476 | structure of pointers to basic mapping functions |
481 | structure of pointers to basic mapping functions |
477 | <emphasis>page_mapping_operations</emphasis>. To achieve different |
482 | <emphasis>page_mapping_operations</emphasis>. To achieve different |
478 | functionality of page tables, corresponding layer must implement |
483 | functionality of page tables, corresponding layer must implement |
479 | functions, declared in |
484 | functions, declared in |
480 | <emphasis>page_mapping_operations</emphasis></para> |
485 | <emphasis>page_mapping_operations</emphasis></para> |
481 | 486 | ||
482 | <para>Thanks to the abstract paging interface, there was a place left |
487 | <para>Thanks to the abstract paging interface, there was a place left |
483 | for more paging implementations (besides already implemented |
488 | for more paging implementations (besides already implemented |
484 | hieararchical page tables and hash table), for example <indexterm> |
489 | hieararchical page tables and hash table), for example <indexterm> |
485 | <primary>B-tree</primary> |
490 | <primary>B-tree</primary> |
486 | </indexterm> B-Tree based page tables.</para> |
491 | </indexterm> B-Tree based page tables.</para> |
487 | </section> |
492 | </section> |
488 | 493 | ||
489 | <section id="page_tables"> |
494 | <section id="page_tables"> |
490 | <indexterm> |
495 | <indexterm> |
491 | <primary>page tables</primary> |
496 | <primary>page tables</primary> |
492 | 497 | ||
493 | <secondary>- hierarchical</secondary> |
498 | <secondary>- hierarchical</secondary> |
494 | </indexterm> |
499 | </indexterm> |
495 | 500 | ||
496 | <title>Hierarchical 4-level page tables</title> |
501 | <title>Hierarchical 4-level page tables</title> |
497 | 502 | ||
498 | <para>Hierarchical 4-level page tables are the generalization of the |
503 | <para>Hierarchical 4-level page tables are the generalization of the |
499 | hardware capabilities of most architectures. Each address space has |
504 | hardware capabilities of most architectures. Each address space has |
500 | its own page tables.<itemizedlist> |
505 | its own page tables.<itemizedlist> |
501 | <listitem>ia32 uses 2-level page tables, with full hardware |
506 | <listitem>ia32 uses 2-level page tables, with full hardware |
502 | support.</listitem> |
507 | support.</listitem> |
503 | 508 | ||
504 | <listitem>amd64 uses 4-level page tables, also coming with full |
509 | <listitem>amd64 uses 4-level page tables, also coming with full |
505 | hardware support.</listitem> |
510 | hardware support.</listitem> |
506 | 511 | ||
507 | <listitem>mips and ppc32 have 2-level tables, software simulated |
512 | <listitem>mips and ppc32 have 2-level tables, software simulated |
508 | support.</listitem> |
513 | support.</listitem> |
509 | </itemizedlist></para> |
514 | </itemizedlist></para> |
510 | </section> |
515 | </section> |
511 | 516 | ||
512 | <section> |
517 | <section> |
513 | <indexterm> |
518 | <indexterm> |
514 | <primary>page tables</primary> |
519 | <primary>page tables</primary> |
515 | 520 | ||
516 | <secondary>- hashing</secondary> |
521 | <secondary>- hashing</secondary> |
517 | </indexterm> |
522 | </indexterm> |
518 | 523 | ||
519 | <title>Global hash table</title> |
524 | <title>Global hash table</title> |
520 | 525 | ||
521 | <para>Implementation of the global hash table was encouraged by the |
526 | <para>Implementation of the global hash table was encouraged by the |
522 | ia64 architecture support. One of the major differences between global |
527 | ia64 architecture support. One of the major differences between global |
523 | hash table and hierarchical tables is that global hash table exists |
528 | hash table and hierarchical tables is that global hash table exists |
524 | only once in the system and the hierarchical tables are maintained per |
529 | only once in the system and the hierarchical tables are maintained per |
525 | address space.</para> |
530 | address space.</para> |
526 | 531 | ||
527 | <para>Thus, hash table contains information about all address spaces |
532 | <para>Thus, hash table contains information about all address spaces |
528 | mappings in the system, so, the hash of an entry must contain |
533 | mappings in the system, so, the hash of an entry must contain |
529 | information of both address space pointer or id and the virtual |
534 | information of both address space pointer or id and the virtual |
530 | address of the page. Generic hash table implementation assumes that |
535 | address of the page. Generic hash table implementation assumes that |
531 | the addresses of the pointers to the address spaces are likely to be |
536 | the addresses of the pointers to the address spaces are likely to be |
532 | on the close addresses, so it uses least significant bits for hash; |
537 | on the close addresses, so it uses least significant bits for hash; |
533 | also it assumes that the virtual page addresses have roughly the same |
538 | also it assumes that the virtual page addresses have roughly the same |
534 | probability of occurring, so the least significant bits of VPN compose |
539 | probability of occurring, so the least significant bits of VPN compose |
535 | the hash index.</para> |
540 | the hash index.</para> |
536 | 541 | ||
537 | <para>Paging hash table uses generic hash table with collision chains |
542 | <para>Paging hash table uses generic hash table with collision chains |
538 | (see the <link linkend="hashtables">Data Structures</link> chapter of |
543 | (see the <link linkend="hashtables">Data Structures</link> chapter of |
539 | this manual for details).</para> |
544 | this manual for details).</para> |
540 | </section> |
545 | </section> |
541 | </section> |
546 | </section> |
542 | 547 | ||
543 | <section id="tlb"> |
548 | <section id="tlb"> |
544 | <indexterm> |
549 | <indexterm> |
545 | <primary>TLB</primary> |
550 | <primary>TLB</primary> |
546 | </indexterm> |
551 | </indexterm> |
547 | 552 | ||
548 | <title>Translation Lookaside buffer</title> |
553 | <title>Translation Lookaside buffer</title> |
549 | 554 | ||
550 | <para>Due to the extensive overhead during the page mapping lookup in |
555 | <para>Due to the extensive overhead during the page mapping lookup in |
551 | the page tables, all architectures has fast assotiative cache memory |
556 | the page tables, all architectures has fast assotiative cache memory |
552 | built-in CPU. This memory called TLB stores recently used page table |
557 | built-in CPU. This memory called TLB stores recently used page table |
553 | entries.</para> |
558 | entries.</para> |
554 | 559 | ||
555 | <section id="tlb_shootdown"> |
560 | <section id="tlb_shootdown"> |
556 | <indexterm> |
561 | <indexterm> |
557 | <primary>TLB</primary> |
562 | <primary>TLB</primary> |
558 | 563 | ||
559 | <secondary>- TLB shootdown</secondary> |
564 | <secondary>- TLB shootdown</secondary> |
560 | </indexterm> |
565 | </indexterm> |
561 | 566 | ||
562 | <title>TLB consistency. TLB shootdown algorithm.</title> |
567 | <title>TLB consistency. TLB shootdown algorithm.</title> |
563 | 568 | ||
564 | <para>Operating system is responsible for keeping TLB consistent by |
569 | <para>Operating system is responsible for keeping TLB consistent by |
565 | invalidating the contents of TLB, whenever there is some change in |
570 | invalidating the contents of TLB, whenever there is some change in |
566 | page tables. Those changes may occur when page or group of pages were |
571 | page tables. Those changes may occur when page or group of pages were |
567 | unmapped, mapping is changed or system switching active address space |
572 | unmapped, mapping is changed or system switching active address space |
568 | to schedule a new system task. Moreover, this invalidation operation |
573 | to schedule a new system task. Moreover, this invalidation operation |
569 | must be done an all system CPUs because each CPU has its own |
574 | must be done an all system CPUs because each CPU has its own |
570 | independent TLB cache. Thus maintaining TLB consistency on SMP |
575 | independent TLB cache. Thus maintaining TLB consistency on SMP |
571 | configuration as not as trivial task as it looks on the first glance. |
576 | configuration as not as trivial task as it looks on the first glance. |
572 | Naive solution would assume that is the CPU which wants to invalidate |
577 | Naive solution would assume that is the CPU which wants to invalidate |
573 | TLB will invalidate TLB caches on other CPUs. It is not possible on |
578 | TLB will invalidate TLB caches on other CPUs. It is not possible on |
574 | the most of the architectures, because of the simple fact - flushing |
579 | the most of the architectures, because of the simple fact - flushing |
575 | TLB is allowed only on the local CPU and there is no possibility to |
580 | TLB is allowed only on the local CPU and there is no possibility to |
576 | access other CPUs' TLB caches, thus invalidate TLB remotely.</para> |
581 | access other CPUs' TLB caches, thus invalidate TLB remotely.</para> |
577 | 582 | ||
578 | <para>Technique of remote invalidation of TLB entries is called "TLB |
583 | <para>Technique of remote invalidation of TLB entries is called "TLB |
579 | shootdown". HelenOS uses a variation of the algorithm described by D. |
584 | shootdown". HelenOS uses a variation of the algorithm described by D. |
580 | Black et al., "Translation Lookaside Buffer Consistency: A Software |
585 | Black et al., "Translation Lookaside Buffer Consistency: A Software |
581 | Approach," Proc. Third Int'l Conf. Architectural Support for |
586 | Approach," Proc. Third Int'l Conf. Architectural Support for |
582 | Programming Languages and Operating Systems, 1989, pp. 113-122. <xref |
587 | Programming Languages and Operating Systems, 1989, pp. 113-122. <xref |
583 | linkend="Black89" /></para> |
588 | linkend="Black89" /></para> |
584 | 589 | ||
585 | <para>As the situation demands, you will want partitial invalidation |
590 | <para>As the situation demands, you will want partitial invalidation |
586 | of TLB caches. In case of simple memory mapping change it is necessary |
591 | of TLB caches. In case of simple memory mapping change it is necessary |
587 | to invalidate only one or more adjacent pages. In case if the |
592 | to invalidate only one or more adjacent pages. In case if the |
588 | architecture is aware of ASIDs, when kernel needs to dump some ASID to |
593 | architecture is aware of ASIDs, when kernel needs to dump some ASID to |
589 | use by another task, it invalidates only entries from this particular |
594 | use by another task, it invalidates only entries from this particular |
590 | address space. Final option of the TLB invalidation is the complete |
595 | address space. Final option of the TLB invalidation is the complete |
591 | TLB cache invalidation, which is the operation that flushes all |
596 | TLB cache invalidation, which is the operation that flushes all |
592 | entries in TLB.</para> |
597 | entries in TLB.</para> |
593 | 598 | ||
594 | <para>TLB shootdown is performed in two phases.</para> |
599 | <para>TLB shootdown is performed in two phases.</para> |
595 | 600 | ||
596 | <formalpara> |
601 | <formalpara> |
597 | <title>Phase 1.</title> |
602 | <title>Phase 1.</title> |
598 | 603 | ||
599 | <para>First, initiator locks a global TLB spinlock, then request is |
604 | <para>First, initiator locks a global TLB spinlock, then request is |
600 | being put to the local request cache of every other CPU in the |
605 | being put to the local request cache of every other CPU in the |
601 | system protected by its spinlock. In case the cache is full, all |
606 | system protected by its spinlock. In case the cache is full, all |
602 | requests in the cache are replaced by one request, indicating global |
607 | requests in the cache are replaced by one request, indicating global |
603 | TLB flush. Then the initiator thread sends an IPI message indicating |
608 | TLB flush. Then the initiator thread sends an IPI message indicating |
604 | the TLB shootdown request to the rest of the CPUs and waits actively |
609 | the TLB shootdown request to the rest of the CPUs and waits actively |
605 | until all CPUs confirm TLB invalidating action execution by setting |
610 | until all CPUs confirm TLB invalidating action execution by setting |
606 | up a special flag. After setting this flag this thread is blocked on |
611 | up a special flag. After setting this flag this thread is blocked on |
607 | the TLB spinlock, held by the initiator.</para> |
612 | the TLB spinlock, held by the initiator.</para> |
608 | </formalpara> |
613 | </formalpara> |
609 | 614 | ||
610 | <formalpara> |
615 | <formalpara> |
611 | <title>Phase 2.</title> |
616 | <title>Phase 2.</title> |
612 | 617 | ||
613 | <para>All CPUs are waiting on the TLB spinlock to execute TLB |
618 | <para>All CPUs are waiting on the TLB spinlock to execute TLB |
614 | invalidation action and have indicated their intention to the |
619 | invalidation action and have indicated their intention to the |
615 | initiator. Initiator continues, cleaning up its TLB and releasing |
620 | initiator. Initiator continues, cleaning up its TLB and releasing |
616 | the global TLB spinlock. After this all other CPUs gain and |
621 | the global TLB spinlock. After this all other CPUs gain and |
617 | immidiately release TLB spinlock and perform TLB invalidation |
622 | immidiately release TLB spinlock and perform TLB invalidation |
618 | actions.</para> |
623 | actions.</para> |
619 | </formalpara> |
624 | </formalpara> |
620 | </section> |
625 | </section> |
621 | </section> |
626 | </section> |
622 | </section> |
627 | </section> |
623 | </chapter> |
628 | </chapter> |