WebSVN – HelenOS-doc – Blame – /design/trunk/src/ch_memory_management.xml

Rev	Author	Line No.	Line
9	bondari	1	<?xml version="1.0" encoding="UTF-8"?>
11	bondari	2	<chapter id="mm">
		3	<?dbhtml filename="mm.html"?>
9	bondari	4
11	bondari	5	<title>Memory management</title>
9	bondari	6
67	jermar	7	<para>In previous chapters, this book described the scheduling subsystem as
		8	the creator of the impression that threads execute in parallel. The memory
		9	management subsystem, on the other hand, creates the impression that there
		10	is enough physical memory for the kernel and that userspace tasks have the
		11	entire address space only for themselves.</para>
64	jermar	12
26	bondari	13	<section>
64	jermar	14	<title>Physical memory management</title>
		15
		16	<section id="zones_and_frames">
		17	<title>Zones and frames</title>
		18
67	jermar	19	<para>HelenOS represents continuous areas of physical memory in
		20	structures called frame zones (abbreviated as zones). Each zone contains
		21	information about the number of allocated and unallocated physical
		22	memory frames as well as the physical base address of the zone and
		23	number of frames contained in it. A zone also contains an array of frame
		24	structures describing each frame of the zone and, in the last, but not
		25	the least important, front, each zone is equipped with a buddy system
		26	that faciliates effective allocation of power-of-two sized block of
		27	frames.</para>
64	jermar	28
67	jermar	29	<para>This organization of physical memory provides good preconditions
		30	for hot-plugging of more zones. There is also one currently unused zone
		31	attribute: <code>flags</code>. The attribute could be used to give a
		32	special meaning to some zones in the future.</para>
64	jermar	33
67	jermar	34	<para>The zones are linked in a doubly-linked list. This might seem a
		35	bit ineffective because the zone list is walked everytime a frame is
		36	allocated or deallocated. However, this does not represent a significant
		37	performance problem as it is expected that the number of zones will be
		38	rather low. Moreover, most architectures merge all zones into
		39	one.</para>
		40
		41	<para>For each physical memory frame found in a zone, there is a frame
		42	structure that contains number of references and data used by buddy
		43	system.</para>
64	jermar	44	</section>
		45
		46	<section id="frame_allocator">
71	bondari	47	<indexterm>
		48	<primary>frame allocator</primary>
		49	</indexterm>
		50
64	jermar	51	<title>Frame allocator</title>
		52
67	jermar	53	<para>The frame allocator satisfies kernel requests to allocate
		54	power-of-two sized blocks of physical memory. Because of zonal
		55	organization of physical memory, the frame allocator is always working
		56	within a context of some frame zone. In order to carry out the
		57	allocation requests, the frame allocator is tightly integrated with the
		58	buddy system belonging to the zone. The frame allocator is also
		59	responsible for updating information about the number of free and busy
		60	frames in the zone. <figure>
		61	<mediaobject id="frame_alloc">
		62	<imageobject role="html">
		63	<imagedata fileref="images/frame_alloc.png" format="PNG" />
		64	</imageobject>
64	jermar	65
67	jermar	66	<imageobject role="fop">
		67	<imagedata fileref="images.vector/frame_alloc.svg" format="SVG" />
		68	</imageobject>
		69	</mediaobject>
64	jermar	70
67	jermar	71	<title>Frame allocator scheme.</title>
		72	</figure></para>
64	jermar	73
		74	<formalpara>
		75	<title>Allocation / deallocation</title>
		76
67	jermar	77	<para>Upon allocation request via function <code>frame_alloc</code>,
		78	the frame allocator first tries to find a zone that can satisfy the
		79	request (i.e. has the required amount of free frames). Once a suitable
		80	zone is found, the frame allocator uses the buddy allocator on the
		81	zone's buddy system to perform the allocation. During deallocation,
		82	which is triggered by a call to <code>frame_free</code>, the frame
		83	allocator looks up the respective zone that contains the frame being
		84	deallocated. Afterwards, it calls the buddy allocator again, this time
		85	to take care of deallocation within the zone's buddy system.</para>
64	jermar	86	</formalpara>
		87	</section>
		88
		89	<section id="buddy_allocator">
71	bondari	90	<indexterm>
		91	<primary>buddy system</primary>
		92	</indexterm>
		93
64	jermar	94	<title>Buddy allocator</title>
		95
67	jermar	96	<para>In the buddy system, the memory is broken down into power-of-two
		97	sized naturally aligned blocks. These blocks are organized in an array
		98	of lists, in which the list with index i contains all unallocated blocks
		99	of size <mathphrase>2<superscript>i</superscript></mathphrase>. The
		100	index i is called the order of block. Should there be two adjacent
		101	equally sized blocks in the list i<mathphrase />(i.e. buddies), the
		102	buddy allocator would coalesce them and put the resulting block in list
		103	<mathphrase>i + 1</mathphrase>, provided that the resulting block would
		104	be naturally aligned. Similarily, when the allocator is asked to
		105	allocate a block of size
		106	<mathphrase>2<superscript>i</superscript></mathphrase>, it first tries
		107	to satisfy the request from the list with index i. If the request cannot
		108	be satisfied (i.e. the list i is empty), the buddy allocator will try to
		109	allocate and split a larger block from the list with index i + 1. Both
		110	of these algorithms are recursive. The recursion ends either when there
		111	are no blocks to coalesce in the former case or when there are no blocks
		112	that can be split in the latter case.</para>
64	jermar	113
67	jermar	114	<para>This approach greatly reduces external fragmentation of memory and
		115	helps in allocating bigger continuous blocks of memory aligned to their
		116	size. On the other hand, the buddy allocator suffers increased internal
		117	fragmentation of memory and is not suitable for general kernel
		118	allocations. This purpose is better addressed by the <link
		119	linkend="slab">slab allocator</link>.<figure>
64	jermar	120	<mediaobject id="buddy_alloc">
		121	<imageobject role="html">
		122	<imagedata fileref="images/buddy_alloc.png" format="PNG" />
		123	</imageobject>
		124
		125	<imageobject role="fop">
		126	<imagedata fileref="images.vector/buddy_alloc.svg" format="SVG" />
		127	</imageobject>
		128	</mediaobject>
		129
		130	<title>Buddy system scheme.</title>
67	jermar	131	</figure></para>
64	jermar	132
		133	<section>
		134	<title>Implementation</title>
		135
		136	<para>The buddy allocator is, in fact, an abstract framework wich can
		137	be easily specialized to serve one particular task. It knows nothing
		138	about the nature of memory it helps to allocate. In order to beat the
		139	lack of this knowledge, the buddy allocator exports an interface that
		140	each of its clients is required to implement. When supplied with an
		141	implementation of this interface, the buddy allocator can use
		142	specialized external functions to find a buddy for a block, split and
		143	coalesce blocks, manipulate block order and mark blocks busy or
67	jermar	144	available.</para>
64	jermar	145
		146	<formalpara>
		147	<title>Data organization</title>
		148
		149	<para>Each entity allocable by the buddy allocator is required to
		150	contain space for storing block order number and a link variable
		151	used to interconnect blocks within the same order.</para>
		152
		153	<para>Whatever entities are allocated by the buddy allocator, the
		154	first entity within a block is used to represent the entire block.
		155	The first entity keeps the order of the whole block. Other entities
		156	within the block are assigned the magic value
		157	<constant>BUDDY_INNER_BLOCK</constant>. This is especially important
		158	for effective identification of buddies in a one-dimensional array
		159	because the entity that represents a potential buddy cannot be
		160	associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it
		161	is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is
		162	not a buddy).</para>
		163	</formalpara>
		164	</section>
		165	</section>
		166
		167	<section id="slab">
71	bondari	168	<indexterm>
		169	<primary>slab allocator</primary>
		170	</indexterm>
		171
70	jermar	172	<title>Slab allocator</title>
64	jermar	173
67	jermar	174	<para>The majority of memory allocation requests in the kernel is for
		175	small, frequently used data structures. The basic idea behind the slab
		176	allocator is that commonly used objects are preallocated in continuous
		177	areas of physical memory called slabs<footnote>
		178	<para>Slabs are in fact blocks of physical memory frames allocated
		179	from the frame allocator.</para>
		180	</footnote>. Whenever an object is to be allocated, the slab allocator
		181	returns the first available item from a suitable slab corresponding to
		182	the object type<footnote>
		183	<para>The mechanism is rather more complicated, see the next
		184	paragraph.</para>
		185	</footnote>. Due to the fact that the sizes of the requested and
		186	allocated object match, the slab allocator significantly reduces
		187	internal fragmentation.</para>
64	jermar	188
71	bondari	189	<indexterm>
		190	<primary>slab allocator</primary>
		191
72	bondari	192	<secondary>- slab cache</secondary>
71	bondari	193	</indexterm>
		194
67	jermar	195	<para>Slabs of one object type are organized in a structure called slab
		196	cache. There are ususally more slabs in the slab cache, depending on
		197	previous allocations. If the the slab cache runs out of available slabs,
		198	new slabs are allocated. In order to exploit parallelism and to avoid
		199	locking of shared spinlocks, slab caches can have variants of
		200	processor-private slabs called magazines. On each processor, there is a
		201	two-magazine cache. Full magazines that are not part of any
		202	per-processor magazine cache are stored in a global list of full
		203	magazines.</para>
64	jermar	204
71	bondari	205	<indexterm>
		206	<primary>slab allocator</primary>
		207
72	bondari	208	<secondary>- magazine</secondary>
71	bondari	209	</indexterm>
		210
67	jermar	211	<para>Each object begins its life in a slab. When it is allocated from
		212	there, the slab allocator calls a constructor that is registered in the
		213	respective slab cache. The constructor initializes and brings the object
		214	into a known state. The object is then used by the user. When the user
		215	later frees the object, the slab allocator puts it into a processor
71	bondari	216	private <indexterm>
		217	<primary>slab allocator</primary>
		218
72	bondari	219	<secondary>- magazine</secondary>
71	bondari	220	</indexterm>magazine cache, from where it can be precedently allocated
67	jermar	221	again. Note that allocations satisfied from a magazine are already
		222	initialized by the constructor. When both of the processor cached
		223	magazines get full, the allocator will move one of the magazines to the
		224	list of full magazines. Similarily, when allocating from an empty
		225	processor magazine cache, the kernel will reload only one magazine from
		226	the list of full magazines. In other words, the slab allocator tries to
		227	keep the processor magazine cache only half-full in order to prevent
		228	thrashing when allocations and deallocations interleave on magazine
70	jermar	229	boundaries. The advantage of this setup is that during most of the
		230	allocations, no global spinlock needs to be held.</para>
65	jermar	231
67	jermar	232	<para>Should HelenOS run short of memory, it would start deallocating
		233	objects from magazines, calling slab cache destructor on them and
		234	putting them back into slabs. When a slab contanins no allocated object,
		235	it is immediately freed.</para>
66	bondari	236
71	bondari	237	<para>
		238	<figure>
64	jermar	239	<mediaobject id="slab_alloc">
		240	<imageobject role="html">
		241	<imagedata fileref="images/slab_alloc.png" format="PNG" />
		242	</imageobject>
		243	</mediaobject>
		244
		245	<title>Slab allocator scheme.</title>
71	bondari	246	</figure>
		247	</para>
64	jermar	248
67	jermar	249	<section>
		250	<title>Implementation</title>
		251
		252	<para>The slab allocator is closely modelled after OpenSolaris slab
71	bondari	253	allocator by Jeff Bonwick and Jonathan Adams <xref
		254	linkend="Bonwick01" /> with the following exceptions:<itemizedlist>
		255	<listitem>empty slabs are immediately deallocated and</listitem>
66	bondari	256
		257	<listitem>
70	jermar	258	<para>empty magazines are deallocated when not needed.</para>
66	bondari	259	</listitem>
70	jermar	260	</itemizedlist>The following features are not currently supported
		261	but would be easy to do: <itemizedlist>
71	bondari	262	<listitem>cache coloring and</listitem>
64	jermar	263
71	bondari	264	<listitem>dynamic magazine grow (different magazine sizes are
		265	already supported, but the allocation strategy would need to be
		266	adjusted).</listitem>
64	jermar	267	</itemizedlist></para>
		268
		269	<section>
		270	<title>Allocation/deallocation</title>
		271
70	jermar	272	<para>The following two paragraphs summarize and complete the
		273	description of the slab allocator operation (i.e.
		274	<code>slab_alloc</code> and <code>slab_free</code>
		275	operations).</para>
64	jermar	276
		277	<formalpara>
		278	<title>Allocation</title>
		279
70	jermar	280	<para><emphasis>Step 1.</emphasis> When an allocation request
		281	comes, the slab allocator checks availability of memory in the
		282	current magazine of the local processor magazine cache. If the
		283	available memory is there, the allocator just pops the magazine
		284	and returns pointer to the object.</para>
64	jermar	285
70	jermar	286	<para><emphasis>Step 2.</emphasis> If the current magazine in the
		287	processor magazine cache is empty, the allocator will attempt to
		288	swap it with the last magazine from the cache and return to the
		289	first step. If also the last magazine is empty, the algorithm will
		290	fall through to Step 3.</para>
64	jermar	291
70	jermar	292	<para><emphasis>Step 3.</emphasis> Now the allocator is in the
		293	situation when both magazines in the processor magazine cache are
		294	empty. The allocator reloads one magazine from the shared list of
		295	full magazines. If the reload is successful (i.e. there are full
		296	magazines in the list), the algorithm continues with Step
		297	1.</para>
64	jermar	298
70	jermar	299	<para><emphasis>Step 4.</emphasis> In this fail-safe step, an
		300	object is allocated from the conventional slab layer and a pointer
		301	to it is returned. If also the last magazine is full,</para>
64	jermar	302	</formalpara>
		303
		304	<formalpara>
		305	<title>Deallocation</title>
		306
70	jermar	307	<para><emphasis>Step 1.</emphasis> During a deallocation request,
		308	the slab allocator checks if the current magazine of the local
		309	processor magazine cache is not full. If yes, the pointer to the
		310	objects is just pushed into the magazine and the algorithm
		311	returns.</para>
64	jermar	312
70	jermar	313	<para><emphasis>Step 2.</emphasis> If the current magazine is
		314	full, the allocator will attempt to swap it with the last magazine
		315	from the cache and return to the first step. If also the last
		316	magazine is empty, the algorithm will fall through to Step
		317	3.</para>
64	jermar	318
70	jermar	319	<para><emphasis>Step 3.</emphasis> Now the allocator is in the
		320	situation when both magazines in the processor magazine cache are
		321	full. The allocator moves one magazine to the shared list of full
		322	magazines. The algoritm continues with Step 1.</para>
64	jermar	323	</formalpara>
		324	</section>
		325	</section>
		326	</section>
		327	</section>
		328
		329	<section>
67	jermar	330	<title>Virtual memory management</title>
9	bondari	331
67	jermar	332	<section>
		333	<title>Introduction</title>
		334
		335	<para>Virtual memory is a special memory management technique, used by
		336	kernel to achieve a bunch of mission critical goals. <itemizedlist>
		337	<listitem>
		338	Isolate each task from other tasks that are running on the system at the same time.
		339	</listitem>
		340
		341	<listitem>
		342	Allow to allocate more memory, than is actual physical memory size of the machine.
		343	</listitem>
		344
		345	<listitem>
		346	Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations.
		347	</listitem>
		348	</itemizedlist></para>
		349
		350	<para><!--
		351	<para>
		352	Address spaces. Address space area (B+ tree). Only for uspace. Set of syscalls (shrink/extend etc).
		353	Special address space area type - device - prohibits shrink/extend syscalls to call on it.
		354	Address space has link to mapping tables (hierarchical - per Address space, hash - global tables).
		355	</para>
		356
		357	--></para>
		358	</section>
		359
		360	<section>
71	bondari	361	<title>Address spaces</title>
70	jermar	362
71	bondari	363	<section>
		364	<indexterm>
		365	<primary>address space</primary>
70	jermar	366
72	bondari	367	<secondary>- area</secondary>
71	bondari	368	</indexterm>
70	jermar	369
		370	<title>Address space areas</title>
67	jermar	371
		372	<para>Each address space consists of mutually disjunctive continuous
		373	address space areas. Address space area is precisely defined by its
		374	base address and the number of frames/pages is contains.</para>
		375
		376	<para>Address space area , that define behaviour and permissions on
		377	the particular area. <itemizedlist>
71	bondari	378	<listitem><emphasis>AS_AREA_READ</emphasis> flag indicates reading
		379	permission.</listitem>
67	jermar	380
71	bondari	381	<listitem><emphasis>AS_AREA_WRITE</emphasis> flag indicates
		382	writing permission.</listitem>
67	jermar	383
71	bondari	384	<listitem><emphasis>AS_AREA_EXEC</emphasis> flag indicates code
		385	execution permission. Some architectures do not support execution
		386	persmission restriction. In this case this flag has no
		387	effect.</listitem>
67	jermar	388
71	bondari	389	<listitem><emphasis>AS_AREA_DEVICE</emphasis> marks area as mapped
		390	to the device memory.</listitem>
67	jermar	391	</itemizedlist></para>
		392
		393	<para>Kernel provides possibility tasks create/expand/shrink/share its
		394	address space via the set of syscalls.</para>
		395	</section>
		396
		397	<section>
71	bondari	398	<indexterm>
		399	<primary>address space</primary>
		400
72	bondari	401	<secondary>- ASID</secondary>
71	bondari	402	</indexterm>
		403
67	jermar	404	<title>Address Space ID (ASID)</title>
		405
		406	<para>When switching to the different task, kernel also require to
		407	switch mappings to the different address space. In case TLB cannot
		408	distinguish address space mappings, all mapping information in TLB
		409	from the old address space must be flushed, which can create certain
		410	uncessary overhead during the task switching. To avoid this, some
		411	architectures have capability to segregate different address spaces on
		412	hardware level introducing the address space identifier as a part of
		413	TLB record, telling the virtual address space translation unit to
		414	which address space this record is applicable.</para>
		415
		416	<para>HelenOS kernel can take advantage of this hardware supported
		417	identifier by having an ASID abstraction which is somehow related to
		418	the corresponding architecture identifier. I.e. on ia64 kernel ASID is
		419	derived from RID (region identifier) and on the mips32 kernel ASID is
		420	actually the hardware identifier. As expected, this ASID information
		421	record is the part of <emphasis>as_t</emphasis> structure.</para>
		422
		423	<para>Due to the hardware limitations, hardware ASID has limited
		424	length from 8 bits on ia64 to 24 bits on mips32, which makes it
		425	impossible to use it as unique address space identifier for all tasks
		426	running in the system. In such situations special ASID stealing
		427	algoritm is used, which takes ASID from inactive task and assigns it
		428	to the active task.</para>
		429
71	bondari	430	<indexterm>
		431	<primary>address space</primary>
		432
72	bondari	433	<secondary>- ASID stealing</secondary>
71	bondari	434	</indexterm>
		435
		436	<para>
		437	<classname>ASID stealing algoritm here.</classname>
		438	</para>
67	jermar	439	</section>
		440	</section>
		441
71	bondari	442	<section id="paging">
67	jermar	443	<title>Virtual address translation</title>
		444
71	bondari	445	<section>
		446	<title>Introduction</title>
67	jermar	447
71	bondari	448	<para>Virtual memory is usually using paged memory model, where
		449	virtual memory address space is divided into the
		450	<emphasis>pages</emphasis> (usually having size 4096 bytes) and
		451	physical memory is divided into the frames (same sized as a page, of
		452	course). Each page may be mapped to some frame and then, upon memory
		453	access to the virtual address, CPU performs <emphasis>address
		454	translation</emphasis> during the instruction execution. Non-existing
		455	mapping generates page fault exception, calling kernel exception
		456	handler, thus allowing kernel to manipulate rules of memory access.
		457	Information for pages mapping is stored by kernel in the <link
		458	linkend="page_tables">page tables</link></para>
		459
		460	<indexterm>
		461	<primary>page tables</primary>
		462	</indexterm>
		463
		464	<para>The majority of the architectures use multi-level page tables,
		465	which means need to access physical memory several times before
		466	getting physical address. This fact would make serios performance
		467	overhead in virtual memory management. To avoid this <link
		468	linkend="tlb">Traslation Lookaside Buffer (TLB)</link> is used.</para>
		469
70	jermar	470	<para>HelenOS kernel has two different approaches to the paging
		471	implementation: <emphasis>4 level page tables</emphasis> and
71	bondari	472	<emphasis>global hash table</emphasis>, which are accessible via
70	jermar	473	generic paging abstraction layer. Such different functionality was
		474	caused by the major architectural differences between supported
		475	platforms. This abstraction is implemented with help of the global
		476	structure of pointers to basic mapping functions
		477	<emphasis>page_mapping_operations</emphasis>. To achieve different
		478	functionality of page tables, corresponding layer must implement
		479	functions, declared in
		480	<emphasis>page_mapping_operations</emphasis></para>
67	jermar	481
71	bondari	482	<para>Thanks to the abstract paging interface, there was a place left
		483	for more paging implementations (besides already implemented
		484	hieararchical page tables and hash table), for example B-Tree based
		485	page tables.</para>
		486	</section>
67	jermar	487
71	bondari	488	<section id="page_tables">
		489	<indexterm>
		490	<primary>page tables</primary>
67	jermar	491
72	bondari	492	<secondary>- hierarchical</secondary>
71	bondari	493	</indexterm>
67	jermar	494
71	bondari	495	<title>Hierarchical 4-level page tables</title>
67	jermar	496
71	bondari	497	<para>Hierarchical 4-level page tables are the generalization of the
		498	hardware capabilities of most architectures. Each address space has
		499	its own page tables.<itemizedlist>
		500	<listitem>ia32 uses 2-level page tables, with full hardware
		501	support.</listitem>
67	jermar	502
71	bondari	503	<listitem>amd64 uses 4-level page tables, also coming with full
		504	hardware support.</listitem>
67	jermar	505
71	bondari	506	<listitem>mips and ppc32 have 2-level tables, software simulated
		507	support.</listitem>
		508	</itemizedlist></para>
67	jermar	509	</section>
		510
71	bondari	511	<section>
		512	<indexterm>
		513	<primary>page tables</primary>
67	jermar	514
72	bondari	515	<secondary>- hashing</secondary>
71	bondari	516	</indexterm>
67	jermar	517
71	bondari	518	<title>Global hash table</title>
67	jermar	519
71	bondari	520	<para>Implementation of the global hash table was encouraged by the
		521	ia64 architecture support. One of the major differences between global
		522	hash table and hierarchical tables is that global hash table exists
		523	only once in the system and the hierarchical tables are maintained per
		524	address space.</para>
67	jermar	525
71	bondari	526	<para>Thus, hash table contains information about all address spaces
		527	mappings in the system, so, the hash of an entry must contain
		528	information of both address space pointer or id and the virtual
		529	address of the page. Generic hash table implementation assumes that
		530	the addresses of the pointers to the address spaces are likely to be
		531	on the close addresses, so it uses least significant bits for hash;
		532	also it assumes that the virtual page addresses have roughly the same
		533	probability of occurring, so the least significant bits of VPN compose
		534	the hash index.</para>
67	jermar	535
71	bondari	536	<para>- global page hash table: existuje jen jedna v celem systemu
		537	(vyuziva ji ia64), pozn. ia64 ma zatim vypnuty VHPT. Pouziva se
		538	genericke hash table s oddelenymi collision chains. ASID support is
		539	required to use global hash tables.</para>
70	jermar	540	</section>
		541	</section>
67	jermar	542
71	bondari	543	<section id="tlb">
		544	<indexterm>
		545	<primary>TLB</primary>
		546	</indexterm>
67	jermar	547
71	bondari	548	<title>Translation Lookaside buffer</title>
67	jermar	549
71	bondari	550	<para>Due to the extensive overhead during the page mapping lookup in
		551	the page tables, all architectures has fast assotiative cache memory
		552	built-in CPU. This memory called TLB stores recently used page table
		553	entries.</para>
		554
		555	<section id="tlb_shootdown">
		556	<indexterm>
		557	<primary>TLB</primary>
		558
72	bondari	559	<secondary>- TLB shootdown</secondary>
71	bondari	560	</indexterm>
		561
		562	<title>TLB consistency. TLB shootdown algorithm.</title>
		563
		564	<para>Operating system is responsible for keeping TLB consistent by
		565	invalidating the contents of TLB, whenever there is some change in
		566	page tables. Those changes may occur when page or group of pages were
		567	unmapped, mapping is changed or system switching active address space
		568	to schedule a new system task. Moreover, this invalidation operation
		569	must be done an all system CPUs because each CPU has its own
		570	independent TLB cache. Thus maintaining TLB consistency on SMP
		571	configuration as not as trivial task as it looks on the first glance.
		572	Naive solution would assume that is the CPU which wants to invalidate
		573	TLB will invalidate TLB caches on other CPUs. It is not possible on
		574	the most of the architectures, because of the simple fact - flushing
		575	TLB is allowed only on the local CPU and there is no possibility to
		576	access other CPUs' TLB caches, thus invalidate TLB remotely.</para>
		577
		578	<para>Technique of remote invalidation of TLB entries is called "TLB
		579	shootdown". HelenOS uses a variation of the algorithm described by D.
		580	Black et al., "Translation Lookaside Buffer Consistency: A Software
		581	Approach," Proc. Third Int'l Conf. Architectural Support for
		582	Programming Languages and Operating Systems, 1989, pp. 113-122. <xref
		583	linkend="Black89" /></para>
		584
		585	<para>As the situation demands, you will want partitial invalidation
		586	of TLB caches. In case of simple memory mapping change it is necessary
		587	to invalidate only one or more adjacent pages. In case if the
		588	architecture is aware of ASIDs, when kernel needs to dump some ASID to
		589	use by another task, it invalidates only entries from this particular
		590	address space. Final option of the TLB invalidation is the complete
		591	TLB cache invalidation, which is the operation that flushes all
		592	entries in TLB.</para>
		593
		594	<para>TLB shootdown is performed in two phases.</para>
		595
		596	<formalpara>
		597	<title>Phase 1.</title>
		598
		599	<para>First, initiator locks a global TLB spinlock, then request is
		600	being put to the local request cache of every other CPU in the
		601	system protected by its spinlock. In case the cache is full, all
		602	requests in the cache are replaced by one request, indicating global
		603	TLB flush. Then the initiator thread sends an IPI message indicating
		604	the TLB shootdown request to the rest of the CPUs and waits actively
		605	until all CPUs confirm TLB invalidating action execution by setting
		606	up a special flag. After setting this flag this thread is blocked on
		607	the TLB spinlock, held by the initiator.</para>
		608	</formalpara>
		609
		610	<formalpara>
		611	<title>Phase 2.</title>
		612
		613	<para>All CPUs are waiting on the TLB spinlock to execute TLB
		614	invalidation action and have indicated their intention to the
		615	initiator. Initiator continues, cleaning up its TLB and releasing
		616	the global TLB spinlock. After this all other CPUs gain and
		617	immidiately release TLB spinlock and perform TLB invalidation
		618	actions.</para>
		619	</formalpara>
		620	</section>
67	jermar	621	</section>
26	bondari	622	</section>
11	bondari	623	</chapter>

Subversion Repositories HelenOS-doc

(root)/design/trunk/src/ch_memory_management.xml @ 185 – Rev 72