WebSVN – HelenOS-doc – Blame – /design/trunk/src/ch_memory_management.xml

Rev	Author	Line No.	Line
9	bondari	1	<?xml version="1.0" encoding="UTF-8"?>
11	bondari	2	<chapter id="mm">
		3	<?dbhtml filename="mm.html"?>
9	bondari	4
11	bondari	5	<title>Memory management</title>
9	bondari	6
67	jermar	7	<para>In previous chapters, this book described the scheduling subsystem as
		8	the creator of the impression that threads execute in parallel. The memory
		9	management subsystem, on the other hand, creates the impression that there
		10	is enough physical memory for the kernel and that userspace tasks have the
		11	entire address space only for themselves.</para>
64	jermar	12
26	bondari	13	<section>
64	jermar	14	<title>Physical memory management</title>
		15
		16	<section id="zones_and_frames">
69	bondari	17	<indexterm><primary>memory management</primary><secondary>memory zone</secondary></indexterm>
64	jermar	18	<title>Zones and frames</title>
		19
67	jermar	20	<para>HelenOS represents continuous areas of physical memory in
		21	structures called frame zones (abbreviated as zones). Each zone contains
		22	information about the number of allocated and unallocated physical
		23	memory frames as well as the physical base address of the zone and
		24	number of frames contained in it. A zone also contains an array of frame
		25	structures describing each frame of the zone and, in the last, but not
		26	the least important, front, each zone is equipped with a buddy system
		27	that faciliates effective allocation of power-of-two sized block of
		28	frames.</para>
64	jermar	29
67	jermar	30	<para>This organization of physical memory provides good preconditions
		31	for hot-plugging of more zones. There is also one currently unused zone
		32	attribute: <code>flags</code>. The attribute could be used to give a
		33	special meaning to some zones in the future.</para>
64	jermar	34
67	jermar	35	<para>The zones are linked in a doubly-linked list. This might seem a
		36	bit ineffective because the zone list is walked everytime a frame is
		37	allocated or deallocated. However, this does not represent a significant
		38	performance problem as it is expected that the number of zones will be
		39	rather low. Moreover, most architectures merge all zones into
		40	one.</para>
		41
		42	<para>For each physical memory frame found in a zone, there is a frame
		43	structure that contains number of references and data used by buddy
		44	system.</para>
64	jermar	45	</section>
		46
		47	<section id="frame_allocator">
69	bondari	48	<indexterm><primary>memory management</primary><secondary>frame allocator</secondary></indexterm>
64	jermar	49	<title>Frame allocator</title>
		50
67	jermar	51	<para>The frame allocator satisfies kernel requests to allocate
		52	power-of-two sized blocks of physical memory. Because of zonal
		53	organization of physical memory, the frame allocator is always working
		54	within a context of some frame zone. In order to carry out the
		55	allocation requests, the frame allocator is tightly integrated with the
		56	buddy system belonging to the zone. The frame allocator is also
		57	responsible for updating information about the number of free and busy
		58	frames in the zone. <figure>
		59	<mediaobject id="frame_alloc">
		60	<imageobject role="html">
		61	<imagedata fileref="images/frame_alloc.png" format="PNG" />
		62	</imageobject>
64	jermar	63
67	jermar	64	<imageobject role="fop">
		65	<imagedata fileref="images.vector/frame_alloc.svg" format="SVG" />
		66	</imageobject>
		67	</mediaobject>
64	jermar	68
67	jermar	69	<title>Frame allocator scheme.</title>
		70	</figure></para>
64	jermar	71
		72	<formalpara>
		73	<title>Allocation / deallocation</title>
		74
67	jermar	75	<para>Upon allocation request via function <code>frame_alloc</code>,
		76	the frame allocator first tries to find a zone that can satisfy the
		77	request (i.e. has the required amount of free frames). Once a suitable
		78	zone is found, the frame allocator uses the buddy allocator on the
		79	zone's buddy system to perform the allocation. During deallocation,
		80	which is triggered by a call to <code>frame_free</code>, the frame
		81	allocator looks up the respective zone that contains the frame being
		82	deallocated. Afterwards, it calls the buddy allocator again, this time
		83	to take care of deallocation within the zone's buddy system.</para>
64	jermar	84	</formalpara>
		85	</section>
		86
		87	<section id="buddy_allocator">
69	bondari	88	<indexterm><primary>memory management</primary><secondary>buddy system</secondary></indexterm>
64	jermar	89	<title>Buddy allocator</title>
		90
67	jermar	91	<para>In the buddy system, the memory is broken down into power-of-two
		92	sized naturally aligned blocks. These blocks are organized in an array
		93	of lists, in which the list with index i contains all unallocated blocks
		94	of size <mathphrase>2<superscript>i</superscript></mathphrase>. The
		95	index i is called the order of block. Should there be two adjacent
		96	equally sized blocks in the list i<mathphrase />(i.e. buddies), the
		97	buddy allocator would coalesce them and put the resulting block in list
		98	<mathphrase>i + 1</mathphrase>, provided that the resulting block would
		99	be naturally aligned. Similarily, when the allocator is asked to
		100	allocate a block of size
		101	<mathphrase>2<superscript>i</superscript></mathphrase>, it first tries
		102	to satisfy the request from the list with index i. If the request cannot
		103	be satisfied (i.e. the list i is empty), the buddy allocator will try to
		104	allocate and split a larger block from the list with index i + 1. Both
		105	of these algorithms are recursive. The recursion ends either when there
		106	are no blocks to coalesce in the former case or when there are no blocks
		107	that can be split in the latter case.</para>
64	jermar	108
67	jermar	109	<para>This approach greatly reduces external fragmentation of memory and
		110	helps in allocating bigger continuous blocks of memory aligned to their
		111	size. On the other hand, the buddy allocator suffers increased internal
		112	fragmentation of memory and is not suitable for general kernel
		113	allocations. This purpose is better addressed by the <link
		114	linkend="slab">slab allocator</link>.<figure>
64	jermar	115	<mediaobject id="buddy_alloc">
		116	<imageobject role="html">
		117	<imagedata fileref="images/buddy_alloc.png" format="PNG" />
		118	</imageobject>
		119
		120	<imageobject role="fop">
		121	<imagedata fileref="images.vector/buddy_alloc.svg" format="SVG" />
		122	</imageobject>
		123	</mediaobject>
		124
		125	<title>Buddy system scheme.</title>
67	jermar	126	</figure></para>
64	jermar	127
		128	<section>
		129	<title>Implementation</title>
		130
		131	<para>The buddy allocator is, in fact, an abstract framework wich can
		132	be easily specialized to serve one particular task. It knows nothing
		133	about the nature of memory it helps to allocate. In order to beat the
		134	lack of this knowledge, the buddy allocator exports an interface that
		135	each of its clients is required to implement. When supplied with an
		136	implementation of this interface, the buddy allocator can use
		137	specialized external functions to find a buddy for a block, split and
		138	coalesce blocks, manipulate block order and mark blocks busy or
67	jermar	139	available.</para>
64	jermar	140
		141	<formalpara>
		142	<title>Data organization</title>
		143
		144	<para>Each entity allocable by the buddy allocator is required to
		145	contain space for storing block order number and a link variable
		146	used to interconnect blocks within the same order.</para>
		147
		148	<para>Whatever entities are allocated by the buddy allocator, the
		149	first entity within a block is used to represent the entire block.
		150	The first entity keeps the order of the whole block. Other entities
		151	within the block are assigned the magic value
		152	<constant>BUDDY_INNER_BLOCK</constant>. This is especially important
		153	for effective identification of buddies in a one-dimensional array
		154	because the entity that represents a potential buddy cannot be
		155	associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it
		156	is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is
		157	not a buddy).</para>
		158	</formalpara>
		159	</section>
		160	</section>
		161
		162	<section id="slab">
69	bondari	163	<indexterm><primary>memory management</primary><secondary>slab</secondary></indexterm>
		164	<title>Slab allocator</title>
64	jermar	165
67	jermar	166	<para>The majority of memory allocation requests in the kernel is for
		167	small, frequently used data structures. The basic idea behind the slab
		168	allocator is that commonly used objects are preallocated in continuous
		169	areas of physical memory called slabs<footnote>
		170	<para>Slabs are in fact blocks of physical memory frames allocated
		171	from the frame allocator.</para>
		172	</footnote>. Whenever an object is to be allocated, the slab allocator
		173	returns the first available item from a suitable slab corresponding to
		174	the object type<footnote>
		175	<para>The mechanism is rather more complicated, see the next
		176	paragraph.</para>
		177	</footnote>. Due to the fact that the sizes of the requested and
		178	allocated object match, the slab allocator significantly reduces
		179	internal fragmentation.</para>
64	jermar	180
67	jermar	181	<para>Slabs of one object type are organized in a structure called slab
		182	cache. There are ususally more slabs in the slab cache, depending on
		183	previous allocations. If the the slab cache runs out of available slabs,
		184	new slabs are allocated. In order to exploit parallelism and to avoid
		185	locking of shared spinlocks, slab caches can have variants of
		186	processor-private slabs called magazines. On each processor, there is a
		187	two-magazine cache. Full magazines that are not part of any
		188	per-processor magazine cache are stored in a global list of full
		189	magazines.</para>
64	jermar	190
67	jermar	191	<para>Each object begins its life in a slab. When it is allocated from
		192	there, the slab allocator calls a constructor that is registered in the
		193	respective slab cache. The constructor initializes and brings the object
		194	into a known state. The object is then used by the user. When the user
		195	later frees the object, the slab allocator puts it into a processor
		196	private magazine cache, from where it can be precedently allocated
		197	again. Note that allocations satisfied from a magazine are already
		198	initialized by the constructor. When both of the processor cached
		199	magazines get full, the allocator will move one of the magazines to the
		200	list of full magazines. Similarily, when allocating from an empty
		201	processor magazine cache, the kernel will reload only one magazine from
		202	the list of full magazines. In other words, the slab allocator tries to
		203	keep the processor magazine cache only half-full in order to prevent
		204	thrashing when allocations and deallocations interleave on magazine
		205	boundaries.</para>
65	jermar	206
67	jermar	207	<para>Should HelenOS run short of memory, it would start deallocating
		208	objects from magazines, calling slab cache destructor on them and
		209	putting them back into slabs. When a slab contanins no allocated object,
		210	it is immediately freed.</para>
66	bondari	211
67	jermar	212	<para><figure>
64	jermar	213	<mediaobject id="slab_alloc">
		214	<imageobject role="html">
		215	<imagedata fileref="images/slab_alloc.png" format="PNG" />
		216	</imageobject>
		217	</mediaobject>
		218
		219	<title>Slab allocator scheme.</title>
67	jermar	220	</figure></para>
64	jermar	221
67	jermar	222	<section>
		223	<title>Implementation</title>
		224
		225	<para>The slab allocator is closely modelled after OpenSolaris slab
		226	allocator by Jeff Bonwick and Jonathan Adams with the following
		227	exceptions:<itemizedlist>
64	jermar	228	<listitem>
68	bondari	229	empty slabs are immediately deallocated
64	jermar	230	</listitem>
66	bondari	231
		232	<listitem>
67	jermar	233	<para>empty magazines are deallocated when not needed</para>
66	bondari	234	</listitem>
64	jermar	235	</itemizedlist> Following features are not currently supported but
		236	would be easy to do: <itemizedlist>
		237	<listitem>
67	jermar	238	cache coloring
64	jermar	239	</listitem>
		240
		241	<listitem>
67	jermar	242	dynamic magazine grow (different magazine sizes are already supported, but the allocation strategy would need to be adjusted)
64	jermar	243	</listitem>
		244	</itemizedlist></para>
		245
		246	<section>
69	bondari	247	<indexterm><primary>memory management</primary><secondary>slab magazine</secondary></indexterm>
		248	<title>Magazine layer</title>
64	jermar	249
		250	<para>Due to the extensive bottleneck on SMP architures, caused by
		251	global slab locking mechanism, making processing of all slab
		252	allocation requests serialized, a new layer was introduced to the
		253	classic slab allocator design. Slab allocator was extended to
		254	support per-CPU caches 'magazines' to achieve good SMP scaling.
		255	<termdef>Slab SMP perfromance bottleneck was resolved by introducing
		256	a per-CPU caching scheme called as <glossterm>magazine
		257	layer</glossterm></termdef>.</para>
		258
		259	<para>Magazine is a N-element cache of objects, so each magazine can
		260	satisfy N allocations. Magazine behaves like a automatic weapon
		261	magazine (LIFO, stack), so the allocation/deallocation become simple
		262	push/pop pointer operation. Trick is that CPU does not access global
		263	slab allocator data during the allocation from its magazine, thus
		264	making possible parallel allocations between CPUs.</para>
		265
		266	<para>Implementation also requires adding another feature as the
		267	CPU-bound magazine is actually a pair of magazines to avoid
		268	thrashing when during allocation/deallocatiion of 1 item at the
		269	magazine size boundary. LIFO order is enforced, which should avoid
		270	fragmentation as much as possible.</para>
		271
		272	<para>Another important entity of magazine layer is the common full
		273	magazine list (also called a depot), that stores full magazines that
		274	may be used by any of the CPU magazine caches to reload active CPU
		275	magazine. This list of magazines can be pre-filled with full
		276	magazines during initialization, but in current implementation it is
		277	filled during object deallocation, when CPU magazine becomes
		278	full.</para>
		279
		280	<para>Slab allocator control structures are allocated from special
		281	slabs, that are marked by special flag, indicating that it should
		282	not be used for slab magazine layer. This is done to avoid possible
		283	infinite recursions and deadlock during conventional slab allocaiton
		284	requests.</para>
		285	</section>
		286
		287	<section>
		288	<title>Allocation/deallocation</title>
		289
		290	<para>Every cache contains list of full slabs and list of partialy
		291	full slabs. Empty slabs are immediately freed (thrashing will be
		292	avoided because of magazines).</para>
		293
		294	<para>The SLAB allocator allocates lots of space and does not free
		295	it. When frame allocator fails to allocate the frame, it calls
		296	slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim.
		297	The light reclaim releases slabs from cpu-shared magazine-list,
		298	until at least 1 slab is deallocated in each cache (this algorithm
		299	should probably change). The brutal reclaim removes all cached
		300	objects, even from CPU-bound magazines.</para>
		301
		302	<formalpara>
		303	<title>Allocation</title>
		304
		305	<para><emphasis>Step 1.</emphasis> When it comes to the allocation
		306	request, slab allocator first of all checks availability of memory
		307	in local CPU-bound magazine. If it is there, we would just "pop"
		308	the CPU magazine and return the pointer to object.</para>
		309
		310	<para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
		311	empty, allocator will attempt to reload magazin, swapping it with
		312	second CPU magazine and returns to the first step.</para>
		313
		314	<para><emphasis>Step 3.</emphasis> Now we are in the situation
		315	when both CPU-bound magazines are empty, which makes allocator to
		316	access shared full-magazines depot to reload CPU-bound magazines.
		317	If reload is succesful (meaning there are full magazines in depot)
		318	algoritm continues at Step 1.</para>
		319
		320	<para><emphasis>Step 4.</emphasis> Final step of the allocation.
		321	In this step object is allocated from the conventional slab layer
		322	and pointer is returned.</para>
		323	</formalpara>
		324
		325	<formalpara>
		326	<title>Deallocation</title>
		327
		328	<para><emphasis>Step 1.</emphasis> During deallocation request,
		329	slab allocator will check if the local CPU-bound magazine is not
		330	full. In this case we will just push the pointer to this
		331	magazine.</para>
		332
		333	<para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
		334	full, allocator will attempt to reload magazin, swapping it with
		335	second CPU magazine and returns to the first step.</para>
		336
		337	<para><emphasis>Step 3.</emphasis> Now we are in the situation
		338	when both CPU-bound magazines are full, which makes allocator to
		339	access shared full-magazines depot to put one of the magazines to
		340	the depot and creating new empty magazine. Algoritm continues at
		341	Step 1.</para>
		342	</formalpara>
		343	</section>
		344	</section>
		345	</section>
		346
		347	<!-- End of Physmem -->
		348	</section>
		349
		350	<section>
67	jermar	351	<title>Virtual memory management</title>
9	bondari	352
67	jermar	353	<section>
		354	<title>Introduction</title>
		355
		356	<para>Virtual memory is a special memory management technique, used by
		357	kernel to achieve a bunch of mission critical goals. <itemizedlist>
		358	<listitem>
		359	Isolate each task from other tasks that are running on the system at the same time.
		360	</listitem>
		361
		362	<listitem>
		363	Allow to allocate more memory, than is actual physical memory size of the machine.
		364	</listitem>
		365
		366	<listitem>
		367	Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations.
		368	</listitem>
		369	</itemizedlist></para>
		370
		371	<para><!--
		372
		373	<para>
		374	Address spaces. Address space area (B+ tree). Only for uspace. Set of syscalls (shrink/extend etc).
		375	Special address space area type - device - prohibits shrink/extend syscalls to call on it.
		376	Address space has link to mapping tables (hierarchical - per Address space, hash - global tables).
		377	</para>
		378
		379	--></para>
		380	</section>
		381
		382	<section>
		383	<title>Address spaces</title>
		384
		385	<section>
69	bondari	386	<indexterm><primary>memory management</primary><secondary>address space area</secondary></indexterm>
		387	<title>Address space areas</title>
67	jermar	388
		389	<para>Each address space consists of mutually disjunctive continuous
		390	address space areas. Address space area is precisely defined by its
		391	base address and the number of frames/pages is contains.</para>
		392
		393	<para>Address space area , that define behaviour and permissions on
		394	the particular area. <itemizedlist>
		395	<listitem>
		396
		397
		398	<emphasis>AS_AREA_READ</emphasis>
		399
		400	flag indicates reading permission.
		401	</listitem>
		402
		403	<listitem>
		404
		405
		406	<emphasis>AS_AREA_WRITE</emphasis>
		407
		408	flag indicates writing permission.
		409	</listitem>
		410
		411	<listitem>
		412
		413
		414	<emphasis>AS_AREA_EXEC</emphasis>
		415
		416	flag indicates code execution permission. Some architectures do not support execution persmission restriction. In this case this flag has no effect.
		417	</listitem>
		418
		419	<listitem>
		420
		421
		422	<emphasis>AS_AREA_DEVICE</emphasis>
		423
		424	marks area as mapped to the device memory.
		425	</listitem>
		426	</itemizedlist></para>
		427
		428	<para>Kernel provides possibility tasks create/expand/shrink/share its
		429	address space via the set of syscalls.</para>
		430	</section>
		431
		432	<section>
69	bondari	433	<indexterm><primary>memory management</primary><secondary>asid</secondary></indexterm>
67	jermar	434	<title>Address Space ID (ASID)</title>
		435
		436	<para>When switching to the different task, kernel also require to
		437	switch mappings to the different address space. In case TLB cannot
		438	distinguish address space mappings, all mapping information in TLB
		439	from the old address space must be flushed, which can create certain
		440	uncessary overhead during the task switching. To avoid this, some
		441	architectures have capability to segregate different address spaces on
		442	hardware level introducing the address space identifier as a part of
		443	TLB record, telling the virtual address space translation unit to
		444	which address space this record is applicable.</para>
		445
		446	<para>HelenOS kernel can take advantage of this hardware supported
		447	identifier by having an ASID abstraction which is somehow related to
		448	the corresponding architecture identifier. I.e. on ia64 kernel ASID is
		449	derived from RID (region identifier) and on the mips32 kernel ASID is
		450	actually the hardware identifier. As expected, this ASID information
		451	record is the part of <emphasis>as_t</emphasis> structure.</para>
		452
		453	<para>Due to the hardware limitations, hardware ASID has limited
		454	length from 8 bits on ia64 to 24 bits on mips32, which makes it
		455	impossible to use it as unique address space identifier for all tasks
		456	running in the system. In such situations special ASID stealing
		457	algoritm is used, which takes ASID from inactive task and assigns it
		458	to the active task.</para>
		459
		460	<para><classname>ASID stealing algoritm here.</classname></para>
		461	</section>
		462	</section>
		463
		464	<section>
		465	<title>Virtual address translation</title>
		466
68	bondari	467	<section id="paging">
		468	<title>Paging</title>
67	jermar	469
68	bondari	470	<section>
		471	<title>Introduction</title>
67	jermar	472
68	bondari	473	<para>Virtual memory is usually using paged memory model, where
		474	virtual memory address space is divided into the
		475	<emphasis>pages</emphasis> (usually having size 4096 bytes) and
		476	physical memory is divided into the frames (same sized as a page, of
		477	course). Each page may be mapped to some frame and then, upon memory
		478	access to the virtual address, CPU performs <emphasis>address
		479	translation</emphasis> during the instruction execution.
		480	Non-existing mapping generates page fault exception, calling kernel
		481	exception handler, thus allowing kernel to manipulate rules of
		482	memory access. Information for pages mapping is stored by kernel in
69	bondari	483	the page tables</para>
67	jermar	484
68	bondari	485	<para>The majority of the architectures use multi-level page tables,
		486	which means need to access physical memory several times before
		487	getting physical address. This fact would make serios performance
		488	overhead in virtual memory management. To avoid this <link
		489	linkend="tlb">Traslation Lookaside Buffer (TLB)</link> is
		490	used.</para>
		491
		492	<para>HelenOS kernel has two different approaches to the paging
		493	implementation: <emphasis>4 level page tables</emphasis> and
		494	<emphasis>global hash table</emphasis>, which are accessible via
		495	generic paging abstraction layer. Such different functionality was
		496	caused by the major architectural differences between supported
		497	platforms. This abstraction is implemented with help of the global
		498	structure of pointers to basic mapping functions
		499	<emphasis>page_mapping_operations</emphasis>. To achieve different
		500	functionality of page tables, corresponding layer must implement
		501	functions, declared in
		502	<emphasis>page_mapping_operations</emphasis></para>
		503
		504	<para>Thanks to the abstract paging interface, there was a place
		505	left for more paging implementations (besides already implemented
		506	hieararchical page tables and hash table), for example B-Tree based
		507	page tables.</para>
		508	</section>
		509
		510	<section>
		511	<title>Hierarchical 4-level page tables</title>
		512
		513	<para>Hierarchical 4-level page tables are the generalization of the
		514	hardware capabilities of most architectures. Each address space has
		515	its own page tables.<itemizedlist>
67	jermar	516	<listitem>
		517	ia32 uses 2-level page tables, with full hardware support.
		518	</listitem>
		519
		520	<listitem>
		521	amd64 uses 4-level page tables, also coming with full hardware support.
		522	</listitem>
		523
		524	<listitem>
		525	mips and ppc32 have 2-level tables, software simulated support.
		526	</listitem>
		527	</itemizedlist></para>
68	bondari	528	</section>
67	jermar	529
68	bondari	530	<section>
69	bondari	531	<indexterm><primary>memory management</primary><secondary>hash table</secondary></indexterm>
		532	<title>Global hash table</title>
67	jermar	533
68	bondari	534	<para>Implementation of the global hash table was encouraged by the
		535	ia64 architecture support. One of the major differences between
		536	global hash table and hierarchical tables is that global hash table
		537	exists only once in the system and the hierarchical tables are
		538	maintained per address space.</para>
67	jermar	539
68	bondari	540	<para>Thus, hash table contains information about all address spaces
		541	mappings in the system, so, the hash of an entry must contain
		542	information of both address space pointer or id and the virtual
		543	address of the page. Generic hash table implementation assumes that
		544	the addresses of the pointers to the address spaces are likely to be
		545	on the close addresses, so it uses least significant bits for hash;
		546	also it assumes that the virtual page addresses have roughly the
		547	same probability of occurring, so the least significant bits of VPN
		548	compose the hash index.</para>
		549
		550	<para>Collision chains ...</para>
		551	</section>
67	jermar	552	</section>
		553
		554	<section id="tlb">
69	bondari	555	<indexterm><primary>memory management</primary><secondary>tlb</secondary></indexterm>
67	jermar	556	<title>Translation Lookaside buffer</title>
		557
		558	<para>Due to the extensive overhead during the page mapping lookup in
		559	the page tables, all architectures has fast assotiative cache memory
		560	built-in CPU. This memory called TLB stores recently used page table
		561	entries.</para>
		562
		563	<section id="tlb_shootdown">
		564	<title>TLB consistency. TLB shootdown algorithm.</title>
		565
		566	<para>Operating system is responsible for keeping TLB consistent by
		567	invalidating the contents of TLB, whenever there is some change in
		568	page tables. Those changes may occur when page or group of pages
		569	were unmapped, mapping is changed or system switching active address
68	bondari	570	space to schedule a new system task. Moreover, this invalidation
		571	operation must be done an all system CPUs because each CPU has its
		572	own independent TLB cache. Thus maintaining TLB consistency on SMP
		573	configuration as not as trivial task as it looks on the first
		574	glance. Naive solution would assume that is the CPU which wants to
		575	invalidate TLB will invalidate TLB caches on other CPUs. It is not
		576	possible on the most of the architectures, because of the simple
		577	fact - flushing TLB is allowed only on the local CPU and there is no
		578	possibility to access other CPUs' TLB caches, thus invalidate TLB
		579	remotely.</para>
67	jermar	580
		581	<para>Technique of remote invalidation of TLB entries is called "TLB
		582	shootdown". HelenOS uses a variation of the algorithm described by
		583	D. Black et al., "Translation Lookaside Buffer Consistency: A
		584	Software Approach," Proc. Third Int'l Conf. Architectural Support
		585	for Programming Languages and Operating Systems, 1989, pp.
		586	113-122.</para>
		587
		588	<para>As the situation demands, you will want partitial invalidation
		589	of TLB caches. In case of simple memory mapping change it is
		590	necessary to invalidate only one or more adjacent pages. In case if
68	bondari	591	the architecture is aware of ASIDs, when kernel needs to dump some
		592	ASID to use by another task, it invalidates only entries from this
		593	particular address space. Final option of the TLB invalidation is
		594	the complete TLB cache invalidation, which is the operation that
		595	flushes all entries in TLB.</para>
67	jermar	596
68	bondari	597	<para>TLB shootdown is performed in two phases.</para>
67	jermar	598
68	bondari	599	<formalpara>
		600	<title>Phase 1.</title>
67	jermar	601
68	bondari	602	<para>First, initiator locks a global TLB spinlock, then request
		603	is being put to the local request cache of every other CPU in the
		604	system protected by its spinlock. In case the cache is full, all
		605	requests in the cache are replaced by one request, indicating
		606	global TLB flush. Then the initiator thread sends an IPI message
		607	indicating the TLB shootdown request to the rest of the CPUs and
		608	waits actively until all CPUs confirm TLB invalidating action
		609	execution by setting up a special flag. After setting this flag
		610	this thread is blocked on the TLB spinlock, held by the
		611	initiator.</para>
		612	</formalpara>
67	jermar	613
68	bondari	614	<formalpara>
		615	<title>Phase 2.</title>
		616
		617	<para>All CPUs are waiting on the TLB spinlock to execute TLB
		618	invalidation action and have indicated their intention to the
		619	initiator. Initiator continues, cleaning up its TLB and releasing
		620	the global TLB spinlock. After this all other CPUs gain and
		621	immidiately release TLB spinlock and perform TLB invalidation
		622	actions.</para>
		623	</formalpara>
		624	</section>
		625	</section>
67	jermar	626	</section>
26	bondari	627	</section>
11	bondari	628	</chapter>

Subversion Repositories HelenOS-doc

(root)/design/trunk/src/ch_memory_management.xml – Rev 69