WebSVN – HelenOS-doc – Blame – /design/trunk/src/ch_memory_management.xml

Rev	Author	Line No.	Line
9	bondari	1	<?xml version="1.0" encoding="UTF-8"?>
11	bondari	2	<chapter id="mm">
		3	<?dbhtml filename="mm.html"?>
9	bondari	4
11	bondari	5	<title>Memory management</title>
9	bondari	6
67	jermar	7	<para>In previous chapters, this book described the scheduling subsystem as
		8	the creator of the impression that threads execute in parallel. The memory
		9	management subsystem, on the other hand, creates the impression that there
		10	is enough physical memory for the kernel and that userspace tasks have the
		11	entire address space only for themselves.</para>
64	jermar	12
26	bondari	13	<section>
64	jermar	14	<title>Physical memory management</title>
		15
		16	<section id="zones_and_frames">
		17	<title>Zones and frames</title>
		18
67	jermar	19	<para>HelenOS represents continuous areas of physical memory in
		20	structures called frame zones (abbreviated as zones). Each zone contains
		21	information about the number of allocated and unallocated physical
		22	memory frames as well as the physical base address of the zone and
		23	number of frames contained in it. A zone also contains an array of frame
		24	structures describing each frame of the zone and, in the last, but not
		25	the least important, front, each zone is equipped with a buddy system
		26	that faciliates effective allocation of power-of-two sized block of
		27	frames.</para>
64	jermar	28
67	jermar	29	<para>This organization of physical memory provides good preconditions
		30	for hot-plugging of more zones. There is also one currently unused zone
		31	attribute: <code>flags</code>. The attribute could be used to give a
		32	special meaning to some zones in the future.</para>
64	jermar	33
67	jermar	34	<para>The zones are linked in a doubly-linked list. This might seem a
		35	bit ineffective because the zone list is walked everytime a frame is
		36	allocated or deallocated. However, this does not represent a significant
		37	performance problem as it is expected that the number of zones will be
		38	rather low. Moreover, most architectures merge all zones into
		39	one.</para>
		40
		41	<para>For each physical memory frame found in a zone, there is a frame
		42	structure that contains number of references and data used by buddy
		43	system.</para>
64	jermar	44	</section>
		45
		46	<section id="frame_allocator">
		47	<title>Frame allocator</title>
		48
67	jermar	49	<para>The frame allocator satisfies kernel requests to allocate
		50	power-of-two sized blocks of physical memory. Because of zonal
		51	organization of physical memory, the frame allocator is always working
		52	within a context of some frame zone. In order to carry out the
		53	allocation requests, the frame allocator is tightly integrated with the
		54	buddy system belonging to the zone. The frame allocator is also
		55	responsible for updating information about the number of free and busy
		56	frames in the zone. <figure>
		57	<mediaobject id="frame_alloc">
		58	<imageobject role="html">
		59	<imagedata fileref="images/frame_alloc.png" format="PNG" />
		60	</imageobject>
64	jermar	61
67	jermar	62	<imageobject role="fop">
		63	<imagedata fileref="images.vector/frame_alloc.svg" format="SVG" />
		64	</imageobject>
		65	</mediaobject>
64	jermar	66
67	jermar	67	<title>Frame allocator scheme.</title>
		68	</figure></para>
64	jermar	69
		70	<formalpara>
		71	<title>Allocation / deallocation</title>
		72
67	jermar	73	<para>Upon allocation request via function <code>frame_alloc</code>,
		74	the frame allocator first tries to find a zone that can satisfy the
		75	request (i.e. has the required amount of free frames). Once a suitable
		76	zone is found, the frame allocator uses the buddy allocator on the
		77	zone's buddy system to perform the allocation. During deallocation,
		78	which is triggered by a call to <code>frame_free</code>, the frame
		79	allocator looks up the respective zone that contains the frame being
		80	deallocated. Afterwards, it calls the buddy allocator again, this time
		81	to take care of deallocation within the zone's buddy system.</para>
64	jermar	82	</formalpara>
		83	</section>
		84
		85	<section id="buddy_allocator">
		86	<title>Buddy allocator</title>
		87
67	jermar	88	<para>In the buddy system, the memory is broken down into power-of-two
		89	sized naturally aligned blocks. These blocks are organized in an array
		90	of lists, in which the list with index i contains all unallocated blocks
		91	of size <mathphrase>2<superscript>i</superscript></mathphrase>. The
		92	index i is called the order of block. Should there be two adjacent
		93	equally sized blocks in the list i<mathphrase />(i.e. buddies), the
		94	buddy allocator would coalesce them and put the resulting block in list
		95	<mathphrase>i + 1</mathphrase>, provided that the resulting block would
		96	be naturally aligned. Similarily, when the allocator is asked to
		97	allocate a block of size
		98	<mathphrase>2<superscript>i</superscript></mathphrase>, it first tries
		99	to satisfy the request from the list with index i. If the request cannot
		100	be satisfied (i.e. the list i is empty), the buddy allocator will try to
		101	allocate and split a larger block from the list with index i + 1. Both
		102	of these algorithms are recursive. The recursion ends either when there
		103	are no blocks to coalesce in the former case or when there are no blocks
		104	that can be split in the latter case.</para>
64	jermar	105
67	jermar	106	<para>This approach greatly reduces external fragmentation of memory and
		107	helps in allocating bigger continuous blocks of memory aligned to their
		108	size. On the other hand, the buddy allocator suffers increased internal
		109	fragmentation of memory and is not suitable for general kernel
		110	allocations. This purpose is better addressed by the <link
		111	linkend="slab">slab allocator</link>.<figure>
64	jermar	112	<mediaobject id="buddy_alloc">
		113	<imageobject role="html">
		114	<imagedata fileref="images/buddy_alloc.png" format="PNG" />
		115	</imageobject>
		116
		117	<imageobject role="fop">
		118	<imagedata fileref="images.vector/buddy_alloc.svg" format="SVG" />
		119	</imageobject>
		120	</mediaobject>
		121
		122	<title>Buddy system scheme.</title>
67	jermar	123	</figure></para>
64	jermar	124
		125	<section>
		126	<title>Implementation</title>
		127
		128	<para>The buddy allocator is, in fact, an abstract framework wich can
		129	be easily specialized to serve one particular task. It knows nothing
		130	about the nature of memory it helps to allocate. In order to beat the
		131	lack of this knowledge, the buddy allocator exports an interface that
		132	each of its clients is required to implement. When supplied with an
		133	implementation of this interface, the buddy allocator can use
		134	specialized external functions to find a buddy for a block, split and
		135	coalesce blocks, manipulate block order and mark blocks busy or
67	jermar	136	available.</para>
64	jermar	137
		138	<formalpara>
		139	<title>Data organization</title>
		140
		141	<para>Each entity allocable by the buddy allocator is required to
		142	contain space for storing block order number and a link variable
		143	used to interconnect blocks within the same order.</para>
		144
		145	<para>Whatever entities are allocated by the buddy allocator, the
		146	first entity within a block is used to represent the entire block.
		147	The first entity keeps the order of the whole block. Other entities
		148	within the block are assigned the magic value
		149	<constant>BUDDY_INNER_BLOCK</constant>. This is especially important
		150	for effective identification of buddies in a one-dimensional array
		151	because the entity that represents a potential buddy cannot be
		152	associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it
		153	is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is
		154	not a buddy).</para>
		155	</formalpara>
		156	</section>
		157	</section>
		158
		159	<section id="slab">
		160	<title>Slab allocator</title>
		161
67	jermar	162	<para>The majority of memory allocation requests in the kernel is for
		163	small, frequently used data structures. The basic idea behind the slab
		164	allocator is that commonly used objects are preallocated in continuous
		165	areas of physical memory called slabs<footnote>
		166	<para>Slabs are in fact blocks of physical memory frames allocated
		167	from the frame allocator.</para>
		168	</footnote>. Whenever an object is to be allocated, the slab allocator
		169	returns the first available item from a suitable slab corresponding to
		170	the object type<footnote>
		171	<para>The mechanism is rather more complicated, see the next
		172	paragraph.</para>
		173	</footnote>. Due to the fact that the sizes of the requested and
		174	allocated object match, the slab allocator significantly reduces
		175	internal fragmentation.</para>
64	jermar	176
67	jermar	177	<para>Slabs of one object type are organized in a structure called slab
		178	cache. There are ususally more slabs in the slab cache, depending on
		179	previous allocations. If the the slab cache runs out of available slabs,
		180	new slabs are allocated. In order to exploit parallelism and to avoid
		181	locking of shared spinlocks, slab caches can have variants of
		182	processor-private slabs called magazines. On each processor, there is a
		183	two-magazine cache. Full magazines that are not part of any
		184	per-processor magazine cache are stored in a global list of full
		185	magazines.</para>
64	jermar	186
67	jermar	187	<para>Each object begins its life in a slab. When it is allocated from
		188	there, the slab allocator calls a constructor that is registered in the
		189	respective slab cache. The constructor initializes and brings the object
		190	into a known state. The object is then used by the user. When the user
		191	later frees the object, the slab allocator puts it into a processor
		192	private magazine cache, from where it can be precedently allocated
		193	again. Note that allocations satisfied from a magazine are already
		194	initialized by the constructor. When both of the processor cached
		195	magazines get full, the allocator will move one of the magazines to the
		196	list of full magazines. Similarily, when allocating from an empty
		197	processor magazine cache, the kernel will reload only one magazine from
		198	the list of full magazines. In other words, the slab allocator tries to
		199	keep the processor magazine cache only half-full in order to prevent
		200	thrashing when allocations and deallocations interleave on magazine
		201	boundaries.</para>
65	jermar	202
67	jermar	203	<para>Should HelenOS run short of memory, it would start deallocating
		204	objects from magazines, calling slab cache destructor on them and
		205	putting them back into slabs. When a slab contanins no allocated object,
		206	it is immediately freed.</para>
66	bondari	207
67	jermar	208	<para><figure>
64	jermar	209	<mediaobject id="slab_alloc">
		210	<imageobject role="html">
		211	<imagedata fileref="images/slab_alloc.png" format="PNG" />
		212	</imageobject>
		213	</mediaobject>
		214
		215	<title>Slab allocator scheme.</title>
67	jermar	216	</figure></para>
64	jermar	217
67	jermar	218	<section>
		219	<title>Implementation</title>
		220
		221	<para>The slab allocator is closely modelled after OpenSolaris slab
		222	allocator by Jeff Bonwick and Jonathan Adams with the following
		223	exceptions:<itemizedlist>
64	jermar	224	<listitem>
68	bondari	225	empty slabs are immediately deallocated
64	jermar	226	</listitem>
66	bondari	227
		228	<listitem>
67	jermar	229	<para>empty magazines are deallocated when not needed</para>
66	bondari	230	</listitem>
64	jermar	231	</itemizedlist> Following features are not currently supported but
		232	would be easy to do: <itemizedlist>
		233	<listitem>
67	jermar	234	cache coloring
64	jermar	235	</listitem>
		236
		237	<listitem>
67	jermar	238	dynamic magazine grow (different magazine sizes are already supported, but the allocation strategy would need to be adjusted)
64	jermar	239	</listitem>
		240	</itemizedlist></para>
		241
		242	<section>
		243	<title>Magazine layer</title>
		244
		245	<para>Due to the extensive bottleneck on SMP architures, caused by
		246	global slab locking mechanism, making processing of all slab
		247	allocation requests serialized, a new layer was introduced to the
		248	classic slab allocator design. Slab allocator was extended to
		249	support per-CPU caches 'magazines' to achieve good SMP scaling.
		250	<termdef>Slab SMP perfromance bottleneck was resolved by introducing
		251	a per-CPU caching scheme called as <glossterm>magazine
		252	layer</glossterm></termdef>.</para>
		253
		254	<para>Magazine is a N-element cache of objects, so each magazine can
		255	satisfy N allocations. Magazine behaves like a automatic weapon
		256	magazine (LIFO, stack), so the allocation/deallocation become simple
		257	push/pop pointer operation. Trick is that CPU does not access global
		258	slab allocator data during the allocation from its magazine, thus
		259	making possible parallel allocations between CPUs.</para>
		260
		261	<para>Implementation also requires adding another feature as the
		262	CPU-bound magazine is actually a pair of magazines to avoid
		263	thrashing when during allocation/deallocatiion of 1 item at the
		264	magazine size boundary. LIFO order is enforced, which should avoid
		265	fragmentation as much as possible.</para>
		266
		267	<para>Another important entity of magazine layer is the common full
		268	magazine list (also called a depot), that stores full magazines that
		269	may be used by any of the CPU magazine caches to reload active CPU
		270	magazine. This list of magazines can be pre-filled with full
		271	magazines during initialization, but in current implementation it is
		272	filled during object deallocation, when CPU magazine becomes
		273	full.</para>
		274
		275	<para>Slab allocator control structures are allocated from special
		276	slabs, that are marked by special flag, indicating that it should
		277	not be used for slab magazine layer. This is done to avoid possible
		278	infinite recursions and deadlock during conventional slab allocaiton
		279	requests.</para>
		280	</section>
		281
		282	<section>
		283	<title>Allocation/deallocation</title>
		284
		285	<para>Every cache contains list of full slabs and list of partialy
		286	full slabs. Empty slabs are immediately freed (thrashing will be
		287	avoided because of magazines).</para>
		288
		289	<para>The SLAB allocator allocates lots of space and does not free
		290	it. When frame allocator fails to allocate the frame, it calls
		291	slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim.
		292	The light reclaim releases slabs from cpu-shared magazine-list,
		293	until at least 1 slab is deallocated in each cache (this algorithm
		294	should probably change). The brutal reclaim removes all cached
		295	objects, even from CPU-bound magazines.</para>
		296
		297	<formalpara>
		298	<title>Allocation</title>
		299
		300	<para><emphasis>Step 1.</emphasis> When it comes to the allocation
		301	request, slab allocator first of all checks availability of memory
		302	in local CPU-bound magazine. If it is there, we would just "pop"
		303	the CPU magazine and return the pointer to object.</para>
		304
		305	<para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
		306	empty, allocator will attempt to reload magazin, swapping it with
		307	second CPU magazine and returns to the first step.</para>
		308
		309	<para><emphasis>Step 3.</emphasis> Now we are in the situation
		310	when both CPU-bound magazines are empty, which makes allocator to
		311	access shared full-magazines depot to reload CPU-bound magazines.
		312	If reload is succesful (meaning there are full magazines in depot)
		313	algoritm continues at Step 1.</para>
		314
		315	<para><emphasis>Step 4.</emphasis> Final step of the allocation.
		316	In this step object is allocated from the conventional slab layer
		317	and pointer is returned.</para>
		318	</formalpara>
		319
		320	<formalpara>
		321	<title>Deallocation</title>
		322
		323	<para><emphasis>Step 1.</emphasis> During deallocation request,
		324	slab allocator will check if the local CPU-bound magazine is not
		325	full. In this case we will just push the pointer to this
		326	magazine.</para>
		327
		328	<para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
		329	full, allocator will attempt to reload magazin, swapping it with
		330	second CPU magazine and returns to the first step.</para>
		331
		332	<para><emphasis>Step 3.</emphasis> Now we are in the situation
		333	when both CPU-bound magazines are full, which makes allocator to
		334	access shared full-magazines depot to put one of the magazines to
		335	the depot and creating new empty magazine. Algoritm continues at
		336	Step 1.</para>
		337	</formalpara>
		338	</section>
		339	</section>
		340	</section>
		341
		342	<!-- End of Physmem -->
		343	</section>
		344
		345	<section>
67	jermar	346	<title>Virtual memory management</title>
9	bondari	347
67	jermar	348	<section>
		349	<title>Introduction</title>
		350
		351	<para>Virtual memory is a special memory management technique, used by
		352	kernel to achieve a bunch of mission critical goals. <itemizedlist>
		353	<listitem>
		354	Isolate each task from other tasks that are running on the system at the same time.
		355	</listitem>
		356
		357	<listitem>
		358	Allow to allocate more memory, than is actual physical memory size of the machine.
		359	</listitem>
		360
		361	<listitem>
		362	Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations.
		363	</listitem>
		364	</itemizedlist></para>
		365
		366	<para><!--
		367
		368	<para>
		369	Address spaces. Address space area (B+ tree). Only for uspace. Set of syscalls (shrink/extend etc).
		370	Special address space area type - device - prohibits shrink/extend syscalls to call on it.
		371	Address space has link to mapping tables (hierarchical - per Address space, hash - global tables).
		372	</para>
		373
		374	--></para>
		375	</section>
		376
		377	<section>
		378	<title>Address spaces</title>
		379
		380	<section>
		381	<title>Address space areas</title>
		382
		383	<para>Each address space consists of mutually disjunctive continuous
		384	address space areas. Address space area is precisely defined by its
		385	base address and the number of frames/pages is contains.</para>
		386
		387	<para>Address space area , that define behaviour and permissions on
		388	the particular area. <itemizedlist>
		389	<listitem>
		390
		391
		392	<emphasis>AS_AREA_READ</emphasis>
		393
		394	flag indicates reading permission.
		395	</listitem>
		396
		397	<listitem>
		398
		399
		400	<emphasis>AS_AREA_WRITE</emphasis>
		401
		402	flag indicates writing permission.
		403	</listitem>
		404
		405	<listitem>
		406
		407
		408	<emphasis>AS_AREA_EXEC</emphasis>
		409
		410	flag indicates code execution permission. Some architectures do not support execution persmission restriction. In this case this flag has no effect.
		411	</listitem>
		412
		413	<listitem>
		414
		415
		416	<emphasis>AS_AREA_DEVICE</emphasis>
		417
		418	marks area as mapped to the device memory.
		419	</listitem>
		420	</itemizedlist></para>
		421
		422	<para>Kernel provides possibility tasks create/expand/shrink/share its
		423	address space via the set of syscalls.</para>
		424	</section>
		425
		426	<section>
		427	<title>Address Space ID (ASID)</title>
		428
		429	<para>When switching to the different task, kernel also require to
		430	switch mappings to the different address space. In case TLB cannot
		431	distinguish address space mappings, all mapping information in TLB
		432	from the old address space must be flushed, which can create certain
		433	uncessary overhead during the task switching. To avoid this, some
		434	architectures have capability to segregate different address spaces on
		435	hardware level introducing the address space identifier as a part of
		436	TLB record, telling the virtual address space translation unit to
		437	which address space this record is applicable.</para>
		438
		439	<para>HelenOS kernel can take advantage of this hardware supported
		440	identifier by having an ASID abstraction which is somehow related to
		441	the corresponding architecture identifier. I.e. on ia64 kernel ASID is
		442	derived from RID (region identifier) and on the mips32 kernel ASID is
		443	actually the hardware identifier. As expected, this ASID information
		444	record is the part of <emphasis>as_t</emphasis> structure.</para>
		445
		446	<para>Due to the hardware limitations, hardware ASID has limited
		447	length from 8 bits on ia64 to 24 bits on mips32, which makes it
		448	impossible to use it as unique address space identifier for all tasks
		449	running in the system. In such situations special ASID stealing
		450	algoritm is used, which takes ASID from inactive task and assigns it
		451	to the active task.</para>
		452
		453	<para><classname>ASID stealing algoritm here.</classname></para>
		454	</section>
		455	</section>
		456
		457	<section>
		458	<title>Virtual address translation</title>
		459
68	bondari	460	<section id="paging">
		461	<title>Paging</title>
67	jermar	462
68	bondari	463	<section>
		464	<title>Introduction</title>
67	jermar	465
68	bondari	466	<para>Virtual memory is usually using paged memory model, where
		467	virtual memory address space is divided into the
		468	<emphasis>pages</emphasis> (usually having size 4096 bytes) and
		469	physical memory is divided into the frames (same sized as a page, of
		470	course). Each page may be mapped to some frame and then, upon memory
		471	access to the virtual address, CPU performs <emphasis>address
		472	translation</emphasis> during the instruction execution.
		473	Non-existing mapping generates page fault exception, calling kernel
		474	exception handler, thus allowing kernel to manipulate rules of
		475	memory access. Information for pages mapping is stored by kernel in
		476	the <link linkend="page_tables">page tables</link></para>
67	jermar	477
68	bondari	478	<para>The majority of the architectures use multi-level page tables,
		479	which means need to access physical memory several times before
		480	getting physical address. This fact would make serios performance
		481	overhead in virtual memory management. To avoid this <link
		482	linkend="tlb">Traslation Lookaside Buffer (TLB)</link> is
		483	used.</para>
		484
		485	<para>HelenOS kernel has two different approaches to the paging
		486	implementation: <emphasis>4 level page tables</emphasis> and
		487	<emphasis>global hash table</emphasis>, which are accessible via
		488	generic paging abstraction layer. Such different functionality was
		489	caused by the major architectural differences between supported
		490	platforms. This abstraction is implemented with help of the global
		491	structure of pointers to basic mapping functions
		492	<emphasis>page_mapping_operations</emphasis>. To achieve different
		493	functionality of page tables, corresponding layer must implement
		494	functions, declared in
		495	<emphasis>page_mapping_operations</emphasis></para>
		496
		497	<para>Thanks to the abstract paging interface, there was a place
		498	left for more paging implementations (besides already implemented
		499	hieararchical page tables and hash table), for example B-Tree based
		500	page tables.</para>
		501	</section>
		502
		503	<section>
		504	<title>Hierarchical 4-level page tables</title>
		505
		506	<para>Hierarchical 4-level page tables are the generalization of the
		507	hardware capabilities of most architectures. Each address space has
		508	its own page tables.<itemizedlist>
67	jermar	509	<listitem>
		510	ia32 uses 2-level page tables, with full hardware support.
		511	</listitem>
		512
		513	<listitem>
		514	amd64 uses 4-level page tables, also coming with full hardware support.
		515	</listitem>
		516
		517	<listitem>
		518	mips and ppc32 have 2-level tables, software simulated support.
		519	</listitem>
		520	</itemizedlist></para>
68	bondari	521	</section>
67	jermar	522
68	bondari	523	<section>
		524	<title>Global hash table</title>
67	jermar	525
68	bondari	526	<para>Implementation of the global hash table was encouraged by the
		527	ia64 architecture support. One of the major differences between
		528	global hash table and hierarchical tables is that global hash table
		529	exists only once in the system and the hierarchical tables are
		530	maintained per address space.</para>
67	jermar	531
68	bondari	532	<para>Thus, hash table contains information about all address spaces
		533	mappings in the system, so, the hash of an entry must contain
		534	information of both address space pointer or id and the virtual
		535	address of the page. Generic hash table implementation assumes that
		536	the addresses of the pointers to the address spaces are likely to be
		537	on the close addresses, so it uses least significant bits for hash;
		538	also it assumes that the virtual page addresses have roughly the
		539	same probability of occurring, so the least significant bits of VPN
		540	compose the hash index.</para>
		541
		542	<para>Collision chains ...</para>
		543	</section>
67	jermar	544	</section>
		545
		546	<section id="tlb">
		547	<title>Translation Lookaside buffer</title>
		548
		549	<para>Due to the extensive overhead during the page mapping lookup in
		550	the page tables, all architectures has fast assotiative cache memory
		551	built-in CPU. This memory called TLB stores recently used page table
		552	entries.</para>
		553
		554	<section id="tlb_shootdown">
		555	<title>TLB consistency. TLB shootdown algorithm.</title>
		556
		557	<para>Operating system is responsible for keeping TLB consistent by
		558	invalidating the contents of TLB, whenever there is some change in
		559	page tables. Those changes may occur when page or group of pages
		560	were unmapped, mapping is changed or system switching active address
68	bondari	561	space to schedule a new system task. Moreover, this invalidation
		562	operation must be done an all system CPUs because each CPU has its
		563	own independent TLB cache. Thus maintaining TLB consistency on SMP
		564	configuration as not as trivial task as it looks on the first
		565	glance. Naive solution would assume that is the CPU which wants to
		566	invalidate TLB will invalidate TLB caches on other CPUs. It is not
		567	possible on the most of the architectures, because of the simple
		568	fact - flushing TLB is allowed only on the local CPU and there is no
		569	possibility to access other CPUs' TLB caches, thus invalidate TLB
		570	remotely.</para>
67	jermar	571
		572	<para>Technique of remote invalidation of TLB entries is called "TLB
		573	shootdown". HelenOS uses a variation of the algorithm described by
		574	D. Black et al., "Translation Lookaside Buffer Consistency: A
		575	Software Approach," Proc. Third Int'l Conf. Architectural Support
		576	for Programming Languages and Operating Systems, 1989, pp.
		577	113-122.</para>
		578
		579	<para>As the situation demands, you will want partitial invalidation
		580	of TLB caches. In case of simple memory mapping change it is
		581	necessary to invalidate only one or more adjacent pages. In case if
68	bondari	582	the architecture is aware of ASIDs, when kernel needs to dump some
		583	ASID to use by another task, it invalidates only entries from this
		584	particular address space. Final option of the TLB invalidation is
		585	the complete TLB cache invalidation, which is the operation that
		586	flushes all entries in TLB.</para>
67	jermar	587
68	bondari	588	<para>TLB shootdown is performed in two phases.</para>
67	jermar	589
68	bondari	590	<formalpara>
		591	<title>Phase 1.</title>
67	jermar	592
68	bondari	593	<para>First, initiator locks a global TLB spinlock, then request
		594	is being put to the local request cache of every other CPU in the
		595	system protected by its spinlock. In case the cache is full, all
		596	requests in the cache are replaced by one request, indicating
		597	global TLB flush. Then the initiator thread sends an IPI message
		598	indicating the TLB shootdown request to the rest of the CPUs and
		599	waits actively until all CPUs confirm TLB invalidating action
		600	execution by setting up a special flag. After setting this flag
		601	this thread is blocked on the TLB spinlock, held by the
		602	initiator.</para>
		603	</formalpara>
67	jermar	604
68	bondari	605	<formalpara>
		606	<title>Phase 2.</title>
		607
		608	<para>All CPUs are waiting on the TLB spinlock to execute TLB
		609	invalidation action and have indicated their intention to the
		610	initiator. Initiator continues, cleaning up its TLB and releasing
		611	the global TLB spinlock. After this all other CPUs gain and
		612	immidiately release TLB spinlock and perform TLB invalidation
		613	actions.</para>
		614	</formalpara>
		615	</section>
		616	</section>
67	jermar	617	</section>
26	bondari	618	</section>
11	bondari	619	</chapter>

Subversion Repositories HelenOS-doc

(root)/design/trunk/src/ch_memory_management.xml – Rev 68