WebSVN – HelenOS-doc – Blame – /design/trunk/src/ch_memory_management.xml

Rev	Author	Line No.	Line
9	bondari	1	<?xml version="1.0" encoding="UTF-8"?>
11	bondari	2	<chapter id="mm">
		3	<?dbhtml filename="mm.html"?>
9	bondari	4
11	bondari	5	<title>Memory management</title>
9	bondari	6
67	jermar	7	<para>In previous chapters, this book described the scheduling subsystem as
		8	the creator of the impression that threads execute in parallel. The memory
		9	management subsystem, on the other hand, creates the impression that there
		10	is enough physical memory for the kernel and that userspace tasks have the
		11	entire address space only for themselves.</para>
64	jermar	12
26	bondari	13	<section>
64	jermar	14	<title>Physical memory management</title>
		15
		16	<section id="zones_and_frames">
		17	<title>Zones and frames</title>
		18
67	jermar	19	<para>HelenOS represents continuous areas of physical memory in
		20	structures called frame zones (abbreviated as zones). Each zone contains
		21	information about the number of allocated and unallocated physical
		22	memory frames as well as the physical base address of the zone and
		23	number of frames contained in it. A zone also contains an array of frame
		24	structures describing each frame of the zone and, in the last, but not
		25	the least important, front, each zone is equipped with a buddy system
		26	that faciliates effective allocation of power-of-two sized block of
		27	frames.</para>
64	jermar	28
67	jermar	29	<para>This organization of physical memory provides good preconditions
		30	for hot-plugging of more zones. There is also one currently unused zone
		31	attribute: <code>flags</code>. The attribute could be used to give a
		32	special meaning to some zones in the future.</para>
64	jermar	33
67	jermar	34	<para>The zones are linked in a doubly-linked list. This might seem a
		35	bit ineffective because the zone list is walked everytime a frame is
		36	allocated or deallocated. However, this does not represent a significant
		37	performance problem as it is expected that the number of zones will be
		38	rather low. Moreover, most architectures merge all zones into
		39	one.</para>
		40
76	palkovsky	41	<para>Every physical memory frame in a zone, is described by a structure
		42	that contains number of references and other data used by buddy
67	jermar	43	system.</para>
64	jermar	44	</section>
		45
		46	<section id="frame_allocator">
71	bondari	47	<indexterm>
		48	<primary>frame allocator</primary>
		49	</indexterm>
		50
64	jermar	51	<title>Frame allocator</title>
		52
67	jermar	53	<para>The frame allocator satisfies kernel requests to allocate
		54	power-of-two sized blocks of physical memory. Because of zonal
		55	organization of physical memory, the frame allocator is always working
76	palkovsky	56	within a context of a particular frame zone. In order to carry out the
67	jermar	57	allocation requests, the frame allocator is tightly integrated with the
		58	buddy system belonging to the zone. The frame allocator is also
		59	responsible for updating information about the number of free and busy
		60	frames in the zone. <figure>
		61	<mediaobject id="frame_alloc">
77	bondari	62	<imageobject role="eps">
		63	<imagedata fileref="images.vector/frame_alloc.eps" format="EPS" />
		64	</imageobject>
		65
67	jermar	66	<imageobject role="html">
		67	<imagedata fileref="images/frame_alloc.png" format="PNG" />
		68	</imageobject>
64	jermar	69
67	jermar	70	<imageobject role="fop">
		71	<imagedata fileref="images.vector/frame_alloc.svg" format="SVG" />
		72	</imageobject>
		73	</mediaobject>
64	jermar	74
67	jermar	75	<title>Frame allocator scheme.</title>
		76	</figure></para>
64	jermar	77
		78	<formalpara>
		79	<title>Allocation / deallocation</title>
		80
67	jermar	81	<para>Upon allocation request via function <code>frame_alloc</code>,
		82	the frame allocator first tries to find a zone that can satisfy the
		83	request (i.e. has the required amount of free frames). Once a suitable
		84	zone is found, the frame allocator uses the buddy allocator on the
		85	zone's buddy system to perform the allocation. During deallocation,
		86	which is triggered by a call to <code>frame_free</code>, the frame
		87	allocator looks up the respective zone that contains the frame being
		88	deallocated. Afterwards, it calls the buddy allocator again, this time
		89	to take care of deallocation within the zone's buddy system.</para>
64	jermar	90	</formalpara>
		91	</section>
		92
		93	<section id="buddy_allocator">
71	bondari	94	<indexterm>
		95	<primary>buddy system</primary>
		96	</indexterm>
		97
64	jermar	98	<title>Buddy allocator</title>
		99
67	jermar	100	<para>In the buddy system, the memory is broken down into power-of-two
		101	sized naturally aligned blocks. These blocks are organized in an array
		102	of lists, in which the list with index i contains all unallocated blocks
		103	of size <mathphrase>2<superscript>i</superscript></mathphrase>. The
		104	index i is called the order of block. Should there be two adjacent
		105	equally sized blocks in the list i<mathphrase />(i.e. buddies), the
		106	buddy allocator would coalesce them and put the resulting block in list
		107	<mathphrase>i + 1</mathphrase>, provided that the resulting block would
		108	be naturally aligned. Similarily, when the allocator is asked to
		109	allocate a block of size
		110	<mathphrase>2<superscript>i</superscript></mathphrase>, it first tries
		111	to satisfy the request from the list with index i. If the request cannot
		112	be satisfied (i.e. the list i is empty), the buddy allocator will try to
		113	allocate and split a larger block from the list with index i + 1. Both
		114	of these algorithms are recursive. The recursion ends either when there
		115	are no blocks to coalesce in the former case or when there are no blocks
		116	that can be split in the latter case.</para>
64	jermar	117
67	jermar	118	<para>This approach greatly reduces external fragmentation of memory and
		119	helps in allocating bigger continuous blocks of memory aligned to their
		120	size. On the other hand, the buddy allocator suffers increased internal
		121	fragmentation of memory and is not suitable for general kernel
		122	allocations. This purpose is better addressed by the <link
		123	linkend="slab">slab allocator</link>.<figure>
64	jermar	124	<mediaobject id="buddy_alloc">
77	bondari	125	<imageobject role="eps">
		126	<imagedata fileref="images.vector/buddy_alloc.eps" format="EPS" />
		127	</imageobject>
		128
64	jermar	129	<imageobject role="html">
		130	<imagedata fileref="images/buddy_alloc.png" format="PNG" />
		131	</imageobject>
		132
		133	<imageobject role="fop">
		134	<imagedata fileref="images.vector/buddy_alloc.svg" format="SVG" />
		135	</imageobject>
		136	</mediaobject>
		137
		138	<title>Buddy system scheme.</title>
67	jermar	139	</figure></para>
64	jermar	140
		141	<section>
		142	<title>Implementation</title>
		143
		144	<para>The buddy allocator is, in fact, an abstract framework wich can
		145	be easily specialized to serve one particular task. It knows nothing
		146	about the nature of memory it helps to allocate. In order to beat the
		147	lack of this knowledge, the buddy allocator exports an interface that
		148	each of its clients is required to implement. When supplied with an
		149	implementation of this interface, the buddy allocator can use
		150	specialized external functions to find a buddy for a block, split and
		151	coalesce blocks, manipulate block order and mark blocks busy or
67	jermar	152	available.</para>
64	jermar	153
		154	<formalpara>
		155	<title>Data organization</title>
		156
		157	<para>Each entity allocable by the buddy allocator is required to
		158	contain space for storing block order number and a link variable
		159	used to interconnect blocks within the same order.</para>
		160
		161	<para>Whatever entities are allocated by the buddy allocator, the
		162	first entity within a block is used to represent the entire block.
		163	The first entity keeps the order of the whole block. Other entities
		164	within the block are assigned the magic value
		165	<constant>BUDDY_INNER_BLOCK</constant>. This is especially important
		166	for effective identification of buddies in a one-dimensional array
		167	because the entity that represents a potential buddy cannot be
		168	associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it
		169	is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is
		170	not a buddy).</para>
		171	</formalpara>
		172	</section>
		173	</section>
		174
		175	<section id="slab">
71	bondari	176	<indexterm>
		177	<primary>slab allocator</primary>
		178	</indexterm>
		179
70	jermar	180	<title>Slab allocator</title>
64	jermar	181
67	jermar	182	<para>The majority of memory allocation requests in the kernel is for
		183	small, frequently used data structures. The basic idea behind the slab
		184	allocator is that commonly used objects are preallocated in continuous
		185	areas of physical memory called slabs<footnote>
		186	<para>Slabs are in fact blocks of physical memory frames allocated
		187	from the frame allocator.</para>
		188	</footnote>. Whenever an object is to be allocated, the slab allocator
		189	returns the first available item from a suitable slab corresponding to
		190	the object type<footnote>
		191	<para>The mechanism is rather more complicated, see the next
		192	paragraph.</para>
		193	</footnote>. Due to the fact that the sizes of the requested and
		194	allocated object match, the slab allocator significantly reduces
		195	internal fragmentation.</para>
64	jermar	196
71	bondari	197	<indexterm>
		198	<primary>slab allocator</primary>
		199
72	bondari	200	<secondary>- slab cache</secondary>
71	bondari	201	</indexterm>
		202
67	jermar	203	<para>Slabs of one object type are organized in a structure called slab
		204	cache. There are ususally more slabs in the slab cache, depending on
		205	previous allocations. If the the slab cache runs out of available slabs,
		206	new slabs are allocated. In order to exploit parallelism and to avoid
		207	locking of shared spinlocks, slab caches can have variants of
		208	processor-private slabs called magazines. On each processor, there is a
		209	two-magazine cache. Full magazines that are not part of any
		210	per-processor magazine cache are stored in a global list of full
		211	magazines.</para>
64	jermar	212
71	bondari	213	<indexterm>
		214	<primary>slab allocator</primary>
		215
72	bondari	216	<secondary>- magazine</secondary>
71	bondari	217	</indexterm>
		218
67	jermar	219	<para>Each object begins its life in a slab. When it is allocated from
		220	there, the slab allocator calls a constructor that is registered in the
		221	respective slab cache. The constructor initializes and brings the object
		222	into a known state. The object is then used by the user. When the user
		223	later frees the object, the slab allocator puts it into a processor
71	bondari	224	private <indexterm>
		225	<primary>slab allocator</primary>
		226
72	bondari	227	<secondary>- magazine</secondary>
71	bondari	228	</indexterm>magazine cache, from where it can be precedently allocated
67	jermar	229	again. Note that allocations satisfied from a magazine are already
		230	initialized by the constructor. When both of the processor cached
		231	magazines get full, the allocator will move one of the magazines to the
		232	list of full magazines. Similarily, when allocating from an empty
		233	processor magazine cache, the kernel will reload only one magazine from
		234	the list of full magazines. In other words, the slab allocator tries to
		235	keep the processor magazine cache only half-full in order to prevent
		236	thrashing when allocations and deallocations interleave on magazine
70	jermar	237	boundaries. The advantage of this setup is that during most of the
		238	allocations, no global spinlock needs to be held.</para>
65	jermar	239
67	jermar	240	<para>Should HelenOS run short of memory, it would start deallocating
		241	objects from magazines, calling slab cache destructor on them and
		242	putting them back into slabs. When a slab contanins no allocated object,
		243	it is immediately freed.</para>
66	bondari	244
71	bondari	245	<para>
		246	<figure>
64	jermar	247	<mediaobject id="slab_alloc">
77	bondari	248	<imageobject role="eps">
		249	<imagedata fileref="images.vector/slab_alloc.eps" format="EPS" />
		250	</imageobject>
		251
64	jermar	252	<imageobject role="html">
		253	<imagedata fileref="images/slab_alloc.png" format="PNG" />
		254	</imageobject>
77	bondari	255
		256	<imageobject role="fop">
		257	<imagedata fileref="images.vector/slab_alloc.svg" format="SVG" />
		258	</imageobject>
64	jermar	259	</mediaobject>
		260
		261	<title>Slab allocator scheme.</title>
71	bondari	262	</figure>
		263	</para>
64	jermar	264
67	jermar	265	<section>
		266	<title>Implementation</title>
		267
		268	<para>The slab allocator is closely modelled after OpenSolaris slab
71	bondari	269	allocator by Jeff Bonwick and Jonathan Adams <xref
		270	linkend="Bonwick01" /> with the following exceptions:<itemizedlist>
		271	<listitem>empty slabs are immediately deallocated and</listitem>
66	bondari	272
		273	<listitem>
70	jermar	274	<para>empty magazines are deallocated when not needed.</para>
66	bondari	275	</listitem>
70	jermar	276	</itemizedlist>The following features are not currently supported
		277	but would be easy to do: <itemizedlist>
71	bondari	278	<listitem>cache coloring and</listitem>
64	jermar	279
71	bondari	280	<listitem>dynamic magazine grow (different magazine sizes are
		281	already supported, but the allocation strategy would need to be
		282	adjusted).</listitem>
64	jermar	283	</itemizedlist></para>
		284
		285	<section>
		286	<title>Allocation/deallocation</title>
		287
70	jermar	288	<para>The following two paragraphs summarize and complete the
		289	description of the slab allocator operation (i.e.
		290	<code>slab_alloc</code> and <code>slab_free</code>
		291	operations).</para>
64	jermar	292
		293	<formalpara>
		294	<title>Allocation</title>
		295
70	jermar	296	<para><emphasis>Step 1.</emphasis> When an allocation request
		297	comes, the slab allocator checks availability of memory in the
		298	current magazine of the local processor magazine cache. If the
76	palkovsky	299	available memory is there, the allocator just pops the object from
		300	magazine and returns it.</para>
64	jermar	301
70	jermar	302	<para><emphasis>Step 2.</emphasis> If the current magazine in the
		303	processor magazine cache is empty, the allocator will attempt to
		304	swap it with the last magazine from the cache and return to the
		305	first step. If also the last magazine is empty, the algorithm will
		306	fall through to Step 3.</para>
64	jermar	307
70	jermar	308	<para><emphasis>Step 3.</emphasis> Now the allocator is in the
		309	situation when both magazines in the processor magazine cache are
		310	empty. The allocator reloads one magazine from the shared list of
		311	full magazines. If the reload is successful (i.e. there are full
		312	magazines in the list), the algorithm continues with Step
		313	1.</para>
64	jermar	314
70	jermar	315	<para><emphasis>Step 4.</emphasis> In this fail-safe step, an
		316	object is allocated from the conventional slab layer and a pointer
		317	to it is returned. If also the last magazine is full,</para>
64	jermar	318	</formalpara>
		319
		320	<formalpara>
		321	<title>Deallocation</title>
		322
70	jermar	323	<para><emphasis>Step 1.</emphasis> During a deallocation request,
		324	the slab allocator checks if the current magazine of the local
76	palkovsky	325	processor magazine cache is not full. If it is, the pointer to the
70	jermar	326	objects is just pushed into the magazine and the algorithm
		327	returns.</para>
64	jermar	328
70	jermar	329	<para><emphasis>Step 2.</emphasis> If the current magazine is
		330	full, the allocator will attempt to swap it with the last magazine
		331	from the cache and return to the first step. If also the last
		332	magazine is empty, the algorithm will fall through to Step
		333	3.</para>
64	jermar	334
70	jermar	335	<para><emphasis>Step 3.</emphasis> Now the allocator is in the
		336	situation when both magazines in the processor magazine cache are
76	palkovsky	337	full. The allocator tries to allocate a new empty magazine and
		338	flush one of the full magazines to the shared list of full
		339	magazines. If it is successfull, the algoritm continues with Step
		340	1.</para>
		341
		342	<para><emphasis>Step 4. </emphasis>In case of low memory condition
		343	when the allocation of empty magazine fails, the object is moved
		344	directly into slab. In the worst case object deallocation does not
		345	need to allocate any additional memory.</para>
64	jermar	346	</formalpara>
		347	</section>
		348	</section>
		349	</section>
		350	</section>
		351
		352	<section>
67	jermar	353	<title>Virtual memory management</title>
9	bondari	354
67	jermar	355	<section>
		356	<title>Introduction</title>
		357
		358	<para>Virtual memory is a special memory management technique, used by
		359	kernel to achieve a bunch of mission critical goals. <itemizedlist>
		360	<listitem>
		361	Isolate each task from other tasks that are running on the system at the same time.
		362	</listitem>
		363
		364	<listitem>
		365	Allow to allocate more memory, than is actual physical memory size of the machine.
		366	</listitem>
		367
		368	<listitem>
		369	Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations.
		370	</listitem>
		371	</itemizedlist></para>
		372
		373	<para><!--
		374	<para>
		375	Address spaces. Address space area (B+ tree). Only for uspace. Set of syscalls (shrink/extend etc).
		376	Special address space area type - device - prohibits shrink/extend syscalls to call on it.
		377	Address space has link to mapping tables (hierarchical - per Address space, hash - global tables).
		378	</para>
		379
		380	--></para>
		381	</section>
		382
		383	<section>
71	bondari	384	<title>Address spaces</title>
70	jermar	385
71	bondari	386	<section>
		387	<indexterm>
		388	<primary>address space</primary>
70	jermar	389
72	bondari	390	<secondary>- area</secondary>
71	bondari	391	</indexterm>
70	jermar	392
		393	<title>Address space areas</title>
67	jermar	394
		395	<para>Each address space consists of mutually disjunctive continuous
		396	address space areas. Address space area is precisely defined by its
		397	base address and the number of frames/pages is contains.</para>
		398
		399	<para>Address space area , that define behaviour and permissions on
		400	the particular area. <itemizedlist>
71	bondari	401	<listitem><emphasis>AS_AREA_READ</emphasis> flag indicates reading
		402	permission.</listitem>
67	jermar	403
71	bondari	404	<listitem><emphasis>AS_AREA_WRITE</emphasis> flag indicates
		405	writing permission.</listitem>
67	jermar	406
71	bondari	407	<listitem><emphasis>AS_AREA_EXEC</emphasis> flag indicates code
		408	execution permission. Some architectures do not support execution
		409	persmission restriction. In this case this flag has no
		410	effect.</listitem>
67	jermar	411
71	bondari	412	<listitem><emphasis>AS_AREA_DEVICE</emphasis> marks area as mapped
		413	to the device memory.</listitem>
67	jermar	414	</itemizedlist></para>
		415
		416	<para>Kernel provides possibility tasks create/expand/shrink/share its
		417	address space via the set of syscalls.</para>
		418	</section>
		419
		420	<section>
71	bondari	421	<indexterm>
		422	<primary>address space</primary>
		423
72	bondari	424	<secondary>- ASID</secondary>
71	bondari	425	</indexterm>
		426
67	jermar	427	<title>Address Space ID (ASID)</title>
		428
76	palkovsky	429	<para>Every task in the operating system has it's own view of the
		430	virtual memory. When performing context switch between different
		431	tasks, the kernel must switch the address space mapping as well. As
		432	modern processors perform very aggressive caching of virtual mappings,
		433	flushing the complete TLB on every context switch would be very
		434	inefficient. To avoid such performance penalty, some architectures
		435	introduce an address space identifier, which allows storing several
		436	different mappings inside TLB.</para>
67	jermar	437
76	palkovsky	438	<para>HelenOS kernel can take advantage of this hardware support by
		439	having an ASID abstraction. I.e. on ia64 kernel ASID is derived from
		440	RID (region identifier) and on the mips32 kernel ASID is actually the
		441	hardware identifier. As expected, this ASID information record is the
		442	part of <emphasis>as_t</emphasis> structure.</para>
67	jermar	443
		444	<para>Due to the hardware limitations, hardware ASID has limited
		445	length from 8 bits on ia64 to 24 bits on mips32, which makes it
		446	impossible to use it as unique address space identifier for all tasks
		447	running in the system. In such situations special ASID stealing
		448	algoritm is used, which takes ASID from inactive task and assigns it
		449	to the active task.</para>
		450
71	bondari	451	<indexterm>
		452	<primary>address space</primary>
		453
72	bondari	454	<secondary>- ASID stealing</secondary>
71	bondari	455	</indexterm>
		456
		457	<para>
		458	<classname>ASID stealing algoritm here.</classname>
		459	</para>
67	jermar	460	</section>
		461	</section>
		462
71	bondari	463	<section id="paging">
67	jermar	464	<title>Virtual address translation</title>
		465
71	bondari	466	<section>
		467	<title>Introduction</title>
67	jermar	468
71	bondari	469	<para>Virtual memory is usually using paged memory model, where
		470	virtual memory address space is divided into the
		471	<emphasis>pages</emphasis> (usually having size 4096 bytes) and
		472	physical memory is divided into the frames (same sized as a page, of
		473	course). Each page may be mapped to some frame and then, upon memory
		474	access to the virtual address, CPU performs <emphasis>address
		475	translation</emphasis> during the instruction execution. Non-existing
		476	mapping generates page fault exception, calling kernel exception
		477	handler, thus allowing kernel to manipulate rules of memory access.
		478	Information for pages mapping is stored by kernel in the <link
		479	linkend="page_tables">page tables</link></para>
		480
		481	<indexterm>
		482	<primary>page tables</primary>
		483	</indexterm>
		484
		485	<para>The majority of the architectures use multi-level page tables,
		486	which means need to access physical memory several times before
		487	getting physical address. This fact would make serios performance
		488	overhead in virtual memory management. To avoid this <link
		489	linkend="tlb">Traslation Lookaside Buffer (TLB)</link> is used.</para>
		490
70	jermar	491	<para>HelenOS kernel has two different approaches to the paging
		492	implementation: <emphasis>4 level page tables</emphasis> and
71	bondari	493	<emphasis>global hash table</emphasis>, which are accessible via
70	jermar	494	generic paging abstraction layer. Such different functionality was
		495	caused by the major architectural differences between supported
		496	platforms. This abstraction is implemented with help of the global
		497	structure of pointers to basic mapping functions
		498	<emphasis>page_mapping_operations</emphasis>. To achieve different
		499	functionality of page tables, corresponding layer must implement
		500	functions, declared in
		501	<emphasis>page_mapping_operations</emphasis></para>
67	jermar	502
71	bondari	503	<para>Thanks to the abstract paging interface, there was a place left
		504	for more paging implementations (besides already implemented
73	bondari	505	hieararchical page tables and hash table), for example <indexterm>
		506	<primary>B-tree</primary>
		507	</indexterm> B-Tree based page tables.</para>
71	bondari	508	</section>
67	jermar	509
71	bondari	510	<section id="page_tables">
		511	<indexterm>
		512	<primary>page tables</primary>
67	jermar	513
72	bondari	514	<secondary>- hierarchical</secondary>
71	bondari	515	</indexterm>
67	jermar	516
71	bondari	517	<title>Hierarchical 4-level page tables</title>
67	jermar	518
71	bondari	519	<para>Hierarchical 4-level page tables are the generalization of the
		520	hardware capabilities of most architectures. Each address space has
		521	its own page tables.<itemizedlist>
		522	<listitem>ia32 uses 2-level page tables, with full hardware
		523	support.</listitem>
67	jermar	524
71	bondari	525	<listitem>amd64 uses 4-level page tables, also coming with full
		526	hardware support.</listitem>
67	jermar	527
71	bondari	528	<listitem>mips and ppc32 have 2-level tables, software simulated
		529	support.</listitem>
		530	</itemizedlist></para>
67	jermar	531	</section>
		532
71	bondari	533	<section>
		534	<indexterm>
		535	<primary>page tables</primary>
67	jermar	536
72	bondari	537	<secondary>- hashing</secondary>
71	bondari	538	</indexterm>
67	jermar	539
71	bondari	540	<title>Global hash table</title>
67	jermar	541
71	bondari	542	<para>Implementation of the global hash table was encouraged by the
		543	ia64 architecture support. One of the major differences between global
		544	hash table and hierarchical tables is that global hash table exists
		545	only once in the system and the hierarchical tables are maintained per
		546	address space.</para>
67	jermar	547
71	bondari	548	<para>Thus, hash table contains information about all address spaces
		549	mappings in the system, so, the hash of an entry must contain
		550	information of both address space pointer or id and the virtual
		551	address of the page. Generic hash table implementation assumes that
		552	the addresses of the pointers to the address spaces are likely to be
		553	on the close addresses, so it uses least significant bits for hash;
		554	also it assumes that the virtual page addresses have roughly the same
		555	probability of occurring, so the least significant bits of VPN compose
		556	the hash index.</para>
67	jermar	557
74	bondari	558	<para>Paging hash table uses generic hash table with collision chains
		559	(see the <link linkend="hashtables">Data Structures</link> chapter of
		560	this manual for details).</para>
70	jermar	561	</section>
		562	</section>
67	jermar	563
71	bondari	564	<section id="tlb">
		565	<indexterm>
		566	<primary>TLB</primary>
		567	</indexterm>
67	jermar	568
71	bondari	569	<title>Translation Lookaside buffer</title>
67	jermar	570
71	bondari	571	<para>Due to the extensive overhead during the page mapping lookup in
		572	the page tables, all architectures has fast assotiative cache memory
		573	built-in CPU. This memory called TLB stores recently used page table
		574	entries.</para>
		575
		576	<section id="tlb_shootdown">
		577	<indexterm>
		578	<primary>TLB</primary>
		579
72	bondari	580	<secondary>- TLB shootdown</secondary>
71	bondari	581	</indexterm>
		582
		583	<title>TLB consistency. TLB shootdown algorithm.</title>
		584
		585	<para>Operating system is responsible for keeping TLB consistent by
		586	invalidating the contents of TLB, whenever there is some change in
		587	page tables. Those changes may occur when page or group of pages were
		588	unmapped, mapping is changed or system switching active address space
		589	to schedule a new system task. Moreover, this invalidation operation
		590	must be done an all system CPUs because each CPU has its own
		591	independent TLB cache. Thus maintaining TLB consistency on SMP
		592	configuration as not as trivial task as it looks on the first glance.
		593	Naive solution would assume that is the CPU which wants to invalidate
		594	TLB will invalidate TLB caches on other CPUs. It is not possible on
		595	the most of the architectures, because of the simple fact - flushing
		596	TLB is allowed only on the local CPU and there is no possibility to
		597	access other CPUs' TLB caches, thus invalidate TLB remotely.</para>
		598
		599	<para>Technique of remote invalidation of TLB entries is called "TLB
		600	shootdown". HelenOS uses a variation of the algorithm described by D.
		601	Black et al., "Translation Lookaside Buffer Consistency: A Software
		602	Approach," Proc. Third Int'l Conf. Architectural Support for
		603	Programming Languages and Operating Systems, 1989, pp. 113-122. <xref
		604	linkend="Black89" /></para>
		605
		606	<para>As the situation demands, you will want partitial invalidation
		607	of TLB caches. In case of simple memory mapping change it is necessary
		608	to invalidate only one or more adjacent pages. In case if the
		609	architecture is aware of ASIDs, when kernel needs to dump some ASID to
		610	use by another task, it invalidates only entries from this particular
		611	address space. Final option of the TLB invalidation is the complete
		612	TLB cache invalidation, which is the operation that flushes all
		613	entries in TLB.</para>
		614
		615	<para>TLB shootdown is performed in two phases.</para>
		616
		617	<formalpara>
		618	<title>Phase 1.</title>
		619
		620	<para>First, initiator locks a global TLB spinlock, then request is
		621	being put to the local request cache of every other CPU in the
		622	system protected by its spinlock. In case the cache is full, all
		623	requests in the cache are replaced by one request, indicating global
		624	TLB flush. Then the initiator thread sends an IPI message indicating
		625	the TLB shootdown request to the rest of the CPUs and waits actively
		626	until all CPUs confirm TLB invalidating action execution by setting
		627	up a special flag. After setting this flag this thread is blocked on
		628	the TLB spinlock, held by the initiator.</para>
		629	</formalpara>
		630
		631	<formalpara>
		632	<title>Phase 2.</title>
		633
		634	<para>All CPUs are waiting on the TLB spinlock to execute TLB
		635	invalidation action and have indicated their intention to the
		636	initiator. Initiator continues, cleaning up its TLB and releasing
		637	the global TLB spinlock. After this all other CPUs gain and
		638	immidiately release TLB spinlock and perform TLB invalidation
		639	actions.</para>
		640	</formalpara>
		641	</section>
67	jermar	642	</section>
26	bondari	643	</section>
11	bondari	644	</chapter>

Subversion Repositories HelenOS-doc

(root)/design/trunk/src/ch_memory_management.xml @ 185 – Rev 77