WebSVN – HelenOS-doc – Blame – /design/trunk/src/ch_memory_management.xml

Rev	Author	Line No.	Line
9	bondari	1	<?xml version="1.0" encoding="UTF-8"?>
11	bondari	2	<chapter id="mm">
		3	<?dbhtml filename="mm.html"?>
9	bondari	4
11	bondari	5	<title>Memory management</title>
9	bondari	6
26	bondari	7	<section>
11	bondari	8	<title>Virtual memory management</title>
9	bondari	9
		10	<section>
35	bondari	11	<title>Introduction</title>
		12
		13	<para>Virtual memory is a special memory management technique, used by
		14	kernel to achieve a bunch of mission critical goals. <itemizedlist>
		15	<listitem>
		16	Isolate each task from other tasks that are running on the system at the same time.
		17	</listitem>
		18
		19	<listitem>
		20	Allow to allocate more memory, than is actual physical memory size of the machine.
		21	</listitem>
		22
		23	<listitem>
		24	Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations.
		25	</listitem>
		26	</itemizedlist></para>
		27	</section>
		28
		29	<section>
		30
		31
		32	<title>Paging</title>
		33
		34
		35
		36	<para>Virtual memory is usually using paged memory model, where virtual
		37	memory address space is divided into the <emphasis>pages</emphasis>
		38	(usually having size 4096 bytes) and physical memory is divided into the
		39	frames (same sized as a page, of cause). Each page may be mapped to some
		40	frame and then, upon memory access to the virtual address, CPU performs
		41	<emphasis>address translation</emphasis> during the instruction
		42	execution. Non-existing mapping generates page fault exception, calling
		43	kernel exception handler, thus allowing kernel to manipulate rules of
		44	memory access. Information for pages mapping is stored by kernel in the
		45	<link linkend="page_tables">page tables</link></para>
		46
		47
		48
		49	<para>The majority of the architectures use multi-level page tables,
		50	which means need to access physical memory several times before getting
		51	physical address. This fact would make serios performance overhead in
		52	virtual memory management. To avoid this <link linkend="tlb">Traslation
		53	Lookaside Buffer (TLB)</link> is used.</para>
		54
		55
		56
		57	<para>At the moment HelenOS does not support swapping.</para>
		58
		59	- pouzivame vypadky stranky k alokaci ramcu on-demand v ramci as_area - na architekturach, ktere to podporuji, podporujeme non-exec stranky
		60	</section>
		61
		62	<section>
11	bondari	63	<title>Address spaces</title>
9	bondari	64
35	bondari	65	<section>
		66	<title>Address spaces and areas</title>
		67
		68	<para>- adresovy prostor se sklada z tzv. address space areas
		69	usporadanych v B+stromu; tyto areas popisuji vyuzivane casti
		70	adresoveho prostoru patrici do user address space. Kazda cast je dana
		71	svoji bazovou adresou, velikosti a flagy (rwx/dd).</para>
		72
		73	<para>- uzivatelske thready maji moznost manipulovat se svym adresovym
		74	prostorem (vytvaret/resizovat/sdilet) as_areas pomoci syscallu</para>
		75	</section>
		76
		77	<section>
		78	<title>Address Space ID (ASID)</title>
		79
		80	<para>- nektery hardware umoznuje rozlisit ruzne adresove prostory od
		81	sebe (cilem je maximalizovat vyuziti TLB); dela to tak, ze s kazdou
		82	polozkou TLB/strankovacich tabulek sdruzi identifikator adresoveho
		83	prostoru (ASID, RID, ppc32 ???). Tyto id mivaji ruznou sirku: 8-bitu
		84	az 24-bitu (kolik ma ppc32?)</para>
		85
		86	<para>- kernel tomu rozumi a sam pouziva abstrakci ASIDu (na ia64 to
		87	je napr. cislo odvozene od RIDu, na mips32 to je ASID samotny);
		88	existence ASIDu je nutnou podminkou pouziti _global_ page hash table
		89	mechanismu.</para>
		90
		91	<para>- na vsech arch. plati, ze asidu je mnohem mene, nez teoreticky
		92	pocet soucasne bezicich tasku ~ adresovych prostoru, takze je
		93	implementovan mechanismus, ktery umoznuje jednomu adresovemu prostoru
		94	ASID odebrat a pridelit ho jinemu</para>
		95
		96	<para>- vztah task ~ adresovy prostor: teoreticky existuje moznost, ze
		97	je adresovy prostor sdilen vice tasky, avsak tuto moznost nepouzivame
		98	a neni ani nijak osetrena. Tim padem plati, ze kazdy task ma vlastni
		99	adresovy prostor</para>
		100	</section>
9	bondari	101	</section>
		102
		103	<section>
11	bondari	104	<title>Virtual address translation</title>
9	bondari	105
35	bondari	106	<section id="page_tables">
		107	<title>Page tables</title>
34	bondari	108
35	bondari	109	<para>HelenOS kernel has two different approaches to the paging
		110	implementation: <emphasis>4 level page tables</emphasis> and
		111	<emphasis>global hash tables</emphasis>, which are accessible via
		112	generic paging abstraction layer. This division was caused by the
		113	major architectural differences between different platforms.</para>
34	bondari	114
35	bondari	115	<formalpara>
		116	<title>4-level page tables</title>
34	bondari	117
35	bondari	118	<para>4-level page tables are the generalization of the hardware
		119	capabilities of the certain platforms. <itemizedlist>
		120	<listitem>
		121	ia32 uses 2-level page tables, with full hardware support.
		122	</listitem>
34	bondari	123
35	bondari	124	<listitem>
		125	amd64 uses 4-level page tables, also coming with full hardware support.
		126	</listitem>
		127
		128	<listitem>
		129	mips and ppc32 have 2-level tables, software simulated support.
		130	</listitem>
		131	</itemizedlist></para>
		132	</formalpara>
		133
		134	<formalpara>
		135	<title>Global hash tables</title>
		136
		137	<para>- global page hash table: existuje jen jedna v celem systemu
		138	(vyuziva ji ia64), pozn. ia64 ma zatim vypnuty VHPT. Pouziva se
		139	genericke hash table s oddelenymi collision chains</para>
		140	</formalpara>
		141
		142	<para>Thanks to the abstract paging interface, there is possibility
		143	left have more paging implementations, for example B-Tree page
		144	tables.</para>
		145	</section>
		146
		147	<section id="tlb">
		148	<title>Translation Lookaside buffer</title>
		149
		150	<para>- TLB cachuji informace ve strankovacich tabulkach; alternativne
		151	se lze na strankovaci tabulky (ci ruzne hw rozsireni [e.g. VHPT, ppc32
		152	hw hash table]) divat jako na velke TLB</para>
		153
		154	<para>- pri modifikaci mapovani nebo odstraneni mapovani ze
		155	strankovacich tabulek je potreba zajistit konsistenci TLB a techto
		156	tabulek; nutne delat na vsech CPU; na to mame zjednodusenou verzi TLB
		157	shootdown mechanismu; je to variace na algoritmus popsany zde: D.
		158	Black et al., "Translation Lookaside Buffer Consistency: A Software
		159	Approach," Proc. Third Int'l Conf. Architectural Support for
		160	Programming Languages and Operating Systems, 1989, pp. 113-122.</para>
		161
		162	<para>- nutno poznamenat, ze existuji odlehcenejsi verze TLB shootdown
		163	algoritmu</para>
		164	</section>
		165	</section>
26	bondari	166	</section>
9	bondari	167
26	bondari	168	<!-- End of VM -->
24	bondari	169
26	bondari	170	<section>
		171	<!-- Phys mem -->
		172
11	bondari	173	<title>Physical memory management</title>
9	bondari	174
24	bondari	175	<section id="zones_and_frames">
		176	<title>Zones and frames</title>
		177
34	bondari	178	<para><!--graphic fileref="images/mm2.png" /--><!--graphic fileref="images/buddy_alloc.svg" format="SVG" /--></para>
26	bondari	179
		180	<para>On some architectures not whole physical memory is available for
		181	conventional usage. This limitations require from kernel to maintain a
		182	table of available and unavailable ranges of physical memory addresses.
		183	Main idea of zones is in creating memory zone entity, that is a
		184	continuous chunk of memory available for allocation. If some chunk is
		185	not available, we simply do not put it in any zone.</para>
		186
		187	<para>Zone is also serves for informational purposes, containing
		188	information about number of free and busy frames. Physical memory
		189	allocation is also done inside the certain zone. Allocation of zone
		190	frame must be organized by the <link linkend="frame_allocator">frame
		191	allocator</link> associated with the zone.</para>
		192
		193	<para>Some of the architectures (mips32, ppc32) have only one zone, that
		194	covers whole physical memory, and the others (like ia32) may have
		195	multiple zones. Information about zones on current machine is stored in
		196	BIOS hardware tables or can be hardcoded into kernel during compile
		197	time.</para>
24	bondari	198	</section>
		199
		200	<section id="frame_allocator">
		201	<title>Frame allocator</title>
		202
26	bondari	203	<formalpara>
		204	<title>Overview</title>
24	bondari	205
26	bondari	206	<para>Frame allocator provides physical memory allocation for the
		207	kernel. Because of zonal organization of physical memory, frame
		208	allocator is always working in context of some zone, thus making
		209	impossible to allocate a piece of memory, which lays in different
		210	zone, which cannot happen, because two adjacent zones can be merged
		211	into one. Frame allocator is also being responsible to update
		212	information on the number of free/busy frames in zone. Physical memory
		213	allocation inside one <link linkend="zones_and_frames">memory
		214	zone</link> is being handled by an instance of <link
		215	linkend="buddy_allocator">buddy allocator</link> tailored to allocate
		216	blocks of physical memory frames.</para>
		217	</formalpara>
24	bondari	218
26	bondari	219	<formalpara>
		220	<title>Allocation / deallocation</title>
24	bondari	221
26	bondari	222	<para>Upon allocation request, frame allocator tries to find first
		223	zone, that can satisfy the incoming request (has required amount of
		224	free frames to allocate). During deallocation, frame allocator needs
		225	to find zone, that contain deallocated frame. This approach could
		226	bring up two potential problems: <itemizedlist>
		227	<listitem>
		228	Linear search of zones does not any good to performance, but number of zones is not expected to be high. And if yes, list of zones can be replaced with more time-efficient B-tree.
		229	</listitem>
24	bondari	230
26	bondari	231	<listitem>
		232	Quickly find out if zone contains required number of frames to allocate and if this chunk of memory is properly aligned. This issue is perfectly solved bu the buddy allocator.
		233	</listitem>
		234	</itemizedlist></para>
		235	</formalpara>
		236	</section>
17	jermar	237
34	bondari	238	<section id="buddy_allocator">
		239	<title>Buddy allocator</title>
17	jermar	240
34	bondari	241	<section>
		242	<title>Overview</title>
17	jermar	243
34	bondari	244	<para>In buddy allocator, memory is broken down into power-of-two
		245	sized naturally aligned blocks. These blocks are organized in an array
		246	of lists in which list with index i contains all unallocated blocks of
		247	the size <mathphrase>2<superscript>i</superscript></mathphrase>. The
		248	index i is called the order of block. Should there be two adjacent
		249	equally sized blocks in list <mathphrase>i</mathphrase> (i.e.
		250	buddies), the buddy allocator would coalesce them and put the
		251	resulting block in list <mathphrase>i + 1</mathphrase>, provided that
		252	the resulting block would be naturally aligned. Similarily, when the
		253	allocator is asked to allocate a block of size
		254	<mathphrase>2<superscript>i</superscript></mathphrase>, it first tries
		255	to satisfy the request from list with index i. If the request cannot
		256	be satisfied (i.e. the list i is empty), the buddy allocator will try
		257	to allocate and split larger block from list with index i + 1. Both of
		258	these algorithms are recursive. The recursion ends either when there
		259	are no blocks to coalesce in the former case or when there are no
		260	blocks that can be split in the latter case.</para>
17	jermar	261
34	bondari	262	<!--graphic fileref="images/mm1.png" format="EPS" /-->
17	jermar	263
34	bondari	264	<para>This approach greatly reduces external fragmentation of memory
		265	and helps in allocating bigger continuous blocks of memory aligned to
		266	their size. On the other hand, the buddy allocator suffers increased
		267	internal fragmentation of memory and is not suitable for general
		268	kernel allocations. This purpose is better addressed by the <link
		269	linkend="slab">slab allocator</link>.</para>
		270	</section>
17	jermar	271
34	bondari	272	<section>
		273	<title>Implementation</title>
17	jermar	274
34	bondari	275	<para>The buddy allocator is, in fact, an abstract framework wich can
		276	be easily specialized to serve one particular task. It knows nothing
		277	about the nature of memory it helps to allocate. In order to beat the
		278	lack of this knowledge, the buddy allocator exports an interface that
		279	each of its clients is required to implement. When supplied an
		280	implementation of this interface, the buddy allocator can use
		281	specialized external functions to find buddy for a block, split and
		282	coalesce blocks, manipulate block order and mark blocks busy or
		283	available. For precize documentation of this interface, refer to <link
		284	linkend="???">HelenOS Generic Kernel Reference Manual</link>.</para>
17	jermar	285
34	bondari	286	<formalpara>
		287	<title>Data organization</title>
17	jermar	288
34	bondari	289	<para>Each entity allocable by the buddy allocator is required to
		290	contain space for storing block order number and a link variable
		291	used to interconnect blocks within the same order.</para>
15	bondari	292
34	bondari	293	<para>Whatever entities are allocated by the buddy allocator, the
		294	first entity within a block is used to represent the entire block.
		295	The first entity keeps the order of the whole block. Other entities
		296	within the block are assigned the magic value
		297	<constant>BUDDY_INNER_BLOCK</constant>. This is especially important
		298	for effective identification of buddies in one-dimensional array
		299	because the entity that represents a potential buddy cannot be
		300	associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it
		301	is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is
		302	not a buddy).</para>
15	bondari	303
34	bondari	304	<para>Buddy allocator always uses first frame to represent frame
		305	block. This frame contains <varname>buddy_order</varname> variable
		306	to provide information about the block size it actually represents (
		307	<mathphrase>2<superscript>buddy_order</superscript></mathphrase>
		308	frames block). Other frames in block have this value set to magic
		309	<constant>BUDDY_INNER_BLOCK</constant> that is much greater than
		310	buddy <varname>max_order</varname> value.</para>
15	bondari	311
34	bondari	312	<para>Each <varname>frame_t</varname> also contains pointer member
		313	to hold frame structure in the linked list inside one order.</para>
		314	</formalpara>
15	bondari	315
34	bondari	316	<formalpara>
		317	<title>Allocation algorithm</title>
15	bondari	318
34	bondari	319	<para>Upon <mathphrase>2<superscript>i</superscript></mathphrase>
		320	frames block allocation request, allocator checks if there are any
		321	blocks available at the order list <varname>i</varname>. If yes,
		322	removes block from order list and returns its address. If no,
		323	recursively allocates
		324	<mathphrase>2<superscript>i+1</superscript></mathphrase> frame
		325	block, splits it into two
		326	<mathphrase>2<superscript>i</superscript></mathphrase> frame blocks.
		327	Then adds one of the blocks to the <varname>i</varname> order list
		328	and returns address of another.</para>
		329	</formalpara>
15	bondari	330
34	bondari	331	<formalpara>
		332	<title>Deallocation algorithm</title>
17	jermar	333
34	bondari	334	<para>Check if block has so called buddy (another free
		335	<mathphrase>2<superscript>i</superscript></mathphrase> frame block
		336	that can be linked with freed block into the
		337	<mathphrase>2<superscript>i+1</superscript></mathphrase> block).
		338	Technically, buddy is a odd/even block for even/odd block
		339	respectively. Plus we can put an extra requirement, that resulting
		340	block must be aligned to its size. This requirement guarantees
		341	natural block alignment for the blocks coming out the allocation
		342	system.</para>
9	bondari	343
34	bondari	344	<para>Using direct pointer arithmetics,
		345	<varname>frame_t::ref_count</varname> and
		346	<varname>frame_t::buddy_order</varname> variables, finding buddy is
		347	done at constant time.</para>
		348	</formalpara>
		349	</section>
26	bondari	350	</section>
		351
15	bondari	352	<section id="slab">
11	bondari	353	<title>Slab allocator</title>
9	bondari	354
26	bondari	355	<section>
34	bondari	356	<title>Overview</title>
9	bondari	357
34	bondari	358	<para><termdef><glossterm>Slab</glossterm> represents a contiguous
		359	piece of memory, usually made of several physically contiguous
		360	pages.</termdef> <termdef><glossterm>Slab cache</glossterm> consists
		361	of one or more slabs.</termdef></para>
		362
26	bondari	363	<para>The majority of memory allocation requests in the kernel are for
		364	small, frequently used data structures. For this purpose the slab
34	bondari	365	allocator is a perfect solution. The basic idea behind the slab
26	bondari	366	allocator is to have lists of commonly used objects available packed
		367	into pages. This avoids the overhead of allocating and destroying
34	bondari	368	commonly used types of objects such threads, virtual memory structures
		369	etc. Also due to the exact allocated size matching, slab allocation
		370	completely eliminates internal fragmentation issue.</para>
26	bondari	371	</section>
24	bondari	372
26	bondari	373	<section>
34	bondari	374	<title>Implementation</title>
9	bondari	375
26	bondari	376	<para>The SLAB allocator is closely modelled after <ulink
		377	url="http://www.usenix.org/events/usenix01/full_papers/bonwick/bonwick_html/">
		378	OpenSolaris SLAB allocator by Jeff Bonwick and Jonathan Adams </ulink>
		379	with the following exceptions: <itemizedlist>
		380	<listitem>
		381	empty SLABS are deallocated immediately (in Linux they are kept in linked list, in Solaris ???)
		382	</listitem>
		383
		384	<listitem>
		385	empty magazines are deallocated when not needed (in Solaris they are held in linked list in slab cache)
		386	</listitem>
		387	</itemizedlist> Following features are not currently supported but
		388	would be easy to do: <itemizedlist>
		389	<listitem>
		390	- cache coloring
		391	</listitem>
		392
		393	<listitem>
34	bondari	394	- dynamic magazine grow (different magazine sizes are already supported, but we would need to adjust allocation strategy)
26	bondari	395	</listitem>
		396	</itemizedlist></para>
		397
34	bondari	398	<section>
		399	<title>Magazine layer</title>
26	bondari	400
34	bondari	401	<para>Due to the extensive bottleneck on SMP architures, caused by
		402	global SLAB locking mechanism, making processing of all slab
		403	allocation requests serialized, a new layer was introduced to the
		404	classic slab allocator design. Slab allocator was extended to
		405	support per-CPU caches 'magazines' to achieve good SMP scaling.
		406	<termdef>Slab SMP perfromance bottleneck was resolved by introducing
		407	a per-CPU caching scheme called as <glossterm>magazine
		408	layer</glossterm></termdef>.</para>
26	bondari	409
34	bondari	410	<para>Magazine is a N-element cache of objects, so each magazine can
		411	satisfy N allocations. Magazine behaves like a automatic weapon
		412	magazine (LIFO, stack), so the allocation/deallocation become simple
		413	push/pop pointer operation. Trick is that CPU does not access global
		414	slab allocator data during the allocation from its magazine, thus
		415	making possible parallel allocations between CPUs.</para>
26	bondari	416
34	bondari	417	<para>Implementation also requires adding another feature as the
		418	CPU-bound magazine is actually a pair of magazines to avoid
		419	thrashing when during allocation/deallocatiion of 1 item at the
		420	magazine size boundary. LIFO order is enforced, which should avoid
		421	fragmentation as much as possible.</para>
26	bondari	422
34	bondari	423	<para>Another important entity of magazine layer is a full magazine
		424	depot, that stores full magazines which are used by any of the CPU
		425	magazine caches to reload active CPU magazine. Magazine depot can be
		426	pre-filled with full magazines during initialization, but in current
		427	implementation it is filled during object deallocation, when CPU
		428	magazine becomes full.</para>
26	bondari	429
34	bondari	430	<para>Slab allocator control structures are allocated from special
		431	slabs, that are marked by special flag, indicating that it should
		432	not be used for slab magazine layer. This is done to avoid possible
		433	infinite recursions and deadlock during conventional slab allocaiton
		434	requests.</para>
		435	</section>
26	bondari	436
34	bondari	437	<section>
		438	<title>Allocation/deallocation</title>
26	bondari	439
34	bondari	440	<para>Every cache contains list of full slabs and list of partialy
		441	full slabs. Empty slabs are immediately freed (thrashing will be
		442	avoided because of magazines).</para>
26	bondari	443
34	bondari	444	<para>The SLAB allocator allocates lots of space and does not free
		445	it. When frame allocator fails to allocate the frame, it calls
		446	slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim.
		447	The light reclaim releases slabs from cpu-shared magazine-list,
		448	until at least 1 slab is deallocated in each cache (this algorithm
		449	should probably change). The brutal reclaim removes all cached
		450	objects, even from CPU-bound magazines.</para>
		451
		452	<formalpara>
		453	<title>Allocation</title>
		454
		455	<para><emphasis>Step 1.</emphasis> When it comes to the allocation
		456	request, slab allocator first of all checks availability of memory
		457	in local CPU-bound magazine. If it is there, we would just "pop"
		458	the CPU magazine and return the pointer to object.</para>
		459
		460	<para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
		461	empty, allocator will attempt to reload magazin, swapping it with
		462	second CPU magazine and returns to the first step.</para>
		463
		464	<para><emphasis>Step 3.</emphasis> Now we are in the situation
		465	when both CPU-bound magazines are empty, which makes allocator to
		466	access shared full-magazines depot to reload CPU-bound magazines.
		467	If reload is succesful (meaning there are full magazines in depot)
		468	algoritm continues at Step 1.</para>
		469
		470	<para><emphasis>Step 4.</emphasis> Final step of the allocation.
		471	In this step object is allocated from the conventional slab layer
		472	and pointer is returned.</para>
		473	</formalpara>
		474
		475	<formalpara>
		476	<title>Deallocation</title>
		477
		478	<para><emphasis>Step 1.</emphasis> During deallocation request,
		479	slab allocator will check if the local CPU-bound magazine is not
		480	full. In this case we will just push the pointer to this
		481	magazine.</para>
		482
		483	<para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
		484	full, allocator will attempt to reload magazin, swapping it with
		485	second CPU magazine and returns to the first step.</para>
		486
		487	<para><emphasis>Step 3.</emphasis> Now we are in the situation
		488	when both CPU-bound magazines are full, which makes allocator to
		489	access shared full-magazines depot to put one of the magazines to
		490	the depot and creating new empty magazine. Algoritm continues at
		491	Step 1.</para>
		492	</formalpara>
		493	</section>
26	bondari	494	</section>
15	bondari	495	</section>
26	bondari	496
		497	<!-- End of Physmem -->
		498	</section>
		499
		500	<section>
		501	<title>Memory sharing</title>
		502
		503	<para>Not implemented yet(?)</para>
		504	</section>
11	bondari	505	</chapter>

Subversion Repositories HelenOS-doc

(root)/design/trunk/src/ch_memory_management.xml @ 185 – Rev 35