WebSVN – HelenOS-doc – Blame – /design/trunk/src/ch_memory_management.xml

Rev	Author	Line No.	Line
9	bondari	1	<?xml version="1.0" encoding="UTF-8"?>
11	bondari	2	<chapter id="mm">
		3	<?dbhtml filename="mm.html"?>
9	bondari	4
11	bondari	5	<title>Memory management</title>
9	bondari	6
26	bondari	7	<section>
11	bondari	8	<title>Virtual memory management</title>
9	bondari	9
		10	<section>
35	bondari	11	<title>Introduction</title>
		12
		13	<para>Virtual memory is a special memory management technique, used by
		14	kernel to achieve a bunch of mission critical goals. <itemizedlist>
		15	<listitem>
		16	Isolate each task from other tasks that are running on the system at the same time.
		17	</listitem>
		18
		19	<listitem>
		20	Allow to allocate more memory, than is actual physical memory size of the machine.
		21	</listitem>
		22
		23	<listitem>
		24	Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations.
		25	</listitem>
		26	</itemizedlist></para>
38	bondari	27
		28
		29	<para><!--
		30
		31	TLB shootdown ASID/ASID:PAGE/ALL.
		32	TLB shootdown requests can come in asynchroniously
		33	so there is a cache of TLB shootdown requests. Upon cache overflow TLB shootdown ALL is executed
		34
		35
		36	<para>
		37	Address spaces. Address space area (B+ tree). Only for uspace. Set of syscalls (shrink/extend etc).
		38	Special address space area type - device - prohibits shrink/extend syscalls to call on it.
		39	Address space has link to mapping tables (hierarchical - per Address space, hash - global tables).
		40	</para>
		41
		42	--></para>
35	bondari	43	</section>
		44
		45	<section>
		46
		47
		48	<title>Paging</title>
		49
		50	<para>Virtual memory is usually using paged memory model, where virtual
		51	memory address space is divided into the <emphasis>pages</emphasis>
		52	(usually having size 4096 bytes) and physical memory is divided into the
37	bondari	53	frames (same sized as a page, of course). Each page may be mapped to some
35	bondari	54	frame and then, upon memory access to the virtual address, CPU performs
		55	<emphasis>address translation</emphasis> during the instruction
		56	execution. Non-existing mapping generates page fault exception, calling
		57	kernel exception handler, thus allowing kernel to manipulate rules of
		58	memory access. Information for pages mapping is stored by kernel in the
		59	<link linkend="page_tables">page tables</link></para>
		60
		61
		62
		63	<para>The majority of the architectures use multi-level page tables,
		64	which means need to access physical memory several times before getting
		65	physical address. This fact would make serios performance overhead in
		66	virtual memory management. To avoid this <link linkend="tlb">Traslation
		67	Lookaside Buffer (TLB)</link> is used.</para>
		68
		69
		70
		71	<para>At the moment HelenOS does not support swapping.</para>
		72
37	bondari	73	<para>- pouzivame vypadky stranky k alokaci ramcu on-demand v ramci as_area - na architekturach, ktere to podporuji, podporujeme non-exec stranky </para>
35	bondari	74	</section>
		75
		76	<section>
11	bondari	77	<title>Address spaces</title>
9	bondari	78
35	bondari	79	<section>
		80	<title>Address spaces and areas</title>
		81
37	bondari	82	<para>
		83
		84	- adresovy prostor se sklada z tzv. address space areas
35	bondari	85	usporadanych v B+stromu; tyto areas popisuji vyuzivane casti
		86	adresoveho prostoru patrici do user address space. Kazda cast je dana
37	bondari	87	svoji bazovou adresou, velikosti a flagy (rwx/dd).
35	bondari	88
37	bondari	89	</para>
		90
35	bondari	91	<para>- uzivatelske thready maji moznost manipulovat se svym adresovym
		92	prostorem (vytvaret/resizovat/sdilet) as_areas pomoci syscallu</para>
		93	</section>
		94
		95	<section>
		96	<title>Address Space ID (ASID)</title>
		97
		98	<para>- nektery hardware umoznuje rozlisit ruzne adresove prostory od
		99	sebe (cilem je maximalizovat vyuziti TLB); dela to tak, ze s kazdou
		100	polozkou TLB/strankovacich tabulek sdruzi identifikator adresoveho
		101	prostoru (ASID, RID, ppc32 ???). Tyto id mivaji ruznou sirku: 8-bitu
		102	az 24-bitu (kolik ma ppc32?)</para>
		103
		104	<para>- kernel tomu rozumi a sam pouziva abstrakci ASIDu (na ia64 to
		105	je napr. cislo odvozene od RIDu, na mips32 to je ASID samotny);
		106	existence ASIDu je nutnou podminkou pouziti _global_ page hash table
		107	mechanismu.</para>
		108
		109	<para>- na vsech arch. plati, ze asidu je mnohem mene, nez teoreticky
		110	pocet soucasne bezicich tasku ~ adresovych prostoru, takze je
		111	implementovan mechanismus, ktery umoznuje jednomu adresovemu prostoru
		112	ASID odebrat a pridelit ho jinemu</para>
		113
		114	<para>- vztah task ~ adresovy prostor: teoreticky existuje moznost, ze
		115	je adresovy prostor sdilen vice tasky, avsak tuto moznost nepouzivame
		116	a neni ani nijak osetrena. Tim padem plati, ze kazdy task ma vlastni
		117	adresovy prostor</para>
		118	</section>
38	bondari	119
		120
		121
9	bondari	122	</section>
		123
		124	<section>
11	bondari	125	<title>Virtual address translation</title>
9	bondari	126
35	bondari	127	<section id="page_tables">
		128	<title>Page tables</title>
34	bondari	129
35	bondari	130	<para>HelenOS kernel has two different approaches to the paging
		131	implementation: <emphasis>4 level page tables</emphasis> and
		132	<emphasis>global hash tables</emphasis>, which are accessible via
		133	generic paging abstraction layer. This division was caused by the
		134	major architectural differences between different platforms.</para>
34	bondari	135
35	bondari	136	<formalpara>
		137	<title>4-level page tables</title>
34	bondari	138
35	bondari	139	<para>4-level page tables are the generalization of the hardware
		140	capabilities of the certain platforms. <itemizedlist>
		141	<listitem>
		142	ia32 uses 2-level page tables, with full hardware support.
		143	</listitem>
34	bondari	144
35	bondari	145	<listitem>
		146	amd64 uses 4-level page tables, also coming with full hardware support.
		147	</listitem>
		148
		149	<listitem>
		150	mips and ppc32 have 2-level tables, software simulated support.
		151	</listitem>
		152	</itemizedlist></para>
		153	</formalpara>
		154
		155	<formalpara>
		156	<title>Global hash tables</title>
		157
		158	<para>- global page hash table: existuje jen jedna v celem systemu
		159	(vyuziva ji ia64), pozn. ia64 ma zatim vypnuty VHPT. Pouziva se
		160	genericke hash table s oddelenymi collision chains</para>
		161	</formalpara>
		162
		163	<para>Thanks to the abstract paging interface, there is possibility
		164	left have more paging implementations, for example B-Tree page
		165	tables.</para>
		166	</section>
		167
		168	<section id="tlb">
		169	<title>Translation Lookaside buffer</title>
		170
		171	<para>- TLB cachuji informace ve strankovacich tabulkach; alternativne
		172	se lze na strankovaci tabulky (ci ruzne hw rozsireni [e.g. VHPT, ppc32
		173	hw hash table]) divat jako na velke TLB</para>
		174
		175	<para>- pri modifikaci mapovani nebo odstraneni mapovani ze
		176	strankovacich tabulek je potreba zajistit konsistenci TLB a techto
		177	tabulek; nutne delat na vsech CPU; na to mame zjednodusenou verzi TLB
		178	shootdown mechanismu; je to variace na algoritmus popsany zde: D.
		179	Black et al., "Translation Lookaside Buffer Consistency: A Software
		180	Approach," Proc. Third Int'l Conf. Architectural Support for
		181	Programming Languages and Operating Systems, 1989, pp. 113-122.</para>
		182
		183	<para>- nutno poznamenat, ze existuji odlehcenejsi verze TLB shootdown
		184	algoritmu</para>
		185	</section>
		186	</section>
26	bondari	187	</section>
9	bondari	188
26	bondari	189	<!-- End of VM -->
24	bondari	190
26	bondari	191	<section>
		192	<!-- Phys mem -->
		193
11	bondari	194	<title>Physical memory management</title>
9	bondari	195
24	bondari	196	<section id="zones_and_frames">
		197	<title>Zones and frames</title>
		198
34	bondari	199	<para><!--graphic fileref="images/mm2.png" /--><!--graphic fileref="images/buddy_alloc.svg" format="SVG" /--></para>
26	bondari	200
		201	<para>On some architectures not whole physical memory is available for
		202	conventional usage. This limitations require from kernel to maintain a
		203	table of available and unavailable ranges of physical memory addresses.
		204	Main idea of zones is in creating memory zone entity, that is a
		205	continuous chunk of memory available for allocation. If some chunk is
		206	not available, we simply do not put it in any zone.</para>
		207
		208	<para>Zone is also serves for informational purposes, containing
		209	information about number of free and busy frames. Physical memory
		210	allocation is also done inside the certain zone. Allocation of zone
		211	frame must be organized by the <link linkend="frame_allocator">frame
		212	allocator</link> associated with the zone.</para>
		213
		214	<para>Some of the architectures (mips32, ppc32) have only one zone, that
		215	covers whole physical memory, and the others (like ia32) may have
		216	multiple zones. Information about zones on current machine is stored in
		217	BIOS hardware tables or can be hardcoded into kernel during compile
		218	time.</para>
24	bondari	219	</section>
		220
		221	<section id="frame_allocator">
		222	<title>Frame allocator</title>
		223
26	bondari	224	<formalpara>
		225	<title>Overview</title>
24	bondari	226
26	bondari	227	<para>Frame allocator provides physical memory allocation for the
		228	kernel. Because of zonal organization of physical memory, frame
		229	allocator is always working in context of some zone, thus making
		230	impossible to allocate a piece of memory, which lays in different
		231	zone, which cannot happen, because two adjacent zones can be merged
		232	into one. Frame allocator is also being responsible to update
		233	information on the number of free/busy frames in zone. Physical memory
		234	allocation inside one <link linkend="zones_and_frames">memory
		235	zone</link> is being handled by an instance of <link
		236	linkend="buddy_allocator">buddy allocator</link> tailored to allocate
		237	blocks of physical memory frames.</para>
		238	</formalpara>
24	bondari	239
26	bondari	240	<formalpara>
		241	<title>Allocation / deallocation</title>
24	bondari	242
26	bondari	243	<para>Upon allocation request, frame allocator tries to find first
		244	zone, that can satisfy the incoming request (has required amount of
		245	free frames to allocate). During deallocation, frame allocator needs
		246	to find zone, that contain deallocated frame. This approach could
		247	bring up two potential problems: <itemizedlist>
		248	<listitem>
		249	Linear search of zones does not any good to performance, but number of zones is not expected to be high. And if yes, list of zones can be replaced with more time-efficient B-tree.
		250	</listitem>
24	bondari	251
26	bondari	252	<listitem>
		253	Quickly find out if zone contains required number of frames to allocate and if this chunk of memory is properly aligned. This issue is perfectly solved bu the buddy allocator.
		254	</listitem>
		255	</itemizedlist></para>
		256	</formalpara>
		257	</section>
17	jermar	258
34	bondari	259	<section id="buddy_allocator">
		260	<title>Buddy allocator</title>
17	jermar	261
34	bondari	262	<section>
		263	<title>Overview</title>
17	jermar	264
34	bondari	265	<para>In buddy allocator, memory is broken down into power-of-two
		266	sized naturally aligned blocks. These blocks are organized in an array
		267	of lists in which list with index i contains all unallocated blocks of
		268	the size <mathphrase>2<superscript>i</superscript></mathphrase>. The
		269	index i is called the order of block. Should there be two adjacent
		270	equally sized blocks in list <mathphrase>i</mathphrase> (i.e.
		271	buddies), the buddy allocator would coalesce them and put the
		272	resulting block in list <mathphrase>i + 1</mathphrase>, provided that
		273	the resulting block would be naturally aligned. Similarily, when the
		274	allocator is asked to allocate a block of size
		275	<mathphrase>2<superscript>i</superscript></mathphrase>, it first tries
		276	to satisfy the request from list with index i. If the request cannot
		277	be satisfied (i.e. the list i is empty), the buddy allocator will try
		278	to allocate and split larger block from list with index i + 1. Both of
		279	these algorithms are recursive. The recursion ends either when there
		280	are no blocks to coalesce in the former case or when there are no
		281	blocks that can be split in the latter case.</para>
17	jermar	282
34	bondari	283	<!--graphic fileref="images/mm1.png" format="EPS" /-->
17	jermar	284
34	bondari	285	<para>This approach greatly reduces external fragmentation of memory
		286	and helps in allocating bigger continuous blocks of memory aligned to
		287	their size. On the other hand, the buddy allocator suffers increased
		288	internal fragmentation of memory and is not suitable for general
		289	kernel allocations. This purpose is better addressed by the <link
		290	linkend="slab">slab allocator</link>.</para>
		291	</section>
17	jermar	292
34	bondari	293	<section>
		294	<title>Implementation</title>
17	jermar	295
34	bondari	296	<para>The buddy allocator is, in fact, an abstract framework wich can
		297	be easily specialized to serve one particular task. It knows nothing
		298	about the nature of memory it helps to allocate. In order to beat the
		299	lack of this knowledge, the buddy allocator exports an interface that
		300	each of its clients is required to implement. When supplied an
		301	implementation of this interface, the buddy allocator can use
		302	specialized external functions to find buddy for a block, split and
		303	coalesce blocks, manipulate block order and mark blocks busy or
		304	available. For precize documentation of this interface, refer to <link
		305	linkend="???">HelenOS Generic Kernel Reference Manual</link>.</para>
17	jermar	306
34	bondari	307	<formalpara>
		308	<title>Data organization</title>
17	jermar	309
34	bondari	310	<para>Each entity allocable by the buddy allocator is required to
		311	contain space for storing block order number and a link variable
		312	used to interconnect blocks within the same order.</para>
15	bondari	313
34	bondari	314	<para>Whatever entities are allocated by the buddy allocator, the
		315	first entity within a block is used to represent the entire block.
		316	The first entity keeps the order of the whole block. Other entities
		317	within the block are assigned the magic value
		318	<constant>BUDDY_INNER_BLOCK</constant>. This is especially important
		319	for effective identification of buddies in one-dimensional array
		320	because the entity that represents a potential buddy cannot be
		321	associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it
		322	is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is
		323	not a buddy).</para>
15	bondari	324
34	bondari	325	<para>Buddy allocator always uses first frame to represent frame
		326	block. This frame contains <varname>buddy_order</varname> variable
		327	to provide information about the block size it actually represents (
		328	<mathphrase>2<superscript>buddy_order</superscript></mathphrase>
		329	frames block). Other frames in block have this value set to magic
		330	<constant>BUDDY_INNER_BLOCK</constant> that is much greater than
		331	buddy <varname>max_order</varname> value.</para>
15	bondari	332
34	bondari	333	<para>Each <varname>frame_t</varname> also contains pointer member
		334	to hold frame structure in the linked list inside one order.</para>
		335	</formalpara>
15	bondari	336
34	bondari	337	<formalpara>
		338	<title>Allocation algorithm</title>
15	bondari	339
34	bondari	340	<para>Upon <mathphrase>2<superscript>i</superscript></mathphrase>
		341	frames block allocation request, allocator checks if there are any
		342	blocks available at the order list <varname>i</varname>. If yes,
		343	removes block from order list and returns its address. If no,
		344	recursively allocates
		345	<mathphrase>2<superscript>i+1</superscript></mathphrase> frame
		346	block, splits it into two
		347	<mathphrase>2<superscript>i</superscript></mathphrase> frame blocks.
		348	Then adds one of the blocks to the <varname>i</varname> order list
		349	and returns address of another.</para>
		350	</formalpara>
15	bondari	351
34	bondari	352	<formalpara>
		353	<title>Deallocation algorithm</title>
17	jermar	354
34	bondari	355	<para>Check if block has so called buddy (another free
		356	<mathphrase>2<superscript>i</superscript></mathphrase> frame block
		357	that can be linked with freed block into the
		358	<mathphrase>2<superscript>i+1</superscript></mathphrase> block).
		359	Technically, buddy is a odd/even block for even/odd block
		360	respectively. Plus we can put an extra requirement, that resulting
		361	block must be aligned to its size. This requirement guarantees
		362	natural block alignment for the blocks coming out the allocation
		363	system.</para>
9	bondari	364
34	bondari	365	<para>Using direct pointer arithmetics,
		366	<varname>frame_t::ref_count</varname> and
		367	<varname>frame_t::buddy_order</varname> variables, finding buddy is
		368	done at constant time.</para>
		369	</formalpara>
		370	</section>
26	bondari	371	</section>
		372
15	bondari	373	<section id="slab">
11	bondari	374	<title>Slab allocator</title>
9	bondari	375
26	bondari	376	<section>
34	bondari	377	<title>Overview</title>
9	bondari	378
34	bondari	379	<para><termdef><glossterm>Slab</glossterm> represents a contiguous
		380	piece of memory, usually made of several physically contiguous
		381	pages.</termdef> <termdef><glossterm>Slab cache</glossterm> consists
		382	of one or more slabs.</termdef></para>
		383
26	bondari	384	<para>The majority of memory allocation requests in the kernel are for
		385	small, frequently used data structures. For this purpose the slab
34	bondari	386	allocator is a perfect solution. The basic idea behind the slab
26	bondari	387	allocator is to have lists of commonly used objects available packed
		388	into pages. This avoids the overhead of allocating and destroying
34	bondari	389	commonly used types of objects such threads, virtual memory structures
		390	etc. Also due to the exact allocated size matching, slab allocation
		391	completely eliminates internal fragmentation issue.</para>
26	bondari	392	</section>
24	bondari	393
26	bondari	394	<section>
34	bondari	395	<title>Implementation</title>
9	bondari	396
26	bondari	397	<para>The SLAB allocator is closely modelled after <ulink
		398	url="http://www.usenix.org/events/usenix01/full_papers/bonwick/bonwick_html/">
		399	OpenSolaris SLAB allocator by Jeff Bonwick and Jonathan Adams </ulink>
		400	with the following exceptions: <itemizedlist>
		401	<listitem>
		402	empty SLABS are deallocated immediately (in Linux they are kept in linked list, in Solaris ???)
		403	</listitem>
		404
		405	<listitem>
		406	empty magazines are deallocated when not needed (in Solaris they are held in linked list in slab cache)
		407	</listitem>
		408	</itemizedlist> Following features are not currently supported but
		409	would be easy to do: <itemizedlist>
		410	<listitem>
		411	- cache coloring
		412	</listitem>
		413
		414	<listitem>
34	bondari	415	- dynamic magazine grow (different magazine sizes are already supported, but we would need to adjust allocation strategy)
26	bondari	416	</listitem>
		417	</itemizedlist></para>
		418
34	bondari	419	<section>
		420	<title>Magazine layer</title>
26	bondari	421
34	bondari	422	<para>Due to the extensive bottleneck on SMP architures, caused by
		423	global SLAB locking mechanism, making processing of all slab
		424	allocation requests serialized, a new layer was introduced to the
		425	classic slab allocator design. Slab allocator was extended to
		426	support per-CPU caches 'magazines' to achieve good SMP scaling.
		427	<termdef>Slab SMP perfromance bottleneck was resolved by introducing
		428	a per-CPU caching scheme called as <glossterm>magazine
		429	layer</glossterm></termdef>.</para>
26	bondari	430
34	bondari	431	<para>Magazine is a N-element cache of objects, so each magazine can
		432	satisfy N allocations. Magazine behaves like a automatic weapon
		433	magazine (LIFO, stack), so the allocation/deallocation become simple
		434	push/pop pointer operation. Trick is that CPU does not access global
		435	slab allocator data during the allocation from its magazine, thus
		436	making possible parallel allocations between CPUs.</para>
26	bondari	437
34	bondari	438	<para>Implementation also requires adding another feature as the
		439	CPU-bound magazine is actually a pair of magazines to avoid
		440	thrashing when during allocation/deallocatiion of 1 item at the
		441	magazine size boundary. LIFO order is enforced, which should avoid
		442	fragmentation as much as possible.</para>
26	bondari	443
34	bondari	444	<para>Another important entity of magazine layer is a full magazine
		445	depot, that stores full magazines which are used by any of the CPU
		446	magazine caches to reload active CPU magazine. Magazine depot can be
		447	pre-filled with full magazines during initialization, but in current
		448	implementation it is filled during object deallocation, when CPU
		449	magazine becomes full.</para>
26	bondari	450
34	bondari	451	<para>Slab allocator control structures are allocated from special
		452	slabs, that are marked by special flag, indicating that it should
		453	not be used for slab magazine layer. This is done to avoid possible
		454	infinite recursions and deadlock during conventional slab allocaiton
		455	requests.</para>
		456	</section>
26	bondari	457
34	bondari	458	<section>
		459	<title>Allocation/deallocation</title>
26	bondari	460
34	bondari	461	<para>Every cache contains list of full slabs and list of partialy
		462	full slabs. Empty slabs are immediately freed (thrashing will be
		463	avoided because of magazines).</para>
26	bondari	464
34	bondari	465	<para>The SLAB allocator allocates lots of space and does not free
		466	it. When frame allocator fails to allocate the frame, it calls
		467	slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim.
		468	The light reclaim releases slabs from cpu-shared magazine-list,
		469	until at least 1 slab is deallocated in each cache (this algorithm
		470	should probably change). The brutal reclaim removes all cached
		471	objects, even from CPU-bound magazines.</para>
		472
		473	<formalpara>
		474	<title>Allocation</title>
		475
		476	<para><emphasis>Step 1.</emphasis> When it comes to the allocation
		477	request, slab allocator first of all checks availability of memory
		478	in local CPU-bound magazine. If it is there, we would just "pop"
		479	the CPU magazine and return the pointer to object.</para>
		480
		481	<para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
		482	empty, allocator will attempt to reload magazin, swapping it with
		483	second CPU magazine and returns to the first step.</para>
		484
		485	<para><emphasis>Step 3.</emphasis> Now we are in the situation
		486	when both CPU-bound magazines are empty, which makes allocator to
		487	access shared full-magazines depot to reload CPU-bound magazines.
		488	If reload is succesful (meaning there are full magazines in depot)
		489	algoritm continues at Step 1.</para>
		490
		491	<para><emphasis>Step 4.</emphasis> Final step of the allocation.
		492	In this step object is allocated from the conventional slab layer
		493	and pointer is returned.</para>
		494	</formalpara>
		495
		496	<formalpara>
		497	<title>Deallocation</title>
		498
		499	<para><emphasis>Step 1.</emphasis> During deallocation request,
		500	slab allocator will check if the local CPU-bound magazine is not
		501	full. In this case we will just push the pointer to this
		502	magazine.</para>
		503
		504	<para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
		505	full, allocator will attempt to reload magazin, swapping it with
		506	second CPU magazine and returns to the first step.</para>
		507
		508	<para><emphasis>Step 3.</emphasis> Now we are in the situation
		509	when both CPU-bound magazines are full, which makes allocator to
		510	access shared full-magazines depot to put one of the magazines to
		511	the depot and creating new empty magazine. Algoritm continues at
		512	Step 1.</para>
		513	</formalpara>
		514	</section>
26	bondari	515	</section>
15	bondari	516	</section>
26	bondari	517
		518	<!-- End of Physmem -->
		519	</section>
		520
		521	<section>
		522	<title>Memory sharing</title>
		523
		524	<para>Not implemented yet(?)</para>
		525	</section>
11	bondari	526	</chapter>

Subversion Repositories HelenOS-doc

(root)/design/trunk/src/ch_memory_management.xml @ 185 – Rev 38