WebSVN – HelenOS-doc – Blame – /design/trunk/src/ch_memory_management.xml

Rev	Author	Line No.	Line
9	bondari	1	<?xml version="1.0" encoding="UTF-8"?>
11	bondari	2	<chapter id="mm">
		3	<?dbhtml filename="mm.html"?>
9	bondari	4
11	bondari	5	<title>Memory management</title>
9	bondari	6
26	bondari	7	<section>
		8	<!-- VM -->
24	bondari	9
11	bondari	10	<title>Virtual memory management</title>
9	bondari	11
		12	<section>
11	bondari	13	<title>Address spaces</title>
9	bondari	14
		15	<para></para>
		16	</section>
		17
		18	<section>
11	bondari	19	<title>Virtual address translation</title>
9	bondari	20
		21	<para></para>
		22	</section>
26	bondari	23	</section>
9	bondari	24
26	bondari	25	<!-- End of VM -->
24	bondari	26
26	bondari	27	<section>
		28	<!-- Phys mem -->
		29
11	bondari	30	<title>Physical memory management</title>
9	bondari	31
24	bondari	32	<section id="zones_and_frames">
		33	<title>Zones and frames</title>
		34
		35	<para>
26	bondari	36	<!--graphic fileref="images/mm2.png" /-->
24	bondari	37
26	bondari	38	<!--graphic fileref="images/buddy_alloc.svg" format="SVG" /-->
		39	<mediaobject
24	bondari	40
26	bondari	41
		42	</para>
		43
		44	<para>On some architectures not whole physical memory is available for
		45	conventional usage. This limitations require from kernel to maintain a
		46	table of available and unavailable ranges of physical memory addresses.
		47	Main idea of zones is in creating memory zone entity, that is a
		48	continuous chunk of memory available for allocation. If some chunk is
		49	not available, we simply do not put it in any zone.</para>
		50
		51	<para>Zone is also serves for informational purposes, containing
		52	information about number of free and busy frames. Physical memory
		53	allocation is also done inside the certain zone. Allocation of zone
		54	frame must be organized by the <link linkend="frame_allocator">frame
		55	allocator</link> associated with the zone.</para>
		56
		57	<para>Some of the architectures (mips32, ppc32) have only one zone, that
		58	covers whole physical memory, and the others (like ia32) may have
		59	multiple zones. Information about zones on current machine is stored in
		60	BIOS hardware tables or can be hardcoded into kernel during compile
		61	time.</para>
24	bondari	62	</section>
		63
		64	<section id="frame_allocator">
		65	<title>Frame allocator</title>
		66
26	bondari	67	<formalpara>
		68	<title>Overview</title>
24	bondari	69
26	bondari	70	<para>Frame allocator provides physical memory allocation for the
		71	kernel. Because of zonal organization of physical memory, frame
		72	allocator is always working in context of some zone, thus making
		73	impossible to allocate a piece of memory, which lays in different
		74	zone, which cannot happen, because two adjacent zones can be merged
		75	into one. Frame allocator is also being responsible to update
		76	information on the number of free/busy frames in zone. Physical memory
		77	allocation inside one <link linkend="zones_and_frames">memory
		78	zone</link> is being handled by an instance of <link
		79	linkend="buddy_allocator">buddy allocator</link> tailored to allocate
		80	blocks of physical memory frames.</para>
		81	</formalpara>
24	bondari	82
26	bondari	83	<formalpara>
		84	<title>Allocation / deallocation</title>
24	bondari	85
26	bondari	86	<para>Upon allocation request, frame allocator tries to find first
		87	zone, that can satisfy the incoming request (has required amount of
		88	free frames to allocate). During deallocation, frame allocator needs
		89	to find zone, that contain deallocated frame. This approach could
		90	bring up two potential problems: <itemizedlist>
		91	<listitem>
		92	Linear search of zones does not any good to performance, but number of zones is not expected to be high. And if yes, list of zones can be replaced with more time-efficient B-tree.
		93	</listitem>
24	bondari	94
26	bondari	95	<listitem>
		96	Quickly find out if zone contains required number of frames to allocate and if this chunk of memory is properly aligned. This issue is perfectly solved bu the buddy allocator.
		97	</listitem>
		98	</itemizedlist></para>
		99	</formalpara>
		100	</section>
		101	</section>
17	jermar	102
26	bondari	103	<section id="buddy_allocator">
		104	<title>Buddy allocator</title>
17	jermar	105
26	bondari	106	<section>
		107	<title>Overview</title>
17	jermar	108
26	bondari	109	<para>In buddy allocator, memory is broken down into power-of-two sized
		110	naturally aligned blocks. These blocks are organized in an array of
		111	lists in which list with index i contains all unallocated blocks of the
		112	size <mathphrase>2<superscript>i</superscript></mathphrase>. The index i
		113	is called the order of block. Should there be two adjacent equally sized
		114	blocks in list <mathphrase>i</mathphrase> (i.e. buddies), the buddy
		115	allocator would coalesce them and put the resulting block in list
		116	<mathphrase>i + 1</mathphrase>, provided that the resulting block would
		117	be naturally aligned. Similarily, when the allocator is asked to
		118	allocate a block of size
		119	<mathphrase>2<superscript>i</superscript></mathphrase>, it first tries
		120	to satisfy the request from list with index i. If the request cannot be
		121	satisfied (i.e. the list i is empty), the buddy allocator will try to
		122	allocate and split larger block from list with index i + 1. Both of
		123	these algorithms are recursive. The recursion ends either when there are
		124	no blocks to coalesce in the former case or when there are no blocks
		125	that can be split in the latter case.</para>
17	jermar	126
26	bondari	127	<graphic fileref="images/mm1.png" format="EPS" />
17	jermar	128
26	bondari	129	<para>This approach greatly reduces external fragmentation of memory and
		130	helps in allocating bigger continuous blocks of memory aligned to their
		131	size. On the other hand, the buddy allocator suffers increased internal
		132	fragmentation of memory and is not suitable for general kernel
		133	allocations. This purpose is better addressed by the <link
		134	linkend="slab">slab allocator</link>.</para>
		135	</section>
17	jermar	136
26	bondari	137	<section>
		138	<title>Implementation</title>
17	jermar	139
26	bondari	140	<para>The buddy allocator is, in fact, an abstract framework wich can be
		141	easily specialized to serve one particular task. It knows nothing about
		142	the nature of memory it helps to allocate. In order to beat the lack of
		143	this knowledge, the buddy allocator exports an interface that each of
		144	its clients is required to implement. When supplied an implementation of
		145	this interface, the buddy allocator can use specialized external
		146	functions to find buddy for a block, split and coalesce blocks,
		147	manipulate block order and mark blocks busy or available. For precize
		148	documentation of this interface, refer to <link linkend="???">HelenOS
		149	Generic Kernel Reference Manual</link>.</para>
17	jermar	150
26	bondari	151	<formalpara>
		152	<title>Data organization</title>
17	jermar	153
26	bondari	154	<para>Each entity allocable by the buddy allocator is required to
		155	contain space for storing block order number and a link variable used
		156	to interconnect blocks within the same order.</para>
15	bondari	157
26	bondari	158	<para>Whatever entities are allocated by the buddy allocator, the
		159	first entity within a block is used to represent the entire block. The
		160	first entity keeps the order of the whole block. Other entities within
		161	the block are assigned the magic value
		162	<constant>BUDDY_INNER_BLOCK</constant>. This is especially important
		163	for effective identification of buddies in one-dimensional array
		164	because the entity that represents a potential buddy cannot be
		165	associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it is
		166	associated with <constant>BUDDY_INNER_BLOCK</constant> then it is not
		167	a buddy).</para>
		168	</formalpara>
15	bondari	169
26	bondari	170	<formalpara>
		171	<title>Data organization</title>
15	bondari	172
26	bondari	173	<para>Buddy allocator always uses first frame to represent frame
		174	block. This frame contains <varname>buddy_order</varname> variable to
		175	provide information about the block size it actually represents (
		176	<mathphrase>2<superscript>buddy_order</superscript></mathphrase>
		177	frames block). Other frames in block have this value set to magic
		178	<constant>BUDDY_INNER_BLOCK</constant> that is much greater than buddy
		179	<varname>max_order</varname> value.</para>
15	bondari	180
26	bondari	181	<para>Each <varname>frame_t</varname> also contains pointer member to
		182	hold frame structure in the linked list inside one order.</para>
		183	</formalpara>
15	bondari	184
26	bondari	185	<formalpara>
		186	<title>Allocation algorithm</title>
15	bondari	187
26	bondari	188	<para>Upon <mathphrase>2<superscript>i</superscript></mathphrase>
		189	frames block allocation request, allocator checks if there are any
		190	blocks available at the order list <varname>i</varname>. If yes,
		191	removes block from order list and returns its address. If no,
		192	recursively allocates
		193	<mathphrase>2<superscript>i+1</superscript></mathphrase> frame block,
		194	splits it into two
		195	<mathphrase>2<superscript>i</superscript></mathphrase> frame blocks.
		196	Then adds one of the blocks to the <varname>i</varname> order list and
		197	returns address of another.</para>
		198	</formalpara>
17	jermar	199
26	bondari	200	<formalpara>
		201	<title>Deallocation algorithm</title>
9	bondari	202
26	bondari	203	<para>Check if block has so called buddy (another free
		204	<mathphrase>2<superscript>i</superscript></mathphrase> frame block
		205	that can be linked with freed block into the
		206	<mathphrase>2<superscript>i+1</superscript></mathphrase> block).
		207	Technically, buddy is a odd/even block for even/odd block
		208	respectively. Plus we can put an extra requirement, that resulting
		209	block must be aligned to its size. This requirement guarantees natural
		210	block alignment for the blocks coming out the allocation
		211	system.</para>
24	bondari	212
26	bondari	213	<para>Using direct pointer arithmetics,
		214	<varname>frame_t::ref_count</varname> and
		215	<varname>frame_t::buddy_order</varname> variables, finding buddy is
		216	done at constant time.</para>
		217	</formalpara>
		218	</section>
		219
15	bondari	220	<section id="slab">
11	bondari	221	<title>Slab allocator</title>
9	bondari	222
26	bondari	223	<section>
		224	<title>Introduction</title>
9	bondari	225
26	bondari	226	<para>The majority of memory allocation requests in the kernel are for
		227	small, frequently used data structures. For this purpose the slab
		228	allocator is a perfect solution. The basic idea behind a slab
		229	allocator is to have lists of commonly used objects available packed
		230	into pages. This avoids the overhead of allocating and destroying
		231	commonly used types of objects such as inodes, threads, virtual memory
		232	structures etc.</para>
24	bondari	233
26	bondari	234	<para>Original slab allocator locking mechanism has become a
		235	significant preformance bottleneck on SMP architectures. <termdef>Slab
		236	SMP perfromance bottleneck was resolved by introducing a per-CPU
		237	caching scheme called as <glossterm>magazine
		238	layer</glossterm></termdef>.</para>
		239	</section>
24	bondari	240
26	bondari	241	<section>
		242	<title>Implementation details (needs revision)</title>
9	bondari	243
26	bondari	244	<para>The SLAB allocator is closely modelled after <ulink
		245	url="http://www.usenix.org/events/usenix01/full_papers/bonwick/bonwick_html/">
		246	OpenSolaris SLAB allocator by Jeff Bonwick and Jonathan Adams </ulink>
		247	with the following exceptions: <itemizedlist>
		248	<listitem>
		249	empty SLABS are deallocated immediately (in Linux they are kept in linked list, in Solaris ???)
		250	</listitem>
		251
		252	<listitem>
		253	empty magazines are deallocated when not needed (in Solaris they are held in linked list in slab cache)
		254	</listitem>
		255	</itemizedlist> Following features are not currently supported but
		256	would be easy to do: <itemizedlist>
		257	<listitem>
		258	- cache coloring
		259	</listitem>
		260
		261	<listitem>
		262	- dynamic magazine growing (different magazine sizes are already supported, but we would need to adjust allocation strategy)
		263	</listitem>
		264	</itemizedlist></para>
		265
		266	<para>The SLAB allocator supports per-CPU caches ('magazines') to
		267	facilitate good SMP scaling.</para>
		268
		269	<para>When a new object is being allocated, it is first checked, if it
		270	is available in CPU-bound magazine. If it is not found there, it is
		271	allocated from CPU-shared SLAB - if partial full is found, it is used,
		272	otherwise a new one is allocated.</para>
		273
		274	<para>When an object is being deallocated, it is put to CPU-bound
		275	magazine. If there is no such magazine, new one is allocated (if it
		276	fails, the object is deallocated into SLAB). If the magazine is full,
		277	it is put into cpu-shared list of magazines and new one is
		278	allocated.</para>
		279
		280	<para>The CPU-bound magazine is actually a pair of magazines to avoid
		281	thrashing when somebody is allocating/deallocating 1 item at the
		282	magazine size boundary. LIFO order is enforced, which should avoid
		283	fragmentation as much as possible.</para>
		284
		285	<para>Every cache contains list of full slabs and list of partialy
		286	full slabs. Empty SLABS are immediately freed (thrashing will be
		287	avoided because of magazines).</para>
		288
		289	<para>The SLAB information structure is kept inside the data area, if
		290	possible. The cache can be marked that it should not use magazines.
		291	This is used only for SLAB related caches to avoid deadlocks and
		292	infinite recursion (the SLAB allocator uses itself for allocating all
		293	it's control structures).</para>
		294
		295	<para>The SLAB allocator allocates lots of space and does not free it.
		296	When frame allocator fails to allocate the frame, it calls
		297	slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim.
		298	The light reclaim releases slabs from cpu-shared magazine-list, until
		299	at least 1 slab is deallocated in each cache (this algorithm should
		300	probably change). The brutal reclaim removes all cached objects, even
		301	from CPU-bound magazines.</para>
		302
		303	<para>TODO: <itemizedlist>
		304	<listitem>
		305	For better CPU-scaling the magazine allocation strategy should be extended. Currently, if the cache does not have magazine, it asks for non-cpu cached magazine cache to provide one. It might be feasible to add cpu-cached magazine cache (which would allocate it's magazines from non-cpu-cached mag. cache). This would provide a nice per-cpu buffer. The other possibility is to use the per-cache 'empty-magazine-list', which decreases competing for 1 per-system magazine cache.
		306	</listitem>
		307
		308	<listitem>
		309	- it might be good to add granularity of locks even to slab level, we could then try_spinlock over all partial slabs and thus improve scalability even on slab level
		310	</listitem>
		311	</itemizedlist></para>
		312	</section>
15	bondari	313	</section>
26	bondari	314
		315	<!-- End of Physmem -->
		316	</section>
		317
		318	<section>
		319	<title>Memory sharing</title>
		320
		321	<para>Not implemented yet(?)</para>
		322	</section>
11	bondari	323	</chapter>

Subversion Repositories HelenOS-doc

(root)/design/trunk/src/ch_memory_management.xml @ 185 – Rev 26