WebSVN – HelenOS-doc – Blame – /design/trunk/src/ch_memory_management.xml

Rev	Author	Line No.	Line
9	bondari	1	<?xml version="1.0" encoding="UTF-8"?>
11	bondari	2	<chapter id="mm">
		3	<?dbhtml filename="mm.html"?>
9	bondari	4
11	bondari	5	<title>Memory management</title>
9	bondari	6
26	bondari	7	<section>
		8	<!-- VM -->
24	bondari	9
11	bondari	10	<title>Virtual memory management</title>
9	bondari	11
		12	<section>
11	bondari	13	<title>Address spaces</title>
9	bondari	14
		15	<para></para>
		16	</section>
		17
		18	<section>
11	bondari	19	<title>Virtual address translation</title>
9	bondari	20
		21	<para></para>
		22	</section>
26	bondari	23	</section>
9	bondari	24
26	bondari	25	<!-- End of VM -->
24	bondari	26
26	bondari	27	<section>
		28	<!-- Phys mem -->
		29
11	bondari	30	<title>Physical memory management</title>
9	bondari	31
24	bondari	32	<section id="zones_and_frames">
		33	<title>Zones and frames</title>
		34
		35	<para>
26	bondari	36	<!--graphic fileref="images/mm2.png" /-->
24	bondari	37
26	bondari	38	<!--graphic fileref="images/buddy_alloc.svg" format="SVG" /-->
24	bondari	39
26	bondari	40
		41	</para>
		42
		43	<para>On some architectures not whole physical memory is available for
		44	conventional usage. This limitations require from kernel to maintain a
		45	table of available and unavailable ranges of physical memory addresses.
		46	Main idea of zones is in creating memory zone entity, that is a
		47	continuous chunk of memory available for allocation. If some chunk is
		48	not available, we simply do not put it in any zone.</para>
		49
		50	<para>Zone is also serves for informational purposes, containing
		51	information about number of free and busy frames. Physical memory
		52	allocation is also done inside the certain zone. Allocation of zone
		53	frame must be organized by the <link linkend="frame_allocator">frame
		54	allocator</link> associated with the zone.</para>
		55
		56	<para>Some of the architectures (mips32, ppc32) have only one zone, that
		57	covers whole physical memory, and the others (like ia32) may have
		58	multiple zones. Information about zones on current machine is stored in
		59	BIOS hardware tables or can be hardcoded into kernel during compile
		60	time.</para>
24	bondari	61	</section>
		62
		63	<section id="frame_allocator">
		64	<title>Frame allocator</title>
		65
26	bondari	66	<formalpara>
		67	<title>Overview</title>
24	bondari	68
26	bondari	69	<para>Frame allocator provides physical memory allocation for the
		70	kernel. Because of zonal organization of physical memory, frame
		71	allocator is always working in context of some zone, thus making
		72	impossible to allocate a piece of memory, which lays in different
		73	zone, which cannot happen, because two adjacent zones can be merged
		74	into one. Frame allocator is also being responsible to update
		75	information on the number of free/busy frames in zone. Physical memory
		76	allocation inside one <link linkend="zones_and_frames">memory
		77	zone</link> is being handled by an instance of <link
		78	linkend="buddy_allocator">buddy allocator</link> tailored to allocate
		79	blocks of physical memory frames.</para>
		80	</formalpara>
24	bondari	81
26	bondari	82	<formalpara>
		83	<title>Allocation / deallocation</title>
24	bondari	84
26	bondari	85	<para>Upon allocation request, frame allocator tries to find first
		86	zone, that can satisfy the incoming request (has required amount of
		87	free frames to allocate). During deallocation, frame allocator needs
		88	to find zone, that contain deallocated frame. This approach could
		89	bring up two potential problems: <itemizedlist>
		90	<listitem>
		91	Linear search of zones does not any good to performance, but number of zones is not expected to be high. And if yes, list of zones can be replaced with more time-efficient B-tree.
		92	</listitem>
24	bondari	93
26	bondari	94	<listitem>
		95	Quickly find out if zone contains required number of frames to allocate and if this chunk of memory is properly aligned. This issue is perfectly solved bu the buddy allocator.
		96	</listitem>
		97	</itemizedlist></para>
		98	</formalpara>
		99	</section>
		100	</section>
17	jermar	101
26	bondari	102	<section id="buddy_allocator">
		103	<title>Buddy allocator</title>
17	jermar	104
26	bondari	105	<section>
		106	<title>Overview</title>
17	jermar	107
26	bondari	108	<para>In buddy allocator, memory is broken down into power-of-two sized
		109	naturally aligned blocks. These blocks are organized in an array of
		110	lists in which list with index i contains all unallocated blocks of the
		111	size <mathphrase>2<superscript>i</superscript></mathphrase>. The index i
		112	is called the order of block. Should there be two adjacent equally sized
		113	blocks in list <mathphrase>i</mathphrase> (i.e. buddies), the buddy
		114	allocator would coalesce them and put the resulting block in list
		115	<mathphrase>i + 1</mathphrase>, provided that the resulting block would
		116	be naturally aligned. Similarily, when the allocator is asked to
		117	allocate a block of size
		118	<mathphrase>2<superscript>i</superscript></mathphrase>, it first tries
		119	to satisfy the request from list with index i. If the request cannot be
		120	satisfied (i.e. the list i is empty), the buddy allocator will try to
		121	allocate and split larger block from list with index i + 1. Both of
		122	these algorithms are recursive. The recursion ends either when there are
		123	no blocks to coalesce in the former case or when there are no blocks
		124	that can be split in the latter case.</para>
17	jermar	125
27	bondari	126	<!--graphic fileref="images/mm1.png" format="EPS" /-->
17	jermar	127
26	bondari	128	<para>This approach greatly reduces external fragmentation of memory and
		129	helps in allocating bigger continuous blocks of memory aligned to their
		130	size. On the other hand, the buddy allocator suffers increased internal
		131	fragmentation of memory and is not suitable for general kernel
		132	allocations. This purpose is better addressed by the <link
		133	linkend="slab">slab allocator</link>.</para>
		134	</section>
17	jermar	135
26	bondari	136	<section>
		137	<title>Implementation</title>
17	jermar	138
26	bondari	139	<para>The buddy allocator is, in fact, an abstract framework wich can be
		140	easily specialized to serve one particular task. It knows nothing about
		141	the nature of memory it helps to allocate. In order to beat the lack of
		142	this knowledge, the buddy allocator exports an interface that each of
		143	its clients is required to implement. When supplied an implementation of
		144	this interface, the buddy allocator can use specialized external
		145	functions to find buddy for a block, split and coalesce blocks,
		146	manipulate block order and mark blocks busy or available. For precize
		147	documentation of this interface, refer to <link linkend="???">HelenOS
		148	Generic Kernel Reference Manual</link>.</para>
17	jermar	149
26	bondari	150	<formalpara>
		151	<title>Data organization</title>
17	jermar	152
26	bondari	153	<para>Each entity allocable by the buddy allocator is required to
		154	contain space for storing block order number and a link variable used
		155	to interconnect blocks within the same order.</para>
15	bondari	156
26	bondari	157	<para>Whatever entities are allocated by the buddy allocator, the
		158	first entity within a block is used to represent the entire block. The
		159	first entity keeps the order of the whole block. Other entities within
		160	the block are assigned the magic value
		161	<constant>BUDDY_INNER_BLOCK</constant>. This is especially important
		162	for effective identification of buddies in one-dimensional array
		163	because the entity that represents a potential buddy cannot be
		164	associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it is
		165	associated with <constant>BUDDY_INNER_BLOCK</constant> then it is not
		166	a buddy).</para>
		167	</formalpara>
15	bondari	168
26	bondari	169	<formalpara>
		170	<title>Data organization</title>
15	bondari	171
26	bondari	172	<para>Buddy allocator always uses first frame to represent frame
		173	block. This frame contains <varname>buddy_order</varname> variable to
		174	provide information about the block size it actually represents (
		175	<mathphrase>2<superscript>buddy_order</superscript></mathphrase>
		176	frames block). Other frames in block have this value set to magic
		177	<constant>BUDDY_INNER_BLOCK</constant> that is much greater than buddy
		178	<varname>max_order</varname> value.</para>
15	bondari	179
26	bondari	180	<para>Each <varname>frame_t</varname> also contains pointer member to
		181	hold frame structure in the linked list inside one order.</para>
		182	</formalpara>
15	bondari	183
26	bondari	184	<formalpara>
		185	<title>Allocation algorithm</title>
15	bondari	186
26	bondari	187	<para>Upon <mathphrase>2<superscript>i</superscript></mathphrase>
		188	frames block allocation request, allocator checks if there are any
		189	blocks available at the order list <varname>i</varname>. If yes,
		190	removes block from order list and returns its address. If no,
		191	recursively allocates
		192	<mathphrase>2<superscript>i+1</superscript></mathphrase> frame block,
		193	splits it into two
		194	<mathphrase>2<superscript>i</superscript></mathphrase> frame blocks.
		195	Then adds one of the blocks to the <varname>i</varname> order list and
		196	returns address of another.</para>
		197	</formalpara>
17	jermar	198
26	bondari	199	<formalpara>
		200	<title>Deallocation algorithm</title>
9	bondari	201
26	bondari	202	<para>Check if block has so called buddy (another free
		203	<mathphrase>2<superscript>i</superscript></mathphrase> frame block
		204	that can be linked with freed block into the
		205	<mathphrase>2<superscript>i+1</superscript></mathphrase> block).
		206	Technically, buddy is a odd/even block for even/odd block
		207	respectively. Plus we can put an extra requirement, that resulting
		208	block must be aligned to its size. This requirement guarantees natural
		209	block alignment for the blocks coming out the allocation
		210	system.</para>
24	bondari	211
26	bondari	212	<para>Using direct pointer arithmetics,
		213	<varname>frame_t::ref_count</varname> and
		214	<varname>frame_t::buddy_order</varname> variables, finding buddy is
		215	done at constant time.</para>
		216	</formalpara>
		217	</section>
		218
15	bondari	219	<section id="slab">
11	bondari	220	<title>Slab allocator</title>
9	bondari	221
26	bondari	222	<section>
		223	<title>Introduction</title>
9	bondari	224
26	bondari	225	<para>The majority of memory allocation requests in the kernel are for
		226	small, frequently used data structures. For this purpose the slab
		227	allocator is a perfect solution. The basic idea behind a slab
		228	allocator is to have lists of commonly used objects available packed
		229	into pages. This avoids the overhead of allocating and destroying
		230	commonly used types of objects such as inodes, threads, virtual memory
		231	structures etc.</para>
24	bondari	232
26	bondari	233	<para>Original slab allocator locking mechanism has become a
		234	significant preformance bottleneck on SMP architectures. <termdef>Slab
		235	SMP perfromance bottleneck was resolved by introducing a per-CPU
		236	caching scheme called as <glossterm>magazine
		237	layer</glossterm></termdef>.</para>
		238	</section>
24	bondari	239
26	bondari	240	<section>
		241	<title>Implementation details (needs revision)</title>
9	bondari	242
26	bondari	243	<para>The SLAB allocator is closely modelled after <ulink
		244	url="http://www.usenix.org/events/usenix01/full_papers/bonwick/bonwick_html/">
		245	OpenSolaris SLAB allocator by Jeff Bonwick and Jonathan Adams </ulink>
		246	with the following exceptions: <itemizedlist>
		247	<listitem>
		248	empty SLABS are deallocated immediately (in Linux they are kept in linked list, in Solaris ???)
		249	</listitem>
		250
		251	<listitem>
		252	empty magazines are deallocated when not needed (in Solaris they are held in linked list in slab cache)
		253	</listitem>
		254	</itemizedlist> Following features are not currently supported but
		255	would be easy to do: <itemizedlist>
		256	<listitem>
		257	- cache coloring
		258	</listitem>
		259
		260	<listitem>
		261	- dynamic magazine growing (different magazine sizes are already supported, but we would need to adjust allocation strategy)
		262	</listitem>
		263	</itemizedlist></para>
		264
		265	<para>The SLAB allocator supports per-CPU caches ('magazines') to
		266	facilitate good SMP scaling.</para>
		267
		268	<para>When a new object is being allocated, it is first checked, if it
		269	is available in CPU-bound magazine. If it is not found there, it is
		270	allocated from CPU-shared SLAB - if partial full is found, it is used,
		271	otherwise a new one is allocated.</para>
		272
		273	<para>When an object is being deallocated, it is put to CPU-bound
		274	magazine. If there is no such magazine, new one is allocated (if it
		275	fails, the object is deallocated into SLAB). If the magazine is full,
		276	it is put into cpu-shared list of magazines and new one is
		277	allocated.</para>
		278
		279	<para>The CPU-bound magazine is actually a pair of magazines to avoid
		280	thrashing when somebody is allocating/deallocating 1 item at the
		281	magazine size boundary. LIFO order is enforced, which should avoid
		282	fragmentation as much as possible.</para>
		283
		284	<para>Every cache contains list of full slabs and list of partialy
		285	full slabs. Empty SLABS are immediately freed (thrashing will be
		286	avoided because of magazines).</para>
		287
		288	<para>The SLAB information structure is kept inside the data area, if
		289	possible. The cache can be marked that it should not use magazines.
		290	This is used only for SLAB related caches to avoid deadlocks and
		291	infinite recursion (the SLAB allocator uses itself for allocating all
		292	it's control structures).</para>
		293
		294	<para>The SLAB allocator allocates lots of space and does not free it.
		295	When frame allocator fails to allocate the frame, it calls
		296	slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim.
		297	The light reclaim releases slabs from cpu-shared magazine-list, until
		298	at least 1 slab is deallocated in each cache (this algorithm should
		299	probably change). The brutal reclaim removes all cached objects, even
		300	from CPU-bound magazines.</para>
		301
		302	<para>TODO: <itemizedlist>
		303	<listitem>
		304	For better CPU-scaling the magazine allocation strategy should be extended. Currently, if the cache does not have magazine, it asks for non-cpu cached magazine cache to provide one. It might be feasible to add cpu-cached magazine cache (which would allocate it's magazines from non-cpu-cached mag. cache). This would provide a nice per-cpu buffer. The other possibility is to use the per-cache 'empty-magazine-list', which decreases competing for 1 per-system magazine cache.
		305	</listitem>
		306
		307	<listitem>
		308	- it might be good to add granularity of locks even to slab level, we could then try_spinlock over all partial slabs and thus improve scalability even on slab level
		309	</listitem>
		310	</itemizedlist></para>
		311	</section>
15	bondari	312	</section>
26	bondari	313
		314	<!-- End of Physmem -->
		315	</section>
		316
		317	<section>
		318	<title>Memory sharing</title>
		319
		320	<para>Not implemented yet(?)</para>
		321	</section>
11	bondari	322	</chapter>

Subversion Repositories HelenOS-doc

(root)/design/trunk/src/ch_memory_management.xml @ 185 – Rev 27