WebSVN – HelenOS-doc – Blame – /design/trunk/src/ch_memory_management.xml

Rev	Author	Line No.	Line
9	bondari	1	<?xml version="1.0" encoding="UTF-8"?>
11	bondari	2	<chapter id="mm">
		3	<?dbhtml filename="mm.html"?>
9	bondari	4
11	bondari	5	<title>Memory management</title>
9	bondari	6
66	bondari	7	<section>
		8	<title>Virtual memory management</title>
64	jermar	9
66	bondari	10	<section>
		11	<title>Introduction</title>
		12
		13	<para>Virtual memory is a special memory management technique, used by
		14	kernel to achieve a bunch of mission critical goals. <itemizedlist>
		15	<listitem>
		16	Isolate each task from other tasks that are running on the system at the same time.
		17	</listitem>
		18
		19	<listitem>
		20	Allow to allocate more memory, than is actual physical memory size of the machine.
		21	</listitem>
		22
		23	<listitem>
		24	Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations.
		25	</listitem>
		26	</itemizedlist></para>
		27
		28	<para><!--
		29	<para>
		30	Address spaces. Address space area (B+ tree). Only for uspace. Set of syscalls (shrink/extend etc).
		31	Special address space area type - device - prohibits shrink/extend syscalls to call on it.
		32	Address space has link to mapping tables (hierarchical - per Address space, hash - global tables).
		33	</para>
		34
		35	--></para>
		36	</section>
		37
		38	<section>
		39	<title>Address spaces</title>
		40
		41	<section>
		42	<title>Address space areas</title>
		43
		44	<para>Each address space consists of mutually disjunctive continuous
		45	address space areas. Address space area is precisely defined by its
		46	base address and the number of frames/pages is contains.</para>
		47
		48	<para>Address space area , that define behaviour and permissions on
		49	the particular area. <itemizedlist>
		50	<listitem>
		51
		52
		53	<emphasis>AS_AREA_READ</emphasis>
		54
		55	flag indicates reading permission.
		56	</listitem>
		57
		58	<listitem>
		59
		60
		61	<emphasis>AS_AREA_WRITE</emphasis>
		62
		63	flag indicates writing permission.
		64	</listitem>
		65
		66	<listitem>
		67
		68
		69	<emphasis>AS_AREA_EXEC</emphasis>
		70
		71	flag indicates code execution permission. Some architectures do not support execution persmission restriction. In this case this flag has no effect.
		72	</listitem>
		73
		74	<listitem>
		75
		76
		77	<emphasis>AS_AREA_DEVICE</emphasis>
		78
		79	marks area as mapped to the device memory.
		80	</listitem>
		81	</itemizedlist></para>
		82
		83	<para>Kernel provides possibility tasks create/expand/shrink/share its
		84	address space via the set of syscalls.</para>
		85	</section>
		86
		87	<section>
		88	<title>Address Space ID (ASID)</title>
		89
		90	<para>When switching to the different task, kernel also require to
		91	switch mappings to the different address space. In case TLB cannot
		92	distinguish address space mappings, all mapping information in TLB
		93	from the old address space must be flushed, which can create certain
		94	uncessary overhead during the task switching. To avoid this, some
		95	architectures have capability to segregate different address spaces on
		96	hardware level introducing the address space identifier as a part of
		97	TLB record, telling the virtual address space translation unit to
		98	which address space this record is applicable.</para>
		99
		100	<para>HelenOS kernel can take advantage of this hardware supported
		101	identifier by having an ASID abstraction which is somehow related to
		102	the corresponding architecture identifier. I.e. on ia64 kernel ASID is
		103	derived from RID (region identifier) and on the mips32 kernel ASID is
		104	actually the hardware identifier. As expected, this ASID information
		105	record is the part of <emphasis>as_t</emphasis> structure.</para>
		106
		107	<para>Due to the hardware limitations, hardware ASID has limited
		108	length from 8 bits on ia64 to 24 bits on mips32, which makes it
		109	impossible to use it as unique address space identifier for all tasks
		110	running in the system. In such situations special ASID stealing
		111	algoritm is used, which takes ASID from inactive task and assigns it
		112	to the active task.<classname></classname></para>
		113	</section>
		114	</section>
		115
		116	<section>
		117	<title>Virtual address translation</title>
		118
		119	<section id="pagING">
		120	<title>Paging</title>
		121
		122	<section>
		123	<title>Introduction</title>
		124
		125	<para>Virtual memory is usually using paged memory model, where
		126	virtual memory address space is divided into the
		127	<emphasis>pages</emphasis> (usually having size 4096 bytes) and
		128	physical memory is divided into the frames (same sized as a page, of
		129	course). Each page may be mapped to some frame and then, upon memory
		130	access to the virtual address, CPU performs <emphasis>address
		131	translation</emphasis> during the instruction execution.
		132	Non-existing mapping generates page fault exception, calling kernel
		133	exception handler, thus allowing kernel to manipulate rules of
		134	memory access. Information for pages mapping is stored by kernel in
		135	the <link linkend="page_tables">page tables</link></para>
		136
		137	<para>The majority of the architectures use multi-level page tables,
		138	which means need to access physical memory several times before
		139	getting physical address. This fact would make serios performance
		140	overhead in virtual memory management. To avoid this <link
		141	linkend="tlb">Traslation Lookaside Buffer (TLB)</link> is
		142	used.</para>
		143
		144	<para>HelenOS kernel has two different approaches to the paging
		145	implementation: <emphasis>4 level page tables</emphasis> and
		146	<emphasis>global hash table</emphasis>, which are accessible via
		147	generic paging abstraction layer. Such different functionality was
		148	caused by the major architectural differences between supported
		149	platforms. This abstraction is implemented with help of the global
		150	structure of pointers to basic mapping functions
		151	<emphasis>page_mapping_operations</emphasis>. To achieve different
		152	functionality of page tables, corresponding layer must implement
		153	functions, declared in
		154	<emphasis>page_mapping_operations</emphasis></para>
		155
		156	<para>Thanks to the abstract paging interface, there was a place
		157	left for more paging implementations (besides already implemented
		158	hieararchical page tables and hash table), for example B-Tree based
		159	page tables.</para>
		160	</section>
		161
		162	<section>
		163	<title>Hierarchical 4-level page tables</title>
		164
		165	<para>Hierarchical 4-level page tables are the generalization of the
		166	hardware capabilities of most architectures. Each address space has
		167	its own page tables.<itemizedlist>
		168	<listitem>
		169	ia32 uses 2-level page tables, with full hardware support.
		170	</listitem>
		171
		172	<listitem>
		173	amd64 uses 4-level page tables, also coming with full hardware support.
		174	</listitem>
		175
		176	<listitem>
		177	mips and ppc32 have 2-level tables, software simulated support.
		178	</listitem>
		179	</itemizedlist></para>
		180	</section>
		181
		182	<section>
		183	<title>Global hash table</title>
		184
		185	<para>Implementation of the global hash table was encouraged by the
		186	ia64 architecture support. One of the major differences between
		187	global hash table and hierarchical tables is that global hash table
		188	exists only once in the system and the hierarchical tables are
		189	maintained per address space. </para>
		190
		191	<para>Thus, hash table contains information about all address spaces
		192	mappings in the system, so, the hash of an entry must contain
		193	information of both address space pointer or id and the virtual
		194	address of the page. Generic hash table implementation assumes that
		195	the addresses of the pointers to the address spaces are likely to be
		196	on the close addresses, so it uses least significant bits for hash;
		197	also it assumes that the virtual page addresses have roughly the
		198	same probability of occurring, so the least significant bits of VPN
		199	compose the hash index.</para>
		200
		201	<para>- global page hash table: existuje jen jedna v celem systemu
		202	(vyuziva ji ia64), pozn. ia64 ma zatim vypnuty VHPT. Pouziva se
		203	genericke hash table s oddelenymi collision chains. ASID support is
		204	required to use global hash tables.</para>
		205	</section>
		206	</section>
		207
		208	<section id="tlb">
		209	<title>Translation Lookaside buffer</title>
		210
		211	<para>Due to the extensive overhead during the page mapping lookup in
		212	the page tables, all architectures has fast assotiative cache memory
		213	built-in CPU. This memory called TLB stores recently used page table
		214	entries.</para>
		215
		216	<section id="tlb_shootdown">
		217	<title>TLB consistency. TLB shootdown algorithm.</title>
		218
		219	<para>Operating system is responsible for keeping TLB consistent by
		220	invalidating the contents of TLB, whenever there is some change in
		221	page tables. Those changes may occur when page or group of pages
		222	were unmapped, mapping is changed or system switching active address
		223	space to schedule a new system task. Moreover, this invalidation
		224	operation must be done an all system CPUs because each CPU has its
		225	own independent TLB cache. Thus maintaining TLB consistency on SMP
		226	configuration as not as trivial task as it looks on the first
		227	glance. Naive solution would assume that is the CPU which wants to
		228	invalidate TLB will invalidate TLB caches on other CPUs. It is not
		229	possible on the most of the architectures, because of the simple
		230	fact - flushing TLB is allowed only on the local CPU and there is no
		231	possibility to access other CPUs' TLB caches, thus invalidate TLB
		232	remotely.</para>
		233
		234	<para>Technique of remote invalidation of TLB entries is called "TLB
		235	shootdown". HelenOS uses a variation of the algorithm described by
		236	D. Black et al., "Translation Lookaside Buffer Consistency: A
		237	Software Approach," Proc. Third Int'l Conf. Architectural Support
		238	for Programming Languages and Operating Systems, 1989, pp.
		239	113-122.</para>
		240
		241	<para>As the situation demands, you will want partitial invalidation
		242	of TLB caches. In case of simple memory mapping change it is
		243	necessary to invalidate only one or more adjacent pages. In case if
		244	the architecture is aware of ASIDs, when kernel needs to dump some
		245	ASID to use by another task, it invalidates only entries from this
		246	particular address space. Final option of the TLB invalidation is
		247	the complete TLB cache invalidation, which is the operation that
		248	flushes all entries in TLB.</para>
		249
		250	<para>TLB shootdown is performed in two phases.</para>
		251
		252	<formalpara>
		253	<title>Phase 1.</title>
		254
		255	<para>First, initiator locks a global TLB spinlock, then request
		256	is being put to the local request cache of every other CPU in the
		257	system protected by its spinlock. In case the cache is full, all
		258	requests in the cache are replaced by one request, indicating
		259	global TLB flush. Then the initiator thread sends an IPI message
		260	indicating the TLB shootdown request to the rest of the CPUs and
		261	waits actively until all CPUs confirm TLB invalidating action
		262	execution by setting up a special flag. After setting this flag
		263	this thread is blocked on the TLB spinlock, held by the
		264	initiator.</para>
		265	</formalpara>
		266
		267	<formalpara>
		268	<title>Phase 2.</title>
		269
		270	<para>All CPUs are waiting on the TLB spinlock to execute TLB
		271	invalidation action and have indicated their intention to the
		272	initiator. Initiator continues, cleaning up its TLB and releasing
		273	the global TLB spinlock. After this all other CPUs gain and
		274	immidiately release TLB spinlock and perform TLB invalidation
		275	actions.</para>
		276	</formalpara>
		277	</section>
		278	</section>
		279	</section>
		280
		281	<section>
		282	<title>---</title>
		283
		284	<para>At the moment HelenOS does not support swapping.</para>
		285
		286	<para>- pouzivame vypadky stranky k alokaci ramcu on-demand v ramci
		287	as_area - na architekturach, ktere to podporuji, podporujeme non-exec
		288	stranky</para>
		289	</section>
		290	</section>
		291
26	bondari	292	<section>
64	jermar	293	<title>Physical memory management</title>
		294
		295	<section id="zones_and_frames">
		296	<title>Zones and frames</title>
		297
66	bondari	298	<para>On some architectures not whole physical memory is available for
		299	conventional usage. This limitations require from kernel to maintain a
		300	table of available and unavailable ranges of physical memory addresses.
		301	Main idea of zones is in creating memory zone entity, that is a
		302	continuous chunk of memory available for allocation. If some chunk is
		303	not available, we simply do not put it in any zone.</para>
64	jermar	304
66	bondari	305	<para>Zone is also serves for informational purposes, containing
		306	information about number of free and busy frames. Physical memory
		307	allocation is also done inside the certain zone. Allocation of zone
		308	frame must be organized by the <link linkend="frame_allocator">frame
		309	allocator</link> associated with the zone.</para>
64	jermar	310
66	bondari	311	<para>Some of the architectures (mips32, ppc32) have only one zone, that
		312	covers whole physical memory, and the others (like ia32) may have
		313	multiple zones. Information about zones on current machine is stored in
		314	BIOS hardware tables or can be hardcoded into kernel during compile
		315	time.</para>
64	jermar	316	</section>
		317
		318	<section id="frame_allocator">
		319	<title>Frame allocator</title>
		320
66	bondari	321	<figure>
		322	<mediaobject id="frame_alloc">
		323	<imageobject role="html">
		324	<imagedata fileref="images/frame_alloc.png" format="PNG" />
		325	</imageobject>
64	jermar	326
66	bondari	327	<imageobject role="fop">
		328	<imagedata fileref="images.vector/frame_alloc.svg" format="SVG" />
		329	</imageobject>
		330	</mediaobject>
64	jermar	331
66	bondari	332	<title>Frame allocator scheme.</title>
		333	</figure>
64	jermar	334
		335	<formalpara>
66	bondari	336	<title>Overview</title>
		337
		338	<para>Frame allocator provides physical memory allocation for the
		339	kernel. Because of zonal organization of physical memory, frame
		340	allocator is always working in context of some zone, thus making
		341	impossible to allocate a piece of memory, which lays in different
		342	zone, which cannot happen, because two adjacent zones can be merged
		343	into one. Frame allocator is also being responsible to update
		344	information on the number of free/busy frames in zone. Physical memory
		345	allocation inside one <link linkend="zones_and_frames">memory
		346	zone</link> is being handled by an instance of <link
		347	linkend="buddy_allocator">buddy allocator</link> tailored to allocate
		348	blocks of physical memory frames.</para>
		349	</formalpara>
		350
		351	<formalpara>
64	jermar	352	<title>Allocation / deallocation</title>
		353
66	bondari	354	<para>Upon allocation request, frame allocator tries to find first
		355	zone, that can satisfy the incoming request (has required amount of
		356	free frames to allocate). During deallocation, frame allocator needs
		357	to find zone, that contain deallocated frame. This approach could
		358	bring up two potential problems: <itemizedlist>
		359	<listitem>
		360	Linear search of zones does not any good to performance, but number of zones is not expected to be high. And if yes, list of zones can be replaced with more time-efficient B-tree.
		361	</listitem>
		362
		363	<listitem>
		364	Quickly find out if zone contains required number of frames to allocate and if this chunk of memory is properly aligned. This issue is perfectly solved bu the buddy allocator.
		365	</listitem>
		366	</itemizedlist></para>
64	jermar	367	</formalpara>
		368	</section>
		369
		370	<section id="buddy_allocator">
		371	<title>Buddy allocator</title>
		372
66	bondari	373	<section>
		374	<title>Overview</title>
64	jermar	375
66	bondari	376	<figure>
64	jermar	377	<mediaobject id="buddy_alloc">
		378	<imageobject role="html">
		379	<imagedata fileref="images/buddy_alloc.png" format="PNG" />
		380	</imageobject>
		381
		382	<imageobject role="fop">
		383	<imagedata fileref="images.vector/buddy_alloc.svg" format="SVG" />
		384	</imageobject>
		385	</mediaobject>
		386
		387	<title>Buddy system scheme.</title>
66	bondari	388	</figure>
64	jermar	389
66	bondari	390	<para>In the buddy allocator, the memory is broken down into
		391	power-of-two sized naturally aligned blocks. These blocks are
		392	organized in an array of lists, in which the list with index i
		393	contains all unallocated blocks of size
		394	<mathphrase>2<superscript>i</superscript></mathphrase>. The index i is
		395	called the order of block. Should there be two adjacent equally sized
		396	blocks in the list i<mathphrase />(i.e. buddies), the buddy allocator
		397	would coalesce them and put the resulting block in list <mathphrase>i
		398	+ 1</mathphrase>, provided that the resulting block would be naturally
		399	aligned. Similarily, when the allocator is asked to allocate a block
		400	of size <mathphrase>2<superscript>i</superscript></mathphrase>, it
		401	first tries to satisfy the request from the list with index i. If the
		402	request cannot be satisfied (i.e. the list i is empty), the buddy
		403	allocator will try to allocate and split a larger block from the list
		404	with index i + 1. Both of these algorithms are recursive. The
		405	recursion ends either when there are no blocks to coalesce in the
		406	former case or when there are no blocks that can be split in the
		407	latter case.</para>
		408
		409	<!--graphic fileref="images/mm1.png" format="EPS" /-->
		410
		411	<para>This approach greatly reduces external fragmentation of memory
		412	and helps in allocating bigger continuous blocks of memory aligned to
		413	their size. On the other hand, the buddy allocator suffers increased
		414	internal fragmentation of memory and is not suitable for general
		415	kernel allocations. This purpose is better addressed by the <link
		416	linkend="slab">slab allocator</link>.</para>
		417	</section>
		418
64	jermar	419	<section>
		420	<title>Implementation</title>
		421
		422	<para>The buddy allocator is, in fact, an abstract framework wich can
		423	be easily specialized to serve one particular task. It knows nothing
		424	about the nature of memory it helps to allocate. In order to beat the
		425	lack of this knowledge, the buddy allocator exports an interface that
		426	each of its clients is required to implement. When supplied with an
		427	implementation of this interface, the buddy allocator can use
		428	specialized external functions to find a buddy for a block, split and
		429	coalesce blocks, manipulate block order and mark blocks busy or
66	bondari	430	available. For precise documentation of this interface, refer to
		431	<emphasis>"HelenOS Generic Kernel Reference Manual"</emphasis>.</para>
64	jermar	432
		433	<formalpara>
		434	<title>Data organization</title>
		435
		436	<para>Each entity allocable by the buddy allocator is required to
		437	contain space for storing block order number and a link variable
		438	used to interconnect blocks within the same order.</para>
		439
		440	<para>Whatever entities are allocated by the buddy allocator, the
		441	first entity within a block is used to represent the entire block.
		442	The first entity keeps the order of the whole block. Other entities
		443	within the block are assigned the magic value
		444	<constant>BUDDY_INNER_BLOCK</constant>. This is especially important
		445	for effective identification of buddies in a one-dimensional array
		446	because the entity that represents a potential buddy cannot be
		447	associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it
		448	is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is
		449	not a buddy).</para>
66	bondari	450
		451	<para>The buddy allocator always uses the first frame to represent
		452	the frame block. This frame contains <varname>buddy_order</varname>
		453	variable to provide information about the block size it actually
		454	represents (
		455	<mathphrase>2<superscript>buddy_order</superscript></mathphrase>
		456	frames block). Other frames in block have this value set to magic
		457	<constant>BUDDY_INNER_BLOCK</constant> that is much greater than
		458	buddy <varname>max_order</varname> value.</para>
		459
		460	<para>Each <varname>frame_t</varname> also contains pointer member
		461	to hold frame structure in the linked list inside one order.</para>
64	jermar	462	</formalpara>
66	bondari	463
		464	<formalpara>
		465	<title>Allocation algorithm</title>
		466
		467	<para>Upon <mathphrase>2<superscript>i</superscript></mathphrase>
		468	frames block allocation request, allocator checks if there are any
		469	blocks available at the order list <varname>i</varname>. If yes,
		470	removes block from order list and returns its address. If no,
		471	recursively allocates
		472	<mathphrase>2<superscript>i+1</superscript></mathphrase> frame
		473	block, splits it into two
		474	<mathphrase>2<superscript>i</superscript></mathphrase> frame blocks.
		475	Then adds one of the blocks to the <varname>i</varname> order list
		476	and returns address of another.</para>
		477	</formalpara>
		478
		479	<formalpara>
		480	<title>Deallocation algorithm</title>
		481
		482	<para>Check if block has so called buddy (another free
		483	<mathphrase>2<superscript>i</superscript></mathphrase> frame block
		484	that can be linked with freed block into the
		485	<mathphrase>2<superscript>i+1</superscript></mathphrase> block).
		486	Technically, buddy is a odd/even block for even/odd block
		487	respectively. Plus we can put an extra requirement, that resulting
		488	block must be aligned to its size. This requirement guarantees
		489	natural block alignment for the blocks coming out the allocation
		490	system.</para>
		491
		492	<para>Using direct pointer arithmetics,
		493	<varname>frame_t::ref_count</varname> and
		494	<varname>frame_t::buddy_order</varname> variables, finding buddy is
		495	done at constant time.</para>
		496	</formalpara>
64	jermar	497	</section>
		498	</section>
		499
		500	<section id="slab">
		501	<title>Slab allocator</title>
		502
66	bondari	503	<section>
		504	<title>Overview</title>
64	jermar	505
66	bondari	506	<para><termdef><glossterm>Slab</glossterm> represents a contiguous
		507	piece of memory, usually made of several physically contiguous
		508	pages.</termdef> <termdef><glossterm>Slab cache</glossterm> consists
		509	of one or more slabs.</termdef></para>
64	jermar	510
66	bondari	511	<para>The majority of memory allocation requests in the kernel are for
		512	small, frequently used data structures. For this purpose the slab
		513	allocator is a perfect solution. The basic idea behind the slab
		514	allocator is to have lists of commonly used objects available packed
		515	into pages. This avoids the overhead of allocating and destroying
		516	commonly used types of objects such threads, virtual memory structures
		517	etc. Also due to the exact allocated size matching, slab allocation
		518	completely eliminates internal fragmentation issue.</para>
		519	</section>
65	jermar	520
66	bondari	521	<section>
		522	<title>Implementation</title>
		523
		524	<figure>
64	jermar	525	<mediaobject id="slab_alloc">
		526	<imageobject role="html">
		527	<imagedata fileref="images/slab_alloc.png" format="PNG" />
		528	</imageobject>
66	bondari	529
		530	<imageobject role="fop">
		531	<imagedata fileref="images.vector/slab_alloc.svg" format="SVG" />
		532	</imageobject>
64	jermar	533	</mediaobject>
		534
		535	<title>Slab allocator scheme.</title>
66	bondari	536	</figure>
64	jermar	537
		538	<para>The slab allocator is closely modelled after <ulink
		539	url="http://www.usenix.org/events/usenix01/full_papers/bonwick/bonwick_html/">
		540	OpenSolaris slab allocator by Jeff Bonwick and Jonathan Adams </ulink>
66	bondari	541	with the following exceptions: <itemizedlist>
64	jermar	542	<listitem>
66	bondari	543	empty slabs are deallocated immediately (in Linux they are kept in linked list, in Solaris ???)
64	jermar	544	</listitem>
66	bondari	545
		546	<listitem>
		547	empty magazines are deallocated when not needed (in Solaris they are held in linked list in slab cache)
		548	</listitem>
64	jermar	549	</itemizedlist> Following features are not currently supported but
		550	would be easy to do: <itemizedlist>
		551	<listitem>
66	bondari	552	- cache coloring
64	jermar	553	</listitem>
		554
		555	<listitem>
66	bondari	556	- dynamic magazine grow (different magazine sizes are already supported, but we would need to adjust allocation strategy)
64	jermar	557	</listitem>
		558	</itemizedlist></para>
		559
		560	<section>
		561	<title>Magazine layer</title>
		562
		563	<para>Due to the extensive bottleneck on SMP architures, caused by
		564	global slab locking mechanism, making processing of all slab
		565	allocation requests serialized, a new layer was introduced to the
		566	classic slab allocator design. Slab allocator was extended to
		567	support per-CPU caches 'magazines' to achieve good SMP scaling.
		568	<termdef>Slab SMP perfromance bottleneck was resolved by introducing
		569	a per-CPU caching scheme called as <glossterm>magazine
		570	layer</glossterm></termdef>.</para>
		571
		572	<para>Magazine is a N-element cache of objects, so each magazine can
		573	satisfy N allocations. Magazine behaves like a automatic weapon
		574	magazine (LIFO, stack), so the allocation/deallocation become simple
		575	push/pop pointer operation. Trick is that CPU does not access global
		576	slab allocator data during the allocation from its magazine, thus
		577	making possible parallel allocations between CPUs.</para>
		578
		579	<para>Implementation also requires adding another feature as the
		580	CPU-bound magazine is actually a pair of magazines to avoid
		581	thrashing when during allocation/deallocatiion of 1 item at the
		582	magazine size boundary. LIFO order is enforced, which should avoid
		583	fragmentation as much as possible.</para>
		584
		585	<para>Another important entity of magazine layer is the common full
		586	magazine list (also called a depot), that stores full magazines that
		587	may be used by any of the CPU magazine caches to reload active CPU
		588	magazine. This list of magazines can be pre-filled with full
		589	magazines during initialization, but in current implementation it is
		590	filled during object deallocation, when CPU magazine becomes
		591	full.</para>
		592
		593	<para>Slab allocator control structures are allocated from special
		594	slabs, that are marked by special flag, indicating that it should
		595	not be used for slab magazine layer. This is done to avoid possible
		596	infinite recursions and deadlock during conventional slab allocaiton
		597	requests.</para>
		598	</section>
		599
		600	<section>
		601	<title>Allocation/deallocation</title>
		602
		603	<para>Every cache contains list of full slabs and list of partialy
		604	full slabs. Empty slabs are immediately freed (thrashing will be
		605	avoided because of magazines).</para>
		606
		607	<para>The SLAB allocator allocates lots of space and does not free
		608	it. When frame allocator fails to allocate the frame, it calls
		609	slab_reclaim(). It tries 'light reclaim' first, then brutal reclaim.
		610	The light reclaim releases slabs from cpu-shared magazine-list,
		611	until at least 1 slab is deallocated in each cache (this algorithm
		612	should probably change). The brutal reclaim removes all cached
		613	objects, even from CPU-bound magazines.</para>
		614
		615	<formalpara>
		616	<title>Allocation</title>
		617
		618	<para><emphasis>Step 1.</emphasis> When it comes to the allocation
		619	request, slab allocator first of all checks availability of memory
		620	in local CPU-bound magazine. If it is there, we would just "pop"
		621	the CPU magazine and return the pointer to object.</para>
		622
		623	<para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
		624	empty, allocator will attempt to reload magazin, swapping it with
		625	second CPU magazine and returns to the first step.</para>
		626
		627	<para><emphasis>Step 3.</emphasis> Now we are in the situation
		628	when both CPU-bound magazines are empty, which makes allocator to
		629	access shared full-magazines depot to reload CPU-bound magazines.
		630	If reload is succesful (meaning there are full magazines in depot)
		631	algoritm continues at Step 1.</para>
		632
		633	<para><emphasis>Step 4.</emphasis> Final step of the allocation.
		634	In this step object is allocated from the conventional slab layer
		635	and pointer is returned.</para>
		636	</formalpara>
		637
		638	<formalpara>
		639	<title>Deallocation</title>
		640
		641	<para><emphasis>Step 1.</emphasis> During deallocation request,
		642	slab allocator will check if the local CPU-bound magazine is not
		643	full. In this case we will just push the pointer to this
		644	magazine.</para>
		645
		646	<para><emphasis>Step 2.</emphasis> If the CPU-bound magazine is
		647	full, allocator will attempt to reload magazin, swapping it with
		648	second CPU magazine and returns to the first step.</para>
		649
		650	<para><emphasis>Step 3.</emphasis> Now we are in the situation
		651	when both CPU-bound magazines are full, which makes allocator to
		652	access shared full-magazines depot to put one of the magazines to
		653	the depot and creating new empty magazine. Algoritm continues at
		654	Step 1.</para>
		655	</formalpara>
		656	</section>
		657	</section>
		658	</section>
		659
		660	<!-- End of Physmem -->
		661	</section>
		662
		663	<section>
66	bondari	664	<title>Memory sharing</title>
9	bondari	665
66	bondari	666	<para>Not implemented yet(?)</para>
26	bondari	667	</section>
11	bondari	668	</chapter>

Subversion Repositories HelenOS-doc

(root)/design/trunk/src/ch_memory_management.xml @ 185 – Rev 66