WebSVN – HelenOS-doc – Blame – /design/trunk/src/ch_memory_management.xml

Rev	Author	Line No.	Line
9	bondari	1	<?xml version="1.0" encoding="UTF-8"?>
11	bondari	2	<chapter id="mm">
		3	<?dbhtml filename="mm.html"?>
9	bondari	4
11	bondari	5	<title>Memory management</title>
9	bondari	6
67	jermar	7	<para>In previous chapters, this book described the scheduling subsystem as
		8	the creator of the impression that threads execute in parallel. The memory
		9	management subsystem, on the other hand, creates the impression that there
		10	is enough physical memory for the kernel and that userspace tasks have the
		11	entire address space only for themselves.</para>
64	jermar	12
26	bondari	13	<section>
64	jermar	14	<title>Physical memory management</title>
		15
		16	<section id="zones_and_frames">
		17	<title>Zones and frames</title>
		18
67	jermar	19	<para>HelenOS represents continuous areas of physical memory in
		20	structures called frame zones (abbreviated as zones). Each zone contains
		21	information about the number of allocated and unallocated physical
		22	memory frames as well as the physical base address of the zone and
		23	number of frames contained in it. A zone also contains an array of frame
		24	structures describing each frame of the zone and, in the last, but not
		25	the least important, front, each zone is equipped with a buddy system
		26	that faciliates effective allocation of power-of-two sized block of
		27	frames.</para>
64	jermar	28
67	jermar	29	<para>This organization of physical memory provides good preconditions
		30	for hot-plugging of more zones. There is also one currently unused zone
		31	attribute: <code>flags</code>. The attribute could be used to give a
		32	special meaning to some zones in the future.</para>
64	jermar	33
67	jermar	34	<para>The zones are linked in a doubly-linked list. This might seem a
		35	bit ineffective because the zone list is walked everytime a frame is
		36	allocated or deallocated. However, this does not represent a significant
		37	performance problem as it is expected that the number of zones will be
		38	rather low. Moreover, most architectures merge all zones into
		39	one.</para>
		40
		41	<para>For each physical memory frame found in a zone, there is a frame
		42	structure that contains number of references and data used by buddy
		43	system.</para>
64	jermar	44	</section>
		45
		46	<section id="frame_allocator">
		47	<title>Frame allocator</title>
		48
67	jermar	49	<para>The frame allocator satisfies kernel requests to allocate
		50	power-of-two sized blocks of physical memory. Because of zonal
		51	organization of physical memory, the frame allocator is always working
		52	within a context of some frame zone. In order to carry out the
		53	allocation requests, the frame allocator is tightly integrated with the
		54	buddy system belonging to the zone. The frame allocator is also
		55	responsible for updating information about the number of free and busy
		56	frames in the zone. <figure>
		57	<mediaobject id="frame_alloc">
		58	<imageobject role="html">
		59	<imagedata fileref="images/frame_alloc.png" format="PNG" />
		60	</imageobject>
64	jermar	61
67	jermar	62	<imageobject role="fop">
		63	<imagedata fileref="images.vector/frame_alloc.svg" format="SVG" />
		64	</imageobject>
		65	</mediaobject>
64	jermar	66
67	jermar	67	<title>Frame allocator scheme.</title>
		68	</figure></para>
64	jermar	69
		70	<formalpara>
		71	<title>Allocation / deallocation</title>
		72
67	jermar	73	<para>Upon allocation request via function <code>frame_alloc</code>,
		74	the frame allocator first tries to find a zone that can satisfy the
		75	request (i.e. has the required amount of free frames). Once a suitable
		76	zone is found, the frame allocator uses the buddy allocator on the
		77	zone's buddy system to perform the allocation. During deallocation,
		78	which is triggered by a call to <code>frame_free</code>, the frame
		79	allocator looks up the respective zone that contains the frame being
		80	deallocated. Afterwards, it calls the buddy allocator again, this time
		81	to take care of deallocation within the zone's buddy system.</para>
64	jermar	82	</formalpara>
		83	</section>
		84
		85	<section id="buddy_allocator">
		86	<title>Buddy allocator</title>
		87
67	jermar	88	<para>In the buddy system, the memory is broken down into power-of-two
		89	sized naturally aligned blocks. These blocks are organized in an array
		90	of lists, in which the list with index i contains all unallocated blocks
		91	of size <mathphrase>2<superscript>i</superscript></mathphrase>. The
		92	index i is called the order of block. Should there be two adjacent
		93	equally sized blocks in the list i<mathphrase />(i.e. buddies), the
		94	buddy allocator would coalesce them and put the resulting block in list
		95	<mathphrase>i + 1</mathphrase>, provided that the resulting block would
		96	be naturally aligned. Similarily, when the allocator is asked to
		97	allocate a block of size
		98	<mathphrase>2<superscript>i</superscript></mathphrase>, it first tries
		99	to satisfy the request from the list with index i. If the request cannot
		100	be satisfied (i.e. the list i is empty), the buddy allocator will try to
		101	allocate and split a larger block from the list with index i + 1. Both
		102	of these algorithms are recursive. The recursion ends either when there
		103	are no blocks to coalesce in the former case or when there are no blocks
		104	that can be split in the latter case.</para>
64	jermar	105
67	jermar	106	<para>This approach greatly reduces external fragmentation of memory and
		107	helps in allocating bigger continuous blocks of memory aligned to their
		108	size. On the other hand, the buddy allocator suffers increased internal
		109	fragmentation of memory and is not suitable for general kernel
		110	allocations. This purpose is better addressed by the <link
		111	linkend="slab">slab allocator</link>.<figure>
64	jermar	112	<mediaobject id="buddy_alloc">
		113	<imageobject role="html">
		114	<imagedata fileref="images/buddy_alloc.png" format="PNG" />
		115	</imageobject>
		116
		117	<imageobject role="fop">
		118	<imagedata fileref="images.vector/buddy_alloc.svg" format="SVG" />
		119	</imageobject>
		120	</mediaobject>
		121
		122	<title>Buddy system scheme.</title>
67	jermar	123	</figure></para>
64	jermar	124
		125	<section>
		126	<title>Implementation</title>
		127
		128	<para>The buddy allocator is, in fact, an abstract framework wich can
		129	be easily specialized to serve one particular task. It knows nothing
		130	about the nature of memory it helps to allocate. In order to beat the
		131	lack of this knowledge, the buddy allocator exports an interface that
		132	each of its clients is required to implement. When supplied with an
		133	implementation of this interface, the buddy allocator can use
		134	specialized external functions to find a buddy for a block, split and
		135	coalesce blocks, manipulate block order and mark blocks busy or
67	jermar	136	available.</para>
64	jermar	137
		138	<formalpara>
		139	<title>Data organization</title>
		140
		141	<para>Each entity allocable by the buddy allocator is required to
		142	contain space for storing block order number and a link variable
		143	used to interconnect blocks within the same order.</para>
		144
		145	<para>Whatever entities are allocated by the buddy allocator, the
		146	first entity within a block is used to represent the entire block.
		147	The first entity keeps the order of the whole block. Other entities
		148	within the block are assigned the magic value
		149	<constant>BUDDY_INNER_BLOCK</constant>. This is especially important
		150	for effective identification of buddies in a one-dimensional array
		151	because the entity that represents a potential buddy cannot be
		152	associated with <constant>BUDDY_INNER_BLOCK</constant> (i.e. if it
		153	is associated with <constant>BUDDY_INNER_BLOCK</constant> then it is
		154	not a buddy).</para>
		155	</formalpara>
		156	</section>
		157	</section>
		158
		159	<section id="slab">
70	jermar	160	<title>Slab allocator</title>
64	jermar	161
67	jermar	162	<para>The majority of memory allocation requests in the kernel is for
		163	small, frequently used data structures. The basic idea behind the slab
		164	allocator is that commonly used objects are preallocated in continuous
		165	areas of physical memory called slabs<footnote>
		166	<para>Slabs are in fact blocks of physical memory frames allocated
		167	from the frame allocator.</para>
		168	</footnote>. Whenever an object is to be allocated, the slab allocator
		169	returns the first available item from a suitable slab corresponding to
		170	the object type<footnote>
		171	<para>The mechanism is rather more complicated, see the next
		172	paragraph.</para>
		173	</footnote>. Due to the fact that the sizes of the requested and
		174	allocated object match, the slab allocator significantly reduces
		175	internal fragmentation.</para>
64	jermar	176
67	jermar	177	<para>Slabs of one object type are organized in a structure called slab
		178	cache. There are ususally more slabs in the slab cache, depending on
		179	previous allocations. If the the slab cache runs out of available slabs,
		180	new slabs are allocated. In order to exploit parallelism and to avoid
		181	locking of shared spinlocks, slab caches can have variants of
		182	processor-private slabs called magazines. On each processor, there is a
		183	two-magazine cache. Full magazines that are not part of any
		184	per-processor magazine cache are stored in a global list of full
		185	magazines.</para>
64	jermar	186
67	jermar	187	<para>Each object begins its life in a slab. When it is allocated from
		188	there, the slab allocator calls a constructor that is registered in the
		189	respective slab cache. The constructor initializes and brings the object
		190	into a known state. The object is then used by the user. When the user
		191	later frees the object, the slab allocator puts it into a processor
		192	private magazine cache, from where it can be precedently allocated
		193	again. Note that allocations satisfied from a magazine are already
		194	initialized by the constructor. When both of the processor cached
		195	magazines get full, the allocator will move one of the magazines to the
		196	list of full magazines. Similarily, when allocating from an empty
		197	processor magazine cache, the kernel will reload only one magazine from
		198	the list of full magazines. In other words, the slab allocator tries to
		199	keep the processor magazine cache only half-full in order to prevent
		200	thrashing when allocations and deallocations interleave on magazine
70	jermar	201	boundaries. The advantage of this setup is that during most of the
		202	allocations, no global spinlock needs to be held.</para>
65	jermar	203
67	jermar	204	<para>Should HelenOS run short of memory, it would start deallocating
		205	objects from magazines, calling slab cache destructor on them and
		206	putting them back into slabs. When a slab contanins no allocated object,
		207	it is immediately freed.</para>
66	bondari	208
67	jermar	209	<para><figure>
64	jermar	210	<mediaobject id="slab_alloc">
		211	<imageobject role="html">
		212	<imagedata fileref="images/slab_alloc.png" format="PNG" />
		213	</imageobject>
		214	</mediaobject>
		215
		216	<title>Slab allocator scheme.</title>
67	jermar	217	</figure></para>
64	jermar	218
67	jermar	219	<section>
		220	<title>Implementation</title>
		221
		222	<para>The slab allocator is closely modelled after OpenSolaris slab
		223	allocator by Jeff Bonwick and Jonathan Adams with the following
		224	exceptions:<itemizedlist>
64	jermar	225	<listitem>
70	jermar	226	empty slabs are immediately deallocated and
64	jermar	227	</listitem>
66	bondari	228
		229	<listitem>
70	jermar	230	<para>empty magazines are deallocated when not needed.</para>
66	bondari	231	</listitem>
70	jermar	232	</itemizedlist>The following features are not currently supported
		233	but would be easy to do: <itemizedlist>
64	jermar	234	<listitem>
70	jermar	235	cache coloring and
64	jermar	236	</listitem>
		237
		238	<listitem>
70	jermar	239	dynamic magazine grow (different magazine sizes are already supported, but the allocation strategy would need to be adjusted).
64	jermar	240	</listitem>
		241	</itemizedlist></para>
		242
		243	<section>
		244	<title>Allocation/deallocation</title>
		245
70	jermar	246	<para>The following two paragraphs summarize and complete the
		247	description of the slab allocator operation (i.e.
		248	<code>slab_alloc</code> and <code>slab_free</code>
		249	operations).</para>
64	jermar	250
		251	<formalpara>
		252	<title>Allocation</title>
		253
70	jermar	254	<para><emphasis>Step 1.</emphasis> When an allocation request
		255	comes, the slab allocator checks availability of memory in the
		256	current magazine of the local processor magazine cache. If the
		257	available memory is there, the allocator just pops the magazine
		258	and returns pointer to the object.</para>
64	jermar	259
70	jermar	260	<para><emphasis>Step 2.</emphasis> If the current magazine in the
		261	processor magazine cache is empty, the allocator will attempt to
		262	swap it with the last magazine from the cache and return to the
		263	first step. If also the last magazine is empty, the algorithm will
		264	fall through to Step 3.</para>
64	jermar	265
70	jermar	266	<para><emphasis>Step 3.</emphasis> Now the allocator is in the
		267	situation when both magazines in the processor magazine cache are
		268	empty. The allocator reloads one magazine from the shared list of
		269	full magazines. If the reload is successful (i.e. there are full
		270	magazines in the list), the algorithm continues with Step
		271	1.</para>
64	jermar	272
70	jermar	273	<para><emphasis>Step 4.</emphasis> In this fail-safe step, an
		274	object is allocated from the conventional slab layer and a pointer
		275	to it is returned. If also the last magazine is full,</para>
64	jermar	276	</formalpara>
		277
		278	<formalpara>
		279	<title>Deallocation</title>
		280
70	jermar	281	<para><emphasis>Step 1.</emphasis> During a deallocation request,
		282	the slab allocator checks if the current magazine of the local
		283	processor magazine cache is not full. If yes, the pointer to the
		284	objects is just pushed into the magazine and the algorithm
		285	returns.</para>
64	jermar	286
70	jermar	287	<para><emphasis>Step 2.</emphasis> If the current magazine is
		288	full, the allocator will attempt to swap it with the last magazine
		289	from the cache and return to the first step. If also the last
		290	magazine is empty, the algorithm will fall through to Step
		291	3.</para>
64	jermar	292
70	jermar	293	<para><emphasis>Step 3.</emphasis> Now the allocator is in the
		294	situation when both magazines in the processor magazine cache are
		295	full. The allocator moves one magazine to the shared list of full
		296	magazines. The algoritm continues with Step 1.</para>
64	jermar	297	</formalpara>
		298	</section>
		299	</section>
		300	</section>
		301	</section>
		302
		303	<section>
67	jermar	304	<title>Virtual memory management</title>
9	bondari	305
67	jermar	306	<section>
		307	<title>Introduction</title>
		308
		309	<para>Virtual memory is a special memory management technique, used by
		310	kernel to achieve a bunch of mission critical goals. <itemizedlist>
		311	<listitem>
		312	Isolate each task from other tasks that are running on the system at the same time.
		313	</listitem>
		314
		315	<listitem>
		316	Allow to allocate more memory, than is actual physical memory size of the machine.
		317	</listitem>
		318
		319	<listitem>
		320	Allowing, in general, to load and execute two programs that are linked on the same address without complicated relocations.
		321	</listitem>
		322	</itemizedlist></para>
		323
		324	<para><!--
		325
70	jermar	326	TLB shootdown ASID/ASID:PAGE/ALL.
		327	TLB shootdown requests can come in asynchroniously
		328	so there is a cache of TLB shootdown requests. Upon cache overflow TLB shootdown ALL is executed
		329
		330
67	jermar	331	<para>
		332	Address spaces. Address space area (B+ tree). Only for uspace. Set of syscalls (shrink/extend etc).
		333	Special address space area type - device - prohibits shrink/extend syscalls to call on it.
		334	Address space has link to mapping tables (hierarchical - per Address space, hash - global tables).
		335	</para>
		336
		337	--></para>
		338	</section>
		339
		340	<section>
70	jermar	341	<title>Paging</title>
		342
		343	<para>Virtual memory is usually using paged memory model, where virtual
		344	memory address space is divided into the <emphasis>pages</emphasis>
		345	(usually having size 4096 bytes) and physical memory is divided into the
		346	frames (same sized as a page, of course). Each page may be mapped to
		347	some frame and then, upon memory access to the virtual address, CPU
		348	performs <emphasis>address translation</emphasis> during the instruction
		349	execution. Non-existing mapping generates page fault exception, calling
		350	kernel exception handler, thus allowing kernel to manipulate rules of
		351	memory access. Information for pages mapping is stored by kernel in the
		352	<link linkend="page_tables">page tables</link></para>
		353
		354	<para>The majority of the architectures use multi-level page tables,
		355	which means need to access physical memory several times before getting
		356	physical address. This fact would make serios performance overhead in
		357	virtual memory management. To avoid this <link linkend="tlb">Traslation
		358	Lookaside Buffer (TLB)</link> is used.</para>
		359	</section>
		360
		361	<section>
67	jermar	362	<title>Address spaces</title>
		363
		364	<section>
70	jermar	365	<title>Address space areas</title>
67	jermar	366
		367	<para>Each address space consists of mutually disjunctive continuous
		368	address space areas. Address space area is precisely defined by its
		369	base address and the number of frames/pages is contains.</para>
		370
		371	<para>Address space area , that define behaviour and permissions on
		372	the particular area. <itemizedlist>
		373	<listitem>
		374
		375
		376	<emphasis>AS_AREA_READ</emphasis>
		377
		378	flag indicates reading permission.
		379	</listitem>
		380
		381	<listitem>
		382
		383
		384	<emphasis>AS_AREA_WRITE</emphasis>
		385
		386	flag indicates writing permission.
		387	</listitem>
		388
		389	<listitem>
		390
		391
		392	<emphasis>AS_AREA_EXEC</emphasis>
		393
		394	flag indicates code execution permission. Some architectures do not support execution persmission restriction. In this case this flag has no effect.
		395	</listitem>
		396
		397	<listitem>
		398
		399
		400	<emphasis>AS_AREA_DEVICE</emphasis>
		401
		402	marks area as mapped to the device memory.
		403	</listitem>
		404	</itemizedlist></para>
		405
		406	<para>Kernel provides possibility tasks create/expand/shrink/share its
		407	address space via the set of syscalls.</para>
		408	</section>
		409
		410	<section>
		411	<title>Address Space ID (ASID)</title>
		412
		413	<para>When switching to the different task, kernel also require to
		414	switch mappings to the different address space. In case TLB cannot
		415	distinguish address space mappings, all mapping information in TLB
		416	from the old address space must be flushed, which can create certain
		417	uncessary overhead during the task switching. To avoid this, some
		418	architectures have capability to segregate different address spaces on
		419	hardware level introducing the address space identifier as a part of
		420	TLB record, telling the virtual address space translation unit to
		421	which address space this record is applicable.</para>
		422
		423	<para>HelenOS kernel can take advantage of this hardware supported
		424	identifier by having an ASID abstraction which is somehow related to
		425	the corresponding architecture identifier. I.e. on ia64 kernel ASID is
		426	derived from RID (region identifier) and on the mips32 kernel ASID is
		427	actually the hardware identifier. As expected, this ASID information
		428	record is the part of <emphasis>as_t</emphasis> structure.</para>
		429
		430	<para>Due to the hardware limitations, hardware ASID has limited
		431	length from 8 bits on ia64 to 24 bits on mips32, which makes it
		432	impossible to use it as unique address space identifier for all tasks
		433	running in the system. In such situations special ASID stealing
		434	algoritm is used, which takes ASID from inactive task and assigns it
		435	to the active task.</para>
		436
		437	<para><classname>ASID stealing algoritm here.</classname></para>
		438	</section>
		439	</section>
		440
		441	<section>
		442	<title>Virtual address translation</title>
		443
70	jermar	444	<section id="page_tables">
		445	<title>Page tables</title>
67	jermar	446
70	jermar	447	<para>HelenOS kernel has two different approaches to the paging
		448	implementation: <emphasis>4 level page tables</emphasis> and
		449	<emphasis>global hash tables</emphasis>, which are accessible via
		450	generic paging abstraction layer. Such different functionality was
		451	caused by the major architectural differences between supported
		452	platforms. This abstraction is implemented with help of the global
		453	structure of pointers to basic mapping functions
		454	<emphasis>page_mapping_operations</emphasis>. To achieve different
		455	functionality of page tables, corresponding layer must implement
		456	functions, declared in
		457	<emphasis>page_mapping_operations</emphasis></para>
67	jermar	458
70	jermar	459	<formalpara>
		460	<title>4-level page tables</title>
67	jermar	461
70	jermar	462	<para>4-level page tables are the generalization of the hardware
		463	capabilities of several architectures.<itemizedlist>
67	jermar	464	<listitem>
		465	ia32 uses 2-level page tables, with full hardware support.
		466	</listitem>
		467
		468	<listitem>
		469	amd64 uses 4-level page tables, also coming with full hardware support.
		470	</listitem>
		471
		472	<listitem>
		473	mips and ppc32 have 2-level tables, software simulated support.
		474	</listitem>
		475	</itemizedlist></para>
70	jermar	476	</formalpara>
67	jermar	477
70	jermar	478	<formalpara>
		479	<title>Global hash tables</title>
67	jermar	480
70	jermar	481	<para>- global page hash table: existuje jen jedna v celem systemu
		482	(vyuziva ji ia64), pozn. ia64 ma zatim vypnuty VHPT. Pouziva se
		483	genericke hash table s oddelenymi collision chains. ASID support is
		484	required to use global hash tables.</para>
		485	</formalpara>
67	jermar	486
70	jermar	487	<para>Thanks to the abstract paging interface, there is possibility
		488	left have more paging implementations, for example B-Tree page
		489	tables.</para>
67	jermar	490	</section>
		491
		492	<section id="tlb">
		493	<title>Translation Lookaside buffer</title>
		494
		495	<para>Due to the extensive overhead during the page mapping lookup in
		496	the page tables, all architectures has fast assotiative cache memory
		497	built-in CPU. This memory called TLB stores recently used page table
		498	entries.</para>
		499
		500	<section id="tlb_shootdown">
		501	<title>TLB consistency. TLB shootdown algorithm.</title>
		502
		503	<para>Operating system is responsible for keeping TLB consistent by
		504	invalidating the contents of TLB, whenever there is some change in
		505	page tables. Those changes may occur when page or group of pages
		506	were unmapped, mapping is changed or system switching active address
70	jermar	507	space to schedule a new system task (which is a batch unmap of all
		508	address space mappings). Moreover, this invalidation operation must
		509	be done an all system CPUs because each CPU has its own independent
		510	TLB cache. Thus maintaining TLB consistency on SMP configuration as
		511	not as trivial task as it looks at the first glance. Naive solution
		512	would assume remote TLB invalidatation, which is not possible on the
		513	most of the architectures, because of the simple fact - flushing TLB
		514	is allowed only on the local CPU and there is no possibility to
		515	access other CPUs' TLB caches.</para>
67	jermar	516
		517	<para>Technique of remote invalidation of TLB entries is called "TLB
		518	shootdown". HelenOS uses a variation of the algorithm described by
		519	D. Black et al., "Translation Lookaside Buffer Consistency: A
		520	Software Approach," Proc. Third Int'l Conf. Architectural Support
		521	for Programming Languages and Operating Systems, 1989, pp.
		522	113-122.</para>
		523
		524	<para>As the situation demands, you will want partitial invalidation
		525	of TLB caches. In case of simple memory mapping change it is
		526	necessary to invalidate only one or more adjacent pages. In case if
70	jermar	527	the architecture is aware of ASIDs, during the address space
		528	switching, kernel invalidates only entries from this particular
		529	address space. Final option of the TLB invalidation is the complete
		530	TLB cache invalidation, which is the operation that flushes all
		531	entries in TLB.</para>
67	jermar	532
70	jermar	533	<para>TLB shootdown is performed in two phases. First, the initiator
		534	process sends an IPI message indicating the TLB shootdown request to
		535	the rest of the CPUs. Then, it waits until all CPUs confirm TLB
		536	invalidating action execution.</para>
		537	</section>
		538	</section>
		539	</section>
67	jermar	540
70	jermar	541	<section>
		542	<title>---</title>
67	jermar	543
70	jermar	544	<para>At the moment HelenOS does not support swapping.</para>
67	jermar	545
70	jermar	546	<para>- pouzivame vypadky stranky k alokaci ramcu on-demand v ramci
		547	as_area - na architekturach, ktere to podporuji, podporujeme non-exec
		548	stranky</para>
67	jermar	549	</section>
26	bondari	550	</section>
11	bondari	551	</chapter>

Subversion Repositories HelenOS-doc

(root)/design/trunk/src/ch_memory_management.xml – Rev 70